Commit Graph

268 Commits

Author SHA1 Message Date
Tejun Heo
54d2e717bc sched_ext: Remove scx_ops_cpu_preempt static_key
scx_ops_cpu_preempt is used to encode whether ops.cpu_acquire/release() are
implemented into a static_key. These tests aren't hot enough for static_key
usage to make any meaningful difference and are made to use a static_key
mostly because there was no reason not to. However, global static_keys can't
work with the planned hierarchical multiple scheduler support. Remove the
static_key and instead use an internal ops flag SCX_OPS_HAS_CPU_PREEMPT to
record and test whether ops.cpu_acquire/release() are implemented.

In repeated hackbench runs before and after static_keys removal on an AMD
Ryzen 3900X, I couldn't tell any measurable performance difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09 09:06:00 -10:00
Tejun Heo
cc39454c34 sched_ext: Remove scx_ops_enq_* static_keys
scx_ops_enq_last/exiting/migration_disabled are used to encode the
corresponding SCX_OPS_ flags into static_keys. These flags aren't hot enough
for static_key usage to make any meaningful difference and are made
static_keys mostly because there was no reason not to. However, global
static_keys can't work with the planned hierarchical multiple scheduler
support. Remove the static_keys and test the ops flags directly.

In repeated hackbench runs before and after static_keys removal on an AMD
Ryzen 3900X, I couldn't tell any measurable performance difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09 09:05:59 -10:00
Tejun Heo
d75ee2d678 sched_ext: Indentation updates
Purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09 09:05:59 -10:00
Tejun Heo
294f5ff474 sched_ext: Merge branch 'for-6.15-fixes' into for-6.16
Pull for-6.15-fixes to receive:

 e776b26e37 ("sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings")

which conflicts with:

 1a7ff7216c ("sched_ext: Drop "ops" from scx_ops_enable_state and friends")

The former removes code updated by the latter. Resolved by removing the
updated section.

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-08 08:56:57 -10:00
Tejun Heo
bc08b15b54 sched_ext: Mark SCX_OPS_HAS_CGROUP_WEIGHT for deprecation
SCX_OPS_HAS_CGROUP_WEIGHT was only used to suppress the missing cgroup
weight support warnings. Now that the warnings are removed, the flag doesn't
do anything. Mark it for deprecation and remove its usage from scx_flatcg.

v2: Actually include the scx_flatcg update.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-and-reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-04-08 08:53:52 -10:00
Tejun Heo
e776b26e37 sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings
sched_ext generates warnings when cpu.weight / cpu.idle are set to
non-default values if the BPF scheduler doesn't implement weight support.
These warnings don't provide much value while adding constant annoyance. A
BPF scheduler may not implement any particular behavior and there's nothing
particularly special about missing cgroup weight support. Drop the warnings.

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-08 08:00:49 -10:00
Breno Leitao
47068309b5 sched_ext: Use kvzalloc for large exit_dump allocation
Replace kzalloc with kvzalloc for the exit_dump buffer allocation, which
can require large contiguous memory depending on the implementation.
This change prevents allocation failures by allowing the system to fall
back to vmalloc when contiguous memory allocation fails.

Since this buffer is only used for debugging purposes, physical memory
contiguity is not required, making vmalloc a suitable alternative.

Cc: stable@vger.kernel.org
Fixes: 07814a9439 ("sched_ext: Print debug dump after an error exit")
Suggested-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-08 07:53:27 -10:00
Andrea Righi
683d2d0fab sched_ext: idle: Introduce scx_bpf_select_cpu_and()
Provide a new kfunc, scx_bpf_select_cpu_and(), that can be used to apply
the built-in idle CPU selection policy to a subset of allowed CPU.

This new helper is basically an extension of scx_bpf_select_cpu_dfl().
However, when an idle CPU can't be found, it returns a negative value
instead of @prev_cpu, aligning its behavior more closely with
scx_bpf_pick_idle_cpu().

It also accepts %SCX_PICK_IDLE_* flags, which can be used to enforce
strict selection to @prev_cpu's node (%SCX_PICK_IDLE_IN_NODE), or to
request only a full-idle SMT core (%SCX_PICK_IDLE_CORE), while applying
the built-in selection logic.

With this helper, BPF schedulers can apply the built-in idle CPU
selection policy restricted to any arbitrary subset of CPUs.

Example usage
=============

Possible usage in ops.select_cpu():

s32 BPF_STRUCT_OPS(foo_select_cpu, struct task_struct *p,
		   s32 prev_cpu, u64 wake_flags)
{
	const struct cpumask *cpus = task_allowed_cpus(p) ?: p->cpus_ptr;
	s32 cpu;

	cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, cpus, 0);
	if (cpu >= 0) {
		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
		return cpu;
	}

	return prev_cpu;
}

Results
=======

Load distribution on a 4 sockets, 4 cores per socket system, simulated
using virtme-ng, running a modified version of scx_bpfland that uses
scx_bpf_select_cpu_and() with 0xff00 as the allowed subset of CPUs:

 $ vng --cpu 16,sockets=4,cores=4,threads=1
 ...
 $ stress-ng -c 16
 ...
 $ htop
 ...
   0[                         0.0%]   8[||||||||||||||||||||||||100.0%]
   1[                         0.0%]   9[||||||||||||||||||||||||100.0%]
   2[                         0.0%]  10[||||||||||||||||||||||||100.0%]
   3[                         0.0%]  11[||||||||||||||||||||||||100.0%]
   4[                         0.0%]  12[||||||||||||||||||||||||100.0%]
   5[                         0.0%]  13[||||||||||||||||||||||||100.0%]
   6[                         0.0%]  14[||||||||||||||||||||||||100.0%]
   7[                         0.0%]  15[||||||||||||||||||||||||100.0%]

With scx_bpf_select_cpu_dfl() tasks would be distributed evenly across
all the available CPUs.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07 07:13:52 -10:00
Andrea Righi
23c63a9652 sched_ext: idle: Explicitly pass allowed cpumask to scx_select_cpu_dfl()
Modify scx_select_cpu_dfl() to take the allowed cpumask as an explicit
argument, instead of implicitly using @p->cpus_ptr.

This prepares for future changes where arbitrary cpumasks may be passed
to the built-in idle CPU selection policy.

This is a pure refactoring with no functional changes.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07 07:13:52 -10:00
Tejun Heo
29b49be6c9 sched_ext: Drop "ops" from SCX_OPS_TASK_ITER_BATCH
The tag "ops" is used for two different purposes. First, to indicate that
the entity is directly related to the operations such as flags carried in
sched_ext_ops. Second, to indicate that the entity applies to something
global such as enable or bypass states. The second usage is historical and
causes confusion rather than clarifying anything. For example,
scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with
scx_ops_flags. Let's drop the second usages.

Drop "ops" from SCX_OPS_TASK_ITER_BATCH.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-and-acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04 10:10:39 -10:00
Tejun Heo
1a2469403e sched_ext: Drop "ops" from scx_ops_{init|exit|enable|disable}[_task]() and friends
The tag "ops" is used for two different purposes. First, to indicate that
the entity is directly related to the operations such as flags carried in
sched_ext_ops. Second, to indicate that the entity applies to something
global such as enable or bypass states. The second usage is historical and
causes confusion rather than clarifying anything. For example,
scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with
scx_ops_flags. Let's drop the second usages.

Drop "ops" from scx_ops_{init|exit|enable|disable}[_task]() and friends.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04 08:52:49 -10:00
Tejun Heo
c5f22258f5 sched_ext: Drop "ops" from scx_ops_exit(), scx_ops_error() and friends
The tag "ops" is used for two different purposes. First, to indicate that
the entity is directly related to the operations such as flags carried in
sched_ext_ops. Second, to indicate that the entity applies to something
global such as enable or bypass states. The second usage is historical and
causes confusion rather than clarifying anything. For example,
scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with
scx_ops_flags. Let's drop the second usages.

Drop "ops" from scx_ops_exit(), scx_ops_error() and friends.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04 08:52:49 -10:00
Tejun Heo
8c6ee86246 sched_ext: Drop "ops" from scx_ops_bypass(), scx_ops_breather() and friends
The tag "ops" is used for two different purposes. First, to indicate that
the entity is directly related to the operations such as flags carried in
sched_ext_ops. Second, to indicate that the entity applies to something
global such as enable or bypass states. The second usage is historical and
causes confusion rather than clarifying anything. For example,
scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with
scx_ops_flags. Let's drop the second usages.

Drop "ops" from scx_ops_bypass(), scx_ops_breather() and friends. Update
scx_show_state.py accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04 08:52:49 -10:00
Tejun Heo
a50c365f99 sched_ext: Drop "ops" from scx_ops_helper, scx_ops_enable_mutex and __scx_ops_enabled
The tag "ops" is used for two different purposes. First, to indicate that
the entity is directly related to the operations such as flags carried in
sched_ext_ops. Second, to indicate that the entity applies to something
global such as enable or bypass states. The second usage is historical and
causes confusion rather than clarifying anything. For example,
scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with
scx_ops_flags. Let's drop the second usages.

Drop "ops" from scx_ops_helper, scx_ops_enable_mutex and __scx_ops_enabled.
Update scx_show_state.py accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04 08:52:48 -10:00
Tejun Heo
1a7ff7216c sched_ext: Drop "ops" from scx_ops_enable_state and friends
The tag "ops" is used for two different purposes. First, to indicate that
the entity is directly related to the operations such as flags carried in
sched_ext_ops. Second, to indicate that the entity applies to something
global such as enable or bypass states. The second usage is historical and
causes confusion rather than clarifying anything. For example,
scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with
scx_ops_flags. Let's drop the second usages.

Drop "ops" from scx_ops_enable_state and friends. Update scx_show_state.py
accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04 08:52:48 -10:00
Andrea Righi
f0c6eab5e4 sched_ext: initialize built-in idle state before ops.init()
A BPF scheduler may want to use the built-in idle cpumasks in ops.init()
before the scheduler is fully initialized, either directly or through a
BPF timer for example.

However, this would result in an error, since the idle state has not
been properly initialized yet.

This can be easily verified by modifying scx_simple to call
scx_bpf_get_idle_cpumask() in ops.init():

$ sudo scx_simple

DEBUG DUMP
===========================================================================

scx_simple[121] triggered exit kind 1024:
  runtime error (built-in idle tracking is disabled)
...

Fix this by properly initializing the idle state before ops.init() is
called. With this change applied:

$ sudo scx_simple
local=2 global=0
local=19 global=11
local=23 global=11
...

Fixes: d73249f887 ("sched_ext: idle: Make idle static keys private")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-26 10:47:50 -10:00
Jake Hillion
a8897ed852 sched_ext: create_dsq: Return -EEXIST on duplicate request
create_dsq and therefore the scx_bpf_create_dsq kfunc currently silently
ignore duplicate entries. As a sched_ext scheduler is creating each DSQ
for a different purpose this is surprising behaviour.

Replace rhashtable_insert_fast which ignores duplicates with
rhashtable_lookup_insert_fast that reports duplicates (though doesn't
return their value). The rest of the code is structured correctly and
this now returns -EEXIST.

Tested by adding an extra scx_bpf_create_dsq to scx_simple. Previously
this was ignored, now init fails with a -17 code. Also ran scx_lavd
which continued to work well.

Signed-off-by: Jake Hillion <jake@hillion.co.uk>
Acked-by: Andrea Righi <arighi@nvidia.com>
Fixes: f0e1a0643a ("sched_ext: Implement BPF extensible scheduler class")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-26 10:30:36 -10:00
Linus Torvalds
32b22538be Scheduler updates for v6.15:
[ Merge note, these two commits are identical:
 
    - f3fa0e40df ("sched/clock: Don't define sched_clock_irqtime as static key")
    - b9f2b29b94 ("sched: Don't define sched_clock_irqtime as static key")
 
   The first one is a cherry-picked version of the second, and the first one
   is already upstream. ]
 
 Core & fair scheduler changes:
 
   - Cancel the slice protection of the idle entity (Zihan Zhou)
   - Reduce the default slice to avoid tasks getting an extra tick
     (Zihan Zhou)
   - Force propagating min_slice of cfs_rq when {en,de}queue tasks
     (Tianchen Ding)
   - Refactor can_migrate_task() to elimate looping (I Hsin Cheng)
   - Add unlikey branch hints to several system calls (Colin Ian King)
   - Optimize current_clr_polling() on certain architectures (Yujun Dong)
 
 Deadline scheduler: (Juri Lelli)
 
   - Remove redundant dl_clear_root_domain call
   - Move dl_rebuild_rd_accounting to cpuset.h
 
 Uclamp:
 
   - Use the uclamp_is_used() helper instead of open-coding it (Xuewen Yan)
   - Optimize sched_uclamp_used static key enabling (Xuewen Yan)
 
 Scheduler topology support: (Juri Lelli)
 
   - Ignore special tasks when rebuilding domains
   - Add wrappers for sched_domains_mutex
   - Generalize unique visiting of root domains
   - Rebuild root domain accounting after every update
   - Remove partition_and_rebuild_sched_domains
   - Stop exposing partition_sched_domains_locked
 
 RSEQ: (Michael Jeanson)
 
   - Update kernel fields in lockstep with CONFIG_DEBUG_RSEQ=y
   - Fix segfault on registration when rseq_cs is non-zero
   - selftests: Add rseq syscall errors test
   - selftests: Ensure the rseq ABI TLS is actually 1024 bytes
 
 Membarriers:
 
   - Fix redundant load of membarrier_state (Nysal Jan K.A.)
 
 Scheduler debugging:
 
   - Introduce and use preempt_model_str() (Sebastian Andrzej Siewior)
   - Make CONFIG_SCHED_DEBUG unconditional (Ingo Molnar)
 
 Fixes and cleanups:
 
   - Always save/restore x86 TSC sched_clock() on suspend/resume
    (Guilherme G. Piccoli)
 
   - Misc fixes and cleanups (Thorsten Blum, Juri Lelli,
     Sebastian Andrzej Siewior)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfejsoRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1ivkhAAwBF2tYRBS1oIHcC/OKK3JJoHVDp2LFbU
 9sm5S3ZlGD/Ns2fbpY+9A8UFgUFfjYiTSV7hvf2B9Vge0XSxTmMNFu/MdxLBbo9r
 w6GSeNcNDQKpjEGLkrmPFsa2fiYI4dmH0IzDbS9V2cNPk470QBKjAKXNPaSER691
 n2wLnQq+m5o4gXnPjnSz6RrrzisRnm2GOWnDV/iqR47pZFNlX2wWlo3s5r7//Hw0
 a+QfEfpgKehhy/VSDXmSAgpqnNffjc78yBV6LNoVUddwahnOWiQMS3XViOqgy5VO
 jUGBrzW+sKkdRMBppxwJ/0XWgHGC27amIgnU0ZE5u+eiUEu8H9qWl1cRCFxyeB0O
 8+WNfwmkH+FPWUdsn84kdePhSsZy6HfM6h44Xe0hx1V7tQXEXfbPzK3TnQg8Ktt1
 Ky6ctbZt4cGpqGQuIqvba21A/racrD/DgvB7mHeZksnqZoKTDwxhT/nlQGpuwPoy
 SJYd1ynFVJvfC69SwMdwnaglimvEZx1GfT0o5XtCMslY5NkWCou5u+e65WX7ccU5
 94wBCwI1/+KiFMJZp6TlPw07Q/Hsj9dDryxLc3OunU3zMVt3++1GZnBS/eb5FN1A
 5TlAEpxgH9c5Q4/XJFCvx21DrTVHSuIrR6naK91bgqHo0cEfaHGtO/ejPtJGSfxe
 YIFnnu1dhRw=
 =SuQE
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Core & fair scheduler changes:

   - Cancel the slice protection of the idle entity (Zihan Zhou)
   - Reduce the default slice to avoid tasks getting an extra tick
     (Zihan Zhou)
   - Force propagating min_slice of cfs_rq when {en,de}queue tasks
     (Tianchen Ding)
   - Refactor can_migrate_task() to elimate looping (I Hsin Cheng)
   - Add unlikey branch hints to several system calls (Colin Ian King)
   - Optimize current_clr_polling() on certain architectures (Yujun
     Dong)

  Deadline scheduler: (Juri Lelli)
   - Remove redundant dl_clear_root_domain call
   - Move dl_rebuild_rd_accounting to cpuset.h

  Uclamp:
   - Use the uclamp_is_used() helper instead of open-coding it (Xuewen
     Yan)
   - Optimize sched_uclamp_used static key enabling (Xuewen Yan)

  Scheduler topology support: (Juri Lelli)
   - Ignore special tasks when rebuilding domains
   - Add wrappers for sched_domains_mutex
   - Generalize unique visiting of root domains
   - Rebuild root domain accounting after every update
   - Remove partition_and_rebuild_sched_domains
   - Stop exposing partition_sched_domains_locked

  RSEQ: (Michael Jeanson)
   - Update kernel fields in lockstep with CONFIG_DEBUG_RSEQ=y
   - Fix segfault on registration when rseq_cs is non-zero
   - selftests: Add rseq syscall errors test
   - selftests: Ensure the rseq ABI TLS is actually 1024 bytes

  Membarriers:
   - Fix redundant load of membarrier_state (Nysal Jan K.A.)

  Scheduler debugging:
   - Introduce and use preempt_model_str() (Sebastian Andrzej Siewior)
   - Make CONFIG_SCHED_DEBUG unconditional (Ingo Molnar)

  Fixes and cleanups:
   - Always save/restore x86 TSC sched_clock() on suspend/resume
     (Guilherme G. Piccoli)
   - Misc fixes and cleanups (Thorsten Blum, Juri Lelli, Sebastian
     Andrzej Siewior)"

* tag 'sched-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
  cpuidle, sched: Use smp_mb__after_atomic() in current_clr_polling()
  sched/debug: Remove CONFIG_SCHED_DEBUG
  sched/debug: Remove CONFIG_SCHED_DEBUG from self-test config files
  sched/debug, Documentation: Remove (most) CONFIG_SCHED_DEBUG references from documentation
  sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional
  sched/debug: Make 'const_debug' tunables unconditional __read_mostly
  sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()
  rseq/selftests: Fix namespace collision with rseq UAPI header
  include/{topology,cpuset}: Move dl_rebuild_rd_accounting to cpuset.h
  sched/topology: Stop exposing partition_sched_domains_locked
  cgroup/cpuset: Remove partition_and_rebuild_sched_domains
  sched/topology: Remove redundant dl_clear_root_domain call
  sched/deadline: Rebuild root domain accounting after every update
  sched/deadline: Generalize unique visiting of root domains
  sched/topology: Wrappers for sched_domains_mutex
  sched/deadline: Ignore special tasks when rebuilding domains
  tracing: Use preempt_model_str()
  xtensa: Rely on generic printing of preemption model
  x86: Rely on generic printing of preemption model
  s390: Rely on generic printing of preemption model
  ...
2025-03-24 21:28:12 -07:00
Ingo Molnar
f7d2728cc0 sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()
The scheduler has this special SCHED_WARN() facility that
depends on CONFIG_SCHED_DEBUG.

Since CONFIG_SCHED_DEBUG is getting removed, convert
SCHED_WARN() to WARN_ON_ONCE().

Note that the warning output isn't 100% equivalent:

   #define SCHED_WARN_ON(x)      WARN_ONCE(x, #x)

Because SCHED_WARN_ON() would output the 'x' condition
as well, while WARN_ONCE() will only show a backtrace.

Hopefully these are rare enough to not really matter.

If it does, we should probably introduce a new WARN_ON()
variant that outputs the condition in stringified form,
or improve WARN_ON() itself.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250317104257.3496611-2-mingo@kernel.org
2025-03-19 22:20:53 +01:00
Andrea Righi
e4855fc90e sched_ext: idle: Refactor scx_select_cpu_dfl()
Make scx_select_cpu_dfl() more consistent with the other idle-related
APIs by returning a negative value when an idle CPU isn't found.

No functional changes, this is purely a refactoring.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-14 08:17:11 -10:00
Andrea Righi
c414c2171c sched_ext: idle: Honor idle flags in the built-in idle selection policy
Enable passing idle flags (%SCX_PICK_IDLE_*) to scx_select_cpu_dfl(),
to enforce strict selection criteria, such as selecting an idle CPU
strictly within @prev_cpu's node or choosing only a fully idle SMT core.

This functionality will be exposed through a dedicated kfunc in a
separate patch.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-14 08:17:01 -10:00
Andrea Righi
97e13ecb02 sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()
scx_bpf_reenqueue_local() can be invoked from ops.cpu_release() to give
tasks that are queued to the local DSQ a chance to migrate to other
CPUs, when a CPU is taken by a higher scheduling class.

However, there is no point re-enqueuing tasks that can only run on that
particular CPU, as they would simply be re-added to the same local DSQ
without any benefit.

Therefore, skip per-CPU tasks in scx_bpf_reenqueue_local().

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-13 07:07:27 -10:00
Changwoo Min
71d078803c sched_ext: Add trace point to track sched_ext core events
Add tracing support to track sched_ext core events
(/sched_ext/sched_ext_event). This may be useful for debugging sched_ext
schedulers that trigger a particular event.

The trace point can be used as other trace points, so it can be used in,
for example, `perf trace` and BPF programs, as follows:

======
$> sudo perf trace -e sched_ext:sched_ext_event --filter 'name == "SCX_EV_ENQ_SLICE_DFL"'
======

======
struct tp_sched_ext_event {
	struct trace_entry ent;
	u32 __data_loc_name;
	s64 delta;
};

SEC("tracepoint/sched_ext/sched_ext_event")
int rtp_add_event(struct tp_sched_ext_event *ctx)
{
	char event_name[128];
	unsigned short offset = ctx->__data_loc_name & 0xFFFF;
        bpf_probe_read_str((void *)event_name, 128, (char *)ctx + offset);

	bpf_printk("name %s   delta %lld", event_name, ctx->delta);
	return 0;
}
======

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-04 08:06:17 -10:00
Changwoo Min
038730dc12 sched_ext: Change the event type from u64 to s64
The event count could be negative in the future,
so change the event type from u64 to s64.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-04 08:05:23 -10:00
Tejun Heo
8a9b1585e2 sched_ext: Merge branch 'for-6.14-fixes' into for-6.15
Pull for-6.14-fixes to receive:

  9360dfe4cb ("sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()")

which conflicts with:

  337d1b354a ("sched_ext: Move built-in idle CPU selection policy to a separate file")

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-03 08:02:22 -10:00
Andrea Righi
9360dfe4cb sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()
If a BPF scheduler provides an invalid CPU (outside the nr_cpu_ids
range) as prev_cpu to scx_bpf_select_cpu_dfl() it can cause a kernel
crash.

To prevent this, validate prev_cpu in scx_bpf_select_cpu_dfl() and
trigger an scx error if an invalid CPU is specified.

Fixes: f0e1a0643a ("sched_ext: Implement BPF extensible scheduler class")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-03 07:55:48 -10:00
Tejun Heo
8fef0a3b17 sched_ext: Fix pick_task_scx() picking non-queued tasks when it's called without balance()
a6250aa251 ("sched_ext: Handle cases where pick_task_scx() is called
without preceding balance_scx()") added a workaround to handle the cases
where pick_task_scx() is called without prececing balance_scx() which is due
to a fair class bug where pick_taks_fair() may return NULL after a true
return from balance_fair().

The workaround detects when pick_task_scx() is called without preceding
balance_scx() and emulates SCX_RQ_BAL_KEEP and triggers kicking to avoid
stalling. Unfortunately, the workaround code was testing whether @prev was
on SCX to decide whether to keep the task running. This is incorrect as the
task may be on SCX but no longer runnable.

This could lead to a non-runnable task to be returned from pick_task_scx()
which cause interesting confusions and failures. e.g. A common failure mode
is the task ending up with (!on_rq && on_cpu) state which can cause
potential wakers to busy loop, which can easily lead to deadlocks.

Fix it by testing whether @prev has SCX_TASK_QUEUED set. This makes
@prev_on_scx only used in one place. Open code the usage and improve the
comment while at it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Pat Cody <patcody@meta.com>
Fixes: a6250aa251 ("sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()")
Cc: stable@vger.kernel.org # v6.12+
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-02-25 08:28:52 -10:00
Andrea Righi
0e9b4c10e8 sched_ext: idle: Introduce scx_bpf_nr_node_ids()
Similarly to scx_bpf_nr_cpu_ids(), introduce a new kfunc
scx_bpf_nr_node_ids() to expose the maximum number of NUMA nodes in the
system.

BPF schedulers can use this information together with the new node-aware
kfuncs, for example to create per-node DSQs, validate node IDs, etc.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-24 07:37:42 -10:00
Andrea Righi
48849271e6 sched_ext: idle: Per-node idle cpumasks
Using a single global idle mask can lead to inefficiencies and a lot of
stress on the cache coherency protocol on large systems with multiple
NUMA nodes, since all the CPUs can create a really intense read/write
activity on the single global cpumask.

Therefore, split the global cpumask into multiple per-NUMA node cpumasks
to improve scalability and performance on large systems.

The concept is that each cpumask will track only the idle CPUs within
its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
In this way concurrent access to the idle cpumask will be restricted
within each NUMA node.

The split of multiple per-node idle cpumasks can be controlled using the
SCX_OPS_BUILTIN_IDLE_PER_NODE flag.

By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global
host-wide idle cpumask is used, maintaining the previous behavior.

NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via
SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will
trigger an scx error, since there are no system-wide cpumasks.

= Test =

Hardware:
 - System: DGX B200
    - CPUs: 224 SMT threads (112 physical cores)
    - Processor: INTEL(R) XEON(R) PLATINUM 8570
    - 2 NUMA nodes

Scheduler:
 - scx_simple [1] (so that we can focus at the built-in idle selection
   policy and not at the scheduling policy itself)

Test:
 - Run a parallel kernel build `make -j $(nproc)` and measure the average
   elapsed time over 10 runs:

          avg time | stdev
          ---------+------
 before:   52.431s | 2.895
  after:   50.342s | 2.895

= Conclusion =

Splitting the global cpumask into multiple per-NUMA cpumasks helped to
achieve a speedup of approximately +4% with this particular architecture
and test case.

The same test on a DGX-1 (40 physical cores, Intel Xeon E5-2698 v4 @
2.20GHz, 2 NUMA nodes) shows a speedup of around 1.5-3%.

On smaller systems, I haven't noticed any measurable regressions or
improvements with the same test (parallel kernel build) and scheduler
(scx_simple).

Moreover, with a modified scx_bpfland that uses the new NUMA-aware APIs
I observed an additional +2-2.5% performance improvement with the same
test.

[1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c

Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16 06:52:20 -10:00
Andrea Righi
0aaaf89df8 sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODE
Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows
BPF schedulers to select between using a global flat idle cpumask or
multiple per-node cpumasks.

This only introduces the flag and the mechanism to enable/disable this
feature without affecting any scheduling behavior.

Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16 06:52:20 -10:00
Andrea Righi
d73249f887 sched_ext: idle: Make idle static keys private
Make all the static keys used by the idle CPU selection policy private
to ext_idle.c. This avoids unnecessary exposure in headers and improves
code encapsulation.

Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16 06:52:20 -10:00
Changwoo Min
ad3b301aa0 sched_ext: Provides a sysfs 'events' to expose core event counters
Add a sysfs entry at /sys/kernel/sched_ext/root/events to expose core
event counters through the files system interface. Each line of the file
shows the event name and its counter value.

In addition, the format of scx_dump_event() is adjusted as the event name
gets longer.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-14 07:15:41 -10:00
Tejun Heo
3539c6411a sched_ext: Implement SCX_OPS_ALLOW_QUEUED_WAKEUP
A task wakeup can be either processed on the waker's CPU or bounced to the
wakee's previous CPU using an IPI (ttwu_queue). Bouncing to the wakee's CPU
avoids the waker's CPU locking and accessing the wakee's rq which can be
expensive across cache and node boundaries.

When ttwu_queue path is taken, select_task_rq() and thus ops.select_cpu()
may be skipped in some cases (racing against the wakee switching out). As
this confused some BPF schedulers, there wasn't a good way for a BPF
scheduler to tell whether idle CPU selection has been skipped, ops.enqueue()
couldn't insert tasks into foreign local DSQs, and the performance
difference on machines with simple toplogies were minimal, sched_ext
disabled ttwu_queue.

However, this optimization makes noticeable difference on more complex
topologies and a BPF scheduler now has an easy way tell whether
ops.select_cpu() was skipped since 9b671793c7 ("sched_ext, scx_qmap: Add
and use SCX_ENQ_CPU_SELECTED") and can insert tasks into foreign local DSQs
since 5b26f7b920 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct
dispatches").

Implement SCX_OPS_ALLOW_QUEUED_WAKEUP which allows BPF schedulers to choose
to enable ttwu_queue optimization.

v2: Update the patch description and comment re. ops.select_cpu() being
    skipped in some cases as opposed to always as per Neel.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Neel Natu <neelnatu@google.com>
Reported-by: Barret Rhoden <brho@google.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-02-13 07:30:46 -10:00
Chuyi Zhou
f5717c93a1 sched_ext: Use SCX_CALL_OP_TASK in task_tick_scx
Now when we use scx_bpf_task_cgroup() in ops.tick() to get the cgroup of
the current task, the following error will occur:

scx_foo[3795244] triggered exit kind 1024:
  runtime error (called on a task not being operated on)

The reason is that we are using SCX_CALL_OP() instead of SCX_CALL_OP_TASK()
when calling ops.tick(), which triggers the error during the subsequent
scx_kf_allowed_on_arg_tasks() check.

SCX_CALL_OP_TASK() was first introduced in commit 36454023f5 ("sched_ext:
Track tasks that are subjects of the in-flight SCX operation") to ensure
task's rq lock is held when accessing task's sched_group. Since ops.tick()
is marked as SCX_KF_TERMINAL and task_tick_scx() is protected by the rq
lock, we can use SCX_CALL_OP_TASK() to avoid the above issue. Similarly,
the same changes should be made for ops.disable() and ops.exit_task(), as
they are also protected by task_rq_lock() and it's safe to access the
task's task_group.

Fixes: 36454023f5 ("sched_ext: Track tasks that are subjects of the in-flight SCX operation")
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-13 06:57:33 -10:00
Tejun Heo
78e4690de4 Merge branch 'for-6.14-fixes' into for-6.15
Pull to receive f3f08c3acf ("sched_ext: Fix incorrect assumption about
migration disabled tasks in task_can_run_on_remote_rq()") which conflicts
with 26176116d9 ("sched_ext: Count SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE in
the right spot") in for-6.15.
2025-02-10 10:45:43 -10:00
Tejun Heo
f3f08c3acf sched_ext: Fix incorrect assumption about migration disabled tasks in task_can_run_on_remote_rq()
While fixing migration disabled task handling, 3296682157 ("sched_ext: Fix
migration disabled handling in targeted dispatches") assumed that a
migration disabled task's ->cpus_ptr would only have the pinned CPU. While
this is eventually true for migration disabled tasks that are switched out,
->cpus_ptr update is performed by migrate_disable_switch() which is called
right before context_switch() in __scheduler(). However, the task is
enqueued earlier during pick_next_task() via put_prev_task_scx(), so there
is a race window where another CPU can see the task on a DSQ.

If the CPU tries to dispatch the migration disabled task while in that
window, task_allowed_on_cpu() will succeed and task_can_run_on_remote_rq()
will subsequently trigger SCHED_WARN(is_migration_disabled()).

  WARNING: CPU: 8 PID: 1837 at kernel/sched/ext.c:2466 task_can_run_on_remote_rq+0x12e/0x140
  Sched_ext: layered (enabled+all), task: runnable_at=-10ms
  RIP: 0010:task_can_run_on_remote_rq+0x12e/0x140
  ...
   <TASK>
   consume_dispatch_q+0xab/0x220
   scx_bpf_dsq_move_to_local+0x58/0xd0
   bpf_prog_84dd17b0654b6cf0_layered_dispatch+0x290/0x1cfa
   bpf__sched_ext_ops_dispatch+0x4b/0xab
   balance_one+0x1fe/0x3b0
   balance_scx+0x61/0x1d0
   prev_balance+0x46/0xc0
   __pick_next_task+0x73/0x1c0
   __schedule+0x206/0x1730
   schedule+0x3a/0x160
   __do_sys_sched_yield+0xe/0x20
   do_syscall_64+0xbb/0x1e0
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fix it by converting the SCHED_WARN() back to a regular failure path. Also,
perform the migration disabled test before task_allowed_on_cpu() test so
that BPF schedulers which fail to handle migration disabled tasks can be
noticed easily.

While at it, adjust scx_ops_error() message for !task_allowed_on_cpu() case
for brevity and consistency.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 3296682157 ("sched_ext: Fix migration disabled handling in targeted dispatches")
Acked-by: Andrea Righi <arighi@nvidia.com>
Reported-by: Jake Hillion <jakehillion@meta.com>
2025-02-10 10:40:47 -10:00
Li RongQing
1a4e0d8682 sched_ext: Take NUMA node into account when allocating per-CPU cpumasks
per-CPU cpumasks are dominantly accessed from their own local CPUs,
so allocate them node-local to improve performance.

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-10 07:20:13 -10:00
Tejun Heo
eace54dff0 sched_ext: Add SCX_EV_ENQ_SKIP_MIGRATION_DISABLED
Count the number of times a migration disabled task is automatically
dispatched to its local DSQ.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-02-08 20:39:11 -10:00
Tejun Heo
26176116d9 sched_ext: Count SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE in the right spot
SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE wasn't quite right in two aspects:

- It counted both migration disabled and offline events.

- It didn't count events from scx_bpf_dsq_move() path.

Fix it by moving the counting into task_can_run_on_remote_rq() which is
shared by both paths and can distinguish the different rejection conditions.
The argument @trigger_error is renamed to @enforce as it now does more than
just triggering error.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-02-08 20:39:11 -10:00
Tejun Heo
29ef4a2fcf Merge branch 'for-6.14-fixes' into for-6.15
Pull to receive:

- 2fa0fbeb69 ("sched_ext: Implement auto local dispatching of migration disabled tasks")
- 3296682157 ("sched_ext: Fix migration disabled handling in targeted dispatches")

as planned for-6.15 changes depend on them (e.g. adding event counter for
implicit migration disabled task handling).
2025-02-08 20:34:43 -10:00
Tejun Heo
3296682157 sched_ext: Fix migration disabled handling in targeted dispatches
A dispatch operation that can target a specific local DSQ -
scx_bpf_dsq_move_to_local() or scx_bpf_dsq_move() - checks whether the task
can be migrated to the target CPU using task_can_run_on_remote_rq(). If the
task can't be migrated to the targeted CPU, it is bounced through a global
DSQ.

task_can_run_on_remote_rq() assumes that the task is on a CPU that's
different from the targeted CPU but the callers doesn't uphold the
assumption and may call the function when the task is already on the target
CPU. When such task has migration disabled, task_can_run_on_remote_rq() ends
up returning %false incorrectly unnecessarily bouncing the task to a global
DSQ.

Fix it by updating the callers to only call task_can_run_on_remote_rq() when
the task is on a different CPU than the target CPU. As this is a bit subtle,
for clarity and documentation:

- Make task_can_run_on_remote_rq() trigger SCHED_WARN_ON() if the task is on
  the same CPU as the target CPU.

- is_migration_disabled() test in task_can_run_on_remote_rq() cannot trigger
  if the task is on a different CPU than the target CPU as the preceding
  task_allowed_on_cpu() test should fail beforehand. Convert the test into
  SCHED_WARN_ON().

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 4c30f5ce4f ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
Fixes: 0366017e09 ("sched_ext: Use task_can_run_on_remote_rq() test in dispatch_to_local_dsq()")
Cc: stable@vger.kernel.org # v6.12+
2025-02-08 20:33:25 -10:00
Tejun Heo
2fa0fbeb69 sched_ext: Implement auto local dispatching of migration disabled tasks
Migration disabled tasks are special and pinned to their previous CPUs. They
tripped up some unsuspecting BPF schedulers as their ->nr_cpus_allowed may
not agree with the bits set in ->cpus_ptr. Make it easier for BPF schedulers
by automatically dispatching them to the pinned local DSQs by default. If a
BPF scheduler wants to handle migration disabled tasks explicitly, it can
set SCX_OPS_ENQ_MIGRATION_DISABLED.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-02-08 20:32:54 -10:00
Changwoo Min
6d3f8fb4b2 sched_ext: Add an event, SCX_EV_ENQ_SLICE_DFL
Add a core event, SCX_EV_ENQ_SLICE_DFL, which represents how many
tasks have been enqueued (or pick_task-ed or select_cpu-ed) with
a default time slice (SCX_SLICE_DFL).

Scheduling a task with SCX_SLICE_DFL unintentionally would be a source
of latency spikes because SCX_SLICE_DFL is relatively long (20 msec).
Thus, soaring the SCX_EV_ENQ_SLICE_DFL value would be a sign of BPF
scheduler bugs, causing latency spikes, especially when ops.select_cpu()
is provided.

__scx_add_event() is used since the caller holds an rq lock or p->pi_lock,
so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-07 11:24:59 -10:00
Changwoo Min
d46457c31c sched_ext: Add an event, SCX_EV_BYPASS_DURATION
Add a core event, SCX_EV_BYPASS_DURATION, which represents the
total duration of bypass modes in nanoseconds.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04 10:36:47 -10:00
Changwoo Min
5c605cd33c sched_ext: Add an event, SCX_EV_BYPASS_DISPATCH
Add a core event, SCX_EV_BYPASS_DISPATCH, which represents how many
tasks have been dispatched in the bypass mode.

__scx_add_event() is used since the caller holds an rq lock or
p->pi_lock, so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04 10:36:47 -10:00
Changwoo Min
4f7a38c7c9 sched_ext: Add an event, SCX_EV_BYPASS_ACTIVATE
Add a core event, SCX_EV_BYPASS_ACTIVATE, which represents how many
times the bypass mode has been triggered.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04 10:36:47 -10:00
Changwoo Min
824d4f2dce sched_ext: Add an event, SCX_EV_ENQ_SKIP_EXITING
Add a core event, SCX_EV_ENQ_SKIP_EXITING, which represents how many
times a task is enqueued to a local DSQ when exiting if
SCX_OPS_ENQ_EXITING is not set.

__scx_add_event() is used since the caller holds an rq lock,
so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04 10:36:47 -10:00
Changwoo Min
7125660bc1 sched_ext: Add an event, SCX_EV_DISPATCH_KEEP_LAST
Add a core event, SCX_EV_DISPATCH_KEEP_LAST, which represents how many
times a task is continued to run without ops.enqueue() when
SCX_OPS_ENQ_LAST is not set.

__scx_add_event() is used since the caller holds an rq lock,
so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02 07:23:18 -10:00
Changwoo Min
9be0a1b0c8 sched_ext: Add an event, SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE
Add a core event, SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE, which represents how
many times a BPF scheduler tries to dispatch to an offlined local DSQ.

__scx_add_event() is used since the caller holds an rq lock,
so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02 07:23:18 -10:00
Changwoo Min
f7f6142107 sched_ext: Add an event, SCX_EV_SELECT_CPU_FALLBACK
Add a core event, SCX_EV_SELECT_CPU_FALLBACK, which represents how many times
ops.select_cpu() returns a CPU that the task can't use.

__scx_add_event() is used since the caller holds an rq lock,
so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02 07:23:18 -10:00
Changwoo Min
17103b8504 sched_ext: Implement event counter infrastructure
Collect the statistics of specific types of behavior in the sched_ext core,
which are not easily visible but still interesting to an scx scheduler.

An event type is defined in 'struct scx_event_stats.' When an event occurs,
its counter is accumulated using 'scx_add_event()' and '__scx_add_event()'
to per-CPU 'struct scx_event_stats' for efficiency. 'scx_bpf_events()'
aggregates all the per-CPU counters and exposes a system-wide counters.

For convenience and readability of the code, 'scx_agg_event()' and
'scx_dump_event()' are provided.

The collected events can be observed after a BPF scheduler is unloaded
beforea new BPF scheduler is loaded so the per-CPU 'struct scx_event_stats'
are reset.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02 07:23:18 -10:00
Andrea Righi
337d1b354a sched_ext: Move built-in idle CPU selection policy to a separate file
As ext.c is becoming quite large, move the idle CPU selection policy to
separate files (ext_idle.c / ext_idle.h) for better code readability.

Moreover, group together all the idle CPU selection kfunc's to the same
btf_kfunc_id_set block.

No functional changes, this is purely code reorganization.

Suggested-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-27 12:43:43 -10:00
Andrea Righi
1626e5ef0b sched_ext: Fix lock imbalance in dispatch_to_local_dsq()
While performing the rq locking dance in dispatch_to_local_dsq(), we may
trigger the following lock imbalance condition, in particular when
multiple tasks are rapidly changing CPU affinity (i.e., running a
`stress-ng --race-sched 0`):

[   13.413579] =====================================
[   13.413660] WARNING: bad unlock balance detected!
[   13.413729] 6.13.0-virtme #15 Not tainted
[   13.413792] -------------------------------------
[   13.413859] kworker/1:1/80 is trying to release lock (&rq->__lock) at:
[   13.413954] [<ffffffff873c6c48>] dispatch_to_local_dsq+0x108/0x1a0
[   13.414111] but there are no more locks to release!
[   13.414176]
[   13.414176] other info that might help us debug this:
[   13.414258] 1 lock held by kworker/1:1/80:
[   13.414318]  #0: ffff8b66feb41698 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x20/0x90
[   13.414612]
[   13.414612] stack backtrace:
[   13.415255] CPU: 1 UID: 0 PID: 80 Comm: kworker/1:1 Not tainted 6.13.0-virtme #15
[   13.415505] Workqueue:  0x0 (events)
[   13.415567] Sched_ext: dsp_local_on (enabled+all), task: runnable_at=-2ms
[   13.415570] Call Trace:
[   13.415700]  <TASK>
[   13.415744]  dump_stack_lvl+0x78/0xe0
[   13.415806]  ? dispatch_to_local_dsq+0x108/0x1a0
[   13.415884]  print_unlock_imbalance_bug+0x11b/0x130
[   13.415965]  ? dispatch_to_local_dsq+0x108/0x1a0
[   13.416226]  lock_release+0x231/0x2c0
[   13.416326]  _raw_spin_unlock+0x1b/0x40
[   13.416422]  dispatch_to_local_dsq+0x108/0x1a0
[   13.416554]  flush_dispatch_buf+0x199/0x1d0
[   13.416652]  balance_one+0x194/0x370
[   13.416751]  balance_scx+0x61/0x1e0
[   13.416848]  prev_balance+0x43/0xb0
[   13.416947]  __pick_next_task+0x6b/0x1b0
[   13.417052]  __schedule+0x20d/0x1740

This happens because dispatch_to_local_dsq() is racing with
dispatch_dequeue() and, when the latter wins, we incorrectly assume that
the task has been moved to dst_rq.

Fix by properly tracking the currently locked rq.

Fixes: 4d3ca89bdd ("sched_ext: Refactor consume_remote_task()")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-27 12:41:12 -10:00
Tejun Heo
d6f3e7d564 sched_ext: Fix incorrect autogroup migration detection
scx_move_task() is called from sched_move_task() and tells the BPF scheduler
that cgroup migration is being committed. sched_move_task() is used by both
cgroup and autogroup migrations and scx_move_task() tried to filter out
autogroup migrations by testing the destination cgroup and PF_EXITING but
this is not enough. In fact, without explicitly tagging the thread which is
doing the cgroup migration, there is no good way to tell apart
scx_move_task() invocations for racing migration to the root cgroup and an
autogroup migration.

This led to scx_move_task() incorrectly ignoring a migration from non-root
cgroup to an autogroup of the root cgroup triggering the following warning:

  WARNING: CPU: 7 PID: 1 at kernel/sched/ext.c:3725 scx_cgroup_can_attach+0x196/0x340
  ...
  Call Trace:
  <TASK>
    cgroup_migrate_execute+0x5b1/0x700
    cgroup_attach_task+0x296/0x400
    __cgroup_procs_write+0x128/0x140
    cgroup_procs_write+0x17/0x30
    kernfs_fop_write_iter+0x141/0x1f0
    vfs_write+0x31d/0x4a0
    __x64_sys_write+0x72/0xf0
    do_syscall_64+0x82/0x160
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fix it by adding an argument to sched_move_task() that indicates whether the
moving is for a cgroup or autogroup migration. After the change,
scx_move_task() is called only for cgroup migrations and renamed to
scx_cgroup_move_task().

Link: https://github.com/sched-ext/scx/issues/370
Fixes: 8195136669 ("sched_ext: Add cgroup support")
Cc: stable@vger.kernel.org # v6.12+
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-27 08:31:50 -10:00
Andrea Righi
2279563e3a sched_ext: Include task weight in the error state dump
Report the task weight when dumping the task state during an error exit.
Moreover, adjust the output format to display dsq_vtime, slice, and
weight on the same line.

This can help identify whether certain tasks were excessively
prioritized or de-prioritized due to large niceness gaps.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-24 08:07:42 -10:00
Atul Kumar Pant
be8ee18152 sched_ext: Fixes typos in comments
Fixes some spelling errors in the comments.

Signed-off-by: Atul Kumar Pant <atulpant.linux@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-24 08:06:05 -10:00
Linus Torvalds
bc8198dc7e sched_ext: Changes for v6.14
- scx_bpf_now() added so that BPF scheduler can access the cached timestamp
   in struct rq to avoid reading TSC multiple times within a locked
   scheduling operation.
 
 - Minor updates to the built-in idle CPU selection logic.
 
 - tool/sched_ext updates and other misc changes.
 
 Pulling sched_ext/for-6.14 into master causes a merge conflict between the
 following two commits (first commit in master, second in for-6.14):
 
   a2a3374c47 sched_ext: idle: Refresh idle masks during idle-to-idle transitions
   9cf9aceed2 sched_ext: idle: use assign_cpu() to update the idle cpumask
 
   static void update_builtin_idle(int cpu, bool idle)
   {
   <<<<<<< HEAD
           if (idle)
                   cpumask_set_cpu(cpu, idle_masks.cpu);
           else
                   cpumask_clear_cpu(cpu, idle_masks.cpu);
   =======
           int cpu = cpu_of(rq);
 
           if (SCX_HAS_OP(update_idle) && !scx_rq_bypassing(rq)) {
                   SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
                   if (!static_branch_unlikely(&scx_builtin_idle_enabled))
                           return;
           }
 
           assign_cpu(cpu, idle_masks.cpu, idle);
   >>>>>>> 987ce79b52
 
 The first commit factored out update_builtin_idle() and the second replaced
 cpumask_set/clear_cpu() calls with assign_cpu(). The conflict can be
 resolved by taking the code from the first and then replacing the
 cpumask_set/clear_cpu() calls with assign_cpu():
 
   static void update_builtin_idle(int cpu, bool idle)
   {
           assign_cpu(cpu, idle_masks.cpu, idle);
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ4/yjw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGc/3AQChJ2zxBLMtQOfNVWgqhAyatCnFquMncl/IeGAs
 Sf4eawD/QHNWYBPx0nOFQS8RZNmUcXuEDeHMy+1MvTUaIDL5MQM=
 =6t0t
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext updates from Tejun Heo:

 - scx_bpf_now() added so that BPF scheduler can access the cached
   timestamp in struct rq to avoid reading TSC multiple times within a
   locked scheduling operation.

 - Minor updates to the built-in idle CPU selection logic.

 - tool/sched_ext updates and other misc changes.

* tag 'sched_ext-for-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: fix kernel-doc warnings
  sched_ext: Use time helpers in BPF schedulers
  sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_now()
  sched_ext: Add time helpers for BPF schedulers
  sched_ext: Add scx_bpf_now() for BPF scheduler
  sched_ext: Implement scx_bpf_now()
  sched_ext: Relocate scx_enabled() related code
  sched_ext: Add option -l in selftest runner to list all available tests
  sched_ext: Include remaining task time slice in error state dump
  sched_ext: update scx_bpf_dsq_insert() doc for SCX_DSQ_LOCAL_ON
  sched_ext: idle: small CPU iteration refactoring
  sched_ext: idle: introduce check_builtin_idle_enabled() helper
  sched_ext: idle: clarify comments
  sched_ext: idle: use assign_cpu() to update the idle cpumask
  sched_ext: Use str_enabled_disabled() helper in update_selcpu_topology()
  sched_ext: Use sizeof_field for key_len in dsq_hash_params
  tools/sched_ext: Receive updates from SCX repo
  sched_ext: Use the NUMA scheduling domain for NUMA optimizations
2025-01-23 18:49:43 -08:00
Linus Torvalds
1d6d399223 Kthreads affinity follow either of 4 existing different patterns:
1) Per-CPU kthreads must stay affine to a single CPU and never execute
    relevant code on any other CPU. This is currently handled by smpboot
    code which takes care of CPU-hotplug operations. Affinity here is
    a correctness constraint.
 
 2) Some kthreads _have_ to be affine to a specific set of CPUs and can't
    run anywhere else. The affinity is set through kthread_bind_mask()
    and the subsystem takes care by itself to handle CPU-hotplug
    operations. Affinity here is assumed to be a correctness constraint.
 
 3) Per-node kthreads _prefer_ to be affine to a specific NUMA node. This
    is not a correctness constraint but merely a preference in terms of
    memory locality. kswapd and kcompactd both fall into this category.
    The affinity is set manually like for any other task and CPU-hotplug
    is supposed to be handled by the relevant subsystem so that the task
    is properly reaffined whenever a given CPU from the node comes up.
    Also care should be taken so that the node affinity doesn't cross
    isolated (nohz_full) cpumask boundaries.
 
 4) Similar to the previous point except kthreads have a _preferred_
    affinity different than a node. Both RCU boost kthreads and RCU
    exp kworkers fall into this category as they refer to "RCU nodes"
    from a distinctly distributed tree.
 
 Currently the preferred affinity patterns (3 and 4) have at least 4
 identified users, with more or less success when it comes to handle
 CPU-hotplug operations and CPU isolation. Each of which do it in its own
 ad-hoc way.
 
 This is an infrastructure proposal to handle this with the following API
 changes:
 
 _ kthread_create_on_node() automatically affines the created kthread to
   its target node unless it has been set as per-cpu or bound with
   kthread_bind[_mask]() before the first wake-up.
 
 - kthread_affine_preferred() is a new function that can be called right
   after kthread_create_on_node() to specify a preferred affinity
   different than the specified node.
 
 When the preferred affinity can't be applied because the possible
 targets are offline or isolated (nohz_full), the kthread is affine
 to the housekeeping CPUs (which means to all online CPUs most of the
 time or only the non-nohz_full CPUs when nohz_full= is set).
 
 kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been
 converted, along with a few old drivers.
 
 Summary of the changes:
 
 * Consolidate a bunch of ad-hoc implementations of kthread_run_on_cpu()
 
 * Introduce task_cpu_fallback_mask() that defines the default last
   resort affinity of a task to become nohz_full aware
 
 * Add some correctness check to ensure kthread_bind() is always called
   before the first kthread wake up.
 
 * Default affine kthread to its preferred node.
 
 * Convert kswapd / kcompactd and remove their halfway working ad-hoc
   affinity implementation
 
 * Implement kthreads preferred affinity
 
 * Unify kthread worker and kthread API's style
 
 * Convert RCU kthreads to the new API and remove the ad-hoc affinity
   implementation.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmeNf8gACgkQhSRUR1CO
 jHedQQ/+IxTjjqQiItzrq41TES2S0desHDq8lNJFb7rsR/DtKFyLx3s67cOYV+cM
 Yx54QHg2m/Fz4nXMQ7Po5ygOtJGCKBc5C5QQy7y0lVKeTQK+daDfEtBSa3oG7j3C
 u+E3tTY6qxkbCzymUyaKkHN4/ay2vLvjFS50luV7KMyI3x47Aji+t7VdCX4LCPP2
 eAwOALWD0+7qLJ/VF6gsmQLKA4Qx7PQAzBa3KSBmUN9UcN8Gk1bQHCTIQKDHP9LQ
 v8BXrNZtYX1o2+snNYpX2z6/ECjxkdwriOgqqZY5306hd9RAQ1u46Dx3byrIqjGn
 ULG/XQ2istPyhTqb/h+RbrobdOcwEUIeqk8hRRbBXE8bPpqUz9EMuaCMxWDbQjgH
 NTuKG4ifKJ/IqstkkuDkdOiByE/ysMmwqrTXgSnu2ITNL9yY3BEgFbvA95hgo42s
 f7QCxEfZb1MHcNEMENSMwM3xw5lLMGMpxVZcMQ3gLwyotMBRrhFZm1qZJG7TITYW
 IDIeCbH4JOMdQwLs3CcWTXio0N5/85NhRNFV+IDn96OrgxObgnMtV8QwNgjXBAJ5
 wGeJWt8s34W1Zo3qS9gEuVzEhW4XaxISQQMkHe8faKkK6iHmIB/VjSQikDwwUNQ/
 AspYj82RyWBCDZsqhiYh71kpxjvS6Xp0bj39Ce1sNsOnuksxKkQ=
 =g8In
 -----END PGP SIGNATURE-----

Merge tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks

Pull kthread updates from Frederic Weisbecker:
 "Kthreads affinity follow either of 4 existing different patterns:

   1) Per-CPU kthreads must stay affine to a single CPU and never
      execute relevant code on any other CPU. This is currently handled
      by smpboot code which takes care of CPU-hotplug operations.
      Affinity here is a correctness constraint.

   2) Some kthreads _have_ to be affine to a specific set of CPUs and
      can't run anywhere else. The affinity is set through
      kthread_bind_mask() and the subsystem takes care by itself to
      handle CPU-hotplug operations. Affinity here is assumed to be a
      correctness constraint.

   3) Per-node kthreads _prefer_ to be affine to a specific NUMA node.
      This is not a correctness constraint but merely a preference in
      terms of memory locality. kswapd and kcompactd both fall into this
      category. The affinity is set manually like for any other task and
      CPU-hotplug is supposed to be handled by the relevant subsystem so
      that the task is properly reaffined whenever a given CPU from the
      node comes up. Also care should be taken so that the node affinity
      doesn't cross isolated (nohz_full) cpumask boundaries.

   4) Similar to the previous point except kthreads have a _preferred_
      affinity different than a node. Both RCU boost kthreads and RCU
      exp kworkers fall into this category as they refer to "RCU nodes"
      from a distinctly distributed tree.

  Currently the preferred affinity patterns (3 and 4) have at least 4
  identified users, with more or less success when it comes to handle
  CPU-hotplug operations and CPU isolation. Each of which do it in its
  own ad-hoc way.

  This is an infrastructure proposal to handle this with the following
  API changes:

   - kthread_create_on_node() automatically affines the created kthread
     to its target node unless it has been set as per-cpu or bound with
     kthread_bind[_mask]() before the first wake-up.

   - kthread_affine_preferred() is a new function that can be called
     right after kthread_create_on_node() to specify a preferred
     affinity different than the specified node.

  When the preferred affinity can't be applied because the possible
  targets are offline or isolated (nohz_full), the kthread is affine to
  the housekeeping CPUs (which means to all online CPUs most of the time
  or only the non-nohz_full CPUs when nohz_full= is set).

  kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been
  converted, along with a few old drivers.

  Summary of the changes:

   - Consolidate a bunch of ad-hoc implementations of
     kthread_run_on_cpu()

   - Introduce task_cpu_fallback_mask() that defines the default last
     resort affinity of a task to become nohz_full aware

   - Add some correctness check to ensure kthread_bind() is always
     called before the first kthread wake up.

   - Default affine kthread to its preferred node.

   - Convert kswapd / kcompactd and remove their halfway working ad-hoc
     affinity implementation

   - Implement kthreads preferred affinity

   - Unify kthread worker and kthread API's style

   - Convert RCU kthreads to the new API and remove the ad-hoc affinity
     implementation"

* tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks:
  kthread: modify kernel-doc function name to match code
  rcu: Use kthread preferred affinity for RCU exp kworkers
  treewide: Introduce kthread_run_worker[_on_cpu]()
  kthread: Unify kthread_create_on_cpu() and kthread_create_worker_on_cpu() automatic format
  rcu: Use kthread preferred affinity for RCU boost
  kthread: Implement preferred affinity
  mm: Create/affine kswapd to its preferred node
  mm: Create/affine kcompactd to its preferred node
  kthread: Default affine kthread to its preferred NUMA node
  kthread: Make sure kthread hasn't started while binding it
  sched,arm64: Handle CPU isolation on last resort fallback rq selection
  arm64: Exclude nohz_full CPUs from 32bits el0 support
  lib: test_objpool: Use kthread_run_on_cpu()
  kallsyms: Use kthread_run_on_cpu()
  soc/qman: test: Use kthread_run_on_cpu()
  arm/bL_switcher: Use kthread_run_on_cpu()
2025-01-21 17:10:05 -08:00
Randy Dunlap
987ce79b52 sched_ext: fix kernel-doc warnings
Use the correct function parameter names and function names.
Use the correct kernel-doc comment format for struct sched_ext_ops
to eliminate a bunch of warnings.

ext.c:1418: warning: Excess function parameter 'include_dead' description in 'scx_task_iter_next_locked'
ext.c:7261: warning: expecting prototype for scx_bpf_dump(). Prototype was for scx_bpf_dump_bstr() instead
ext.c:7352: warning: Excess function parameter 'flags' description in 'scx_bpf_cpuperf_set'

ext.c:3150: warning: Function parameter or struct member 'in_fi' not described in 'scx_prio_less'
ext.c:4711: warning: Function parameter or struct member 'dur_s' not described in 'scx_softlockup'
ext.c:4775: warning: Function parameter or struct member 'bypass' not described in 'scx_ops_bypass'
ext.c:7453: warning: Function parameter or struct member 'idle_mask' not described in 'scx_bpf_put_idle_cpumask'

ext.c:209: warning: Incorrect use of kernel-doc format:          * select_cpu - Pick the target CPU for a task which is being woken up
ext.c:236: warning: Incorrect use of kernel-doc format:          * enqueue - Enqueue a task on the BPF scheduler
ext.c:251: warning: Incorrect use of kernel-doc format:          * dequeue - Remove a task from the BPF scheduler
ext.c:267: warning: Incorrect use of kernel-doc format:          * dispatch - Dispatch tasks from the BPF scheduler and/or user DSQs
ext.c:290: warning: Incorrect use of kernel-doc format:          * tick - Periodic tick
ext.c:300: warning: Incorrect use of kernel-doc format:          * runnable - A task is becoming runnable on its associated CPU
ext.c:327: warning: Incorrect use of kernel-doc format:          * running - A task is starting to run on its associated CPU
ext.c:335: warning: Incorrect use of kernel-doc format:          * stopping - A task is stopping execution
ext.c:346: warning: Incorrect use of kernel-doc format:          * quiescent - A task is becoming not runnable on its associated CPU
ext.c:366: warning: Incorrect use of kernel-doc format:          * yield - Yield CPU
ext.c:381: warning: Incorrect use of kernel-doc format:          * core_sched_before - Task ordering for core-sched
ext.c:399: warning: Incorrect use of kernel-doc format:          * set_weight - Set task weight
ext.c:408: warning: Incorrect use of kernel-doc format:          * set_cpumask - Set CPU affinity
ext.c:418: warning: Incorrect use of kernel-doc format:          * update_idle - Update the idle state of a CPU
ext.c:439: warning: Incorrect use of kernel-doc format:          * cpu_acquire - A CPU is becoming available to the BPF scheduler
ext.c:449: warning: Incorrect use of kernel-doc format:          * cpu_release - A CPU is taken away from the BPF scheduler
ext.c:461: warning: Incorrect use of kernel-doc format:          * init_task - Initialize a task to run in a BPF scheduler
ext.c:476: warning: Incorrect use of kernel-doc format:          * exit_task - Exit a previously-running task from the system
ext.c:485: warning: Incorrect use of kernel-doc format:          * enable - Enable BPF scheduling for a task
ext.c:494: warning: Incorrect use of kernel-doc format:          * disable - Disable BPF scheduling for a task
ext.c:504: warning: Incorrect use of kernel-doc format:          * dump - Dump BPF scheduler state on error
ext.c:512: warning: Incorrect use of kernel-doc format:          * dump_cpu - Dump BPF scheduler state for a CPU on error
ext.c:524: warning: Incorrect use of kernel-doc format:          * dump_task - Dump BPF scheduler state for a runnable task on error
ext.c:535: warning: Incorrect use of kernel-doc format:          * cgroup_init - Initialize a cgroup
ext.c:550: warning: Incorrect use of kernel-doc format:          * cgroup_exit - Exit a cgroup
ext.c:559: warning: Incorrect use of kernel-doc format:          * cgroup_prep_move - Prepare a task to be moved to a different cgroup
ext.c:574: warning: Incorrect use of kernel-doc format:          * cgroup_move - Commit cgroup move
ext.c:585: warning: Incorrect use of kernel-doc format:          * cgroup_cancel_move - Cancel cgroup move
ext.c:597: warning: Incorrect use of kernel-doc format:          * cgroup_set_weight - A cgroup's weight is being changed
ext.c:611: warning: Incorrect use of kernel-doc format:          * cpu_online - A CPU became online
ext.c:620: warning: Incorrect use of kernel-doc format:          * cpu_offline - A CPU is going offline
ext.c:633: warning: Incorrect use of kernel-doc format:          * init - Initialize the BPF scheduler
ext.c:638: warning: Incorrect use of kernel-doc format:          * exit - Clean up after the BPF scheduler
ext.c:648: warning: Incorrect use of kernel-doc format:          * dispatch_max_batch - Max nr of tasks that dispatch() can dispatch
ext.c:653: warning: Incorrect use of kernel-doc format:          * flags - %SCX_OPS_* flags
ext.c:658: warning: Incorrect use of kernel-doc format:          * timeout_ms - The maximum amount of time, in milliseconds, that a
ext.c:667: warning: Incorrect use of kernel-doc format:          * exit_dump_len - scx_exit_info.dump buffer length. If 0, the default
ext.c:673: warning: Incorrect use of kernel-doc format:          * hotplug_seq - A sequence number that may be set by the scheduler to
ext.c:682: warning: Incorrect use of kernel-doc format:          * name - BPF scheduler's name

ext.c:689: warning: Function parameter or struct member 'select_cpu' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'enqueue' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'dequeue' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'dispatch' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'tick' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'runnable' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'running' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'stopping' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'quiescent' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'yield' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'core_sched_before' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'set_weight' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'set_cpumask' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'update_idle' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cpu_acquire' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cpu_release' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'init_task' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'exit_task' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'enable' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'disable' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'dump' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'dump_cpu' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'dump_task' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cgroup_init' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cgroup_exit' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cgroup_prep_move' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cgroup_move' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cgroup_cancel_move' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cgroup_set_weight' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cpu_online' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cpu_offline' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'init' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'exit' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'dispatch_max_batch' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'flags' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'timeout_ms' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'exit_dump_len' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'hotplug_seq' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'name' not described in 'sched_ext_ops'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Changwoo Min <changwoo@igalia.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: bpf@vger.kernel.org
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-13 08:23:29 -10:00
Andrea Righi
a2a3374c47 sched_ext: idle: Refresh idle masks during idle-to-idle transitions
With the consolidation of put_prev_task/set_next_task(), see
commit 436f3eed5c ("sched: Combine the last put_prev_task() and the
first set_next_task()"), we are now skipping the transition between
these two functions when the previous and the next tasks are the same.

As a result, the scx idle state of a CPU is updated only when
transitioning to or from the idle thread. While this is generally
correct, it can lead to uneven and inefficient core utilization in
certain scenarios [1].

A typical scenario involves proactive wake-ups: scx_bpf_pick_idle_cpu()
selects and marks an idle CPU as busy, followed by a wake-up via
scx_bpf_kick_cpu(), without dispatching any tasks. In this case, the CPU
continues running the idle thread, returns to idle, but remains marked
as busy, preventing it from being selected again as an idle CPU (until a
task eventually runs on it and releases the CPU).

For example, running a workload that uses 20% of each CPU, combined with
an scx scheduler using proactive wake-ups, results in the following core
utilization:

 CPU 0: 25.7%
 CPU 1: 29.3%
 CPU 2: 26.5%
 CPU 3: 25.5%
 CPU 4:  0.0%
 CPU 5: 25.5%
 CPU 6:  0.0%
 CPU 7: 10.5%

To address this, refresh the idle state also in pick_task_idle(), during
idle-to-idle transitions, but only trigger ops.update_idle() on actual
state changes to prevent unnecessary updates to the scx scheduler and
maintain balanced state transitions.

With this change in place, the core utilization in the previous example
becomes the following:

 CPU 0: 18.8%
 CPU 1: 19.4%
 CPU 2: 18.0%
 CPU 3: 18.7%
 CPU 4: 19.3%
 CPU 5: 18.9%
 CPU 6: 18.7%
 CPU 7: 19.3%

[1] https://github.com/sched-ext/scx/pull/1139

Fixes: 7c65ae81ea ("sched_ext: Don't call put_prev_task_scx() before picking the next task")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-10 12:40:42 -10:00
Changwoo Min
3a9910b590 sched_ext: Implement scx_bpf_now()
Returns a high-performance monotonically non-decreasing clock for the current
CPU. The clock returned is in nanoseconds.

It provides the following properties:

1) High performance: Many BPF schedulers call bpf_ktime_get_ns() frequently
 to account for execution time and track tasks' runtime properties.
 Unfortunately, in some hardware platforms, bpf_ktime_get_ns() -- which
 eventually reads a hardware timestamp counter -- is neither performant nor
 scalable. scx_bpf_now() aims to provide a high-performance clock by
 using the rq clock in the scheduler core whenever possible.

2) High enough resolution for the BPF scheduler use cases: In most BPF
 scheduler use cases, the required clock resolution is lower than the most
 accurate hardware clock (e.g., rdtsc in x86). scx_bpf_now() basically
 uses the rq clock in the scheduler core whenever it is valid. It considers
 that the rq clock is valid from the time the rq clock is updated
 (update_rq_clock) until the rq is unlocked (rq_unpin_lock).

3) Monotonically non-decreasing clock for the same CPU: scx_bpf_now()
 guarantees the clock never goes backward when comparing them in the same
 CPU. On the other hand, when comparing clocks in different CPUs, there
 is no such guarantee -- the clock can go backward. It provides a
 monotonically *non-decreasing* clock so that it would provide the same
 clock values in two different scx_bpf_now() calls in the same CPU
 during the same period of when the rq clock is valid.

An rq clock becomes valid when it is updated using update_rq_clock()
and invalidated when the rq is unlocked using rq_unpin_lock().

Let's suppose the following timeline in the scheduler core:

   T1. rq_lock(rq)
   T2. update_rq_clock(rq)
   T3. a sched_ext BPF operation
   T4. rq_unlock(rq)
   T5. a sched_ext BPF operation
   T6. rq_lock(rq)
   T7. update_rq_clock(rq)

For [T2, T4), we consider that rq clock is valid (SCX_RQ_CLK_VALID is
set), so scx_bpf_now() calls during [T2, T4) (including T3) will
return the rq clock updated at T2. For duration [T4, T7), when a BPF
scheduler can still call scx_bpf_now() (T5), we consider the rq clock
is invalid (SCX_RQ_CLK_VALID is unset at T4). So when calling
scx_bpf_now() at T5, we will return a fresh clock value by calling
sched_clock_cpu() internally. Also, to prevent getting outdated rq clocks
from a previous scx scheduler, invalidate all the rq clocks when unloading
a BPF scheduler.

One example of calling scx_bpf_now(), when the rq clock is invalid
(like T5), is in scx_central [1]. The scx_central scheduler uses a BPF
timer for preemptive scheduling. In every msec, the timer callback checks
if the currently running tasks exceed their timeslice. At the beginning of
the BPF timer callback (central_timerfn in scx_central.bpf.c), scx_central
gets the current time. When the BPF timer callback runs, the rq clock could
be invalid, the same as T5. In this case, scx_bpf_now() returns a fresh
clock value rather than returning the old one (T2).

[1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_central.bpf.c

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-10 08:04:40 -10:00
Frederic Weisbecker
b04e317b52 treewide: Introduce kthread_run_worker[_on_cpu]()
kthread_create() creates a kthread without running it yet. kthread_run()
creates a kthread and runs it.

On the other hand, kthread_create_worker() creates a kthread worker and
runs it.

This difference in behaviours is confusing. Also there is no way to
create a kthread worker and affine it using kthread_bind_mask() or
kthread_affine_preferred() before starting it.

Consolidate the behaviours and introduce kthread_run_worker[_on_cpu]()
that behaves just like kthread_run(). kthread_create_worker[_on_cpu]()
will now only create a kthread worker without starting it.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
2025-01-08 18:15:03 +01:00
Honglei Wang
68e449d849 sched_ext: switch class when preempted by higher priority scheduler
ops.cpu_release() function, if defined, must be invoked when preempted by
a higher priority scheduler class task. This scenario was skipped in
commit f422316d74 ("sched_ext: Remove switch_class_scx()"). Let's fix
it.

Fixes: f422316d74 ("sched_ext: Remove switch_class_scx()")
Signed-off-by: Honglei Wang <jameshongleiwang@126.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-08 06:51:40 -10:00
Changwoo Min
6268d5bc10 sched_ext: Replace rq_lock() to raw_spin_rq_lock() in scx_ops_bypass()
scx_ops_bypass() iterates all CPUs to re-enqueue all the scx tasks.
For each CPU, it acquires a lock using rq_lock() regardless of whether
a CPU is offline or the CPU is currently running a task in a higher
scheduler class (e.g., deadline). The rq_lock() is supposed to be used
for online CPUs, and the use of rq_lock() may trigger an unnecessary
warning in rq_pin_lock(). Therefore, replace rq_lock() to
raw_spin_rq_lock() in scx_ops_bypass().

Without this change, we observe the following warning:

===== START =====
[    6.615205] rq->balance_callback && rq->balance_callback != &balance_push_callback
[    6.615208] WARNING: CPU: 2 PID: 0 at kernel/sched/sched.h:1730 __schedule+0x1130/0x1c90
=====  END  =====

Fixes: 0e7ffff1b8 ("scx: Fix raciness in scx_ops_bypass()")
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-08 06:48:53 -10:00
Henry Huang
30dd3b13f9 sched_ext: keep running prev when prev->scx.slice != 0
When %SCX_OPS_ENQ_LAST is set and prev->scx.slice != 0,
@prev will be dispacthed into the local DSQ in put_prev_task_scx().
However, pick_task_scx() is executed before put_prev_task_scx(),
so it will not pick @prev.
Set %SCX_RQ_BAL_KEEP in balance_one() to ensure that pick_task_scx()
can pick @prev.

Signed-off-by: Henry Huang <henry.hj@antgroup.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-08 06:48:33 -10:00
Andrea Righi
382d7efc14 sched_ext: Include remaining task time slice in error state dump
Report the remaining time slice when dumping task information during an
error exit.

This information can be useful for tracking incorrect or excessively
long time slices in schedulers that implement dynamic time slice logic.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-06 08:56:38 -10:00
Andrea Righi
e4975ac535 sched_ext: update scx_bpf_dsq_insert() doc for SCX_DSQ_LOCAL_ON
With commit 5b26f7b920 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct
dispatches"), scx_bpf_dsq_insert() can use SCX_DSQ_LOCAL_ON for direct
dispatch from ops.enqueue() to target the local DSQ of any CPU.

Update the documentation accordingly.

Fixes: 5b26f7b920 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-06 08:56:00 -10:00
Andrea Righi
d9071ecb31 sched_ext: idle: small CPU iteration refactoring
Replace the loop to check if all SMT CPUs are idle with
cpumask_subset(). This simplifies the code and slightly improves
efficiency, while preserving the original behavior.

Note that idle_masks.smt handling remains racy, which is acceptable as
it serves as an optimization and is self-correcting.

Suggested-and-reviewed-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-06 08:48:38 -10:00
Andrea Righi
c0cf353009 sched_ext: idle: introduce check_builtin_idle_enabled() helper
Minor refactoring to add a helper function for checking if the built-in
idle CPU selection policy is enabled.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-29 12:45:11 -10:00
Andrea Righi
02f034dcbf sched_ext: idle: clarify comments
Add a comments to clarify about the usage of cpumask_intersects().

Moreover, update scx_select_cpu_dfl() description clarifying that the
final step of the idle selection logic involves searching for any idle
CPU in the system that the task can use.

Reviewed-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-29 12:44:15 -10:00
Andrea Righi
9cf9aceed2 sched_ext: idle: use assign_cpu() to update the idle cpumask
Use the assign_cpu() helper to set or clear the CPU in the idle mask,
based on the idle condition.

Acked-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-29 12:43:07 -10:00
Henry Huang
35bf430e08 sched_ext: initialize kit->cursor.flags
struct bpf_iter_scx_dsq *it maybe not initialized.
If we didn't call scx_bpf_dsq_move_set_vtime and scx_bpf_dsq_move_set_slice
before scx_bpf_dsq_move, it would cause unexpected behaviors:
1. Assign a huge slice into p->scx.slice
2. Assign a invalid vtime into p->scx.dsq_vtime

Signed-off-by: Henry Huang <henry.hj@antgroup.com>
Fixes: 6462dd53a2 ("sched_ext: Compact struct bpf_iter_scx_dsq_kern")
Cc: stable@vger.kernel.org # v6.12
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-24 10:56:08 -10:00
Thorsten Blum
bc3a116a44 sched_ext: Use str_enabled_disabled() helper in update_selcpu_topology()
Remove hard-coded strings by using the str_enabled_disabled() helper
function.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-24 10:47:55 -10:00
Liang Jie
e197f5ec3a sched_ext: Use sizeof_field for key_len in dsq_hash_params
Update the `dsq_hash_params` initialization to use `sizeof_field`
for the `key_len` field instead of a hardcoded value.

This improves code readability and ensures the key length dynamically
matches the size of the `id` field in the `scx_dispatch_q` structure.

Signed-off-by: Liang Jie <liangjie@lixiang.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-13 06:51:19 -10:00
Tejun Heo
18b2093f45 sched_ext: Fix invalid irq restore in scx_ops_bypass()
While adding outer irqsave/restore locking, 0e7ffff1b8 ("scx: Fix raciness
in scx_ops_bypass()") forgot to convert an inner rq_unlock_irqrestore() to
rq_unlock() which could re-enable IRQ prematurely leading to the following
warning:

  raw_local_irq_restore() called with IRQs enabled
  WARNING: CPU: 1 PID: 96 at kernel/locking/irqflag-debug.c:10 warn_bogus_irq_restore+0x30/0x40
  ...
  Sched_ext: create_dsq (enabling)
  pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  pc : warn_bogus_irq_restore+0x30/0x40
  lr : warn_bogus_irq_restore+0x30/0x40
  ...
  Call trace:
   warn_bogus_irq_restore+0x30/0x40 (P)
   warn_bogus_irq_restore+0x30/0x40 (L)
   scx_ops_bypass+0x224/0x3b8
   scx_ops_enable.isra.0+0x2c8/0xaa8
   bpf_scx_reg+0x18/0x30
  ...
  irq event stamp: 33739
  hardirqs last  enabled at (33739): [<ffff8000800b699c>] scx_ops_bypass+0x174/0x3b8
  hardirqs last disabled at (33738): [<ffff800080d48ad4>] _raw_spin_lock_irqsave+0xb4/0xd8

Drop the stray _irqrestore().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ihor Solodrai <ihor.solodrai@pm.me>
Link: http://lkml.kernel.org/r/qC39k3UsonrBYD_SmuxHnZIQLsuuccoCrkiqb_BT7DvH945A1_LZwE4g-5Pu9FcCtqZt4lY1HhIPi0homRuNWxkgo1rgP3bkxa0donw8kV4=@pm.me
Fixes: 0e7ffff1b8 ("scx: Fix raciness in scx_ops_bypass()")
Cc: stable@vger.kernel.org # v6.12
2024-12-11 11:02:35 -10:00
Andrea Righi
4572541892 sched_ext: Use the NUMA scheduling domain for NUMA optimizations
Rely on the NUMA scheduling domain topology, instead of accessing NUMA
topology information directly.

There is basically no functional change, but in this way we ensure
consistent use of the same topology information determined by the
scheduling subsystem.

Fixes: f6ce6b9493 ("sched_ext: Do not enable LLC/NUMA optimizations when domains overlap")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-04 09:49:56 -10:00
Linus Torvalds
8f7c8b88bd sched_ext: Change for v6.13
- Improve the default select_cpu() implementation making it topology aware
   and handle WAKE_SYNC better.
 
 - set_arg_maybe_null() was used to inform the verifier which ops args could
   be NULL in a rather hackish way. Use the new __nullable CFI stub tags
   instead.
 
 - On Sapphire Rapids multi-socket systems, a BPF scheduler, by hammering on
   the same queue across sockets, could live-lock the system to the point
   where the system couldn't make reasonable forward progress. This could
   lead to soft-lockup triggered resets or stalling out bypass mode switch
   and thus BPF scheduler ejection for tens of minutes if not hours. After
   trying a number of mitigations, the following set worked reliably:
 
   - Injecting artificial cpu_relax() loops in two places while sched_ext is
     trying to turn on the bypass mode.
 
   - Triggering scheduler ejection when soft-lockup detection is imminent (a
     quarter of threshold left).
 
   While not the prettiest, the impact both in terms of code complexity and
   overhead is minimal.
 
 - A common complaint on the API is the overuse of the word "dispatch" and
   the confusion around "consume". This is due to how the dispatch queues
   became more generic over time. Rename the affected kfuncs for clarity.
   Thanks to BPF's compatibility features, this change can be made in a way
   that's both forward and backward compatible. The compatibility code will
   be dropped in a few releases.
 
 - Pull sched_ext/for-6.12-fixes to receive a prerequisite change. Other misc
   changes.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZztuXA4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGePUAP4nFTDaUDngVlxGv5hpYz8/Gcv1bPsWEydRRmH/
 3F+pNgEAmGIGAEwFYfc9Zn8Kbjf0eJAduf2RhGRatQO6F/+GSwo=
 =AcyC
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext updates from Tejun Heo:

 - Improve the default select_cpu() implementation making it topology
   aware and handle WAKE_SYNC better.

 - set_arg_maybe_null() was used to inform the verifier which ops args
   could be NULL in a rather hackish way. Use the new __nullable CFI
   stub tags instead.

 - On Sapphire Rapids multi-socket systems, a BPF scheduler, by
   hammering on the same queue across sockets, could live-lock the
   system to the point where the system couldn't make reasonable forward
   progress.

   This could lead to soft-lockup triggered resets or stalling out
   bypass mode switch and thus BPF scheduler ejection for tens of
   minutes if not hours. After trying a number of mitigations, the
   following set worked reliably:

     - Injecting artificial cpu_relax() loops in two places while
       sched_ext is trying to turn on the bypass mode.

     - Triggering scheduler ejection when soft-lockup detection is
       imminent (a quarter of threshold left).

   While not the prettiest, the impact both in terms of code complexity
   and overhead is minimal.

 - A common complaint on the API is the overuse of the word "dispatch"
   and the confusion around "consume". This is due to how the dispatch
   queues became more generic over time. Rename the affected kfuncs for
   clarity. Thanks to BPF's compatibility features, this change can be
   made in a way that's both forward and backward compatible. The
   compatibility code will be dropped in a few releases.

 - Other misc changes

* tag 'sched_ext-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (21 commits)
  sched_ext: Replace scx_next_task_picked() with switch_class() in comment
  sched_ext: Rename scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*()
  sched_ext: Rename scx_bpf_consume() to scx_bpf_dsq_move_to_local()
  sched_ext: Rename scx_bpf_dispatch[_vtime]() to scx_bpf_dsq_insert[_vtime]()
  sched_ext: scx_bpf_dispatch_from_dsq_set_*() are allowed from unlocked context
  sched_ext: add a missing rcu_read_lock/unlock pair at scx_select_cpu_dfl()
  sched_ext: Clarify sched_ext_ops table for userland scheduler
  sched_ext: Enable the ops breather and eject BPF scheduler on softlockup
  sched_ext: Avoid live-locking bypass mode switching
  sched_ext: Fix incorrect use of bitwise AND
  sched_ext: Do not enable LLC/NUMA optimizations when domains overlap
  sched_ext: Introduce NUMA awareness to the default idle selection policy
  sched_ext: Replace set_arg_maybe_null() with __nullable CFI stub tags
  sched_ext: Rename CFI stubs to names that are recognized by BPF
  sched_ext: Introduce LLC awareness to the default idle selection policy
  sched_ext: Clarify ops.select_cpu() for single-CPU tasks
  sched_ext: improve WAKE_SYNC behavior for default idle CPU selection
  sched_ext: Use btf_ids to resolve task_struct
  sched/ext: Use tg_cgroup() to elieminate duplicate code
  sched/ext: Fix unmatch trailing comment of CONFIG_EXT_GROUP_SCHED
  ...
2024-11-20 10:08:00 -08:00
Linus Torvalds
3f020399e4 Scheduler changes for v6.13:
- Core facilities:
 
     - Add the "Lazy preemption" model (CONFIG_PREEMPT_LAZY=y), which optimizes
       fair-class preemption by delaying preemption requests to the
       tick boundary, while working as full preemption for RR/FIFO/DEADLINE
       classes. (Peter Zijlstra)
 
         - x86: Enable Lazy preemption (Peter Zijlstra)
         - riscv: Enable Lazy preemption (Jisheng Zhang)
 
     - Initialize idle tasks only once (Thomas Gleixner)
 
     - sched/ext: Remove sched_fork() hack (Thomas Gleixner)
 
  - Fair scheduler:
     - Optimize the PLACE_LAG when se->vlag is zero (Huang Shijie)
 
  - Idle loop:
       Optimize the generic idle loop by removing unnecessary
       memory barrier (Zhongqiu Han)
 
  - RSEQ:
     - Improve cache locality of RSEQ concurrency IDs for
       intermittent workloads (Mathieu Desnoyers)
 
  - Waitqueues:
     - Make wake_up_{bit,var} less fragile (Neil Brown)
 
  - PSI:
     - Pass enqueue/dequeue flags to psi callbacks directly (Johannes Weiner)
 
  - Preparatory patches for proxy execution:
     - core: Add move_queued_task_locked helper (Connor O'Brien)
     - core: Consolidate pick_*_task to task_is_pushable helper (Connor O'Brien)
     - core: Split out __schedule() deactivate task logic into a helper (John Stultz)
     - core: Split scheduler and execution contexts (Peter Zijlstra)
     - locking/mutex: Make mutex::wait_lock irq safe (Juri Lelli)
     - locking/mutex: Expose __mutex_owner() (Juri Lelli)
     - locking/mutex: Remove wakeups from under mutex::wait_lock (Peter Zijlstra)
 
  - Misc fixes and cleanups:
     - core: Remove unused __HAVE_THREAD_FUNCTIONS hook support (David Disseldorp)
     - core: Update the comment for TIF_NEED_RESCHED_LAZY (Sebastian Andrzej Siewior)
     - wait: Remove unused bit_wait_io_timeout (Dr. David Alan Gilbert)
     - fair: remove the DOUBLE_TICK feature (Huang Shijie)
     - fair: fix the comment for PREEMPT_SHORT (Huang Shijie)
     - uclamp: Fix unnused variable warning (Christian Loehle)
     - rt: No PREEMPT_RT=y for all{yes,mod}config
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmc7fnQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hZTBAAozVdWA2m51aNa67HvAZta/olmrIagVbW
 inwbTgqa8b+UfeWEuKOfrZr5khjEh6pLgR3dBTib1uH6xxYj/Okds+qbPWSBPVLh
 yzavlm/zJZM1U1XtxE3eyVfqWik4GrY7DoIMDQQr+YH7rNXonJeJkll38OI2E5MC
 q3Q01qyMo8RJJX8qkf3f8ObOoP/51NsVniTw0Zb2fzEhXz8FjezLlxk6cMfgSkJG
 lg9gfIwUZ7Xg5neRo4kJcc3Ht31KYOhWSiupBJzRD1hss/N/AybvMcTX/Cm8d07w
 HIAdDDAn84o46miFo/a0V/hsJZ72idWbqxVJUCtaezrpOUiFkG+uInRvG/ynr0lF
 5dEI9f+6PUw8Nc7L72IyHkobjPqS2IefSaxYYCBKmxMX2qrenfTor/pKiWzzhBIl
 rX3MZSuUJ8NjV4rNGD/qXRM1IsMJrsDwxDyv+sRec3XdH33x286ds6aAUEPDQ6N7
 96VS0sOKcNUJN8776ErNjlIxRl8HTlpkaO3nZlQIfXgTlXUpRvOuKbEWqP+606lo
 oANgJTKgUhgJPWZnvmdRxDjSiOp93QcImjus9i1tN81FGiEDleONsJUxu2Di1E5+
 s1nCiytjq+cdvzCqFyiOZUh+g6kSZ4yXxNgLg2UvbXzX1zOeUQT3WtyKUhMPXhU8
 esh1TgbUbpE=
 =Zcqj
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Core facilities:

   - Add the "Lazy preemption" model (CONFIG_PREEMPT_LAZY=y), which
     optimizes fair-class preemption by delaying preemption requests to
     the tick boundary, while working as full preemption for
     RR/FIFO/DEADLINE classes. (Peter Zijlstra)
        - x86: Enable Lazy preemption (Peter Zijlstra)
        - riscv: Enable Lazy preemption (Jisheng Zhang)

   - Initialize idle tasks only once (Thomas Gleixner)

   - sched/ext: Remove sched_fork() hack (Thomas Gleixner)

  Fair scheduler:

   - Optimize the PLACE_LAG when se->vlag is zero (Huang Shijie)

  Idle loop:

   - Optimize the generic idle loop by removing unnecessary memory
     barrier (Zhongqiu Han)

  RSEQ:

   - Improve cache locality of RSEQ concurrency IDs for intermittent
     workloads (Mathieu Desnoyers)

  Waitqueues:

   - Make wake_up_{bit,var} less fragile (Neil Brown)

  PSI:

   - Pass enqueue/dequeue flags to psi callbacks directly (Johannes
     Weiner)

  Preparatory patches for proxy execution:

   - Add move_queued_task_locked helper (Connor O'Brien)

   - Consolidate pick_*_task to task_is_pushable helper (Connor O'Brien)

   - Split out __schedule() deactivate task logic into a helper (John
     Stultz)

   - Split scheduler and execution contexts (Peter Zijlstra)

   - Make mutex::wait_lock irq safe (Juri Lelli)

   - Expose __mutex_owner() (Juri Lelli)

   - Remove wakeups from under mutex::wait_lock (Peter Zijlstra)

  Misc fixes and cleanups:

   - Remove unused __HAVE_THREAD_FUNCTIONS hook support (David
     Disseldorp)

   - Update the comment for TIF_NEED_RESCHED_LAZY (Sebastian Andrzej
     Siewior)

   - Remove unused bit_wait_io_timeout (Dr. David Alan Gilbert)

   - remove the DOUBLE_TICK feature (Huang Shijie)

   - fix the comment for PREEMPT_SHORT (Huang Shijie)

   - Fix unnused variable warning (Christian Loehle)

   - No PREEMPT_RT=y for all{yes,mod}config"

* tag 'sched-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  sched, x86: Update the comment for TIF_NEED_RESCHED_LAZY.
  sched: No PREEMPT_RT=y for all{yes,mod}config
  riscv: add PREEMPT_LAZY support
  sched, x86: Enable Lazy preemption
  sched: Enable PREEMPT_DYNAMIC for PREEMPT_RT
  sched: Add Lazy preemption model
  sched: Add TIF_NEED_RESCHED_LAZY infrastructure
  sched/ext: Remove sched_fork() hack
  sched: Initialize idle tasks only once
  sched: psi: pass enqueue/dequeue flags to psi callbacks directly
  sched/uclamp: Fix unnused variable warning
  sched: Split scheduler and execution contexts
  sched: Split out __schedule() deactivate task logic into a helper
  sched: Consolidate pick_*_task to task_is_pushable helper
  sched: Add move_queued_task_locked helper
  locking/mutex: Expose __mutex_owner()
  locking/mutex: Make mutex::wait_lock irq safe
  locking/mutex: Remove wakeups from under mutex::wait_lock
  sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads
  sched: idle: Optimize the generic idle loop by removing needless memory barrier
  ...
2024-11-19 14:16:06 -08:00
Linus Torvalds
d79944b094 sched_ext: One more fix for v6.12-rc7
ops.cpu_acquire() was being invoked with the wrong kfunc mask allowing the
 operation to call kfuncs which shouldn't be allowed. Fix it by using
 SCX_KF_REST instead, which is trivial and low risk.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZzamXw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGRReAP4/JQ1mKkJv+9nTZkW9OcFFHGVVhrprOUEEFk5j
 pmHwPAD8DTBMMS/BCQOoXDdiB9uU7ut6M8VdsIj1jmJkMja+eQI=
 =942J
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.12-rc7-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fix from Tejun Heo:
 "One more fix for v6.12-rc7

  ops.cpu_acquire() was being invoked with the wrong kfunc mask allowing
  the operation to call kfuncs which shouldn't be allowed. Fix it by
  using SCX_KF_REST instead, which is trivial and low risk"

* tag 'sched_ext-for-6.12-rc7-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: ops.cpu_acquire() should be called with SCX_KF_REST
2024-11-15 09:59:51 -08:00
Zhao Mengmeng
6b8950ef99 sched_ext: Replace scx_next_task_picked() with switch_class() in comment
scx_next_task_picked() has been replaced with siwtch_class(), but comment
is still referencing old one, so replace it.

Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-14 15:30:24 -10:00
Tejun Heo
a4af89cc50 sched_ext: ops.cpu_acquire() should be called with SCX_KF_REST
ops.cpu_acquire() is currently called with 0 kf_maks which is interpreted as
SCX_KF_UNLOCKED which allows all unlocked kfuncs, but ops.cpu_acquire() is
called from balance_one() under the rq lock and should only be allowed call
kfuncs that are safe under the rq lock. Update it to use SCX_KF_REST.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Zhao Mengmeng <zhaomzhao@126.com>
Link: http://lkml.kernel.org/r/ZzYvf2L3rlmjuKzh@slm.duckdns.org
Fixes: 245254f708 ("sched_ext: Implement sched_ext_ops.cpu_acquire/release()")
2024-11-14 08:50:58 -10:00
Linus Torvalds
3022e9d00e sched_ext: Fixes for v6.12-rc7
- The fair sched class currently has a bug where its balance() returns true
   telling the sched core that it has tasks to run but then NULL from
   pick_task(). This makes sched core call sched_ext's pick_task() without
   preceding balance() which can lead to stalls in partial mode. For now,
   work around by detecting the condition and forcing the CPU to go through
   another scheduling cycle.
 
 - Add a missing newline to an error message and fix drgn introspection tool
   which went out of sync.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZzI8sw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGb5KAP40b/o6TyAFDG+Hn6GxyxQT7rcAUMXsdB2bcEpg
 /IjmzQEAwbHU5KP5vQXV6XHv+2V7Rs7u6ZqFtDnL88N0A9hf3wk=
 =7hL8
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.12-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

 - The fair sched class currently has a bug where its balance() returns
   true telling the sched core that it has tasks to run but then NULL
   from pick_task(). This makes sched core call sched_ext's pick_task()
   without preceding balance() which can lead to stalls in partial mode.

   For now, work around by detecting the condition and forcing the CPU
   to go through another scheduling cycle.

 - Add a missing newline to an error message and fix drgn introspection
   tool which went out of sync.

* tag 'sched_ext-for-6.12-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()
  sched_ext: Update scx_show_state.py to match scx_ops_bypass_depth's new type
  sched_ext: Add a missing newline at the end of an error message
2024-11-11 14:09:57 -08:00
Tejun Heo
5cbb302880 sched_ext: Rename scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*()
In sched_ext API, a repeatedly reported pain point is the overuse of the
verb "dispatch" and confusion around "consume":

- ops.dispatch()
- scx_bpf_dispatch[_vtime]()
- scx_bpf_consume()
- scx_bpf_dispatch[_vtime]_from_dsq*()

This overloading of the term is historical. Originally, there were only
built-in DSQs and moving a task into a DSQ always dispatched it for
execution. Using the verb "dispatch" for the kfuncs to move tasks into these
DSQs made sense.

Later, user DSQs were added and scx_bpf_dispatch[_vtime]() updated to be
able to insert tasks into any DSQ. The only allowed DSQ to DSQ transfer was
from a non-local DSQ to a local DSQ and this operation was named "consume".
This was already confusing as a task could be dispatched to a user DSQ from
ops.enqueue() and then the DSQ would have to be consumed in ops.dispatch().
Later addition of scx_bpf_dispatch_from_dsq*() made the confusion even worse
as "dispatch" in this context meant moving a task to an arbitrary DSQ from a
user DSQ.

Clean up the API with the following renames:

1. scx_bpf_dispatch[_vtime]()		-> scx_bpf_dsq_insert[_vtime]()
2. scx_bpf_consume()			-> scx_bpf_dsq_move_to_local()
3. scx_bpf_dispatch[_vtime]_from_dsq*()	-> scx_bpf_dsq_move[_vtime]*()

This patch performs the third set of renames. Compatibility is maintained
by:

- The previous kfunc names are still provided by the kernel so that old
  binaries can run. Kernel generates a warning when the old names are used.

- compat.bpf.h provides wrappers for the new names which automatically fall
  back to the old names when running on older kernels. They also trigger
  build error if old names are used for new builds.

- scx_bpf_dispatch[_vtime]_from_dsq*() were already wrapped in __COMPAT
  macros as they were introduced during v6.12 cycle. Wrap new API in
  __COMPAT macros too and trigger build errors on both __COMPAT prefixed and
  naked usages of the old names.

The compat features will be dropped after v6.15.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Johannes Bechberger <me@mostlynerdless.de>
Acked-by: Giovanni Gherdovich <ggherdovich@suse.com>
Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Ming Yang <yougmark94@gmail.com>
2024-11-11 07:06:16 -10:00
Tejun Heo
5209c03c8e sched_ext: Rename scx_bpf_consume() to scx_bpf_dsq_move_to_local()
In sched_ext API, a repeatedly reported pain point is the overuse of the
verb "dispatch" and confusion around "consume":

- ops.dispatch()
- scx_bpf_dispatch[_vtime]()
- scx_bpf_consume()
- scx_bpf_dispatch[_vtime]_from_dsq*()

This overloading of the term is historical. Originally, there were only
built-in DSQs and moving a task into a DSQ always dispatched it for
execution. Using the verb "dispatch" for the kfuncs to move tasks into these
DSQs made sense.

Later, user DSQs were added and scx_bpf_dispatch[_vtime]() updated to be
able to insert tasks into any DSQ. The only allowed DSQ to DSQ transfer was
from a non-local DSQ to a local DSQ and this operation was named "consume".
This was already confusing as a task could be dispatched to a user DSQ from
ops.enqueue() and then the DSQ would have to be consumed in ops.dispatch().
Later addition of scx_bpf_dispatch_from_dsq*() made the confusion even worse
as "dispatch" in this context meant moving a task to an arbitrary DSQ from a
user DSQ.

Clean up the API with the following renames:

1. scx_bpf_dispatch[_vtime]()		-> scx_bpf_dsq_insert[_vtime]()
2. scx_bpf_consume()			-> scx_bpf_dsq_move_to_local()
3. scx_bpf_dispatch[_vtime]_from_dsq*()	-> scx_bpf_dsq_move[_vtime]*()

This patch performs the second rename. Compatibility is maintained by:

- The previous kfunc names are still provided by the kernel so that old
  binaries can run. Kernel generates a warning when the old names are used.

- compat.bpf.h provides wrappers for the new names which automatically fall
  back to the old names when running on older kernels. They also trigger
  build error if old names are used for new builds.

The compat features will be dropped after v6.15.

v2: Comment and documentation updates.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Johannes Bechberger <me@mostlynerdless.de>
Acked-by: Giovanni Gherdovich <ggherdovich@suse.com>
Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Ming Yang <yougmark94@gmail.com>
2024-11-11 07:06:16 -10:00
Tejun Heo
cc26abb1a1 sched_ext: Rename scx_bpf_dispatch[_vtime]() to scx_bpf_dsq_insert[_vtime]()
In sched_ext API, a repeatedly reported pain point is the overuse of the
verb "dispatch" and confusion around "consume":

- ops.dispatch()
- scx_bpf_dispatch[_vtime]()
- scx_bpf_consume()
- scx_bpf_dispatch[_vtime]_from_dsq*()

This overloading of the term is historical. Originally, there were only
built-in DSQs and moving a task into a DSQ always dispatched it for
execution. Using the verb "dispatch" for the kfuncs to move tasks into these
DSQs made sense.

Later, user DSQs were added and scx_bpf_dispatch[_vtime]() updated to be
able to insert tasks into any DSQ. The only allowed DSQ to DSQ transfer was
from a non-local DSQ to a local DSQ and this operation was named "consume".
This was already confusing as a task could be dispatched to a user DSQ from
ops.enqueue() and then the DSQ would have to be consumed in ops.dispatch().
Later addition of scx_bpf_dispatch_from_dsq*() made the confusion even worse
as "dispatch" in this context meant moving a task to an arbitrary DSQ from a
user DSQ.

Clean up the API with the following renames:

1. scx_bpf_dispatch[_vtime]()		-> scx_bpf_dsq_insert[_vtime]()
2. scx_bpf_consume()			-> scx_bpf_dsq_move_to_local()
3. scx_bpf_dispatch[_vtime]_from_dsq*()	-> scx_bpf_dsq_move[_vtime]*()

This patch performs the first set of renames. Compatibility is maintained
by:

- The previous kfunc names are still provided by the kernel so that old
  binaries can run. Kernel generates a warning when the old names are used.

- compat.bpf.h provides wrappers for the new names which automatically fall
  back to the old names when running on older kernels. They also trigger
  build error if old names are used for new builds.

The compat features will be dropped after v6.15.

v2: Documentation updates.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Johannes Bechberger <me@mostlynerdless.de>
Acked-by: Giovanni Gherdovich <ggherdovich@suse.com>
Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Ming Yang <yougmark94@gmail.com>
2024-11-11 07:06:16 -10:00
Tejun Heo
a6250aa251 sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()
sched_ext dispatches tasks from the BPF scheduler from balance_scx() and
thus every pick_task_scx() call must be preceded by balance_scx(). While
this usually holds, due to a bug, there are cases where the fair class's
balance() returns true indicating that it has tasks to run on the CPU and
thus terminating balance() calls but fails to actually find the next task to
run when pick_task() is called. In such cases, pick_task_scx() can be called
without preceding balance_scx().

Detect this condition using SCX_RQ_BAL_PENDING flags. If detected, keep
running the previous task if possible and avoid stalling from entering idle
without balancing.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/Ztj_h5c2LYsdXYbA@slm.duckdns.org
2024-11-09 10:43:55 -10:00
Tejun Heo
72b85bf6a7 sched_ext: scx_bpf_dispatch_from_dsq_set_*() are allowed from unlocked context
4c30f5ce4f ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
added four kfuncs for dispatching while iterating. They are allowed from the
dispatch and unlocked contexts but two of the kfuncs were only added in the
dispatch section. Add missing declarations in the unlocked section.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 4c30f5ce4f ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
2024-11-09 09:40:25 -10:00
Changwoo Min
f39489fea6 sched_ext: add a missing rcu_read_lock/unlock pair at scx_select_cpu_dfl()
When getting an LLC CPU mask in the default CPU selection policy,
scx_select_cpu_dfl(), a pointer to the sched_domain is dereferenced
using rcu_read_lock() without holding rcu_read_lock(). Such an unprotected
dereference often causes the following warning and can cause an invalid
memory access in the worst case.

Therefore, protect dereference of a sched_domain pointer using a pair
of rcu_read_lock() and unlock().

[   20.996135] =============================
[   20.996345] WARNING: suspicious RCU usage
[   20.996563] 6.11.0-virtme #17 Tainted: G        W
[   20.996576] -----------------------------
[   20.996576] kernel/sched/ext.c:3323 suspicious rcu_dereference_check() usage!
[   20.996576]
[   20.996576] other info that might help us debug this:
[   20.996576]
[   20.996576]
[   20.996576] rcu_scheduler_active = 2, debug_locks = 1
[   20.996576] 4 locks held by kworker/8:1/140:
[   20.996576]  #0: ffff8b18c00dd348 ((wq_completion)pm){+.+.}-{0:0}, at: process_one_work+0x4a0/0x590
[   20.996576]  #1: ffffb3da01f67e58 ((work_completion)(&dev->power.work)){+.+.}-{0:0}, at: process_one_work+0x1ba/0x590
[   20.996576]  #2: ffffffffa316f9f0 (&rcu_state.gp_wq){..-.}-{2:2}, at: swake_up_one+0x15/0x60
[   20.996576]  #3: ffff8b1880398a60 (&p->pi_lock){-.-.}-{2:2}, at: try_to_wake_up+0x59/0x7d0
[   20.996576]
[   20.996576] stack backtrace:
[   20.996576] CPU: 8 UID: 0 PID: 140 Comm: kworker/8:1 Tainted: G        W          6.11.0-virtme #17
[   20.996576] Tainted: [W]=WARN
[   20.996576] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[   20.996576] Workqueue: pm pm_runtime_work
[   20.996576] Sched_ext: simple (disabling+all), task: runnable_at=-6ms
[   20.996576] Call Trace:
[   20.996576]  <IRQ>
[   20.996576]  dump_stack_lvl+0x6f/0xb0
[   20.996576]  lockdep_rcu_suspicious.cold+0x4e/0x96
[   20.996576]  scx_select_cpu_dfl+0x234/0x260
[   20.996576]  select_task_rq_scx+0xfb/0x190
[   20.996576]  select_task_rq+0x47/0x110
[   20.996576]  try_to_wake_up+0x110/0x7d0
[   20.996576]  swake_up_one+0x39/0x60
[   20.996576]  rcu_core+0xb08/0xe50
[   20.996576]  ? srso_alias_return_thunk+0x5/0xfbef5
[   20.996576]  ? mark_held_locks+0x40/0x70
[   20.996576]  handle_softirqs+0xd3/0x410
[   20.996576]  irq_exit_rcu+0x78/0xa0
[   20.996576]  sysvec_apic_timer_interrupt+0x73/0x80
[   20.996576]  </IRQ>
[   20.996576]  <TASK>
[   20.996576]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[   20.996576] RIP: 0010:_raw_spin_unlock_irqrestore+0x36/0x70
[   20.996576] Code: f5 53 48 8b 74 24 10 48 89 fb 48 83 c7 18 e8 11 b4 36 ff 48 89 df e8 99 0d 37 ff f7 c5 00 02 00 00 75 17 9c 58 f6 c4 02 75 2b <65> ff 0d 5b 55 3c 5e 74 16 5b 5d e9 95 8e 28 00 e8 a5 ee 44 ff 9c
[   20.996576] RSP: 0018:ffffb3da01f67d20 EFLAGS: 00000246
[   20.996576] RAX: 0000000000000002 RBX: ffffffffa4640220 RCX: 0000000000000040
[   20.996576] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa1c7b27b
[   20.996576] RBP: 0000000000000246 R08: 0000000000000001 R09: 0000000000000000
[   20.996576] R10: 0000000000000001 R11: 000000000000021c R12: 0000000000000246
[   20.996576] R13: ffff8b1881363958 R14: 0000000000000000 R15: ffff8b1881363800
[   20.996576]  ? _raw_spin_unlock_irqrestore+0x4b/0x70
[   20.996576]  serial_port_runtime_resume+0xd4/0x1a0
[   20.996576]  ? __pfx_serial_port_runtime_resume+0x10/0x10
[   20.996576]  __rpm_callback+0x44/0x170
[   20.996576]  ? __pfx_serial_port_runtime_resume+0x10/0x10
[   20.996576]  rpm_callback+0x55/0x60
[   20.996576]  ? __pfx_serial_port_runtime_resume+0x10/0x10
[   20.996576]  rpm_resume+0x582/0x7b0
[   20.996576]  pm_runtime_work+0x7c/0xb0
[   20.996576]  process_one_work+0x1fb/0x590
[   20.996576]  worker_thread+0x18e/0x350
[   20.996576]  ? __pfx_worker_thread+0x10/0x10
[   20.996576]  kthread+0xe2/0x110
[   20.996576]  ? __pfx_kthread+0x10/0x10
[   20.996576]  ret_from_fork+0x34/0x50
[   20.996576]  ? __pfx_kthread+0x10/0x10
[   20.996576]  ret_from_fork_asm+0x1a/0x30
[   20.996576]  </TASK>
[   21.056592] sched_ext: BPF scheduler "simple" disabled (unregistered from user space)

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-09 05:47:00 -10:00
Changwoo Min
153591f703 sched_ext: Clarify sched_ext_ops table for userland scheduler
Update the comments in sched_ext_ops to clarify this table is for
a BPF scheduler and a userland scheduler should also rely on the
sched_ext_ops table through the BPF scheduler.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-08 16:40:27 -10:00
Tejun Heo
e32c260195 sched_ext: Enable the ops breather and eject BPF scheduler on softlockup
On 2 x Intel Sapphire Rapids machines with 224 logical CPUs, a poorly
behaving BPF scheduler can live-lock the system by making multiple CPUs bang
on the same DSQ to the point where soft-lockup detection triggers before
SCX's own watchdog can take action. It also seems possible that the machine
can be live-locked enough to prevent scx_ops_helper, which is an RT task,
from running in a timely manner.

Implement scx_softlockup() which is called when three quarters of
soft-lockup threshold has passed. The function immediately enables the ops
breather and triggers an ops error to initiate ejection of the BPF
scheduler.

The previous and this patch combined enable the kernel to reliably recover
the system from live-lock conditions that can be triggered by a poorly
behaving BPF scheduler on Intel dual socket systems.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
2024-11-08 10:42:22 -10:00
Tejun Heo
62dcbab8b0 sched_ext: Avoid live-locking bypass mode switching
A poorly behaving BPF scheduler can live-lock the system by e.g. incessantly
banging on the same DSQ on a large NUMA system to the point where switching
to the bypass mode can take a long time. Turning on the bypass mode requires
dequeueing and re-enqueueing currently runnable tasks, if the DSQs that they
are on are live-locked, this can take tens of seconds cascading into other
failures. This was observed on 2 x Intel Sapphire Rapids machines with 224
logical CPUs.

Inject artifical delays while the bypass mode is switching to guarantee
timely completion.

While at it, move __scx_ops_bypass_lock into scx_ops_bypass() and rename it
to bypass_lock.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Valentin Andrei <vandrei@meta.com>
Reported-by: Patrick Lu <patlu@meta.com>
2024-11-08 10:42:13 -10:00
Tejun Heo
f07b806ad8 Merge branch 'for-6.12-fixes' into for-6.13
Pull sched_ext/for-6.12-fixes to receive 0e7ffff1b8 ("scx: Fix raciness in
scx_ops_bypass()"). Planned updates for scx_ops_bypass() depends on it.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-08 10:40:44 -10:00
Andrea Righi
6d594af5bf sched_ext: Fix incorrect use of bitwise AND
There is no reason to use a bitwise AND when checking the conditions to
enable NUMA optimization for the built-in CPU idle selection policy, so
use a logical AND instead.

Fixes: f6ce6b9493 ("sched_ext: Do not enable LLC/NUMA optimizations when domains overlap")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Closes: https://lore.kernel.org/lkml/20241108181753.GA2681424@thelio-3990X/
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-08 09:56:38 -10:00
Andrea Righi
f6ce6b9493 sched_ext: Do not enable LLC/NUMA optimizations when domains overlap
When the LLC and NUMA domains fully overlap, enabling both optimizations
in the built-in idle CPU selection policy is redundant, as it leads to
searching for an idle CPU within the same domain twice.

Likewise, if all online CPUs are within a single LLC domain, LLC
optimization is unnecessary.

Therefore, detect overlapping domains and enable topology optimizations
only when necessary.

Moreover, rely on the online CPUs for this detection logic, instead of
using the possible CPUs.

Fixes: 860a45219b ("sched_ext: Introduce NUMA awareness to the default idle selection policy")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-07 14:56:39 -10:00
Tejun Heo
f7d1b585e1 sched_ext: Add a missing newline at the end of an error message
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-05 11:45:24 -10:00
Thomas Gleixner
0f0d1b8e50 sched/ext: Remove sched_fork() hack
Instead of solving the underlying problem of the double invocation of
__sched_fork() for idle tasks, sched-ext decided to hack around the issue
by partially clearing out the entity struct to preserve the already
enqueued node. A provided analysis and solution has been ignored for four
months.

Now that someone else has taken care of cleaning it up, remove the
disgusting hack and clear out the full structure. Remove the comment in the
structure declaration as well, as there is no requirement for @node being
the last element anymore.

Fixes: f0e1a0643a ("sched_ext: Implement BPF extensible scheduler class")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/87ldy82wkc.ffs@tglx
2024-11-05 12:55:37 +01:00
Linus Torvalds
33e83ffe4c A few scheduler fixes:
- Plug a race between pick_next_task_fair() and try_to_wake_up() where
    both try to write to the same task, even though both paths hold a
    runqueue lock, but obviously from different runqueues.
 
    The problem is that the store to task::on_rq in __block_task() is
    visible to try_to_wake_up() which assumes that the task is not queued.
    Both sides then operate on the same task.
 
    Cure it by rearranging __block_task() so the the store to task::on_rq is
    the last operation on the task.
 
  - Prevent a potential NULL pointer dereference in task_numa_work()
 
    task_numa_work() iterates the VMAs of a process. A concurrent unmap of
    the address space can result in a NULL pointer return from vma_next()
    which is unchecked.
 
    Add the missing NULL pointer check to prevent this.
 
  - Operate on the correct scheduler policy in task_should_scx()
 
    task_should_scx() returns true when a task should be handled by sched
    EXT. It checks the tasks scheduling policy.
 
    This fails when the check is done before a policy has been set.
 
    Cure it by handing the policy into task_should_scx() so it operates
    on the requested value.
 
  - Add the missing handling of sched EXT in the delayed dequeue
    mechanism. This was simply forgotten.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmcnTqATHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoX/aD/4yvskeG9i7wAj2NOdDTAs1K0gLURt+
 nHDb1YkoIOXOfanaG7ZdBWb4sYnsnLX/KIhVsDQiXACFr6G0IjQ1zaN1iRtEkH79
 5BfVi98gAXdFU3y+EGqyaqiAp7MFOBTmsfJi5095fX0L+2aViSAjDEvHzvvC/hXD
 tmq47vFQEgIZPSxljEaKPaNmyDM+geusv5lX/lABH5MG0fYsT85VV6BQ2T1LsN1O
 WFBLD/uPEOSXumyZW8nV8yE2PioLDJz8W+uSnr38/HCH99mtJApqZyskaagKtr0g
 vLhOfoaYVR/j5ODUk6LExZ8zy140zDzUWzC5+RNnyb8jQf/Lx88fTNZY8/Wsm5m9
 oKtoiGzkL0LG/c05Cjh/vqReK26qILK4+ynDGaowDmTlUTS2jeNZL1ABlIwWkaLP
 5TDegJPkoUA1Z4YegxtRFROGHp1J+lfbqz537bghMaqdJXMaG84qjSszsPz9NbS9
 F7K63JKjfXAF6N8bhKvZk4jAbD97EYf3B0o8E69TjoZxaiuKf00xK7HGWmuQD3u3
 lOHkfIZzf5b7ELNgcketCYsbJvxbI4oQrp/9V425ORSr1Ih2GxCT51/x/NlFHoEH
 ujIjAe2YQyLhb26M0RG8Xao3BPT7RGMR058C8lwxtPLuPNIwB8MqCsXmU9xlEypg
 iexGnsj6zXTddg==
 =4mie
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Thomas Gleixner:

 - Plug a race between pick_next_task_fair() and try_to_wake_up() where
   both try to write to the same task, even though both paths hold a
   runqueue lock, but obviously from different runqueues.

   The problem is that the store to task::on_rq in __block_task() is
   visible to try_to_wake_up() which assumes that the task is not
   queued. Both sides then operate on the same task.

   Cure it by rearranging __block_task() so the the store to task::on_rq
   is the last operation on the task.

 - Prevent a potential NULL pointer dereference in task_numa_work()

   task_numa_work() iterates the VMAs of a process. A concurrent unmap
   of the address space can result in a NULL pointer return from
   vma_next() which is unchecked.

   Add the missing NULL pointer check to prevent this.

 - Operate on the correct scheduler policy in task_should_scx()

   task_should_scx() returns true when a task should be handled by sched
   EXT. It checks the tasks scheduling policy.

   This fails when the check is done before a policy has been set.

   Cure it by handing the policy into task_should_scx() so it operates
   on the requested value.

 - Add the missing handling of sched EXT in the delayed dequeue
   mechanism. This was simply forgotten.

* tag 'sched-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/ext: Fix scx vs sched_delayed
  sched: Pass correct scheduling policy to __setscheduler_class
  sched/numa: Fix the potential null pointer dereference in task_numa_work()
  sched: Fix pick_next_task_fair() vs try_to_wake_up() race
2024-11-03 08:18:28 -10:00
Peter Zijlstra
69d5e722be sched/ext: Fix scx vs sched_delayed
Commit 98442f0ccd ("sched: Fix delayed_dequeue vs
switched_from_fair()") forgot about scx :/

Fixes: 98442f0ccd ("sched: Fix delayed_dequeue vs switched_from_fair()")
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lkml.kernel.org/r/20241030104934.GK14555@noisy.programming.kicks-ass.net
2024-10-30 22:42:12 +01:00
Linus Torvalds
daa9f66fe1 sched_ext: Fixes for v6.12-rc5
- Instances of scx_ops_bypass() could race each other leading to
   misbehavior. Fix by protecting the operation with a spinlock.
 
 - selftest and userspace header fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZyF/5Q4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGRi+AP4+jGUz+O1LS0bCNj44Xlr0v6kci5dfJR7TlBv5
 hwROcgEA84i7nRq6oJ1IkK7ItLbZYwgZyxqdn0Pgsq+oMWhgAwE=
 =R766
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.12-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

 - Instances of scx_ops_bypass() could race each other leading to
   misbehavior. Fix by protecting the operation with a spinlock.

 - selftest and userspace header fixes

* tag 'sched_ext-for-6.12-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Fix enq_last_no_enq_fails selftest
  sched_ext: Make cast_mask() inline
  scx: Fix raciness in scx_ops_bypass()
  scx: Fix exit selftest to use custom DSQ
  sched_ext: Fix function pointer type mismatches in BPF selftests
  selftests/sched_ext: add order-only dependency of runner.o on BPFOBJ
2024-10-29 16:35:40 -10:00
Andrea Righi
860a45219b sched_ext: Introduce NUMA awareness to the default idle selection policy
Similarly to commit dfa4ed29b1 ("sched_ext: Introduce LLC awareness to
the default idle selection policy"), extend the built-in idle CPU
selection policy to also prioritize CPUs within the same NUMA node.

With this change applied, the built-in CPU idle selection policy follows
this logic:
 - always prioritize CPUs from fully idle SMT cores,
 - select the same CPU if possible,
 - select a CPU within the same LLC domain,
 - select a CPU within the same NUMA node.

Both NUMA and LLC awareness features are enabled only when the system
has multiple NUMA nodes or multiple LLC domains.

In the future, we may want to improve the NUMA node selection to account
the node distance from prev_cpu. Currently, the logic only tries to keep
tasks running on the same NUMA node. If all CPUs within a node are busy,
the next NUMA node is chosen randomly.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-29 09:36:35 -10:00