Commit Graph

4774 Commits

Author SHA1 Message Date
Tejun Heo
c201ea1578 sched_ext: Move event_stats_cpu into scx_sched
The event counters are going to become per scheduler instance. Move
event_stats_cpu into scx_sched.

- [__]scx_add_event() are updated to take @sch. While at it, add missing
  parentheses around @cnt expansions.

- scx_read_events() is updated to take @sch.

- scx_bpf_events() accesses scx_root under RCU read lock.

v2: - Replace stale scx_bpf_get_event_stat() reference in a comment with
      scx_bpf_events().

    - Trivial goto label rename.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29 08:40:10 -10:00
Tejun Heo
3a8facc424 sched_ext: Factor out scx_read_events()
In prepration of moving event_stats_cpu into scx_sched, factor out
scx_read_events() out of scx_bpf_events() and update the in-kernel users -
scx_attr_events_show() and scx_dump_state() - to use scx_read_events()
instead of scx_bpf_events(). No observable behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29 08:40:10 -10:00
Tejun Heo
f97a79156a sched_ext: Relocate scx_event_stats definition
In prepration of moving event_stats_cpu into scx_sched, move scx_event_stats
definitions above scx_sched definition. No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29 08:40:10 -10:00
Tejun Heo
8409b800a0 sched_ext: Move global_dsqs into scx_sched
Global DSQs are going to become per scheduler instance. Move global_dsqs
into scx_sched. find_global_dsq() already takes a task_struct pointer as an
argument and should later be able to determine the scx_sched to use from
that. For now, assume scx_root.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29 08:40:10 -10:00
Tejun Heo
cdf5a6faa8 sched_ext: Move dsq_hash into scx_sched
User DSQs are going to become per scheduler instance. Move dsq_hash into
scx_sched. This shifts the code that assumes scx_root to be the only
scx_sched instance up the call stack but doesn't remove them yet.

v2: Add missing rcu_read_lock() in scx_bpf_destroy_dsq() as per Andrea.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29 08:40:10 -10:00
Tejun Heo
d9f7546310 sched_ext: Factor out scx_alloc_and_add_sched()
More will be moved into scx_sched. Factor out the allocation and kobject
addition path into scx_alloc_and_add_sched().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29 08:40:10 -10:00
Tejun Heo
392b7e08de sched_ext: Inline create_dsq() into scx_bpf_create_dsq()
create_dsq() is only used by scx_bpf_create_dsq() and the separation gets in
the way of making dsq_hash per scx_sched. Inline it into
scx_bpf_create_dsq(). While at it, add unlikely() around
SCX_DSQ_FLAG_BUILTIN test.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29 08:40:10 -10:00
Tejun Heo
17108735b4 sched_ext: Use dynamic allocation for scx_sched
To prepare for supporting multiple schedulers, make scx_sched allocated
dynamically. scx_sched->kobj is now an embedded field and the kobj's
lifetime determines the lifetime of the containing scx_sched.

- Enable path is updated so that kobj init and addition are performed later.

- scx_sched freeing is initiated in scx_kobj_release() and also goes through
  an rcu_work so that scx_root can be accessed from an unsynchronized path -
  scx_disable().

- sched_ext_ops->priv is added and used to point to scx_sched instance
  created for the ops instance. This is used by bpf_scx_unreg() to determine
  the scx_sched instance to disable and put.

No behavior changes intended.

v2: Andrea reported kernel oops due to scx_bpf_unreg() trying to deref NULL
    scx_root after scheduler init failure. sched_ext_ops->priv added so that
    scx_bpf_unreg() can always find the scx_sched instance to unregister
    even if it failed early during init.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29 08:40:10 -10:00
Tejun Heo
a77d10d032 sched_ext: Avoid NULL scx_root deref through SCX_HAS_OP()
SCX_HAS_OP() tests scx_root->has_op bitmap. The bitmap is currently in a
statically allocated struct scx_sched and initialized while loading the BPF
scheduler and cleared while unloading, and thus can be tested anytime.
However, scx_root will be switched to dynamic allocation and thus won't
always be deferenceable.

Most usages of SCX_HAS_OP() are already protected by scx_enabled() either
directly or indirectly (e.g. through a task which is on SCX). However, there
are a couple places that could try to dereference NULL scx_root. Update them
so that scx_root is guaranteed to be valid before SCX_HAS_OP() is called.

- In handle_hotplug(), test whether scx_root is NULL before doing anything
  else. This is safe because scx_root updates will be protected by
  cpus_read_lock().

- In scx_tg_offline(), test scx_cgroup_enabled before invoking SCX_HAS_OP(),
  which should guarnatee that scx_root won't turn NULL. This is also in line
  with other cgroup operations. As the code path is synchronized against
  scx_cgroup_init/exit() through scx_cgroup_rwsem, this shouldn't cause any
  behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29 08:40:10 -10:00
Tejun Heo
48e1267773 sched_ext: Introduce scx_sched
To support multiple scheduler instances, collect some of the global
variables that should be specific to a scheduler instance into the new
struct scx_sched. scx_root is the root scheduler instance and points to a
static instance of struct scx_sched. Except for an extra dereference through
the scx_root pointer, this patch makes no functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29 08:40:10 -10:00
Tejun Heo
ce565f839c Merge branch 'for-6.15-fixes' into for-6.16
To receive e38be1c764 ("sched_ext: Fix rq lock state in hotplug ops") to
avoid conflicts with scx_sched related patches pending for for-6.16.
2025-04-29 08:24:58 -10:00
Andrea Righi
e38be1c764 sched_ext: Fix rq lock state in hotplug ops
The ops.cpu_online() and ops.cpu_offline() callbacks incorrectly assume
that the rq involved in the operation is locked, which is not the case
during hotplug, triggering the following warning:

  WARNING: CPU: 1 PID: 20 at kernel/sched/sched.h:1504 handle_hotplug+0x280/0x340

Fix by not tracking the target rq as locked in the context of
ops.cpu_online() and ops.cpu_offline().

Fixes: 18853ba782 ("sched_ext: Track currently locked rq")
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Tested-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-29 08:01:32 -10:00
Linus Torvalds
3d23ef05c3 Fix sporadic crashes in dequeue_entities() due to ... bad math.
[ Arguably if pick_eevdf()/pick_next_entity() was less trusting
   of complex math being correct it could have de-escalated a
   crash into a warning, but that's for a different patch. ]
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmgMnbERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iWcg//Tj96bTRqh9c31n0r4+k6xPZE8kyf7FVd
 vT/LraGe9I731+omWQQrybwgxe01Ja/bk2FNShQhVN3safk43OXReTWyNcd213Un
 q4vYvwX8gcVOl8qZibnAUHO2N0B+uTPlFBniG+VyhWqb+zPq0/dRCSy7nc0TNc54
 hHhERcLYOn+WWrUzKSbL5vm2JRowmlthTiw8/li7N5aappM8Hr4XbIZuhvd2aaB4
 ocsXJpOJyDUXP51Zi7jWEbWPr8O3VS/Zdz/F9/MGomPZ6rPBmRyNnadn0w1gjrGB
 ccTvJgBMMRH7Ltp05TslvVsnRnUIRIRe6bx/kW5pkSANxpSP+Ztw90ssAwq1v11G
 38+XIVnRJCjgP9O8/YByyW3dgWrp2o6rrZJIt++50BfQzASmT66//1Z1iV4nQIC0
 szoSa/tOm/WOFNK357pFDhAZyhZZUYlq5etvReG7q4OEHZb0y9/axw40wkzY/rpy
 Jc9XnaMoey/SvyvNHMKNJxFJNHuodosfY3rXRaeuhP22FW3qPqbtCxTM+8nRWMTs
 HojbqlrMFH9rAV9K1STgdH5YjIWsWHwJ9siHfw6SZtocMLOvzWDtfKyCJ+cvtvUz
 z82EBuPsltgq6+LLoGXyY+wlEUo/qvY3Ywv0Y9cAqolbBw5Cw/aEaDSH+K1cmubb
 AgtloYCkmXo=
 =pJ11
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Ingo Molnar:
 "Fix sporadic crashes in dequeue_entities() due to ... bad math.

  [ Arguably if pick_eevdf()/pick_next_entity() was less trusting of
    complex math being correct it could have de-escalated a crash into
    a warning, but that's for a different patch ]"

* tag 'sched-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/eevdf: Fix se->slice being set to U64_MAX and resulting crash
2025-04-26 09:23:20 -07:00
Omar Sandoval
bbce3de72b sched/eevdf: Fix se->slice being set to U64_MAX and resulting crash
There is a code path in dequeue_entities() that can set the slice of a
sched_entity to U64_MAX, which sometimes results in a crash.

The offending case is when dequeue_entities() is called to dequeue a
delayed group entity, and then the entity's parent's dequeue is delayed.
In that case:

1. In the if (entity_is_task(se)) else block at the beginning of
   dequeue_entities(), slice is set to
   cfs_rq_min_slice(group_cfs_rq(se)). If the entity was delayed, then
   it has no queued tasks, so cfs_rq_min_slice() returns U64_MAX.
2. The first for_each_sched_entity() loop dequeues the entity.
3. If the entity was its parent's only child, then the next iteration
   tries to dequeue the parent.
4. If the parent's dequeue needs to be delayed, then it breaks from the
   first for_each_sched_entity() loop _without updating slice_.
5. The second for_each_sched_entity() loop sets the parent's ->slice to
   the saved slice, which is still U64_MAX.

This throws off subsequent calculations with potentially catastrophic
results. A manifestation we saw in production was:

6. In update_entity_lag(), se->slice is used to calculate limit, which
   ends up as a huge negative number.
7. limit is used in se->vlag = clamp(vlag, -limit, limit). Because limit
   is negative, vlag > limit, so se->vlag is set to the same huge
   negative number.
8. In place_entity(), se->vlag is scaled, which overflows and results in
   another huge (positive or negative) number.
9. The adjusted lag is subtracted from se->vruntime, which increases or
   decreases se->vruntime by a huge number.
10. pick_eevdf() calls entity_eligible()/vruntime_eligible(), which
    incorrectly returns false because the vruntime is so far from the
    other vruntimes on the queue, causing the
    (vruntime - cfs_rq->min_vruntime) * load calulation to overflow.
11. Nothing appears to be eligible, so pick_eevdf() returns NULL.
12. pick_next_entity() tries to dereference the return value of
    pick_eevdf() and crashes.

Dumping the cfs_rq states from the core dumps with drgn showed tell-tale
huge vruntime ranges and bogus vlag values, and I also traced se->slice
being set to U64_MAX on live systems (which was usually "benign" since
the rest of the runqueue needed to be in a particular state to crash).

Fix it in dequeue_entities() by always setting slice from the first
non-empty cfs_rq.

Fixes: aef6987d89 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/f0c2d1072be229e1bdddc73c0703919a8b00c652.1745570998.git.osandov@fb.com
2025-04-26 10:44:36 +02:00
Andrea Righi
e7dcd1304b sched_ext: Remove duplicate BTF_ID_FLAGS definitions
Some kfuncs specific to the idle CPU selection policy are registered in
both the scx_kfunc_ids_any and scx_kfunc_ids_idle blocks, even though
they should only be defined in the latter.

Remove the duplicates from scx_kfunc_ids_any.

Fixes: 337d1b354a ("sched_ext: Move built-in idle CPU selection policy to a separate file")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-25 08:41:02 -10:00
Andrea Righi
069ac9e161 sched_ext: Clarify CPU context for running/stopping callbacks
The ops.running() and ops.stopping() callbacks can be invoked from a CPU
other than the one the task is assigned to, particularly when a task
property is changed, as both scx_next_task_scx() and dequeue_task_scx() may
run on CPUs different from the task's target CPU.

This behavior can lead to confusion or incorrect assumptions if not
properly clarified, potentially resulting in bugs (see [1]).

Therefore, update the documentation to clarify this aspect and advise
users to use scx_bpf_task_cpu() to determine the actual CPU the task
will run on or was running on.

[1] https://github.com/sched-ext/scx/pull/1728

Cc: Jake Hillion <jake@hillion.co.uk>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-23 13:34:32 -10:00
Tejun Heo
ac47c272b2 Merge branch 'for-6.15-fixes' into for-6.16
a11d6784d7 ("sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()")
added a call to scx_ops_error() which was renamed to scx_error() in
for-6.16. Fix it up.
2025-04-22 09:29:44 -10:00
Andrea Righi
a11d6784d7 sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()
scx_bpf_cpuperf_set() can be used to set a performance target level on
any CPU. However, it doesn't correctly acquire the corresponding rq
lock, which may lead to unsafe behavior and trigger the following
warning, due to the lockdep_assert_rq_held() check:

[   51.713737] WARNING: CPU: 3 PID: 3899 at kernel/sched/sched.h:1512 scx_bpf_cpuperf_set+0x1a0/0x1e0
...
[   51.713836] Call trace:
[   51.713837]  scx_bpf_cpuperf_set+0x1a0/0x1e0 (P)
[   51.713839]  bpf_prog_62d35beb9301601f_bpfland_init+0x168/0x440
[   51.713841]  bpf__sched_ext_ops_init+0x54/0x8c
[   51.713843]  scx_ops_enable.constprop.0+0x2c0/0x10f0
[   51.713845]  bpf_scx_reg+0x18/0x30
[   51.713847]  bpf_struct_ops_link_create+0x154/0x1b0
[   51.713849]  __sys_bpf+0x1934/0x22a0

Fix by properly acquiring the rq lock when possible or raising an error
if we try to operate on a CPU that is not the one currently locked.

Fixes: d86adb4fc0 ("sched_ext: Add cpuperf support")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-22 09:28:12 -10:00
Andrea Righi
18853ba782 sched_ext: Track currently locked rq
Some kfuncs provided by sched_ext may need to operate on a struct rq,
but they can be invoked from various contexts, specifically, different
scx callbacks.

While some of these callbacks are invoked with a particular rq already
locked, others are not. This makes it impossible for a kfunc to reliably
determine whether it's safe to access a given rq, triggering potential
bugs or unsafe behaviors, see for example [1].

To address this, track the currently locked rq whenever a sched_ext
callback is invoked via SCX_CALL_OP*().

This allows kfuncs that need to operate on an arbitrary rq to retrieve
the currently locked one and apply the appropriate action as needed.

[1] https://lore.kernel.org/lkml/20250325140021.73570-1-arighi@nvidia.com/

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-22 09:27:50 -10:00
Linus Torvalds
a33b5a08cb sched_ext: Fixes for v6.15-rc3
- Use kvzalloc() so that large exit_dump buffer allocations don't fail
   easily.
 
 - Remove cpu.weight / cpu.idle unimplemented warnings which are more
   annoying than helpful. This makes SCX_OPS_HAS_CGROUP_WEIGHT unnecessary.
   Mark it for deprecation.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaAaeKw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGcDjAQDM14FReObIKOOuqzlNaYXQqcSW2///ZVf/FR8j
 HpWtyAD/Tsqg6CzBpTxKkpMRLsE2iKI1t770vkUnDbjcnR0Rxgc=
 =rIV5
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

 - Use kvzalloc() so that large exit_dump buffer allocations don't fail
   easily

 - Remove cpu.weight / cpu.idle unimplemented warnings which are more
   annoying than helpful.

   This makes SCX_OPS_HAS_CGROUP_WEIGHT unnecessary. Mark it for
   deprecation

* tag 'sched_ext-for-6.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Mark SCX_OPS_HAS_CGROUP_WEIGHT for deprecation
  sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings
  sched_ext: Use kvzalloc for large exit_dump allocation
2025-04-21 19:16:29 -07:00
Honglei Wang
69120f8228 sched_ext: add helper for refill task with default slice
Add helper for refilling task with default slice and event
statistics accordingly.

Signed-off-by: Honglei Wang <jameshongleiwang@126.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-18 17:27:00 -10:00
Honglei Wang
f203683c3e sched_ext: change the variable name for slice refill event
SCX_EV_ENQ_SLICE_DFL gives the impression that the event only occurs
when the tasks were enqueued, which seems not accurate. What it actually
means is the refilling with defalt slice, and this can occur either when
enqueue or pick_task. Let's change the variable to
SCX_EV_REFILL_SLICE_DFL.

Signed-off-by: Honglei Wang <jameshongleiwang@126.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-18 17:25:39 -10:00
Rafael J. Wysocki
75da043d8f cpufreq/sched: Set need_freq_update in ignore_dl_rate_limit()
Notice that ignore_dl_rate_limit() need not piggy back on the
limits_changed handling to achieve its goal (which is to enforce a
frequency update before its due time).

Namely, if sugov_should_update_freq() is updated to check
sg_policy->need_freq_update and return 'true' if it is set when
sg_policy->limits_changed is not set, ignore_dl_rate_limit() may
set the former directly instead of setting the latter, so it can
avoid hitting the memory barrier in sugov_should_update_freq().

Update the code accordingly.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/10666429.nUPlyArG6x@rjwysocki.net
2025-04-17 17:54:44 +02:00
Rafael J. Wysocki
79443a7e9d cpufreq/sched: Explicitly synchronize limits_changed flag handling
The handling of the limits_changed flag in struct sugov_policy needs to
be explicitly synchronized to ensure that cpufreq policy limits updates
will not be missed in some cases.

Without that synchronization it is theoretically possible that
the limits_changed update in sugov_should_update_freq() will be
reordered with respect to the reads of the policy limits in
cpufreq_driver_resolve_freq() and in that case, if the limits_changed
update in sugov_limits() clobbers the one in sugov_should_update_freq(),
the new policy limits may not take effect for a long time.

Likewise, the limits_changed update in sugov_limits() may theoretically
get reordered with respect to the updates of the policy limits in
cpufreq_set_policy() and if sugov_should_update_freq() runs between
them, the policy limits change may be missed.

To ensure that the above situations will not take place, add memory
barriers preventing the reordering in question from taking place and
add READ_ONCE() and WRITE_ONCE() annotations around all of the
limits_changed flag updates to prevent the compiler from messing up
with that code.

Fixes: 600f5badb7 ("cpufreq: schedutil: Don't skip freq update when limits change")
Cc: 5.3+ <stable@vger.kernel.org> # 5.3+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/3376719.44csPzL39Z@rjwysocki.net
2025-04-17 17:54:44 +02:00
Rafael J. Wysocki
cfde542df7 cpufreq/sched: Fix the usage of CPUFREQ_NEED_UPDATE_LIMITS
Commit 8e461a1cb4 ("cpufreq: schedutil: Fix superfluous updates caused
by need_freq_update") modified sugov_should_update_freq() to set the
need_freq_update flag only for drivers with CPUFREQ_NEED_UPDATE_LIMITS
set, but that flag generally needs to be set when the policy limits
change because the driver callback may need to be invoked for the new
limits to take effect.

However, if the return value of cpufreq_driver_resolve_freq() after
applying the new limits is still equal to the previously selected
frequency, the driver callback needs to be invoked only in the case
when CPUFREQ_NEED_UPDATE_LIMITS is set (which means that the driver
specifically wants its callback to be invoked every time the policy
limits change).

Update the code accordingly to avoid missing policy limits changes for
drivers without CPUFREQ_NEED_UPDATE_LIMITS.

Fixes: 8e461a1cb4 ("cpufreq: schedutil: Fix superfluous updates caused by need_freq_update")
Closes: https://lore.kernel.org/lkml/Z_Tlc6Qs-tYpxWYb@linaro.org/
Reported-by: Stephan Gerhold <stephan.gerhold@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/3010358.e9J7NaK4W3@rjwysocki.net
2025-04-17 17:54:43 +02:00
Peter Zijlstra
c70fc32f44 sched/fair: Adhere to place_entity() constraints
Mike reports that commit 6d71a9c616 ("sched/fair: Fix EEVDF entity
placement bug causing scheduling lag") relies on commit 4423af84b2
("sched/fair: optimize the PLACE_LAG when se->vlag is zero") to not
trip a WARN in place_entity().

What happens is that the lag of the very last entity is 0 per
definition -- the average of one element matches the value of that
element. Therefore place_entity() will match the condition skipping
the lag adjustment:

  if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) {

Without the 'se->vlag' condition -- it will attempt to adjust the zero
lag even though we're inserting into an empty tree.

Notably, we should have failed the 'cfs_rq->nr_queued' condition, but
don't because they didn't get updated.

Additionally, move update_load_add() after placement() as is
consistent with other place_entity() users -- this change is
non-functional, place_entity() does not use cfs_rq->load.

Fixes: 6d71a9c616 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/c216eb4ef0e0e0029c600aefc69d56681cee5581.camel@gmx.de
2025-04-16 21:09:12 +02:00
K Prateek Nayak
44671e21e3 sched/debug: Print the local group's asym_prefer_cpu
Add a file to read local group's "asym_prefer_cpu" from debugfs. This
information was useful when debugging issues where "asym_prefer_cpu" was
incorrectly set to a CPU with a lower asym priority.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250409053446.23367-5-kprateek.nayak@amd.com
2025-04-16 21:09:11 +02:00
K Prateek Nayak
0e3f6c3696 sched/topology: Introduce sched_update_asym_prefer_cpu()
A subset of AMD Processors supporting Preferred Core Rankings also
feature the ability to dynamically switch these rankings at runtime to
bias load balancing towards or away from the LLC domain with larger
cache.

To support dynamically updating "sg->asym_prefer_cpu" without needing to
rebuild the sched domain, introduce sched_update_asym_prefer_cpu() which
recomutes the "asym_prefer_cpu" when the core-ranking of a CPU changes.

sched_update_asym_prefer_cpu() swaps the "sg->asym_prefer_cpu" with the
CPU whose ranking has changed if the new ranking is greater than that of
the "asym_prefer_cpu". If CPU whose ranking has changed is the current
"asym_prefer_cpu", it scans the CPUs of the sched groups to find the new
"asym_prefer_cpu" and sets it accordingly.

get_group() for non-overlapping sched domains returns the sched group
for the first CPU in the sched_group_span() which ensures all CPUs in
the group see the updated value of "asym_prefer_cpu".

Overlapping groups are allocated differently and will require moving the
"asym_prefer_cpu" to "sg->sgc" but since the current implementations do
not set "SD_ASYM_PACKING" at NUMA domains, skip additional
indirection and place a SCHED_WARN_ON() to alert any future users.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250409053446.23367-3-kprateek.nayak@amd.com
2025-04-16 21:09:11 +02:00
K Prateek Nayak
872aa4de18 sched/fair: Use READ_ONCE() to read sg->asym_prefer_cpu
Subsequent commits add the support to dynamically update the sched_group
struct's "asym_prefer_cpu" member from a remote CPU. Use READ_ONCE()
when reading the "sg->asym_prefer_cpu" to ensure load balancer always
reads the latest value.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250409053446.23367-2-kprateek.nayak@amd.com
2025-04-16 21:09:11 +02:00
Tejun Heo
0b30461793 sched_ext: Make scx_has_op a bitmap
scx_has_op is used to encode which ops are implemented by the BPF scheduler
into an array of static_keys. While this saves a bit of branching overhead,
that is unlikely to be noticeable compared to the overall cost. As the
global static_keys can't work with the planned hierarchical multiple
scheduler support, replace the static_key array with a bitmap.

In repeated hackbench runs before and after static_keys removal on an AMD
Ryzen 3900X, I couldn't tell any measurable performance difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09 09:06:00 -10:00
Tejun Heo
743354e3bb sched_ext: Remove scx_ops_allow_queued_wakeup static_key
scx_ops_allow_queued_wakeup is used to encode SCX_OPS_ALLOW_QUEUED_WAKEUP
into a static_key. The test is gated behind scx_enabled(), and, even when
sched_ext is enabled, is unlikely for the static_key usage to make any
meaningful difference. It is made to use a static_key mostly because there
was no reason not to. However, global static_keys can't work with the
planned hierarchical multiple scheduler support. Remove the static_key and
instead test SCX_OPS_ALLOW_QUEUED_WAKEUP directly.

In repeated hackbench runs before and after static_keys removal on an AMD
Ryzen 3900X, I couldn't tell any measurable performance difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09 09:06:00 -10:00
Tejun Heo
54d2e717bc sched_ext: Remove scx_ops_cpu_preempt static_key
scx_ops_cpu_preempt is used to encode whether ops.cpu_acquire/release() are
implemented into a static_key. These tests aren't hot enough for static_key
usage to make any meaningful difference and are made to use a static_key
mostly because there was no reason not to. However, global static_keys can't
work with the planned hierarchical multiple scheduler support. Remove the
static_key and instead use an internal ops flag SCX_OPS_HAS_CPU_PREEMPT to
record and test whether ops.cpu_acquire/release() are implemented.

In repeated hackbench runs before and after static_keys removal on an AMD
Ryzen 3900X, I couldn't tell any measurable performance difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09 09:06:00 -10:00
Tejun Heo
cc39454c34 sched_ext: Remove scx_ops_enq_* static_keys
scx_ops_enq_last/exiting/migration_disabled are used to encode the
corresponding SCX_OPS_ flags into static_keys. These flags aren't hot enough
for static_key usage to make any meaningful difference and are made
static_keys mostly because there was no reason not to. However, global
static_keys can't work with the planned hierarchical multiple scheduler
support. Remove the static_keys and test the ops flags directly.

In repeated hackbench runs before and after static_keys removal on an AMD
Ryzen 3900X, I couldn't tell any measurable performance difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09 09:05:59 -10:00
Tejun Heo
d75ee2d678 sched_ext: Indentation updates
Purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09 09:05:59 -10:00
Tejun Heo
294f5ff474 sched_ext: Merge branch 'for-6.15-fixes' into for-6.16
Pull for-6.15-fixes to receive:

 e776b26e37 ("sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings")

which conflicts with:

 1a7ff7216c ("sched_ext: Drop "ops" from scx_ops_enable_state and friends")

The former removes code updated by the latter. Resolved by removing the
updated section.

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-08 08:56:57 -10:00
Phil Auld
6432e163ba sched/isolation: Make use of more than one housekeeping cpu
The exising code uses housekeeping_any_cpu() to select a cpu for
a given housekeeping task. However, this often ends up calling
cpumask_any_and() which is defined as cpumask_first_and() which has
the effect of alyways using the first cpu among those available.

The same applies when multiple NUMA nodes are involved. In that
case the first cpu in the local node is chosen which does provide
a bit of spreading but with multiple HK cpus per node the same
issues arise.

We have numerous cases where a single HK cpu just cannot keep up
and the remote_tick warning fires. It also can lead to the other
things (orchastration sw, HA keepalives etc) on the HK cpus getting
starved which leads to other issues.  In these cases we recommend
increasing the number of HK cpus.  But... that only helps the
userspace tasks somewhat. It does not help the actual housekeeping
part.

Spread the HK work out by having housekeeping_any_cpu() and
sched_numa_find_closest() use cpumask_any_and_distribute()
instead of cpumask_any_and().

Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20250218184618.1331715-1-pauld@redhat.com
2025-04-08 20:55:55 +02:00
Harshit Agarwal
690e47d140 sched/rt: Fix race in push_rt_task
Overview
========
When a CPU chooses to call push_rt_task and picks a task to push to
another CPU's runqueue then it will call find_lock_lowest_rq method
which would take a double lock on both CPUs' runqueues. If one of the
locks aren't readily available, it may lead to dropping the current
runqueue lock and reacquiring both the locks at once. During this window
it is possible that the task is already migrated and is running on some
other CPU. These cases are already handled. However, if the task is
migrated and has already been executed and another CPU is now trying to
wake it up (ttwu) such that it is queued again on the runqeue
(on_rq is 1) and also if the task was run by the same CPU, then the
current checks will pass even though the task was migrated out and is no
longer in the pushable tasks list.

Crashes
=======
This bug resulted in quite a few flavors of crashes triggering kernel
panics with various crash signatures such as assert failures, page
faults, null pointer dereferences, and queue corruption errors all
coming from scheduler itself.

Some of the crashes:
-> kernel BUG at kernel/sched/rt.c:1616! BUG_ON(idx >= MAX_RT_PRIO)
   Call Trace:
   ? __die_body+0x1a/0x60
   ? die+0x2a/0x50
   ? do_trap+0x85/0x100
   ? pick_next_task_rt+0x6e/0x1d0
   ? do_error_trap+0x64/0xa0
   ? pick_next_task_rt+0x6e/0x1d0
   ? exc_invalid_op+0x4c/0x60
   ? pick_next_task_rt+0x6e/0x1d0
   ? asm_exc_invalid_op+0x12/0x20
   ? pick_next_task_rt+0x6e/0x1d0
   __schedule+0x5cb/0x790
   ? update_ts_time_stats+0x55/0x70
   schedule_idle+0x1e/0x40
   do_idle+0x15e/0x200
   cpu_startup_entry+0x19/0x20
   start_secondary+0x117/0x160
   secondary_startup_64_no_verify+0xb0/0xbb

-> BUG: kernel NULL pointer dereference, address: 00000000000000c0
   Call Trace:
   ? __die_body+0x1a/0x60
   ? no_context+0x183/0x350
   ? __warn+0x8a/0xe0
   ? exc_page_fault+0x3d6/0x520
   ? asm_exc_page_fault+0x1e/0x30
   ? pick_next_task_rt+0xb5/0x1d0
   ? pick_next_task_rt+0x8c/0x1d0
   __schedule+0x583/0x7e0
   ? update_ts_time_stats+0x55/0x70
   schedule_idle+0x1e/0x40
   do_idle+0x15e/0x200
   cpu_startup_entry+0x19/0x20
   start_secondary+0x117/0x160
   secondary_startup_64_no_verify+0xb0/0xbb

-> BUG: unable to handle page fault for address: ffff9464daea5900
   kernel BUG at kernel/sched/rt.c:1861! BUG_ON(rq->cpu != task_cpu(p))

-> kernel BUG at kernel/sched/rt.c:1055! BUG_ON(!rq->nr_running)
   Call Trace:
   ? __die_body+0x1a/0x60
   ? die+0x2a/0x50
   ? do_trap+0x85/0x100
   ? dequeue_top_rt_rq+0xa2/0xb0
   ? do_error_trap+0x64/0xa0
   ? dequeue_top_rt_rq+0xa2/0xb0
   ? exc_invalid_op+0x4c/0x60
   ? dequeue_top_rt_rq+0xa2/0xb0
   ? asm_exc_invalid_op+0x12/0x20
   ? dequeue_top_rt_rq+0xa2/0xb0
   dequeue_rt_entity+0x1f/0x70
   dequeue_task_rt+0x2d/0x70
   __schedule+0x1a8/0x7e0
   ? blk_finish_plug+0x25/0x40
   schedule+0x3c/0xb0
   futex_wait_queue_me+0xb6/0x120
   futex_wait+0xd9/0x240
   do_futex+0x344/0xa90
   ? get_mm_exe_file+0x30/0x60
   ? audit_exe_compare+0x58/0x70
   ? audit_filter_rules.constprop.26+0x65e/0x1220
   __x64_sys_futex+0x148/0x1f0
   do_syscall_64+0x30/0x80
   entry_SYSCALL_64_after_hwframe+0x62/0xc7

-> BUG: unable to handle page fault for address: ffff8cf3608bc2c0
   Call Trace:
   ? __die_body+0x1a/0x60
   ? no_context+0x183/0x350
   ? spurious_kernel_fault+0x171/0x1c0
   ? exc_page_fault+0x3b6/0x520
   ? plist_check_list+0x15/0x40
   ? plist_check_list+0x2e/0x40
   ? asm_exc_page_fault+0x1e/0x30
   ? _cond_resched+0x15/0x30
   ? futex_wait_queue_me+0xc8/0x120
   ? futex_wait+0xd9/0x240
   ? try_to_wake_up+0x1b8/0x490
   ? futex_wake+0x78/0x160
   ? do_futex+0xcd/0xa90
   ? plist_check_list+0x15/0x40
   ? plist_check_list+0x2e/0x40
   ? plist_del+0x6a/0xd0
   ? plist_check_list+0x15/0x40
   ? plist_check_list+0x2e/0x40
   ? dequeue_pushable_task+0x20/0x70
   ? __schedule+0x382/0x7e0
   ? asm_sysvec_reschedule_ipi+0xa/0x20
   ? schedule+0x3c/0xb0
   ? exit_to_user_mode_prepare+0x9e/0x150
   ? irqentry_exit_to_user_mode+0x5/0x30
   ? asm_sysvec_reschedule_ipi+0x12/0x20

Above are some of the common examples of the crashes that were observed
due to this issue.

Details
=======
Let's look at the following scenario to understand this race.

1) CPU A enters push_rt_task
  a) CPU A has chosen next_task = task p.
  b) CPU A calls find_lock_lowest_rq(Task p, CPU Z’s rq).
  c) CPU A identifies CPU X as a destination CPU (X < Z).
  d) CPU A enters double_lock_balance(CPU Z’s rq, CPU X’s rq).
  e) Since X is lower than Z, CPU A unlocks CPU Z’s rq. Someone else has
     locked CPU X’s rq, and thus, CPU A must wait.

2) At CPU Z
  a) Previous task has completed execution and thus, CPU Z enters
     schedule, locks its own rq after CPU A releases it.
  b) CPU Z dequeues previous task and begins executing task p.
  c) CPU Z unlocks its rq.
  d) Task p yields the CPU (ex. by doing IO or waiting to acquire a
     lock) which triggers the schedule function on CPU Z.
  e) CPU Z enters schedule again, locks its own rq, and dequeues task p.
  f) As part of dequeue, it sets p.on_rq = 0 and unlocks its rq.

3) At CPU B
  a) CPU B enters try_to_wake_up with input task p.
  b) Since CPU Z dequeued task p, p.on_rq = 0, and CPU B updates
     B.state = WAKING.
  c) CPU B via select_task_rq determines CPU Y as the target CPU.

4) The race
  a) CPU A acquires CPU X’s lock and relocks CPU Z.
  b) CPU A reads task p.cpu = Z and incorrectly concludes task p is
     still on CPU Z.
  c) CPU A failed to notice task p had been dequeued from CPU Z while
     CPU A was waiting for locks in double_lock_balance. If CPU A knew
     that task p had been dequeued, it would return NULL forcing
     push_rt_task to give up the task p's migration.
  d) CPU B updates task p.cpu = Y and calls ttwu_queue.
  e) CPU B locks Ys rq. CPU B enqueues task p onto Y and sets task
     p.on_rq = 1.
  f) CPU B unlocks CPU Y, triggering memory synchronization.
  g) CPU A reads task p.on_rq = 1, cementing its assumption that task p
     has not migrated.
  h) CPU A decides to migrate p to CPU X.

This leads to A dequeuing p from Y's queue and various crashes down the
line.

Solution
========
The solution here is fairly simple. After obtaining the lock (at 4a),
the check is enhanced to make sure that the task is still at the head of
the pushable tasks list. If not, then it is anyway not suitable for
being pushed out.

Testing
=======
The fix is tested on a cluster of 3 nodes, where the panics due to this
are hit every couple of days. A fix similar to this was deployed on such
cluster and was stable for more than 30 days.

Co-developed-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Co-developed-by: Gauri Patwardhan <gauri.patwardhan@nutanix.com>
Signed-off-by: Gauri Patwardhan <gauri.patwardhan@nutanix.com>
Co-developed-by: Rahul Chunduru <rahul.chunduru@nutanix.com>
Signed-off-by: Rahul Chunduru <rahul.chunduru@nutanix.com>
Signed-off-by: Harshit Agarwal <harshit@nutanix.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Tested-by: Will Ton <william.ton@nutanix.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250225180553.167995-1-harshit@nutanix.com
2025-04-08 20:55:55 +02:00
Michal Koutný
0ab94c3242 sched: Add annotations to RT_GROUP_SCHED fields
Update comments to ease RT throttling understanding.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250310170442.504716-10-mkoutny@suse.com
2025-04-08 20:55:55 +02:00
Michal Koutný
87f1fb77d8 sched: Add RT_GROUP WARN checks for non-root task_groups
With CONFIG_RT_GROUP_SCHED but runtime disabling of RT_GROUPs we expect
the existence of the root task_group only and all rt_sched_entity'ies
should be queued on root's rt_rq.

If we get a non-root RT_GROUP something went wrong.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250310170442.504716-9-mkoutny@suse.com
2025-04-08 20:55:54 +02:00
Michal Koutný
d6809c2f60 sched: Do not construct nor expose RT_GROUP_SCHED structures if disabled
Thanks to kernel cmdline being available early, before any
cgroup hierarchy exists, we can achieve the RT_GROUP_SCHED boottime
disabling goal by simply skipping any creation (and destruction) of
RT_GROUP data and its exposure via RT attributes.

We can do this thanks to previously placed runtime guards that would
redirect all operations to root_task_group's data when RT_GROUP_SCHED
disabled.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250310170442.504716-8-mkoutny@suse.com
2025-04-08 20:55:54 +02:00
Michal Koutný
277e090975 sched: Bypass bandwitdh checks with runtime disabled RT_GROUP_SCHED
When RT_GROUPs are compiled but not exposed, their bandwidth cannot
be configured (and it is not initialized for non-root task_groups neither).
Therefore bypass any checks of task vs task_group bandwidth.

This will achieve behavior very similar to setups that have
!CONFIG_RT_GROUP_SCHED and attach cpu controller to cgroup v2 hierarchy.
(On a related note, this may allow having RT tasks with
CONFIG_RT_GROUP_SCHED and cgroup v2 hierarchy.)

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250310170442.504716-7-mkoutny@suse.com
2025-04-08 20:55:54 +02:00
Michal Koutný
61d3164fec sched: Skip non-root task_groups with disabled RT_GROUP_SCHED
First, we want to prevent placement of RT tasks on non-root rt_rqs which
we achieve in the task migration code that'd fall back to
root_task_group's rt_rq.

Second, we want to work with only root_task_group's rt_rq when iterating
all "real" rt_rqs when RT_GROUP is disabled. To achieve this we keep
root_task_group as the first one on the task_groups and break out
quickly.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250310170442.504716-6-mkoutny@suse.com
2025-04-08 20:55:53 +02:00
Michal Koutný
e34e0131fe sched: Add commadline option for RT_GROUP_SCHED toggling
Only simple implementation with a static key wrapper, it will be wired
in later.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250310170442.504716-5-mkoutny@suse.com
2025-04-08 20:55:53 +02:00
Michal Koutný
a5a25b32c0 sched: Always initialize rt_rq's task_group
rt_rq->tg may be NULL which denotes the root task_group.
Store the pointer to root_task_group directly so that callers may use
rt_rq->tg homogenously.

root_task_group exists always with CONFIG_CGROUPS_SCHED,
CONFIG_RT_GROUP_SCHED depends on that.

This changes root level rt_rq's default limit from infinity to the
value of (originally) global RT throttling.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250310170442.504716-4-mkoutny@suse.com
2025-04-08 20:55:53 +02:00
Michal Koutný
e285313f08 sched: Remove unneeed macro wrap
rt_entity_is_task has split definitions based on CONFIG_RT_GROUP_SCHED,
therefore we can use it always. No functional change intended.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250310170442.504716-3-mkoutny@suse.com
2025-04-08 20:55:53 +02:00
Michal Koutný
433bce5dad sched: Convert CONFIG_RT_GROUP_SCHED macros to code conditions
Convert the blocks guarded by macros to regular code so that the RT
group code gets more compile validation. Reasoning is in
Documentation/process/coding-style.rst 21) Conditional Compilation.
With that, no functional change is expected.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250310170442.504716-2-mkoutny@suse.com
2025-04-08 20:55:52 +02:00
Pierre Gondois
f2d650618b sched/fair: Allow decaying util_est when util_avg > CPU capa
commit 10a35e6812 ("sched/pelt: Skip updating util_est when
utilization is higher than CPU's capacity")
prevents util_est from being updated if util_avg is higher than the
underlying CPU capacity to avoid overestimating the task when the CPU
is capped (due to thermal issue for instance). In this scenario, the
task will miss its deadlines and start overlapping its wake-up events
for instance. The task will appear as always running when the CPU is
just not powerful enough to allow having a good estimation of the
task.

commit b8c9636140 ("sched/fair/util_est: Implement faster ramp-up
EWMA on utilization increases")
sets ewma to util_avg when ewma > util_avg, allowing ewma to quickly
grow instead of slowly converge to the new util_avg value when a task
profile changes from small to big.

However, the 2 conditions:
- Check util_avg against max CPU capacity
- Check whether util_est > util_avg
are placed in an order such as it is possible to set util_est to a
value higher than the CPU capacity if util_est > util_avg, but
util_est is prevented to decay as long as:
CPU capacity < util_avg < util_est.

Just remove the check as either:
1.
There is idle time on the CPU. In that case the util_avg value of the
task is actually correct. It is possible that the task missed a
deadline and appears bigger, but this is also the case when the
util_avg of the task is lower than the maximum CPU capacity.
2.
There is no idle time. In that case, the util_avg value might aswell
be an under estimation of the size of the task.
It is possible that undesired frequency spikes will appear when the
task is later enqueued with an inflated util_est value, but the
frequency spike might aswell be deserved. The absence of idle time
prevents from drawing any conclusion.

Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.rog>
Link: https://lore.kernel.org/r/20250325150542.1077344-1-pierre.gondois@arm.com
2025-04-08 20:55:52 +02:00
Steve Wahl
ce29a7da84 sched/topology: Refinement to topology_span_sane speedup
Simplify the topology_span_sane code further, removing the need to
allocate an array and gotos used to make sure the array gets freed.

This version is in a separate commit because it could return a
different sanity result than the previous code, but only in odd
circumstances that are not expected to actually occur; for example,
when a CPU is not listed in its own mask.

Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Link: https://lore.kernel.org/r/20250304160844.75373-3-steve.wahl@hpe.com
2025-04-08 20:55:52 +02:00
Steve Wahl
f55dac1daf sched/topology: improve topology_span_sane speed
Use a different approach to topology_span_sane(), that checks for the
same constraint of no partial overlaps for any two CPU sets for
non-NUMA topology levels, but does so in a way that is O(N) rather
than O(N^2).

Instead of comparing with all other masks to detect collisions, keep
one mask that includes all CPUs seen so far and detect collisions with
a single cpumask_intersects test.

If the current mask has no collisions with previously seen masks, it
should be a new mask, which can be uniquely identified by the lowest
bit set in this mask.  Keep a pointer to this mask for future
reference (in an array indexed by the lowest bit set), and add the
CPUs in this mask to the list of those seen.

If the current mask does collide with previously seen masks, it should
be exactly equal to a mask seen before, looked up in the same array
indexed by the lowest bit set in the mask, a single comparison.

Move the topology_span_sane() check out of the existing topology level
loop, let it use its own loop so that the array allocation can be done
only once, shared across levels.

On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
the average time to take one processor offline is reduced from 2.18
seconds to 1.01 seconds.  (Off-lining 959 of 1920 processors took
34m49.765s without this change, 16m10.038s with this change in place.)

Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Link: https://lore.kernel.org/r/20250304160844.75373-2-steve.wahl@hpe.com
2025-04-08 20:55:51 +02:00
Peter Zijlstra
8feb053d53 sched: Fix trace_sched_switch(.prev_state)
Gabriele noted that in case of signal_pending_state(), the tracepoint
sees a stale task-state.

Fixes: fa2c3254d7 ("sched/tracing: Don't re-read p->state when emitting sched_switch event")
Reported-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Valentin Schneider <vschneid@redhat.com>
2025-04-08 20:55:51 +02:00
Tejun Heo
bc08b15b54 sched_ext: Mark SCX_OPS_HAS_CGROUP_WEIGHT for deprecation
SCX_OPS_HAS_CGROUP_WEIGHT was only used to suppress the missing cgroup
weight support warnings. Now that the warnings are removed, the flag doesn't
do anything. Mark it for deprecation and remove its usage from scx_flatcg.

v2: Actually include the scx_flatcg update.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-and-reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-04-08 08:53:52 -10:00
Tejun Heo
e776b26e37 sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings
sched_ext generates warnings when cpu.weight / cpu.idle are set to
non-default values if the BPF scheduler doesn't implement weight support.
These warnings don't provide much value while adding constant annoyance. A
BPF scheduler may not implement any particular behavior and there's nothing
particularly special about missing cgroup weight support. Drop the warnings.

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-08 08:00:49 -10:00
Breno Leitao
47068309b5 sched_ext: Use kvzalloc for large exit_dump allocation
Replace kzalloc with kvzalloc for the exit_dump buffer allocation, which
can require large contiguous memory depending on the implementation.
This change prevents allocation failures by allowing the system to fall
back to vmalloc when contiguous memory allocation fails.

Since this buffer is only used for debugging purposes, physical memory
contiguity is not required, making vmalloc a suitable alternative.

Cc: stable@vger.kernel.org
Fixes: 07814a9439 ("sched_ext: Print debug dump after an error exit")
Suggested-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-08 07:53:27 -10:00
Andrea Righi
683d2d0fab sched_ext: idle: Introduce scx_bpf_select_cpu_and()
Provide a new kfunc, scx_bpf_select_cpu_and(), that can be used to apply
the built-in idle CPU selection policy to a subset of allowed CPU.

This new helper is basically an extension of scx_bpf_select_cpu_dfl().
However, when an idle CPU can't be found, it returns a negative value
instead of @prev_cpu, aligning its behavior more closely with
scx_bpf_pick_idle_cpu().

It also accepts %SCX_PICK_IDLE_* flags, which can be used to enforce
strict selection to @prev_cpu's node (%SCX_PICK_IDLE_IN_NODE), or to
request only a full-idle SMT core (%SCX_PICK_IDLE_CORE), while applying
the built-in selection logic.

With this helper, BPF schedulers can apply the built-in idle CPU
selection policy restricted to any arbitrary subset of CPUs.

Example usage
=============

Possible usage in ops.select_cpu():

s32 BPF_STRUCT_OPS(foo_select_cpu, struct task_struct *p,
		   s32 prev_cpu, u64 wake_flags)
{
	const struct cpumask *cpus = task_allowed_cpus(p) ?: p->cpus_ptr;
	s32 cpu;

	cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, cpus, 0);
	if (cpu >= 0) {
		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
		return cpu;
	}

	return prev_cpu;
}

Results
=======

Load distribution on a 4 sockets, 4 cores per socket system, simulated
using virtme-ng, running a modified version of scx_bpfland that uses
scx_bpf_select_cpu_and() with 0xff00 as the allowed subset of CPUs:

 $ vng --cpu 16,sockets=4,cores=4,threads=1
 ...
 $ stress-ng -c 16
 ...
 $ htop
 ...
   0[                         0.0%]   8[||||||||||||||||||||||||100.0%]
   1[                         0.0%]   9[||||||||||||||||||||||||100.0%]
   2[                         0.0%]  10[||||||||||||||||||||||||100.0%]
   3[                         0.0%]  11[||||||||||||||||||||||||100.0%]
   4[                         0.0%]  12[||||||||||||||||||||||||100.0%]
   5[                         0.0%]  13[||||||||||||||||||||||||100.0%]
   6[                         0.0%]  14[||||||||||||||||||||||||100.0%]
   7[                         0.0%]  15[||||||||||||||||||||||||100.0%]

With scx_bpf_select_cpu_dfl() tasks would be distributed evenly across
all the available CPUs.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07 07:13:52 -10:00
Andrea Righi
c2d8b2a57c sched_ext: idle: Accept an arbitrary cpumask in scx_select_cpu_dfl()
Many scx schedulers implement their own hard or soft-affinity rules
to support topology characteristics, such as heterogeneous architectures
(e.g., big.LITTLE, P-cores/E-cores), or to categorize tasks based on
specific properties (e.g., running certain tasks only in a subset of
CPUs).

Currently, there is no mechanism that allows to use the built-in idle
CPU selection policy to an arbitrary subset of CPUs. As a result,
schedulers often implement their own idle CPU selection policies, which
are typically similar to one another, leading to a lot of code
duplication.

To address this, modify scx_select_cpu_dfl() to accept an arbitrary
cpumask, that can be used by the BPF schedulers to apply the existent
built-in idle CPU selection policy to a subset of allowed CPUs.

With this concept the idle CPU selection policy becomes the following:
 - always prioritize CPUs from fully idle SMT cores (if SMT is enabled),
 - select the same CPU if it's idle and in the allowed CPUs,
 - select an idle CPU within the same LLC, if the LLC cpumask is a
   subset of the allowed CPUs,
 - select an idle CPU within the same node, if the node cpumask is a
   subset of the allowed CPUs,
 - select an idle CPU within the allowed CPUs.

This functionality will be exposed through a dedicated kfunc in a
separate patch.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07 07:13:52 -10:00
Andrea Righi
23c63a9652 sched_ext: idle: Explicitly pass allowed cpumask to scx_select_cpu_dfl()
Modify scx_select_cpu_dfl() to take the allowed cpumask as an explicit
argument, instead of implicitly using @p->cpus_ptr.

This prepares for future changes where arbitrary cpumasks may be passed
to the built-in idle CPU selection policy.

This is a pure refactoring with no functional changes.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07 07:13:52 -10:00
Andrea Righi
29f512f555 sched_ext: idle: Extend topology optimizations to all tasks
The built-in idle selection policy, scx_select_cpu_dfl(), always
prioritizes picking idle CPUs within the same LLC or NUMA node, but
these optimizations are currently applied only when a task has no CPU
affinity constraints.

This is done primarily for efficiency, as it avoids the overhead of
updating a cpumask every time we need to select an idle CPU (which can
be costly in large SMP systems).

However, this approach limits the effectiveness of the built-in idle
policy and results in inconsistent behavior, as affinity-restricted
tasks don't benefit from topology-aware optimizations.

To address this, modify the policy to apply LLC and NUMA-aware
optimizations even when a task is constrained to a subset of CPUs.

We can still avoid updating the cpumasks by checking if the subset of
LLC and node CPUs are contained in the subset of allowed CPUs usable by
the task (which is true in most of the cases - for tasks that don't have
affinity constratints).

Moreover, use temporary local per-CPU cpumasks to determine the LLC and
node subsets, minimizing potential overhead even on large SMP systems.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07 07:13:52 -10:00
Linus Torvalds
302deb109d Miscellaneous scheduler fixes/updates:
- Fix a nonsensical Kconfig combination
  - Remove an unnecessary rseq-notification
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfysY8RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gftQ//Sy6HJRQhC/xxLuv8N8f9bEzju5fSgEaK
 2Cg5aKmKyf6IECIAMilB3snAM4h1X1ajnGbxx98sERuDTjkjKawLD17hC5riYxyr
 fReDy6VkdNw/yALK/QcodJZHPWBbYeVT3uVO9qQSnMq6q8IJrOkM0rZwOYawo2FJ
 ID7xWUGhPTatuqm2TM4r4yzwXrPq5fHllrWEsc4LlhtXYRJmzeOGbLh63vUgUFZO
 iu0uM7qt93GoVZqPsw5fliuFE+m4Ug8fPY+hBtXZlUn/npQpR9dP3+hccXIsslCq
 H00pmnqiE5nyDo8zsOG3rzO4gml6k4JQGUWkcmzVq56n02N6naC4KTGZS79aCpaV
 7KwInYW2fzwYcd6UEVHlRqeJK/XFTcL+fDfFWSEp5T/3keeCnwilZDRAHPAgW6ot
 GxAUPT8P8qlnGhXSOMoOoJND3KChelQQzJBQc/j5EToqYCNLytqzWnPgXbzwN/Za
 ZWhlL2T39n3ykEQarlm0MOL35n/0CF27Q5dKOLeaS6OA7K1wYHOQYuCf09zRpKrv
 aaKiKhir4RyYLsfUIJD9cSO68AZQGAwXZGEyM23eErjcA/ZNHrew4TGFM3Tzwj2Q
 /7wHpWfRhhcP7igGrOoJ+YDOCvrfUSgegRYx8hgucuWmFFI1h1mrmfWy8lyPihtm
 pPy9jAwjElI=
 =n5lA
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:

 - Fix a nonsensical Kconfig combination

 - Remove an unnecessary rseq-notification

* tag 'sched-urgent-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rseq: Eliminate useless task_work on execve
  sched/isolation: Make CONFIG_CPU_ISOLATION depend on CONFIG_SMP
2025-04-06 10:44:58 -07:00
Thomas Gleixner
8fa7292fee treewide: Switch/rename to timer_delete[_sync]()
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.

Conversion was done with coccinelle plus manual fixups where necessary.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-05 10:30:12 +02:00
Tejun Heo
29b49be6c9 sched_ext: Drop "ops" from SCX_OPS_TASK_ITER_BATCH
The tag "ops" is used for two different purposes. First, to indicate that
the entity is directly related to the operations such as flags carried in
sched_ext_ops. Second, to indicate that the entity applies to something
global such as enable or bypass states. The second usage is historical and
causes confusion rather than clarifying anything. For example,
scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with
scx_ops_flags. Let's drop the second usages.

Drop "ops" from SCX_OPS_TASK_ITER_BATCH.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-and-acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04 10:10:39 -10:00
Tejun Heo
1a2469403e sched_ext: Drop "ops" from scx_ops_{init|exit|enable|disable}[_task]() and friends
The tag "ops" is used for two different purposes. First, to indicate that
the entity is directly related to the operations such as flags carried in
sched_ext_ops. Second, to indicate that the entity applies to something
global such as enable or bypass states. The second usage is historical and
causes confusion rather than clarifying anything. For example,
scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with
scx_ops_flags. Let's drop the second usages.

Drop "ops" from scx_ops_{init|exit|enable|disable}[_task]() and friends.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04 08:52:49 -10:00
Tejun Heo
c5f22258f5 sched_ext: Drop "ops" from scx_ops_exit(), scx_ops_error() and friends
The tag "ops" is used for two different purposes. First, to indicate that
the entity is directly related to the operations such as flags carried in
sched_ext_ops. Second, to indicate that the entity applies to something
global such as enable or bypass states. The second usage is historical and
causes confusion rather than clarifying anything. For example,
scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with
scx_ops_flags. Let's drop the second usages.

Drop "ops" from scx_ops_exit(), scx_ops_error() and friends.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04 08:52:49 -10:00
Tejun Heo
8c6ee86246 sched_ext: Drop "ops" from scx_ops_bypass(), scx_ops_breather() and friends
The tag "ops" is used for two different purposes. First, to indicate that
the entity is directly related to the operations such as flags carried in
sched_ext_ops. Second, to indicate that the entity applies to something
global such as enable or bypass states. The second usage is historical and
causes confusion rather than clarifying anything. For example,
scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with
scx_ops_flags. Let's drop the second usages.

Drop "ops" from scx_ops_bypass(), scx_ops_breather() and friends. Update
scx_show_state.py accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04 08:52:49 -10:00
Tejun Heo
a50c365f99 sched_ext: Drop "ops" from scx_ops_helper, scx_ops_enable_mutex and __scx_ops_enabled
The tag "ops" is used for two different purposes. First, to indicate that
the entity is directly related to the operations such as flags carried in
sched_ext_ops. Second, to indicate that the entity applies to something
global such as enable or bypass states. The second usage is historical and
causes confusion rather than clarifying anything. For example,
scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with
scx_ops_flags. Let's drop the second usages.

Drop "ops" from scx_ops_helper, scx_ops_enable_mutex and __scx_ops_enabled.
Update scx_show_state.py accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04 08:52:48 -10:00
Tejun Heo
1a7ff7216c sched_ext: Drop "ops" from scx_ops_enable_state and friends
The tag "ops" is used for two different purposes. First, to indicate that
the entity is directly related to the operations such as flags carried in
sched_ext_ops. Second, to indicate that the entity applies to something
global such as enable or bypass states. The second usage is historical and
causes confusion rather than clarifying anything. For example,
scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with
scx_ops_flags. Let's drop the second usages.

Drop "ops" from scx_ops_enable_state and friends. Update scx_show_state.py
accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04 08:52:48 -10:00
Linus Torvalds
ea59cb7423 sched_ext: Fixes for v6.15-rc0
- Calling scx_bpf_create_dsq() with the same ID would succeed creating
   duplicate DSQs. Fix it to return -EEXIST.
 
 - scx_select_cpu_dfl() fixes and cleanups.
 
 - Synchronize tool/sched_ext with external scheduler repo. While this isn't
   a fix. There's no risk to the kernel and it's better if they stay synced
   closer.
 -----BEGIN PGP SIGNATURE-----
 
 iIMEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ+29Eg4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGeNGAP97GCCCwovepx3f9HV3RRk8oEregsGI7gmr+TC5
 +XJrqwD4urg6I5JGM3K5dB9m626RyUP6k5RmYdjqBrEL6LauCg==
 =uWzD
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.15-rc0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

 - Calling scx_bpf_create_dsq() with the same ID would succeed creating
   duplicate DSQs. Fix it to return -EEXIST.

 - scx_select_cpu_dfl() fixes and cleanups.

 - Synchronize tool/sched_ext with external scheduler repo. While this
   isn't a fix. There's no risk to the kernel and it's better if they
   stay synced closer.

* tag 'sched_ext-for-6.15-rc0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  tools/sched_ext: Sync with scx repo
  sched_ext: initialize built-in idle state before ops.init()
  sched_ext: create_dsq: Return -EEXIST on duplicate request
  sched_ext: Remove a meaningless conditional goto in scx_select_cpu_dfl()
  sched_ext: idle: Fix return code of scx_select_cpu_dfl()
2025-04-03 10:03:38 -07:00
Mathieu Desnoyers
169eae7711 rseq: Eliminate useless task_work on execve
Eliminate a useless task_work on execve by moving the call to
rseq_set_notify_resume() from sched_mm_cid_after_execve() to the error
path of bprm_execve().

The call to rseq_set_notify_resume() from sched_mm_cid_after_execve() is
pointless in the success case, because rseq_execve() will clear the rseq
pointer before returning to userspace.

sched_mm_cid_after_execve() is called from both the success and error
paths of bprm_execve(). The call to rseq_set_notify_resume() is needed
on error because the mm_cid may have changed.

Also move the rseq_execve() to right after sched_mm_cid_after_execve()
in bprm_execve().

[ mingo: Merged to a recent upstream kernel, extended the changelog. ]

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250327132945.1558783-1-mathieu.desnoyers@efficios.com
2025-04-03 13:10:47 +02:00
Linus Torvalds
88221ac0d5 Latency tracing changes for v6.15:
- Add some trace events to osnoise and timerlat sample generation
 
   This adds more information to the osnoise and timerlat tracers as well as
   allows BPF programs to be attached to these locations to extract even more
   data.
 
 - Fix to DECLARE_TRACE_CONDITION() macro
 
   It wasn't used but now will be and it happened to be broken causing the
   build to fail.
 
 - Add scheduler specification monitors to runtime verifier (RV)
 
   This is a continuation of Daniel Bristot's work.
 
   RV allows monitors to run and react concurrently. Running the cumulative
   model is equivalent to running single components using the same
   reactors, with the advantage that it's easier to point out which
   specification failed in case of error.
 
   This update introduces nested monitors to RV, in short, the sysfs
   monitor folder will contain a monitor named sched, which is nothing but
   an empty container for other monitors. Controlling the sched monitor
   (enable, disable, set reactors) controls all nested monitors.
 
   The following scheduling monitors are added:
 
   * sco: scheduling context operations
       Monitor to ensure sched_set_state happens only in thread context
   * tss: task switch while scheduling
       Monitor to ensure sched_switch happens only in scheduling context
   * snroc: set non runnable on its own context
       Monitor to ensure set_state happens only in the respective task's context
   * scpd: schedule called with preemption disabled
       Monitor to ensure schedule is called with preemption disabled
   * snep: schedule does not enable preempt
       Monitor to ensure schedule does not enable preempt
   * sncid: schedule not called with interrupt disabled
       Monitor to ensure schedule is not called with interrupt disabled
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ+QhuxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qg62AP9bkeNDbiCuAqjZGddV09Hw26wC3yum
 kQoNSebD8G52rQEA9GDjK37xGzYwW/fJokhJVTV39qfub6inAJE5dS6WeQY=
 =8Ikv
 -----END PGP SIGNATURE-----

Merge tag 'trace-latency-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull latency tracing updates from Steven Rostedt:

 - Add some trace events to osnoise and timerlat sample generation

   This adds more information to the osnoise and timerlat tracers as
   well as allows BPF programs to be attached to these locations to
   extract even more data.

 - Fix to DECLARE_TRACE_CONDITION() macro

   It wasn't used but now will be and it happened to be broken causing
   the build to fail.

 - Add scheduler specification monitors to runtime verifier (RV)

   This is a continuation of Daniel Bristot's work.

   RV allows monitors to run and react concurrently. Running the
   cumulative model is equivalent to running single components using the
   same reactors, with the advantage that it's easier to point out which
   specification failed in case of error.

   This update introduces nested monitors to RV, in short, the sysfs
   monitor folder will contain a monitor named sched, which is nothing
   but an empty container for other monitors. Controlling the sched
   monitor (enable, disable, set reactors) controls all nested monitors.

   The following scheduling monitors are added:

     - sco: scheduling context operations
       Monitor to ensure sched_set_state happens only in thread context

     - tss: task switch while scheduling
       Monitor to ensure sched_switch happens only in scheduling context

     - snroc: set non runnable on its own context
       Monitor to ensure set_state happens only in the respective task's context

     - scpd: schedule called with preemption disabled
       Monitor to ensure schedule is called with preemption disabled

     - snep: schedule does not enable preempt
       Monitor to ensure schedule does not enable preempt

     - sncid: schedule not called with interrupt disabled
       Monitor to ensure schedule is not called with interrupt disabled

* tag 'trace-latency-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tools/rv: Allow rv list to filter for container
  Documentation/rv: Add docs for the sched monitors
  verification/dot2k: Add support for nested monitors
  tools/rv: Add support for nested monitors
  rv: Add scpd, snep and sncid per-cpu monitors
  rv: Add snroc per-task monitor
  rv: Add sco and tss per-cpu monitors
  rv: Add option for nested monitors and include sched
  sched: Add sched tracepoints for RV task model
  rv: Add license identifiers to monitor files
  tracing: Fix DECLARE_TRACE_CONDITION
  trace/osnoise: Add trace events for samples
2025-03-27 16:03:52 -07:00
Andrea Righi
f0c6eab5e4 sched_ext: initialize built-in idle state before ops.init()
A BPF scheduler may want to use the built-in idle cpumasks in ops.init()
before the scheduler is fully initialized, either directly or through a
BPF timer for example.

However, this would result in an error, since the idle state has not
been properly initialized yet.

This can be easily verified by modifying scx_simple to call
scx_bpf_get_idle_cpumask() in ops.init():

$ sudo scx_simple

DEBUG DUMP
===========================================================================

scx_simple[121] triggered exit kind 1024:
  runtime error (built-in idle tracking is disabled)
...

Fix this by properly initializing the idle state before ops.init() is
called. With this change applied:

$ sudo scx_simple
local=2 global=0
local=19 global=11
local=23 global=11
...

Fixes: d73249f887 ("sched_ext: idle: Make idle static keys private")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-26 10:47:50 -10:00
Jake Hillion
a8897ed852 sched_ext: create_dsq: Return -EEXIST on duplicate request
create_dsq and therefore the scx_bpf_create_dsq kfunc currently silently
ignore duplicate entries. As a sched_ext scheduler is creating each DSQ
for a different purpose this is surprising behaviour.

Replace rhashtable_insert_fast which ignores duplicates with
rhashtable_lookup_insert_fast that reports duplicates (though doesn't
return their value). The rest of the code is structured correctly and
this now returns -EEXIST.

Tested by adding an extra scx_bpf_create_dsq to scx_simple. Previously
this was ignored, now init fails with a -17 code. Also ran scx_lavd
which continued to work well.

Signed-off-by: Jake Hillion <jake@hillion.co.uk>
Acked-by: Andrea Righi <arighi@nvidia.com>
Fixes: f0e1a0643a ("sched_ext: Implement BPF extensible scheduler class")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-26 10:30:36 -10:00
Tejun Heo
883cc35e9f sched_ext: Remove a meaningless conditional goto in scx_select_cpu_dfl()
scx_select_cpu_dfl() has a meaningless conditional goto at the end. Remove
it. No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrea Righi <arighi@nvidia.com>
2025-03-26 10:09:47 -10:00
Andrea Righi
37477d9eca sched_ext: idle: Fix return code of scx_select_cpu_dfl()
Return -EBUSY when using %SCX_PICK_IDLE_CORE with scx_select_cpu_dfl()
if a fully idle SMT core cannot be found, instead of falling back to
@prev_cpu, which is not a fully idle SMT core in this case.

Fixes: c414c2171c ("sched_ext: idle: Honor idle flags in the built-in idle selection policy")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-26 10:08:01 -10:00
Linus Torvalds
a50b4fe095 A treewide hrtimer timer cleanup
hrtimers are initialized with hrtimer_init() and a subsequent store to
   the callback pointer. This turned out to be suboptimal for the upcoming
   Rust integration and is obviously a silly implementation to begin with.
 
   This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
   with hrtimer_setup(T, cb);
 
   The conversion was done with Coccinelle and a few manual fixups.
 
   Once the conversion has completely landed in mainline, hrtimer_init()
   will be removed and the hrtimer::function becomes a private member.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff5jQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVvRD/wKtuwmiA66NJFgXC0qVq82A6fO3bY8
 GBdbfysDJIbqGu5PTcULTbJ8qkqv3jeLUv6CcXvS4sZ7y/uJQl2lzf8yrD/0bbwc
 rLI6sHiPSZmK93kNVN4X5H7kvt7cE/DYC9nnEOgK3BY5FgKc4n9887d4aVBhL8Lv
 ODwVXvZ+xi351YCj7qRyPU24zt/p4tkkT1o2k4a0HBluqLI0D+V20fke9IERUL8r
 d1uWKlcn0TqYDesE8HXKIhbst3gx52rMJrXBJDHwFmG6v8Pj1fkTXCVpPo8QcBz8
 OTVkpomN9f/Tx4+GZwhZOF86LhLL3OhxD6pT7JhFCXdmSGv+Ez8uyk1YZysM/XpV
 Juy/1yAcBpDIDkmhMFGdAAn48Nn9Fotty0r4je60zSEp1d/4QMXcFme29qr2JTUE
 iWnQ/HD6DxUjVHqy7CYvvo26Xegg1C7qgyOVt4PYZwAM1VKF5P3kzYTb4SAdxtop
 Tpji1sfW9QV08jqMNo6XntD32DSP9S2HqjO9LwBw700jnx2jjJ35fcJs6iodMOUn
 gckIZLMn3L0OoglPdyA5O7SNTbKE7aFiRKdnT/cJtR3Fa39Qu27CwC5gfiyuie9I
 Q+LG8GLuYSBHXAR+PBK4GWlzJ7Dn8k3eqmbnLeKpRMsU6ZzcttgA64xhaviN2wN0
 iJbvLJeisXr3GA==
 =bYAX
 -----END PGP SIGNATURE-----

Merge tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer cleanups from Thomas Gleixner:
 "A treewide hrtimer timer cleanup

  hrtimers are initialized with hrtimer_init() and a subsequent store to
  the callback pointer. This turned out to be suboptimal for the
  upcoming Rust integration and is obviously a silly implementation to
  begin with.

  This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
  with hrtimer_setup(T, cb);

  The conversion was done with Coccinelle and a few manual fixups.

  Once the conversion has completely landed in mainline, hrtimer_init()
  will be removed and the hrtimer::function becomes a private member"

* tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
  wifi: rt2x00: Switch to use hrtimer_update_function()
  io_uring: Use helper function hrtimer_update_function()
  serial: xilinx_uartps: Use helper function hrtimer_update_function()
  ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup()
  RDMA: Switch to use hrtimer_setup()
  virtio: mem: Switch to use hrtimer_setup()
  drm/vmwgfx: Switch to use hrtimer_setup()
  drm/xe/oa: Switch to use hrtimer_setup()
  drm/vkms: Switch to use hrtimer_setup()
  drm/msm: Switch to use hrtimer_setup()
  drm/i915/request: Switch to use hrtimer_setup()
  drm/i915/uncore: Switch to use hrtimer_setup()
  drm/i915/pmu: Switch to use hrtimer_setup()
  drm/i915/perf: Switch to use hrtimer_setup()
  drm/i915/gvt: Switch to use hrtimer_setup()
  drm/i915/huc: Switch to use hrtimer_setup()
  drm/amdgpu: Switch to use hrtimer_setup()
  stm class: heartbeat: Switch to use hrtimer_setup()
  i2c: Switch to use hrtimer_setup()
  iio: Switch to use hrtimer_setup()
  ...
2025-03-25 10:54:15 -07:00
Linus Torvalds
32b22538be Scheduler updates for v6.15:
[ Merge note, these two commits are identical:
 
    - f3fa0e40df ("sched/clock: Don't define sched_clock_irqtime as static key")
    - b9f2b29b94 ("sched: Don't define sched_clock_irqtime as static key")
 
   The first one is a cherry-picked version of the second, and the first one
   is already upstream. ]
 
 Core & fair scheduler changes:
 
   - Cancel the slice protection of the idle entity (Zihan Zhou)
   - Reduce the default slice to avoid tasks getting an extra tick
     (Zihan Zhou)
   - Force propagating min_slice of cfs_rq when {en,de}queue tasks
     (Tianchen Ding)
   - Refactor can_migrate_task() to elimate looping (I Hsin Cheng)
   - Add unlikey branch hints to several system calls (Colin Ian King)
   - Optimize current_clr_polling() on certain architectures (Yujun Dong)
 
 Deadline scheduler: (Juri Lelli)
 
   - Remove redundant dl_clear_root_domain call
   - Move dl_rebuild_rd_accounting to cpuset.h
 
 Uclamp:
 
   - Use the uclamp_is_used() helper instead of open-coding it (Xuewen Yan)
   - Optimize sched_uclamp_used static key enabling (Xuewen Yan)
 
 Scheduler topology support: (Juri Lelli)
 
   - Ignore special tasks when rebuilding domains
   - Add wrappers for sched_domains_mutex
   - Generalize unique visiting of root domains
   - Rebuild root domain accounting after every update
   - Remove partition_and_rebuild_sched_domains
   - Stop exposing partition_sched_domains_locked
 
 RSEQ: (Michael Jeanson)
 
   - Update kernel fields in lockstep with CONFIG_DEBUG_RSEQ=y
   - Fix segfault on registration when rseq_cs is non-zero
   - selftests: Add rseq syscall errors test
   - selftests: Ensure the rseq ABI TLS is actually 1024 bytes
 
 Membarriers:
 
   - Fix redundant load of membarrier_state (Nysal Jan K.A.)
 
 Scheduler debugging:
 
   - Introduce and use preempt_model_str() (Sebastian Andrzej Siewior)
   - Make CONFIG_SCHED_DEBUG unconditional (Ingo Molnar)
 
 Fixes and cleanups:
 
   - Always save/restore x86 TSC sched_clock() on suspend/resume
    (Guilherme G. Piccoli)
 
   - Misc fixes and cleanups (Thorsten Blum, Juri Lelli,
     Sebastian Andrzej Siewior)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfejsoRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1ivkhAAwBF2tYRBS1oIHcC/OKK3JJoHVDp2LFbU
 9sm5S3ZlGD/Ns2fbpY+9A8UFgUFfjYiTSV7hvf2B9Vge0XSxTmMNFu/MdxLBbo9r
 w6GSeNcNDQKpjEGLkrmPFsa2fiYI4dmH0IzDbS9V2cNPk470QBKjAKXNPaSER691
 n2wLnQq+m5o4gXnPjnSz6RrrzisRnm2GOWnDV/iqR47pZFNlX2wWlo3s5r7//Hw0
 a+QfEfpgKehhy/VSDXmSAgpqnNffjc78yBV6LNoVUddwahnOWiQMS3XViOqgy5VO
 jUGBrzW+sKkdRMBppxwJ/0XWgHGC27amIgnU0ZE5u+eiUEu8H9qWl1cRCFxyeB0O
 8+WNfwmkH+FPWUdsn84kdePhSsZy6HfM6h44Xe0hx1V7tQXEXfbPzK3TnQg8Ktt1
 Ky6ctbZt4cGpqGQuIqvba21A/racrD/DgvB7mHeZksnqZoKTDwxhT/nlQGpuwPoy
 SJYd1ynFVJvfC69SwMdwnaglimvEZx1GfT0o5XtCMslY5NkWCou5u+e65WX7ccU5
 94wBCwI1/+KiFMJZp6TlPw07Q/Hsj9dDryxLc3OunU3zMVt3++1GZnBS/eb5FN1A
 5TlAEpxgH9c5Q4/XJFCvx21DrTVHSuIrR6naK91bgqHo0cEfaHGtO/ejPtJGSfxe
 YIFnnu1dhRw=
 =SuQE
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Core & fair scheduler changes:

   - Cancel the slice protection of the idle entity (Zihan Zhou)
   - Reduce the default slice to avoid tasks getting an extra tick
     (Zihan Zhou)
   - Force propagating min_slice of cfs_rq when {en,de}queue tasks
     (Tianchen Ding)
   - Refactor can_migrate_task() to elimate looping (I Hsin Cheng)
   - Add unlikey branch hints to several system calls (Colin Ian King)
   - Optimize current_clr_polling() on certain architectures (Yujun
     Dong)

  Deadline scheduler: (Juri Lelli)
   - Remove redundant dl_clear_root_domain call
   - Move dl_rebuild_rd_accounting to cpuset.h

  Uclamp:
   - Use the uclamp_is_used() helper instead of open-coding it (Xuewen
     Yan)
   - Optimize sched_uclamp_used static key enabling (Xuewen Yan)

  Scheduler topology support: (Juri Lelli)
   - Ignore special tasks when rebuilding domains
   - Add wrappers for sched_domains_mutex
   - Generalize unique visiting of root domains
   - Rebuild root domain accounting after every update
   - Remove partition_and_rebuild_sched_domains
   - Stop exposing partition_sched_domains_locked

  RSEQ: (Michael Jeanson)
   - Update kernel fields in lockstep with CONFIG_DEBUG_RSEQ=y
   - Fix segfault on registration when rseq_cs is non-zero
   - selftests: Add rseq syscall errors test
   - selftests: Ensure the rseq ABI TLS is actually 1024 bytes

  Membarriers:
   - Fix redundant load of membarrier_state (Nysal Jan K.A.)

  Scheduler debugging:
   - Introduce and use preempt_model_str() (Sebastian Andrzej Siewior)
   - Make CONFIG_SCHED_DEBUG unconditional (Ingo Molnar)

  Fixes and cleanups:
   - Always save/restore x86 TSC sched_clock() on suspend/resume
     (Guilherme G. Piccoli)
   - Misc fixes and cleanups (Thorsten Blum, Juri Lelli, Sebastian
     Andrzej Siewior)"

* tag 'sched-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
  cpuidle, sched: Use smp_mb__after_atomic() in current_clr_polling()
  sched/debug: Remove CONFIG_SCHED_DEBUG
  sched/debug: Remove CONFIG_SCHED_DEBUG from self-test config files
  sched/debug, Documentation: Remove (most) CONFIG_SCHED_DEBUG references from documentation
  sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional
  sched/debug: Make 'const_debug' tunables unconditional __read_mostly
  sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()
  rseq/selftests: Fix namespace collision with rseq UAPI header
  include/{topology,cpuset}: Move dl_rebuild_rd_accounting to cpuset.h
  sched/topology: Stop exposing partition_sched_domains_locked
  cgroup/cpuset: Remove partition_and_rebuild_sched_domains
  sched/topology: Remove redundant dl_clear_root_domain call
  sched/deadline: Rebuild root domain accounting after every update
  sched/deadline: Generalize unique visiting of root domains
  sched/topology: Wrappers for sched_domains_mutex
  sched/deadline: Ignore special tasks when rebuilding domains
  tracing: Use preempt_model_str()
  xtensa: Rely on generic printing of preemption model
  x86: Rely on generic printing of preemption model
  s390: Rely on generic printing of preemption model
  ...
2025-03-24 21:28:12 -07:00
Linus Torvalds
5a658afd46 Objtool changes for v6.15:
- The biggest change is the new option to automatically fail
    the build on objtool warnings: CONFIG_OBJTOOL_WERROR.
 
    While there are no currently known unfixed false positives
    left, such an expansion in the severity of objtool warnings
    inevitably creates a risk of build failures, so it's disabled by
    default and depends on !COMPILE_TEST, so it shouldn't be enabled
    on allyesconfig/allmodconfig builds and won't be forced on people
    who just accept build-time defaults in 'make oldconfig'.
 
    While the option is strongly recommended, only people who enable
    it explicitly should see it.
 
    (Josh Poimboeuf)
 
  - Disable branch profiling in noinstr code with a broad
    brush that includes all of arch/x86/ and kernel/sched/. (Josh Poimboeuf)
 
  - Create backup object files on objtool errors and print exact
    objtool arguments to make failure analysis easier (Josh Poimboeuf)
 
  - Improve noreturn handling (Josh Poimboeuf)
 
  - Improve rodata handling (Tiezhu Yang)
 
  - Support jump tables, switch tables and goto tables on LoongArch (Tiezhu Yang)
 
  - Misc cleanups and fixes (Josh Poimboeuf, David Engraf, Ingo Molnar)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfefkARHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1inlRAAvd9Hom2qh9Iu+KYYF58vsg9zsxWZA6I1
 blouKI4SUA8Xjuw6Nihx+emaPaMW1boGSLTNsFzrCa3S1+4UHTTp/Y8snZWJ/Mc/
 Peg52N6u/LIcoQM+vNJYRtd9y4wabX87vl0qTxte0kB0Neps3/yQvxtUa2K1srXp
 8nwHK+PdzNsgPuIrIiNc9ymsPvbqFHmVIRRVNVKX4BlPJi2kJs9kx43kszweQR/X
 kW/bs1315m1HS5i02K0Zs/XdOZHLsk9ERu+aviBJV1txrgZIukATIqbODiI+3RZX
 0oa3KxfzEVFN2k3OukrezV2INzETkN+oOSTAZIUOqwSVe+8rdQVBSdYT6svYn/yy
 aS8Bi5Mm1nfizTU+cRrzU7FxWCmwsxm9r4fHPTV8Owjxg0uoGk/E/qlvERuR2rpA
 p2tHMo1lp2Yo+VZBZPfm5KDHFG4tSGhF9eav2bqSI7/Kf5AWxRl8kBs5iLrcxsXh
 4qk3FalnuM7A+1McAUcBJAvM897Yie0s2G83ZipyYyA6U3LSBhBMWh9FlIiAjuIh
 YnX6IFkW9tVzVZpJFEGQn+2Ewl5Y2Go5bokKk03vkWCZCgg+hEUsVh6Cnm1ocZpO
 Ll3/UF4i8XjjyuAuHDn6mzyHIgch2xRN02v7dJSb09+O9b8vIoPoSqbWSLJUOBqf
 r6UesXDG8mY=
 =iswJ
 -----END PGP SIGNATURE-----

Merge tag 'objtool-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool updates from Ingo Molnar:

 - The biggest change is the new option to automatically fail the build
   on objtool warnings: CONFIG_OBJTOOL_WERROR.

   While there are no currently known unfixed false positives left, such
   an expansion in the severity of objtool warnings inevitably creates a
   risk of build failures, so it's disabled by default and depends on
   !COMPILE_TEST, so it shouldn't be enabled on
   allyesconfig/allmodconfig builds and won't be forced on people who
   just accept build-time defaults in 'make oldconfig'.

   While the option is strongly recommended, only people who enable it
   explicitly should see it.

   (Josh Poimboeuf)

 - Disable branch profiling in noinstr code with a broad brush that
   includes all of arch/x86/ and kernel/sched/. (Josh Poimboeuf)

 - Create backup object files on objtool errors and print exact objtool
   arguments to make failure analysis easier (Josh Poimboeuf)

 - Improve noreturn handling (Josh Poimboeuf)

 - Improve rodata handling (Tiezhu Yang)

 - Support jump tables, switch tables and goto tables on LoongArch
   (Tiezhu Yang)

 - Misc cleanups and fixes (Josh Poimboeuf, David Engraf, Ingo Molnar)

* tag 'objtool-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
  tracing: Disable branch profiling in noinstr code
  objtool: Use O_CREAT with explicit mode mask
  objtool: Add CONFIG_OBJTOOL_WERROR
  objtool: Create backup on error and print args
  objtool: Change "warning:" to "error:" for --Werror
  objtool: Add --Werror option
  objtool: Add --output option
  objtool: Upgrade "Linked object detected" warning to error
  objtool: Consolidate option validation
  objtool: Remove --unret dependency on --rethunk
  objtool: Increase per-function WARN_FUNC() rate limit
  objtool: Update documentation
  objtool: Improve __noreturn annotation warning
  objtool: Fix error handling inconsistencies in check()
  x86/traps: Make exc_double_fault() consistently noreturn
  LoongArch: Enable jump table for objtool
  objtool/LoongArch: Add support for goto table
  objtool/LoongArch: Add support for switch table
  objtool: Handle PC relative relocation type
  objtool: Handle different entry size of rodata
  ...
2025-03-24 21:18:05 -07:00
Linus Torvalds
3ba7dfb8da RCU pull request for v6.15
This pull request contains the following branches:
 
 docs.2025.02.04a:
  - Add broken-timing possibility to stallwarn.rst.
  - Improve discussion of this_cpu_ptr(), add raw_cpu_ptr().
  - Document self-propagating callbacks.
  - Point call_srcu() to call_rcu() for detailed memory ordering.
  - Add CONFIG_RCU_LAZY delays to call_rcu() kernel-doc header.
  - Clarify RCU_LAZY and RCU_LAZY_DEFAULT_OFF help text.
  - Remove references to old grace-period-wait primitives.
 
 srcu.2025.02.05a:
  - Introduce srcu_read_{un,}lock_fast(), which is similar to
    srcu_read_{un,}lock_lite(): avoid smp_mb()s in lock and unlock at the
    cost of calling synchronize_rcu() in synchronize_srcu(). Moreover, by
    returning the percpu offset of the counter at srcu_read_lock_fast()
    time, srcu_read_unlock_fast() can save extra pointer dereferencing,
    which makes it faster than srcu_read_{un,}lock_lite().
    srcu_read_{un,}lock_fast() are intended to replace
    rcu_read_{un,}lock_trace() if possible.
 
 torture.2025.02.05a:
  - Add get_torture_init_jiffies() to return the start time of the test.
  - Add a test_boost_holdoff module parameter to allow delaying boosting
    tests when building rcutorture as built-in.
  - Add grace period sequence number logging at the beginning and end of
    failure/close-call results.
  - Switch to hexadecimal for the expedited grace period sequence number
    in the rcu_exp_grace_period trace point.
  - Make cur_ops->format_gp_seqs take buffer length.
  - Move RCU_TORTURE_TEST_{CHK_RDR_STATE,LOG_CPU} to bool.
  - Complain when invalid SRCU reader_flavor is specified.
  - Add FORCE_NEED_SRCU_NMI_SAFE Kconfig for testing, which forces SRCU
    uses atomics even when percpu ops are NMI safe, and use the Kconfig
    for SRCU lockdep testing.
 
 misc.2025.03.04a:
  - Split rcu_report_exp_cpu_mult() mask parameter and use for tracing.
  - Remove READ_ONCE() for rdp->gpwrap access in __note_gp_changes().
  - Fix get_state_synchronize_rcu_full() GP-start detection.
  - Move RCU Tasks self-tests to core_initcall().
  - Print segment lengths in show_rcu_nocb_gp_state().
  - Make RCU watch ct_kernel_exit_state() warning.
  - Flush console log from kernel_power_off().
  - rcutorture: Allow a negative value for nfakewriters.
  - rcu: Update TREE05.boot to test normal synchronize_rcu().
  - rcu: Use _full() API to debug synchronize_rcu().
 
 lazypreempt.2025.03.04a: Make RCU handle PREEMPT_LAZY better:
  - Fix header guard for rcu_all_qs().
  - rcu: Rename PREEMPT_AUTO to PREEMPT_LAZY.
  - Update __cond_resched comment about RCU quiescent states.
  - Handle unstable rdp in rcu_read_unlock_strict().
  - Handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y.
  - osnoise: Provide quiescent states.
  - Adjust rcutorture with possible PREEMPT_RCU=n && PREEMPT_COUNT=y
    combination.
  - Limit PREEMPT_RCU configurations.
  - Make rcutorture senario TREE07 and senario TREE10 use PREEMPT_LAZY=y.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEj5IosQTPz8XU1wRHSXnow7UH+rgFAmfeBLQACgkQSXnow7UH
 +rh11Qf/Rt6IZJ/YT/V9Sd+8hMx4O0BMh779pr9cD6mbAG+FDk2Yeva1m8vIdFOb
 qId6oc8K/ef2JfFjSn0oHMzQP2D3XUyiJWPNbBDHv/D8Os8GZgjzu8dkxVkSbdbY
 OxtvIflbcqFN1JDJfGKZnTEW0/YxGqfnS9b6R7iyyA7SOGQ/WubGOE5qNCqPufc9
 zJiP+qTUFYQzCIiPlEJul39o9KboPogbt3QAAQjWmi3utd77ehJnm/15FvAjyau4
 uhC2cnGfMY535rQaiaQeBQ/IHIowKripCq0JQFvcUNdyArZM3HOI2x79+2II6ft7
 mjHskNODOIJHfW2o1RzQ0yRYAywFIg==
 =J+mH
 -----END PGP SIGNATURE-----

Merge tag 'rcu-next-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux

Pull RCU updates from Boqun Feng:
 "Documentation:
   - Add broken-timing possibility to stallwarn.rst
   - Improve discussion of this_cpu_ptr(), add raw_cpu_ptr()
   - Document self-propagating callbacks
   - Point call_srcu() to call_rcu() for detailed memory ordering
   - Add CONFIG_RCU_LAZY delays to call_rcu() kernel-doc header
   - Clarify RCU_LAZY and RCU_LAZY_DEFAULT_OFF help text
   - Remove references to old grace-period-wait primitives

  srcu:
   - Introduce srcu_read_{un,}lock_fast(), which is similar to
     srcu_read_{un,}lock_lite(): avoid smp_mb()s in lock and unlock
     at the cost of calling synchronize_rcu() in synchronize_srcu()

     Moreover, by returning the percpu offset of the counter at
     srcu_read_lock_fast() time, srcu_read_unlock_fast() can avoid
     extra pointer dereferencing, which makes it faster than
     srcu_read_{un,}lock_lite()

     srcu_read_{un,}lock_fast() are intended to replace
     rcu_read_{un,}lock_trace() if possible

  RCU torture:
   - Add get_torture_init_jiffies() to return the start time of the test
   - Add a test_boost_holdoff module parameter to allow delaying
     boosting tests when building rcutorture as built-in
   - Add grace period sequence number logging at the beginning and end
     of failure/close-call results
   - Switch to hexadecimal for the expedited grace period sequence
     number in the rcu_exp_grace_period trace point
   - Make cur_ops->format_gp_seqs take buffer length
   - Move RCU_TORTURE_TEST_{CHK_RDR_STATE,LOG_CPU} to bool
   - Complain when invalid SRCU reader_flavor is specified
   - Add FORCE_NEED_SRCU_NMI_SAFE Kconfig for testing, which forces SRCU
     uses atomics even when percpu ops are NMI safe, and use the Kconfig
     for SRCU lockdep testing

  Misc:
   - Split rcu_report_exp_cpu_mult() mask parameter and use for tracing
   - Remove READ_ONCE() for rdp->gpwrap access in __note_gp_changes()
   - Fix get_state_synchronize_rcu_full() GP-start detection
   - Move RCU Tasks self-tests to core_initcall()
   - Print segment lengths in show_rcu_nocb_gp_state()
   - Make RCU watch ct_kernel_exit_state() warning
   - Flush console log from kernel_power_off()
   - rcutorture: Allow a negative value for nfakewriters
   - rcu: Update TREE05.boot to test normal synchronize_rcu()
   - rcu: Use _full() API to debug synchronize_rcu()

  Make RCU handle PREEMPT_LAZY better:
   - Fix header guard for rcu_all_qs()
   - rcu: Rename PREEMPT_AUTO to PREEMPT_LAZY
   - Update __cond_resched comment about RCU quiescent states
   - Handle unstable rdp in rcu_read_unlock_strict()
   - Handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y
   - osnoise: Provide quiescent states
   - Adjust rcutorture with possible PREEMPT_RCU=n && PREEMPT_COUNT=y
     combination
   - Limit PREEMPT_RCU configurations
   - Make rcutorture senario TREE07 and senario TREE10 use
     PREEMPT_LAZY=y"

* tag 'rcu-next-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (59 commits)
  rcutorture: Make scenario TREE07 build CONFIG_PREEMPT_LAZY=y
  rcutorture: Make scenario TREE10 build CONFIG_PREEMPT_LAZY=y
  rcu: limit PREEMPT_RCU configurations
  rcutorture: Update ->extendables check for lazy preemption
  rcutorture: Update rcutorture_one_extend_check() for lazy preemption
  osnoise: provide quiescent states
  rcu: Use _full() API to debug synchronize_rcu()
  rcu: Update TREE05.boot to test normal synchronize_rcu()
  rcutorture: Allow a negative value for nfakewriters
  Flush console log from kernel_power_off()
  context_tracking: Make RCU watch ct_kernel_exit_state() warning
  rcu/nocb: Print segment lengths in show_rcu_nocb_gp_state()
  rcu-tasks: Move RCU Tasks self-tests to core_initcall()
  rcu: Fix get_state_synchronize_rcu_full() GP-start detection
  torture: Make SRCU lockdep testing use srcu_read_lock_nmisafe()
  srcu: Add FORCE_NEED_SRCU_NMI_SAFE Kconfig for testing
  rcutorture: Complain when invalid SRCU reader_flavor is specified
  rcutorture: Move RCU_TORTURE_TEST_{CHK_RDR_STATE,LOG_CPU} to bool
  rcutorture: Make cur_ops->format_gp_seqs take buffer length
  rcutorture: Add ftrace-compatible timestamp to GP# failure/close-call output
  ...
2025-03-24 19:41:37 -07:00
Linus Torvalds
bcb044256d sched_ext: Changes for v6.15
- Add mechanism to count and report internal events. This significantly
   improves visibility on subtle corner conditions.
 
 - The default idle CPU selection logic is revamped and improved in multiple
   ways including being made topology aware.
 
 - sched_ext was disabling ttwu_queue for simplicity, which can be costly
   when hardware topology is more complex. Implement
   SCX_OPS_ALLOWED_QUEUED_WAKEUP so that BPF schedulers can selectively
   enable ttwu_queue.
 
 - tools/sched_ext updates to improve compatibility among others.
 
 - Other misc updates and fixes.
 
 - sched_ext/for-6.14-fixes were pulled a few times to receive prerequisite
   fixes and resolve conflicts.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ999Sg4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGf/KAQCoMTVOBpQT9gCaCKDOmrVJTwi6boEoV5WnGZzw
 PDr0vwEAq36iz4no6Y5THcN/DCx+52IiS0zuhPy3rBZVo11TMgU=
 =iQ+A
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext updates from Tejun Heo:

 - Add mechanism to count and report internal events. This significantly
   improves visibility on subtle corner conditions.

 - The default idle CPU selection logic is revamped and improved in
   multiple ways including being made topology aware.

 - sched_ext was disabling ttwu_queue for simplicity, which can be
   costly when hardware topology is more complex. Implement
   SCX_OPS_ALLOWED_QUEUED_WAKEUP so that BPF schedulers can selectively
   enable ttwu_queue.

 - tools/sched_ext updates to improve compatibility among others.

 - Other misc updates and fixes.

 - sched_ext/for-6.14-fixes were pulled a few times to receive
   prerequisite fixes and resolve conflicts.

* tag 'sched_ext-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (42 commits)
  sched_ext: idle: Refactor scx_select_cpu_dfl()
  sched_ext: idle: Honor idle flags in the built-in idle selection policy
  sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()
  sched_ext: Add trace point to track sched_ext core events
  sched_ext: Change the event type from u64 to s64
  sched_ext: Documentation: add task lifecycle summary
  tools/sched_ext: Provide a compatible helper for scx_bpf_events()
  selftests/sched_ext: Add NUMA-aware scheduler test
  tools/sched_ext: Provide consistent access to scx flags
  sched_ext: idle: Fix scx_bpf_pick_any_cpu_node() behavior
  sched_ext: idle: Introduce scx_bpf_nr_node_ids()
  sched_ext: idle: Introduce node-aware idle cpu kfunc helpers
  sched_ext: idle: Per-node idle cpumasks
  sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODE
  sched_ext: idle: Make idle static keys private
  sched/topology: Introduce for_each_node_numadist() iterator
  mm/numa: Introduce nearest_node_nodemask()
  nodemask: numa: reorganize inclusion path
  nodemask: add nodes_copy()
  tools/sched_ext: Sync with scx repo
  ...
2025-03-24 17:23:48 -07:00
Gabriele Monaco
26f80681a0 sched: Add sched tracepoints for RV task model
Add the following tracepoints:
* sched_entry(bool preempt, ip)
    Called while entering __schedule
* sched_exit(bool is_switch, ip)
    Called while exiting __schedule
* sched_set_state(task, curr_state, state)
    Called when a task changes its state (to and from running)

These tracepoints are useful to describe the Linux task model and are
adapted from the patches by Daniel Bristot de Oliveira
(https://bristot.me/linux-task-model/).

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/20250305140406.350227-2-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-24 17:27:39 -04:00
Josh Poimboeuf
2cbb20b008 tracing: Disable branch profiling in noinstr code
CONFIG_TRACE_BRANCH_PROFILING inserts a call to ftrace_likely_update()
for each use of likely() or unlikely().  That breaks noinstr rules if
the affected function is annotated as noinstr.

Disable branch profiling for files with noinstr functions.  In addition
to some individual files, this also includes the entire arch/x86
subtree, as well as the kernel/entry, drivers/cpuidle, and drivers/idle
directories, all of which are noinstr-heavy.

Due to the nature of how sched binaries are built by combining multiple
.c files into one, branch profiling is disabled more broadly across the
sched code than would otherwise be needed.

This fixes many warnings like the following:

  vmlinux.o: warning: objtool: do_syscall_64+0x40: call to ftrace_likely_update() leaves .noinstr.text section
  vmlinux.o: warning: objtool: __rdgsbase_inactive+0x33: call to ftrace_likely_update() leaves .noinstr.text section
  vmlinux.o: warning: objtool: handle_bug.isra.0+0x198: call to ftrace_likely_update() leaves .noinstr.text section
  ...

Reported-by: Ingo Molnar <mingo@kernel.org>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/fb94fc9303d48a5ed370498f54500cc4c338eb6d.1742586676.git.jpoimboe@kernel.org
2025-03-22 09:49:26 +01:00
Linus Torvalds
cb90c8df91 Revert a scheduler performance optimization that regressed other workloads.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfdP/YRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hiUBAAiWqjSF3RQ+Sc3/2sRV/pv3dWvMC790+k
 VYQm7RPvYtXyfS9G+p9GnjRuRmsKRyte7Ctui0/z+J0K82kXr5WZlnR6xtkuNQtr
 2BWpR5vFLC0jXxRy1Z8xgSQV914WNqWdnwo2AHwGI4lS2bBij6WA27YEfPTttmDf
 tjNfKX3fXeF6TPbsPqxtTDWY/7qVv4hlxgE890m56eXhYZQPSGck+D2oRfvAbd9e
 LcqTXDxs8AeQsCYPMa/mzKPj4QVI8YIqCChszp9/wpmZup20XksRJdpYOjBCZi8a
 a9DzP+D5XYCj/dl1QXXudyIs/CGkf1+muv93NQRQ2+ZdYX6XZgkpkSm6xEKg1icp
 XzOXxVAqA3v5z8+QYAUb3B7DSnZrCMxSp0jhc4B/vORIDoLVF6vpvnhzn39ZhvB/
 fzrc15H2ddIJ52Ng+9bbbCxFy7+DZvUC05QYgQ5OPhM3jiD8ysTHY1B/I46aINwU
 q2ephtqZU6hVm/0C+anVO6MAxHUed7bJRpf9tc1GaTetjmBzH7Wc93sfloFrfdZK
 5v+JJSCa95GF7fyiZkcK85iO6oaJ3vIjEYewk+hC/2LHfXqkY0Jy3g/SXG48ILOW
 r4xwIN4yJ7wPYfPGG3N1wjKCX7OAopHJY8hinFrjxvFUT3GBTea8/bpqoaE12FSo
 m7d3Ei9GlNE=
 =683W
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2025-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Ingo Molnar:
 "Revert a scheduler performance optimization that regressed other
  workloads"

* tag 'sched-urgent-2025-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Revert "sched/core: Reduce cost of sched_move_task when config autogroup"
2025-03-21 08:48:40 -07:00
Ingo Molnar
dd5bdaf2b7 sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional
All the big Linux distros enable CONFIG_SCHED_DEBUG, because
the various features it provides help not just with kernel
development, but with system administration and user-space
software development as well.

Reflect this reality and enable this functionality
unconditionally.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250317104257.3496611-4-mingo@kernel.org
2025-03-19 22:20:53 +01:00
Ingo Molnar
57903f72f2 sched/debug: Make 'const_debug' tunables unconditional __read_mostly
With CONFIG_SCHED_DEBUG becoming unconditional, remove the
extra 'const_debug' indirection towards __read_mostly.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250317104257.3496611-3-mingo@kernel.org
2025-03-19 22:20:53 +01:00
Ingo Molnar
f7d2728cc0 sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()
The scheduler has this special SCHED_WARN() facility that
depends on CONFIG_SCHED_DEBUG.

Since CONFIG_SCHED_DEBUG is getting removed, convert
SCHED_WARN() to WARN_ON_ONCE().

Note that the warning output isn't 100% equivalent:

   #define SCHED_WARN_ON(x)      WARN_ONCE(x, #x)

Because SCHED_WARN_ON() would output the 'x' condition
as well, while WARN_ONCE() will only show a backtrace.

Hopefully these are rare enough to not really matter.

If it does, we should probably introduce a new WARN_ON()
variant that outputs the condition in stringified form,
or improve WARN_ON() itself.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250317104257.3496611-2-mingo@kernel.org
2025-03-19 22:20:53 +01:00
Juri Lelli
d128130f48 sched/topology: Stop exposing partition_sched_domains_locked
The are no callers of partition_sched_domains_locked() outside
topology.c.

Stop exposing such function.

Suggested-by: Waiman Long <llong@redhat.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Waiman Long <longman@redhat.com>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/Z9MSC96a8FcqWV3G@jlelli-thinkpadt14gen4.remote.csb
2025-03-17 11:23:43 +01:00
Juri Lelli
d735bab3d5 sched/topology: Remove redundant dl_clear_root_domain call
We completely clean and restore root domains bandwidth accounting after
every root domains change, so the dl_clear_root_domain() call in
partition_sched_domains_locked() is redundant.

Remove it.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Waiman Long <llong@redhat.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Waiman Long <longman@redhat.com>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/Z9MRtcX4tz4tcLRR@jlelli-thinkpadt14gen4.remote.csb
2025-03-17 11:23:42 +01:00
Juri Lelli
2ff899e351 sched/deadline: Rebuild root domain accounting after every update
Rebuilding of root domains accounting information (total_bw) is
currently broken on some cases, e.g. suspend/resume on aarch64. Problem
is that the way we keep track of domain changes and try to add bandwidth
back is convoluted and fragile.

Fix it by simplify things by making sure bandwidth accounting is cleared
and completely restored after root domains changes (after root domains
are again stable).

To be sure we always call dl_rebuild_rd_accounting while holding
cpuset_mutex we also add cpuset_reset_sched_domains() wrapper.

Fixes: 53916d5fd3 ("sched/deadline: Check bandwidth overflow earlier for hotplug")
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Co-developed-by: Waiman Long <llong@redhat.com>
Signed-off-by: Waiman Long <llong@redhat.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/Z9MRfeJKJUOyUSto@jlelli-thinkpadt14gen4.remote.csb
2025-03-17 11:23:42 +01:00
Juri Lelli
45007c6fb5 sched/deadline: Generalize unique visiting of root domains
Bandwidth checks and updates that work on root domains currently employ
a cookie mechanism for efficiency. This mechanism is very much tied to
when root domains are first created and initialized.

Generalize the cookie mechanism so that it can be used also later at
runtime while updating root domains. Also, additionally guard it with
sched_domains_mutex, since domains need to be stable while updating them
(and it will be required for further dynamic changes).

Fixes: 53916d5fd3 ("sched/deadline: Check bandwidth overflow earlier for hotplug")
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Waiman Long <longman@redhat.com>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/Z9MQaiXPvEeW_v7x@jlelli-thinkpadt14gen4.remote.csb
2025-03-17 11:23:41 +01:00
Juri Lelli
56209334dd sched/topology: Wrappers for sched_domains_mutex
Create wrappers for sched_domains_mutex so that it can transparently be
used on both CONFIG_SMP and !CONFIG_SMP, as some function will need to
do.

Fixes: 53916d5fd3 ("sched/deadline: Check bandwidth overflow earlier for hotplug")
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Waiman Long <longman@redhat.com>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/Z9MP5Oq9RB8jBs3y@jlelli-thinkpadt14gen4.remote.csb
2025-03-17 11:23:41 +01:00
Juri Lelli
f6147af176 sched/deadline: Ignore special tasks when rebuilding domains
SCHED_DEADLINE special tasks get a fake bandwidth that is only used to
make sure sleeping and priority inheritance 'work', but it is ignored
for runtime enforcement and admission control.

Be consistent with it also when rebuilding root domains.

Fixes: 53916d5fd3 ("sched/deadline: Check bandwidth overflow earlier for hotplug")
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Waiman Long <longman@redhat.com>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20250313170011.357208-2-juri.lelli@redhat.com
2025-03-17 11:23:41 +01:00
Sebastian Andrzej Siewior
8bdc5daaa0 sched: Add a generic function to return the preemption string
The individual architectures often add the preemption model to the begin
of the backtrace. This is the case on X86 or ARM64 for the "die" case
but not for regular warning. With the addition of DYNAMIC_PREEMPT for
PREEMPT_RT we end up with CONFIG_PREEMPT and CONFIG_PREEMPT_RT set
simultaneously. That means that everyone who tried to add that piece of
information gets it wrong for PREEMPT_RT because PREEMPT is checked
first.

Provide a generic function which returns the current scheduling model
considering LAZY preempt and the current state of PREEMPT_DYNAMIC.

The resulting strings are:
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃   Model   ┃  -RT -DYN    ┃     +RT -DYN      ┃     -RT +DYN       ┃     +RT +DYN      ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│NONE       │ NONE         │ n/a               │ PREEMPT(none)      │ n/a               │
├───────────┼──────────────┼───────────────────┼────────────────────┼───────────────────┤
│VOLUNTARY  │ VOLUNTARY    │ n/a               │ PREEMPT(voluntary) │ n/a               │
├───────────┼──────────────┼───────────────────┼────────────────────┼───────────────────┤
│FULL       │ PREEMPT      │ PREEMPT_RT        │ PREEMPT(full)      │ PREEMPT_{RT,full} │
├───────────┼──────────────┼───────────────────┼────────────────────┼───────────────────┤
│LAZY       │ PREEMPT_LAZY │ PREEMPT_{RT,LAZY} │ PREEMPT(lazy)      │ PREEMPT_{RT,lazy} │
└───────────┴──────────────┴───────────────────┴────────────────────┴───────────────────┘

[ The dynamic building of the string can lead to an empty string if the
  function is invoked simultaneously on two CPUs. ]

Co-developed-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Co-developed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org>
Signed-off-by: "Steven Rostedt (Google)" <rostedt@goodmis.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://lore.kernel.org/r/20250314160810.2373416-2-bigeasy@linutronix.de
2025-03-17 11:23:38 +01:00
Dietmar Eggemann
76f970ce51 Revert "sched/core: Reduce cost of sched_move_task when config autogroup"
This reverts commit eff6c8ce8d.

Hazem reported a 30% drop in UnixBench spawn test with commit
eff6c8ce8d ("sched/core: Reduce cost of sched_move_task when config
autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM
(aarch64) (single level MC sched domain):

  https://lkml.kernel.org/r/20250205151026.13061-1-hagarhem@amazon.com

There is an early bail from sched_move_task() if p->sched_task_group is
equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are
pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope'
(Ubuntu '22.04.5 LTS').

So in:

  do_exit()

    sched_autogroup_exit_task()

      sched_move_task()

        if sched_get_task_group(p) == p->sched_task_group
          return

        /* p is enqueued */
        dequeue_task()              \
        sched_change_group()        |
          task_change_group_fair()  |
            detach_task_cfs_rq()    |                              (1)
            set_task_rq()           |
            attach_task_cfs_rq()    |
        enqueue_task()              /

(1) isn't called for p anymore.

Turns out that the regression is related to sgs->group_util in
group_is_overloaded() and group_has_capacity(). If (1) isn't called for
all the 'spawn' tasks then sgs->group_util is ~900 and
sgs->group_capacity = 1024 (single CPU sched domain) and this leads to
group_is_overloaded() returning true (2) and group_has_capacity() false
(3) much more often compared to the case when (1) is called.

I.e. there are much more cases of 'group_is_overloaded' and
'group_fully_busy' in WF_FORK wakeup sched_balance_find_dst_cpu() which
then returns much more often a CPU != smp_processor_id() (5).

This isn't good for these extremely short running tasks (FORK + EXIT)
and also involves calling sched_balance_find_dst_group_cpu() unnecessary
(single CPU sched domain).

Instead if (1) is called for 'p->flags & PF_EXITING' then the path
(4),(6) is taken much more often.

  select_task_rq_fair(..., wake_flags = WF_FORK)

    cpu = smp_processor_id()

    new_cpu = sched_balance_find_dst_cpu(..., cpu, ...)

      group = sched_balance_find_dst_group(..., cpu)

        do {

          update_sg_wakeup_stats()

            sgs->group_type = group_classify()

              if group_is_overloaded()                             (2)
                return group_overloaded

              if !group_has_capacity()                             (3)
                return group_fully_busy

              return group_has_spare                               (4)

        } while group

        if local_sgs.group_type > idlest_sgs.group_type
          return idlest                                            (5)

        case group_has_spare:

          if local_sgs.idle_cpus >= idlest_sgs.idle_cpus
            return NULL                                            (6)

Unixbench Tests './Run -c 4 spawn' on:

(a) VM AWS instance (m7gd.16xlarge) with v6.13 ('maxcpus=4 nr_cpus=4')
    and Ubuntu 22.04.5 LTS (aarch64).

    Shell & test run in '/user.slice/user-1000.slice/session-1.scope'.

    w/o patch	w/ patch
    21005	27120

(b) i7-13700K with tip/sched/core ('nosmt maxcpus=8 nr_cpus=8') and
    Ubuntu 22.04.5 LTS (x86_64).

    Shell & test run in '/A'.

    w/o patch	w/ patch
    67675	88806

CONFIG_SCHED_AUTOGROUP=y & /sys/proc/kernel/sched_autogroup_enabled equal
0 or 1.

Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Hagar Hemdan <hagarhem@amazon.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250314151345.275739-1-dietmar.eggemann@arm.com
2025-03-15 10:34:27 +01:00
Xuewen Yan
4bc4582414 sched/uclamp: Optimize sched_uclamp_used static key enabling
Repeat calls of static_branch_enable() to an already enabled
static key introduce overhead, because it calls cpus_read_lock().

Users may frequently set the uclamp value of tasks, triggering
the repeat enabling of the sched_uclamp_used static key.

Optimize this and avoid repeat calls to static_branch_enable()
by checking whether it's enabled already.

[ mingo: Rewrote the changelog for legibility ]

Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250219093747.2612-2-xuewen.yan@unisoc.com
2025-03-15 10:28:50 +01:00
Xuewen Yan
5fca5a4cf9 sched/uclamp: Use the uclamp_is_used() helper instead of open-coding it
Don't open-code static_branch_unlikely(&sched_uclamp_used), we have
the uclamp_is_used() wrapper around it.

[ mingo: Clean up the changelog ]

Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250219093747.2612-1-xuewen.yan@unisoc.com
2025-03-15 10:26:37 +01:00
Linus Torvalds
a22ea738f4 Fix a sleeping-while-atomic bug caused by a recent optimization
utilizing static keys that didn't consider that the
 static_key_disable() call could be triggered in atomic context.
 
 Revert the optimization.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfT8m4RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1h8IRAAlvqGw8HmdN5VGdco95ejRK8zdPLWQFvI
 m7FvjSZu5ciZrxz72QVmpIoprfNSb7WiYFhrVi7x6qIHgc/g9frIWxRP/OnXsMSg
 hWAHystWPVrLirD6lN46nOKvVaqaACa6NyndNnwGtLEwTyf3Rc87XU/gRjiCwCBG
 hFt7HAEbgSi0YnKcPm79Le0Es33ZfH1Z47LSOSLlEyZbp+dcoliOzveJmbxhcfTM
 3yUO9HtoFRKsUbDZCYHE/ZWgDJbaGNPokwHf9XMSFrWbTXPHxvJvKgZKMfaN+rtr
 NKXzDvUW7D3Au3KHrBk39TcBZMkAKLnnze5RV8xAurKzMZBRWD7ybKetzeMt8gzg
 os/Ge9M5AxldiuPxK8s8S+X+L5tQ/uNwzbQEaC0w4ouH3ZBjgoHTWNJM6Lpwpl42
 3c+zBC+2TdC2dDj0elL2Z34T6X6FiHMoBHOIKaTvX7Lh5G4lTLC9+GIBAt/vSAFv
 vqpo3Bc61qq2jgpnmWRVVfvHKrvA7dvfvId0oE8JQfGB446glLlEKJ7Dyn65NJp6
 9rc2wPAAs+oaaxi47p94CKwBvYzyKkMXSnyB5WXQJOk4O6QD7uv9qGXHzy+8gAom
 QAFTJGpTr7mnXcu/ZOdGGBqoq5Kcbvf15SnnkYqduE0f/omoW+BqgEyGxvbP+qsV
 /7D7ZCXbyk0=
 =fXc6
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2025-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Ingo Molnar:
 "Fix a sleeping-while-atomic bug caused by a recent optimization
  utilizing static keys that didn't consider that the
  static_key_disable() call could be triggered in atomic context.

  Revert the optimization"

* tag 'sched-urgent-2025-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/clock: Don't define sched_clock_irqtime as static key
2025-03-14 09:56:46 -10:00
Andrea Righi
e4855fc90e sched_ext: idle: Refactor scx_select_cpu_dfl()
Make scx_select_cpu_dfl() more consistent with the other idle-related
APIs by returning a negative value when an idle CPU isn't found.

No functional changes, this is purely a refactoring.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-14 08:17:11 -10:00
Andrea Righi
c414c2171c sched_ext: idle: Honor idle flags in the built-in idle selection policy
Enable passing idle flags (%SCX_PICK_IDLE_*) to scx_select_cpu_dfl(),
to enforce strict selection criteria, such as selecting an idle CPU
strictly within @prev_cpu's node or choosing only a fully idle SMT core.

This functionality will be exposed through a dedicated kfunc in a
separate patch.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-14 08:17:01 -10:00
Andrea Righi
97e13ecb02 sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()
scx_bpf_reenqueue_local() can be invoked from ops.cpu_release() to give
tasks that are queued to the local DSQ a chance to migrate to other
CPUs, when a CPU is taken by a higher scheduling class.

However, there is no point re-enqueuing tasks that can only run on that
particular CPU, as they would simply be re-added to the same local DSQ
without any benefit.

Therefore, skip per-CPU tasks in scx_bpf_reenqueue_local().

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-13 07:07:27 -10:00
Linus Torvalds
b7f94fcf55 sched_ext: A fix for v6.14-rc6
BPF schedulers could trigger a crash by passing in an invalid CPU to the
 helper scx_bpf_select_cpu_dfl(). Fix it by verifying input validity.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ9H4tw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGY2PAP95rqNEACvl5uz/HQM+T0WpwGDaIJ3fmKYd3GZY
 3XJjhwD/YmKMLmth0xeDLkAtVUNsMp4EjpssKdzi0CJq+Nl4nQw=
 =OxEw
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.14-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fix from Tejun Heo:
 "BPF schedulers could trigger a crash by passing in an invalid CPU to
  the scx_bpf_select_cpu_dfl() helper.

  Fix it by verifying input validity"

* tag 'sched_ext-for-6.14-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()
2025-03-12 11:52:04 -10:00
Yafang Shao
f3fa0e40df sched/clock: Don't define sched_clock_irqtime as static key
The sched_clock_irqtime was defined as a static key in:

  8722903cbb ("sched: Define sched_clock_irqtime as static key")

However, this change introduces a 'sleeping in atomic context' warning:

	arch/x86/kernel/tsc.c:1214 mark_tsc_unstable()
	warn: sleeping in atomic context

As analyzed by Dan, the affected code path is as follows:

vcpu_load() <- disables preempt
-> kvm_arch_vcpu_load()
   -> mark_tsc_unstable() <- sleeps

virt/kvm/kvm_main.c
   166  void vcpu_load(struct kvm_vcpu *vcpu)
   167  {
   168          int cpu = get_cpu();
                          ^^^^^^^^^^
This get_cpu() disables preemption.

   169
   170          __this_cpu_write(kvm_running_vcpu, vcpu);
   171          preempt_notifier_register(&vcpu->preempt_notifier);
   172          kvm_arch_vcpu_load(vcpu, cpu);
   173          put_cpu();
   174  }

arch/x86/kvm/x86.c
  4979          if (unlikely(vcpu->cpu != cpu) || kvm_check_tsc_unstable()) {
  4980                  s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
  4981                                  rdtsc() - vcpu->arch.last_host_tsc;
  4982                  if (tsc_delta < 0)
  4983                          mark_tsc_unstable("KVM discovered backwards TSC");

arch/x86/kernel/tsc.c
    1206 void mark_tsc_unstable(char *reason)
    1207 {
    1208         if (tsc_unstable)
    1209                 return;
    1210
    1211         tsc_unstable = 1;
    1212         if (using_native_sched_clock())
    1213                 clear_sched_clock_stable();
--> 1214         disable_sched_clock_irqtime();
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
kernel/jump_label.c
   245  void static_key_disable(struct static_key *key)
   246  {
   247          cpus_read_lock();
                ^^^^^^^^^^^^^^^^
This lock has a might_sleep() in it which triggers the static checker
warning.

   248          static_key_disable_cpuslocked(key);
   249          cpus_read_unlock();
   250  }

Let revert this change for now as {disable,enable}_sched_clock_irqtime
are used in many places, as pointed out by Sean, including the following:

The code path in clocksource_watchdog():

  clocksource_watchdog()
  |
  -> spin_lock(&watchdog_lock);
     |
     -> __clocksource_unstable()
        |
        -> clocksource.mark_unstable() == tsc_cs_mark_unstable()
           |
           -> disable_sched_clock_irqtime()

And the code path in sched_clock_register():

	/* Cannot register a sched_clock with interrupts on */
	local_irq_save(flags);

	...

	/* Enable IRQ time accounting if we have a fast enough sched_clock() */
	if (irqtime > 0 || (irqtime == -1 && rate >= 1000000))
		enable_sched_clock_irqtime();

	local_irq_restore(flags);

[ lkp@intel.com: reported a build error in the prev version ]
[ mingo: cherry-picked it over into sched/urgent ]

Closes: https://lore.kernel.org/kvm/37a79ba3-9ce0-479c-a5b0-2bd75d573ed3@stanley.mountain/
Fixes: 8722903cbb ("sched: Define sched_clock_irqtime as static key")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Debugged-by: Dan Carpenter <dan.carpenter@linaro.org>
Debugged-by: Sean Christopherson <seanjc@google.com>
Debugged-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20250205032438.14668-1-laoar.shao@gmail.com
2025-03-10 14:22:58 +01:00
Ingo Molnar
82354fce16 Merge branch 'sched/urgent' into sched/core, to pick up dependent commits
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-06 22:26:36 +01:00
Shrikanth Hegde
14672f059d sched/deadline: Use online cpus for validating runtime
The ftrace selftest reported a failure because writing -1 to
sched_rt_runtime_us returns -EBUSY. This happens when the possible
CPUs are different from active CPUs.

Active CPUs are part of one root domain, while remaining CPUs are part
of def_root_domain. Since active cpumask is being used, this results in
cpus=0 when a non active CPUs is used in the loop.

Fix it by looping over the online CPUs instead for validating the
bandwidth calculations.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20250306052954.452005-2-sshegde@linux.ibm.com
2025-03-06 10:21:31 +01:00
Zecheng Li
3b4035ddbf sched/fair: Fix potential memory corruption in child_cfs_rq_on_list
child_cfs_rq_on_list attempts to convert a 'prev' pointer to a cfs_rq.
This 'prev' pointer can originate from struct rq's leaf_cfs_rq_list,
making the conversion invalid and potentially leading to memory
corruption. Depending on the relative positions of leaf_cfs_rq_list and
the task group (tg) pointer within the struct, this can cause a memory
fault or access garbage data.

The issue arises in list_add_leaf_cfs_rq, where both
cfs_rq->leaf_cfs_rq_list and rq->leaf_cfs_rq_list are added to the same
leaf list. Also, rq->tmp_alone_branch can be set to rq->leaf_cfs_rq_list.

This adds a check `if (prev == &rq->leaf_cfs_rq_list)` after the main
conditional in child_cfs_rq_on_list. This ensures that the container_of
operation will convert a correct cfs_rq struct.

This check is sufficient because only cfs_rqs on the same CPU are added
to the list, so verifying the 'prev' pointer against the current rq's list
head is enough.

Fixes a potential memory corruption issue that due to current struct
layout might not be manifesting as a crash but could lead to unpredictable
behavior when the layout changes.

Fixes: fdaba61ef8 ("sched/fair: Ensure that the CFS parent is added after unthrottling")
Signed-off-by: Zecheng Li <zecheng@google.com>
Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250304214031.2882646-1-zecheng@google.com
2025-03-05 17:30:54 +01:00
Changwoo Min
71d078803c sched_ext: Add trace point to track sched_ext core events
Add tracing support to track sched_ext core events
(/sched_ext/sched_ext_event). This may be useful for debugging sched_ext
schedulers that trigger a particular event.

The trace point can be used as other trace points, so it can be used in,
for example, `perf trace` and BPF programs, as follows:

======
$> sudo perf trace -e sched_ext:sched_ext_event --filter 'name == "SCX_EV_ENQ_SLICE_DFL"'
======

======
struct tp_sched_ext_event {
	struct trace_entry ent;
	u32 __data_loc_name;
	s64 delta;
};

SEC("tracepoint/sched_ext/sched_ext_event")
int rtp_add_event(struct tp_sched_ext_event *ctx)
{
	char event_name[128];
	unsigned short offset = ctx->__data_loc_name & 0xFFFF;
        bpf_probe_read_str((void *)event_name, 128, (char *)ctx + offset);

	bpf_printk("name %s   delta %lld", event_name, ctx->delta);
	return 0;
}
======

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-04 08:06:17 -10:00
Changwoo Min
038730dc12 sched_ext: Change the event type from u64 to s64
The event count could be negative in the future,
so change the event type from u64 to s64.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-04 08:05:23 -10:00
Tejun Heo
8a9b1585e2 sched_ext: Merge branch 'for-6.14-fixes' into for-6.15
Pull for-6.14-fixes to receive:

  9360dfe4cb ("sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()")

which conflicts with:

  337d1b354a ("sched_ext: Move built-in idle CPU selection policy to a separate file")

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-03 08:02:22 -10:00
Andrea Righi
9360dfe4cb sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()
If a BPF scheduler provides an invalid CPU (outside the nr_cpu_ids
range) as prev_cpu to scx_bpf_select_cpu_dfl() it can cause a kernel
crash.

To prevent this, validate prev_cpu in scx_bpf_select_cpu_dfl() and
trigger an scx error if an invalid CPU is specified.

Fixes: f0e1a0643a ("sched_ext: Implement BPF extensible scheduler class")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-03 07:55:48 -10:00
Linus Torvalds
d203484f25 Prevent cond_resched() based preemption when interrupts are disabled,
on PREEMPT_NONE and PREEMPT_VOLUNTARY kernels.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfCDDMRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gf+RAAvFNXelLgrNbILZ6ckp/ikWnjCbf2QOIk
 aCm6JMQm7WrFvgo1u6CM4vQQYZdEqf8+KiEjJJnoq2P4jvYzhO1/1pLfEDNaHeiH
 GneosmKAwSMR8lgDlw5DXxhXsfeuYYhG5VMe2ia+kyiIA83TUF6hl9jpawWB3dsw
 +xB6CAg3JLoR2v44E/Mf1PdGaGrF90fYxp+X5RNSqxVXcN54cgVx2G9lHeTIWcnp
 SjIiWo5mply50de+dxD5dNUB9mj/k+yLQaiuPfUDGo/ZOjFyBnsP5VlD+ySbhkIa
 Rwdw6olLqXLcX5D5RsPIuePm/XdmAQXr6GXxJjdhtV1oWTP3Bejev3upQ/kxHQ50
 DQa+aSTqNx9bNlwphUafCmVo1OZap4mViOSWP7r96HhFwehLGGmkjEaU9eFuUl0P
 kG+qGq28U+Nnz0r6/pEkwic1B6wbq2x1XRbtJqxXnBcQvMxMgDWNrTIj1ytDcSBb
 3Qo0shRrtjH7DN1ly8IBllLQ0wXXI5O6GwjI7absEyEjpdoxFyMsHpaFONlTWRdi
 NgR2+5MWTxExeWaDRPAJM+THzwucfWVTeZVXJFMRfQnNIBj7TpO3X3Y4xzP9Vl/Y
 2HEz8voSDZUVN6Ejxx/am7kb68WpWw46xmj59wWT7nf9SVEEm+R4Pfe3O9+0yvQV
 V4l6tN4yfEU=
 =RknP
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Ingo Molnar:
 "Prevent cond_resched() based preemption when interrupts are disabled,
  on PREEMPT_NONE and PREEMPT_VOLUNTARY kernels"

* tag 'sched-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Prevent rescheduling when interrupts are disabled
2025-02-28 17:00:16 -08:00
Thomas Gleixner
82c387ef75 sched/core: Prevent rescheduling when interrupts are disabled
David reported a warning observed while loop testing kexec jump:

  Interrupts enabled after irqrouter_resume+0x0/0x50
  WARNING: CPU: 0 PID: 560 at drivers/base/syscore.c:103 syscore_resume+0x18a/0x220
   kernel_kexec+0xf6/0x180
   __do_sys_reboot+0x206/0x250
   do_syscall_64+0x95/0x180

The corresponding interrupt flag trace:

  hardirqs last  enabled at (15573): [<ffffffffa8281b8e>] __up_console_sem+0x7e/0x90
  hardirqs last disabled at (15580): [<ffffffffa8281b73>] __up_console_sem+0x63/0x90

That means __up_console_sem() was invoked with interrupts enabled. Further
instrumentation revealed that in the interrupt disabled section of kexec
jump one of the syscore_suspend() callbacks woke up a task, which set the
NEED_RESCHED flag. A later callback in the resume path invoked
cond_resched() which in turn led to the invocation of the scheduler:

  __cond_resched+0x21/0x60
  down_timeout+0x18/0x60
  acpi_os_wait_semaphore+0x4c/0x80
  acpi_ut_acquire_mutex+0x3d/0x100
  acpi_ns_get_node+0x27/0x60
  acpi_ns_evaluate+0x1cb/0x2d0
  acpi_rs_set_srs_method_data+0x156/0x190
  acpi_pci_link_set+0x11c/0x290
  irqrouter_resume+0x54/0x60
  syscore_resume+0x6a/0x200
  kernel_kexec+0x145/0x1c0
  __do_sys_reboot+0xeb/0x240
  do_syscall_64+0x95/0x180

This is a long standing problem, which probably got more visible with
the recent printk changes. Something does a task wakeup and the
scheduler sets the NEED_RESCHED flag. cond_resched() sees it set and
invokes schedule() from a completely bogus context. The scheduler
enables interrupts after context switching, which causes the above
warning at the end.

Quite some of the code paths in syscore_suspend()/resume() can result in
triggering a wakeup with the exactly same consequences. They might not
have done so yet, but as they share a lot of code with normal operations
it's just a question of time.

The problem only affects the PREEMPT_NONE and PREEMPT_VOLUNTARY scheduling
models. Full preemption is not affected as cond_resched() is disabled and
the preemption check preemptible() takes the interrupt disabled flag into
account.

Cure the problem by adding a corresponding check into cond_resched().

Reported-by: David Woodhouse <dwmw@amazon.co.uk>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: David Woodhouse <dwmw@amazon.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Closes: https://lore.kernel.org/all/7717fe2ac0ce5f0a2c43fdab8b11f4483d54a2a4.camel@infradead.org
2025-02-27 21:13:57 +01:00
Linus Torvalds
e6d3c4e535 sched_ext: A fix for v6.14-rc4
pick_task_scx() has a workaround to avoid stalling when the fair class's
 balance() says yes but pick_task() says no. The workaround was incorrectly
 deciding to keep the prev taks running if the task is on SCX even when the
 task is in a sleeping state, which can lead to several confusing failure
 modes. Fix it by testing the prev task is currently queued on SCX instead.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ79/Ww4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGdhUAQDM1AcK7pJUHzayuQCecCxNspGty8nR9T4KeVly
 51pA2gEA4sbs6Fj4doVKVyaCunsvFoZ8Tb/utCX716fVnpjMMgY=
 =yDMV
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.14-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fix from Tejun Heo:
 "pick_task_scx() has a workaround to avoid stalling when the fair
  class's balance() says yes but pick_task() says no.

  The workaround was incorrectly deciding to keep the prev taks running
  if the task is on SCX even when the task is in a sleeping state, which
  can lead to several confusing failure modes.

  Fix it by testing the prev task is currently queued on SCX instead"

* tag 'sched_ext-for-6.14-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Fix pick_task_scx() picking non-queued tasks when it's called without balance()
2025-02-26 14:13:11 -08:00
Andrea Righi
fde7d64766 sched_ext: idle: Fix scx_bpf_pick_any_cpu_node() behavior
When %SCX_PICK_IDLE_IN_NODE is specified, scx_bpf_pick_any_cpu_node()
should always return a CPU from the specified node, regardless of its
idle state.

Also clarify this logic in the function documentation.

Fixes: 01059219b0 ("sched_ext: idle: Introduce node-aware idle cpu kfunc helpers")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-25 08:35:53 -10:00
Tejun Heo
8fef0a3b17 sched_ext: Fix pick_task_scx() picking non-queued tasks when it's called without balance()
a6250aa251 ("sched_ext: Handle cases where pick_task_scx() is called
without preceding balance_scx()") added a workaround to handle the cases
where pick_task_scx() is called without prececing balance_scx() which is due
to a fair class bug where pick_taks_fair() may return NULL after a true
return from balance_fair().

The workaround detects when pick_task_scx() is called without preceding
balance_scx() and emulates SCX_RQ_BAL_KEEP and triggers kicking to avoid
stalling. Unfortunately, the workaround code was testing whether @prev was
on SCX to decide whether to keep the task running. This is incorrect as the
task may be on SCX but no longer runnable.

This could lead to a non-runnable task to be returned from pick_task_scx()
which cause interesting confusions and failures. e.g. A common failure mode
is the task ending up with (!on_rq && on_cpu) state which can cause
potential wakers to busy loop, which can easily lead to deadlocks.

Fix it by testing whether @prev has SCX_TASK_QUEUED set. This makes
@prev_on_scx only used in one place. Open code the usage and improve the
comment while at it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Pat Cody <patcody@meta.com>
Fixes: a6250aa251 ("sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()")
Cc: stable@vger.kernel.org # v6.12+
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-02-25 08:28:52 -10:00
Andrea Righi
0e9b4c10e8 sched_ext: idle: Introduce scx_bpf_nr_node_ids()
Similarly to scx_bpf_nr_cpu_ids(), introduce a new kfunc
scx_bpf_nr_node_ids() to expose the maximum number of NUMA nodes in the
system.

BPF schedulers can use this information together with the new node-aware
kfuncs, for example to create per-node DSQs, validate node IDs, etc.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-24 07:37:42 -10:00
Colin Ian King
1a5d3492f8 sched: Add unlikey branch hints to several system calls
Adding an unlikely() hint on early error return paths improves the
run-time performance of several sched related system calls.

Benchmarking on an i9-12900 shows the following per system call
performance improvements:

		       before     after     improvement
sched_getattr          182.4ns    170.6ns      ~6.5%
sched_setattr          284.3ns    267.6ns      ~5.9%
sched_getparam         161.6ns    148.1ns      ~8.4%
sched_setparam        1265.4ns   1227.6ns      ~3.0%
sched_getscheduler     129.4ns    118.2ns      ~8.7%
sched_setscheduler    1237.3ns   1216.7ns      ~1.7%

Results are based on running 20 tests with turbo disabled (to reduce
clock freq turbo changes), with 10 second run per test based on the
number of system calls per second. The % standard deviation of the
measurements for the 20 tests was 0.05% to 0.40%, so the results are
reliable.

Tested on kernel build with gcc 14.2.1

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250219142423.45516-1-colin.i.king@gmail.com
2025-02-21 20:19:13 +01:00
Thorsten Blum
b796ea8489 sched/core: Remove duplicate included header file stats.h
The header file stats.h is included twice. Remove the redundant include
and the following make includecheck warning:

  stats.h is included more than once

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250219111756.3070-2-thorsten.blum@linux.dev
2025-02-21 20:19:13 +01:00
Andrea Righi
01059219b0 sched_ext: idle: Introduce node-aware idle cpu kfunc helpers
Introduce a new kfunc to retrieve the node associated to a CPU:

 int scx_bpf_cpu_node(s32 cpu)

Add the following kfuncs to provide BPF schedulers direct access to
per-node idle cpumasks information:

 const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
 const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
 s32 scx_bpf_pick_idle_cpu_node(const cpumask_t *cpus_allowed,
 				int node, u64 flags)
 s32 scx_bpf_pick_any_cpu_node(const cpumask_t *cpus_allowed,
 			       int node, u64 flags)

Moreover, trigger an scx error when any of the non-node aware idle CPU
kfuncs are used when SCX_OPS_BUILTIN_IDLE_PER_NODE is enabled.

Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-18 09:10:24 -10:00
Nam Cao
ee13da875b sched: Switch to use hrtimer_setup()
hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.

Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/a55e849cba3c41b4c5708be6ea6be6f337d1a8fb.1738746821.git.namcao@linutronix.de
2025-02-18 10:32:32 +01:00
Mathieu Desnoyers
02d954c0fd sched: Compact RSEQ concurrency IDs with reduced threads and affinity
When a process reduces its number of threads or clears bits in its CPU
affinity mask, the mm_cid allocation should eventually converge towards
smaller values.

However, the change introduced by:

commit 7e019dcc47 ("sched: Improve cache locality of RSEQ concurrency
IDs for intermittent workloads")

adds a per-mm/CPU recent_cid which is never unset unless a thread
migrates.

This is a tradeoff between:

A) Preserving cache locality after a transition from many threads to few
   threads, or after reducing the hamming weight of the allowed CPU mask.

B) Making the mm_cid upper bounds wrt nr threads and allowed CPU mask
   easy to document and understand.

C) Allowing applications to eventually react to mm_cid compaction after
   reduction of the nr threads or allowed CPU mask, making the tracking
   of mm_cid compaction easier by shrinking it back towards 0 or not.

D) Making sure applications that periodically reduce and then increase
   again the nr threads or allowed CPU mask still benefit from good
   cache locality with mm_cid.

Introduce the following changes:

* After shrinking the number of threads or reducing the number of
  allowed CPUs, reduce the value of max_nr_cid so expansion of CID
  allocation will preserve cache locality if the number of threads or
  allowed CPUs increase again.

* Only re-use a recent_cid if it is within the max_nr_cid upper bound,
  else find the first available CID.

Fixes: 7e019dcc47 ("sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lkml.kernel.org/r/20250210153253.460471-2-gmonaco@redhat.com
2025-02-18 08:50:36 +01:00
Linus Torvalds
ff3b373ecc - Clarify what happens when a task is woken up from the wake queue and make
clear its removal from that queue is atomic
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmexq1MACgkQEsHwGGHe
 VUq6ghAAoD872KGmQ3YioDZs+FKLpLWvo+6lC2rY+GFQ3oCu4TJfmlsiTGLCyzjv
 aYwL52diIORD9/Yfn6Sq/ZWkfncoVmwnht+tgVjXeRr3Pb4EnttgWxPRx/xYQizr
 jgRASpNsRUTr6zNzqEeeQYodIJaInOF5+r26oqYArcN5V9XB9Qaj0+f14UiyB6u+
 53qpEQqQopeLPyG4t59iUfefsaWm2ZIW3EnoWeyI8sRuaapGY/0LHhUAn6vcA++N
 kuUkliVsk+f/uTNZeJ4zv2uy8DpBXO4kTjmwPVwFz46sJ8RL8P7MOLax7e7fNssw
 tylwHt4qoLEoB2vg1yMvlUNFMeH85gj8hTyJMsgGtFTbwCH0kLFEoXUz0lfKMS7U
 A271E1Bumu3OT7FrAnxQahViv02YWG5fcg6R3OidQdSmgoQBIMJwDA2pyKLiq9FL
 7mWoNfEqqBWn4O/1qBlf3jCvfFlzXRSSgVzEruoNB93cgzTaQaN5yVgeekMzwJEj
 NDowmIZRxEN8+lJyMxIGSOGa44aTXu0/+dtehEDeSpsDOXULFc5fpYt4SSa0Jt/F
 LlgnPGkM1vF0ddG5vDJpGw6B9Dhb82i1oYy2IVwOCkLOXSp2kvLDEI3sqgk5mmlH
 zFtAV21Zm4jqBke/aF7r+RCYRvDyQkuW0d1+H9okGWLSk7jKauM=
 =J43o
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Borislav Petkov:

 - Clarify what happens when a task is woken up from the wake queue and
   make clear its removal from that queue is atomic

* tag 'sched_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Clarify wake_up_q()'s write to task->wake_q.next
2025-02-16 10:38:24 -08:00
Andrea Righi
48849271e6 sched_ext: idle: Per-node idle cpumasks
Using a single global idle mask can lead to inefficiencies and a lot of
stress on the cache coherency protocol on large systems with multiple
NUMA nodes, since all the CPUs can create a really intense read/write
activity on the single global cpumask.

Therefore, split the global cpumask into multiple per-NUMA node cpumasks
to improve scalability and performance on large systems.

The concept is that each cpumask will track only the idle CPUs within
its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
In this way concurrent access to the idle cpumask will be restricted
within each NUMA node.

The split of multiple per-node idle cpumasks can be controlled using the
SCX_OPS_BUILTIN_IDLE_PER_NODE flag.

By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global
host-wide idle cpumask is used, maintaining the previous behavior.

NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via
SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will
trigger an scx error, since there are no system-wide cpumasks.

= Test =

Hardware:
 - System: DGX B200
    - CPUs: 224 SMT threads (112 physical cores)
    - Processor: INTEL(R) XEON(R) PLATINUM 8570
    - 2 NUMA nodes

Scheduler:
 - scx_simple [1] (so that we can focus at the built-in idle selection
   policy and not at the scheduling policy itself)

Test:
 - Run a parallel kernel build `make -j $(nproc)` and measure the average
   elapsed time over 10 runs:

          avg time | stdev
          ---------+------
 before:   52.431s | 2.895
  after:   50.342s | 2.895

= Conclusion =

Splitting the global cpumask into multiple per-NUMA cpumasks helped to
achieve a speedup of approximately +4% with this particular architecture
and test case.

The same test on a DGX-1 (40 physical cores, Intel Xeon E5-2698 v4 @
2.20GHz, 2 NUMA nodes) shows a speedup of around 1.5-3%.

On smaller systems, I haven't noticed any measurable regressions or
improvements with the same test (parallel kernel build) and scheduler
(scx_simple).

Moreover, with a modified scx_bpfland that uses the new NUMA-aware APIs
I observed an additional +2-2.5% performance improvement with the same
test.

[1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c

Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16 06:52:20 -10:00
Andrea Righi
0aaaf89df8 sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODE
Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows
BPF schedulers to select between using a global flat idle cpumask or
multiple per-node cpumasks.

This only introduces the flag and the mechanism to enable/disable this
feature without affecting any scheduling behavior.

Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16 06:52:20 -10:00
Andrea Righi
d73249f887 sched_ext: idle: Make idle static keys private
Make all the static keys used by the idle CPU selection policy private
to ext_idle.c. This avoids unnecessary exposure in headers and improves
code encapsulation.

Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16 06:52:20 -10:00
Linus Torvalds
04f41cbf03 sched_ext: Fixes for v6.14-rc2
- Fix lock imbalance in a corner case of dispatch_to_local_dsq().
 
 - Migration disabled tasks were confusing some BPF schedulers and its
   handling had a bug. Fix it and simplify the default behavior by
   dispatching them automatically.
 
 - ops.tick(), ops.disable() and ops.exit_task() were incorrectly disallowing
   kfuncs that require the task argument to be the rq operation is currently
   operating on and thus is rq-locked. Allow them.
 
 - Fix autogroup migration handling bug which was occasionally triggering a
   warning in the cgroup migration path.
 
 - tools/sched_ext, selftest and other misc updates.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ695uA4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGeCfAQDmUixMNJCIrRphYsWcYUzlGLZyyRpQEEYFtRMO
 UC266gD+PUV2UvuO5sAVH8HVnGdOqkXaE/IRG+TC7fQH3ruPlgI=
 =LFd1
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.14-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

 - Fix lock imbalance in a corner case of dispatch_to_local_dsq()

 - Migration disabled tasks were confusing some BPF schedulers and its
   handling had a bug. Fix it and simplify the default behavior by
   dispatching them automatically

 - ops.tick(), ops.disable() and ops.exit_task() were incorrectly
   disallowing kfuncs that require the task argument to be the rq
   operation is currently operating on and thus is rq-locked.
   Allow them.

 - Fix autogroup migration handling bug which was occasionally
   triggering a warning in the cgroup migration path

 - tools/sched_ext, selftest and other misc updates

* tag 'sched_ext-for-6.14-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Use SCX_CALL_OP_TASK in task_tick_scx
  sched_ext: Fix the incorrect bpf_list kfunc API in common.bpf.h.
  sched_ext: selftests: Fix grammar in tests description
  sched_ext: Fix incorrect assumption about migration disabled tasks in task_can_run_on_remote_rq()
  sched_ext: Fix migration disabled handling in targeted dispatches
  sched_ext: Implement auto local dispatching of migration disabled tasks
  sched_ext: Fix incorrect time delta calculation in time_delta()
  sched_ext: Fix lock imbalance in dispatch_to_local_dsq()
  sched_ext: selftests/dsp_local_on: Fix selftest on UP systems
  tools/sched_ext: Add helper to check task migration state
  sched_ext: Fix incorrect autogroup migration detection
  sched_ext: selftests/dsp_local_on: Fix sporadic failures
  selftests/sched_ext: Fix enum resolution
  sched_ext: Include task weight in the error state dump
  sched_ext: Fixes typos in comments
2025-02-14 11:14:24 -08:00
Changwoo Min
ad3b301aa0 sched_ext: Provides a sysfs 'events' to expose core event counters
Add a sysfs entry at /sys/kernel/sched_ext/root/events to expose core
event counters through the files system interface. Each line of the file
shows the event name and its counter value.

In addition, the format of scx_dump_event() is adjusted as the event name
gets longer.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-14 07:15:41 -10:00
I Hsin Cheng
d34e798094 sched/fair: Refactor can_migrate_task() to elimate looping
The function "can_migrate_task()" utilize "for_each_cpu_and" with a
"if" statement inside to find the destination cpu. It's the same logic
to find the first set bit of the result of the bitwise-AND of
"env->dst_grpmask", "env->cpus" and "p->cpus_ptr".

Refactor it by using "cpumask_first_and_and()" to perform bitwise-AND
for "env->dst_grpmask", "env->cpus" and "p->cpus_ptr" and pick the
first cpu within the intersection as the destination cpu, so we can
elimate the need of looping and multiple times of branch.

After the refactoring this part of the code can speed up from ~115ns
to ~54ns, according to the test below.

Ran the test for 5 times and the result is showned in the following
table, and the test script is paste in next section.

  -------------------------------------------------------
  |Old method|  130|  118|  115|  109|  106|  avg ~115ns|
  -------------------------------------------------------
  |New method|   58|   55|   54|   48|   55|  avg  ~54ns|
  -------------------------------------------------------

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250210103019.283824-1-richard120310@gmail.com
2025-02-14 10:32:01 +01:00
Tianchen Ding
563bc2161b sched/eevdf: Force propagating min_slice of cfs_rq when {en,de}queue tasks
When a task is enqueued and its parent cgroup se is already on_rq, this
parent cgroup se will not be enqueued again, and hence the root->min_slice
leaves unchanged. The same issue happens when a task is dequeued and its
parent cgroup se has other runnable entities, and the parent cgroup se
will not be dequeued.

Force propagating min_slice when se doesn't need to be enqueued or
dequeued. Ensure the se hierarchy always get the latest min_slice.

Fixes: aef6987d89 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy")
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250211063659.7180-1-dtcccc@linux.alibaba.com
2025-02-14 10:32:01 +01:00
Yafang Shao
b9f2b29b94 sched: Don't define sched_clock_irqtime as static key
The sched_clock_irqtime was defined as a static key in commit 8722903cbb
('sched: Define sched_clock_irqtime as static key'). However, this change
introduces a 'sleeping in atomic context' warning, as shown below:

	arch/x86/kernel/tsc.c:1214 mark_tsc_unstable()
	warn: sleeping in atomic context

As analyzed by Dan, the affected code path is as follows:

vcpu_load() <- disables preempt
-> kvm_arch_vcpu_load()
   -> mark_tsc_unstable() <- sleeps

virt/kvm/kvm_main.c
   166  void vcpu_load(struct kvm_vcpu *vcpu)
   167  {
   168          int cpu = get_cpu();
                          ^^^^^^^^^^
This get_cpu() disables preemption.

   169
   170          __this_cpu_write(kvm_running_vcpu, vcpu);
   171          preempt_notifier_register(&vcpu->preempt_notifier);
   172          kvm_arch_vcpu_load(vcpu, cpu);
   173          put_cpu();
   174  }

arch/x86/kvm/x86.c
  4979          if (unlikely(vcpu->cpu != cpu) || kvm_check_tsc_unstable()) {
  4980                  s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
  4981                                  rdtsc() - vcpu->arch.last_host_tsc;
  4982                  if (tsc_delta < 0)
  4983                          mark_tsc_unstable("KVM discovered backwards TSC");

arch/x86/kernel/tsc.c
    1206 void mark_tsc_unstable(char *reason)
    1207 {
    1208         if (tsc_unstable)
    1209                 return;
    1210
    1211         tsc_unstable = 1;
    1212         if (using_native_sched_clock())
    1213                 clear_sched_clock_stable();
--> 1214         disable_sched_clock_irqtime();
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
kernel/jump_label.c
   245  void static_key_disable(struct static_key *key)
   246  {
   247          cpus_read_lock();
                ^^^^^^^^^^^^^^^^
This lock has a might_sleep() in it which triggers the static checker
warning.

   248          static_key_disable_cpuslocked(key);
   249          cpus_read_unlock();
   250  }

Let revert this change for now as {disable,enable}_sched_clock_irqtime
are used in many places, as pointed out by Sean, including the following:

The code path in clocksource_watchdog():

  clocksource_watchdog()
  |
  -> spin_lock(&watchdog_lock);
     |
     -> __clocksource_unstable()
        |
        -> clocksource.mark_unstable() == tsc_cs_mark_unstable()
           |
           -> disable_sched_clock_irqtime()

And the code path in sched_clock_register():

	/* Cannot register a sched_clock with interrupts on */
	local_irq_save(flags);

	...

	/* Enable IRQ time accounting if we have a fast enough sched_clock() */
	if (irqtime > 0 || (irqtime == -1 && rate >= 1000000))
		enable_sched_clock_irqtime();

	local_irq_restore(flags);

[lkp@intel.com: reported a build error in the prev version]

Closes: https://lore.kernel.org/kvm/37a79ba3-9ce0-479c-a5b0-2bd75d573ed3@stanley.mountain/
Fixes: 8722903cbb ("sched: Define sched_clock_irqtime as static key")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Debugged-by: Dan Carpenter <dan.carpenter@linaro.org>
Debugged-by: Sean Christopherson <seanjc@google.com>
Debugged-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20250205032438.14668-1-laoar.shao@gmail.com
2025-02-14 10:32:00 +01:00
zihan zhou
2ae891b826 sched: Reduce the default slice to avoid tasks getting an extra tick
The old default value for slice is 0.75 msec * (1 + ilog(ncpus)) which
means that we have a default slice of:

  0.75 for 1 cpu
  1.50 up to 3 cpus
  2.25 up to 7 cpus
  3.00 for 8 cpus and above.

For HZ=250 and HZ=100, because of the tick accuracy, the runtime of
tasks is far higher than their slice.

For HZ=1000 with 8 cpus or more, the accuracy of tick is already
satisfactory, but there is still an issue that tasks will get an extra
tick because the tick often arrives a little faster than expected. In
this case, the task can only wait until the next tick to consider that it
has reached its deadline, and will run 1ms longer.

vruntime + sysctl_sched_base_slice =     deadline
        |-----------|-----------|-----------|-----------|
             1ms          1ms         1ms         1ms
                   ^           ^           ^           ^
                 tick1       tick2       tick3       tick4(nearly 4ms)

There are two reasons for tick error: clockevent precision and the
CONFIG_IRQ_TIME_ACCOUNTING/CONFIG_PARAVIRT_TIME_ACCOUNTING. with
CONFIG_IRQ_TIME_ACCOUNTING every tick will be less than 1ms, but even
without it, because of clockevent precision, tick still often less than
1ms.

In order to make scheduling more precise, we changed 0.75 to 0.70,
Using 0.70 instead of 0.75 should not change much for other configs
and would fix this issue:

  0.70 for 1 cpu
  1.40 up to 3 cpus
  2.10 up to 7 cpus
  2.8 for 8 cpus and above.

This does not guarantee that tasks can run the slice time accurately
every time, but occasionally running an extra tick has little impact.

Signed-off-by: zihan zhou <15645113830zzh@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20250208075322.13139-1-15645113830zzh@gmail.com
2025-02-14 10:32:00 +01:00
zihan zhou
f553741ac8 sched: Cancel the slice protection of the idle entity
A wakeup non-idle entity should preempt idle entity at any time,
but because of the slice protection of the idle entity, the non-idle
entity has to wait, so just cancel it.

This patch is aimed at minimizing the impact of SCHED_IDLE on
SCHED_NORMAL. For example, a task with SCHED_IDLE policy that sleeps for
1s and then runs for 3 ms, running cyclictest on the same cpu, has a
maximum latency of 3 ms, which is caused by the slice protection of the
idle entity. It is unreasonable. With this patch, the cyclictest latency
under the same conditions is basically the same on the cpu with idle
processes and on empty cpu.

[peterz: add helpers]
Fixes: 63304558ba ("sched/eevdf: Curb wakeup-preemption")
Signed-off-by: zihan zhou <15645113830zzh@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20250208080850.16300-1-15645113830zzh@gmail.com
2025-02-14 10:32:00 +01:00
Tejun Heo
3539c6411a sched_ext: Implement SCX_OPS_ALLOW_QUEUED_WAKEUP
A task wakeup can be either processed on the waker's CPU or bounced to the
wakee's previous CPU using an IPI (ttwu_queue). Bouncing to the wakee's CPU
avoids the waker's CPU locking and accessing the wakee's rq which can be
expensive across cache and node boundaries.

When ttwu_queue path is taken, select_task_rq() and thus ops.select_cpu()
may be skipped in some cases (racing against the wakee switching out). As
this confused some BPF schedulers, there wasn't a good way for a BPF
scheduler to tell whether idle CPU selection has been skipped, ops.enqueue()
couldn't insert tasks into foreign local DSQs, and the performance
difference on machines with simple toplogies were minimal, sched_ext
disabled ttwu_queue.

However, this optimization makes noticeable difference on more complex
topologies and a BPF scheduler now has an easy way tell whether
ops.select_cpu() was skipped since 9b671793c7 ("sched_ext, scx_qmap: Add
and use SCX_ENQ_CPU_SELECTED") and can insert tasks into foreign local DSQs
since 5b26f7b920 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct
dispatches").

Implement SCX_OPS_ALLOW_QUEUED_WAKEUP which allows BPF schedulers to choose
to enable ttwu_queue optimization.

v2: Update the patch description and comment re. ops.select_cpu() being
    skipped in some cases as opposed to always as per Neel.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Neel Natu <neelnatu@google.com>
Reported-by: Barret Rhoden <brho@google.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-02-13 07:30:46 -10:00
Chuyi Zhou
f5717c93a1 sched_ext: Use SCX_CALL_OP_TASK in task_tick_scx
Now when we use scx_bpf_task_cgroup() in ops.tick() to get the cgroup of
the current task, the following error will occur:

scx_foo[3795244] triggered exit kind 1024:
  runtime error (called on a task not being operated on)

The reason is that we are using SCX_CALL_OP() instead of SCX_CALL_OP_TASK()
when calling ops.tick(), which triggers the error during the subsequent
scx_kf_allowed_on_arg_tasks() check.

SCX_CALL_OP_TASK() was first introduced in commit 36454023f5 ("sched_ext:
Track tasks that are subjects of the in-flight SCX operation") to ensure
task's rq lock is held when accessing task's sched_group. Since ops.tick()
is marked as SCX_KF_TERMINAL and task_tick_scx() is protected by the rq
lock, we can use SCX_CALL_OP_TASK() to avoid the above issue. Similarly,
the same changes should be made for ops.disable() and ops.exit_task(), as
they are also protected by task_rq_lock() and it's safe to access the
task's task_group.

Fixes: 36454023f5 ("sched_ext: Track tasks that are subjects of the in-flight SCX operation")
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-13 06:57:33 -10:00
Tejun Heo
78e4690de4 Merge branch 'for-6.14-fixes' into for-6.15
Pull to receive f3f08c3acf ("sched_ext: Fix incorrect assumption about
migration disabled tasks in task_can_run_on_remote_rq()") which conflicts
with 26176116d9 ("sched_ext: Count SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE in
the right spot") in for-6.15.
2025-02-10 10:45:43 -10:00
Tejun Heo
f3f08c3acf sched_ext: Fix incorrect assumption about migration disabled tasks in task_can_run_on_remote_rq()
While fixing migration disabled task handling, 3296682157 ("sched_ext: Fix
migration disabled handling in targeted dispatches") assumed that a
migration disabled task's ->cpus_ptr would only have the pinned CPU. While
this is eventually true for migration disabled tasks that are switched out,
->cpus_ptr update is performed by migrate_disable_switch() which is called
right before context_switch() in __scheduler(). However, the task is
enqueued earlier during pick_next_task() via put_prev_task_scx(), so there
is a race window where another CPU can see the task on a DSQ.

If the CPU tries to dispatch the migration disabled task while in that
window, task_allowed_on_cpu() will succeed and task_can_run_on_remote_rq()
will subsequently trigger SCHED_WARN(is_migration_disabled()).

  WARNING: CPU: 8 PID: 1837 at kernel/sched/ext.c:2466 task_can_run_on_remote_rq+0x12e/0x140
  Sched_ext: layered (enabled+all), task: runnable_at=-10ms
  RIP: 0010:task_can_run_on_remote_rq+0x12e/0x140
  ...
   <TASK>
   consume_dispatch_q+0xab/0x220
   scx_bpf_dsq_move_to_local+0x58/0xd0
   bpf_prog_84dd17b0654b6cf0_layered_dispatch+0x290/0x1cfa
   bpf__sched_ext_ops_dispatch+0x4b/0xab
   balance_one+0x1fe/0x3b0
   balance_scx+0x61/0x1d0
   prev_balance+0x46/0xc0
   __pick_next_task+0x73/0x1c0
   __schedule+0x206/0x1730
   schedule+0x3a/0x160
   __do_sys_sched_yield+0xe/0x20
   do_syscall_64+0xbb/0x1e0
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fix it by converting the SCHED_WARN() back to a regular failure path. Also,
perform the migration disabled test before task_allowed_on_cpu() test so
that BPF schedulers which fail to handle migration disabled tasks can be
noticed easily.

While at it, adjust scx_ops_error() message for !task_allowed_on_cpu() case
for brevity and consistency.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 3296682157 ("sched_ext: Fix migration disabled handling in targeted dispatches")
Acked-by: Andrea Righi <arighi@nvidia.com>
Reported-by: Jake Hillion <jakehillion@meta.com>
2025-02-10 10:40:47 -10:00
Li RongQing
1a4e0d8682 sched_ext: Take NUMA node into account when allocating per-CPU cpumasks
per-CPU cpumasks are dominantly accessed from their own local CPUs,
so allocate them node-local to improve performance.

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-10 07:20:13 -10:00
Tejun Heo
eace54dff0 sched_ext: Add SCX_EV_ENQ_SKIP_MIGRATION_DISABLED
Count the number of times a migration disabled task is automatically
dispatched to its local DSQ.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-02-08 20:39:11 -10:00
Tejun Heo
26176116d9 sched_ext: Count SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE in the right spot
SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE wasn't quite right in two aspects:

- It counted both migration disabled and offline events.

- It didn't count events from scx_bpf_dsq_move() path.

Fix it by moving the counting into task_can_run_on_remote_rq() which is
shared by both paths and can distinguish the different rejection conditions.
The argument @trigger_error is renamed to @enforce as it now does more than
just triggering error.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-02-08 20:39:11 -10:00
Tejun Heo
29ef4a2fcf Merge branch 'for-6.14-fixes' into for-6.15
Pull to receive:

- 2fa0fbeb69 ("sched_ext: Implement auto local dispatching of migration disabled tasks")
- 3296682157 ("sched_ext: Fix migration disabled handling in targeted dispatches")

as planned for-6.15 changes depend on them (e.g. adding event counter for
implicit migration disabled task handling).
2025-02-08 20:34:43 -10:00
Tejun Heo
3296682157 sched_ext: Fix migration disabled handling in targeted dispatches
A dispatch operation that can target a specific local DSQ -
scx_bpf_dsq_move_to_local() or scx_bpf_dsq_move() - checks whether the task
can be migrated to the target CPU using task_can_run_on_remote_rq(). If the
task can't be migrated to the targeted CPU, it is bounced through a global
DSQ.

task_can_run_on_remote_rq() assumes that the task is on a CPU that's
different from the targeted CPU but the callers doesn't uphold the
assumption and may call the function when the task is already on the target
CPU. When such task has migration disabled, task_can_run_on_remote_rq() ends
up returning %false incorrectly unnecessarily bouncing the task to a global
DSQ.

Fix it by updating the callers to only call task_can_run_on_remote_rq() when
the task is on a different CPU than the target CPU. As this is a bit subtle,
for clarity and documentation:

- Make task_can_run_on_remote_rq() trigger SCHED_WARN_ON() if the task is on
  the same CPU as the target CPU.

- is_migration_disabled() test in task_can_run_on_remote_rq() cannot trigger
  if the task is on a different CPU than the target CPU as the preceding
  task_allowed_on_cpu() test should fail beforehand. Convert the test into
  SCHED_WARN_ON().

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 4c30f5ce4f ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
Fixes: 0366017e09 ("sched_ext: Use task_can_run_on_remote_rq() test in dispatch_to_local_dsq()")
Cc: stable@vger.kernel.org # v6.12+
2025-02-08 20:33:25 -10:00
Tejun Heo
2fa0fbeb69 sched_ext: Implement auto local dispatching of migration disabled tasks
Migration disabled tasks are special and pinned to their previous CPUs. They
tripped up some unsuspecting BPF schedulers as their ->nr_cpus_allowed may
not agree with the bits set in ->cpus_ptr. Make it easier for BPF schedulers
by automatically dispatching them to the pinned local DSQs by default. If a
BPF scheduler wants to handle migration disabled tasks explicitly, it can
set SCX_OPS_ENQ_MIGRATION_DISABLED.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-02-08 20:32:54 -10:00
Linus Torvalds
c7b92e8969 Fix a cfs_rq->h_nr_runnable accounting bug that trips up a
defensive SCHED_WARN_ON() on certain workloads. The bug is
 believed to be (accidentally) self-correcting, hence no
 behavioral side effects are expected.
 
 Also print se.slice in debug output, since this value can now be
 set via the syscall ABI and can be useful to track.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmenIcsRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iyZA//ZpzGiiZ8coXBk4PQ77c0+BOgSGdkbWeR
 zicKvqWd+j9skOTVrIk8MPs7D3C5uNuDuVAeNWYalBHRO7ndLfCg36pHqR8tQ3xN
 2GwziJzKpi1r4WXBSCtFe/abKrMhIsMYiqKDQz3Ry2Zfn+TuY4t4bzEYgfm481mO
 FkLcXs97RI5sFsCBI0uhRu76A9jrNRZKW2zczHTb2MS4zjwGq1yL3vkx2M11d0Su
 BqrcPjz6mZFuzCrA6pW0QeSP4Mn5xL8c2seoMYnvFrPYzq47Thxm4nOWPURCN9uX
 6MmqNXisJjTe00noswhbcUia1n2cr37uu5d21tXhn7Rf2XTO4+WRE7zjzvxLKQCr
 DtcH1XqvnChPGPFTsSlso4vbygfEZIk14DWZ71mXQJto6t06A/yjFiAR82ZIKzLA
 ZlK6hnX2rQ8WST3XSixiRPU/5xDu3ptJ0aDRV/dScCMzb7c0Y2YSeqA4xBMcZDPq
 A1TrNUsnsvH0TJXXqeslDSxxEFUKHNiOI2itzMCxAteEZXPysa8Rvcl4zFOZ5A6F
 nb8hDN94dydNnPpf4g+ZIVEHs5ES3Yeexh1dGwCjQSp72qAJK3wkHQ6KfPN/T3Gb
 PbDVw93cYJI/fhB2BX9DcnN+uNLTrqr1kfY7uFBE1feMdnQNBHILdjbByILfQAQ+
 0r7+aYwn0F0=
 =5mot
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:
 "Fix a cfs_rq->h_nr_runnable accounting bug that trips up a defensive
  SCHED_WARN_ON() on certain workloads. The bug is believed to be
  (accidentally) self-correcting, hence no behavioral side effects are
  expected.

  Also print se.slice in debug output, since this value can now be set
  via the syscall ABI and can be useful to track"

* tag 'sched-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/debug: Provide slice length for fair tasks
  sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
2025-02-08 11:16:22 -08:00
Jann Horn
bcc6244e13 sched: Clarify wake_up_q()'s write to task->wake_q.next
Clarify that wake_up_q() does an atomic write to task->wake_q.next, after
which a concurrent __wake_q_add() can immediately overwrite
task->wake_q.next again.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250129-sched-wakeup-prettier-v1-1-2f51f5f663fa@google.com
2025-02-08 15:43:12 +01:00
Changwoo Min
6d3f8fb4b2 sched_ext: Add an event, SCX_EV_ENQ_SLICE_DFL
Add a core event, SCX_EV_ENQ_SLICE_DFL, which represents how many
tasks have been enqueued (or pick_task-ed or select_cpu-ed) with
a default time slice (SCX_SLICE_DFL).

Scheduling a task with SCX_SLICE_DFL unintentionally would be a source
of latency spikes because SCX_SLICE_DFL is relatively long (20 msec).
Thus, soaring the SCX_EV_ENQ_SLICE_DFL value would be a sign of BPF
scheduler bugs, causing latency spikes, especially when ops.select_cpu()
is provided.

__scx_add_event() is used since the caller holds an rq lock or p->pi_lock,
so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-07 11:24:59 -10:00
Ankur Arora
2c00e1199c sched: update __cond_resched comment about RCU quiescent states
Update comment in __cond_resched() clarifying how urgently needed
quiescent state are provided.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-02-05 07:01:55 -08:00
Changwoo Min
d46457c31c sched_ext: Add an event, SCX_EV_BYPASS_DURATION
Add a core event, SCX_EV_BYPASS_DURATION, which represents the
total duration of bypass modes in nanoseconds.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04 10:36:47 -10:00
Changwoo Min
5c605cd33c sched_ext: Add an event, SCX_EV_BYPASS_DISPATCH
Add a core event, SCX_EV_BYPASS_DISPATCH, which represents how many
tasks have been dispatched in the bypass mode.

__scx_add_event() is used since the caller holds an rq lock or
p->pi_lock, so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04 10:36:47 -10:00
Changwoo Min
4f7a38c7c9 sched_ext: Add an event, SCX_EV_BYPASS_ACTIVATE
Add a core event, SCX_EV_BYPASS_ACTIVATE, which represents how many
times the bypass mode has been triggered.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04 10:36:47 -10:00
Changwoo Min
824d4f2dce sched_ext: Add an event, SCX_EV_ENQ_SKIP_EXITING
Add a core event, SCX_EV_ENQ_SKIP_EXITING, which represents how many
times a task is enqueued to a local DSQ when exiting if
SCX_OPS_ENQ_EXITING is not set.

__scx_add_event() is used since the caller holds an rq lock,
so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04 10:36:47 -10:00
Changwoo Min
7125660bc1 sched_ext: Add an event, SCX_EV_DISPATCH_KEEP_LAST
Add a core event, SCX_EV_DISPATCH_KEEP_LAST, which represents how many
times a task is continued to run without ops.enqueue() when
SCX_OPS_ENQ_LAST is not set.

__scx_add_event() is used since the caller holds an rq lock,
so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02 07:23:18 -10:00
Changwoo Min
9be0a1b0c8 sched_ext: Add an event, SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE
Add a core event, SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE, which represents how
many times a BPF scheduler tries to dispatch to an offlined local DSQ.

__scx_add_event() is used since the caller holds an rq lock,
so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02 07:23:18 -10:00
Changwoo Min
f7f6142107 sched_ext: Add an event, SCX_EV_SELECT_CPU_FALLBACK
Add a core event, SCX_EV_SELECT_CPU_FALLBACK, which represents how many times
ops.select_cpu() returns a CPU that the task can't use.

__scx_add_event() is used since the caller holds an rq lock,
so the preemption has already been disabled.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02 07:23:18 -10:00
Changwoo Min
17103b8504 sched_ext: Implement event counter infrastructure
Collect the statistics of specific types of behavior in the sched_ext core,
which are not easily visible but still interesting to an scx scheduler.

An event type is defined in 'struct scx_event_stats.' When an event occurs,
its counter is accumulated using 'scx_add_event()' and '__scx_add_event()'
to per-CPU 'struct scx_event_stats' for efficiency. 'scx_bpf_events()'
aggregates all the per-CPU counters and exposes a system-wide counters.

For convenience and readability of the code, 'scx_agg_event()' and
'scx_dump_event()' are provided.

The collected events can be observed after a BPF scheduler is unloaded
beforea new BPF scheduler is loaded so the per-CPU 'struct scx_event_stats'
are reset.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02 07:23:18 -10:00
Christian Loehle
9065ce6975 sched/debug: Provide slice length for fair tasks
Since commit:

  857b158dc5 ("sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion")

... we have the userspace per-task tunable slice length, which is
a key parameter that is otherwise difficult to obtain, so provide
it in /proc/$PID/sched.

[ mingo: Clarified the changelog. ]

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/453349b1-1637-42f5-a7b2-2385392b5956@arm.com
2025-01-31 10:45:33 +01:00
Linus Torvalds
f55b0671e3 More power management updates for 6.14-rc1
- Add missing error handling for syscore_suspend() to the hibernation
    core code (Wentao Liang).
 
  - Revert a commit that added unused macros (Andy Shevchenko).
 
  - Synchronize the runtime PM status of devices that were runtime-
    suspended before a system-wide suspend and need to be resumed during
    the subsequent system-wide resume transition (Rafael Wysocki).
 
  - Clean up the teo cpuidle governor and make the handling of short idle
    intervals in it consistent regardless of the properties of idle
    states supplied by the cpuidle driver (Rafael Wysocki).
 
  - Fix some boost-related issues in cpufreq (Lifeng Zheng).
 
  - Fix build issues in the s3c64xx and airoha cpufreq drivers (Viresh
    Kumar).
 
  - Remove unconditional binding of schedutil governor kthreads to the
    affected CPUs if the cpufreq driver indicates that updates can happen
    from any CPU (Christian Loehle).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmeb5xYSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxcFsP/2FIoEI2G6J7pk8zChWT225qkkaieh5P
 tHIkcFINlgzyjLnqmyWELUdt+sB7re6/dMmoLor+abudHimvBvUfAj6Oiz1F3p2F
 utE9TpfhOkXi1ci5zBl9h6+iDj2Z5op3Qe/qw/W3DTlcManAD+6r60A2tOEy0jhi
 GTbp2SEEU28+LU/2J59IfxEuRTTH4pbQGXi+iKv/k9bmtLvQofa1saXyQCBSZrvO
 z3MBdqnAxLeZCg/qILmEGsBvbv1wpugvp3yoMLVwGNyul12Augcs8PreQz7e5tFq
 spEuCfpBJwyJLAGlOnjOYgsPbJBXWRkIBeLH7JealfZr9TX0y4LZSAHi/xe0Asd3
 BBZLLDojxhYMLzmqSkuafHlQd5J7jKl++RS1A9Qm6aqglKjeSOC9Ca9fmrc9Ub9P
 Jpf1SVJ3kJsv1Z7wuUcaj6oLxD8wlAgCo5pigNgWTP2HllhP2bmc22M8JWxpAz+m
 nMeW8nr8bAu4XViZeb74YKGgUDngO/uKRwBpthSkE2fqM7q0Wr5E7G0u9M91mzG6
 nd/XDwta5TeznMQpSy339NgT61i4HHfyc/SDIdpkBxI0C5l6jknNawq79i9gy7/E
 In4MyooOlls/iX/JR0uxx2hXEByltF0IHqwsRYeJ6dmIajYgASXR1Hhh5Iy9fPJ/
 JTJ7vR5oZPB/
 =EgsM
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more power management updates from Rafael Wysocki:
 "These are mostly fixes on top of the previously merged power
  management material with the addition of some teo cpuidle governor
  updates, some of which may also be regarded as fixes:

   - Add missing error handling for syscore_suspend() to the hibernation
     core code (Wentao Liang)

   - Revert a commit that added unused macros (Andy Shevchenko)

   - Synchronize the runtime PM status of devices that were runtime-
     suspended before a system-wide suspend and need to be resumed
     during the subsequent system-wide resume transition (Rafael
     Wysocki)

   - Clean up the teo cpuidle governor and make the handling of short
     idle intervals in it consistent regardless of the properties of
     idle states supplied by the cpuidle driver (Rafael Wysocki)

   - Fix some boost-related issues in cpufreq (Lifeng Zheng)

   - Fix build issues in the s3c64xx and airoha cpufreq drivers (Viresh
     Kumar)

   - Remove unconditional binding of schedutil governor kthreads to the
     affected CPUs if the cpufreq driver indicates that updates can
     happen from any CPU (Christian Loehle)"

* tag 'pm-6.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: sleep: core: Synchronize runtime PM status of parents and children
  cpufreq: airoha: Depends on OF
  PM: Revert "Add EXPORT macros for exporting PM functions"
  PM: hibernate: Add error handling for syscore_suspend()
  cpufreq/schedutil: Only bind threads if needed
  cpufreq: ACPI: Remove set_boost in acpi_cpufreq_cpu_init()
  cpufreq: CPPC: Fix wrong max_freq in policy initialization
  cpufreq: Introduce a more generic way to set default per-policy boost flag
  cpufreq: Fix re-boost issue after hotplugging a CPU
  cpufreq: s3c64xx: Fix compilation warning
  cpuidle: teo: Skip sleep length computation for low latency constraints
  cpuidle: teo: Replace time_span_ns with a flag
  cpuidle: teo: Simplify handling of total events count
  cpuidle: teo: Skip getting the sleep length if wakeups are very frequent
  cpuidle: teo: Simplify counting events used for tick management
  cpuidle: teo: Clarify two code comments
  cpuidle: teo: Drop local variable prev_intercept_idx
  cpuidle: teo: Combine candidate state index checks against 0
  cpuidle: teo: Reorder candidate state index checks
  cpuidle: teo: Rearrange idle state lookup code
2025-01-30 15:10:34 -08:00
Joel Granados
1751f872cc treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.

Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit 78eb4ea25c ("sysctl: treewide:
constify the ctl_table argument of proc_handlers") constified all the
proc_handlers.

Created this by running an spatch followed by a sed command:
Spatch:
    virtual patch

    @
    depends on !(file in "net")
    disable optional_qualifier
    @

    identifier table_name != {
      watchdog_hardlockup_sysctl,
      iwcm_ctl_table,
      ucma_ctl_table,
      memory_allocation_profiling_sysctls,
      loadpin_sysctl_table
    };
    @@

    + const
    struct ctl_table table_name [] = { ... };

sed:
    sed --in-place \
      -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \
      kernel/utsname_sysctl.c

Reviewed-by: Song Liu <song@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI
Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Acked-by: Anna Schumaker <anna.schumaker@oracle.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-01-28 13:48:37 +01:00
Andrea Righi
337d1b354a sched_ext: Move built-in idle CPU selection policy to a separate file
As ext.c is becoming quite large, move the idle CPU selection policy to
separate files (ext_idle.c / ext_idle.h) for better code readability.

Moreover, group together all the idle CPU selection kfunc's to the same
btf_kfunc_id_set block.

No functional changes, this is purely code reorganization.

Suggested-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-27 12:43:43 -10:00
Andrea Righi
1626e5ef0b sched_ext: Fix lock imbalance in dispatch_to_local_dsq()
While performing the rq locking dance in dispatch_to_local_dsq(), we may
trigger the following lock imbalance condition, in particular when
multiple tasks are rapidly changing CPU affinity (i.e., running a
`stress-ng --race-sched 0`):

[   13.413579] =====================================
[   13.413660] WARNING: bad unlock balance detected!
[   13.413729] 6.13.0-virtme #15 Not tainted
[   13.413792] -------------------------------------
[   13.413859] kworker/1:1/80 is trying to release lock (&rq->__lock) at:
[   13.413954] [<ffffffff873c6c48>] dispatch_to_local_dsq+0x108/0x1a0
[   13.414111] but there are no more locks to release!
[   13.414176]
[   13.414176] other info that might help us debug this:
[   13.414258] 1 lock held by kworker/1:1/80:
[   13.414318]  #0: ffff8b66feb41698 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x20/0x90
[   13.414612]
[   13.414612] stack backtrace:
[   13.415255] CPU: 1 UID: 0 PID: 80 Comm: kworker/1:1 Not tainted 6.13.0-virtme #15
[   13.415505] Workqueue:  0x0 (events)
[   13.415567] Sched_ext: dsp_local_on (enabled+all), task: runnable_at=-2ms
[   13.415570] Call Trace:
[   13.415700]  <TASK>
[   13.415744]  dump_stack_lvl+0x78/0xe0
[   13.415806]  ? dispatch_to_local_dsq+0x108/0x1a0
[   13.415884]  print_unlock_imbalance_bug+0x11b/0x130
[   13.415965]  ? dispatch_to_local_dsq+0x108/0x1a0
[   13.416226]  lock_release+0x231/0x2c0
[   13.416326]  _raw_spin_unlock+0x1b/0x40
[   13.416422]  dispatch_to_local_dsq+0x108/0x1a0
[   13.416554]  flush_dispatch_buf+0x199/0x1d0
[   13.416652]  balance_one+0x194/0x370
[   13.416751]  balance_scx+0x61/0x1e0
[   13.416848]  prev_balance+0x43/0xb0
[   13.416947]  __pick_next_task+0x6b/0x1b0
[   13.417052]  __schedule+0x20d/0x1740

This happens because dispatch_to_local_dsq() is racing with
dispatch_dequeue() and, when the latter wins, we incorrectly assume that
the task has been moved to dst_rq.

Fix by properly tracking the currently locked rq.

Fixes: 4d3ca89bdd ("sched_ext: Refactor consume_remote_task()")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-27 12:41:12 -10:00
Tejun Heo
d6f3e7d564 sched_ext: Fix incorrect autogroup migration detection
scx_move_task() is called from sched_move_task() and tells the BPF scheduler
that cgroup migration is being committed. sched_move_task() is used by both
cgroup and autogroup migrations and scx_move_task() tried to filter out
autogroup migrations by testing the destination cgroup and PF_EXITING but
this is not enough. In fact, without explicitly tagging the thread which is
doing the cgroup migration, there is no good way to tell apart
scx_move_task() invocations for racing migration to the root cgroup and an
autogroup migration.

This led to scx_move_task() incorrectly ignoring a migration from non-root
cgroup to an autogroup of the root cgroup triggering the following warning:

  WARNING: CPU: 7 PID: 1 at kernel/sched/ext.c:3725 scx_cgroup_can_attach+0x196/0x340
  ...
  Call Trace:
  <TASK>
    cgroup_migrate_execute+0x5b1/0x700
    cgroup_attach_task+0x296/0x400
    __cgroup_procs_write+0x128/0x140
    cgroup_procs_write+0x17/0x30
    kernfs_fop_write_iter+0x141/0x1f0
    vfs_write+0x31d/0x4a0
    __x64_sys_write+0x72/0xf0
    do_syscall_64+0x82/0x160
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fix it by adding an argument to sched_move_task() that indicates whether the
moving is for a cgroup or autogroup migration. After the change,
scx_move_task() is called only for cgroup migrations and renamed to
scx_cgroup_move_task().

Link: https://github.com/sched-ext/scx/issues/370
Fixes: 8195136669 ("sched_ext: Add cgroup support")
Cc: stable@vger.kernel.org # v6.12+
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-27 08:31:50 -10:00
Linus Torvalds
9c5968db9e The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs.
 
 - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the
   page allocator so we end up with the ability to allocate and free
   zero-refcount pages.  So that callers (ie, slab) can avoid a refcount
   inc & dec.
 
 - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use
   large folios other than PMD-sized ones.
 
 - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and
   fixes for this small built-in kernel selftest.
 
 - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of
   the mapletree code.
 
 - "mm: fix format issues and param types" from Keren Sun implements a
   few minor code cleanups.
 
 - "simplify split calculation" from Wei Yang provides a few fixes and a
   test for the mapletree code.
 
 - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes
   continues the work of moving vma-related code into the (relatively) new
   mm/vma.c.
 
 - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
   Hildenbrand cleans up and rationalizes handling of gfp flags in the page
   allocator.
 
 - "readahead: Reintroduce fix for improper RA window sizing" from Jan
   Kara is a second attempt at fixing a readahead window sizing issue.  It
   should reduce the amount of unnecessary reading.
 
 - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
   addresses an issue where "huge" amounts of pte pagetables are
   accumulated
   (https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/).
   Qi's series addresses this windup by synchronously freeing PTE memory
   within the context of madvise(MADV_DONTNEED).
 
 - "selftest/mm: Remove warnings found by adding compiler flags" from
   Muhammad Usama Anjum fixes some build warnings in the selftests code
   when optional compiler warnings are enabled.
 
 - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David
   Hildenbrand tightens the allocator's observance of __GFP_HARDWALL.
 
 - "pkeys kselftests improvements" from Kevin Brodsky implements various
   fixes and cleanups in the MM selftests code, mainly pertaining to the
   pkeys tests.
 
 - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
   estimate application working set size.
 
 - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
   provides some cleanups to memcg's hugetlb charging logic.
 
 - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
   removes the global swap cgroup lock.  A speedup of 10% for a tmpfs-based
   kernel build was demonstrated.
 
 - "zram: split page type read/write handling" from Sergey Senozhatsky
   has several fixes and cleaups for zram in the area of zram_write_page().
   A watchdog softlockup warning was eliminated.
 
 - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky
   cleans up the pagetable destructor implementations.  A rare
   use-after-free race is fixed.
 
 - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
   simplifies and cleans up the debugging code in the VMA merging logic.
 
 - "Account page tables at all levels" from Kevin Brodsky cleans up and
   regularizes the pagetable ctor/dtor handling.  This results in
   improvements in accounting accuracy.
 
 - "mm/damon: replace most damon_callback usages in sysfs with new core
   functions" from SeongJae Park cleans up and generalizes DAMON's sysfs
   file interface logic.
 
 - "mm/damon: enable page level properties based monitoring" from
   SeongJae Park increases the amount of information which is presented in
   response to DAMOS actions.
 
 - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes
   DAMON's long-deprecated debugfs interfaces.  Thus the migration to sysfs
   is completed.
 
 - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter
   Xu cleans up and generalizes the hugetlb reservation accounting.
 
 - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
   removes a never-used feature of the alloc_pages_bulk() interface.
 
 - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
   extends DAMOS filters to support not only exclusion (rejecting), but
   also inclusion (allowing) behavior.
 
 - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
   "introduces a new memory descriptor for zswap.zpool that currently
   overlaps with struct page for now.  This is part of the effort to reduce
   the size of struct page and to enable dynamic allocation of memory
   descriptors."
 
 - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and
   simplifies the swap allocator locking.  A speedup of 400% was
   demonstrated for one workload.  As was a 35% reduction for kernel build
   time with swap-on-zram.
 
 - "mm: update mips to use do_mmap(), make mmap_region() internal" from
   Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
   mmap_region() can be made MM-internal.
 
 - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU
   regressions and otherwise improves MGLRU performance.
 
 - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park
   updates DAMON documentation.
 
 - "Cleanup for memfd_create()" from Isaac Manjarres does that thing.
 
 - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand
   provides various cleanups in the areas of hugetlb folios, THP folios and
   migration.
 
 - "Uncached buffered IO" from Jens Axboe implements the new
   RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache
   reading and writing.  To permite userspace to address issues with
   massive buildup of useless pagecache when reading/writing fast devices.
 
 - "selftests/mm: virtual_address_range: Reduce memory" from Thomas
   Weißschuh fixes and optimizes some of the MM selftests.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5a+cwAKCRDdBJ7gKXxA
 jtoyAP9R58oaOKPJuTizEKKXvh/RpMyD6sYcz/uPpnf+cKTZxQEAqfVznfWlw/Lz
 uC3KRZYhmd5YrxU4o+qjbzp9XWX/xAE=
 =Ib2s
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "The various patchsets are summarized below. Plus of course many
  indivudual patches which are described in their changelogs.

   - "Allocate and free frozen pages" from Matthew Wilcox reorganizes
     the page allocator so we end up with the ability to allocate and
     free zero-refcount pages. So that callers (ie, slab) can avoid a
     refcount inc & dec

   - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to
     use large folios other than PMD-sized ones

   - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance
     and fixes for this small built-in kernel selftest

   - "mas_anode_descend() related cleanup" from Wei Yang tidies up part
     of the mapletree code

   - "mm: fix format issues and param types" from Keren Sun implements a
     few minor code cleanups

   - "simplify split calculation" from Wei Yang provides a few fixes and
     a test for the mapletree code

   - "mm/vma: make more mmap logic userland testable" from Lorenzo
     Stoakes continues the work of moving vma-related code into the
     (relatively) new mm/vma.c

   - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
     Hildenbrand cleans up and rationalizes handling of gfp flags in the
     page allocator

   - "readahead: Reintroduce fix for improper RA window sizing" from Jan
     Kara is a second attempt at fixing a readahead window sizing issue.
     It should reduce the amount of unnecessary reading

   - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
     addresses an issue where "huge" amounts of pte pagetables are
     accumulated:

       https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/

     Qi's series addresses this windup by synchronously freeing PTE
     memory within the context of madvise(MADV_DONTNEED)

   - "selftest/mm: Remove warnings found by adding compiler flags" from
     Muhammad Usama Anjum fixes some build warnings in the selftests
     code when optional compiler warnings are enabled

   - "mm: don't use __GFP_HARDWALL when migrating remote pages" from
     David Hildenbrand tightens the allocator's observance of
     __GFP_HARDWALL

   - "pkeys kselftests improvements" from Kevin Brodsky implements
     various fixes and cleanups in the MM selftests code, mainly
     pertaining to the pkeys tests

   - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
     estimate application working set size

   - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
     provides some cleanups to memcg's hugetlb charging logic

   - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
     removes the global swap cgroup lock. A speedup of 10% for a
     tmpfs-based kernel build was demonstrated

   - "zram: split page type read/write handling" from Sergey Senozhatsky
     has several fixes and cleaups for zram in the area of
     zram_write_page(). A watchdog softlockup warning was eliminated

   - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin
     Brodsky cleans up the pagetable destructor implementations. A rare
     use-after-free race is fixed

   - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
     simplifies and cleans up the debugging code in the VMA merging
     logic

   - "Account page tables at all levels" from Kevin Brodsky cleans up
     and regularizes the pagetable ctor/dtor handling. This results in
     improvements in accounting accuracy

   - "mm/damon: replace most damon_callback usages in sysfs with new
     core functions" from SeongJae Park cleans up and generalizes
     DAMON's sysfs file interface logic

   - "mm/damon: enable page level properties based monitoring" from
     SeongJae Park increases the amount of information which is
     presented in response to DAMOS actions

   - "mm/damon: remove DAMON debugfs interface" from SeongJae Park
     removes DAMON's long-deprecated debugfs interfaces. Thus the
     migration to sysfs is completed

   - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from
     Peter Xu cleans up and generalizes the hugetlb reservation
     accounting

   - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
     removes a never-used feature of the alloc_pages_bulk() interface

   - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
     extends DAMOS filters to support not only exclusion (rejecting),
     but also inclusion (allowing) behavior

   - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
     introduces a new memory descriptor for zswap.zpool that currently
     overlaps with struct page for now. This is part of the effort to
     reduce the size of struct page and to enable dynamic allocation of
     memory descriptors

   - "mm, swap: rework of swap allocator locks" from Kairui Song redoes
     and simplifies the swap allocator locking. A speedup of 400% was
     demonstrated for one workload. As was a 35% reduction for kernel
     build time with swap-on-zram

   - "mm: update mips to use do_mmap(), make mmap_region() internal"
     from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
     mmap_region() can be made MM-internal

   - "mm/mglru: performance optimizations" from Yu Zhao fixes a few
     MGLRU regressions and otherwise improves MGLRU performance

   - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae
     Park updates DAMON documentation

   - "Cleanup for memfd_create()" from Isaac Manjarres does that thing

   - "mm: hugetlb+THP folio and migration cleanups" from David
     Hildenbrand provides various cleanups in the areas of hugetlb
     folios, THP folios and migration

   - "Uncached buffered IO" from Jens Axboe implements the new
     RWF_DONTCACHE flag which provides synchronous dropbehind for
     pagecache reading and writing. To permite userspace to address
     issues with massive buildup of useless pagecache when
     reading/writing fast devices

   - "selftests/mm: virtual_address_range: Reduce memory" from Thomas
     Weißschuh fixes and optimizes some of the MM selftests"

* tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
  mm/compaction: fix UBSAN shift-out-of-bounds warning
  s390/mm: add missing ctor/dtor on page table upgrade
  kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags()
  tools: add VM_WARN_ON_VMG definition
  mm/damon/core: use str_high_low() helper in damos_wmark_wait_us()
  seqlock: add missing parameter documentation for raw_seqcount_try_begin()
  mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh
  mm/page_alloc: remove the incorrect and misleading comment
  zram: remove zcomp_stream_put() from write_incompressible_page()
  mm: separate move/undo parts from migrate_pages_batch()
  mm/kfence: use str_write_read() helper in get_access_type()
  selftests/mm/mkdirty: fix memory leak in test_uffdio_copy()
  kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags()
  selftests/mm: virtual_address_range: avoid reading from VM_IO mappings
  selftests/mm: vm_util: split up /proc/self/smaps parsing
  selftests/mm: virtual_address_range: unmap chunks after validation
  selftests/mm: virtual_address_range: mmap() without PROT_WRITE
  selftests/memfd/memfd_test: fix possible NULL pointer dereference
  mm: add FGP_DONTCACHE folio creation flag
  mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
  ...
2025-01-26 18:36:23 -08:00
Linus Torvalds
c159dfbdd4 Mainly individually changelogged singleton patches. The patch series in
this pull are:
 
 - "lib min_heap: Improve min_heap safety, testing, and documentation"
   from Kuan-Wei Chiu provides various tightenings to the min_heap library
   code.
 
 - "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms some
   cleanup and Rust preparation in the xarray library code.
 
 - "Update reference to include/asm-<arch>" from Geert Uytterhoeven fixes
   pathnames in some code comments.
 
 - "Converge on using secs_to_jiffies()" from Easwar Hariharan uses the
   new secs_to_jiffies() in various places where that is appropriate.
 
 - "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen
   switches two filesystems to the new mount API.
 
 - "Convert ocfs2 to use folios" from Matthew Wilcox does that.
 
 - "Remove get_task_comm() and print task comm directly" from Yafang Shao
   removes now-unneeded calls to get_task_comm() in various places.
 
 - "squashfs: reduce memory usage and update docs" from Phillip Lougher
   implements some memory savings in squashfs and performs some
   maintainability work.
 
 - "lib: clarify comparison function requirements" from Kuan-Wei Chiu
   tightens the sort code's behaviour and adds some maintenance work.
 
 - "nilfs2: protect busy buffer heads from being force-cleared" from
   Ryusuke Konishi fixes an issues in nlifs when the fs is presented with a
   corrupted image.
 
 - "nilfs2: fix kernel-doc comments for function return values" from
   Ryusuke Konishi fixes some nilfs kerneldoc.
 
 - "nilfs2: fix issues with rename operations" from Ryusuke Konishi
   addresses some nilfs BUG_ONs which syzbot was able to trigger.
 
 - "minmax.h: Cleanups and minor optimisations" from David Laight
   does some maintenance work on the min/max library code.
 
 - "Fixes and cleanups to xarray" from Kemeng Shi does maintenance work
   on the xarray library code.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5SP5QAKCRDdBJ7gKXxA
 jqN7AQChvwXGG43n4d5SDiA/rH7ddvowQcDqhC9cAMJ1ReR7qwEA8/LIWDE4PdMX
 mJnaZ1/ibpEpearrChCViApQtcyEGQI=
 =ti4E
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:
 "Mainly individually changelogged singleton patches. The patch series
  in this pull are:

   - "lib min_heap: Improve min_heap safety, testing, and documentation"
     from Kuan-Wei Chiu provides various tightenings to the min_heap
     library code

   - "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms
     some cleanup and Rust preparation in the xarray library code

   - "Update reference to include/asm-<arch>" from Geert Uytterhoeven
     fixes pathnames in some code comments

   - "Converge on using secs_to_jiffies()" from Easwar Hariharan uses
     the new secs_to_jiffies() in various places where that is
     appropriate

   - "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen
     switches two filesystems to the new mount API

   - "Convert ocfs2 to use folios" from Matthew Wilcox does that

   - "Remove get_task_comm() and print task comm directly" from Yafang
     Shao removes now-unneeded calls to get_task_comm() in various
     places

   - "squashfs: reduce memory usage and update docs" from Phillip
     Lougher implements some memory savings in squashfs and performs
     some maintainability work

   - "lib: clarify comparison function requirements" from Kuan-Wei Chiu
     tightens the sort code's behaviour and adds some maintenance work

   - "nilfs2: protect busy buffer heads from being force-cleared" from
     Ryusuke Konishi fixes an issues in nlifs when the fs is presented
     with a corrupted image

   - "nilfs2: fix kernel-doc comments for function return values" from
     Ryusuke Konishi fixes some nilfs kerneldoc

   - "nilfs2: fix issues with rename operations" from Ryusuke Konishi
     addresses some nilfs BUG_ONs which syzbot was able to trigger

   - "minmax.h: Cleanups and minor optimisations" from David Laight does
     some maintenance work on the min/max library code

   - "Fixes and cleanups to xarray" from Kemeng Shi does maintenance
     work on the xarray library code"

* tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (131 commits)
  ocfs2: use str_yes_no() and str_no_yes() helper functions
  include/linux/lz4.h: add some missing macros
  Xarray: use xa_mark_t in xas_squash_marks() to keep code consistent
  Xarray: remove repeat check in xas_squash_marks()
  Xarray: distinguish large entries correctly in xas_split_alloc()
  Xarray: move forward index correctly in xas_pause()
  Xarray: do not return sibling entries from xas_find_marked()
  ipc/util.c: complete the kernel-doc function descriptions
  gcov: clang: use correct function param names
  latencytop: use correct kernel-doc format for func params
  minmax.h: remove some #defines that are only expanded once
  minmax.h: simplify the variants of clamp()
  minmax.h: move all the clamp() definitions after the min/max() ones
  minmax.h: use BUILD_BUG_ON_MSG() for the lo < hi test in clamp()
  minmax.h: reduce the #define expansion of min(), max() and clamp()
  minmax.h: update some comments
  minmax.h: add whitespace around operators and after commas
  nilfs2: do not update mtime of renamed directory that is not moved
  nilfs2: handle errors that nilfs_prepare_chunk() may return
  CREDITS: fix spelling mistake
  ...
2025-01-26 17:50:53 -08:00
Oxana Kharitonova
65ef17aa07 hung_task: add task->flags, blocked by coredump to log
Resending this patch as I haven't received feedback on my initial
submission https://lore.kernel.org/all/20241204182953.10854-1-oxana@cloudflare.com/

For the processes which are terminated abnormally the kernel can provide
a coredump if enabled. When the coredump is performed, the process and
all its threads are put into the D state
(TASK_UNINTERRUPTIBLE | TASK_FREEZABLE).

On the other hand, we have kernel thread khungtaskd which monitors the
processes in the D state. If the task stuck in the D state more than
kernel.hung_task_timeout_secs, the hung_task alert appears in the kernel
log.

The higher memory usage of a process, the longer it takes to create
coredump, the longer tasks are in the D state. We have hung_task alerts
for the processes with memory usage above 10Gb. Although, our
kernel.hung_task_timeout_secs is 10 sec when the default is 120 sec.

Adding additional information to the log that the task is blocked by
coredump will help with monitoring. Another approach might be to
completely filter out alerts for such tasks, but in that case we would
lose transparency about what is putting pressure on some system
resources, e.g. we saw an increase in I/O when coredump occurs due its
writing to disk.

Additionally, it would be helpful to have task_struct->flags in the log
from the function sched_show_task(). Currently it prints
task_struct->thread_info->flags, this seems misleading as the line
starts with "task:xxxx".

[akpm@linux-foundation.org: fix printk control string]
Link: https://lkml.kernel.org/r/20250110160328.64947-1-oxana@cloudflare.com
Signed-off-by: Oxana Kharitonova <oxana@cloudflare.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ben Segall <bsegall@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-24 22:47:24 -08:00
Andrea Righi
2279563e3a sched_ext: Include task weight in the error state dump
Report the task weight when dumping the task state during an error exit.
Moreover, adjust the output format to display dsq_vtime, slice, and
weight on the same line.

This can help identify whether certain tasks were excessively
prioritized or de-prioritized due to large niceness gaps.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-24 08:07:42 -10:00
Atul Kumar Pant
be8ee18152 sched_ext: Fixes typos in comments
Fixes some spelling errors in the comments.

Signed-off-by: Atul Kumar Pant <atulpant.linux@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-24 08:06:05 -10:00
Linus Torvalds
bc8198dc7e sched_ext: Changes for v6.14
- scx_bpf_now() added so that BPF scheduler can access the cached timestamp
   in struct rq to avoid reading TSC multiple times within a locked
   scheduling operation.
 
 - Minor updates to the built-in idle CPU selection logic.
 
 - tool/sched_ext updates and other misc changes.
 
 Pulling sched_ext/for-6.14 into master causes a merge conflict between the
 following two commits (first commit in master, second in for-6.14):
 
   a2a3374c47 sched_ext: idle: Refresh idle masks during idle-to-idle transitions
   9cf9aceed2 sched_ext: idle: use assign_cpu() to update the idle cpumask
 
   static void update_builtin_idle(int cpu, bool idle)
   {
   <<<<<<< HEAD
           if (idle)
                   cpumask_set_cpu(cpu, idle_masks.cpu);
           else
                   cpumask_clear_cpu(cpu, idle_masks.cpu);
   =======
           int cpu = cpu_of(rq);
 
           if (SCX_HAS_OP(update_idle) && !scx_rq_bypassing(rq)) {
                   SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
                   if (!static_branch_unlikely(&scx_builtin_idle_enabled))
                           return;
           }
 
           assign_cpu(cpu, idle_masks.cpu, idle);
   >>>>>>> 987ce79b52
 
 The first commit factored out update_builtin_idle() and the second replaced
 cpumask_set/clear_cpu() calls with assign_cpu(). The conflict can be
 resolved by taking the code from the first and then replacing the
 cpumask_set/clear_cpu() calls with assign_cpu():
 
   static void update_builtin_idle(int cpu, bool idle)
   {
           assign_cpu(cpu, idle_masks.cpu, idle);
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ4/yjw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGc/3AQChJ2zxBLMtQOfNVWgqhAyatCnFquMncl/IeGAs
 Sf4eawD/QHNWYBPx0nOFQS8RZNmUcXuEDeHMy+1MvTUaIDL5MQM=
 =6t0t
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext updates from Tejun Heo:

 - scx_bpf_now() added so that BPF scheduler can access the cached
   timestamp in struct rq to avoid reading TSC multiple times within a
   locked scheduling operation.

 - Minor updates to the built-in idle CPU selection logic.

 - tool/sched_ext updates and other misc changes.

* tag 'sched_ext-for-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: fix kernel-doc warnings
  sched_ext: Use time helpers in BPF schedulers
  sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_now()
  sched_ext: Add time helpers for BPF schedulers
  sched_ext: Add scx_bpf_now() for BPF scheduler
  sched_ext: Implement scx_bpf_now()
  sched_ext: Relocate scx_enabled() related code
  sched_ext: Add option -l in selftest runner to list all available tests
  sched_ext: Include remaining task time slice in error state dump
  sched_ext: update scx_bpf_dsq_insert() doc for SCX_DSQ_LOCAL_ON
  sched_ext: idle: small CPU iteration refactoring
  sched_ext: idle: introduce check_builtin_idle_enabled() helper
  sched_ext: idle: clarify comments
  sched_ext: idle: use assign_cpu() to update the idle cpumask
  sched_ext: Use str_enabled_disabled() helper in update_selcpu_topology()
  sched_ext: Use sizeof_field for key_len in dsq_hash_params
  tools/sched_ext: Receive updates from SCX repo
  sched_ext: Use the NUMA scheduling domain for NUMA optimizations
2025-01-23 18:49:43 -08:00
Christian Loehle
93940fbdc4 cpufreq/schedutil: Only bind threads if needed
Remove the unconditional binding of sugov kthreads to the affected CPUs
if the cpufreq driver indicates that updates can happen from any CPU.
This allows userspace to set affinities to either save power (waking up
bigger CPUs on HMP can be expensive) or increasing performance (by
letting the utilized CPUs run without preemption of the sugov kthread).

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://patch.msgid.link/5a8deed4-7764-4729-a9d4-9520c25fa7e8@arm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-01-23 21:09:25 +01:00
Linus Torvalds
f4b9d3bf44 Power management updates for 6.14-rc1
- Use str_enable_disable()-like helpers in cpufreq (Krzysztof
    Kozlowski).
 
  - Extend the Apple cpufreq driver to support more SoCs (Hector Martin,
    Nick Chan).
 
  - Add new cpufreq driver for Airoha SoCs (Christian Marangi).
 
  - Fix using cpufreq-dt as module (Andreas Kemnade).
 
  - Minor fixes for Sparc, SCMI, and Qcom cpufreq drivers (Ethan Carter
    Edwards, Sibi Sankar, Manivannan Sadhasivam).
 
  - Fix the maximum supported frequency computation in the ACPI cpufreq
    driver to avoid relying on unfounded assumptions (Gautham Shenoy).
 
  - Fix an amd-pstate driver regression with preferred core rankings not
    being used (Mario Limonciello).
 
  - Fix a precision issue with frequency calculation in the amd-pstate
    driver (Naresh Solanki).
 
  - Add ftrace event to the amd-pstate driver for active mode (Mario
    Limonciello).
 
  - Set default EPP policy on Ryzen processors in amd-pstate (Mario
    Limonciello).
 
  - Clean up the amd-pstate cpufreq driver and optimize it to increase
    code reuse (Mario Limonciello, Dhananjay Ugwekar).
 
  - Use CPPC to get scaling factors between HWP performance levels and
    frequency in the intel_pstate driver and make it stop using a built
    -in scaling factor for Arrow Lake processors (Rafael Wysocki).
 
  - Make intel_pstate initialize epp_policy to CPUFREQ_POLICY_UNKNOWN for
    consistency with CPU offline (Christian Loehle).
 
  - Fix superfluous updates caused by need_freq_update in the schedutil
    cpufreq governor (Sultan Alsawaf).
 
  - Allow configuring the system suspend-resume (DPM) watchdog to warn
    earlier than panic (Douglas Anderson).
 
  - Implement devm_device_init_wakeup() helper and introduce a device-
    managed variant of dev_pm_set_wake_irq() (Joe Hattori, Peng Fan).
 
  - Remove direct inclusions of 'pm_wakeup.h' which should be only
    included via 'device.h' (Wolfram Sang).
 
  - Clean up two comments in the core system-wide PM code (Rafael
    Wysocki, Randy Dunlap).
 
  - Add Clearwater Forest processor support to the intel_idle cpuidle
    driver (Artem Bityutskiy).
 
  - Clean up the Exynos devfreq driver and devfreq core (Markus Elfring,
    Jeongjun Park).
 
  - Minor cleanups and fixes for OPP (Dan Carpenter, Neil Armstrong, Joe
    Hattori).
 
  - Implement dev_pm_opp_get_bw() (Neil Armstrong).
 
  - Expose OPP reference counting helpers for Rust (Viresh Kumar).
 
  - Fix TSC MHz calculation in cpupower (He Rongguang).
 
  - Add install and uninstall options to bindings Makefile and add header
    changes for cpufreq.h to SWIG bindings in cpupower (John B. Wyatt IV).
 
  - Add missing residency header changes in cpuidle.h to SWIG bindings in
    cpupower (John B. Wyatt IV).
 
  - Add output files to .gitignore and clean them up in "make clean" in
    selftests/cpufreq (Li Zhijian).
 
  - Fix cross-compilation in cpupower Makefile (Peng Fan).
 
  - Revise the is_valid flag handling for idle_monitor in the cpupower
    utility (wangfushuai).
 
  - Extend and clean up AMD processors support in cpupower (Mario
    Limonciello).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmeOthsSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxqQsP/ivDt8nqDnxdKB7cKFQIsEK+tl0RnFVD
 o5regvYeRcGWpUXuMaqBtTmCMjsB8bUkcj2yLquM54ubjHAGF6zJuw9ZytMPHVcC
 b2xk3RCFlXSBFXVK8eOh3XRviA9nGhuY97ZnPsQOlvoECrxT2xyeL+mWo7s+t+q9
 2NUH+yfRoi5FM+nqqDhsm0xXxJuPaNg6eAjIASuMjXap48rNk3L5kW6W/6nw7i0I
 xQWd/pKLHaI5e7DRF/QdMKu8+Fm4BbN0jMqLblKPOmTe9KggvBkck5q1Um20sYkJ
 vdKMAT02ClGavIC7DtY092Xik84NZfID4ZUchS6e2hJIQ3Uaw/eDvAo/jlT8gIzq
 fnXPdApRIzQGDvMxFaAsKaGlwxiVlAGHPDSTH6MVWzsp+1DSkbloSwVPAfeYIn44
 Jhov+6Ydux3597sSjo+YmD58acimXl7urVuk8P6m3U5+gb8/jlgbxpIn+vbxH3Ka
 o44Vt7axD63gezOQY134sj5gic5JL0GuZovOlvzrF6+FsjvVqcax6FZ4n3uIXu7P
 C1nwai+Wdzo7wvuz7RfO0g15Y15wYLQLYsRq/osRlf+sOmGVv7nA9tSzZ0LUdD5D
 Pp6PxppF6anM0Kjen8Ppuu+Bcr11JfVvhnVTJqhs6u71XdAy4TnG1JjL4lPWYJ4D
 Gfz2hyPNjiQX
 =AoMC
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "The majority of changes here are cpufreq updates which are dominated
  by amd-pstate driver changes, like in the previous cycle. Moreover,
  changes related to amd-pstate are also the majority of cpupower
  utility updates.

  Included are some pieces of new hardware support, like the addition of
  Clearwater Forest processors support to intel_idle, new cpufreq driver
  for Airoha SoCs, and Apple cpufreq driver extensions to support more
  SoCs. The intel_pstate driver is also extended to be able to support
  new platforms by using ACPI CPPC to compute scaling factors between
  HWP performance states and frequency.

  The rest is mostly fixes and cleanups in assorted pieces of power
  management code.

  Specifics:

   - Use str_enable_disable()-like helpers in cpufreq (Krzysztof
     Kozlowski)

   - Extend the Apple cpufreq driver to support more SoCs (Hector
     Martin, Nick Chan)

   - Add new cpufreq driver for Airoha SoCs (Christian Marangi)

   - Fix using cpufreq-dt as module (Andreas Kemnade)

   - Minor fixes for Sparc, SCMI, and Qcom cpufreq drivers (Ethan Carter
     Edwards, Sibi Sankar, Manivannan Sadhasivam)

   - Fix the maximum supported frequency computation in the ACPI cpufreq
     driver to avoid relying on unfounded assumptions (Gautham Shenoy)

   - Fix an amd-pstate driver regression with preferred core rankings
     not being used (Mario Limonciello)

   - Fix a precision issue with frequency calculation in the amd-pstate
     driver (Naresh Solanki)

   - Add ftrace event to the amd-pstate driver for active mode (Mario
     Limonciello)

   - Set default EPP policy on Ryzen processors in amd-pstate (Mario
     Limonciello)

   - Clean up the amd-pstate cpufreq driver and optimize it to increase
     code reuse (Mario Limonciello, Dhananjay Ugwekar)

   - Use CPPC to get scaling factors between HWP performance levels and
     frequency in the intel_pstate driver and make it stop using a
     built-in scaling factor for Arrow Lake processors (Rafael Wysocki)

   - Make intel_pstate initialize epp_policy to CPUFREQ_POLICY_UNKNOWN
     for consistency with CPU offline (Christian Loehle)

   - Fix superfluous updates caused by need_freq_update in the schedutil
     cpufreq governor (Sultan Alsawaf)

   - Allow configuring the system suspend-resume (DPM) watchdog to warn
     earlier than panic (Douglas Anderson)

   - Implement devm_device_init_wakeup() helper and introduce a device-
     managed variant of dev_pm_set_wake_irq() (Joe Hattori, Peng Fan)

   - Remove direct inclusions of 'pm_wakeup.h' which should be only
     included via 'device.h' (Wolfram Sang)

   - Clean up two comments in the core system-wide PM code (Rafael
     Wysocki, Randy Dunlap)

   - Add Clearwater Forest processor support to the intel_idle cpuidle
     driver (Artem Bityutskiy)

   - Clean up the Exynos devfreq driver and devfreq core (Markus
     Elfring, Jeongjun Park)

   - Minor cleanups and fixes for OPP (Dan Carpenter, Neil Armstrong,
     Joe Hattori)

   - Implement dev_pm_opp_get_bw() (Neil Armstrong)

   - Expose OPP reference counting helpers for Rust (Viresh Kumar)

   - Fix TSC MHz calculation in cpupower (He Rongguang)

   - Add install and uninstall options to bindings Makefile and add
     header changes for cpufreq.h to SWIG bindings in cpupower (John B.
     Wyatt IV)

   - Add missing residency header changes in cpuidle.h to SWIG bindings
     in cpupower (John B. Wyatt IV)

   - Add output files to .gitignore and clean them up in "make clean" in
     selftests/cpufreq (Li Zhijian)

   - Fix cross-compilation in cpupower Makefile (Peng Fan)

   - Revise the is_valid flag handling for idle_monitor in the cpupower
     utility (wangfushuai)

   - Extend and clean up AMD processors support in cpupower (Mario
     Limonciello)"

* tag 'pm-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (67 commits)
  PM / OPP: Add reference counting helpers for Rust implementation
  PM: sleep: wakeirq: Introduce device-managed variant of dev_pm_set_wake_irq()
  cpufreq: Use str_enable_disable()-like helpers
  cpufreq: airoha: Add EN7581 CPUFreq SMCCC driver
  PM: sleep: Allow configuring the DPM watchdog to warn earlier than panic
  PM: sleep: convert comment from kernel-doc to plain comment
  cpufreq: ACPI: Fix max-frequency computation
  pm: cpupower: Add missing residency header changes in cpuidle.h to SWIG
  PM / devfreq: exynos: remove unused function parameter
  OPP: OF: Fix an OF node leak in _opp_add_static_v2()
  cpufreq/amd-pstate: Refactor max frequency calculation
  cpufreq/amd-pstate: Fix prefcore rankings
  pm: cpupower: Add header changes for cpufreq.h to SWIG bindings
  cpufreq: sparc: change kzalloc to kcalloc
  cpufreq: qcom: Implement clk_ops::determine_rate() for qcom_cpufreq* clocks
  cpufreq: qcom: Fix qcom_cpufreq_hw_recalc_rate() to query LUT if LMh IRQ is not available
  cpufreq: apple-soc: Add Apple A7-A8X SoC cpufreq support
  cpufreq: apple-soc: Set fallback transition latency to APPLE_DVFS_TRANSITION_TIMEOUT
  cpufreq: apple-soc: Increase cluster switch timeout to 400us
  cpufreq: apple-soc: Use 32-bit read for status register
  ...
2025-01-22 11:16:14 -08:00
Linus Torvalds
1d6d399223 Kthreads affinity follow either of 4 existing different patterns:
1) Per-CPU kthreads must stay affine to a single CPU and never execute
    relevant code on any other CPU. This is currently handled by smpboot
    code which takes care of CPU-hotplug operations. Affinity here is
    a correctness constraint.
 
 2) Some kthreads _have_ to be affine to a specific set of CPUs and can't
    run anywhere else. The affinity is set through kthread_bind_mask()
    and the subsystem takes care by itself to handle CPU-hotplug
    operations. Affinity here is assumed to be a correctness constraint.
 
 3) Per-node kthreads _prefer_ to be affine to a specific NUMA node. This
    is not a correctness constraint but merely a preference in terms of
    memory locality. kswapd and kcompactd both fall into this category.
    The affinity is set manually like for any other task and CPU-hotplug
    is supposed to be handled by the relevant subsystem so that the task
    is properly reaffined whenever a given CPU from the node comes up.
    Also care should be taken so that the node affinity doesn't cross
    isolated (nohz_full) cpumask boundaries.
 
 4) Similar to the previous point except kthreads have a _preferred_
    affinity different than a node. Both RCU boost kthreads and RCU
    exp kworkers fall into this category as they refer to "RCU nodes"
    from a distinctly distributed tree.
 
 Currently the preferred affinity patterns (3 and 4) have at least 4
 identified users, with more or less success when it comes to handle
 CPU-hotplug operations and CPU isolation. Each of which do it in its own
 ad-hoc way.
 
 This is an infrastructure proposal to handle this with the following API
 changes:
 
 _ kthread_create_on_node() automatically affines the created kthread to
   its target node unless it has been set as per-cpu or bound with
   kthread_bind[_mask]() before the first wake-up.
 
 - kthread_affine_preferred() is a new function that can be called right
   after kthread_create_on_node() to specify a preferred affinity
   different than the specified node.
 
 When the preferred affinity can't be applied because the possible
 targets are offline or isolated (nohz_full), the kthread is affine
 to the housekeeping CPUs (which means to all online CPUs most of the
 time or only the non-nohz_full CPUs when nohz_full= is set).
 
 kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been
 converted, along with a few old drivers.
 
 Summary of the changes:
 
 * Consolidate a bunch of ad-hoc implementations of kthread_run_on_cpu()
 
 * Introduce task_cpu_fallback_mask() that defines the default last
   resort affinity of a task to become nohz_full aware
 
 * Add some correctness check to ensure kthread_bind() is always called
   before the first kthread wake up.
 
 * Default affine kthread to its preferred node.
 
 * Convert kswapd / kcompactd and remove their halfway working ad-hoc
   affinity implementation
 
 * Implement kthreads preferred affinity
 
 * Unify kthread worker and kthread API's style
 
 * Convert RCU kthreads to the new API and remove the ad-hoc affinity
   implementation.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmeNf8gACgkQhSRUR1CO
 jHedQQ/+IxTjjqQiItzrq41TES2S0desHDq8lNJFb7rsR/DtKFyLx3s67cOYV+cM
 Yx54QHg2m/Fz4nXMQ7Po5ygOtJGCKBc5C5QQy7y0lVKeTQK+daDfEtBSa3oG7j3C
 u+E3tTY6qxkbCzymUyaKkHN4/ay2vLvjFS50luV7KMyI3x47Aji+t7VdCX4LCPP2
 eAwOALWD0+7qLJ/VF6gsmQLKA4Qx7PQAzBa3KSBmUN9UcN8Gk1bQHCTIQKDHP9LQ
 v8BXrNZtYX1o2+snNYpX2z6/ECjxkdwriOgqqZY5306hd9RAQ1u46Dx3byrIqjGn
 ULG/XQ2istPyhTqb/h+RbrobdOcwEUIeqk8hRRbBXE8bPpqUz9EMuaCMxWDbQjgH
 NTuKG4ifKJ/IqstkkuDkdOiByE/ysMmwqrTXgSnu2ITNL9yY3BEgFbvA95hgo42s
 f7QCxEfZb1MHcNEMENSMwM3xw5lLMGMpxVZcMQ3gLwyotMBRrhFZm1qZJG7TITYW
 IDIeCbH4JOMdQwLs3CcWTXio0N5/85NhRNFV+IDn96OrgxObgnMtV8QwNgjXBAJ5
 wGeJWt8s34W1Zo3qS9gEuVzEhW4XaxISQQMkHe8faKkK6iHmIB/VjSQikDwwUNQ/
 AspYj82RyWBCDZsqhiYh71kpxjvS6Xp0bj39Ce1sNsOnuksxKkQ=
 =g8In
 -----END PGP SIGNATURE-----

Merge tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks

Pull kthread updates from Frederic Weisbecker:
 "Kthreads affinity follow either of 4 existing different patterns:

   1) Per-CPU kthreads must stay affine to a single CPU and never
      execute relevant code on any other CPU. This is currently handled
      by smpboot code which takes care of CPU-hotplug operations.
      Affinity here is a correctness constraint.

   2) Some kthreads _have_ to be affine to a specific set of CPUs and
      can't run anywhere else. The affinity is set through
      kthread_bind_mask() and the subsystem takes care by itself to
      handle CPU-hotplug operations. Affinity here is assumed to be a
      correctness constraint.

   3) Per-node kthreads _prefer_ to be affine to a specific NUMA node.
      This is not a correctness constraint but merely a preference in
      terms of memory locality. kswapd and kcompactd both fall into this
      category. The affinity is set manually like for any other task and
      CPU-hotplug is supposed to be handled by the relevant subsystem so
      that the task is properly reaffined whenever a given CPU from the
      node comes up. Also care should be taken so that the node affinity
      doesn't cross isolated (nohz_full) cpumask boundaries.

   4) Similar to the previous point except kthreads have a _preferred_
      affinity different than a node. Both RCU boost kthreads and RCU
      exp kworkers fall into this category as they refer to "RCU nodes"
      from a distinctly distributed tree.

  Currently the preferred affinity patterns (3 and 4) have at least 4
  identified users, with more or less success when it comes to handle
  CPU-hotplug operations and CPU isolation. Each of which do it in its
  own ad-hoc way.

  This is an infrastructure proposal to handle this with the following
  API changes:

   - kthread_create_on_node() automatically affines the created kthread
     to its target node unless it has been set as per-cpu or bound with
     kthread_bind[_mask]() before the first wake-up.

   - kthread_affine_preferred() is a new function that can be called
     right after kthread_create_on_node() to specify a preferred
     affinity different than the specified node.

  When the preferred affinity can't be applied because the possible
  targets are offline or isolated (nohz_full), the kthread is affine to
  the housekeeping CPUs (which means to all online CPUs most of the time
  or only the non-nohz_full CPUs when nohz_full= is set).

  kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been
  converted, along with a few old drivers.

  Summary of the changes:

   - Consolidate a bunch of ad-hoc implementations of
     kthread_run_on_cpu()

   - Introduce task_cpu_fallback_mask() that defines the default last
     resort affinity of a task to become nohz_full aware

   - Add some correctness check to ensure kthread_bind() is always
     called before the first kthread wake up.

   - Default affine kthread to its preferred node.

   - Convert kswapd / kcompactd and remove their halfway working ad-hoc
     affinity implementation

   - Implement kthreads preferred affinity

   - Unify kthread worker and kthread API's style

   - Convert RCU kthreads to the new API and remove the ad-hoc affinity
     implementation"

* tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks:
  kthread: modify kernel-doc function name to match code
  rcu: Use kthread preferred affinity for RCU exp kworkers
  treewide: Introduce kthread_run_worker[_on_cpu]()
  kthread: Unify kthread_create_on_cpu() and kthread_create_worker_on_cpu() automatic format
  rcu: Use kthread preferred affinity for RCU boost
  kthread: Implement preferred affinity
  mm: Create/affine kswapd to its preferred node
  mm: Create/affine kcompactd to its preferred node
  kthread: Default affine kthread to its preferred NUMA node
  kthread: Make sure kthread hasn't started while binding it
  sched,arm64: Handle CPU isolation on last resort fallback rq selection
  arm64: Exclude nohz_full CPUs from 32bits el0 support
  lib: test_objpool: Use kthread_run_on_cpu()
  kallsyms: Use kthread_run_on_cpu()
  soc/qman: test: Use kthread_run_on_cpu()
  arm/bL_switcher: Use kthread_run_on_cpu()
2025-01-21 17:10:05 -08:00
Linus Torvalds
62de6e1685 Scheduler enhancements for v6.14:
- Fair scheduler (SCHED_FAIR) enhancements:
 
    - Behavioral improvements:
      - Untangle NEXT_BUDDY and pick_next_task() (Peter Zijlstra)
 
    - Delayed-dequeue enhancements & fixes: (Vincent Guittot)
 
      - Rename h_nr_running into h_nr_queued
      - Add new cfs_rq.h_nr_runnable
      - Use the new cfs_rq.h_nr_runnable
      - Removed unsued cfs_rq.h_nr_delayed
      - Rename cfs_rq.idle_h_nr_running into h_nr_idle
      - Remove unused cfs_rq.idle_nr_running
      - Rename cfs_rq.nr_running into nr_queued
      - Do not try to migrate delayed dequeue task
      - Fix variable declaration position
      - Encapsulate set custom slice in a __setparam_fair() function
 
    - Fixes:
      - Fix race between yield_to() and try_to_wake_up() (Tianchen Ding)
      - Fix CPU bandwidth limit bypass during CPU hotplug (Vishal Chourasia)
 
    - Cleanups:
      - Clean up in migrate_degrades_locality() to improve
        readability (Peter Zijlstra)
      - Mark m*_vruntime() with __maybe_unused (Andy Shevchenko)
      - Update comments after sched_tick() rename (Sebastian Andrzej Siewior)
      - Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()
        (Valentin Schneider)
 
  - Deadline scheduler (SCHED_DL) enhancements:
 
    - Restore dl_server bandwidth on non-destructive root domain
      changes (Juri Lelli)
 
    - Correctly account for allocated bandwidth during
      hotplug (Juri Lelli)
 
    - Check bandwidth overflow earlier for hotplug (Juri Lelli)
 
    - Clean up goto label in pick_earliest_pushable_dl_task()
      (John Stultz)
 
    - Consolidate timer cancellation (Wander Lairson Costa)
 
  - Load-balancer enhancements:
 
    - Improve performance by prioritizing migrating eligible
      tasks in sched_balance_rq() (Hao Jia)
 
    - Do not compute NUMA Balancing stats unnecessarily during
      load-balancing (K Prateek Nayak)
 
    - Do not compute overloaded status unnecessarily during
      load-balancing (K Prateek Nayak)
 
  - Generic scheduling code enhancements:
 
    - Use READ_ONCE() in task_on_rq_queued(), to consistently use
      the WRITE_ONCE() updated ->on_rq field (Harshit Agarwal)
 
  - Isolated CPUs support enhancements: (Waiman Long)
 
    - Make "isolcpus=nohz" equivalent to "nohz_full"
    - Consolidate housekeeping cpumasks that are always identical
    - Remove HK_TYPE_SCHED
    - Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE
 
  - RSEQ enhancements:
 
    - Validate read-only fields under DEBUG_RSEQ config
      (Mathieu Desnoyers)
 
  - PSI enhancements:
 
    - Fix race when task wakes up before psi_sched_switch()
      adjusts flags (Chengming Zhou)
 
  - IRQ time accounting performance enhancements: (Yafang Shao)
 
    - Define sched_clock_irqtime as static key
    - Don't account irq time if sched_clock_irqtime is disabled
 
  - Virtual machine scheduling enhancements:
 
    - Don't try to catch up excess steal time (Suleiman Souhlal)
 
  - Heterogenous x86 CPU scheduling enhancements: (K Prateek Nayak)
 
    - Convert "sysctl_sched_itmt_enabled" to boolean
    - Use guard() for itmt_update_mutex
    - Move the "sched_itmt_enabled" sysctl to debugfs
    - Remove x86_smt_flags and use cpu_smt_flags directly
    - Use x86_sched_itmt_flags for PKG domain unconditionally
 
  - Debugging code & instrumentation enhancements:
 
    - Change need_resched warnings to pr_err() (David Rientjes)
    - Print domain name in /proc/schedstat (K Prateek Nayak)
    - Fix value reported by hot tasks pulled in /proc/schedstat (Peter Zijlstra)
    - Report the different kinds of imbalances in /proc/schedstat (Swapnil Sapkal)
    - Move sched domain name out of CONFIG_SCHED_DEBUG (Swapnil Sapkal)
    - Update Schedstat version to 17 (Swapnil Sapkal)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmePSRcRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hrdBAAjYiLl5Q8SHM0xnl+kbvuUkCTgEB/gSgA
 mfrZtHRUgRZuA89NZ9NljlCkQSlsLTOjnpNuaeFzs529GMg9iemc99dbnz3BP5F3
 V5qpYvWe7yIkJ3hd0TOGLmYEPMNQaAW57YBOrxcPjWNLJ4cr9iMdccVA1OQtcmqD
 ZUh3nibv81QI8HDmT2G+figxEIqH3yBV1+SmEIxbrdkQpIJ5702Ng6+0KQK5TShN
 xwjFELWZUl2TfkoCc4nkIpkImV6cI1DvXSw1xK6gbb1xEVOrsmFW3TYFw4trKHBu
 2RBG4wtmzNjh+12GmSdIBJHogPNcay+JIJW9EG/unT7jirqzkkeP1X2eJEbh+X1L
 CMa7GsD9Vy72jCzeJDMuiy7bKfG/MiKUtDXrAZQDo2atbw7H88QOzMuTE5a5WSV+
 tRxXGI/dgFVOk+JQUfctfJbYeXjmG8GAflawvXtGDAfDZsja6M+65fH8p0AOgW1E
 HHmXUzAe2E2xQBiSok/DYHPQeCDBAjoJvU93YhGiXv8UScb2UaD4BAfzfmc8P+Zs
 Eox6444ah5U0jiXmZ3HU707n1zO+Ql4qKoyyMJzSyP+oYHE/Do7NYTElw2QovVdN
 FX/9Uae8T4ttA/5lFe7FNoXgKvSxXDKYyKLZcysjVrWJF866Ui/TWtmxA6w8Osn7
 sfucuLawLPM=
 =5ZNW
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Fair scheduler (SCHED_FAIR) enhancements:

   - Behavioral improvements:
      - Untangle NEXT_BUDDY and pick_next_task() (Peter Zijlstra)

   - Delayed-dequeue enhancements & fixes: (Vincent Guittot)
      - Rename h_nr_running into h_nr_queued
      - Add new cfs_rq.h_nr_runnable
      - Use the new cfs_rq.h_nr_runnable
      - Removed unsued cfs_rq.h_nr_delayed
      - Rename cfs_rq.idle_h_nr_running into h_nr_idle
      - Remove unused cfs_rq.idle_nr_running
      - Rename cfs_rq.nr_running into nr_queued
      - Do not try to migrate delayed dequeue task
      - Fix variable declaration position
      - Encapsulate set custom slice in a __setparam_fair() function

   - Fixes:
      - Fix race between yield_to() and try_to_wake_up() (Tianchen Ding)
      - Fix CPU bandwidth limit bypass during CPU hotplug (Vishal
        Chourasia)

   - Cleanups:
      - Clean up in migrate_degrades_locality() to improve readability
        (Peter Zijlstra)
      - Mark m*_vruntime() with __maybe_unused (Andy Shevchenko)
      - Update comments after sched_tick() rename (Sebastian Andrzej
        Siewior)
      - Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()
        (Valentin Schneider)

  Deadline scheduler (SCHED_DL) enhancements:

   - Restore dl_server bandwidth on non-destructive root domain changes
     (Juri Lelli)

   - Correctly account for allocated bandwidth during hotplug (Juri
     Lelli)

   - Check bandwidth overflow earlier for hotplug (Juri Lelli)

   - Clean up goto label in pick_earliest_pushable_dl_task() (John
     Stultz)

   - Consolidate timer cancellation (Wander Lairson Costa)

  Load-balancer enhancements:

   - Improve performance by prioritizing migrating eligible tasks in
     sched_balance_rq() (Hao Jia)

   - Do not compute NUMA Balancing stats unnecessarily during
     load-balancing (K Prateek Nayak)

   - Do not compute overloaded status unnecessarily during
     load-balancing (K Prateek Nayak)

  Generic scheduling code enhancements:

   - Use READ_ONCE() in task_on_rq_queued(), to consistently use the
     WRITE_ONCE() updated ->on_rq field (Harshit Agarwal)

  Isolated CPUs support enhancements: (Waiman Long)

   - Make "isolcpus=nohz" equivalent to "nohz_full"
   - Consolidate housekeeping cpumasks that are always identical
   - Remove HK_TYPE_SCHED
   - Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE

  RSEQ enhancements:

   - Validate read-only fields under DEBUG_RSEQ config (Mathieu
     Desnoyers)

  PSI enhancements:

   - Fix race when task wakes up before psi_sched_switch() adjusts flags
     (Chengming Zhou)

  IRQ time accounting performance enhancements: (Yafang Shao)

   - Define sched_clock_irqtime as static key
   - Don't account irq time if sched_clock_irqtime is disabled

  Virtual machine scheduling enhancements:

   - Don't try to catch up excess steal time (Suleiman Souhlal)

  Heterogenous x86 CPU scheduling enhancements: (K Prateek Nayak)

   - Convert "sysctl_sched_itmt_enabled" to boolean
   - Use guard() for itmt_update_mutex
   - Move the "sched_itmt_enabled" sysctl to debugfs
   - Remove x86_smt_flags and use cpu_smt_flags directly
   - Use x86_sched_itmt_flags for PKG domain unconditionally

  Debugging code & instrumentation enhancements:

   - Change need_resched warnings to pr_err() (David Rientjes)
   - Print domain name in /proc/schedstat (K Prateek Nayak)
   - Fix value reported by hot tasks pulled in /proc/schedstat (Peter
     Zijlstra)
   - Report the different kinds of imbalances in /proc/schedstat
     (Swapnil Sapkal)
   - Move sched domain name out of CONFIG_SCHED_DEBUG (Swapnil Sapkal)
   - Update Schedstat version to 17 (Swapnil Sapkal)"

* tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
  rseq: Fix rseq unregistration regression
  psi: Fix race when task wakes up before psi_sched_switch() adjusts flags
  sched, psi: Don't account irq time if sched_clock_irqtime is disabled
  sched: Don't account irq time if sched_clock_irqtime is disabled
  sched: Define sched_clock_irqtime as static key
  sched/fair: Do not compute overloaded status unnecessarily during lb
  sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb
  x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionally
  x86/topology: Remove x86_smt_flags and use cpu_smt_flags directly
  x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs
  x86/itmt: Use guard() for itmt_update_mutex
  x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean
  sched/core: Prioritize migrating eligible tasks in sched_balance_rq()
  sched/debug: Change need_resched warnings to pr_err
  sched/fair: Encapsulate set custom slice in a __setparam_fair() function
  sched: Fix race between yield_to() and try_to_wake_up()
  docs: Update Schedstat version to 17
  sched/stats: Print domain name in /proc/schedstat
  sched: Move sched domain name out of CONFIG_SCHED_DEBUG
  sched: Report the different kinds of imbalances in /proc/schedstat
  ...
2025-01-21 11:32:36 -08:00
K Prateek Nayak
3429dd57f0 sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an
entity is delayed irrespective of whether the entity corresponds to a
task or a cfs_rq.

Consider the following scenario:

	root
       /    \
      A	     B (*) delayed since B is no longer eligible on root
      |	     |
    Task0  Task1 <--- dequeue_task_fair() - task blocks

When Task1 blocks (dequeue_entity() for task's se returns true),
dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the
hierarchy of Task1. However, when the sched_entity corresponding to
cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the
hierarchy too leading to both dequeue_entity() and set_delayed()
decrementing h_nr_runnable for the dequeue of the same task.

A SCHED_WARN_ON() to inspect h_nr_runnable post its update in
dequeue_entities() like below:

    cfs_rq->h_nr_runnable -= h_nr_runnable;
    SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);

is consistently tripped when running wakeup intensive workloads like
hackbench in a cgroup.

This error is self correcting since cfs_rq are per-cpu and cannot
migrate. The entitiy is either picked for full dequeue or is requeued
when a task wakes up below it. Both those paths call clear_delayed()
which again increments h_nr_runnable of the hierarchy without
considering if the entity corresponds to a task or not.

h_nr_runnable will eventually reflect the correct value however in the
interim, the incorrect values can still influence PELT calculation which
uses se->runnable_weight or cfs_rq->h_nr_runnable.

Since only delayed tasks take the early return path in
dequeue_entities() and enqueue_task_fair(), adjust the
h_nr_runnable in {set,clear}_delayed() only when a task is delayed as
this path skips the h_nr_* update loops and returns early.

For entities corresponding to cfs_rq, the h_nr_* update loop in the
caller will do the right thing.

Fixes: 76f2f78329 ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Link: https://lkml.kernel.org/r/20250117105852.23908-1-kprateek.nayak@amd.com
2025-01-21 13:13:36 +01:00
Rafael J. Wysocki
a5c16f29a8 Merge branch 'pm-cpufreq'
Merge cpufreq updates for 6.14:

 - Use str_enable_disable()-like helpers in cpufreq (Krzysztof
   Kozlowski).

 - Extend the Apple cpufreq driver to support more SoCs (Hector Martin,
   Nick Chan).

 - Add new cpufreq driver for Airoha SoCs (Christian Marangi).

 - Fix using cpufreq-dt as module (Andreas Kemnade).

 - Minor fixes for Sparc, SCMI, and Qcom cpufreq drivers (Ethan Carter
   Edwards, Sibi Sankar, Manivannan Sadhasivam).

 - Fix the maximum supported frequency computation in the ACPI cpufreq
   driver to avoid relying on unfounded assumptions (Gautham Shenoy).

 - Fix an amd-pstate driver regression with preferred core rankings not
   being used (Mario Limonciello).

 - Fix a precision issue with frequency calculation in the amd-pstate
   driver (Naresh Solanki).

 - Add ftrace event to the amd-pstate driver for active mode (Mario
   Limonciello).

 - Set default EPP policy on Ryzen processors in amd-pstate (Mario
   Limonciello).

 - Clean up the amd-pstate cpufreq driver and optimize it to increase
   code reuse (Mario Limonciello, Dhananjay Ugwekar).

 - Use CPPC to get scaling factors between HWP performance levels and
   frequency in the intel_pstate driver and make it stop using a built
   -in scaling factor for the Arrow Lake processor (Rafael Wysocki).

 - Make intel_pstate initialize epp_policy to CPUFREQ_POLICY_UNKNOWN for
   consistency with CPU offline (Christian Loehle).

 - Fix superfluous updates caused by need_freq_update in the schedutil
   cpufreq governor (Sultan Alsawaf).

* pm-cpufreq: (40 commits)
  cpufreq: Use str_enable_disable()-like helpers
  cpufreq: airoha: Add EN7581 CPUFreq SMCCC driver
  cpufreq: ACPI: Fix max-frequency computation
  cpufreq/amd-pstate: Refactor max frequency calculation
  cpufreq/amd-pstate: Fix prefcore rankings
  cpufreq: sparc: change kzalloc to kcalloc
  cpufreq: qcom: Implement clk_ops::determine_rate() for qcom_cpufreq* clocks
  cpufreq: qcom: Fix qcom_cpufreq_hw_recalc_rate() to query LUT if LMh IRQ is not available
  cpufreq: apple-soc: Add Apple A7-A8X SoC cpufreq support
  cpufreq: apple-soc: Set fallback transition latency to APPLE_DVFS_TRANSITION_TIMEOUT
  cpufreq: apple-soc: Increase cluster switch timeout to 400us
  cpufreq: apple-soc: Use 32-bit read for status register
  cpufreq: apple-soc: Allow per-SoC configuration of APPLE_DVFS_CMD_PS1
  cpufreq: apple-soc: Drop setting the PS2 field on M2+
  dt-bindings: cpufreq: apple,cluster-cpufreq: Add A7-A11, T2 compatibles
  dt-bindings: cpufreq: Document support for Airoha EN7581 CPUFreq
  cpufreq: fix using cpufreq-dt as module
  cpufreq: scmi: Register for limit change notifications
  cpufreq: schedutil: Fix superfluous updates caused by need_freq_update
  cpufreq: intel_pstate: Use CPUFREQ_POLICY_UNKNOWN
  ...
2025-01-20 20:34:50 +01:00
Rafael J. Wysocki
1225bb42b8 Merge branches 'pm-sleep', 'pm-cpuidle' and 'pm-em'
Merge updates related to system sleep, a cpuidle update and an Energy
Model handling code update for 6.14-rc1:

 - Allow configuring the system suspend-resume (DPM) watchdog to warn
   earlier than panic (Douglas Anderson).

 - Implement devm_device_init_wakeup() helper and introduce a device-
   managed variant of dev_pm_set_wake_irq() (Joe Hattori, Peng Fan).

 - Remove direct inclusions of 'pm_wakeup.h' which should be only
   included via 'device.h' (Wolfram Sang).

 - Clean up two comments in the core system-wide PM code (Rafael
   Wysocki, Randy Dunlap).

 - Add Clearwater Forest processor support to the intel_idle cpuidle
   driver (Artem Bityutskiy).

 - Move sched domains rebuild function from the schedutil cpufreq
   governor to the Energy Model handling code (Rafael Wysocki).

* pm-sleep:
  PM: sleep: wakeirq: Introduce device-managed variant of dev_pm_set_wake_irq()
  PM: sleep: Allow configuring the DPM watchdog to warn earlier than panic
  PM: sleep: convert comment from kernel-doc to plain comment
  PM: wakeup: implement devm_device_init_wakeup() helper
  PM: sleep: sysfs: don't include 'pm_wakeup.h' directly
  PM: sleep: autosleep: don't include 'pm_wakeup.h' directly
  PM: sleep: Update stale comment in device_resume()

* pm-cpuidle:
  intel_idle: add Clearwater Forest SoC support

* pm-em:
  PM: EM: Move sched domains rebuild function from schedutil to EM
2025-01-20 19:14:15 +01:00
Linus Torvalds
8ff6d472ab - Do not adjust the weight of empty group entities and avoid scheduling
artifacts
 
 - Avoid scheduling lag by computing lag properly and thus address an EEVDF
   entity placement issue
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmeM2+8ACgkQEsHwGGHe
 VUrqHA/+PyPsILNfDFnaXD/Shv7kCgk3hH2YVrCnyIQn8fhTCL+K0luXRRmAPhl1
 pEbRyEtVRXZM9WyHJ1r7VmnIHVanUKWcuZzAx8HT9b7Tkrqy/xSriH/jAV7FYG9h
 gIhWA4vXT34HG2MQHTecJ/4gvkeI5tV93nkZ6vwEwwN4pGxSvijPR3yr3yl286Nt
 BlHP9cXBWWsxTq4CYb2j5q9nmymKuiDmhD8jc1fVjDKxjmrNHEwX5mtvtRLGI1dR
 O4pjUEqYQ+TWIkX8ZIRBPzLd9b6750ncO/9yb24cY7Z3RMKov4wz8b819Dt77cTp
 XF+bT+8fsaKNRjkzPEseZU4OL16ZO+33Kcn+JoNPWvbONgFKusZ4qkCCwvWpiQlV
 0Eddh1XjKBoXA1tR6VrREcUKQ2zWoXrn8tlHUTMfPuPq1jlbQNtzN7OWcR+WBy7L
 FiClWzWLv0fCeTXaAEcO9+/MN5z2lDQsJRiLtAlGh0JJU6/1H6IVX2tSZllkpR/R
 5pyHCpYsh5cucx6cQPk+rp9O6PDYu2frMuJ8itWlQLHeSzjzOoeE6EwwZcMQG4xb
 UG/azMbmqWrtmZqCgNCBtq7RAkVRe0IuxWuCtV1VkatU+2RXdtV85tVKn/JG2KLW
 05c8GTdQZgnoY65/TZHwudhkzytclynKrrvhfKFllAWnb8D8gyQ=
 =fZWm
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Borislav Petkov:

 - Do not adjust the weight of empty group entities and avoid
   scheduling artifacts

 - Avoid scheduling lag by computing lag properly and thus address
   an EEVDF entity placement issue

* tag 'sched_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE
  sched/fair: Fix EEVDF entity placement bug causing scheduling lag
2025-01-19 09:01:17 -08:00
Rafael J. Wysocki
423124ab97 Merge back earlier cpufreq material for 6.14 2025-01-16 20:20:20 +01:00
Nicholas Piggin
21641bd9a7 lazy tlb: fix hotplug exit race with MMU_LAZY_TLB_SHOOTDOWN
CPU unplug first calls __cpu_disable(), and that's where powerpc calls
cleanup_cpu_mmu_context(), which clears this CPU from mm_cpumask() of all
mms in the system.

However this CPU may still be using a lazy tlb mm, and its mm_cpumask bit
will be cleared from it.  The CPU does not switch away from the lazy tlb
mm until arch_cpu_idle_dead() calls idle_task_exit().

If that user mm exits in this window, it will not be subject to the lazy
tlb mm shootdown and may be freed while in use as a lazy mm by the CPU
that is being unplugged.

cleanup_cpu_mmu_context() could be moved later, but it looks better to
move the lazy tlb mm switching earlier.  The problem with doing the lazy
mm switching in idle_task_exit() is explained in commit bf2c59fce4
("sched/core: Fix illegal RCU from offline CPUs"), which added a wart to
switch away from the mm but leave it set in active_mm to be cleaned up
later.

So instead, switch away from the lazy tlb mm at sched_cpu_wait_empty(),
which is the last hotplug state before teardown
(CPUHP_AP_SCHED_WAIT_EMPTY).  This CPU will never switch to a user thread
from this point, so it has no chance to pick up a new lazy tlb mm.  This
removes the lazy tlb mm handling wart in CPU unplug.

With this, idle_task_exit() is not needed anymore and can be cleaned up. 
This leaves the prototype alone, to be cleaned after this change.

herton: took the suggestions from https://lore.kernel.org/all/87jzvyprsw.ffs@tglx/
and made adjustments on the initial patch proposed by Nicholas.

Link: https://lkml.kernel.org/r/20230524060455.147699-1-npiggin@gmail.com
Link: https://lore.kernel.org/all/20230525205253.E2FAEC433EF@smtp.kernel.org/
Link: https://lkml.kernel.org/r/20241104142318.3295663-1-herton@redhat.com
Fixes: 2655421ae6 ("lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13 22:40:42 -08:00
Peter Zijlstra
d40797d672 kasan: make kasan_record_aux_stack_noalloc() the default behaviour
kasan_record_aux_stack_noalloc() was introduced to record a stack trace
without allocating memory in the process.  It has been added to callers
which were invoked while a raw_spinlock_t was held.  More and more callers
were identified and changed over time.  Is it a good thing to have this
while functions try their best to do a locklessly setup?  The only
downside of having kasan_record_aux_stack() not allocate any memory is
that we end up without a stacktrace if stackdepot runs out of memory and
at the same stacktrace was not recorded before To quote Marco Elver from
https://lore.kernel.org/all/CANpmjNPmQYJ7pv1N3cuU8cP18u7PP_uoZD8YxwZd4jtbof9nVQ@mail.gmail.com/

| I'd be in favor, it simplifies things. And stack depot should be
| able to replenish its pool sufficiently in the "non-aux" cases
| i.e. regular allocations. Worst case we fail to record some
| aux stacks, but I think that's only really bad if there's a bug
| around one of these allocations. In general the probabilities
| of this being a regression are extremely small [...]

Make the kasan_record_aux_stack_noalloc() behaviour default as
kasan_record_aux_stack().

[bigeasy@linutronix.de: dressed the diff as patch]
Link: https://lkml.kernel.org/r/20241122155451.Mb2pmeyJ@linutronix.de
Fixes: 7cb3007ce2 ("kasan: generic: introduce kasan_record_aux_stack_noalloc()")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reported-by: syzbot+39f85d612b7c20d8db48@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67275485.050a0220.3c8d68.0a37.GAE@google.com
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Reviewed-by: Marco Elver <elver@google.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: syzkaller-bugs@googlegroups.com
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13 22:40:36 -08:00
Randy Dunlap
987ce79b52 sched_ext: fix kernel-doc warnings
Use the correct function parameter names and function names.
Use the correct kernel-doc comment format for struct sched_ext_ops
to eliminate a bunch of warnings.

ext.c:1418: warning: Excess function parameter 'include_dead' description in 'scx_task_iter_next_locked'
ext.c:7261: warning: expecting prototype for scx_bpf_dump(). Prototype was for scx_bpf_dump_bstr() instead
ext.c:7352: warning: Excess function parameter 'flags' description in 'scx_bpf_cpuperf_set'

ext.c:3150: warning: Function parameter or struct member 'in_fi' not described in 'scx_prio_less'
ext.c:4711: warning: Function parameter or struct member 'dur_s' not described in 'scx_softlockup'
ext.c:4775: warning: Function parameter or struct member 'bypass' not described in 'scx_ops_bypass'
ext.c:7453: warning: Function parameter or struct member 'idle_mask' not described in 'scx_bpf_put_idle_cpumask'

ext.c:209: warning: Incorrect use of kernel-doc format:          * select_cpu - Pick the target CPU for a task which is being woken up
ext.c:236: warning: Incorrect use of kernel-doc format:          * enqueue - Enqueue a task on the BPF scheduler
ext.c:251: warning: Incorrect use of kernel-doc format:          * dequeue - Remove a task from the BPF scheduler
ext.c:267: warning: Incorrect use of kernel-doc format:          * dispatch - Dispatch tasks from the BPF scheduler and/or user DSQs
ext.c:290: warning: Incorrect use of kernel-doc format:          * tick - Periodic tick
ext.c:300: warning: Incorrect use of kernel-doc format:          * runnable - A task is becoming runnable on its associated CPU
ext.c:327: warning: Incorrect use of kernel-doc format:          * running - A task is starting to run on its associated CPU
ext.c:335: warning: Incorrect use of kernel-doc format:          * stopping - A task is stopping execution
ext.c:346: warning: Incorrect use of kernel-doc format:          * quiescent - A task is becoming not runnable on its associated CPU
ext.c:366: warning: Incorrect use of kernel-doc format:          * yield - Yield CPU
ext.c:381: warning: Incorrect use of kernel-doc format:          * core_sched_before - Task ordering for core-sched
ext.c:399: warning: Incorrect use of kernel-doc format:          * set_weight - Set task weight
ext.c:408: warning: Incorrect use of kernel-doc format:          * set_cpumask - Set CPU affinity
ext.c:418: warning: Incorrect use of kernel-doc format:          * update_idle - Update the idle state of a CPU
ext.c:439: warning: Incorrect use of kernel-doc format:          * cpu_acquire - A CPU is becoming available to the BPF scheduler
ext.c:449: warning: Incorrect use of kernel-doc format:          * cpu_release - A CPU is taken away from the BPF scheduler
ext.c:461: warning: Incorrect use of kernel-doc format:          * init_task - Initialize a task to run in a BPF scheduler
ext.c:476: warning: Incorrect use of kernel-doc format:          * exit_task - Exit a previously-running task from the system
ext.c:485: warning: Incorrect use of kernel-doc format:          * enable - Enable BPF scheduling for a task
ext.c:494: warning: Incorrect use of kernel-doc format:          * disable - Disable BPF scheduling for a task
ext.c:504: warning: Incorrect use of kernel-doc format:          * dump - Dump BPF scheduler state on error
ext.c:512: warning: Incorrect use of kernel-doc format:          * dump_cpu - Dump BPF scheduler state for a CPU on error
ext.c:524: warning: Incorrect use of kernel-doc format:          * dump_task - Dump BPF scheduler state for a runnable task on error
ext.c:535: warning: Incorrect use of kernel-doc format:          * cgroup_init - Initialize a cgroup
ext.c:550: warning: Incorrect use of kernel-doc format:          * cgroup_exit - Exit a cgroup
ext.c:559: warning: Incorrect use of kernel-doc format:          * cgroup_prep_move - Prepare a task to be moved to a different cgroup
ext.c:574: warning: Incorrect use of kernel-doc format:          * cgroup_move - Commit cgroup move
ext.c:585: warning: Incorrect use of kernel-doc format:          * cgroup_cancel_move - Cancel cgroup move
ext.c:597: warning: Incorrect use of kernel-doc format:          * cgroup_set_weight - A cgroup's weight is being changed
ext.c:611: warning: Incorrect use of kernel-doc format:          * cpu_online - A CPU became online
ext.c:620: warning: Incorrect use of kernel-doc format:          * cpu_offline - A CPU is going offline
ext.c:633: warning: Incorrect use of kernel-doc format:          * init - Initialize the BPF scheduler
ext.c:638: warning: Incorrect use of kernel-doc format:          * exit - Clean up after the BPF scheduler
ext.c:648: warning: Incorrect use of kernel-doc format:          * dispatch_max_batch - Max nr of tasks that dispatch() can dispatch
ext.c:653: warning: Incorrect use of kernel-doc format:          * flags - %SCX_OPS_* flags
ext.c:658: warning: Incorrect use of kernel-doc format:          * timeout_ms - The maximum amount of time, in milliseconds, that a
ext.c:667: warning: Incorrect use of kernel-doc format:          * exit_dump_len - scx_exit_info.dump buffer length. If 0, the default
ext.c:673: warning: Incorrect use of kernel-doc format:          * hotplug_seq - A sequence number that may be set by the scheduler to
ext.c:682: warning: Incorrect use of kernel-doc format:          * name - BPF scheduler's name

ext.c:689: warning: Function parameter or struct member 'select_cpu' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'enqueue' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'dequeue' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'dispatch' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'tick' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'runnable' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'running' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'stopping' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'quiescent' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'yield' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'core_sched_before' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'set_weight' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'set_cpumask' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'update_idle' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cpu_acquire' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cpu_release' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'init_task' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'exit_task' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'enable' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'disable' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'dump' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'dump_cpu' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'dump_task' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cgroup_init' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cgroup_exit' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cgroup_prep_move' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cgroup_move' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cgroup_cancel_move' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cgroup_set_weight' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cpu_online' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cpu_offline' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'init' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'exit' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'dispatch_max_batch' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'flags' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'timeout_ms' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'exit_dump_len' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'hotplug_seq' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'name' not described in 'sched_ext_ops'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Changwoo Min <changwoo@igalia.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: bpf@vger.kernel.org
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-13 08:23:29 -10:00
Chengming Zhou
7d9da04057 psi: Fix race when task wakes up before psi_sched_switch() adjusts flags
When running hackbench in a cgroup with bandwidth throttling enabled,
following PSI splat was observed:

    psi: inconsistent task state! task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4

When investigating the series of events leading up to the splat,
following sequence was observed:

    [008] d..2.: sched_switch: ... ==> next_comm=hackbench next_pid=1831 next_prio=120
        ...
    [008] dN.2.: dequeue_entity(task delayed): task=hackbench pid=1831 cfs_rq->throttled=0
    [008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled cfs_rq on CPU8
    # CPU8 goes into newidle balance and releases the rq lock
        ...
    # CPU15 on same LLC Domain is trying to wakeup hackbench(pid=1831)
    [015] d..4.: psi_flags_change: psi: task state: task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 final=14 # Splat (cfs_rq->throttled=1)
    [015] d..4.: sched_wakeup: comm=hackbench pid=1831 prio=120 target_cpu=008 # Task has woken on a throttled hierarchy
    [008] d..2.: sched_switch: prev_comm=hackbench prev_pid=1831 prev_prio=120 prev_state=S ==> ...

psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags
for the blocked entity, however, with the introduction of DELAY_DEQUEUE,
the block task can wakeup when newidle balance drops the runqueue lock
during __schedule().

If a task wakes before psi_sched_switch() adjusts the PSI flags, skip
any modifications in psi_enqueue() which would still see the flags of a
running task and not a blocked one. Instead, rely on psi_sched_switch()
to do the right thing.

Since the status returned by try_to_block_task() may no longer be true
by the time schedule reaches psi_sched_switch(), check if the task is
blocked or not using a combination of task_on_rq_queued() and
p->se.sched_delayed checks.

[ prateek: Commit message, testing, early bailout in psi_enqueue() ]

Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue") # 1a6151017e
Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Link: https://lore.kernel.org/r/20241227061941.2315-1-kprateek.nayak@amd.com
2025-01-13 14:10:26 +01:00
Yafang Shao
a6fd16148f sched, psi: Don't account irq time if sched_clock_irqtime is disabled
sched_clock_irqtime may be disabled due to the clock source. When disabled,
irq_time_read() won't change over time, so there is nothing to account. We
can save iterating the whole hierarchy on every tick and context switch.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20250103022409.2544-4-laoar.shao@gmail.com
2025-01-13 14:10:26 +01:00
Yafang Shao
763a744e24 sched: Don't account irq time if sched_clock_irqtime is disabled
sched_clock_irqtime may be disabled due to the clock source, in which case
IRQ time should not be accounted. Let's add a conditional check to avoid
unnecessary logic.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250103022409.2544-3-laoar.shao@gmail.com
2025-01-13 14:10:25 +01:00
Yafang Shao
8722903cbb sched: Define sched_clock_irqtime as static key
Since CPU time accounting is a performance-critical path, let's define
sched_clock_irqtime as a static key to minimize potential overhead.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250103022409.2544-2-laoar.shao@gmail.com
2025-01-13 14:10:25 +01:00
K Prateek Nayak
3229adbe78 sched/fair: Do not compute overloaded status unnecessarily during lb
Only set sg_overloaded when computing sg_lb_stats() at the highest sched
domain since rd->overloaded status is updated only when load balancing
at the highest domain. While at it, move setting of sg_overloaded below
idle_cpu() check since an idle CPU can never be overloaded.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://lore.kernel.org/r/20241223043407.1611-8-kprateek.nayak@amd.com
2025-01-13 14:10:25 +01:00
K Prateek Nayak
0ac1ee9ebf sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb
Aggregate nr_numa_running and nr_preferred_running when load balancing
at NUMA domains only. While at it, also move the aggregation below the
idle_cpu() check since an idle CPU cannot have any preferred tasks.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20241223043407.1611-7-kprateek.nayak@amd.com
2025-01-13 14:10:25 +01:00
Hao Jia
873199d27b sched/core: Prioritize migrating eligible tasks in sched_balance_rq()
When the PLACE_LAG scheduling feature is enabled and
dst_cfs_rq->nr_queued is greater than 1, if a task is
ineligible (lag < 0) on the source cpu runqueue, it will
also be ineligible when it is migrated to the destination
cpu runqueue. Because we will keep the original equivalent
lag of the task in place_entity(). So if the task was
ineligible before, it will still be ineligible after
migration.

So in sched_balance_rq(), we prioritize migrating eligible
tasks, and we soft-limit ineligible tasks, allowing them
to migrate only when nr_balance_failed is non-zero to
avoid load-balancing trying very hard to balance the load.

Below are some benchmark test results. From my test results,
this patch shows a slight improvement on hackbench.

Benchmark
=========

All of the benchmarks are done inside a normal cpu cgroup in a
clean environment with cpu turbo disabled, and test machine is:

Single NUMA machine model is 13th Gen Intel(R) Core(TM)
i7-13700, 12 Core/24 HT.

Based on master b86545e02e.

Results
=======

hackbench-process-pipes
                      vanilla                  patched
Amean     1       0.5837 (   0.00%)      0.5733 (   1.77%)
Amean     4       1.4423 (   0.00%)      1.4503 (  -0.55%)
Amean     7       2.5147 (   0.00%)      2.4773 (   1.48%)
Amean     12      3.9347 (   0.00%)      3.8880 (   1.19%)
Amean     21      5.3943 (   0.00%)      5.3873 (   0.13%)
Amean     30      6.7840 (   0.00%)      6.6660 (   1.74%)
Amean     48      9.8313 (   0.00%)      9.6100 (   2.25%)
Amean     79     15.4403 (   0.00%)     14.9580 (   3.12%)
Amean     96     18.4970 (   0.00%)     17.9533 (   2.94%)

hackbench-process-sockets
                      vanilla                  patched
Amean     1       0.6297 (   0.00%)      0.6223 (   1.16%)
Amean     4       2.1517 (   0.00%)      2.0887 (   2.93%)
Amean     7       3.6377 (   0.00%)      3.5670 (   1.94%)
Amean     12      6.1277 (   0.00%)      5.9290 (   3.24%)
Amean     21     10.0380 (   0.00%)      9.7623 (   2.75%)
Amean     30     14.1517 (   0.00%)     13.7513 (   2.83%)
Amean     48     24.7253 (   0.00%)     24.2287 (   2.01%)
Amean     79     43.9523 (   0.00%)     43.2330 (   1.64%)
Amean     96     54.5310 (   0.00%)     53.7650 (   1.40%)

tbench4 Throughput
                      vanilla                  patched
Hmean     1       255.97 (   0.00%)      275.01 (   7.44%)
Hmean     2       511.60 (   0.00%)      544.27 (   6.39%)
Hmean     4       996.70 (   0.00%)     1006.57 (   0.99%)
Hmean     8      1646.46 (   0.00%)     1649.15 (   0.16%)
Hmean     16     2259.42 (   0.00%)     2274.35 (   0.66%)
Hmean     32     4725.48 (   0.00%)     4735.57 (   0.21%)
Hmean     64     4411.47 (   0.00%)     4400.05 (  -0.26%)
Hmean     96     4284.31 (   0.00%)     4267.39 (  -0.39%)

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Hao Jia <jiahao1@lixiang.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241223091446.90208-1-jiahao.kernel@gmail.com
2025-01-13 14:10:23 +01:00
David Rientjes
8061b9f5e1 sched/debug: Change need_resched warnings to pr_err
need_resched warnings, if enabled, are treated as WARNINGs.  If
kernel.panic_on_warn is enabled, then this causes a kernel panic.

It's highly unlikely that a panic is desired for these warnings, only a
stack trace is normally required to debug and resolve.

Thus, switch need_resched warnings to simply be a printk with an
associated stack trace so they are no longer in scope for panic_on_warn.

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Acked-by: Josh Don <joshdon@google.com>
Link: https://lkml.kernel.org/r/e8d52023-5291-26bd-5299-8bb9eb604929@google.com
2025-01-13 14:10:23 +01:00
Vincent Guittot
2cf9ac4007 sched/fair: Encapsulate set custom slice in a __setparam_fair() function
Similarly to dl, create a __setparam_fair() function to set parameters
related to fair class and move it in the fair.c file.

No functional changes expected

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20250110144656.484601-1-vincent.guittot@linaro.org
2025-01-13 14:10:22 +01:00
Tianchen Ding
5d808c78d9 sched: Fix race between yield_to() and try_to_wake_up()
We met a SCHED_WARN in set_next_buddy():
  __warn_printk
  set_next_buddy
  yield_to_task_fair
  yield_to
  kvm_vcpu_yield_to [kvm]
  ...

After a short dig, we found the rq_lock held by yield_to() may not
be exactly the rq that the target task belongs to. There is a race
window against try_to_wake_up().

         CPU0                             target_task

                                        blocking on CPU1
   lock rq0 & rq1
   double check task_rq == p_rq, ok
                                        woken to CPU2 (lock task_pi & rq2)
                                        task_rq = rq2
   yield_to_task_fair (w/o lock rq2)

In this race window, yield_to() is operating the task w/o the correct
lock. Fix this by taking task pi_lock first.

Fixes: d95f412200 ("sched: Add yield_to(task, preempt) functionality")
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241231055020.6521-1-dtcccc@linux.alibaba.com
2025-01-13 14:10:22 +01:00
Peter Zijlstra
66951e4860 sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE
Normally dequeue_entities() will continue to dequeue an empty group entity;
except DELAY_DEQUEUE changes things -- it retains empty entities such that they
might continue to compete and burn off some lag.

However, doing this results in update_cfs_group() re-computing the cgroup
weight 'slice' for an empty group, which it (rightly) figures isn't much at
all. This in turn means that the delayed entity is not competing at the
expected weight. Worse, the very low weight causes its lag to be inflated,
which combined with avg_vruntime() using scale_load_down(), leads to artifacts.

As such, don't adjust the weight for empty group entities and let them compete
at their original weight.

Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250110115720.GA17405@noisy.programming.kicks-ass.net
2025-01-13 13:50:56 +01:00
Wang Yaxin
f65c64f311 delayacct: add delay min to record delay peak
Delay accounting can now calculate the average delay of processes, detect
the overall system load, and also record the 'delay max' to identify
potential abnormal delays.  However, 'delay min' can help us identify
another useful delay peak.  By comparing the difference between 'delay
max' and 'delay min', we can understand the optimization space for
latency, providing a reference for the optimization of latency
performance.

Use case
=========
bash-4.4# ./getdelays -d -t 242
print delayacct stats ON
TGID    242
CPU         count     real total  virtual total    delay total  delay average      delay max      delay min
               39      156000000      156576579        2111069          0.054ms     0.212296ms     0.031307ms
IO          count    delay total  delay average      delay max      delay min
                0              0          0.000ms     0.000000ms     0.000000ms
SWAP        count    delay total  delay average      delay max      delay min
                0              0          0.000ms     0.000000ms     0.000000ms
RECLAIM     count    delay total  delay average      delay max      delay min
                0              0          0.000ms     0.000000ms     0.000000ms
THRASHING   count    delay total  delay average      delay max      delay min
                0              0          0.000ms     0.000000ms     0.000000ms
COMPACT     count    delay total  delay average      delay max      delay min
                0              0          0.000ms     0.000000ms     0.000000ms
WPCOPY      count    delay total  delay average      delay max      delay min
              156       11215873          0.072ms     0.207403ms     0.033913ms
IRQ         count    delay total  delay average      delay max      delay min
                0              0          0.000ms     0.000000ms     0.000000ms

Link: https://lkml.kernel.org/r/20241220173105906EOdsPhzjMLYNJJBqgz1ga@zte.com.cn
Co-developed-by: Wang Yong <wang.yong12@zte.com.cn>
Signed-off-by: Wang Yong <wang.yong12@zte.com.cn>
Co-developed-by: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Co-developed-by: Kun Jiang <jiang.kun2@zte.com.cn>
Signed-off-by: Kun Jiang <jiang.kun2@zte.com.cn>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Fan Yu <fan.yu9@zte.com.cn>
Cc: Peilin He <he.peilin@zte.com.cn>
Cc: tuqiang <tu.qiang35@zte.com.cn>
Cc: ye xingchen <ye.xingchen@zte.com.cn>
Cc: Yunkai Zhang <zhang.yunkai@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12 20:21:16 -08:00
Wang Yaxin
658eb5ab91 delayacct: add delay max to record delay peak
Introduce the use cases of delay max, which can help quickly detect
potential abnormal delays in the system and record the types and specific
details of delay spikes.

Problem
========
Delay accounting can track the average delay of processes to show
system workload. However, when a process experiences a significant
delay, maybe a delay spike, which adversely affects performance,
getdelays can only display the average system delay over a period
of time. Yet, average delay is unhelpful for diagnosing delay peak.
It is not even possible to determine which type of delay has spiked,
as this information might be masked by the average delay.

Solution
=========
the 'delay max' can display delay peak since the system's startup,
which can record potential abnormal delays over time, including
the type of delay and the maximum delay. This is helpful for
quickly identifying crash caused by delay.

Use case
=========
bash# ./getdelays -d -p 244
print delayacct stats ON
PID     244

CPU             count     real total  virtual total    delay total  delay average      delay max
                   68      192000000      213676651         705643          0.010ms     0.306381ms
IO              count    delay total  delay average      delay max
                    0              0          0.000ms     0.000000ms
SWAP            count    delay total  delay average      delay max
                    0              0          0.000ms     0.000000ms
RECLAIM         count    delay total  delay average      delay max
                    0              0          0.000ms     0.000000ms
THRASHING       count    delay total  delay average      delay max
                    0              0          0.000ms     0.000000ms
COMPACT         count    delay total  delay average      delay max
                    0              0          0.000ms     0.000000ms
WPCOPY          count    delay total  delay average      delay max
                  235       15648284          0.067ms     0.263842ms
IRQ             count    delay total  delay average      delay max
                    0              0          0.000ms     0.000000ms

[wang.yaxin@zte.com.cn: update docs and fix some spelling errors]
  Link: https://lkml.kernel.org/r/20241213192700771XKZ8H30OtHSeziGqRVMs0@zte.com.cn
Link: https://lkml.kernel.org/r/20241203164848805CS62CQPQWG9GLdQj2_BxS@zte.com.cn
Co-developed-by: Wang Yong <wang.yong12@zte.com.cn>
Signed-off-by: Wang Yong <wang.yong12@zte.com.cn>
Co-developed-by: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Co-developed-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Signed-off-by: Kun Jiang <jiang.kun2@zte.com.cn>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Fan Yu <fan.yu9@zte.com.cn>
Cc: Peilin He <he.peilin@zte.com.cn>
Cc: tuqiang <tu.qiang35@zte.com.cn>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Cc: ye xingchen <ye.xingchen@zte.com.cn>
Cc: Yunkai Zhang <zhang.yunkai@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12 20:20:59 -08:00
Linus Torvalds
2e3f3090bd sched_ext: Fixes for v6.13-rc6
- Fix corner case bug where ops.dispatch() couldn't extend the execution of
   the current task if SCX_OPS_ENQ_LAST is set.
 
 - Fix ops.cpu_release() not being called when a SCX task is preempted by a
   higher priority sched class task.
 
 - Fix buitin idle mask being incorrectly left as busy after an idle CPU is
   picked and kicked.
 
 - scx_ops_bypass() was unnecessarily using rq_lock() which comes with rq
   pinning related sanity checks which could trigger spuriously. Switch to
   raw_spin_rq_lock().
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ4Gmpw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGVntAP0b4i4PEIkupj9+i8ZzlwqvYX3gFJ7E4v3wmjDp
 1VYdrAD/ZetrhrM+9RyyKpMIDFnN+xE6YbslBSlAzGzgfdsbXA0=
 =zGXi
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.13-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

 - Fix corner case bug where ops.dispatch() couldn't extend the
   execution of the current task if SCX_OPS_ENQ_LAST is set.

 - Fix ops.cpu_release() not being called when a SCX task is preempted
   by a higher priority sched class task.

 - Fix buitin idle mask being incorrectly left as busy after an idle CPU
   is picked and kicked.

 - scx_ops_bypass() was unnecessarily using rq_lock() which comes with
   rq pinning related sanity checks which could trigger spuriously.
   Switch to raw_spin_rq_lock().

* tag 'sched_ext-for-6.13-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: idle: Refresh idle masks during idle-to-idle transitions
  sched_ext: switch class when preempted by higher priority scheduler
  sched_ext: Replace rq_lock() to raw_spin_rq_lock() in scx_ops_bypass()
  sched_ext: keep running prev when prev->scx.slice != 0
2025-01-10 15:11:58 -08:00
Andrea Righi
a2a3374c47 sched_ext: idle: Refresh idle masks during idle-to-idle transitions
With the consolidation of put_prev_task/set_next_task(), see
commit 436f3eed5c ("sched: Combine the last put_prev_task() and the
first set_next_task()"), we are now skipping the transition between
these two functions when the previous and the next tasks are the same.

As a result, the scx idle state of a CPU is updated only when
transitioning to or from the idle thread. While this is generally
correct, it can lead to uneven and inefficient core utilization in
certain scenarios [1].

A typical scenario involves proactive wake-ups: scx_bpf_pick_idle_cpu()
selects and marks an idle CPU as busy, followed by a wake-up via
scx_bpf_kick_cpu(), without dispatching any tasks. In this case, the CPU
continues running the idle thread, returns to idle, but remains marked
as busy, preventing it from being selected again as an idle CPU (until a
task eventually runs on it and releases the CPU).

For example, running a workload that uses 20% of each CPU, combined with
an scx scheduler using proactive wake-ups, results in the following core
utilization:

 CPU 0: 25.7%
 CPU 1: 29.3%
 CPU 2: 26.5%
 CPU 3: 25.5%
 CPU 4:  0.0%
 CPU 5: 25.5%
 CPU 6:  0.0%
 CPU 7: 10.5%

To address this, refresh the idle state also in pick_task_idle(), during
idle-to-idle transitions, but only trigger ops.update_idle() on actual
state changes to prevent unnecessary updates to the scx scheduler and
maintain balanced state transitions.

With this change in place, the core utilization in the previous example
becomes the following:

 CPU 0: 18.8%
 CPU 1: 19.4%
 CPU 2: 18.0%
 CPU 3: 18.7%
 CPU 4: 19.3%
 CPU 5: 18.9%
 CPU 6: 18.7%
 CPU 7: 19.3%

[1] https://github.com/sched-ext/scx/pull/1139

Fixes: 7c65ae81ea ("sched_ext: Don't call put_prev_task_scx() before picking the next task")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-10 12:40:42 -10:00
Changwoo Min
3a9910b590 sched_ext: Implement scx_bpf_now()
Returns a high-performance monotonically non-decreasing clock for the current
CPU. The clock returned is in nanoseconds.

It provides the following properties:

1) High performance: Many BPF schedulers call bpf_ktime_get_ns() frequently
 to account for execution time and track tasks' runtime properties.
 Unfortunately, in some hardware platforms, bpf_ktime_get_ns() -- which
 eventually reads a hardware timestamp counter -- is neither performant nor
 scalable. scx_bpf_now() aims to provide a high-performance clock by
 using the rq clock in the scheduler core whenever possible.

2) High enough resolution for the BPF scheduler use cases: In most BPF
 scheduler use cases, the required clock resolution is lower than the most
 accurate hardware clock (e.g., rdtsc in x86). scx_bpf_now() basically
 uses the rq clock in the scheduler core whenever it is valid. It considers
 that the rq clock is valid from the time the rq clock is updated
 (update_rq_clock) until the rq is unlocked (rq_unpin_lock).

3) Monotonically non-decreasing clock for the same CPU: scx_bpf_now()
 guarantees the clock never goes backward when comparing them in the same
 CPU. On the other hand, when comparing clocks in different CPUs, there
 is no such guarantee -- the clock can go backward. It provides a
 monotonically *non-decreasing* clock so that it would provide the same
 clock values in two different scx_bpf_now() calls in the same CPU
 during the same period of when the rq clock is valid.

An rq clock becomes valid when it is updated using update_rq_clock()
and invalidated when the rq is unlocked using rq_unpin_lock().

Let's suppose the following timeline in the scheduler core:

   T1. rq_lock(rq)
   T2. update_rq_clock(rq)
   T3. a sched_ext BPF operation
   T4. rq_unlock(rq)
   T5. a sched_ext BPF operation
   T6. rq_lock(rq)
   T7. update_rq_clock(rq)

For [T2, T4), we consider that rq clock is valid (SCX_RQ_CLK_VALID is
set), so scx_bpf_now() calls during [T2, T4) (including T3) will
return the rq clock updated at T2. For duration [T4, T7), when a BPF
scheduler can still call scx_bpf_now() (T5), we consider the rq clock
is invalid (SCX_RQ_CLK_VALID is unset at T4). So when calling
scx_bpf_now() at T5, we will return a fresh clock value by calling
sched_clock_cpu() internally. Also, to prevent getting outdated rq clocks
from a previous scx scheduler, invalidate all the rq clocks when unloading
a BPF scheduler.

One example of calling scx_bpf_now(), when the rq clock is invalid
(like T5), is in scx_central [1]. The scx_central scheduler uses a BPF
timer for preemptive scheduling. In every msec, the timer callback checks
if the currently running tasks exceed their timeslice. At the beginning of
the BPF timer callback (central_timerfn in scx_central.bpf.c), scx_central
gets the current time. When the BPF timer callback runs, the rq clock could
be invalid, the same as T5. In this case, scx_bpf_now() returns a fresh
clock value rather than returning the old one (T2).

[1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_central.bpf.c

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-10 08:04:40 -10:00
Changwoo Min
ea9b262627 sched_ext: Relocate scx_enabled() related code
scx_enabled() will be used in scx_rq_clock_update/invalidate()
in the following patch, so relocate the scx_enabled() related code
to the proper location.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-10 08:04:40 -10:00
Peter Zijlstra
6d71a9c616 sched/fair: Fix EEVDF entity placement bug causing scheduling lag
I noticed this in my traces today:

       turbostat-1222    [006] d..2.   311.935649: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6)
                               { weight: 1048576 avg_vruntime: 3184159639071 vruntime: 3184159640194 (-1123) deadline: 3184162621107 } ->
                               { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 }
       turbostat-1222    [006] d..2.   311.935651: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6)
                               { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 } ->
                               { weight: 1048576 avg_vruntime: 3184176414812 vruntime: 3184177464419 (-1049607) deadline: 3184180445332 }

Which is a weight transition: 1048576 -> 2 -> 1048576.

One would expect the lag to shoot out *AND* come back, notably:

  -1123*1048576/2 = -588775424
  -588775424*2/1048576 = -1123

Except the trace shows it is all off. Worse, subsequent cycles shoot it
out further and further.

This made me have a very hard look at reweight_entity(), and
specifically the ->on_rq case, which is more prominent with
DELAY_DEQUEUE.

And indeed, it is all sorts of broken. While the computation of the new
lag is correct, the computation for the new vruntime, using the new lag
is broken for it does not consider the logic set out in place_entity().

With the below patch, I now see things like:

    migration/12-55      [012] d..3.   309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12)
                               { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475 } ->
                               { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203 }
    migration/14-62      [014] d..3.   309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15)
                               { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline: 6316614641111 } ->
                               { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650 }

Which isn't perfect yet, but much closer.

Reported-by: Doug Smythies <dsmythies@telus.net>
Reported-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: eab03c23c2 ("sched/eevdf: Fix vruntime adjustment on reweight")
Link: https://lore.kernel.org/r/20250109105959.GA2981@noisy.programming.kicks-ass.net
2025-01-09 12:55:27 +01:00
Frederic Weisbecker
b04e317b52 treewide: Introduce kthread_run_worker[_on_cpu]()
kthread_create() creates a kthread without running it yet. kthread_run()
creates a kthread and runs it.

On the other hand, kthread_create_worker() creates a kthread worker and
runs it.

This difference in behaviours is confusing. Also there is no way to
create a kthread worker and affine it using kthread_bind_mask() or
kthread_affine_preferred() before starting it.

Consolidate the behaviours and introduce kthread_run_worker[_on_cpu]()
that behaves just like kthread_run(). kthread_create_worker[_on_cpu]()
will now only create a kthread worker without starting it.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
2025-01-08 18:15:03 +01:00
Frederic Weisbecker
3a5446612a sched,arm64: Handle CPU isolation on last resort fallback rq selection
When a kthread or any other task has an affinity mask that is fully
offline or unallowed, the scheduler reaffines the task to all possible
CPUs as a last resort.

This default decision doesn't mix up very well with nohz_full CPUs that
are part of the possible cpumask but don't want to be disturbed by
unbound kthreads or even detached pinned user tasks.

Make the fallback affinity setting aware of nohz_full.

Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-01-08 18:14:23 +01:00
Honglei Wang
68e449d849 sched_ext: switch class when preempted by higher priority scheduler
ops.cpu_release() function, if defined, must be invoked when preempted by
a higher priority scheduler class task. This scenario was skipped in
commit f422316d74 ("sched_ext: Remove switch_class_scx()"). Let's fix
it.

Fixes: f422316d74 ("sched_ext: Remove switch_class_scx()")
Signed-off-by: Honglei Wang <jameshongleiwang@126.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-08 06:51:40 -10:00
Changwoo Min
6268d5bc10 sched_ext: Replace rq_lock() to raw_spin_rq_lock() in scx_ops_bypass()
scx_ops_bypass() iterates all CPUs to re-enqueue all the scx tasks.
For each CPU, it acquires a lock using rq_lock() regardless of whether
a CPU is offline or the CPU is currently running a task in a higher
scheduler class (e.g., deadline). The rq_lock() is supposed to be used
for online CPUs, and the use of rq_lock() may trigger an unnecessary
warning in rq_pin_lock(). Therefore, replace rq_lock() to
raw_spin_rq_lock() in scx_ops_bypass().

Without this change, we observe the following warning:

===== START =====
[    6.615205] rq->balance_callback && rq->balance_callback != &balance_push_callback
[    6.615208] WARNING: CPU: 2 PID: 0 at kernel/sched/sched.h:1730 __schedule+0x1130/0x1c90
=====  END  =====

Fixes: 0e7ffff1b8 ("scx: Fix raciness in scx_ops_bypass()")
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-08 06:48:53 -10:00
Henry Huang
30dd3b13f9 sched_ext: keep running prev when prev->scx.slice != 0
When %SCX_OPS_ENQ_LAST is set and prev->scx.slice != 0,
@prev will be dispacthed into the local DSQ in put_prev_task_scx().
However, pick_task_scx() is executed before put_prev_task_scx(),
so it will not pick @prev.
Set %SCX_RQ_BAL_KEEP in balance_one() to ensure that pick_task_scx()
can pick @prev.

Signed-off-by: Henry Huang <henry.hj@antgroup.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-08 06:48:33 -10:00
Andrea Righi
382d7efc14 sched_ext: Include remaining task time slice in error state dump
Report the remaining time slice when dumping task information during an
error exit.

This information can be useful for tracking incorrect or excessively
long time slices in schedulers that implement dynamic time slice logic.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-06 08:56:38 -10:00
Andrea Righi
e4975ac535 sched_ext: update scx_bpf_dsq_insert() doc for SCX_DSQ_LOCAL_ON
With commit 5b26f7b920 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct
dispatches"), scx_bpf_dsq_insert() can use SCX_DSQ_LOCAL_ON for direct
dispatch from ops.enqueue() to target the local DSQ of any CPU.

Update the documentation accordingly.

Fixes: 5b26f7b920 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-06 08:56:00 -10:00
Andrea Righi
d9071ecb31 sched_ext: idle: small CPU iteration refactoring
Replace the loop to check if all SMT CPUs are idle with
cpumask_subset(). This simplifies the code and slightly improves
efficiency, while preserving the original behavior.

Note that idle_masks.smt handling remains racy, which is acceptable as
it serves as an optimization and is self-correcting.

Suggested-and-reviewed-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-06 08:48:38 -10:00