The ftrace selftest reported a failure because writing -1 to
sched_rt_runtime_us returns -EBUSY. This happens when the possible
CPUs are different from active CPUs.
Active CPUs are part of one root domain, while remaining CPUs are part
of def_root_domain. Since active cpumask is being used, this results in
cpus=0 when a non active CPUs is used in the loop.
Fix it by looping over the online CPUs instead for validating the
bandwidth calculations.
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20250306052954.452005-2-sshegde@linux.ibm.com
child_cfs_rq_on_list attempts to convert a 'prev' pointer to a cfs_rq.
This 'prev' pointer can originate from struct rq's leaf_cfs_rq_list,
making the conversion invalid and potentially leading to memory
corruption. Depending on the relative positions of leaf_cfs_rq_list and
the task group (tg) pointer within the struct, this can cause a memory
fault or access garbage data.
The issue arises in list_add_leaf_cfs_rq, where both
cfs_rq->leaf_cfs_rq_list and rq->leaf_cfs_rq_list are added to the same
leaf list. Also, rq->tmp_alone_branch can be set to rq->leaf_cfs_rq_list.
This adds a check `if (prev == &rq->leaf_cfs_rq_list)` after the main
conditional in child_cfs_rq_on_list. This ensures that the container_of
operation will convert a correct cfs_rq struct.
This check is sufficient because only cfs_rqs on the same CPU are added
to the list, so verifying the 'prev' pointer against the current rq's list
head is enough.
Fixes a potential memory corruption issue that due to current struct
layout might not be manifesting as a crash but could lead to unpredictable
behavior when the layout changes.
Fixes: fdaba61ef8 ("sched/fair: Ensure that the CFS parent is added after unthrottling")
Signed-off-by: Zecheng Li <zecheng@google.com>
Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250304214031.2882646-1-zecheng@google.com
Add tracing support to track sched_ext core events
(/sched_ext/sched_ext_event). This may be useful for debugging sched_ext
schedulers that trigger a particular event.
The trace point can be used as other trace points, so it can be used in,
for example, `perf trace` and BPF programs, as follows:
======
$> sudo perf trace -e sched_ext:sched_ext_event --filter 'name == "SCX_EV_ENQ_SLICE_DFL"'
======
======
struct tp_sched_ext_event {
struct trace_entry ent;
u32 __data_loc_name;
s64 delta;
};
SEC("tracepoint/sched_ext/sched_ext_event")
int rtp_add_event(struct tp_sched_ext_event *ctx)
{
char event_name[128];
unsigned short offset = ctx->__data_loc_name & 0xFFFF;
bpf_probe_read_str((void *)event_name, 128, (char *)ctx + offset);
bpf_printk("name %s delta %lld", event_name, ctx->delta);
return 0;
}
======
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The event count could be negative in the future,
so change the event type from u64 to s64.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull for-6.14-fixes to receive:
9360dfe4cb ("sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()")
which conflicts with:
337d1b354a ("sched_ext: Move built-in idle CPU selection policy to a separate file")
Signed-off-by: Tejun Heo <tj@kernel.org>
If a BPF scheduler provides an invalid CPU (outside the nr_cpu_ids
range) as prev_cpu to scx_bpf_select_cpu_dfl() it can cause a kernel
crash.
To prevent this, validate prev_cpu in scx_bpf_select_cpu_dfl() and
trigger an scx error if an invalid CPU is specified.
Fixes: f0e1a0643a ("sched_ext: Implement BPF extensible scheduler class")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
on PREEMPT_NONE and PREEMPT_VOLUNTARY kernels.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfCDDMRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gf+RAAvFNXelLgrNbILZ6ckp/ikWnjCbf2QOIk
aCm6JMQm7WrFvgo1u6CM4vQQYZdEqf8+KiEjJJnoq2P4jvYzhO1/1pLfEDNaHeiH
GneosmKAwSMR8lgDlw5DXxhXsfeuYYhG5VMe2ia+kyiIA83TUF6hl9jpawWB3dsw
+xB6CAg3JLoR2v44E/Mf1PdGaGrF90fYxp+X5RNSqxVXcN54cgVx2G9lHeTIWcnp
SjIiWo5mply50de+dxD5dNUB9mj/k+yLQaiuPfUDGo/ZOjFyBnsP5VlD+ySbhkIa
Rwdw6olLqXLcX5D5RsPIuePm/XdmAQXr6GXxJjdhtV1oWTP3Bejev3upQ/kxHQ50
DQa+aSTqNx9bNlwphUafCmVo1OZap4mViOSWP7r96HhFwehLGGmkjEaU9eFuUl0P
kG+qGq28U+Nnz0r6/pEkwic1B6wbq2x1XRbtJqxXnBcQvMxMgDWNrTIj1ytDcSBb
3Qo0shRrtjH7DN1ly8IBllLQ0wXXI5O6GwjI7absEyEjpdoxFyMsHpaFONlTWRdi
NgR2+5MWTxExeWaDRPAJM+THzwucfWVTeZVXJFMRfQnNIBj7TpO3X3Y4xzP9Vl/Y
2HEz8voSDZUVN6Ejxx/am7kb68WpWw46xmj59wWT7nf9SVEEm+R4Pfe3O9+0yvQV
V4l6tN4yfEU=
=RknP
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
"Prevent cond_resched() based preemption when interrupts are disabled,
on PREEMPT_NONE and PREEMPT_VOLUNTARY kernels"
* tag 'sched-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Prevent rescheduling when interrupts are disabled
David reported a warning observed while loop testing kexec jump:
Interrupts enabled after irqrouter_resume+0x0/0x50
WARNING: CPU: 0 PID: 560 at drivers/base/syscore.c:103 syscore_resume+0x18a/0x220
kernel_kexec+0xf6/0x180
__do_sys_reboot+0x206/0x250
do_syscall_64+0x95/0x180
The corresponding interrupt flag trace:
hardirqs last enabled at (15573): [<ffffffffa8281b8e>] __up_console_sem+0x7e/0x90
hardirqs last disabled at (15580): [<ffffffffa8281b73>] __up_console_sem+0x63/0x90
That means __up_console_sem() was invoked with interrupts enabled. Further
instrumentation revealed that in the interrupt disabled section of kexec
jump one of the syscore_suspend() callbacks woke up a task, which set the
NEED_RESCHED flag. A later callback in the resume path invoked
cond_resched() which in turn led to the invocation of the scheduler:
__cond_resched+0x21/0x60
down_timeout+0x18/0x60
acpi_os_wait_semaphore+0x4c/0x80
acpi_ut_acquire_mutex+0x3d/0x100
acpi_ns_get_node+0x27/0x60
acpi_ns_evaluate+0x1cb/0x2d0
acpi_rs_set_srs_method_data+0x156/0x190
acpi_pci_link_set+0x11c/0x290
irqrouter_resume+0x54/0x60
syscore_resume+0x6a/0x200
kernel_kexec+0x145/0x1c0
__do_sys_reboot+0xeb/0x240
do_syscall_64+0x95/0x180
This is a long standing problem, which probably got more visible with
the recent printk changes. Something does a task wakeup and the
scheduler sets the NEED_RESCHED flag. cond_resched() sees it set and
invokes schedule() from a completely bogus context. The scheduler
enables interrupts after context switching, which causes the above
warning at the end.
Quite some of the code paths in syscore_suspend()/resume() can result in
triggering a wakeup with the exactly same consequences. They might not
have done so yet, but as they share a lot of code with normal operations
it's just a question of time.
The problem only affects the PREEMPT_NONE and PREEMPT_VOLUNTARY scheduling
models. Full preemption is not affected as cond_resched() is disabled and
the preemption check preemptible() takes the interrupt disabled flag into
account.
Cure the problem by adding a corresponding check into cond_resched().
Reported-by: David Woodhouse <dwmw@amazon.co.uk>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: David Woodhouse <dwmw@amazon.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Closes: https://lore.kernel.org/all/7717fe2ac0ce5f0a2c43fdab8b11f4483d54a2a4.camel@infradead.org
pick_task_scx() has a workaround to avoid stalling when the fair class's
balance() says yes but pick_task() says no. The workaround was incorrectly
deciding to keep the prev taks running if the task is on SCX even when the
task is in a sleeping state, which can lead to several confusing failure
modes. Fix it by testing the prev task is currently queued on SCX instead.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ79/Ww4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGdhUAQDM1AcK7pJUHzayuQCecCxNspGty8nR9T4KeVly
51pA2gEA4sbs6Fj4doVKVyaCunsvFoZ8Tb/utCX716fVnpjMMgY=
=yDMV
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.14-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fix from Tejun Heo:
"pick_task_scx() has a workaround to avoid stalling when the fair
class's balance() says yes but pick_task() says no.
The workaround was incorrectly deciding to keep the prev taks running
if the task is on SCX even when the task is in a sleeping state, which
can lead to several confusing failure modes.
Fix it by testing the prev task is currently queued on SCX instead"
* tag 'sched_ext-for-6.14-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Fix pick_task_scx() picking non-queued tasks when it's called without balance()
When %SCX_PICK_IDLE_IN_NODE is specified, scx_bpf_pick_any_cpu_node()
should always return a CPU from the specified node, regardless of its
idle state.
Also clarify this logic in the function documentation.
Fixes: 01059219b0 ("sched_ext: idle: Introduce node-aware idle cpu kfunc helpers")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
a6250aa251 ("sched_ext: Handle cases where pick_task_scx() is called
without preceding balance_scx()") added a workaround to handle the cases
where pick_task_scx() is called without prececing balance_scx() which is due
to a fair class bug where pick_taks_fair() may return NULL after a true
return from balance_fair().
The workaround detects when pick_task_scx() is called without preceding
balance_scx() and emulates SCX_RQ_BAL_KEEP and triggers kicking to avoid
stalling. Unfortunately, the workaround code was testing whether @prev was
on SCX to decide whether to keep the task running. This is incorrect as the
task may be on SCX but no longer runnable.
This could lead to a non-runnable task to be returned from pick_task_scx()
which cause interesting confusions and failures. e.g. A common failure mode
is the task ending up with (!on_rq && on_cpu) state which can cause
potential wakers to busy loop, which can easily lead to deadlocks.
Fix it by testing whether @prev has SCX_TASK_QUEUED set. This makes
@prev_on_scx only used in one place. Open code the usage and improve the
comment while at it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Pat Cody <patcody@meta.com>
Fixes: a6250aa251 ("sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()")
Cc: stable@vger.kernel.org # v6.12+
Acked-by: Andrea Righi <arighi@nvidia.com>
Similarly to scx_bpf_nr_cpu_ids(), introduce a new kfunc
scx_bpf_nr_node_ids() to expose the maximum number of NUMA nodes in the
system.
BPF schedulers can use this information together with the new node-aware
kfuncs, for example to create per-node DSQs, validate node IDs, etc.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Adding an unlikely() hint on early error return paths improves the
run-time performance of several sched related system calls.
Benchmarking on an i9-12900 shows the following per system call
performance improvements:
before after improvement
sched_getattr 182.4ns 170.6ns ~6.5%
sched_setattr 284.3ns 267.6ns ~5.9%
sched_getparam 161.6ns 148.1ns ~8.4%
sched_setparam 1265.4ns 1227.6ns ~3.0%
sched_getscheduler 129.4ns 118.2ns ~8.7%
sched_setscheduler 1237.3ns 1216.7ns ~1.7%
Results are based on running 20 tests with turbo disabled (to reduce
clock freq turbo changes), with 10 second run per test based on the
number of system calls per second. The % standard deviation of the
measurements for the 20 tests was 0.05% to 0.40%, so the results are
reliable.
Tested on kernel build with gcc 14.2.1
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250219142423.45516-1-colin.i.king@gmail.com
The header file stats.h is included twice. Remove the redundant include
and the following make includecheck warning:
stats.h is included more than once
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250219111756.3070-2-thorsten.blum@linux.dev
Introduce a new kfunc to retrieve the node associated to a CPU:
int scx_bpf_cpu_node(s32 cpu)
Add the following kfuncs to provide BPF schedulers direct access to
per-node idle cpumasks information:
const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
s32 scx_bpf_pick_idle_cpu_node(const cpumask_t *cpus_allowed,
int node, u64 flags)
s32 scx_bpf_pick_any_cpu_node(const cpumask_t *cpus_allowed,
int node, u64 flags)
Moreover, trigger an scx error when any of the non-node aware idle CPU
kfuncs are used when SCX_OPS_BUILTIN_IDLE_PER_NODE is enabled.
Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.
Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/a55e849cba3c41b4c5708be6ea6be6f337d1a8fb.1738746821.git.namcao@linutronix.de
When a process reduces its number of threads or clears bits in its CPU
affinity mask, the mm_cid allocation should eventually converge towards
smaller values.
However, the change introduced by:
commit 7e019dcc47 ("sched: Improve cache locality of RSEQ concurrency
IDs for intermittent workloads")
adds a per-mm/CPU recent_cid which is never unset unless a thread
migrates.
This is a tradeoff between:
A) Preserving cache locality after a transition from many threads to few
threads, or after reducing the hamming weight of the allowed CPU mask.
B) Making the mm_cid upper bounds wrt nr threads and allowed CPU mask
easy to document and understand.
C) Allowing applications to eventually react to mm_cid compaction after
reduction of the nr threads or allowed CPU mask, making the tracking
of mm_cid compaction easier by shrinking it back towards 0 or not.
D) Making sure applications that periodically reduce and then increase
again the nr threads or allowed CPU mask still benefit from good
cache locality with mm_cid.
Introduce the following changes:
* After shrinking the number of threads or reducing the number of
allowed CPUs, reduce the value of max_nr_cid so expansion of CID
allocation will preserve cache locality if the number of threads or
allowed CPUs increase again.
* Only re-use a recent_cid if it is within the max_nr_cid upper bound,
else find the first available CID.
Fixes: 7e019dcc47 ("sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lkml.kernel.org/r/20250210153253.460471-2-gmonaco@redhat.com
clear its removal from that queue is atomic
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmexq1MACgkQEsHwGGHe
VUq6ghAAoD872KGmQ3YioDZs+FKLpLWvo+6lC2rY+GFQ3oCu4TJfmlsiTGLCyzjv
aYwL52diIORD9/Yfn6Sq/ZWkfncoVmwnht+tgVjXeRr3Pb4EnttgWxPRx/xYQizr
jgRASpNsRUTr6zNzqEeeQYodIJaInOF5+r26oqYArcN5V9XB9Qaj0+f14UiyB6u+
53qpEQqQopeLPyG4t59iUfefsaWm2ZIW3EnoWeyI8sRuaapGY/0LHhUAn6vcA++N
kuUkliVsk+f/uTNZeJ4zv2uy8DpBXO4kTjmwPVwFz46sJ8RL8P7MOLax7e7fNssw
tylwHt4qoLEoB2vg1yMvlUNFMeH85gj8hTyJMsgGtFTbwCH0kLFEoXUz0lfKMS7U
A271E1Bumu3OT7FrAnxQahViv02YWG5fcg6R3OidQdSmgoQBIMJwDA2pyKLiq9FL
7mWoNfEqqBWn4O/1qBlf3jCvfFlzXRSSgVzEruoNB93cgzTaQaN5yVgeekMzwJEj
NDowmIZRxEN8+lJyMxIGSOGa44aTXu0/+dtehEDeSpsDOXULFc5fpYt4SSa0Jt/F
LlgnPGkM1vF0ddG5vDJpGw6B9Dhb82i1oYy2IVwOCkLOXSp2kvLDEI3sqgk5mmlH
zFtAV21Zm4jqBke/aF7r+RCYRvDyQkuW0d1+H9okGWLSk7jKauM=
=J43o
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Borislav Petkov:
- Clarify what happens when a task is woken up from the wake queue and
make clear its removal from that queue is atomic
* tag 'sched_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Clarify wake_up_q()'s write to task->wake_q.next
Using a single global idle mask can lead to inefficiencies and a lot of
stress on the cache coherency protocol on large systems with multiple
NUMA nodes, since all the CPUs can create a really intense read/write
activity on the single global cpumask.
Therefore, split the global cpumask into multiple per-NUMA node cpumasks
to improve scalability and performance on large systems.
The concept is that each cpumask will track only the idle CPUs within
its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
In this way concurrent access to the idle cpumask will be restricted
within each NUMA node.
The split of multiple per-node idle cpumasks can be controlled using the
SCX_OPS_BUILTIN_IDLE_PER_NODE flag.
By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global
host-wide idle cpumask is used, maintaining the previous behavior.
NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via
SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will
trigger an scx error, since there are no system-wide cpumasks.
= Test =
Hardware:
- System: DGX B200
- CPUs: 224 SMT threads (112 physical cores)
- Processor: INTEL(R) XEON(R) PLATINUM 8570
- 2 NUMA nodes
Scheduler:
- scx_simple [1] (so that we can focus at the built-in idle selection
policy and not at the scheduling policy itself)
Test:
- Run a parallel kernel build `make -j $(nproc)` and measure the average
elapsed time over 10 runs:
avg time | stdev
---------+------
before: 52.431s | 2.895
after: 50.342s | 2.895
= Conclusion =
Splitting the global cpumask into multiple per-NUMA cpumasks helped to
achieve a speedup of approximately +4% with this particular architecture
and test case.
The same test on a DGX-1 (40 physical cores, Intel Xeon E5-2698 v4 @
2.20GHz, 2 NUMA nodes) shows a speedup of around 1.5-3%.
On smaller systems, I haven't noticed any measurable regressions or
improvements with the same test (parallel kernel build) and scheduler
(scx_simple).
Moreover, with a modified scx_bpfland that uses the new NUMA-aware APIs
I observed an additional +2-2.5% performance improvement with the same
test.
[1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c
Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows
BPF schedulers to select between using a global flat idle cpumask or
multiple per-node cpumasks.
This only introduces the flag and the mechanism to enable/disable this
feature without affecting any scheduling behavior.
Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Make all the static keys used by the idle CPU selection policy private
to ext_idle.c. This avoids unnecessary exposure in headers and improves
code encapsulation.
Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
- Fix lock imbalance in a corner case of dispatch_to_local_dsq().
- Migration disabled tasks were confusing some BPF schedulers and its
handling had a bug. Fix it and simplify the default behavior by
dispatching them automatically.
- ops.tick(), ops.disable() and ops.exit_task() were incorrectly disallowing
kfuncs that require the task argument to be the rq operation is currently
operating on and thus is rq-locked. Allow them.
- Fix autogroup migration handling bug which was occasionally triggering a
warning in the cgroup migration path.
- tools/sched_ext, selftest and other misc updates.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ695uA4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGeCfAQDmUixMNJCIrRphYsWcYUzlGLZyyRpQEEYFtRMO
UC266gD+PUV2UvuO5sAVH8HVnGdOqkXaE/IRG+TC7fQH3ruPlgI=
=LFd1
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.14-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Fix lock imbalance in a corner case of dispatch_to_local_dsq()
- Migration disabled tasks were confusing some BPF schedulers and its
handling had a bug. Fix it and simplify the default behavior by
dispatching them automatically
- ops.tick(), ops.disable() and ops.exit_task() were incorrectly
disallowing kfuncs that require the task argument to be the rq
operation is currently operating on and thus is rq-locked.
Allow them.
- Fix autogroup migration handling bug which was occasionally
triggering a warning in the cgroup migration path
- tools/sched_ext, selftest and other misc updates
* tag 'sched_ext-for-6.14-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Use SCX_CALL_OP_TASK in task_tick_scx
sched_ext: Fix the incorrect bpf_list kfunc API in common.bpf.h.
sched_ext: selftests: Fix grammar in tests description
sched_ext: Fix incorrect assumption about migration disabled tasks in task_can_run_on_remote_rq()
sched_ext: Fix migration disabled handling in targeted dispatches
sched_ext: Implement auto local dispatching of migration disabled tasks
sched_ext: Fix incorrect time delta calculation in time_delta()
sched_ext: Fix lock imbalance in dispatch_to_local_dsq()
sched_ext: selftests/dsp_local_on: Fix selftest on UP systems
tools/sched_ext: Add helper to check task migration state
sched_ext: Fix incorrect autogroup migration detection
sched_ext: selftests/dsp_local_on: Fix sporadic failures
selftests/sched_ext: Fix enum resolution
sched_ext: Include task weight in the error state dump
sched_ext: Fixes typos in comments
Add a sysfs entry at /sys/kernel/sched_ext/root/events to expose core
event counters through the files system interface. Each line of the file
shows the event name and its counter value.
In addition, the format of scx_dump_event() is adjusted as the event name
gets longer.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The function "can_migrate_task()" utilize "for_each_cpu_and" with a
"if" statement inside to find the destination cpu. It's the same logic
to find the first set bit of the result of the bitwise-AND of
"env->dst_grpmask", "env->cpus" and "p->cpus_ptr".
Refactor it by using "cpumask_first_and_and()" to perform bitwise-AND
for "env->dst_grpmask", "env->cpus" and "p->cpus_ptr" and pick the
first cpu within the intersection as the destination cpu, so we can
elimate the need of looping and multiple times of branch.
After the refactoring this part of the code can speed up from ~115ns
to ~54ns, according to the test below.
Ran the test for 5 times and the result is showned in the following
table, and the test script is paste in next section.
-------------------------------------------------------
|Old method| 130| 118| 115| 109| 106| avg ~115ns|
-------------------------------------------------------
|New method| 58| 55| 54| 48| 55| avg ~54ns|
-------------------------------------------------------
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250210103019.283824-1-richard120310@gmail.com
When a task is enqueued and its parent cgroup se is already on_rq, this
parent cgroup se will not be enqueued again, and hence the root->min_slice
leaves unchanged. The same issue happens when a task is dequeued and its
parent cgroup se has other runnable entities, and the parent cgroup se
will not be dequeued.
Force propagating min_slice when se doesn't need to be enqueued or
dequeued. Ensure the se hierarchy always get the latest min_slice.
Fixes: aef6987d89 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy")
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250211063659.7180-1-dtcccc@linux.alibaba.com
The sched_clock_irqtime was defined as a static key in commit 8722903cbb
('sched: Define sched_clock_irqtime as static key'). However, this change
introduces a 'sleeping in atomic context' warning, as shown below:
arch/x86/kernel/tsc.c:1214 mark_tsc_unstable()
warn: sleeping in atomic context
As analyzed by Dan, the affected code path is as follows:
vcpu_load() <- disables preempt
-> kvm_arch_vcpu_load()
-> mark_tsc_unstable() <- sleeps
virt/kvm/kvm_main.c
166 void vcpu_load(struct kvm_vcpu *vcpu)
167 {
168 int cpu = get_cpu();
^^^^^^^^^^
This get_cpu() disables preemption.
169
170 __this_cpu_write(kvm_running_vcpu, vcpu);
171 preempt_notifier_register(&vcpu->preempt_notifier);
172 kvm_arch_vcpu_load(vcpu, cpu);
173 put_cpu();
174 }
arch/x86/kvm/x86.c
4979 if (unlikely(vcpu->cpu != cpu) || kvm_check_tsc_unstable()) {
4980 s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
4981 rdtsc() - vcpu->arch.last_host_tsc;
4982 if (tsc_delta < 0)
4983 mark_tsc_unstable("KVM discovered backwards TSC");
arch/x86/kernel/tsc.c
1206 void mark_tsc_unstable(char *reason)
1207 {
1208 if (tsc_unstable)
1209 return;
1210
1211 tsc_unstable = 1;
1212 if (using_native_sched_clock())
1213 clear_sched_clock_stable();
--> 1214 disable_sched_clock_irqtime();
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
kernel/jump_label.c
245 void static_key_disable(struct static_key *key)
246 {
247 cpus_read_lock();
^^^^^^^^^^^^^^^^
This lock has a might_sleep() in it which triggers the static checker
warning.
248 static_key_disable_cpuslocked(key);
249 cpus_read_unlock();
250 }
Let revert this change for now as {disable,enable}_sched_clock_irqtime
are used in many places, as pointed out by Sean, including the following:
The code path in clocksource_watchdog():
clocksource_watchdog()
|
-> spin_lock(&watchdog_lock);
|
-> __clocksource_unstable()
|
-> clocksource.mark_unstable() == tsc_cs_mark_unstable()
|
-> disable_sched_clock_irqtime()
And the code path in sched_clock_register():
/* Cannot register a sched_clock with interrupts on */
local_irq_save(flags);
...
/* Enable IRQ time accounting if we have a fast enough sched_clock() */
if (irqtime > 0 || (irqtime == -1 && rate >= 1000000))
enable_sched_clock_irqtime();
local_irq_restore(flags);
[lkp@intel.com: reported a build error in the prev version]
Closes: https://lore.kernel.org/kvm/37a79ba3-9ce0-479c-a5b0-2bd75d573ed3@stanley.mountain/
Fixes: 8722903cbb ("sched: Define sched_clock_irqtime as static key")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Debugged-by: Dan Carpenter <dan.carpenter@linaro.org>
Debugged-by: Sean Christopherson <seanjc@google.com>
Debugged-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20250205032438.14668-1-laoar.shao@gmail.com
The old default value for slice is 0.75 msec * (1 + ilog(ncpus)) which
means that we have a default slice of:
0.75 for 1 cpu
1.50 up to 3 cpus
2.25 up to 7 cpus
3.00 for 8 cpus and above.
For HZ=250 and HZ=100, because of the tick accuracy, the runtime of
tasks is far higher than their slice.
For HZ=1000 with 8 cpus or more, the accuracy of tick is already
satisfactory, but there is still an issue that tasks will get an extra
tick because the tick often arrives a little faster than expected. In
this case, the task can only wait until the next tick to consider that it
has reached its deadline, and will run 1ms longer.
vruntime + sysctl_sched_base_slice = deadline
|-----------|-----------|-----------|-----------|
1ms 1ms 1ms 1ms
^ ^ ^ ^
tick1 tick2 tick3 tick4(nearly 4ms)
There are two reasons for tick error: clockevent precision and the
CONFIG_IRQ_TIME_ACCOUNTING/CONFIG_PARAVIRT_TIME_ACCOUNTING. with
CONFIG_IRQ_TIME_ACCOUNTING every tick will be less than 1ms, but even
without it, because of clockevent precision, tick still often less than
1ms.
In order to make scheduling more precise, we changed 0.75 to 0.70,
Using 0.70 instead of 0.75 should not change much for other configs
and would fix this issue:
0.70 for 1 cpu
1.40 up to 3 cpus
2.10 up to 7 cpus
2.8 for 8 cpus and above.
This does not guarantee that tasks can run the slice time accurately
every time, but occasionally running an extra tick has little impact.
Signed-off-by: zihan zhou <15645113830zzh@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20250208075322.13139-1-15645113830zzh@gmail.com
A wakeup non-idle entity should preempt idle entity at any time,
but because of the slice protection of the idle entity, the non-idle
entity has to wait, so just cancel it.
This patch is aimed at minimizing the impact of SCHED_IDLE on
SCHED_NORMAL. For example, a task with SCHED_IDLE policy that sleeps for
1s and then runs for 3 ms, running cyclictest on the same cpu, has a
maximum latency of 3 ms, which is caused by the slice protection of the
idle entity. It is unreasonable. With this patch, the cyclictest latency
under the same conditions is basically the same on the cpu with idle
processes and on empty cpu.
[peterz: add helpers]
Fixes: 63304558ba ("sched/eevdf: Curb wakeup-preemption")
Signed-off-by: zihan zhou <15645113830zzh@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20250208080850.16300-1-15645113830zzh@gmail.com
A task wakeup can be either processed on the waker's CPU or bounced to the
wakee's previous CPU using an IPI (ttwu_queue). Bouncing to the wakee's CPU
avoids the waker's CPU locking and accessing the wakee's rq which can be
expensive across cache and node boundaries.
When ttwu_queue path is taken, select_task_rq() and thus ops.select_cpu()
may be skipped in some cases (racing against the wakee switching out). As
this confused some BPF schedulers, there wasn't a good way for a BPF
scheduler to tell whether idle CPU selection has been skipped, ops.enqueue()
couldn't insert tasks into foreign local DSQs, and the performance
difference on machines with simple toplogies were minimal, sched_ext
disabled ttwu_queue.
However, this optimization makes noticeable difference on more complex
topologies and a BPF scheduler now has an easy way tell whether
ops.select_cpu() was skipped since 9b671793c7 ("sched_ext, scx_qmap: Add
and use SCX_ENQ_CPU_SELECTED") and can insert tasks into foreign local DSQs
since 5b26f7b920 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct
dispatches").
Implement SCX_OPS_ALLOW_QUEUED_WAKEUP which allows BPF schedulers to choose
to enable ttwu_queue optimization.
v2: Update the patch description and comment re. ops.select_cpu() being
skipped in some cases as opposed to always as per Neel.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Neel Natu <neelnatu@google.com>
Reported-by: Barret Rhoden <brho@google.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Now when we use scx_bpf_task_cgroup() in ops.tick() to get the cgroup of
the current task, the following error will occur:
scx_foo[3795244] triggered exit kind 1024:
runtime error (called on a task not being operated on)
The reason is that we are using SCX_CALL_OP() instead of SCX_CALL_OP_TASK()
when calling ops.tick(), which triggers the error during the subsequent
scx_kf_allowed_on_arg_tasks() check.
SCX_CALL_OP_TASK() was first introduced in commit 36454023f5 ("sched_ext:
Track tasks that are subjects of the in-flight SCX operation") to ensure
task's rq lock is held when accessing task's sched_group. Since ops.tick()
is marked as SCX_KF_TERMINAL and task_tick_scx() is protected by the rq
lock, we can use SCX_CALL_OP_TASK() to avoid the above issue. Similarly,
the same changes should be made for ops.disable() and ops.exit_task(), as
they are also protected by task_rq_lock() and it's safe to access the
task's task_group.
Fixes: 36454023f5 ("sched_ext: Track tasks that are subjects of the in-flight SCX operation")
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull to receive f3f08c3acf ("sched_ext: Fix incorrect assumption about
migration disabled tasks in task_can_run_on_remote_rq()") which conflicts
with 26176116d9 ("sched_ext: Count SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE in
the right spot") in for-6.15.
While fixing migration disabled task handling, 3296682157 ("sched_ext: Fix
migration disabled handling in targeted dispatches") assumed that a
migration disabled task's ->cpus_ptr would only have the pinned CPU. While
this is eventually true for migration disabled tasks that are switched out,
->cpus_ptr update is performed by migrate_disable_switch() which is called
right before context_switch() in __scheduler(). However, the task is
enqueued earlier during pick_next_task() via put_prev_task_scx(), so there
is a race window where another CPU can see the task on a DSQ.
If the CPU tries to dispatch the migration disabled task while in that
window, task_allowed_on_cpu() will succeed and task_can_run_on_remote_rq()
will subsequently trigger SCHED_WARN(is_migration_disabled()).
WARNING: CPU: 8 PID: 1837 at kernel/sched/ext.c:2466 task_can_run_on_remote_rq+0x12e/0x140
Sched_ext: layered (enabled+all), task: runnable_at=-10ms
RIP: 0010:task_can_run_on_remote_rq+0x12e/0x140
...
<TASK>
consume_dispatch_q+0xab/0x220
scx_bpf_dsq_move_to_local+0x58/0xd0
bpf_prog_84dd17b0654b6cf0_layered_dispatch+0x290/0x1cfa
bpf__sched_ext_ops_dispatch+0x4b/0xab
balance_one+0x1fe/0x3b0
balance_scx+0x61/0x1d0
prev_balance+0x46/0xc0
__pick_next_task+0x73/0x1c0
__schedule+0x206/0x1730
schedule+0x3a/0x160
__do_sys_sched_yield+0xe/0x20
do_syscall_64+0xbb/0x1e0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fix it by converting the SCHED_WARN() back to a regular failure path. Also,
perform the migration disabled test before task_allowed_on_cpu() test so
that BPF schedulers which fail to handle migration disabled tasks can be
noticed easily.
While at it, adjust scx_ops_error() message for !task_allowed_on_cpu() case
for brevity and consistency.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 3296682157 ("sched_ext: Fix migration disabled handling in targeted dispatches")
Acked-by: Andrea Righi <arighi@nvidia.com>
Reported-by: Jake Hillion <jakehillion@meta.com>
per-CPU cpumasks are dominantly accessed from their own local CPUs,
so allocate them node-local to improve performance.
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Count the number of times a migration disabled task is automatically
dispatched to its local DSQ.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE wasn't quite right in two aspects:
- It counted both migration disabled and offline events.
- It didn't count events from scx_bpf_dsq_move() path.
Fix it by moving the counting into task_can_run_on_remote_rq() which is
shared by both paths and can distinguish the different rejection conditions.
The argument @trigger_error is renamed to @enforce as it now does more than
just triggering error.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Changwoo Min <changwoo@igalia.com>
Pull to receive:
- 2fa0fbeb69 ("sched_ext: Implement auto local dispatching of migration disabled tasks")
- 3296682157 ("sched_ext: Fix migration disabled handling in targeted dispatches")
as planned for-6.15 changes depend on them (e.g. adding event counter for
implicit migration disabled task handling).
A dispatch operation that can target a specific local DSQ -
scx_bpf_dsq_move_to_local() or scx_bpf_dsq_move() - checks whether the task
can be migrated to the target CPU using task_can_run_on_remote_rq(). If the
task can't be migrated to the targeted CPU, it is bounced through a global
DSQ.
task_can_run_on_remote_rq() assumes that the task is on a CPU that's
different from the targeted CPU but the callers doesn't uphold the
assumption and may call the function when the task is already on the target
CPU. When such task has migration disabled, task_can_run_on_remote_rq() ends
up returning %false incorrectly unnecessarily bouncing the task to a global
DSQ.
Fix it by updating the callers to only call task_can_run_on_remote_rq() when
the task is on a different CPU than the target CPU. As this is a bit subtle,
for clarity and documentation:
- Make task_can_run_on_remote_rq() trigger SCHED_WARN_ON() if the task is on
the same CPU as the target CPU.
- is_migration_disabled() test in task_can_run_on_remote_rq() cannot trigger
if the task is on a different CPU than the target CPU as the preceding
task_allowed_on_cpu() test should fail beforehand. Convert the test into
SCHED_WARN_ON().
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 4c30f5ce4f ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
Fixes: 0366017e09 ("sched_ext: Use task_can_run_on_remote_rq() test in dispatch_to_local_dsq()")
Cc: stable@vger.kernel.org # v6.12+
Migration disabled tasks are special and pinned to their previous CPUs. They
tripped up some unsuspecting BPF schedulers as their ->nr_cpus_allowed may
not agree with the bits set in ->cpus_ptr. Make it easier for BPF schedulers
by automatically dispatching them to the pinned local DSQs by default. If a
BPF scheduler wants to handle migration disabled tasks explicitly, it can
set SCX_OPS_ENQ_MIGRATION_DISABLED.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
defensive SCHED_WARN_ON() on certain workloads. The bug is
believed to be (accidentally) self-correcting, hence no
behavioral side effects are expected.
Also print se.slice in debug output, since this value can now be
set via the syscall ABI and can be useful to track.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmenIcsRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iyZA//ZpzGiiZ8coXBk4PQ77c0+BOgSGdkbWeR
zicKvqWd+j9skOTVrIk8MPs7D3C5uNuDuVAeNWYalBHRO7ndLfCg36pHqR8tQ3xN
2GwziJzKpi1r4WXBSCtFe/abKrMhIsMYiqKDQz3Ry2Zfn+TuY4t4bzEYgfm481mO
FkLcXs97RI5sFsCBI0uhRu76A9jrNRZKW2zczHTb2MS4zjwGq1yL3vkx2M11d0Su
BqrcPjz6mZFuzCrA6pW0QeSP4Mn5xL8c2seoMYnvFrPYzq47Thxm4nOWPURCN9uX
6MmqNXisJjTe00noswhbcUia1n2cr37uu5d21tXhn7Rf2XTO4+WRE7zjzvxLKQCr
DtcH1XqvnChPGPFTsSlso4vbygfEZIk14DWZ71mXQJto6t06A/yjFiAR82ZIKzLA
ZlK6hnX2rQ8WST3XSixiRPU/5xDu3ptJ0aDRV/dScCMzb7c0Y2YSeqA4xBMcZDPq
A1TrNUsnsvH0TJXXqeslDSxxEFUKHNiOI2itzMCxAteEZXPysa8Rvcl4zFOZ5A6F
nb8hDN94dydNnPpf4g+ZIVEHs5ES3Yeexh1dGwCjQSp72qAJK3wkHQ6KfPN/T3Gb
PbDVw93cYJI/fhB2BX9DcnN+uNLTrqr1kfY7uFBE1feMdnQNBHILdjbByILfQAQ+
0r7+aYwn0F0=
=5mot
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Fix a cfs_rq->h_nr_runnable accounting bug that trips up a defensive
SCHED_WARN_ON() on certain workloads. The bug is believed to be
(accidentally) self-correcting, hence no behavioral side effects are
expected.
Also print se.slice in debug output, since this value can now be set
via the syscall ABI and can be useful to track"
* tag 'sched-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/debug: Provide slice length for fair tasks
sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
Clarify that wake_up_q() does an atomic write to task->wake_q.next, after
which a concurrent __wake_q_add() can immediately overwrite
task->wake_q.next again.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250129-sched-wakeup-prettier-v1-1-2f51f5f663fa@google.com
Add a core event, SCX_EV_ENQ_SLICE_DFL, which represents how many
tasks have been enqueued (or pick_task-ed or select_cpu-ed) with
a default time slice (SCX_SLICE_DFL).
Scheduling a task with SCX_SLICE_DFL unintentionally would be a source
of latency spikes because SCX_SLICE_DFL is relatively long (20 msec).
Thus, soaring the SCX_EV_ENQ_SLICE_DFL value would be a sign of BPF
scheduler bugs, causing latency spikes, especially when ops.select_cpu()
is provided.
__scx_add_event() is used since the caller holds an rq lock or p->pi_lock,
so the preemption has already been disabled.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Update comment in __cond_resched() clarifying how urgently needed
quiescent state are provided.
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Add a core event, SCX_EV_BYPASS_DURATION, which represents the
total duration of bypass modes in nanoseconds.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add a core event, SCX_EV_BYPASS_DISPATCH, which represents how many
tasks have been dispatched in the bypass mode.
__scx_add_event() is used since the caller holds an rq lock or
p->pi_lock, so the preemption has already been disabled.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add a core event, SCX_EV_BYPASS_ACTIVATE, which represents how many
times the bypass mode has been triggered.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add a core event, SCX_EV_ENQ_SKIP_EXITING, which represents how many
times a task is enqueued to a local DSQ when exiting if
SCX_OPS_ENQ_EXITING is not set.
__scx_add_event() is used since the caller holds an rq lock,
so the preemption has already been disabled.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add a core event, SCX_EV_DISPATCH_KEEP_LAST, which represents how many
times a task is continued to run without ops.enqueue() when
SCX_OPS_ENQ_LAST is not set.
__scx_add_event() is used since the caller holds an rq lock,
so the preemption has already been disabled.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add a core event, SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE, which represents how
many times a BPF scheduler tries to dispatch to an offlined local DSQ.
__scx_add_event() is used since the caller holds an rq lock,
so the preemption has already been disabled.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add a core event, SCX_EV_SELECT_CPU_FALLBACK, which represents how many times
ops.select_cpu() returns a CPU that the task can't use.
__scx_add_event() is used since the caller holds an rq lock,
so the preemption has already been disabled.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Collect the statistics of specific types of behavior in the sched_ext core,
which are not easily visible but still interesting to an scx scheduler.
An event type is defined in 'struct scx_event_stats.' When an event occurs,
its counter is accumulated using 'scx_add_event()' and '__scx_add_event()'
to per-CPU 'struct scx_event_stats' for efficiency. 'scx_bpf_events()'
aggregates all the per-CPU counters and exposes a system-wide counters.
For convenience and readability of the code, 'scx_agg_event()' and
'scx_dump_event()' are provided.
The collected events can be observed after a BPF scheduler is unloaded
beforea new BPF scheduler is loaded so the per-CPU 'struct scx_event_stats'
are reset.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Since commit:
857b158dc5 ("sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion")
... we have the userspace per-task tunable slice length, which is
a key parameter that is otherwise difficult to obtain, so provide
it in /proc/$PID/sched.
[ mingo: Clarified the changelog. ]
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/453349b1-1637-42f5-a7b2-2385392b5956@arm.com
- Add missing error handling for syscore_suspend() to the hibernation
core code (Wentao Liang).
- Revert a commit that added unused macros (Andy Shevchenko).
- Synchronize the runtime PM status of devices that were runtime-
suspended before a system-wide suspend and need to be resumed during
the subsequent system-wide resume transition (Rafael Wysocki).
- Clean up the teo cpuidle governor and make the handling of short idle
intervals in it consistent regardless of the properties of idle
states supplied by the cpuidle driver (Rafael Wysocki).
- Fix some boost-related issues in cpufreq (Lifeng Zheng).
- Fix build issues in the s3c64xx and airoha cpufreq drivers (Viresh
Kumar).
- Remove unconditional binding of schedutil governor kthreads to the
affected CPUs if the cpufreq driver indicates that updates can happen
from any CPU (Christian Loehle).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmeb5xYSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxcFsP/2FIoEI2G6J7pk8zChWT225qkkaieh5P
tHIkcFINlgzyjLnqmyWELUdt+sB7re6/dMmoLor+abudHimvBvUfAj6Oiz1F3p2F
utE9TpfhOkXi1ci5zBl9h6+iDj2Z5op3Qe/qw/W3DTlcManAD+6r60A2tOEy0jhi
GTbp2SEEU28+LU/2J59IfxEuRTTH4pbQGXi+iKv/k9bmtLvQofa1saXyQCBSZrvO
z3MBdqnAxLeZCg/qILmEGsBvbv1wpugvp3yoMLVwGNyul12Augcs8PreQz7e5tFq
spEuCfpBJwyJLAGlOnjOYgsPbJBXWRkIBeLH7JealfZr9TX0y4LZSAHi/xe0Asd3
BBZLLDojxhYMLzmqSkuafHlQd5J7jKl++RS1A9Qm6aqglKjeSOC9Ca9fmrc9Ub9P
Jpf1SVJ3kJsv1Z7wuUcaj6oLxD8wlAgCo5pigNgWTP2HllhP2bmc22M8JWxpAz+m
nMeW8nr8bAu4XViZeb74YKGgUDngO/uKRwBpthSkE2fqM7q0Wr5E7G0u9M91mzG6
nd/XDwta5TeznMQpSy339NgT61i4HHfyc/SDIdpkBxI0C5l6jknNawq79i9gy7/E
In4MyooOlls/iX/JR0uxx2hXEByltF0IHqwsRYeJ6dmIajYgASXR1Hhh5Iy9fPJ/
JTJ7vR5oZPB/
=EgsM
-----END PGP SIGNATURE-----
Merge tag 'pm-6.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"These are mostly fixes on top of the previously merged power
management material with the addition of some teo cpuidle governor
updates, some of which may also be regarded as fixes:
- Add missing error handling for syscore_suspend() to the hibernation
core code (Wentao Liang)
- Revert a commit that added unused macros (Andy Shevchenko)
- Synchronize the runtime PM status of devices that were runtime-
suspended before a system-wide suspend and need to be resumed
during the subsequent system-wide resume transition (Rafael
Wysocki)
- Clean up the teo cpuidle governor and make the handling of short
idle intervals in it consistent regardless of the properties of
idle states supplied by the cpuidle driver (Rafael Wysocki)
- Fix some boost-related issues in cpufreq (Lifeng Zheng)
- Fix build issues in the s3c64xx and airoha cpufreq drivers (Viresh
Kumar)
- Remove unconditional binding of schedutil governor kthreads to the
affected CPUs if the cpufreq driver indicates that updates can
happen from any CPU (Christian Loehle)"
* tag 'pm-6.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: sleep: core: Synchronize runtime PM status of parents and children
cpufreq: airoha: Depends on OF
PM: Revert "Add EXPORT macros for exporting PM functions"
PM: hibernate: Add error handling for syscore_suspend()
cpufreq/schedutil: Only bind threads if needed
cpufreq: ACPI: Remove set_boost in acpi_cpufreq_cpu_init()
cpufreq: CPPC: Fix wrong max_freq in policy initialization
cpufreq: Introduce a more generic way to set default per-policy boost flag
cpufreq: Fix re-boost issue after hotplugging a CPU
cpufreq: s3c64xx: Fix compilation warning
cpuidle: teo: Skip sleep length computation for low latency constraints
cpuidle: teo: Replace time_span_ns with a flag
cpuidle: teo: Simplify handling of total events count
cpuidle: teo: Skip getting the sleep length if wakeups are very frequent
cpuidle: teo: Simplify counting events used for tick management
cpuidle: teo: Clarify two code comments
cpuidle: teo: Drop local variable prev_intercept_idx
cpuidle: teo: Combine candidate state index checks against 0
cpuidle: teo: Reorder candidate state index checks
cpuidle: teo: Rearrange idle state lookup code
Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.
Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit 78eb4ea25c ("sysctl: treewide:
constify the ctl_table argument of proc_handlers") constified all the
proc_handlers.
Created this by running an spatch followed by a sed command:
Spatch:
virtual patch
@
depends on !(file in "net")
disable optional_qualifier
@
identifier table_name != {
watchdog_hardlockup_sysctl,
iwcm_ctl_table,
ucma_ctl_table,
memory_allocation_profiling_sysctls,
loadpin_sysctl_table
};
@@
+ const
struct ctl_table table_name [] = { ... };
sed:
sed --in-place \
-e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \
kernel/utsname_sysctl.c
Reviewed-by: Song Liu <song@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI
Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Acked-by: Anna Schumaker <anna.schumaker@oracle.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
As ext.c is becoming quite large, move the idle CPU selection policy to
separate files (ext_idle.c / ext_idle.h) for better code readability.
Moreover, group together all the idle CPU selection kfunc's to the same
btf_kfunc_id_set block.
No functional changes, this is purely code reorganization.
Suggested-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
While performing the rq locking dance in dispatch_to_local_dsq(), we may
trigger the following lock imbalance condition, in particular when
multiple tasks are rapidly changing CPU affinity (i.e., running a
`stress-ng --race-sched 0`):
[ 13.413579] =====================================
[ 13.413660] WARNING: bad unlock balance detected!
[ 13.413729] 6.13.0-virtme #15 Not tainted
[ 13.413792] -------------------------------------
[ 13.413859] kworker/1:1/80 is trying to release lock (&rq->__lock) at:
[ 13.413954] [<ffffffff873c6c48>] dispatch_to_local_dsq+0x108/0x1a0
[ 13.414111] but there are no more locks to release!
[ 13.414176]
[ 13.414176] other info that might help us debug this:
[ 13.414258] 1 lock held by kworker/1:1/80:
[ 13.414318] #0: ffff8b66feb41698 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x20/0x90
[ 13.414612]
[ 13.414612] stack backtrace:
[ 13.415255] CPU: 1 UID: 0 PID: 80 Comm: kworker/1:1 Not tainted 6.13.0-virtme #15
[ 13.415505] Workqueue: 0x0 (events)
[ 13.415567] Sched_ext: dsp_local_on (enabled+all), task: runnable_at=-2ms
[ 13.415570] Call Trace:
[ 13.415700] <TASK>
[ 13.415744] dump_stack_lvl+0x78/0xe0
[ 13.415806] ? dispatch_to_local_dsq+0x108/0x1a0
[ 13.415884] print_unlock_imbalance_bug+0x11b/0x130
[ 13.415965] ? dispatch_to_local_dsq+0x108/0x1a0
[ 13.416226] lock_release+0x231/0x2c0
[ 13.416326] _raw_spin_unlock+0x1b/0x40
[ 13.416422] dispatch_to_local_dsq+0x108/0x1a0
[ 13.416554] flush_dispatch_buf+0x199/0x1d0
[ 13.416652] balance_one+0x194/0x370
[ 13.416751] balance_scx+0x61/0x1e0
[ 13.416848] prev_balance+0x43/0xb0
[ 13.416947] __pick_next_task+0x6b/0x1b0
[ 13.417052] __schedule+0x20d/0x1740
This happens because dispatch_to_local_dsq() is racing with
dispatch_dequeue() and, when the latter wins, we incorrectly assume that
the task has been moved to dst_rq.
Fix by properly tracking the currently locked rq.
Fixes: 4d3ca89bdd ("sched_ext: Refactor consume_remote_task()")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
scx_move_task() is called from sched_move_task() and tells the BPF scheduler
that cgroup migration is being committed. sched_move_task() is used by both
cgroup and autogroup migrations and scx_move_task() tried to filter out
autogroup migrations by testing the destination cgroup and PF_EXITING but
this is not enough. In fact, without explicitly tagging the thread which is
doing the cgroup migration, there is no good way to tell apart
scx_move_task() invocations for racing migration to the root cgroup and an
autogroup migration.
This led to scx_move_task() incorrectly ignoring a migration from non-root
cgroup to an autogroup of the root cgroup triggering the following warning:
WARNING: CPU: 7 PID: 1 at kernel/sched/ext.c:3725 scx_cgroup_can_attach+0x196/0x340
...
Call Trace:
<TASK>
cgroup_migrate_execute+0x5b1/0x700
cgroup_attach_task+0x296/0x400
__cgroup_procs_write+0x128/0x140
cgroup_procs_write+0x17/0x30
kernfs_fop_write_iter+0x141/0x1f0
vfs_write+0x31d/0x4a0
__x64_sys_write+0x72/0xf0
do_syscall_64+0x82/0x160
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fix it by adding an argument to sched_move_task() that indicates whether the
moving is for a cgroup or autogroup migration. After the change,
scx_move_task() is called only for cgroup migrations and renamed to
scx_cgroup_move_task().
Link: https://github.com/sched-ext/scx/issues/370
Fixes: 8195136669 ("sched_ext: Add cgroup support")
Cc: stable@vger.kernel.org # v6.12+
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
indivudual patches which are described in their changelogs.
- "Allocate and free frozen pages" from Matthew Wilcox reorganizes the
page allocator so we end up with the ability to allocate and free
zero-refcount pages. So that callers (ie, slab) can avoid a refcount
inc & dec.
- "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use
large folios other than PMD-sized ones.
- "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and
fixes for this small built-in kernel selftest.
- "mas_anode_descend() related cleanup" from Wei Yang tidies up part of
the mapletree code.
- "mm: fix format issues and param types" from Keren Sun implements a
few minor code cleanups.
- "simplify split calculation" from Wei Yang provides a few fixes and a
test for the mapletree code.
- "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes
continues the work of moving vma-related code into the (relatively) new
mm/vma.c.
- "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
Hildenbrand cleans up and rationalizes handling of gfp flags in the page
allocator.
- "readahead: Reintroduce fix for improper RA window sizing" from Jan
Kara is a second attempt at fixing a readahead window sizing issue. It
should reduce the amount of unnecessary reading.
- "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
addresses an issue where "huge" amounts of pte pagetables are
accumulated
(https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/).
Qi's series addresses this windup by synchronously freeing PTE memory
within the context of madvise(MADV_DONTNEED).
- "selftest/mm: Remove warnings found by adding compiler flags" from
Muhammad Usama Anjum fixes some build warnings in the selftests code
when optional compiler warnings are enabled.
- "mm: don't use __GFP_HARDWALL when migrating remote pages" from David
Hildenbrand tightens the allocator's observance of __GFP_HARDWALL.
- "pkeys kselftests improvements" from Kevin Brodsky implements various
fixes and cleanups in the MM selftests code, mainly pertaining to the
pkeys tests.
- "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
estimate application working set size.
- "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
provides some cleanups to memcg's hugetlb charging logic.
- "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based
kernel build was demonstrated.
- "zram: split page type read/write handling" from Sergey Senozhatsky
has several fixes and cleaups for zram in the area of zram_write_page().
A watchdog softlockup warning was eliminated.
- "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky
cleans up the pagetable destructor implementations. A rare
use-after-free race is fixed.
- "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
simplifies and cleans up the debugging code in the VMA merging logic.
- "Account page tables at all levels" from Kevin Brodsky cleans up and
regularizes the pagetable ctor/dtor handling. This results in
improvements in accounting accuracy.
- "mm/damon: replace most damon_callback usages in sysfs with new core
functions" from SeongJae Park cleans up and generalizes DAMON's sysfs
file interface logic.
- "mm/damon: enable page level properties based monitoring" from
SeongJae Park increases the amount of information which is presented in
response to DAMOS actions.
- "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes
DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs
is completed.
- "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter
Xu cleans up and generalizes the hugetlb reservation accounting.
- "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
removes a never-used feature of the alloc_pages_bulk() interface.
- "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
extends DAMOS filters to support not only exclusion (rejecting), but
also inclusion (allowing) behavior.
- "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
"introduces a new memory descriptor for zswap.zpool that currently
overlaps with struct page for now. This is part of the effort to reduce
the size of struct page and to enable dynamic allocation of memory
descriptors."
- "mm, swap: rework of swap allocator locks" from Kairui Song redoes and
simplifies the swap allocator locking. A speedup of 400% was
demonstrated for one workload. As was a 35% reduction for kernel build
time with swap-on-zram.
- "mm: update mips to use do_mmap(), make mmap_region() internal" from
Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
mmap_region() can be made MM-internal.
- "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU
regressions and otherwise improves MGLRU performance.
- "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park
updates DAMON documentation.
- "Cleanup for memfd_create()" from Isaac Manjarres does that thing.
- "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand
provides various cleanups in the areas of hugetlb folios, THP folios and
migration.
- "Uncached buffered IO" from Jens Axboe implements the new
RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache
reading and writing. To permite userspace to address issues with
massive buildup of useless pagecache when reading/writing fast devices.
- "selftests/mm: virtual_address_range: Reduce memory" from Thomas
Weißschuh fixes and optimizes some of the MM selftests.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5a+cwAKCRDdBJ7gKXxA
jtoyAP9R58oaOKPJuTizEKKXvh/RpMyD6sYcz/uPpnf+cKTZxQEAqfVznfWlw/Lz
uC3KRZYhmd5YrxU4o+qjbzp9XWX/xAE=
=Ib2s
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs.
- "Allocate and free frozen pages" from Matthew Wilcox reorganizes
the page allocator so we end up with the ability to allocate and
free zero-refcount pages. So that callers (ie, slab) can avoid a
refcount inc & dec
- "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to
use large folios other than PMD-sized ones
- "Fix mm/rodata_test" from Petr Tesarik performs some maintenance
and fixes for this small built-in kernel selftest
- "mas_anode_descend() related cleanup" from Wei Yang tidies up part
of the mapletree code
- "mm: fix format issues and param types" from Keren Sun implements a
few minor code cleanups
- "simplify split calculation" from Wei Yang provides a few fixes and
a test for the mapletree code
- "mm/vma: make more mmap logic userland testable" from Lorenzo
Stoakes continues the work of moving vma-related code into the
(relatively) new mm/vma.c
- "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
Hildenbrand cleans up and rationalizes handling of gfp flags in the
page allocator
- "readahead: Reintroduce fix for improper RA window sizing" from Jan
Kara is a second attempt at fixing a readahead window sizing issue.
It should reduce the amount of unnecessary reading
- "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
addresses an issue where "huge" amounts of pte pagetables are
accumulated:
https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/
Qi's series addresses this windup by synchronously freeing PTE
memory within the context of madvise(MADV_DONTNEED)
- "selftest/mm: Remove warnings found by adding compiler flags" from
Muhammad Usama Anjum fixes some build warnings in the selftests
code when optional compiler warnings are enabled
- "mm: don't use __GFP_HARDWALL when migrating remote pages" from
David Hildenbrand tightens the allocator's observance of
__GFP_HARDWALL
- "pkeys kselftests improvements" from Kevin Brodsky implements
various fixes and cleanups in the MM selftests code, mainly
pertaining to the pkeys tests
- "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
estimate application working set size
- "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
provides some cleanups to memcg's hugetlb charging logic
- "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
removes the global swap cgroup lock. A speedup of 10% for a
tmpfs-based kernel build was demonstrated
- "zram: split page type read/write handling" from Sergey Senozhatsky
has several fixes and cleaups for zram in the area of
zram_write_page(). A watchdog softlockup warning was eliminated
- "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin
Brodsky cleans up the pagetable destructor implementations. A rare
use-after-free race is fixed
- "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
simplifies and cleans up the debugging code in the VMA merging
logic
- "Account page tables at all levels" from Kevin Brodsky cleans up
and regularizes the pagetable ctor/dtor handling. This results in
improvements in accounting accuracy
- "mm/damon: replace most damon_callback usages in sysfs with new
core functions" from SeongJae Park cleans up and generalizes
DAMON's sysfs file interface logic
- "mm/damon: enable page level properties based monitoring" from
SeongJae Park increases the amount of information which is
presented in response to DAMOS actions
- "mm/damon: remove DAMON debugfs interface" from SeongJae Park
removes DAMON's long-deprecated debugfs interfaces. Thus the
migration to sysfs is completed
- "mm/hugetlb: Refactor hugetlb allocation resv accounting" from
Peter Xu cleans up and generalizes the hugetlb reservation
accounting
- "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
removes a never-used feature of the alloc_pages_bulk() interface
- "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
extends DAMOS filters to support not only exclusion (rejecting),
but also inclusion (allowing) behavior
- "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
introduces a new memory descriptor for zswap.zpool that currently
overlaps with struct page for now. This is part of the effort to
reduce the size of struct page and to enable dynamic allocation of
memory descriptors
- "mm, swap: rework of swap allocator locks" from Kairui Song redoes
and simplifies the swap allocator locking. A speedup of 400% was
demonstrated for one workload. As was a 35% reduction for kernel
build time with swap-on-zram
- "mm: update mips to use do_mmap(), make mmap_region() internal"
from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
mmap_region() can be made MM-internal
- "mm/mglru: performance optimizations" from Yu Zhao fixes a few
MGLRU regressions and otherwise improves MGLRU performance
- "Docs/mm/damon: add tuning guide and misc updates" from SeongJae
Park updates DAMON documentation
- "Cleanup for memfd_create()" from Isaac Manjarres does that thing
- "mm: hugetlb+THP folio and migration cleanups" from David
Hildenbrand provides various cleanups in the areas of hugetlb
folios, THP folios and migration
- "Uncached buffered IO" from Jens Axboe implements the new
RWF_DONTCACHE flag which provides synchronous dropbehind for
pagecache reading and writing. To permite userspace to address
issues with massive buildup of useless pagecache when
reading/writing fast devices
- "selftests/mm: virtual_address_range: Reduce memory" from Thomas
Weißschuh fixes and optimizes some of the MM selftests"
* tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
mm/compaction: fix UBSAN shift-out-of-bounds warning
s390/mm: add missing ctor/dtor on page table upgrade
kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags()
tools: add VM_WARN_ON_VMG definition
mm/damon/core: use str_high_low() helper in damos_wmark_wait_us()
seqlock: add missing parameter documentation for raw_seqcount_try_begin()
mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh
mm/page_alloc: remove the incorrect and misleading comment
zram: remove zcomp_stream_put() from write_incompressible_page()
mm: separate move/undo parts from migrate_pages_batch()
mm/kfence: use str_write_read() helper in get_access_type()
selftests/mm/mkdirty: fix memory leak in test_uffdio_copy()
kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags()
selftests/mm: virtual_address_range: avoid reading from VM_IO mappings
selftests/mm: vm_util: split up /proc/self/smaps parsing
selftests/mm: virtual_address_range: unmap chunks after validation
selftests/mm: virtual_address_range: mmap() without PROT_WRITE
selftests/memfd/memfd_test: fix possible NULL pointer dereference
mm: add FGP_DONTCACHE folio creation flag
mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
...
this pull are:
- "lib min_heap: Improve min_heap safety, testing, and documentation"
from Kuan-Wei Chiu provides various tightenings to the min_heap library
code.
- "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms some
cleanup and Rust preparation in the xarray library code.
- "Update reference to include/asm-<arch>" from Geert Uytterhoeven fixes
pathnames in some code comments.
- "Converge on using secs_to_jiffies()" from Easwar Hariharan uses the
new secs_to_jiffies() in various places where that is appropriate.
- "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen
switches two filesystems to the new mount API.
- "Convert ocfs2 to use folios" from Matthew Wilcox does that.
- "Remove get_task_comm() and print task comm directly" from Yafang Shao
removes now-unneeded calls to get_task_comm() in various places.
- "squashfs: reduce memory usage and update docs" from Phillip Lougher
implements some memory savings in squashfs and performs some
maintainability work.
- "lib: clarify comparison function requirements" from Kuan-Wei Chiu
tightens the sort code's behaviour and adds some maintenance work.
- "nilfs2: protect busy buffer heads from being force-cleared" from
Ryusuke Konishi fixes an issues in nlifs when the fs is presented with a
corrupted image.
- "nilfs2: fix kernel-doc comments for function return values" from
Ryusuke Konishi fixes some nilfs kerneldoc.
- "nilfs2: fix issues with rename operations" from Ryusuke Konishi
addresses some nilfs BUG_ONs which syzbot was able to trigger.
- "minmax.h: Cleanups and minor optimisations" from David Laight
does some maintenance work on the min/max library code.
- "Fixes and cleanups to xarray" from Kemeng Shi does maintenance work
on the xarray library code.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5SP5QAKCRDdBJ7gKXxA
jqN7AQChvwXGG43n4d5SDiA/rH7ddvowQcDqhC9cAMJ1ReR7qwEA8/LIWDE4PdMX
mJnaZ1/ibpEpearrChCViApQtcyEGQI=
=ti4E
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
"Mainly individually changelogged singleton patches. The patch series
in this pull are:
- "lib min_heap: Improve min_heap safety, testing, and documentation"
from Kuan-Wei Chiu provides various tightenings to the min_heap
library code
- "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms
some cleanup and Rust preparation in the xarray library code
- "Update reference to include/asm-<arch>" from Geert Uytterhoeven
fixes pathnames in some code comments
- "Converge on using secs_to_jiffies()" from Easwar Hariharan uses
the new secs_to_jiffies() in various places where that is
appropriate
- "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen
switches two filesystems to the new mount API
- "Convert ocfs2 to use folios" from Matthew Wilcox does that
- "Remove get_task_comm() and print task comm directly" from Yafang
Shao removes now-unneeded calls to get_task_comm() in various
places
- "squashfs: reduce memory usage and update docs" from Phillip
Lougher implements some memory savings in squashfs and performs
some maintainability work
- "lib: clarify comparison function requirements" from Kuan-Wei Chiu
tightens the sort code's behaviour and adds some maintenance work
- "nilfs2: protect busy buffer heads from being force-cleared" from
Ryusuke Konishi fixes an issues in nlifs when the fs is presented
with a corrupted image
- "nilfs2: fix kernel-doc comments for function return values" from
Ryusuke Konishi fixes some nilfs kerneldoc
- "nilfs2: fix issues with rename operations" from Ryusuke Konishi
addresses some nilfs BUG_ONs which syzbot was able to trigger
- "minmax.h: Cleanups and minor optimisations" from David Laight does
some maintenance work on the min/max library code
- "Fixes and cleanups to xarray" from Kemeng Shi does maintenance
work on the xarray library code"
* tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (131 commits)
ocfs2: use str_yes_no() and str_no_yes() helper functions
include/linux/lz4.h: add some missing macros
Xarray: use xa_mark_t in xas_squash_marks() to keep code consistent
Xarray: remove repeat check in xas_squash_marks()
Xarray: distinguish large entries correctly in xas_split_alloc()
Xarray: move forward index correctly in xas_pause()
Xarray: do not return sibling entries from xas_find_marked()
ipc/util.c: complete the kernel-doc function descriptions
gcov: clang: use correct function param names
latencytop: use correct kernel-doc format for func params
minmax.h: remove some #defines that are only expanded once
minmax.h: simplify the variants of clamp()
minmax.h: move all the clamp() definitions after the min/max() ones
minmax.h: use BUILD_BUG_ON_MSG() for the lo < hi test in clamp()
minmax.h: reduce the #define expansion of min(), max() and clamp()
minmax.h: update some comments
minmax.h: add whitespace around operators and after commas
nilfs2: do not update mtime of renamed directory that is not moved
nilfs2: handle errors that nilfs_prepare_chunk() may return
CREDITS: fix spelling mistake
...
Resending this patch as I haven't received feedback on my initial
submission https://lore.kernel.org/all/20241204182953.10854-1-oxana@cloudflare.com/
For the processes which are terminated abnormally the kernel can provide
a coredump if enabled. When the coredump is performed, the process and
all its threads are put into the D state
(TASK_UNINTERRUPTIBLE | TASK_FREEZABLE).
On the other hand, we have kernel thread khungtaskd which monitors the
processes in the D state. If the task stuck in the D state more than
kernel.hung_task_timeout_secs, the hung_task alert appears in the kernel
log.
The higher memory usage of a process, the longer it takes to create
coredump, the longer tasks are in the D state. We have hung_task alerts
for the processes with memory usage above 10Gb. Although, our
kernel.hung_task_timeout_secs is 10 sec when the default is 120 sec.
Adding additional information to the log that the task is blocked by
coredump will help with monitoring. Another approach might be to
completely filter out alerts for such tasks, but in that case we would
lose transparency about what is putting pressure on some system
resources, e.g. we saw an increase in I/O when coredump occurs due its
writing to disk.
Additionally, it would be helpful to have task_struct->flags in the log
from the function sched_show_task(). Currently it prints
task_struct->thread_info->flags, this seems misleading as the line
starts with "task:xxxx".
[akpm@linux-foundation.org: fix printk control string]
Link: https://lkml.kernel.org/r/20250110160328.64947-1-oxana@cloudflare.com
Signed-off-by: Oxana Kharitonova <oxana@cloudflare.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ben Segall <bsegall@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Report the task weight when dumping the task state during an error exit.
Moreover, adjust the output format to display dsq_vtime, slice, and
weight on the same line.
This can help identify whether certain tasks were excessively
prioritized or de-prioritized due to large niceness gaps.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
- scx_bpf_now() added so that BPF scheduler can access the cached timestamp
in struct rq to avoid reading TSC multiple times within a locked
scheduling operation.
- Minor updates to the built-in idle CPU selection logic.
- tool/sched_ext updates and other misc changes.
Pulling sched_ext/for-6.14 into master causes a merge conflict between the
following two commits (first commit in master, second in for-6.14):
a2a3374c47 sched_ext: idle: Refresh idle masks during idle-to-idle transitions
9cf9aceed2 sched_ext: idle: use assign_cpu() to update the idle cpumask
static void update_builtin_idle(int cpu, bool idle)
{
<<<<<<< HEAD
if (idle)
cpumask_set_cpu(cpu, idle_masks.cpu);
else
cpumask_clear_cpu(cpu, idle_masks.cpu);
=======
int cpu = cpu_of(rq);
if (SCX_HAS_OP(update_idle) && !scx_rq_bypassing(rq)) {
SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
if (!static_branch_unlikely(&scx_builtin_idle_enabled))
return;
}
assign_cpu(cpu, idle_masks.cpu, idle);
>>>>>>> 987ce79b52
The first commit factored out update_builtin_idle() and the second replaced
cpumask_set/clear_cpu() calls with assign_cpu(). The conflict can be
resolved by taking the code from the first and then replacing the
cpumask_set/clear_cpu() calls with assign_cpu():
static void update_builtin_idle(int cpu, bool idle)
{
assign_cpu(cpu, idle_masks.cpu, idle);
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ4/yjw4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGc/3AQChJ2zxBLMtQOfNVWgqhAyatCnFquMncl/IeGAs
Sf4eawD/QHNWYBPx0nOFQS8RZNmUcXuEDeHMy+1MvTUaIDL5MQM=
=6t0t
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext updates from Tejun Heo:
- scx_bpf_now() added so that BPF scheduler can access the cached
timestamp in struct rq to avoid reading TSC multiple times within a
locked scheduling operation.
- Minor updates to the built-in idle CPU selection logic.
- tool/sched_ext updates and other misc changes.
* tag 'sched_ext-for-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: fix kernel-doc warnings
sched_ext: Use time helpers in BPF schedulers
sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_now()
sched_ext: Add time helpers for BPF schedulers
sched_ext: Add scx_bpf_now() for BPF scheduler
sched_ext: Implement scx_bpf_now()
sched_ext: Relocate scx_enabled() related code
sched_ext: Add option -l in selftest runner to list all available tests
sched_ext: Include remaining task time slice in error state dump
sched_ext: update scx_bpf_dsq_insert() doc for SCX_DSQ_LOCAL_ON
sched_ext: idle: small CPU iteration refactoring
sched_ext: idle: introduce check_builtin_idle_enabled() helper
sched_ext: idle: clarify comments
sched_ext: idle: use assign_cpu() to update the idle cpumask
sched_ext: Use str_enabled_disabled() helper in update_selcpu_topology()
sched_ext: Use sizeof_field for key_len in dsq_hash_params
tools/sched_ext: Receive updates from SCX repo
sched_ext: Use the NUMA scheduling domain for NUMA optimizations
Remove the unconditional binding of sugov kthreads to the affected CPUs
if the cpufreq driver indicates that updates can happen from any CPU.
This allows userspace to set affinities to either save power (waking up
bigger CPUs on HMP can be expensive) or increasing performance (by
letting the utilized CPUs run without preemption of the sugov kthread).
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://patch.msgid.link/5a8deed4-7764-4729-a9d4-9520c25fa7e8@arm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- Use str_enable_disable()-like helpers in cpufreq (Krzysztof
Kozlowski).
- Extend the Apple cpufreq driver to support more SoCs (Hector Martin,
Nick Chan).
- Add new cpufreq driver for Airoha SoCs (Christian Marangi).
- Fix using cpufreq-dt as module (Andreas Kemnade).
- Minor fixes for Sparc, SCMI, and Qcom cpufreq drivers (Ethan Carter
Edwards, Sibi Sankar, Manivannan Sadhasivam).
- Fix the maximum supported frequency computation in the ACPI cpufreq
driver to avoid relying on unfounded assumptions (Gautham Shenoy).
- Fix an amd-pstate driver regression with preferred core rankings not
being used (Mario Limonciello).
- Fix a precision issue with frequency calculation in the amd-pstate
driver (Naresh Solanki).
- Add ftrace event to the amd-pstate driver for active mode (Mario
Limonciello).
- Set default EPP policy on Ryzen processors in amd-pstate (Mario
Limonciello).
- Clean up the amd-pstate cpufreq driver and optimize it to increase
code reuse (Mario Limonciello, Dhananjay Ugwekar).
- Use CPPC to get scaling factors between HWP performance levels and
frequency in the intel_pstate driver and make it stop using a built
-in scaling factor for Arrow Lake processors (Rafael Wysocki).
- Make intel_pstate initialize epp_policy to CPUFREQ_POLICY_UNKNOWN for
consistency with CPU offline (Christian Loehle).
- Fix superfluous updates caused by need_freq_update in the schedutil
cpufreq governor (Sultan Alsawaf).
- Allow configuring the system suspend-resume (DPM) watchdog to warn
earlier than panic (Douglas Anderson).
- Implement devm_device_init_wakeup() helper and introduce a device-
managed variant of dev_pm_set_wake_irq() (Joe Hattori, Peng Fan).
- Remove direct inclusions of 'pm_wakeup.h' which should be only
included via 'device.h' (Wolfram Sang).
- Clean up two comments in the core system-wide PM code (Rafael
Wysocki, Randy Dunlap).
- Add Clearwater Forest processor support to the intel_idle cpuidle
driver (Artem Bityutskiy).
- Clean up the Exynos devfreq driver and devfreq core (Markus Elfring,
Jeongjun Park).
- Minor cleanups and fixes for OPP (Dan Carpenter, Neil Armstrong, Joe
Hattori).
- Implement dev_pm_opp_get_bw() (Neil Armstrong).
- Expose OPP reference counting helpers for Rust (Viresh Kumar).
- Fix TSC MHz calculation in cpupower (He Rongguang).
- Add install and uninstall options to bindings Makefile and add header
changes for cpufreq.h to SWIG bindings in cpupower (John B. Wyatt IV).
- Add missing residency header changes in cpuidle.h to SWIG bindings in
cpupower (John B. Wyatt IV).
- Add output files to .gitignore and clean them up in "make clean" in
selftests/cpufreq (Li Zhijian).
- Fix cross-compilation in cpupower Makefile (Peng Fan).
- Revise the is_valid flag handling for idle_monitor in the cpupower
utility (wangfushuai).
- Extend and clean up AMD processors support in cpupower (Mario
Limonciello).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmeOthsSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxqQsP/ivDt8nqDnxdKB7cKFQIsEK+tl0RnFVD
o5regvYeRcGWpUXuMaqBtTmCMjsB8bUkcj2yLquM54ubjHAGF6zJuw9ZytMPHVcC
b2xk3RCFlXSBFXVK8eOh3XRviA9nGhuY97ZnPsQOlvoECrxT2xyeL+mWo7s+t+q9
2NUH+yfRoi5FM+nqqDhsm0xXxJuPaNg6eAjIASuMjXap48rNk3L5kW6W/6nw7i0I
xQWd/pKLHaI5e7DRF/QdMKu8+Fm4BbN0jMqLblKPOmTe9KggvBkck5q1Um20sYkJ
vdKMAT02ClGavIC7DtY092Xik84NZfID4ZUchS6e2hJIQ3Uaw/eDvAo/jlT8gIzq
fnXPdApRIzQGDvMxFaAsKaGlwxiVlAGHPDSTH6MVWzsp+1DSkbloSwVPAfeYIn44
Jhov+6Ydux3597sSjo+YmD58acimXl7urVuk8P6m3U5+gb8/jlgbxpIn+vbxH3Ka
o44Vt7axD63gezOQY134sj5gic5JL0GuZovOlvzrF6+FsjvVqcax6FZ4n3uIXu7P
C1nwai+Wdzo7wvuz7RfO0g15Y15wYLQLYsRq/osRlf+sOmGVv7nA9tSzZ0LUdD5D
Pp6PxppF6anM0Kjen8Ppuu+Bcr11JfVvhnVTJqhs6u71XdAy4TnG1JjL4lPWYJ4D
Gfz2hyPNjiQX
=AoMC
-----END PGP SIGNATURE-----
Merge tag 'pm-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"The majority of changes here are cpufreq updates which are dominated
by amd-pstate driver changes, like in the previous cycle. Moreover,
changes related to amd-pstate are also the majority of cpupower
utility updates.
Included are some pieces of new hardware support, like the addition of
Clearwater Forest processors support to intel_idle, new cpufreq driver
for Airoha SoCs, and Apple cpufreq driver extensions to support more
SoCs. The intel_pstate driver is also extended to be able to support
new platforms by using ACPI CPPC to compute scaling factors between
HWP performance states and frequency.
The rest is mostly fixes and cleanups in assorted pieces of power
management code.
Specifics:
- Use str_enable_disable()-like helpers in cpufreq (Krzysztof
Kozlowski)
- Extend the Apple cpufreq driver to support more SoCs (Hector
Martin, Nick Chan)
- Add new cpufreq driver for Airoha SoCs (Christian Marangi)
- Fix using cpufreq-dt as module (Andreas Kemnade)
- Minor fixes for Sparc, SCMI, and Qcom cpufreq drivers (Ethan Carter
Edwards, Sibi Sankar, Manivannan Sadhasivam)
- Fix the maximum supported frequency computation in the ACPI cpufreq
driver to avoid relying on unfounded assumptions (Gautham Shenoy)
- Fix an amd-pstate driver regression with preferred core rankings
not being used (Mario Limonciello)
- Fix a precision issue with frequency calculation in the amd-pstate
driver (Naresh Solanki)
- Add ftrace event to the amd-pstate driver for active mode (Mario
Limonciello)
- Set default EPP policy on Ryzen processors in amd-pstate (Mario
Limonciello)
- Clean up the amd-pstate cpufreq driver and optimize it to increase
code reuse (Mario Limonciello, Dhananjay Ugwekar)
- Use CPPC to get scaling factors between HWP performance levels and
frequency in the intel_pstate driver and make it stop using a
built-in scaling factor for Arrow Lake processors (Rafael Wysocki)
- Make intel_pstate initialize epp_policy to CPUFREQ_POLICY_UNKNOWN
for consistency with CPU offline (Christian Loehle)
- Fix superfluous updates caused by need_freq_update in the schedutil
cpufreq governor (Sultan Alsawaf)
- Allow configuring the system suspend-resume (DPM) watchdog to warn
earlier than panic (Douglas Anderson)
- Implement devm_device_init_wakeup() helper and introduce a device-
managed variant of dev_pm_set_wake_irq() (Joe Hattori, Peng Fan)
- Remove direct inclusions of 'pm_wakeup.h' which should be only
included via 'device.h' (Wolfram Sang)
- Clean up two comments in the core system-wide PM code (Rafael
Wysocki, Randy Dunlap)
- Add Clearwater Forest processor support to the intel_idle cpuidle
driver (Artem Bityutskiy)
- Clean up the Exynos devfreq driver and devfreq core (Markus
Elfring, Jeongjun Park)
- Minor cleanups and fixes for OPP (Dan Carpenter, Neil Armstrong,
Joe Hattori)
- Implement dev_pm_opp_get_bw() (Neil Armstrong)
- Expose OPP reference counting helpers for Rust (Viresh Kumar)
- Fix TSC MHz calculation in cpupower (He Rongguang)
- Add install and uninstall options to bindings Makefile and add
header changes for cpufreq.h to SWIG bindings in cpupower (John B.
Wyatt IV)
- Add missing residency header changes in cpuidle.h to SWIG bindings
in cpupower (John B. Wyatt IV)
- Add output files to .gitignore and clean them up in "make clean" in
selftests/cpufreq (Li Zhijian)
- Fix cross-compilation in cpupower Makefile (Peng Fan)
- Revise the is_valid flag handling for idle_monitor in the cpupower
utility (wangfushuai)
- Extend and clean up AMD processors support in cpupower (Mario
Limonciello)"
* tag 'pm-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (67 commits)
PM / OPP: Add reference counting helpers for Rust implementation
PM: sleep: wakeirq: Introduce device-managed variant of dev_pm_set_wake_irq()
cpufreq: Use str_enable_disable()-like helpers
cpufreq: airoha: Add EN7581 CPUFreq SMCCC driver
PM: sleep: Allow configuring the DPM watchdog to warn earlier than panic
PM: sleep: convert comment from kernel-doc to plain comment
cpufreq: ACPI: Fix max-frequency computation
pm: cpupower: Add missing residency header changes in cpuidle.h to SWIG
PM / devfreq: exynos: remove unused function parameter
OPP: OF: Fix an OF node leak in _opp_add_static_v2()
cpufreq/amd-pstate: Refactor max frequency calculation
cpufreq/amd-pstate: Fix prefcore rankings
pm: cpupower: Add header changes for cpufreq.h to SWIG bindings
cpufreq: sparc: change kzalloc to kcalloc
cpufreq: qcom: Implement clk_ops::determine_rate() for qcom_cpufreq* clocks
cpufreq: qcom: Fix qcom_cpufreq_hw_recalc_rate() to query LUT if LMh IRQ is not available
cpufreq: apple-soc: Add Apple A7-A8X SoC cpufreq support
cpufreq: apple-soc: Set fallback transition latency to APPLE_DVFS_TRANSITION_TIMEOUT
cpufreq: apple-soc: Increase cluster switch timeout to 400us
cpufreq: apple-soc: Use 32-bit read for status register
...
1) Per-CPU kthreads must stay affine to a single CPU and never execute
relevant code on any other CPU. This is currently handled by smpboot
code which takes care of CPU-hotplug operations. Affinity here is
a correctness constraint.
2) Some kthreads _have_ to be affine to a specific set of CPUs and can't
run anywhere else. The affinity is set through kthread_bind_mask()
and the subsystem takes care by itself to handle CPU-hotplug
operations. Affinity here is assumed to be a correctness constraint.
3) Per-node kthreads _prefer_ to be affine to a specific NUMA node. This
is not a correctness constraint but merely a preference in terms of
memory locality. kswapd and kcompactd both fall into this category.
The affinity is set manually like for any other task and CPU-hotplug
is supposed to be handled by the relevant subsystem so that the task
is properly reaffined whenever a given CPU from the node comes up.
Also care should be taken so that the node affinity doesn't cross
isolated (nohz_full) cpumask boundaries.
4) Similar to the previous point except kthreads have a _preferred_
affinity different than a node. Both RCU boost kthreads and RCU
exp kworkers fall into this category as they refer to "RCU nodes"
from a distinctly distributed tree.
Currently the preferred affinity patterns (3 and 4) have at least 4
identified users, with more or less success when it comes to handle
CPU-hotplug operations and CPU isolation. Each of which do it in its own
ad-hoc way.
This is an infrastructure proposal to handle this with the following API
changes:
_ kthread_create_on_node() automatically affines the created kthread to
its target node unless it has been set as per-cpu or bound with
kthread_bind[_mask]() before the first wake-up.
- kthread_affine_preferred() is a new function that can be called right
after kthread_create_on_node() to specify a preferred affinity
different than the specified node.
When the preferred affinity can't be applied because the possible
targets are offline or isolated (nohz_full), the kthread is affine
to the housekeeping CPUs (which means to all online CPUs most of the
time or only the non-nohz_full CPUs when nohz_full= is set).
kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been
converted, along with a few old drivers.
Summary of the changes:
* Consolidate a bunch of ad-hoc implementations of kthread_run_on_cpu()
* Introduce task_cpu_fallback_mask() that defines the default last
resort affinity of a task to become nohz_full aware
* Add some correctness check to ensure kthread_bind() is always called
before the first kthread wake up.
* Default affine kthread to its preferred node.
* Convert kswapd / kcompactd and remove their halfway working ad-hoc
affinity implementation
* Implement kthreads preferred affinity
* Unify kthread worker and kthread API's style
* Convert RCU kthreads to the new API and remove the ad-hoc affinity
implementation.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmeNf8gACgkQhSRUR1CO
jHedQQ/+IxTjjqQiItzrq41TES2S0desHDq8lNJFb7rsR/DtKFyLx3s67cOYV+cM
Yx54QHg2m/Fz4nXMQ7Po5ygOtJGCKBc5C5QQy7y0lVKeTQK+daDfEtBSa3oG7j3C
u+E3tTY6qxkbCzymUyaKkHN4/ay2vLvjFS50luV7KMyI3x47Aji+t7VdCX4LCPP2
eAwOALWD0+7qLJ/VF6gsmQLKA4Qx7PQAzBa3KSBmUN9UcN8Gk1bQHCTIQKDHP9LQ
v8BXrNZtYX1o2+snNYpX2z6/ECjxkdwriOgqqZY5306hd9RAQ1u46Dx3byrIqjGn
ULG/XQ2istPyhTqb/h+RbrobdOcwEUIeqk8hRRbBXE8bPpqUz9EMuaCMxWDbQjgH
NTuKG4ifKJ/IqstkkuDkdOiByE/ysMmwqrTXgSnu2ITNL9yY3BEgFbvA95hgo42s
f7QCxEfZb1MHcNEMENSMwM3xw5lLMGMpxVZcMQ3gLwyotMBRrhFZm1qZJG7TITYW
IDIeCbH4JOMdQwLs3CcWTXio0N5/85NhRNFV+IDn96OrgxObgnMtV8QwNgjXBAJ5
wGeJWt8s34W1Zo3qS9gEuVzEhW4XaxISQQMkHe8faKkK6iHmIB/VjSQikDwwUNQ/
AspYj82RyWBCDZsqhiYh71kpxjvS6Xp0bj39Ce1sNsOnuksxKkQ=
=g8In
-----END PGP SIGNATURE-----
Merge tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks
Pull kthread updates from Frederic Weisbecker:
"Kthreads affinity follow either of 4 existing different patterns:
1) Per-CPU kthreads must stay affine to a single CPU and never
execute relevant code on any other CPU. This is currently handled
by smpboot code which takes care of CPU-hotplug operations.
Affinity here is a correctness constraint.
2) Some kthreads _have_ to be affine to a specific set of CPUs and
can't run anywhere else. The affinity is set through
kthread_bind_mask() and the subsystem takes care by itself to
handle CPU-hotplug operations. Affinity here is assumed to be a
correctness constraint.
3) Per-node kthreads _prefer_ to be affine to a specific NUMA node.
This is not a correctness constraint but merely a preference in
terms of memory locality. kswapd and kcompactd both fall into this
category. The affinity is set manually like for any other task and
CPU-hotplug is supposed to be handled by the relevant subsystem so
that the task is properly reaffined whenever a given CPU from the
node comes up. Also care should be taken so that the node affinity
doesn't cross isolated (nohz_full) cpumask boundaries.
4) Similar to the previous point except kthreads have a _preferred_
affinity different than a node. Both RCU boost kthreads and RCU
exp kworkers fall into this category as they refer to "RCU nodes"
from a distinctly distributed tree.
Currently the preferred affinity patterns (3 and 4) have at least 4
identified users, with more or less success when it comes to handle
CPU-hotplug operations and CPU isolation. Each of which do it in its
own ad-hoc way.
This is an infrastructure proposal to handle this with the following
API changes:
- kthread_create_on_node() automatically affines the created kthread
to its target node unless it has been set as per-cpu or bound with
kthread_bind[_mask]() before the first wake-up.
- kthread_affine_preferred() is a new function that can be called
right after kthread_create_on_node() to specify a preferred
affinity different than the specified node.
When the preferred affinity can't be applied because the possible
targets are offline or isolated (nohz_full), the kthread is affine to
the housekeeping CPUs (which means to all online CPUs most of the time
or only the non-nohz_full CPUs when nohz_full= is set).
kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been
converted, along with a few old drivers.
Summary of the changes:
- Consolidate a bunch of ad-hoc implementations of
kthread_run_on_cpu()
- Introduce task_cpu_fallback_mask() that defines the default last
resort affinity of a task to become nohz_full aware
- Add some correctness check to ensure kthread_bind() is always
called before the first kthread wake up.
- Default affine kthread to its preferred node.
- Convert kswapd / kcompactd and remove their halfway working ad-hoc
affinity implementation
- Implement kthreads preferred affinity
- Unify kthread worker and kthread API's style
- Convert RCU kthreads to the new API and remove the ad-hoc affinity
implementation"
* tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks:
kthread: modify kernel-doc function name to match code
rcu: Use kthread preferred affinity for RCU exp kworkers
treewide: Introduce kthread_run_worker[_on_cpu]()
kthread: Unify kthread_create_on_cpu() and kthread_create_worker_on_cpu() automatic format
rcu: Use kthread preferred affinity for RCU boost
kthread: Implement preferred affinity
mm: Create/affine kswapd to its preferred node
mm: Create/affine kcompactd to its preferred node
kthread: Default affine kthread to its preferred NUMA node
kthread: Make sure kthread hasn't started while binding it
sched,arm64: Handle CPU isolation on last resort fallback rq selection
arm64: Exclude nohz_full CPUs from 32bits el0 support
lib: test_objpool: Use kthread_run_on_cpu()
kallsyms: Use kthread_run_on_cpu()
soc/qman: test: Use kthread_run_on_cpu()
arm/bL_switcher: Use kthread_run_on_cpu()
- Fair scheduler (SCHED_FAIR) enhancements:
- Behavioral improvements:
- Untangle NEXT_BUDDY and pick_next_task() (Peter Zijlstra)
- Delayed-dequeue enhancements & fixes: (Vincent Guittot)
- Rename h_nr_running into h_nr_queued
- Add new cfs_rq.h_nr_runnable
- Use the new cfs_rq.h_nr_runnable
- Removed unsued cfs_rq.h_nr_delayed
- Rename cfs_rq.idle_h_nr_running into h_nr_idle
- Remove unused cfs_rq.idle_nr_running
- Rename cfs_rq.nr_running into nr_queued
- Do not try to migrate delayed dequeue task
- Fix variable declaration position
- Encapsulate set custom slice in a __setparam_fair() function
- Fixes:
- Fix race between yield_to() and try_to_wake_up() (Tianchen Ding)
- Fix CPU bandwidth limit bypass during CPU hotplug (Vishal Chourasia)
- Cleanups:
- Clean up in migrate_degrades_locality() to improve
readability (Peter Zijlstra)
- Mark m*_vruntime() with __maybe_unused (Andy Shevchenko)
- Update comments after sched_tick() rename (Sebastian Andrzej Siewior)
- Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()
(Valentin Schneider)
- Deadline scheduler (SCHED_DL) enhancements:
- Restore dl_server bandwidth on non-destructive root domain
changes (Juri Lelli)
- Correctly account for allocated bandwidth during
hotplug (Juri Lelli)
- Check bandwidth overflow earlier for hotplug (Juri Lelli)
- Clean up goto label in pick_earliest_pushable_dl_task()
(John Stultz)
- Consolidate timer cancellation (Wander Lairson Costa)
- Load-balancer enhancements:
- Improve performance by prioritizing migrating eligible
tasks in sched_balance_rq() (Hao Jia)
- Do not compute NUMA Balancing stats unnecessarily during
load-balancing (K Prateek Nayak)
- Do not compute overloaded status unnecessarily during
load-balancing (K Prateek Nayak)
- Generic scheduling code enhancements:
- Use READ_ONCE() in task_on_rq_queued(), to consistently use
the WRITE_ONCE() updated ->on_rq field (Harshit Agarwal)
- Isolated CPUs support enhancements: (Waiman Long)
- Make "isolcpus=nohz" equivalent to "nohz_full"
- Consolidate housekeeping cpumasks that are always identical
- Remove HK_TYPE_SCHED
- Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE
- RSEQ enhancements:
- Validate read-only fields under DEBUG_RSEQ config
(Mathieu Desnoyers)
- PSI enhancements:
- Fix race when task wakes up before psi_sched_switch()
adjusts flags (Chengming Zhou)
- IRQ time accounting performance enhancements: (Yafang Shao)
- Define sched_clock_irqtime as static key
- Don't account irq time if sched_clock_irqtime is disabled
- Virtual machine scheduling enhancements:
- Don't try to catch up excess steal time (Suleiman Souhlal)
- Heterogenous x86 CPU scheduling enhancements: (K Prateek Nayak)
- Convert "sysctl_sched_itmt_enabled" to boolean
- Use guard() for itmt_update_mutex
- Move the "sched_itmt_enabled" sysctl to debugfs
- Remove x86_smt_flags and use cpu_smt_flags directly
- Use x86_sched_itmt_flags for PKG domain unconditionally
- Debugging code & instrumentation enhancements:
- Change need_resched warnings to pr_err() (David Rientjes)
- Print domain name in /proc/schedstat (K Prateek Nayak)
- Fix value reported by hot tasks pulled in /proc/schedstat (Peter Zijlstra)
- Report the different kinds of imbalances in /proc/schedstat (Swapnil Sapkal)
- Move sched domain name out of CONFIG_SCHED_DEBUG (Swapnil Sapkal)
- Update Schedstat version to 17 (Swapnil Sapkal)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmePSRcRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hrdBAAjYiLl5Q8SHM0xnl+kbvuUkCTgEB/gSgA
mfrZtHRUgRZuA89NZ9NljlCkQSlsLTOjnpNuaeFzs529GMg9iemc99dbnz3BP5F3
V5qpYvWe7yIkJ3hd0TOGLmYEPMNQaAW57YBOrxcPjWNLJ4cr9iMdccVA1OQtcmqD
ZUh3nibv81QI8HDmT2G+figxEIqH3yBV1+SmEIxbrdkQpIJ5702Ng6+0KQK5TShN
xwjFELWZUl2TfkoCc4nkIpkImV6cI1DvXSw1xK6gbb1xEVOrsmFW3TYFw4trKHBu
2RBG4wtmzNjh+12GmSdIBJHogPNcay+JIJW9EG/unT7jirqzkkeP1X2eJEbh+X1L
CMa7GsD9Vy72jCzeJDMuiy7bKfG/MiKUtDXrAZQDo2atbw7H88QOzMuTE5a5WSV+
tRxXGI/dgFVOk+JQUfctfJbYeXjmG8GAflawvXtGDAfDZsja6M+65fH8p0AOgW1E
HHmXUzAe2E2xQBiSok/DYHPQeCDBAjoJvU93YhGiXv8UScb2UaD4BAfzfmc8P+Zs
Eox6444ah5U0jiXmZ3HU707n1zO+Ql4qKoyyMJzSyP+oYHE/Do7NYTElw2QovVdN
FX/9Uae8T4ttA/5lFe7FNoXgKvSxXDKYyKLZcysjVrWJF866Ui/TWtmxA6w8Osn7
sfucuLawLPM=
=5ZNW
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Fair scheduler (SCHED_FAIR) enhancements:
- Behavioral improvements:
- Untangle NEXT_BUDDY and pick_next_task() (Peter Zijlstra)
- Delayed-dequeue enhancements & fixes: (Vincent Guittot)
- Rename h_nr_running into h_nr_queued
- Add new cfs_rq.h_nr_runnable
- Use the new cfs_rq.h_nr_runnable
- Removed unsued cfs_rq.h_nr_delayed
- Rename cfs_rq.idle_h_nr_running into h_nr_idle
- Remove unused cfs_rq.idle_nr_running
- Rename cfs_rq.nr_running into nr_queued
- Do not try to migrate delayed dequeue task
- Fix variable declaration position
- Encapsulate set custom slice in a __setparam_fair() function
- Fixes:
- Fix race between yield_to() and try_to_wake_up() (Tianchen Ding)
- Fix CPU bandwidth limit bypass during CPU hotplug (Vishal
Chourasia)
- Cleanups:
- Clean up in migrate_degrades_locality() to improve readability
(Peter Zijlstra)
- Mark m*_vruntime() with __maybe_unused (Andy Shevchenko)
- Update comments after sched_tick() rename (Sebastian Andrzej
Siewior)
- Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()
(Valentin Schneider)
Deadline scheduler (SCHED_DL) enhancements:
- Restore dl_server bandwidth on non-destructive root domain changes
(Juri Lelli)
- Correctly account for allocated bandwidth during hotplug (Juri
Lelli)
- Check bandwidth overflow earlier for hotplug (Juri Lelli)
- Clean up goto label in pick_earliest_pushable_dl_task() (John
Stultz)
- Consolidate timer cancellation (Wander Lairson Costa)
Load-balancer enhancements:
- Improve performance by prioritizing migrating eligible tasks in
sched_balance_rq() (Hao Jia)
- Do not compute NUMA Balancing stats unnecessarily during
load-balancing (K Prateek Nayak)
- Do not compute overloaded status unnecessarily during
load-balancing (K Prateek Nayak)
Generic scheduling code enhancements:
- Use READ_ONCE() in task_on_rq_queued(), to consistently use the
WRITE_ONCE() updated ->on_rq field (Harshit Agarwal)
Isolated CPUs support enhancements: (Waiman Long)
- Make "isolcpus=nohz" equivalent to "nohz_full"
- Consolidate housekeeping cpumasks that are always identical
- Remove HK_TYPE_SCHED
- Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE
RSEQ enhancements:
- Validate read-only fields under DEBUG_RSEQ config (Mathieu
Desnoyers)
PSI enhancements:
- Fix race when task wakes up before psi_sched_switch() adjusts flags
(Chengming Zhou)
IRQ time accounting performance enhancements: (Yafang Shao)
- Define sched_clock_irqtime as static key
- Don't account irq time if sched_clock_irqtime is disabled
Virtual machine scheduling enhancements:
- Don't try to catch up excess steal time (Suleiman Souhlal)
Heterogenous x86 CPU scheduling enhancements: (K Prateek Nayak)
- Convert "sysctl_sched_itmt_enabled" to boolean
- Use guard() for itmt_update_mutex
- Move the "sched_itmt_enabled" sysctl to debugfs
- Remove x86_smt_flags and use cpu_smt_flags directly
- Use x86_sched_itmt_flags for PKG domain unconditionally
Debugging code & instrumentation enhancements:
- Change need_resched warnings to pr_err() (David Rientjes)
- Print domain name in /proc/schedstat (K Prateek Nayak)
- Fix value reported by hot tasks pulled in /proc/schedstat (Peter
Zijlstra)
- Report the different kinds of imbalances in /proc/schedstat
(Swapnil Sapkal)
- Move sched domain name out of CONFIG_SCHED_DEBUG (Swapnil Sapkal)
- Update Schedstat version to 17 (Swapnil Sapkal)"
* tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
rseq: Fix rseq unregistration regression
psi: Fix race when task wakes up before psi_sched_switch() adjusts flags
sched, psi: Don't account irq time if sched_clock_irqtime is disabled
sched: Don't account irq time if sched_clock_irqtime is disabled
sched: Define sched_clock_irqtime as static key
sched/fair: Do not compute overloaded status unnecessarily during lb
sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb
x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionally
x86/topology: Remove x86_smt_flags and use cpu_smt_flags directly
x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs
x86/itmt: Use guard() for itmt_update_mutex
x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean
sched/core: Prioritize migrating eligible tasks in sched_balance_rq()
sched/debug: Change need_resched warnings to pr_err
sched/fair: Encapsulate set custom slice in a __setparam_fair() function
sched: Fix race between yield_to() and try_to_wake_up()
docs: Update Schedstat version to 17
sched/stats: Print domain name in /proc/schedstat
sched: Move sched domain name out of CONFIG_SCHED_DEBUG
sched: Report the different kinds of imbalances in /proc/schedstat
...
set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an
entity is delayed irrespective of whether the entity corresponds to a
task or a cfs_rq.
Consider the following scenario:
root
/ \
A B (*) delayed since B is no longer eligible on root
| |
Task0 Task1 <--- dequeue_task_fair() - task blocks
When Task1 blocks (dequeue_entity() for task's se returns true),
dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the
hierarchy of Task1. However, when the sched_entity corresponding to
cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the
hierarchy too leading to both dequeue_entity() and set_delayed()
decrementing h_nr_runnable for the dequeue of the same task.
A SCHED_WARN_ON() to inspect h_nr_runnable post its update in
dequeue_entities() like below:
cfs_rq->h_nr_runnable -= h_nr_runnable;
SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);
is consistently tripped when running wakeup intensive workloads like
hackbench in a cgroup.
This error is self correcting since cfs_rq are per-cpu and cannot
migrate. The entitiy is either picked for full dequeue or is requeued
when a task wakes up below it. Both those paths call clear_delayed()
which again increments h_nr_runnable of the hierarchy without
considering if the entity corresponds to a task or not.
h_nr_runnable will eventually reflect the correct value however in the
interim, the incorrect values can still influence PELT calculation which
uses se->runnable_weight or cfs_rq->h_nr_runnable.
Since only delayed tasks take the early return path in
dequeue_entities() and enqueue_task_fair(), adjust the
h_nr_runnable in {set,clear}_delayed() only when a task is delayed as
this path skips the h_nr_* update loops and returns early.
For entities corresponding to cfs_rq, the h_nr_* update loop in the
caller will do the right thing.
Fixes: 76f2f78329 ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Link: https://lkml.kernel.org/r/20250117105852.23908-1-kprateek.nayak@amd.com
Merge cpufreq updates for 6.14:
- Use str_enable_disable()-like helpers in cpufreq (Krzysztof
Kozlowski).
- Extend the Apple cpufreq driver to support more SoCs (Hector Martin,
Nick Chan).
- Add new cpufreq driver for Airoha SoCs (Christian Marangi).
- Fix using cpufreq-dt as module (Andreas Kemnade).
- Minor fixes for Sparc, SCMI, and Qcom cpufreq drivers (Ethan Carter
Edwards, Sibi Sankar, Manivannan Sadhasivam).
- Fix the maximum supported frequency computation in the ACPI cpufreq
driver to avoid relying on unfounded assumptions (Gautham Shenoy).
- Fix an amd-pstate driver regression with preferred core rankings not
being used (Mario Limonciello).
- Fix a precision issue with frequency calculation in the amd-pstate
driver (Naresh Solanki).
- Add ftrace event to the amd-pstate driver for active mode (Mario
Limonciello).
- Set default EPP policy on Ryzen processors in amd-pstate (Mario
Limonciello).
- Clean up the amd-pstate cpufreq driver and optimize it to increase
code reuse (Mario Limonciello, Dhananjay Ugwekar).
- Use CPPC to get scaling factors between HWP performance levels and
frequency in the intel_pstate driver and make it stop using a built
-in scaling factor for the Arrow Lake processor (Rafael Wysocki).
- Make intel_pstate initialize epp_policy to CPUFREQ_POLICY_UNKNOWN for
consistency with CPU offline (Christian Loehle).
- Fix superfluous updates caused by need_freq_update in the schedutil
cpufreq governor (Sultan Alsawaf).
* pm-cpufreq: (40 commits)
cpufreq: Use str_enable_disable()-like helpers
cpufreq: airoha: Add EN7581 CPUFreq SMCCC driver
cpufreq: ACPI: Fix max-frequency computation
cpufreq/amd-pstate: Refactor max frequency calculation
cpufreq/amd-pstate: Fix prefcore rankings
cpufreq: sparc: change kzalloc to kcalloc
cpufreq: qcom: Implement clk_ops::determine_rate() for qcom_cpufreq* clocks
cpufreq: qcom: Fix qcom_cpufreq_hw_recalc_rate() to query LUT if LMh IRQ is not available
cpufreq: apple-soc: Add Apple A7-A8X SoC cpufreq support
cpufreq: apple-soc: Set fallback transition latency to APPLE_DVFS_TRANSITION_TIMEOUT
cpufreq: apple-soc: Increase cluster switch timeout to 400us
cpufreq: apple-soc: Use 32-bit read for status register
cpufreq: apple-soc: Allow per-SoC configuration of APPLE_DVFS_CMD_PS1
cpufreq: apple-soc: Drop setting the PS2 field on M2+
dt-bindings: cpufreq: apple,cluster-cpufreq: Add A7-A11, T2 compatibles
dt-bindings: cpufreq: Document support for Airoha EN7581 CPUFreq
cpufreq: fix using cpufreq-dt as module
cpufreq: scmi: Register for limit change notifications
cpufreq: schedutil: Fix superfluous updates caused by need_freq_update
cpufreq: intel_pstate: Use CPUFREQ_POLICY_UNKNOWN
...
Merge updates related to system sleep, a cpuidle update and an Energy
Model handling code update for 6.14-rc1:
- Allow configuring the system suspend-resume (DPM) watchdog to warn
earlier than panic (Douglas Anderson).
- Implement devm_device_init_wakeup() helper and introduce a device-
managed variant of dev_pm_set_wake_irq() (Joe Hattori, Peng Fan).
- Remove direct inclusions of 'pm_wakeup.h' which should be only
included via 'device.h' (Wolfram Sang).
- Clean up two comments in the core system-wide PM code (Rafael
Wysocki, Randy Dunlap).
- Add Clearwater Forest processor support to the intel_idle cpuidle
driver (Artem Bityutskiy).
- Move sched domains rebuild function from the schedutil cpufreq
governor to the Energy Model handling code (Rafael Wysocki).
* pm-sleep:
PM: sleep: wakeirq: Introduce device-managed variant of dev_pm_set_wake_irq()
PM: sleep: Allow configuring the DPM watchdog to warn earlier than panic
PM: sleep: convert comment from kernel-doc to plain comment
PM: wakeup: implement devm_device_init_wakeup() helper
PM: sleep: sysfs: don't include 'pm_wakeup.h' directly
PM: sleep: autosleep: don't include 'pm_wakeup.h' directly
PM: sleep: Update stale comment in device_resume()
* pm-cpuidle:
intel_idle: add Clearwater Forest SoC support
* pm-em:
PM: EM: Move sched domains rebuild function from schedutil to EM
artifacts
- Avoid scheduling lag by computing lag properly and thus address an EEVDF
entity placement issue
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmeM2+8ACgkQEsHwGGHe
VUrqHA/+PyPsILNfDFnaXD/Shv7kCgk3hH2YVrCnyIQn8fhTCL+K0luXRRmAPhl1
pEbRyEtVRXZM9WyHJ1r7VmnIHVanUKWcuZzAx8HT9b7Tkrqy/xSriH/jAV7FYG9h
gIhWA4vXT34HG2MQHTecJ/4gvkeI5tV93nkZ6vwEwwN4pGxSvijPR3yr3yl286Nt
BlHP9cXBWWsxTq4CYb2j5q9nmymKuiDmhD8jc1fVjDKxjmrNHEwX5mtvtRLGI1dR
O4pjUEqYQ+TWIkX8ZIRBPzLd9b6750ncO/9yb24cY7Z3RMKov4wz8b819Dt77cTp
XF+bT+8fsaKNRjkzPEseZU4OL16ZO+33Kcn+JoNPWvbONgFKusZ4qkCCwvWpiQlV
0Eddh1XjKBoXA1tR6VrREcUKQ2zWoXrn8tlHUTMfPuPq1jlbQNtzN7OWcR+WBy7L
FiClWzWLv0fCeTXaAEcO9+/MN5z2lDQsJRiLtAlGh0JJU6/1H6IVX2tSZllkpR/R
5pyHCpYsh5cucx6cQPk+rp9O6PDYu2frMuJ8itWlQLHeSzjzOoeE6EwwZcMQG4xb
UG/azMbmqWrtmZqCgNCBtq7RAkVRe0IuxWuCtV1VkatU+2RXdtV85tVKn/JG2KLW
05c8GTdQZgnoY65/TZHwudhkzytclynKrrvhfKFllAWnb8D8gyQ=
=fZWm
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Do not adjust the weight of empty group entities and avoid
scheduling artifacts
- Avoid scheduling lag by computing lag properly and thus address
an EEVDF entity placement issue
* tag 'sched_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE
sched/fair: Fix EEVDF entity placement bug causing scheduling lag
CPU unplug first calls __cpu_disable(), and that's where powerpc calls
cleanup_cpu_mmu_context(), which clears this CPU from mm_cpumask() of all
mms in the system.
However this CPU may still be using a lazy tlb mm, and its mm_cpumask bit
will be cleared from it. The CPU does not switch away from the lazy tlb
mm until arch_cpu_idle_dead() calls idle_task_exit().
If that user mm exits in this window, it will not be subject to the lazy
tlb mm shootdown and may be freed while in use as a lazy mm by the CPU
that is being unplugged.
cleanup_cpu_mmu_context() could be moved later, but it looks better to
move the lazy tlb mm switching earlier. The problem with doing the lazy
mm switching in idle_task_exit() is explained in commit bf2c59fce4
("sched/core: Fix illegal RCU from offline CPUs"), which added a wart to
switch away from the mm but leave it set in active_mm to be cleaned up
later.
So instead, switch away from the lazy tlb mm at sched_cpu_wait_empty(),
which is the last hotplug state before teardown
(CPUHP_AP_SCHED_WAIT_EMPTY). This CPU will never switch to a user thread
from this point, so it has no chance to pick up a new lazy tlb mm. This
removes the lazy tlb mm handling wart in CPU unplug.
With this, idle_task_exit() is not needed anymore and can be cleaned up.
This leaves the prototype alone, to be cleaned after this change.
herton: took the suggestions from https://lore.kernel.org/all/87jzvyprsw.ffs@tglx/
and made adjustments on the initial patch proposed by Nicholas.
Link: https://lkml.kernel.org/r/20230524060455.147699-1-npiggin@gmail.com
Link: https://lore.kernel.org/all/20230525205253.E2FAEC433EF@smtp.kernel.org/
Link: https://lkml.kernel.org/r/20241104142318.3295663-1-herton@redhat.com
Fixes: 2655421ae6 ("lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kasan_record_aux_stack_noalloc() was introduced to record a stack trace
without allocating memory in the process. It has been added to callers
which were invoked while a raw_spinlock_t was held. More and more callers
were identified and changed over time. Is it a good thing to have this
while functions try their best to do a locklessly setup? The only
downside of having kasan_record_aux_stack() not allocate any memory is
that we end up without a stacktrace if stackdepot runs out of memory and
at the same stacktrace was not recorded before To quote Marco Elver from
https://lore.kernel.org/all/CANpmjNPmQYJ7pv1N3cuU8cP18u7PP_uoZD8YxwZd4jtbof9nVQ@mail.gmail.com/
| I'd be in favor, it simplifies things. And stack depot should be
| able to replenish its pool sufficiently in the "non-aux" cases
| i.e. regular allocations. Worst case we fail to record some
| aux stacks, but I think that's only really bad if there's a bug
| around one of these allocations. In general the probabilities
| of this being a regression are extremely small [...]
Make the kasan_record_aux_stack_noalloc() behaviour default as
kasan_record_aux_stack().
[bigeasy@linutronix.de: dressed the diff as patch]
Link: https://lkml.kernel.org/r/20241122155451.Mb2pmeyJ@linutronix.de
Fixes: 7cb3007ce2 ("kasan: generic: introduce kasan_record_aux_stack_noalloc()")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reported-by: syzbot+39f85d612b7c20d8db48@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67275485.050a0220.3c8d68.0a37.GAE@google.com
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Reviewed-by: Marco Elver <elver@google.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: syzkaller-bugs@googlegroups.com
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use the correct function parameter names and function names.
Use the correct kernel-doc comment format for struct sched_ext_ops
to eliminate a bunch of warnings.
ext.c:1418: warning: Excess function parameter 'include_dead' description in 'scx_task_iter_next_locked'
ext.c:7261: warning: expecting prototype for scx_bpf_dump(). Prototype was for scx_bpf_dump_bstr() instead
ext.c:7352: warning: Excess function parameter 'flags' description in 'scx_bpf_cpuperf_set'
ext.c:3150: warning: Function parameter or struct member 'in_fi' not described in 'scx_prio_less'
ext.c:4711: warning: Function parameter or struct member 'dur_s' not described in 'scx_softlockup'
ext.c:4775: warning: Function parameter or struct member 'bypass' not described in 'scx_ops_bypass'
ext.c:7453: warning: Function parameter or struct member 'idle_mask' not described in 'scx_bpf_put_idle_cpumask'
ext.c:209: warning: Incorrect use of kernel-doc format: * select_cpu - Pick the target CPU for a task which is being woken up
ext.c:236: warning: Incorrect use of kernel-doc format: * enqueue - Enqueue a task on the BPF scheduler
ext.c:251: warning: Incorrect use of kernel-doc format: * dequeue - Remove a task from the BPF scheduler
ext.c:267: warning: Incorrect use of kernel-doc format: * dispatch - Dispatch tasks from the BPF scheduler and/or user DSQs
ext.c:290: warning: Incorrect use of kernel-doc format: * tick - Periodic tick
ext.c:300: warning: Incorrect use of kernel-doc format: * runnable - A task is becoming runnable on its associated CPU
ext.c:327: warning: Incorrect use of kernel-doc format: * running - A task is starting to run on its associated CPU
ext.c:335: warning: Incorrect use of kernel-doc format: * stopping - A task is stopping execution
ext.c:346: warning: Incorrect use of kernel-doc format: * quiescent - A task is becoming not runnable on its associated CPU
ext.c:366: warning: Incorrect use of kernel-doc format: * yield - Yield CPU
ext.c:381: warning: Incorrect use of kernel-doc format: * core_sched_before - Task ordering for core-sched
ext.c:399: warning: Incorrect use of kernel-doc format: * set_weight - Set task weight
ext.c:408: warning: Incorrect use of kernel-doc format: * set_cpumask - Set CPU affinity
ext.c:418: warning: Incorrect use of kernel-doc format: * update_idle - Update the idle state of a CPU
ext.c:439: warning: Incorrect use of kernel-doc format: * cpu_acquire - A CPU is becoming available to the BPF scheduler
ext.c:449: warning: Incorrect use of kernel-doc format: * cpu_release - A CPU is taken away from the BPF scheduler
ext.c:461: warning: Incorrect use of kernel-doc format: * init_task - Initialize a task to run in a BPF scheduler
ext.c:476: warning: Incorrect use of kernel-doc format: * exit_task - Exit a previously-running task from the system
ext.c:485: warning: Incorrect use of kernel-doc format: * enable - Enable BPF scheduling for a task
ext.c:494: warning: Incorrect use of kernel-doc format: * disable - Disable BPF scheduling for a task
ext.c:504: warning: Incorrect use of kernel-doc format: * dump - Dump BPF scheduler state on error
ext.c:512: warning: Incorrect use of kernel-doc format: * dump_cpu - Dump BPF scheduler state for a CPU on error
ext.c:524: warning: Incorrect use of kernel-doc format: * dump_task - Dump BPF scheduler state for a runnable task on error
ext.c:535: warning: Incorrect use of kernel-doc format: * cgroup_init - Initialize a cgroup
ext.c:550: warning: Incorrect use of kernel-doc format: * cgroup_exit - Exit a cgroup
ext.c:559: warning: Incorrect use of kernel-doc format: * cgroup_prep_move - Prepare a task to be moved to a different cgroup
ext.c:574: warning: Incorrect use of kernel-doc format: * cgroup_move - Commit cgroup move
ext.c:585: warning: Incorrect use of kernel-doc format: * cgroup_cancel_move - Cancel cgroup move
ext.c:597: warning: Incorrect use of kernel-doc format: * cgroup_set_weight - A cgroup's weight is being changed
ext.c:611: warning: Incorrect use of kernel-doc format: * cpu_online - A CPU became online
ext.c:620: warning: Incorrect use of kernel-doc format: * cpu_offline - A CPU is going offline
ext.c:633: warning: Incorrect use of kernel-doc format: * init - Initialize the BPF scheduler
ext.c:638: warning: Incorrect use of kernel-doc format: * exit - Clean up after the BPF scheduler
ext.c:648: warning: Incorrect use of kernel-doc format: * dispatch_max_batch - Max nr of tasks that dispatch() can dispatch
ext.c:653: warning: Incorrect use of kernel-doc format: * flags - %SCX_OPS_* flags
ext.c:658: warning: Incorrect use of kernel-doc format: * timeout_ms - The maximum amount of time, in milliseconds, that a
ext.c:667: warning: Incorrect use of kernel-doc format: * exit_dump_len - scx_exit_info.dump buffer length. If 0, the default
ext.c:673: warning: Incorrect use of kernel-doc format: * hotplug_seq - A sequence number that may be set by the scheduler to
ext.c:682: warning: Incorrect use of kernel-doc format: * name - BPF scheduler's name
ext.c:689: warning: Function parameter or struct member 'select_cpu' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'enqueue' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'dequeue' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'dispatch' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'tick' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'runnable' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'running' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'stopping' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'quiescent' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'yield' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'core_sched_before' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'set_weight' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'set_cpumask' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'update_idle' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cpu_acquire' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cpu_release' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'init_task' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'exit_task' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'enable' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'disable' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'dump' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'dump_cpu' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'dump_task' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cgroup_init' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cgroup_exit' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cgroup_prep_move' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cgroup_move' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cgroup_cancel_move' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cgroup_set_weight' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cpu_online' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'cpu_offline' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'init' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'exit' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'dispatch_max_batch' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'flags' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'timeout_ms' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'exit_dump_len' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'hotplug_seq' not described in 'sched_ext_ops'
ext.c:689: warning: Function parameter or struct member 'name' not described in 'sched_ext_ops'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Changwoo Min <changwoo@igalia.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: bpf@vger.kernel.org
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When running hackbench in a cgroup with bandwidth throttling enabled,
following PSI splat was observed:
psi: inconsistent task state! task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4
When investigating the series of events leading up to the splat,
following sequence was observed:
[008] d..2.: sched_switch: ... ==> next_comm=hackbench next_pid=1831 next_prio=120
...
[008] dN.2.: dequeue_entity(task delayed): task=hackbench pid=1831 cfs_rq->throttled=0
[008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled cfs_rq on CPU8
# CPU8 goes into newidle balance and releases the rq lock
...
# CPU15 on same LLC Domain is trying to wakeup hackbench(pid=1831)
[015] d..4.: psi_flags_change: psi: task state: task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 final=14 # Splat (cfs_rq->throttled=1)
[015] d..4.: sched_wakeup: comm=hackbench pid=1831 prio=120 target_cpu=008 # Task has woken on a throttled hierarchy
[008] d..2.: sched_switch: prev_comm=hackbench prev_pid=1831 prev_prio=120 prev_state=S ==> ...
psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags
for the blocked entity, however, with the introduction of DELAY_DEQUEUE,
the block task can wakeup when newidle balance drops the runqueue lock
during __schedule().
If a task wakes before psi_sched_switch() adjusts the PSI flags, skip
any modifications in psi_enqueue() which would still see the flags of a
running task and not a blocked one. Instead, rely on psi_sched_switch()
to do the right thing.
Since the status returned by try_to_block_task() may no longer be true
by the time schedule reaches psi_sched_switch(), check if the task is
blocked or not using a combination of task_on_rq_queued() and
p->se.sched_delayed checks.
[ prateek: Commit message, testing, early bailout in psi_enqueue() ]
Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue") # 1a6151017e
Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Link: https://lore.kernel.org/r/20241227061941.2315-1-kprateek.nayak@amd.com
sched_clock_irqtime may be disabled due to the clock source. When disabled,
irq_time_read() won't change over time, so there is nothing to account. We
can save iterating the whole hierarchy on every tick and context switch.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20250103022409.2544-4-laoar.shao@gmail.com
sched_clock_irqtime may be disabled due to the clock source, in which case
IRQ time should not be accounted. Let's add a conditional check to avoid
unnecessary logic.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250103022409.2544-3-laoar.shao@gmail.com
Since CPU time accounting is a performance-critical path, let's define
sched_clock_irqtime as a static key to minimize potential overhead.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250103022409.2544-2-laoar.shao@gmail.com
Only set sg_overloaded when computing sg_lb_stats() at the highest sched
domain since rd->overloaded status is updated only when load balancing
at the highest domain. While at it, move setting of sg_overloaded below
idle_cpu() check since an idle CPU can never be overloaded.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://lore.kernel.org/r/20241223043407.1611-8-kprateek.nayak@amd.com
Aggregate nr_numa_running and nr_preferred_running when load balancing
at NUMA domains only. While at it, also move the aggregation below the
idle_cpu() check since an idle CPU cannot have any preferred tasks.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20241223043407.1611-7-kprateek.nayak@amd.com
When the PLACE_LAG scheduling feature is enabled and
dst_cfs_rq->nr_queued is greater than 1, if a task is
ineligible (lag < 0) on the source cpu runqueue, it will
also be ineligible when it is migrated to the destination
cpu runqueue. Because we will keep the original equivalent
lag of the task in place_entity(). So if the task was
ineligible before, it will still be ineligible after
migration.
So in sched_balance_rq(), we prioritize migrating eligible
tasks, and we soft-limit ineligible tasks, allowing them
to migrate only when nr_balance_failed is non-zero to
avoid load-balancing trying very hard to balance the load.
Below are some benchmark test results. From my test results,
this patch shows a slight improvement on hackbench.
Benchmark
=========
All of the benchmarks are done inside a normal cpu cgroup in a
clean environment with cpu turbo disabled, and test machine is:
Single NUMA machine model is 13th Gen Intel(R) Core(TM)
i7-13700, 12 Core/24 HT.
Based on master b86545e02e.
Results
=======
hackbench-process-pipes
vanilla patched
Amean 1 0.5837 ( 0.00%) 0.5733 ( 1.77%)
Amean 4 1.4423 ( 0.00%) 1.4503 ( -0.55%)
Amean 7 2.5147 ( 0.00%) 2.4773 ( 1.48%)
Amean 12 3.9347 ( 0.00%) 3.8880 ( 1.19%)
Amean 21 5.3943 ( 0.00%) 5.3873 ( 0.13%)
Amean 30 6.7840 ( 0.00%) 6.6660 ( 1.74%)
Amean 48 9.8313 ( 0.00%) 9.6100 ( 2.25%)
Amean 79 15.4403 ( 0.00%) 14.9580 ( 3.12%)
Amean 96 18.4970 ( 0.00%) 17.9533 ( 2.94%)
hackbench-process-sockets
vanilla patched
Amean 1 0.6297 ( 0.00%) 0.6223 ( 1.16%)
Amean 4 2.1517 ( 0.00%) 2.0887 ( 2.93%)
Amean 7 3.6377 ( 0.00%) 3.5670 ( 1.94%)
Amean 12 6.1277 ( 0.00%) 5.9290 ( 3.24%)
Amean 21 10.0380 ( 0.00%) 9.7623 ( 2.75%)
Amean 30 14.1517 ( 0.00%) 13.7513 ( 2.83%)
Amean 48 24.7253 ( 0.00%) 24.2287 ( 2.01%)
Amean 79 43.9523 ( 0.00%) 43.2330 ( 1.64%)
Amean 96 54.5310 ( 0.00%) 53.7650 ( 1.40%)
tbench4 Throughput
vanilla patched
Hmean 1 255.97 ( 0.00%) 275.01 ( 7.44%)
Hmean 2 511.60 ( 0.00%) 544.27 ( 6.39%)
Hmean 4 996.70 ( 0.00%) 1006.57 ( 0.99%)
Hmean 8 1646.46 ( 0.00%) 1649.15 ( 0.16%)
Hmean 16 2259.42 ( 0.00%) 2274.35 ( 0.66%)
Hmean 32 4725.48 ( 0.00%) 4735.57 ( 0.21%)
Hmean 64 4411.47 ( 0.00%) 4400.05 ( -0.26%)
Hmean 96 4284.31 ( 0.00%) 4267.39 ( -0.39%)
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Hao Jia <jiahao1@lixiang.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241223091446.90208-1-jiahao.kernel@gmail.com
need_resched warnings, if enabled, are treated as WARNINGs. If
kernel.panic_on_warn is enabled, then this causes a kernel panic.
It's highly unlikely that a panic is desired for these warnings, only a
stack trace is normally required to debug and resolve.
Thus, switch need_resched warnings to simply be a printk with an
associated stack trace so they are no longer in scope for panic_on_warn.
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Acked-by: Josh Don <joshdon@google.com>
Link: https://lkml.kernel.org/r/e8d52023-5291-26bd-5299-8bb9eb604929@google.com
Similarly to dl, create a __setparam_fair() function to set parameters
related to fair class and move it in the fair.c file.
No functional changes expected
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20250110144656.484601-1-vincent.guittot@linaro.org
We met a SCHED_WARN in set_next_buddy():
__warn_printk
set_next_buddy
yield_to_task_fair
yield_to
kvm_vcpu_yield_to [kvm]
...
After a short dig, we found the rq_lock held by yield_to() may not
be exactly the rq that the target task belongs to. There is a race
window against try_to_wake_up().
CPU0 target_task
blocking on CPU1
lock rq0 & rq1
double check task_rq == p_rq, ok
woken to CPU2 (lock task_pi & rq2)
task_rq = rq2
yield_to_task_fair (w/o lock rq2)
In this race window, yield_to() is operating the task w/o the correct
lock. Fix this by taking task pi_lock first.
Fixes: d95f412200 ("sched: Add yield_to(task, preempt) functionality")
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241231055020.6521-1-dtcccc@linux.alibaba.com
Normally dequeue_entities() will continue to dequeue an empty group entity;
except DELAY_DEQUEUE changes things -- it retains empty entities such that they
might continue to compete and burn off some lag.
However, doing this results in update_cfs_group() re-computing the cgroup
weight 'slice' for an empty group, which it (rightly) figures isn't much at
all. This in turn means that the delayed entity is not competing at the
expected weight. Worse, the very low weight causes its lag to be inflated,
which combined with avg_vruntime() using scale_load_down(), leads to artifacts.
As such, don't adjust the weight for empty group entities and let them compete
at their original weight.
Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250110115720.GA17405@noisy.programming.kicks-ass.net
Delay accounting can now calculate the average delay of processes, detect
the overall system load, and also record the 'delay max' to identify
potential abnormal delays. However, 'delay min' can help us identify
another useful delay peak. By comparing the difference between 'delay
max' and 'delay min', we can understand the optimization space for
latency, providing a reference for the optimization of latency
performance.
Use case
=========
bash-4.4# ./getdelays -d -t 242
print delayacct stats ON
TGID 242
CPU count real total virtual total delay total delay average delay max delay min
39 156000000 156576579 2111069 0.054ms 0.212296ms 0.031307ms
IO count delay total delay average delay max delay min
0 0 0.000ms 0.000000ms 0.000000ms
SWAP count delay total delay average delay max delay min
0 0 0.000ms 0.000000ms 0.000000ms
RECLAIM count delay total delay average delay max delay min
0 0 0.000ms 0.000000ms 0.000000ms
THRASHING count delay total delay average delay max delay min
0 0 0.000ms 0.000000ms 0.000000ms
COMPACT count delay total delay average delay max delay min
0 0 0.000ms 0.000000ms 0.000000ms
WPCOPY count delay total delay average delay max delay min
156 11215873 0.072ms 0.207403ms 0.033913ms
IRQ count delay total delay average delay max delay min
0 0 0.000ms 0.000000ms 0.000000ms
Link: https://lkml.kernel.org/r/20241220173105906EOdsPhzjMLYNJJBqgz1ga@zte.com.cn
Co-developed-by: Wang Yong <wang.yong12@zte.com.cn>
Signed-off-by: Wang Yong <wang.yong12@zte.com.cn>
Co-developed-by: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Co-developed-by: Kun Jiang <jiang.kun2@zte.com.cn>
Signed-off-by: Kun Jiang <jiang.kun2@zte.com.cn>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Fan Yu <fan.yu9@zte.com.cn>
Cc: Peilin He <he.peilin@zte.com.cn>
Cc: tuqiang <tu.qiang35@zte.com.cn>
Cc: ye xingchen <ye.xingchen@zte.com.cn>
Cc: Yunkai Zhang <zhang.yunkai@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Introduce the use cases of delay max, which can help quickly detect
potential abnormal delays in the system and record the types and specific
details of delay spikes.
Problem
========
Delay accounting can track the average delay of processes to show
system workload. However, when a process experiences a significant
delay, maybe a delay spike, which adversely affects performance,
getdelays can only display the average system delay over a period
of time. Yet, average delay is unhelpful for diagnosing delay peak.
It is not even possible to determine which type of delay has spiked,
as this information might be masked by the average delay.
Solution
=========
the 'delay max' can display delay peak since the system's startup,
which can record potential abnormal delays over time, including
the type of delay and the maximum delay. This is helpful for
quickly identifying crash caused by delay.
Use case
=========
bash# ./getdelays -d -p 244
print delayacct stats ON
PID 244
CPU count real total virtual total delay total delay average delay max
68 192000000 213676651 705643 0.010ms 0.306381ms
IO count delay total delay average delay max
0 0 0.000ms 0.000000ms
SWAP count delay total delay average delay max
0 0 0.000ms 0.000000ms
RECLAIM count delay total delay average delay max
0 0 0.000ms 0.000000ms
THRASHING count delay total delay average delay max
0 0 0.000ms 0.000000ms
COMPACT count delay total delay average delay max
0 0 0.000ms 0.000000ms
WPCOPY count delay total delay average delay max
235 15648284 0.067ms 0.263842ms
IRQ count delay total delay average delay max
0 0 0.000ms 0.000000ms
[wang.yaxin@zte.com.cn: update docs and fix some spelling errors]
Link: https://lkml.kernel.org/r/20241213192700771XKZ8H30OtHSeziGqRVMs0@zte.com.cn
Link: https://lkml.kernel.org/r/20241203164848805CS62CQPQWG9GLdQj2_BxS@zte.com.cn
Co-developed-by: Wang Yong <wang.yong12@zte.com.cn>
Signed-off-by: Wang Yong <wang.yong12@zte.com.cn>
Co-developed-by: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Co-developed-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Signed-off-by: Kun Jiang <jiang.kun2@zte.com.cn>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Fan Yu <fan.yu9@zte.com.cn>
Cc: Peilin He <he.peilin@zte.com.cn>
Cc: tuqiang <tu.qiang35@zte.com.cn>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Cc: ye xingchen <ye.xingchen@zte.com.cn>
Cc: Yunkai Zhang <zhang.yunkai@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- Fix corner case bug where ops.dispatch() couldn't extend the execution of
the current task if SCX_OPS_ENQ_LAST is set.
- Fix ops.cpu_release() not being called when a SCX task is preempted by a
higher priority sched class task.
- Fix buitin idle mask being incorrectly left as busy after an idle CPU is
picked and kicked.
- scx_ops_bypass() was unnecessarily using rq_lock() which comes with rq
pinning related sanity checks which could trigger spuriously. Switch to
raw_spin_rq_lock().
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ4Gmpw4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGVntAP0b4i4PEIkupj9+i8ZzlwqvYX3gFJ7E4v3wmjDp
1VYdrAD/ZetrhrM+9RyyKpMIDFnN+xE6YbslBSlAzGzgfdsbXA0=
=zGXi
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.13-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Fix corner case bug where ops.dispatch() couldn't extend the
execution of the current task if SCX_OPS_ENQ_LAST is set.
- Fix ops.cpu_release() not being called when a SCX task is preempted
by a higher priority sched class task.
- Fix buitin idle mask being incorrectly left as busy after an idle CPU
is picked and kicked.
- scx_ops_bypass() was unnecessarily using rq_lock() which comes with
rq pinning related sanity checks which could trigger spuriously.
Switch to raw_spin_rq_lock().
* tag 'sched_ext-for-6.13-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: idle: Refresh idle masks during idle-to-idle transitions
sched_ext: switch class when preempted by higher priority scheduler
sched_ext: Replace rq_lock() to raw_spin_rq_lock() in scx_ops_bypass()
sched_ext: keep running prev when prev->scx.slice != 0
With the consolidation of put_prev_task/set_next_task(), see
commit 436f3eed5c ("sched: Combine the last put_prev_task() and the
first set_next_task()"), we are now skipping the transition between
these two functions when the previous and the next tasks are the same.
As a result, the scx idle state of a CPU is updated only when
transitioning to or from the idle thread. While this is generally
correct, it can lead to uneven and inefficient core utilization in
certain scenarios [1].
A typical scenario involves proactive wake-ups: scx_bpf_pick_idle_cpu()
selects and marks an idle CPU as busy, followed by a wake-up via
scx_bpf_kick_cpu(), without dispatching any tasks. In this case, the CPU
continues running the idle thread, returns to idle, but remains marked
as busy, preventing it from being selected again as an idle CPU (until a
task eventually runs on it and releases the CPU).
For example, running a workload that uses 20% of each CPU, combined with
an scx scheduler using proactive wake-ups, results in the following core
utilization:
CPU 0: 25.7%
CPU 1: 29.3%
CPU 2: 26.5%
CPU 3: 25.5%
CPU 4: 0.0%
CPU 5: 25.5%
CPU 6: 0.0%
CPU 7: 10.5%
To address this, refresh the idle state also in pick_task_idle(), during
idle-to-idle transitions, but only trigger ops.update_idle() on actual
state changes to prevent unnecessary updates to the scx scheduler and
maintain balanced state transitions.
With this change in place, the core utilization in the previous example
becomes the following:
CPU 0: 18.8%
CPU 1: 19.4%
CPU 2: 18.0%
CPU 3: 18.7%
CPU 4: 19.3%
CPU 5: 18.9%
CPU 6: 18.7%
CPU 7: 19.3%
[1] https://github.com/sched-ext/scx/pull/1139
Fixes: 7c65ae81ea ("sched_ext: Don't call put_prev_task_scx() before picking the next task")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Returns a high-performance monotonically non-decreasing clock for the current
CPU. The clock returned is in nanoseconds.
It provides the following properties:
1) High performance: Many BPF schedulers call bpf_ktime_get_ns() frequently
to account for execution time and track tasks' runtime properties.
Unfortunately, in some hardware platforms, bpf_ktime_get_ns() -- which
eventually reads a hardware timestamp counter -- is neither performant nor
scalable. scx_bpf_now() aims to provide a high-performance clock by
using the rq clock in the scheduler core whenever possible.
2) High enough resolution for the BPF scheduler use cases: In most BPF
scheduler use cases, the required clock resolution is lower than the most
accurate hardware clock (e.g., rdtsc in x86). scx_bpf_now() basically
uses the rq clock in the scheduler core whenever it is valid. It considers
that the rq clock is valid from the time the rq clock is updated
(update_rq_clock) until the rq is unlocked (rq_unpin_lock).
3) Monotonically non-decreasing clock for the same CPU: scx_bpf_now()
guarantees the clock never goes backward when comparing them in the same
CPU. On the other hand, when comparing clocks in different CPUs, there
is no such guarantee -- the clock can go backward. It provides a
monotonically *non-decreasing* clock so that it would provide the same
clock values in two different scx_bpf_now() calls in the same CPU
during the same period of when the rq clock is valid.
An rq clock becomes valid when it is updated using update_rq_clock()
and invalidated when the rq is unlocked using rq_unpin_lock().
Let's suppose the following timeline in the scheduler core:
T1. rq_lock(rq)
T2. update_rq_clock(rq)
T3. a sched_ext BPF operation
T4. rq_unlock(rq)
T5. a sched_ext BPF operation
T6. rq_lock(rq)
T7. update_rq_clock(rq)
For [T2, T4), we consider that rq clock is valid (SCX_RQ_CLK_VALID is
set), so scx_bpf_now() calls during [T2, T4) (including T3) will
return the rq clock updated at T2. For duration [T4, T7), when a BPF
scheduler can still call scx_bpf_now() (T5), we consider the rq clock
is invalid (SCX_RQ_CLK_VALID is unset at T4). So when calling
scx_bpf_now() at T5, we will return a fresh clock value by calling
sched_clock_cpu() internally. Also, to prevent getting outdated rq clocks
from a previous scx scheduler, invalidate all the rq clocks when unloading
a BPF scheduler.
One example of calling scx_bpf_now(), when the rq clock is invalid
(like T5), is in scx_central [1]. The scx_central scheduler uses a BPF
timer for preemptive scheduling. In every msec, the timer callback checks
if the currently running tasks exceed their timeslice. At the beginning of
the BPF timer callback (central_timerfn in scx_central.bpf.c), scx_central
gets the current time. When the BPF timer callback runs, the rq clock could
be invalid, the same as T5. In this case, scx_bpf_now() returns a fresh
clock value rather than returning the old one (T2).
[1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_central.bpf.c
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
scx_enabled() will be used in scx_rq_clock_update/invalidate()
in the following patch, so relocate the scx_enabled() related code
to the proper location.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
I noticed this in my traces today:
turbostat-1222 [006] d..2. 311.935649: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6)
{ weight: 1048576 avg_vruntime: 3184159639071 vruntime: 3184159640194 (-1123) deadline: 3184162621107 } ->
{ weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 }
turbostat-1222 [006] d..2. 311.935651: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6)
{ weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 } ->
{ weight: 1048576 avg_vruntime: 3184176414812 vruntime: 3184177464419 (-1049607) deadline: 3184180445332 }
Which is a weight transition: 1048576 -> 2 -> 1048576.
One would expect the lag to shoot out *AND* come back, notably:
-1123*1048576/2 = -588775424
-588775424*2/1048576 = -1123
Except the trace shows it is all off. Worse, subsequent cycles shoot it
out further and further.
This made me have a very hard look at reweight_entity(), and
specifically the ->on_rq case, which is more prominent with
DELAY_DEQUEUE.
And indeed, it is all sorts of broken. While the computation of the new
lag is correct, the computation for the new vruntime, using the new lag
is broken for it does not consider the logic set out in place_entity().
With the below patch, I now see things like:
migration/12-55 [012] d..3. 309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12)
{ weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475 } ->
{ weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203 }
migration/14-62 [014] d..3. 309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15)
{ weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline: 6316614641111 } ->
{ weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650 }
Which isn't perfect yet, but much closer.
Reported-by: Doug Smythies <dsmythies@telus.net>
Reported-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: eab03c23c2 ("sched/eevdf: Fix vruntime adjustment on reweight")
Link: https://lore.kernel.org/r/20250109105959.GA2981@noisy.programming.kicks-ass.net
kthread_create() creates a kthread without running it yet. kthread_run()
creates a kthread and runs it.
On the other hand, kthread_create_worker() creates a kthread worker and
runs it.
This difference in behaviours is confusing. Also there is no way to
create a kthread worker and affine it using kthread_bind_mask() or
kthread_affine_preferred() before starting it.
Consolidate the behaviours and introduce kthread_run_worker[_on_cpu]()
that behaves just like kthread_run(). kthread_create_worker[_on_cpu]()
will now only create a kthread worker without starting it.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
When a kthread or any other task has an affinity mask that is fully
offline or unallowed, the scheduler reaffines the task to all possible
CPUs as a last resort.
This default decision doesn't mix up very well with nohz_full CPUs that
are part of the possible cpumask but don't want to be disturbed by
unbound kthreads or even detached pinned user tasks.
Make the fallback affinity setting aware of nohz_full.
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
ops.cpu_release() function, if defined, must be invoked when preempted by
a higher priority scheduler class task. This scenario was skipped in
commit f422316d74 ("sched_ext: Remove switch_class_scx()"). Let's fix
it.
Fixes: f422316d74 ("sched_ext: Remove switch_class_scx()")
Signed-off-by: Honglei Wang <jameshongleiwang@126.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
scx_ops_bypass() iterates all CPUs to re-enqueue all the scx tasks.
For each CPU, it acquires a lock using rq_lock() regardless of whether
a CPU is offline or the CPU is currently running a task in a higher
scheduler class (e.g., deadline). The rq_lock() is supposed to be used
for online CPUs, and the use of rq_lock() may trigger an unnecessary
warning in rq_pin_lock(). Therefore, replace rq_lock() to
raw_spin_rq_lock() in scx_ops_bypass().
Without this change, we observe the following warning:
===== START =====
[ 6.615205] rq->balance_callback && rq->balance_callback != &balance_push_callback
[ 6.615208] WARNING: CPU: 2 PID: 0 at kernel/sched/sched.h:1730 __schedule+0x1130/0x1c90
===== END =====
Fixes: 0e7ffff1b8 ("scx: Fix raciness in scx_ops_bypass()")
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When %SCX_OPS_ENQ_LAST is set and prev->scx.slice != 0,
@prev will be dispacthed into the local DSQ in put_prev_task_scx().
However, pick_task_scx() is executed before put_prev_task_scx(),
so it will not pick @prev.
Set %SCX_RQ_BAL_KEEP in balance_one() to ensure that pick_task_scx()
can pick @prev.
Signed-off-by: Henry Huang <henry.hj@antgroup.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Report the remaining time slice when dumping task information during an
error exit.
This information can be useful for tracking incorrect or excessively
long time slices in schedulers that implement dynamic time slice logic.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
With commit 5b26f7b920 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct
dispatches"), scx_bpf_dsq_insert() can use SCX_DSQ_LOCAL_ON for direct
dispatch from ops.enqueue() to target the local DSQ of any CPU.
Update the documentation accordingly.
Fixes: 5b26f7b920 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Replace the loop to check if all SMT CPUs are idle with
cpumask_subset(). This simplifies the code and slightly improves
efficiency, while preserving the original behavior.
Note that idle_masks.smt handling remains racy, which is acceptable as
it serves as an optimization and is self-correcting.
Suggested-and-reviewed-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
- Fix the bug where bpf_iter_scx_dsq_new() was not initializing the
iterator's flags and could inadvertently enable e.g. reverse iteration.
- Fix the bug where scx_ops_bypass() could call irq_restore twice.
- Add Andrea and Changwoo as maintainers for better review coverage.
- selftests and tools/sched_ext build and other fixes.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ3hpXg4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGS/lAQDOZDfcJtO1VEsLoPY9NhFHPuBDTfoJyjSi/4mh
GsjgDAD/Sx0rN6C9S/+ToUjmq3FA+ft0m2+97VqgLwkzwA9YxwI=
=jaZ6
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.13-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Fix a bug where bpf_iter_scx_dsq_new() was not initializing the
iterator's flags and could inadvertently enable e.g. reverse
iteration
- Fix a bug where scx_ops_bypass() could call irq_restore twice
- Add Andrea and Changwoo as maintainers for better review coverage
- selftests and tools/sched_ext build and other fixes
* tag 'sched_ext-for-6.13-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Fix dsq_local_on selftest
sched_ext: initialize kit->cursor.flags
sched_ext: Fix invalid irq restore in scx_ops_bypass()
MAINTAINERS: add me as reviewer for sched_ext
MAINTAINERS: add self as reviewer for sched_ext
scx: Fix maximal BPF selftest prog
sched_ext: fix application of sizeof to pointer
selftests/sched_ext: fix build after renames in sched_ext API
sched_ext: Add __weak to fix the build errors
Minor refactoring to add a helper function for checking if the built-in
idle CPU selection policy is enabled.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add a comments to clarify about the usage of cpumask_intersects().
Moreover, update scx_select_cpu_dfl() description clarifying that the
final step of the idle selection logic involves searching for any idle
CPU in the system that the task can use.
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Use the assign_cpu() helper to set or clear the CPU in the idle mask,
based on the idle condition.
Acked-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
struct bpf_iter_scx_dsq *it maybe not initialized.
If we didn't call scx_bpf_dsq_move_set_vtime and scx_bpf_dsq_move_set_slice
before scx_bpf_dsq_move, it would cause unexpected behaviors:
1. Assign a huge slice into p->scx.slice
2. Assign a invalid vtime into p->scx.dsq_vtime
Signed-off-by: Henry Huang <henry.hj@antgroup.com>
Fixes: 6462dd53a2 ("sched_ext: Compact struct bpf_iter_scx_dsq_kern")
Cc: stable@vger.kernel.org # v6.12
Signed-off-by: Tejun Heo <tj@kernel.org>
Update the Schedstat version to 17 as more fields are added to report
different kinds of imbalances in the sched domain. Also domain field
started printing corresponding domain name.
Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241220063224.17767-7-swapnil.sapkal@amd.com
Currently, there does not exist a straightforward way to extract the
names of the sched domains and match them to the per-cpu domain entry in
/proc/schedstat other than looking at the debugfs files which are only
visible after enabling "verbose" debug after commit 34320745df
("sched/debug: Put sched/domains files under the verbose flag")
Since tools like `perf sched stats`[1] require displaying per-domain
information in user friendly manner, display the names of sched domain,
alongside their level in /proc/schedstat.
Domain names also makes the /proc/schedstat data unambiguous when some
of the cpus are offline. For example, on a 128 cpus AMD Zen3 machine
where CPU0 and CPU64 are SMT siblings and CPU64 is offline:
Before:
cpu0 ...
domain0 ...
domain1 ...
cpu1 ...
domain0 ...
domain1 ...
domain2 ...
After:
cpu0 ...
domain0 MC ...
domain1 PKG ...
cpu1 ...
domain0 SMT ...
domain1 MC ...
domain2 PKG ...
[1] https://lore.kernel.org/lkml/20241122084452.1064968-1-swapnil.sapkal@amd.com/
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20241220063224.17767-6-swapnil.sapkal@amd.com
In /proc/schedstat, lb_imbalance reports the sum of imbalances
discovered in sched domains with each call to sched_balance_rq(), which is
not very useful because lb_imbalance does not mention whether the imbalance
is due to load, utilization, nr_tasks or misfit_tasks. Remove this field
from /proc/schedstat.
Currently there is no field in /proc/schedstat to report different types
of imbalances. Introduce new fields in /proc/schedstat to report the
total imbalances in load, utilization, nr_tasks or misfit_tasks.
Added fields to /proc/schedstat:
- lb_imbalance_load: Total imbalance due to load.
- lb_imbalance_util: Total imbalance due to utilization.
- lb_imbalance_task: Total imbalance due to number of tasks.
- lb_imbalance_misfit: Total imbalance due to misfit tasks.
Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://lore.kernel.org/r/20241220063224.17767-4-swapnil.sapkal@amd.com
migrate_degrade_locality() would return {1, 0, -1} respectively to
indicate that migration would degrade-locality, would improve
locality, would be ambivalent to locality improvements.
This patch improves readability by changing the return value to mean:
* Any positive value degrades locality
* 0 migration doesn't affect locality
* Any negative value improves locality
[Swapnil: Fixed comments around code and wrote commit log]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241220063224.17767-3-swapnil.sapkal@amd.com
In /proc/schedstat, lb_hot_gained reports the number hot tasks pulled
during load balance. This value is incremented in can_migrate_task()
if the task is migratable and hot. After incrementing the value,
load balancer can still decide not to migrate this task leading to wrong
accounting. Fix this by incrementing stats when hot tasks are detached.
This issue only exists in detach_tasks() where we can decide to not
migrate hot task even if it is migratable. However, in detach_one_task(),
we migrate it unconditionally.
[Swapnil: Handled the case where nr_failed_migrations_hot was not accounted properly and wrote commit log]
Fixes: d31980846f ("sched: Move up affinity check to mitigate useless redoing overhead")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com>
Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241220063224.17767-2-swapnil.sapkal@amd.com
scheduler_tick() was renamed to sched_tick() in 86dd6c04ef
("sched/balancing: Rename scheduler_tick() => sched_tick()").
Update comments still referring to scheduler_tick.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241219085839.302378-1-bigeasy@linutronix.de
Function sugov_eas_rebuild_sd() defined in the schedutil cpufreq governor
implements generic functionality that may be useful in other places. In
particular, there is a plan to use it in the intel_pstate driver in the
future.
For this reason, move it from schedutil to the energy model code and
rename it to em_rebuild_sched_domains().
This also helps to get rid of some #ifdeffery in schedutil which is a
plus.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
A redundant frequency update is only truly needed when there is a policy
limits change with a driver that specifies CPUFREQ_NEED_UPDATE_LIMITS.
In spite of that, drivers specifying CPUFREQ_NEED_UPDATE_LIMITS receive a
frequency update _all the time_, not just for a policy limits change,
because need_freq_update is never cleared.
Furthermore, ignore_dl_rate_limit()'s usage of need_freq_update also leads
to a redundant frequency update, regardless of whether or not the driver
specifies CPUFREQ_NEED_UPDATE_LIMITS, when the next chosen frequency is the
same as the current one.
Fix the superfluous updates by only honoring CPUFREQ_NEED_UPDATE_LIMITS
when there's a policy limits change, and clearing need_freq_update when a
requisite redundant update occurs.
This is neatly achieved by moving up the CPUFREQ_NEED_UPDATE_LIMITS test
and instead setting need_freq_update to false in sugov_update_next_freq().
Fixes: 600f5badb7 ("cpufreq: schedutil: Don't skip freq update when limits change")
Signed-off-by: Sultan Alsawaf (unemployed) <sultan@kerneltoast.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20241212015734.41241-2-sultan@kerneltoast.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
CPU controller limits are not properly enforced during CPU hotplug
operations, particularly during CPU offline. When a CPU goes offline,
throttled processes are unintentionally being unthrottled across all CPUs
in the system, allowing them to exceed their assigned quota limits.
Consider below for an example,
Assigning 6.25% bandwidth limit to a cgroup
in a 8 CPU system, where, workload is running 8 threads for 20 seconds at
100% CPU utilization, expected (user+sys) time = 10 seconds.
$ cat /sys/fs/cgroup/test/cpu.max
50000 100000
$ ./ebizzy -t 8 -S 20 // non-hotplug case
real 20.00 s
user 10.81 s // intended behaviour
sys 0.00 s
$ ./ebizzy -t 8 -S 20 // hotplug case
real 20.00 s
user 14.43 s // Workload is able to run for 14 secs
sys 0.00 s // when it should have only run for 10 secs
During CPU hotplug, scheduler domains are rebuilt and cpu_attach_domain
is called for every active CPU to update the root domain. That ends up
calling rq_offline_fair which un-throttles any throttled hierarchies.
Unthrottling should only occur for the CPU being hotplugged to allow its
throttled processes to become runnable and get migrated to other CPUs.
With current patch applied,
$ ./ebizzy -t 8 -S 20 // hotplug case
real 21.00 s
user 10.16 s // intended behaviour
sys 0.00 s
This also has another symptom, when a CPU goes offline, and if the cfs_rq
is not in throttled state and the runtime_remaining still had plenty
remaining, it gets reset to 1 here, causing the runtime_remaining of
cfs_rq to be quickly depleted.
Note: hotplug operation (online, offline) was performed in while(1) loop
v3: https://lore.kernel.org/all/20241210102346.228663-2-vishalc@linux.ibm.com
v2: https://lore.kernel.org/all/20241207052730.1746380-2-vishalc@linux.ibm.com
v1: https://lore.kernel.org/all/20241126064812.809903-2-vishalc@linux.ibm.com
Suggested-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Tested-by: Samir Mulani <samir@linux.ibm.com>
Link: https://lore.kernel.org/r/20241212043102.584863-2-vishalc@linux.ibm.com
its time accounting
- Properly track the CFS runqueue runnable stats
- Check the total number of all queued tasks in a sched fair's runqueue
hierarchy before deciding to stop the tick
- Fix the scheduling of the task that got woken last (NEXT_BUDDY) by
preventing those from being delayed
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmdexEsACgkQEsHwGGHe
VUpFqA//SIIbNJEIQEwGkFrYpGwVpSISm94L4ENsrkWbJWQlALwQEBJF9Me/DOZH
vHaX3o+cMxt26W7o0NKyPcvYtulnOr33HZA/uxK35MDaUinSA3Spt3jXHfR3n0mL
ljNQQraWHGaJh7dzKMZoxP6DR78/Z0yotXjt33xeBFMSJuzGsklrbIiSJ6c4m/3u
Y1lrQT8LncsxJMYIPAKtBAc9hvJfGFV6IOTaTfxP0oTuDo/2qTNVHm7to40wk3NW
kb0lf2kjVtE6mwMfEm49rtjE3h0VnPJKGKoEkLi9IQoPbQq9Uf4i9VSmRe3zqPAz
yBxV8BAu2koscMZzqw1CTnd9c/V+/A9qOOHfDo72I5MriJ1qVWCEsqB1y3u2yT6n
XjwFDbPiVKI8H9YlsZpWERocCRypshevPNlYOF93PlK+YTXoMWaXMQhec5NDzLLw
Se1K2sCi3U8BMdln0dH6nhk0unzNKQ8UKzrMFncSjnpWhpJ69uxyUZ/jL//6bvfi
Z+7G4U54mUhGyOAaUSGH/20TnZRWJ7NJC542omFgg9v0VLxx+wnZyX4zJIV0jvRr
6voYmYDCO8zn/hO67VBJuei97ayIzxDNP1tVl15LzcvRcIGWNUPOwp5jijv8vDJG
lJhQrMF6w4fgPItC20FvptlDvpP9cItSzyyOeg074HjDS53QN2Y=
=jOb3
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v6.13_rc3-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Prevent incorrect dequeueing of the deadline dlserver helper task and
fix its time accounting
- Properly track the CFS runqueue runnable stats
- Check the total number of all queued tasks in a sched fair's runqueue
hierarchy before deciding to stop the tick
- Fix the scheduling of the task that got woken last (NEXT_BUDDY) by
preventing those from being delayed
* tag 'sched_urgent_for_v6.13_rc3-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/dlserver: Fix dlserver time accounting
sched/dlserver: Fix dlserver double enqueue
sched/eevdf: More PELT vs DELAYED_DEQUEUE
sched/fair: Fix sched_can_stop_tick() for fair tasks
sched/fair: Fix NEXT_BUDDY
Update the `dsq_hash_params` initialization to use `sizeof_field`
for the `key_len` field instead of a hardcoded value.
This improves code readability and ensures the key length dynamically
matches the size of the `id` field in the `scx_dispatch_q` structure.
Signed-off-by: Liang Jie <liangjie@lixiang.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
dlserver time is accounted when:
- dlserver is active and the dlserver proxies the cfs task.
- dlserver is active but deferred and cfs task runs after being picked
through the normal fair class pick.
dl_server_update is called in two places to make sure that both the
above times are accounted for. But it doesn't check if dlserver is
active or not. Now that we have this dl_server_active flag, we can
consolidate dl_server_update into one place and all we need to check is
whether dlserver is active or not. When dlserver is active there is only
two possible conditions:
- dlserver is deferred.
- cfs task is running on behalf of dlserver.
Fixes: a110a81c52 ("sched/deadline: Deferrable dl server")
Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # ROCK 5B
Link: https://lore.kernel.org/r/20241213032244.877029-2-vineeth@bitbyteword.org
dlserver can get dequeued during a dlserver pick_task due to the delayed
deueue feature and this can lead to issues with dlserver logic as it
still thinks that dlserver is on the runqueue. The dlserver throttling
and replenish logic gets confused and can lead to double enqueue of
dlserver.
Double enqueue of dlserver could happend due to couple of reasons:
Case 1
------
Delayed dequeue feature[1] can cause dlserver being stopped during a
pick initiated by dlserver:
__pick_next_task
pick_task_dl -> server_pick_task
pick_task_fair
pick_next_entity (if (sched_delayed))
dequeue_entities
dl_server_stop
server_pick_task goes ahead with update_curr_dl_se without knowing that
dlserver is dequeued and this confuses the logic and may lead to
unintended enqueue while the server is stopped.
Case 2
------
A race condition between a task dequeue on one cpu and same task's enqueue
on this cpu by a remote cpu while the lock is released causing dlserver
double enqueue.
One cpu would be in the schedule() and releasing RQ-lock:
current->state = TASK_INTERRUPTIBLE();
schedule();
deactivate_task()
dl_stop_server();
pick_next_task()
pick_next_task_fair()
sched_balance_newidle()
rq_unlock(this_rq)
at which point another CPU can take our RQ-lock and do:
try_to_wake_up()
ttwu_queue()
rq_lock()
...
activate_task()
dl_server_start() --> first enqueue
wakeup_preempt() := check_preempt_wakeup_fair()
update_curr()
update_curr_task()
if (current->dl_server)
dl_server_update()
enqueue_dl_entity() --> second enqueue
This bug was not apparent as the enqueue in dl_server_start doesn't
usually happen because of the defer logic. But as a side effect of the
first case(dequeue during dlserver pick), dl_throttled and dl_yield will
be set and this causes the time accounting of dlserver to messup and
then leading to a enqueue in dl_server_start.
Have an explicit flag representing the status of dlserver to avoid the
confusion. This is set in dl_server_start and reset in dlserver_stop.
Fixes: 63ba8422f8 ("sched/deadline: Introduce deadline servers")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # ROCK 5B
Link: https://lkml.kernel.org/r/20241213032244.877029-1-vineeth@bitbyteword.org
Commit 8b5e770ed7 ("sched/deadline: Optimize pull_dl_task()")
added a goto label seems would be better written as a while
loop.
So replace the goto with a while loop, to make it easier to read.
Reported-by: Todd Kjos <tkjos@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20241206000009.1226085-1-jstultz@google.com
a RT mutex
- Do not setup a new deadline entity on a boosted task as that has happened
already
- Update preempt= kernel command line param
- Prevent needless softirqd wakeups in the idle task's context
- Detect the case where the idle load balancer CPU becomes busy and avoid
unnecessary load balancing invocation
- Remove an unnecessary load balancing need_resched() call in nohz_csd_func()
- Allow for raising of SCHED_SOFTIRQ softirq type on RT but retain the warning
to catch any other cases
- Remove a wrong warning when a cpuset update makes the task affinity no
longer a subset of the cpuset
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmdWvHYACgkQEsHwGGHe
VUrJ6g//eEwmHpa9+at3UvXrVlcYQmZsQpgL2ksjVE0n4KXFeUavwCR4h6SJzvcD
RDF9AyDuPAoCqy5DhL5wTXPG/4AnnISqAEkoP7h7YO76P7ks6+HD7t31pCF/uqCH
yqS4vc1RJ6yW8otcCpR7rOPEQ49Klqc1KTFTNAFLc6MNEb/SVH5Ih+wFL5Mj/W3I
UkBEtUy1oR2Q4QPhJ+0sr0LAI1AwjykdbkWzOhs6D1kPaRqdV4Atgc2fwioLIvhO
s++lev9BmGx02dmrRWRmIBL9S9ycSLT1qx28sbzlS+PZMGYqOnImVOW5+EPr+ovK
fILc0m8sLD6GyZHIPgeIT2+DqSvDTQOGQwXyUYmoarI+BWGGSz6iZGn4RrZHMRQo
cpqYV9z7F2t3X1hPfhrH+40BXJeMMX+wd4ahXNA44QD6Bf7I+zPUfsrfnrR4BwV7
qpXhBzXOuZrgOKolIwJmHIxyLtd79idYccGvjIME5rwj8eBg0J7zmjzoVewqUXsb
F9ualvq6twxUIdD4XiClpi+E16Z2Ot3PplNIohosVrUDRDUgvTBbTuDZnUuOkXbb
wV26XKuYKQYfx5UfJBSYL3DCfCttkKCVrPX2oiqw6PKNXw9BM8BQIux+XQH2jvIg
wOPqZWZf2VIoQJU2N+twc/BAIRAF7CNr/ioTJlXQ1hsOIlTp3kk=
=XLf1
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v6.13_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Remove wrong enqueueing of a task for a later wakeup when a task
blocks on a RT mutex
- Do not setup a new deadline entity on a boosted task as that has
happened already
- Update preempt= kernel command line param
- Prevent needless softirqd wakeups in the idle task's context
- Detect the case where the idle load balancer CPU becomes busy and
avoid unnecessary load balancing invocation
- Remove an unnecessary load balancing need_resched() call in
nohz_csd_func()
- Allow for raising of SCHED_SOFTIRQ softirq type on RT but retain the
warning to catch any other cases
- Remove a wrong warning when a cpuset update makes the task affinity
no longer a subset of the cpuset
* tag 'sched_urgent_for_v6.13_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking: rtmutex: Fix wake_q logic in task_blocks_on_rt_mutex
sched/deadline: Fix warning in migrate_enable for boosted tasks
sched/core: Update kernel boot parameters for LAZY preempt.
sched/core: Prevent wakeup of ksoftirqd during idle load balance
sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy
sched/core: Remove the unnecessary need_resched() check in nohz_csd_func()
softirq: Allow raising SCHED_SOFTIRQ from SMP-call-function on RT kernel
sched: fix warning in sched_setaffinity
sched/deadline: Fix replenish_dl_new_period dl_server condition
There are 3 sites using set_next_buddy() and only one is conditional
on NEXT_BUDDY, the other two sites are unconditional; to note:
- yield_to_task()
- cgroup dequeue / pick optimization
However, having NEXT_BUDDY control both the wakeup-preemption and the
picking side of things means its near useless.
Fixes: 147f3efaa2 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241129101541.GA33464@noisy.programming.kicks-ass.net
When max_vruntime() is unused, it prevents kernel builds with clang,
`make W=1` and CONFIG_WERROR=y:
kernel/sched/fair.c:526:19: error: unused function 'max_vruntime' [-Werror,-Wunused-function]
526 | static inline u64 max_vruntime(u64 max_vruntime, u64 vruntime)
| ^~~~~~~~~~~~
Fix this by marking them with __maybe_unused (all cases for the sake of
symmetry).
See also commit 6863f5643d ("kbuild: allow Clang to find unused static
inline functions for W=1 build").
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241202173546.634433-1-andriy.shevchenko@linux.intel.com
Move variable declaration at the beginning of the function
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-12-vincent.guittot@linaro.org
Migrating a delayed dequeued task doesn't help in balancing the number
of runnable tasks in the system.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-11-vincent.guittot@linaro.org
Rename cfs_rq.nr_running into cfs_rq.nr_queued which better reflects the
reality as the value includes both the ready to run tasks and the delayed
dequeue tasks.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-10-vincent.guittot@linaro.org
cfs_rq.idle_nr_running field is not used anywhere so we can remove the
useless associated computation. Last user went in commit 5e963f2bd4
("sched/fair: Commit to EEVDF").
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-9-vincent.guittot@linaro.org
Use same naming convention as others starting with h_nr_* and rename
idle_h_nr_running into h_nr_idle.
The "running" is not correct anymore as it includes delayed dequeue tasks
as well.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-8-vincent.guittot@linaro.org
h_nr_delayed is not used anymore. We now have:
- h_nr_runnable which tracks tasks ready to run
- h_nr_queued which tracks enqueued tasks either ready to run or
delayed dequeue
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-7-vincent.guittot@linaro.org
Use the new h_nr_runnable that tracks only queued and runnable tasks in the
statistics that are used to balance the system:
- PELT runnable_avg
- deciding if a group is overloaded or has spare capacity
- numa stats
- reduced capacity management
- load balance
- nohz kick
It should be noticed that the rq->nr_running still counts the delayed
dequeued tasks as delayed dequeue is a fair feature that is meaningless
at core level.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-6-vincent.guittot@linaro.org
With delayed dequeued feature, a sleeping sched_entity remains queued in
the rq until its lag has elapsed. As a result, it stays also visible
in the statistics that are used to balance the system and in particular
the field cfs.h_nr_queued when the sched_entity is associated to a task.
Create a new h_nr_runnable that tracks only queued and runnable tasks.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-5-vincent.guittot@linaro.org
With delayed dequeued feature, a sleeping sched_entity remains queued
in the rq until its lag has elapsed but can't run.
Rename h_nr_running into h_nr_queued to reflect this new behavior.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-4-vincent.guittot@linaro.org
Vincent and Dietmar noted that while
commit fc1892becd ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE") fixes
the entity runnable stats, it does not adjust the cfs_rq runnable stats,
which are based off of h_nr_running.
Track h_nr_delayed such that we can discount those and adjust the
signal.
Fixes: fc1892becd ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE")
Closes: https://lore.kernel.org/lkml/a9a45193-d0c6-4ba2-a822-464ad30b550e@arm.com/
Closes: https://lore.kernel.org/lkml/CAKfTPtCNUvWE_GX5LyvTF-WdxUT=ZgvZZv-4t=eWntg5uOFqiQ@mail.gmail.com/
[ Fixes checkpatch warnings and rebased ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-3-vincent.guittot@linaro.org
We can't stop the tick of a rq if there are at least 2 tasks enqueued in
the whole hierarchy and not only at the root cfs rq.
rq->cfs.nr_running tracks the number of sched_entity at one level
whereas rq->cfs.h_nr_running tracks all queued tasks in the
hierarchy.
Fixes: 11cc374f46 ("sched_ext: Simplify scx_can_stop_tick() invocation in sched_can_stop_tick()")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-2-vincent.guittot@linaro.org
Adam reports that enabling NEXT_BUDDY insta triggers a WARN in
pick_next_entity().
Moving clear_buddies() up before the delayed dequeue bits ensures
no ->next buddy becomes delayed. Further ensure no new ->next buddy
ever starts as delayed.
Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue")
Reported-by: Adam Li <adamli@os.amperecomputing.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Adam Li <adamli@os.amperecomputing.com>
Link: https://lkml.kernel.org/r/670a0d54-e398-4b1f-8a6e-90784e2fdf89@amd.com
[Problem Description]
When running the hackbench program of LTP, the following memory leak is
reported by kmemleak.
# /opt/ltp/testcases/bin/hackbench 20 thread 1000
Running with 20*40 (== 800) tasks.
# dmesg | grep kmemleak
...
kmemleak: 480 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
kmemleak: 665 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
# cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff888cd8ca2c40 (size 64):
comm "hackbench", pid 17142, jiffies 4299780315
hex dump (first 32 bytes):
ac 74 49 00 01 00 00 00 4c 84 49 00 01 00 00 00 .tI.....L.I.....
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace (crc bff18fd4):
[<ffffffff81419a89>] __kmalloc_cache_noprof+0x2f9/0x3f0
[<ffffffff8113f715>] task_numa_work+0x725/0xa00
[<ffffffff8110f878>] task_work_run+0x58/0x90
[<ffffffff81ddd9f8>] syscall_exit_to_user_mode+0x1c8/0x1e0
[<ffffffff81dd78d5>] do_syscall_64+0x85/0x150
[<ffffffff81e0012b>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
...
This issue can be consistently reproduced on three different servers:
* a 448-core server
* a 256-core server
* a 192-core server
[Root Cause]
Since multiple threads are created by the hackbench program (along with
the command argument 'thread'), a shared vma might be accessed by two or
more cores simultaneously. When two or more cores observe that
vma->numab_state is NULL at the same time, vma->numab_state will be
overwritten.
Although current code ensures that only one thread scans the VMAs in a
single 'numa_scan_period', there might be a chance for another thread
to enter in the next 'numa_scan_period' while we have not gotten till
numab_state allocation [1].
Note that the command `/opt/ltp/testcases/bin/hackbench 50 process 1000`
cannot the reproduce the issue. It is verified with 200+ test runs.
[Solution]
Use the cmpxchg atomic operation to ensure that only one thread executes
the vma->numab_state assignment.
[1] https://lore.kernel.org/lkml/1794be3c-358c-4cdc-a43d-a1f841d91ef7@amd.com/
Link: https://lkml.kernel.org/r/20241113102146.2384-1-ahuang12@lenovo.com
Fixes: ef6a22b70f ("sched/numa: apply the scan delay to every new vma")
Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
Reported-by: Jiwei Sun <sunjw10@lenovo.com>
Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rely on the NUMA scheduling domain topology, instead of accessing NUMA
topology information directly.
There is basically no functional change, but in this way we ensure
consistent use of the same topology information determined by the
scheduling subsystem.
Fixes: f6ce6b9493 ("sched_ext: Do not enable LLC/NUMA optimizations when domains overlap")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
As all the non-domain and non-managed_irq housekeeping types have been
unified to HK_TYPE_KERNEL_NOISE, replace all these references in the
scheduler to use HK_TYPE_KERNEL_NOISE.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20241030175253.125248-5-longman@redhat.com
The housekeeping cpumasks are only set by two boot commandline
parameters: "nohz_full" and "isolcpus". When there is more than one of
"nohz_full" or "isolcpus", the extra ones must have the same CPU list
or the setup will fail partially.
The HK_TYPE_DOMAIN and HK_TYPE_MANAGED_IRQ types are settable by
"isolcpus" only and their settings can be independent of the other
types. The other housekeeping types are all set by "nohz_full" or
"isolcpus=nohz" without a way to set them individually. So they all
have identical cpumasks.
There is actually no point in having different cpumasks for these
"nohz_full" only housekeeping types. Consolidate these types to use the
same cpumask by aliasing them to the same value. If there is a need to
set any of them independently in the future, we can break them out to
their own cpumasks again.
With this change, the number of cpumasks in the housekeeping structure
drops from 9 to 3. Other than that, there should be no other functional
change.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20241030175253.125248-4-longman@redhat.com
The "isolcpus=nohz" boot parameter and flag were used to disable tick
when running a single task. Nowsdays, this "nohz" flag is seldomly used
as it is included as part of the "nohz_full" parameter. Extend this
flag to cover other kernel noises disabled by the "nohz_full" parameter
to make them equivalent. This also eliminates the need to use both the
"isolcpus" and the "nohz_full" parameters to fully isolated a given
set of CPUs.
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20241030175253.125248-3-longman@redhat.com
The HK_TYPE_SCHED housekeeping type is defined but not set anywhere. So
any code that try to use HK_TYPE_SCHED are essentially dead code. So
remove HK_TYPE_SCHED and any code that use it.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20241030175253.125248-2-longman@redhat.com
Andy reported that clang gets upset with CONFIG_CFS_BANDWIDTH=n:
kernel/sched/fair.c:6580:20: error: unused function 'cfs_bandwidth_used' [-Werror,-Wunused-function]
6580 | static inline bool cfs_bandwidth_used(void)
| ^~~~~~~~~~~~~~~~~~
Indeed, cfs_bandwidth_used() is only used within functions defined under
CONFIG_CFS_BANDWIDTH=y. Remove its CONFIG_CFS_BANDWIDTH=n declaration &
definition.
Reported-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Link: https://lore.kernel.org/r/20241127165501.160004-1-vschneid@redhat.com
After commit b58652db66 ("sched/deadline: Fix task_struct reference
leak"), I identified additional calls to hrtimer_try_to_cancel that
might also require a dl_server check. It remains unclear whether this
omission was intentional or accidental in those contexts.
This patch consolidates the timer cancellation logic into dedicated
functions, ensuring consistent behavior across all calls.
Additionally, it reduces code duplication and improves overall code
cleanliness.
Note the use of the __always_inline keyword. In some instances, we
have a task_struct pointer, dereference the dl member, and then use
the container_of macro to retrieve the task_struct pointer again. By
inlining the code, the compiler can potentially optimize out this
redundant round trip.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20240724142253.27145-3-wander@redhat.com
Currently we check for bandwidth overflow potentially due to hotplug
operations at the end of sched_cpu_deactivate(), after the cpu going
offline has already been removed from scheduling, active_mask, etc.
This can create issues for DEADLINE tasks, as there is a substantial
race window between the start of sched_cpu_deactivate() and the moment
we possibly decide to roll-back the operation if dl_bw_deactivate()
returns failure in cpuset_cpu_inactive(). An example is a throttled
task that sees its replenishment timer firing while the cpu it was
previously running on is considered offline, but before
dl_bw_deactivate() had a chance to say no and roll-back happened.
Fix this by directly calling dl_bw_deactivate() first thing in
sched_cpu_deactivate() and do the required calculation in the former
function considering the cpu passed as an argument as offline already.
By doing so we also simplify sched_cpu_deactivate(), as there is no need
anymore for any kind of roll-back if we fail early.
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Tested-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/Zzc1DfPhbvqDDIJR@jlelli-thinkpadt14gen4.remote.csb
For hotplug operations, DEADLINE needs to check that there is still enough
bandwidth left after removing the CPU that is going offline. We however
fail to do so currently.
Restore the correct behavior by restructuring dl_bw_manage() a bit, so
that overflow conditions (not enough bandwidth left) are properly
checked. Also account for dl_server bandwidth, i.e. discount such
bandwidth in the calculation since NORMAL tasks will be anyway moved
away from the CPU as a result of the hotplug operation.
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Tested-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20241114142810.794657-3-juri.lelli@redhat.com
When root domain non-destructive changes (e.g., only modifying one of
the existing root domains while the rest is not touched) happen we still
need to clear DEADLINE bandwidth accounting so that it's then properly
restored, taking into account DEADLINE tasks associated to each cpuset
(associated to each root domain). After the introduction of dl_servers,
we fail to restore such servers contribution after non-destructive
changes (as they are only considered on destructive changes when
runqueues are attached to the new domains).
Fix this by making sure we iterate over the dl_servers attached to
domains that have not been destroyed and add their bandwidth
contribution back correctly.
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Tested-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20241114142810.794657-2-juri.lelli@redhat.com
task_on_rq_queued read p->on_rq without READ_ONCE, though p->on_rq is
set with WRITE_ONCE in {activate|deactivate}_task and smp_store_release
in __block_task, and also read with READ_ONCE in task_on_rq_migrating.
Make all of these accesses pair together by adding READ_ONCE in the
task_on_rq_queued.
Signed-off-by: Harshit Agarwal <harshit@nutanix.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20241114210812.1836587-1-jon@nutanix.com
When steal time exceeds the measured delta when updating clock_task, we
currently try to catch up the excess in future updates.
However, this results in inaccurate run times for the future things using
clock_task, in some situations, as they end up getting additional steal
time that did not actually happen.
This is because there is a window between reading the elapsed time in
update_rq_clock() and sampling the steal time in update_rq_clock_task().
If the VCPU gets preempted between those two points, any additional
steal time is accounted to the outgoing task even though the calculated
delta did not actually contain any of that "stolen" time.
When this race happens, we can end up with steal time that exceeds the
calculated delta, and the previous code would try to catch up that excess
steal time in future clock updates, which is given to the next,
incoming task, even though it did not actually have any time stolen.
This behavior is particularly bad when steal time can be very long,
which we've seen when trying to extend steal time to contain the duration
that the host was suspended [0]. When this happens, clock_task stays
frozen, during which the running task stays running for the whole
duration, since its run time doesn't increase.
However the race can happen even under normal operation.
Ideally we would read the elapsed cpu time and the steal time atomically,
to prevent this race from happening in the first place, but doing so
is non-trivial.
Since the time between those two points isn't otherwise accounted anywhere,
neither to the outgoing task nor the incoming task (because the "end of
outgoing task" and "start of incoming task" timestamps are the same),
I would argue that the right thing to do is to simply drop any excess steal
time, in order to prevent these issues.
[0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/
Signed-off-by: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241118043745.1857272-1-suleiman@google.com
When running the following command:
while true; do
stress-ng --cyclic 30 --timeout 30s --minimize --quiet
done
a warning is eventually triggered:
WARNING: CPU: 43 PID: 2848 at kernel/sched/deadline.c:794
setup_new_dl_entity+0x13e/0x180
...
Call Trace:
<TASK>
? show_trace_log_lvl+0x1c4/0x2df
? enqueue_dl_entity+0x631/0x6e0
? setup_new_dl_entity+0x13e/0x180
? __warn+0x7e/0xd0
? report_bug+0x11a/0x1a0
? handle_bug+0x3c/0x70
? exc_invalid_op+0x14/0x70
? asm_exc_invalid_op+0x16/0x20
enqueue_dl_entity+0x631/0x6e0
enqueue_task_dl+0x7d/0x120
__do_set_cpus_allowed+0xe3/0x280
__set_cpus_allowed_ptr_locked+0x140/0x1d0
__set_cpus_allowed_ptr+0x54/0xa0
migrate_enable+0x7e/0x150
rt_spin_unlock+0x1c/0x90
group_send_sig_info+0xf7/0x1a0
? kill_pid_info+0x1f/0x1d0
kill_pid_info+0x78/0x1d0
kill_proc_info+0x5b/0x110
__x64_sys_kill+0x93/0xc0
do_syscall_64+0x5c/0xf0
entry_SYSCALL_64_after_hwframe+0x6e/0x76
RIP: 0033:0x7f0dab31f92b
This warning occurs because set_cpus_allowed dequeues and enqueues tasks
with the ENQUEUE_RESTORE flag set. If the task is boosted, the warning
is triggered. A boosted task already had its parameters set by
rt_mutex_setprio, and a new call to setup_new_dl_entity is unnecessary,
hence the WARN_ON call.
Check if we are requeueing a boosted task and avoid calling
setup_new_dl_entity if that's the case.
Fixes: 295d6d5e37 ("sched/deadline: Fix switching to -deadline")
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20240724142253.27145-2-wander@redhat.com
Scheduler raises a SCHED_SOFTIRQ to trigger a load balancing event on
from the IPI handler on the idle CPU. If the SMP function is invoked
from an idle CPU via flush_smp_call_function_queue() then the HARD-IRQ
flag is not set and raise_softirq_irqoff() needlessly wakes ksoftirqd
because soft interrupts are handled before ksoftirqd get on the CPU.
Adding a trace_printk() in nohz_csd_func() at the spot of raising
SCHED_SOFTIRQ and enabling trace events for sched_switch, sched_wakeup,
and softirq_entry (for SCHED_SOFTIRQ vector alone) helps observing the
current behavior:
<idle>-0 [000] dN.1.: nohz_csd_func: Raising SCHED_SOFTIRQ from nohz_csd_func
<idle>-0 [000] dN.4.: sched_wakeup: comm=ksoftirqd/0 pid=16 prio=120 target_cpu=000
<idle>-0 [000] .Ns1.: softirq_entry: vec=7 [action=SCHED]
<idle>-0 [000] .Ns1.: softirq_exit: vec=7 [action=SCHED]
<idle>-0 [000] d..2.: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/0 next_pid=16 next_prio=120
ksoftirqd/0-16 [000] d..2.: sched_switch: prev_comm=ksoftirqd/0 prev_pid=16 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
...
Use __raise_softirq_irqoff() to raise the softirq. The SMP function call
is always invoked on the requested CPU in an interrupt handler. It is
guaranteed that soft interrupts are handled at the end.
Following are the observations with the changes when enabling the same
set of events:
<idle>-0 [000] dN.1.: nohz_csd_func: Raising SCHED_SOFTIRQ for nohz_idle_balance
<idle>-0 [000] dN.1.: softirq_raise: vec=7 [action=SCHED]
<idle>-0 [000] .Ns1.: softirq_entry: vec=7 [action=SCHED]
No unnecessary ksoftirqd wakeups are seen from idle task's context to
service the softirq.
Fixes: b2a02fc43a ("smp: Optimize send_call_function_single_ipi()")
Closes: https://lore.kernel.org/lkml/fcf823f-195e-6c9a-eac3-25f870cb35ac@inria.fr/ [1]
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20241119054432.6405-5-kprateek.nayak@amd.com
Commit b2a02fc43a ("smp: Optimize send_call_function_single_ipi()")
optimizes IPIs to idle CPUs in TIF_POLLING_NRFLAG mode by setting the
TIF_NEED_RESCHED flag in idle task's thread info and relying on
flush_smp_call_function_queue() in idle exit path to run the
call-function. A softirq raised by the call-function is handled shortly
after in do_softirq_post_smp_call_flush() but the TIF_NEED_RESCHED flag
remains set and is only cleared later when schedule_idle() calls
__schedule().
need_resched() check in _nohz_idle_balance() exists to bail out of load
balancing if another task has woken up on the CPU currently in-charge of
idle load balancing which is being processed in SCHED_SOFTIRQ context.
Since the optimization mentioned above overloads the interpretation of
TIF_NEED_RESCHED, check for idle_cpu() before going with the existing
need_resched() check which can catch a genuine task wakeup on an idle
CPU processing SCHED_SOFTIRQ from do_softirq_post_smp_call_flush(), as
well as the case where ksoftirqd needs to be preempted as a result of
new task wakeup or slice expiry.
In case of PREEMPT_RT or threadirqs, although the idle load balancing
may be inhibited in some cases on the ilb CPU, the fact that ksoftirqd
is the only fair task going back to sleep will trigger a newidle balance
on the CPU which will alleviate some imbalance if it exists if idle
balance fails to do so.
Fixes: b2a02fc43a ("smp: Optimize send_call_function_single_ipi()")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241119054432.6405-4-kprateek.nayak@amd.com
The need_resched() check currently in nohz_csd_func() can be tracked
to have been added in scheduler_ipi() back in 2011 via commit
ca38062e57 ("sched: Use resched IPI to kick off the nohz idle balance")
Since then, it has travelled quite a bit but it seems like an idle_cpu()
check currently is sufficient to detect the need to bail out from an
idle load balancing. To justify this removal, consider all the following
case where an idle load balancing could race with a task wakeup:
o Since commit f3dd3f6745 ("sched: Remove the limitation of WF_ON_CPU
on wakelist if wakee cpu is idle") a target perceived to be idle
(target_rq->nr_running == 0) will return true for
ttwu_queue_cond(target) which will offload the task wakeup to the idle
target via an IPI.
In all such cases target_rq->ttwu_pending will be set to 1 before
queuing the wake function.
If an idle load balance races here, following scenarios are possible:
- The CPU is not in TIF_POLLING_NRFLAG mode in which case an actual
IPI is sent to the CPU to wake it out of idle. If the
nohz_csd_func() queues before sched_ttwu_pending(), the idle load
balance will bail out since idle_cpu(target) returns 0 since
target_rq->ttwu_pending is 1. If the nohz_csd_func() is queued after
sched_ttwu_pending() it should see rq->nr_running to be non-zero and
bail out of idle load balancing.
- The CPU is in TIF_POLLING_NRFLAG mode and instead of an actual IPI,
the sender will simply set TIF_NEED_RESCHED for the target to put it
out of idle and flush_smp_call_function_queue() in do_idle() will
execute the call function. Depending on the ordering of the queuing
of nohz_csd_func() and sched_ttwu_pending(), the idle_cpu() check in
nohz_csd_func() should either see target_rq->ttwu_pending = 1 or
target_rq->nr_running to be non-zero if there is a genuine task
wakeup racing with the idle load balance kick.
o The waker CPU perceives the target CPU to be busy
(targer_rq->nr_running != 0) but the CPU is in fact going idle and due
to a series of unfortunate events, the system reaches a case where the
waker CPU decides to perform the wakeup by itself in ttwu_queue() on
the target CPU but target is concurrently selected for idle load
balance (XXX: Can this happen? I'm not sure, but we'll consider the
mother of all coincidences to estimate the worst case scenario).
ttwu_do_activate() calls enqueue_task() which would increment
"rq->nr_running" post which it calls wakeup_preempt() which is
responsible for setting TIF_NEED_RESCHED (via a resched IPI or by
setting TIF_NEED_RESCHED on a TIF_POLLING_NRFLAG idle CPU) The key
thing to note in this case is that rq->nr_running is already non-zero
in case of a wakeup before TIF_NEED_RESCHED is set which would
lead to idle_cpu() check returning false.
In all cases, it seems that need_resched() check is unnecessary when
checking for idle_cpu() first since an impending wakeup racing with idle
load balancer will either set the "rq->ttwu_pending" or indicate a newly
woken task via "rq->nr_running".
Chasing the reason why this check might have existed in the first place,
I came across Peter's suggestion on the fist iteration of Suresh's
patch from 2011 [1] where the condition to raise the SCHED_SOFTIRQ was:
sched_ttwu_do_pending(list);
if (unlikely((rq->idle == current) &&
rq->nohz_balance_kick &&
!need_resched()))
raise_softirq_irqoff(SCHED_SOFTIRQ);
Since the condition to raise the SCHED_SOFIRQ was preceded by
sched_ttwu_do_pending() (which is equivalent of sched_ttwu_pending()) in
the current upstream kernel, the need_resched() check was necessary to
catch a newly queued task. Peter suggested modifying it to:
if (idle_cpu() && rq->nohz_balance_kick && !need_resched())
raise_softirq_irqoff(SCHED_SOFTIRQ);
where idle_cpu() seems to have replaced "rq->idle == current" check.
Even back then, the idle_cpu() check would have been sufficient to catch
a new task being enqueued. Since commit b2a02fc43a ("smp: Optimize
send_call_function_single_ipi()") overloads the interpretation of
TIF_NEED_RESCHED for TIF_POLLING_NRFLAG idling, remove the
need_resched() check in nohz_csd_func() to raise SCHED_SOFTIRQ based
on Peter's suggestion.
Fixes: b2a02fc43a ("smp: Optimize send_call_function_single_ipi()")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241119054432.6405-3-kprateek.nayak@amd.com
Commit 8f9ea86fdf added some logic to sched_setaffinity that included
a WARN when a per-task affinity assignment races with a cpuset update.
Specifically, we can have a race where a cpuset update results in the
task affinity no longer being a subset of the cpuset. That's fine; we
have a fallback to instead use the cpuset mask. However, we have a WARN
set up that will trigger if the cpuset mask has no overlap at all with
the requested task affinity. This shouldn't be a warning condition; its
trivial to create this condition.
Reproduced the warning by the following setup:
- $PID inside a cpuset cgroup
- another thread repeatedly switching the cpuset cpus from 1-2 to just 1
- another thread repeatedly setting the $PID affinity (via taskset) to 2
Fixes: 8f9ea86fdf ("sched: Always preserve the user requested cpumask")
Signed-off-by: Josh Don <joshdon@google.com>
Acked-and-tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Link: https://lkml.kernel.org/r/20241111182738.1832953-1-joshdon@google.com
The condition in replenish_dl_new_period() that checks if a reservation
(dl_server) is deferred and is not handling a starvation case is
obviously wrong.
Fix it.
Fixes: a110a81c52 ("sched/deadline: Deferrable dl server")
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20241127063740.8278-1-juri.lelli@redhat.com
- Improve the default select_cpu() implementation making it topology aware
and handle WAKE_SYNC better.
- set_arg_maybe_null() was used to inform the verifier which ops args could
be NULL in a rather hackish way. Use the new __nullable CFI stub tags
instead.
- On Sapphire Rapids multi-socket systems, a BPF scheduler, by hammering on
the same queue across sockets, could live-lock the system to the point
where the system couldn't make reasonable forward progress. This could
lead to soft-lockup triggered resets or stalling out bypass mode switch
and thus BPF scheduler ejection for tens of minutes if not hours. After
trying a number of mitigations, the following set worked reliably:
- Injecting artificial cpu_relax() loops in two places while sched_ext is
trying to turn on the bypass mode.
- Triggering scheduler ejection when soft-lockup detection is imminent (a
quarter of threshold left).
While not the prettiest, the impact both in terms of code complexity and
overhead is minimal.
- A common complaint on the API is the overuse of the word "dispatch" and
the confusion around "consume". This is due to how the dispatch queues
became more generic over time. Rename the affected kfuncs for clarity.
Thanks to BPF's compatibility features, this change can be made in a way
that's both forward and backward compatible. The compatibility code will
be dropped in a few releases.
- Pull sched_ext/for-6.12-fixes to receive a prerequisite change. Other misc
changes.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZztuXA4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGePUAP4nFTDaUDngVlxGv5hpYz8/Gcv1bPsWEydRRmH/
3F+pNgEAmGIGAEwFYfc9Zn8Kbjf0eJAduf2RhGRatQO6F/+GSwo=
=AcyC
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext updates from Tejun Heo:
- Improve the default select_cpu() implementation making it topology
aware and handle WAKE_SYNC better.
- set_arg_maybe_null() was used to inform the verifier which ops args
could be NULL in a rather hackish way. Use the new __nullable CFI
stub tags instead.
- On Sapphire Rapids multi-socket systems, a BPF scheduler, by
hammering on the same queue across sockets, could live-lock the
system to the point where the system couldn't make reasonable forward
progress.
This could lead to soft-lockup triggered resets or stalling out
bypass mode switch and thus BPF scheduler ejection for tens of
minutes if not hours. After trying a number of mitigations, the
following set worked reliably:
- Injecting artificial cpu_relax() loops in two places while
sched_ext is trying to turn on the bypass mode.
- Triggering scheduler ejection when soft-lockup detection is
imminent (a quarter of threshold left).
While not the prettiest, the impact both in terms of code complexity
and overhead is minimal.
- A common complaint on the API is the overuse of the word "dispatch"
and the confusion around "consume". This is due to how the dispatch
queues became more generic over time. Rename the affected kfuncs for
clarity. Thanks to BPF's compatibility features, this change can be
made in a way that's both forward and backward compatible. The
compatibility code will be dropped in a few releases.
- Other misc changes
* tag 'sched_ext-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (21 commits)
sched_ext: Replace scx_next_task_picked() with switch_class() in comment
sched_ext: Rename scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*()
sched_ext: Rename scx_bpf_consume() to scx_bpf_dsq_move_to_local()
sched_ext: Rename scx_bpf_dispatch[_vtime]() to scx_bpf_dsq_insert[_vtime]()
sched_ext: scx_bpf_dispatch_from_dsq_set_*() are allowed from unlocked context
sched_ext: add a missing rcu_read_lock/unlock pair at scx_select_cpu_dfl()
sched_ext: Clarify sched_ext_ops table for userland scheduler
sched_ext: Enable the ops breather and eject BPF scheduler on softlockup
sched_ext: Avoid live-locking bypass mode switching
sched_ext: Fix incorrect use of bitwise AND
sched_ext: Do not enable LLC/NUMA optimizations when domains overlap
sched_ext: Introduce NUMA awareness to the default idle selection policy
sched_ext: Replace set_arg_maybe_null() with __nullable CFI stub tags
sched_ext: Rename CFI stubs to names that are recognized by BPF
sched_ext: Introduce LLC awareness to the default idle selection policy
sched_ext: Clarify ops.select_cpu() for single-CPU tasks
sched_ext: improve WAKE_SYNC behavior for default idle CPU selection
sched_ext: Use btf_ids to resolve task_struct
sched/ext: Use tg_cgroup() to elieminate duplicate code
sched/ext: Fix unmatch trailing comment of CONFIG_EXT_GROUP_SCHED
...
- The final step to get rid of auto-rearming posix-timers
posix-timers are currently auto-rearmed by the kernel when the signal
of the timer is ignored so that the timer signal can be delivered once
the corresponding signal is unignored.
This requires to throttle the timer to prevent a DoS by small intervals
and keeps the system pointlessly out of low power states for no value.
This is a long standing non-trivial problem due to the lock order of
posix-timer lock and the sighand lock along with life time issues as
the timer and the sigqueue have different life time rules.
Cure this by:
* Embedding the sigqueue into the timer struct to have the same life
time rules. Aside of that this also avoids the lookup of the timer
in the signal delivery and rearm path as it's just a always valid
container_of() now.
* Queuing ignored timer signals onto a seperate ignored list.
* Moving queued timer signals onto the ignored list when the signal is
switched to SIG_IGN before it could be delivered.
* Walking the ignored list when SIG_IGN is lifted and requeue the
signals to the actual signal lists. This allows the signal delivery
code to rearm the timer.
This also required to consolidate the signal delivery rules so they are
consistent across all situations. With that all self test scenarios
finally succeed.
- Core infrastructure for VFS multigrain timestamping
This is required to allow the kernel to use coarse grained time stamps
by default and switch to fine grained time stamps when inode attributes
are actively observed via getattr().
These changes have been provided to the VFS tree as well, so that the
VFS specific infrastructure could be built on top.
- Cleanup and consolidation of the sleep() infrastructure
* Move all sleep and timeout functions into one file
* Rework udelay() and ndelay() into proper documented inline functions
and replace the hardcoded magic numbers by proper defines.
* Rework the fsleep() implementation to take the reality of the timer
wheel granularity on different HZ values into account. Right now the
boundaries are hard coded time ranges which fail to provide the
requested accuracy on different HZ settings.
* Update documentation for all sleep/timeout related functions and fix
up stale documentation links all over the place
* Fixup a few usage sites
- Rework of timekeeping and adjtimex(2) to prepare for multiple PTP clocks
A system can have multiple PTP clocks which are participating in
seperate and independent PTP clock domains. So far the kernel only
considers the PTP clock which is based on CLOCK TAI relevant as that's
the clock which drives the timekeeping adjustments via the various user
space daemons through adjtimex(2).
The non TAI based clock domains are accessible via the file descriptor
based posix clocks, but their usability is very limited. They can't be
accessed fast as they always go all the way out to the hardware and
they cannot be utilized in the kernel itself.
As Time Sensitive Networking (TSN) gains traction it is required to
provide fast user and kernel space access to these clocks.
The approach taken is to utilize the timekeeping and adjtimex(2)
infrastructure to provide this access in a similar way how the kernel
provides access to clock MONOTONIC, REALTIME etc.
Instead of creating a duplicated infrastructure this rework converts
timekeeping and adjtimex(2) into generic functionality which operates
on pointers to data structures instead of using static variables.
This allows to provide time accessors and adjtimex(2) functionality for
the independent PTP clocks in a subsequent step.
- Consolidate hrtimer initialization
hrtimers are set up by initializing the data structure and then
seperately setting the callback function for historical reasons.
That's an extra unnecessary step and makes Rust support less straight
forward than it should be.
Provide a new set of hrtimer_setup*() functions and convert the core
code and a few usage sites of the less frequently used interfaces over.
The bulk of the htimer_init() to hrtimer_setup() conversion is already
prepared and scheduled for the next merge window.
- Drivers:
* Ensure that the global timekeeping clocksource is utilizing the
cluster 0 timer on MIPS multi-cluster systems.
Otherwise CPUs on different clusters use their cluster specific
clocksource which is not guaranteed to be synchronized with other
clusters.
* Mostly boring cleanups, fixes, improvements and code movement
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmc7kPITHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoZKkD/9OUL6fOJrDUmOYBa4QVeMyfTef4EaL
tvwIMM/29XQFeiq3xxCIn+EMnHjXn2lvIhYGQ7GKsbKYwvJ7ZBDpQb+UMhZ2nKI9
6D6BP6WomZohKeH2fZbJQAdqOi3KRYdvQdIsVZUexkqiaVPphRvOH9wOr45gHtZM
EyMRSotPlQTDqcrbUejDMEO94GyjDCYXRsyATLxjmTzL/N4xD4NRIiotjM2vL/a9
8MuCgIhrKUEyYlFoOxxeokBsF3kk3/ez2jlG9b/N8VLH3SYIc2zgL58FBgWxlmgG
bY71nVG3nUgEjxBd2dcXAVVqvb+5widk8p6O7xxOAQKTLMcJ4H0tQDkMnzBtUzvB
DGAJDHAmAr0g+ja9O35Pkhunkh4HYFIbq0Il4d1HMKObhJV0JumcKuQVxrXycdm3
UZfq3seqHsZJQbPgCAhlFU0/2WWScocbee9bNebGT33KVwSp5FoVv89C/6Vjb+vV
Gusc3thqrQuMAZW5zV8g4UcBAA/xH4PB0I+vHib+9XPZ4UQ7/6xKl2jE0kd5hX7n
AAUeZvFNFqIsY+B6vz+Jx/yzyM7u5cuXq87pof5EHVFzv56lyTp4ToGcOGYRgKH5
JXeYV1OxGziSDrd5vbf9CzdWMzqMvTefXrHbWrjkjhNOe8E1A8O88RZ5uRKZhmSw
hZZ4hdM9+3T7cg==
=2VC6
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"A rather large update for timekeeping and timers:
- The final step to get rid of auto-rearming posix-timers
posix-timers are currently auto-rearmed by the kernel when the
signal of the timer is ignored so that the timer signal can be
delivered once the corresponding signal is unignored.
This requires to throttle the timer to prevent a DoS by small
intervals and keeps the system pointlessly out of low power states
for no value. This is a long standing non-trivial problem due to
the lock order of posix-timer lock and the sighand lock along with
life time issues as the timer and the sigqueue have different life
time rules.
Cure this by:
- Embedding the sigqueue into the timer struct to have the same
life time rules. Aside of that this also avoids the lookup of
the timer in the signal delivery and rearm path as it's just a
always valid container_of() now.
- Queuing ignored timer signals onto a seperate ignored list.
- Moving queued timer signals onto the ignored list when the
signal is switched to SIG_IGN before it could be delivered.
- Walking the ignored list when SIG_IGN is lifted and requeue the
signals to the actual signal lists. This allows the signal
delivery code to rearm the timer.
This also required to consolidate the signal delivery rules so they
are consistent across all situations. With that all self test
scenarios finally succeed.
- Core infrastructure for VFS multigrain timestamping
This is required to allow the kernel to use coarse grained time
stamps by default and switch to fine grained time stamps when inode
attributes are actively observed via getattr().
These changes have been provided to the VFS tree as well, so that
the VFS specific infrastructure could be built on top.
- Cleanup and consolidation of the sleep() infrastructure
- Move all sleep and timeout functions into one file
- Rework udelay() and ndelay() into proper documented inline
functions and replace the hardcoded magic numbers by proper
defines.
- Rework the fsleep() implementation to take the reality of the
timer wheel granularity on different HZ values into account.
Right now the boundaries are hard coded time ranges which fail
to provide the requested accuracy on different HZ settings.
- Update documentation for all sleep/timeout related functions
and fix up stale documentation links all over the place
- Fixup a few usage sites
- Rework of timekeeping and adjtimex(2) to prepare for multiple PTP
clocks
A system can have multiple PTP clocks which are participating in
seperate and independent PTP clock domains. So far the kernel only
considers the PTP clock which is based on CLOCK TAI relevant as
that's the clock which drives the timekeeping adjustments via the
various user space daemons through adjtimex(2).
The non TAI based clock domains are accessible via the file
descriptor based posix clocks, but their usability is very limited.
They can't be accessed fast as they always go all the way out to
the hardware and they cannot be utilized in the kernel itself.
As Time Sensitive Networking (TSN) gains traction it is required to
provide fast user and kernel space access to these clocks.
The approach taken is to utilize the timekeeping and adjtimex(2)
infrastructure to provide this access in a similar way how the
kernel provides access to clock MONOTONIC, REALTIME etc.
Instead of creating a duplicated infrastructure this rework
converts timekeeping and adjtimex(2) into generic functionality
which operates on pointers to data structures instead of using
static variables.
This allows to provide time accessors and adjtimex(2) functionality
for the independent PTP clocks in a subsequent step.
- Consolidate hrtimer initialization
hrtimers are set up by initializing the data structure and then
seperately setting the callback function for historical reasons.
That's an extra unnecessary step and makes Rust support less
straight forward than it should be.
Provide a new set of hrtimer_setup*() functions and convert the
core code and a few usage sites of the less frequently used
interfaces over.
The bulk of the htimer_init() to hrtimer_setup() conversion is
already prepared and scheduled for the next merge window.
- Drivers:
- Ensure that the global timekeeping clocksource is utilizing the
cluster 0 timer on MIPS multi-cluster systems.
Otherwise CPUs on different clusters use their cluster specific
clocksource which is not guaranteed to be synchronized with
other clusters.
- Mostly boring cleanups, fixes, improvements and code movement"
* tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (140 commits)
posix-timers: Fix spurious warning on double enqueue versus do_exit()
clocksource/drivers/arm_arch_timer: Use of_property_present() for non-boolean properties
clocksource/drivers/gpx: Remove redundant casts
clocksource/drivers/timer-ti-dm: Fix child node refcount handling
dt-bindings: timer: actions,owl-timer: convert to YAML
clocksource/drivers/ralink: Add Ralink System Tick Counter driver
clocksource/drivers/mips-gic-timer: Always use cluster 0 counter as clocksource
clocksource/drivers/timer-ti-dm: Don't fail probe if int not found
clocksource/drivers:sp804: Make user selectable
clocksource/drivers/dw_apb: Remove unused dw_apb_clockevent functions
hrtimers: Delete hrtimer_init_on_stack()
alarmtimer: Switch to use hrtimer_setup() and hrtimer_setup_on_stack()
io_uring: Switch to use hrtimer_setup_on_stack()
sched/idle: Switch to use hrtimer_setup_on_stack()
hrtimers: Delete hrtimer_init_sleeper_on_stack()
wait: Switch to use hrtimer_setup_sleeper_on_stack()
timers: Switch to use hrtimer_setup_sleeper_on_stack()
net: pktgen: Switch to use hrtimer_setup_sleeper_on_stack()
futex: Switch to use hrtimer_setup_sleeper_on_stack()
fs/aio: Switch to use hrtimer_setup_sleeper_on_stack()
...
- Update the amd-pstate driver to set the initial scaling frequency
policy lower bound to be the lowest non-linear frequency (Dhananjay
Ugwekar).
- Enable amd-pstate by default on servers starting with newer AMD Epyc
processors (Swapnil Sapkal).
- Align more codepaths between shared memory and MSR designs in
amd-pstate (Dhananjay Ugwekar).
- Clean up amd-pstate code to rename functions and remove redundant
calls (Dhananjay Ugwekar, Mario Limonciello).
- Do other assorted fixes and cleanups in amd-pstate (Dhananjay Ugwekar
and Mario Limonciello).
- Change the Balance-performance EPP value for Granite Rapids in the
intel_pstate driver to a more performance-biased one (Srinivas
Pandruvada).
- Simplify MSR read on the boot CPU in the ACPI cpufreq driver (Chang
S. Bae).
- Ensure sugov_eas_rebuild_sd() is always called when sugov_init()
succeeds to always enforce sched domains rebuild in case EAS needs
to be enabled (Christian Loehle).
- Switch cpufreq back to platform_driver::remove() (Uwe Kleine-König).
- Use proper frequency unit names in cpufreq (Marcin Juszkiewicz).
- Add a built-in idle states table for Granite Rapids Xeon D to the
intel_idle driver (Artem Bityutskiy).
- Fix some typos in comments in the cpuidle core and drivers (Shen
Lichuan).
- Remove iowait influence from the menu cpuidle governor (Christian
Loehle).
- Add min/max available performance state limits to the Energy Model
management code (Lukasz Luba).
- Update pm-graph to v5.13 (Todd Brandt).
- Add documentation for some recently introduced cpupower utility
options (Tor Vic).
- Make cpupower inform users where cpufreq-bench.conf should be located
when opening it fails (Peng Fan).
- Allow overriding cross-compiling env params in cpupower (Peng Fan).
- Add compile_commands.json to .gitignore in cpupower (John B. Wyatt
IV).
- Improve disable c_state block in cpupower bindings and add a test to
confirm that CPU state is disabled to it (John B. Wyatt IV).
- Add Chinese Simplified translation to cpupower (Kieran Moy).
- Add checks for xgettext and msgfmt to cpupower (Siddharth Menon).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmc3r6sSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxQMUQALNEbh/Ko1d+avq0sfvyPw18BZjEiQw7
M+L0GydLW6tXLYOrD+ZTASksdDhHbK0iuFr1Gca2cZi0Dl+1XF9sy70ITTqzCDIA
8qj1JrPmRYI0KXCfiSSke0W9fU18IdxVX3I7XezVqBl0ICzsroN5wliCkmEnVOU9
LQkw0fyYr7gev4GFEGSJ7WzfPxci0d6J9pYnafFlDEE28WpKz/cyOzYuSghX5lmG
ISHIVNIM6lqNgXyQirConvhrlg60XAyw5k5jqAYZbe78T+dqhH7lr9sDi7c4XxkG
syeiOOyjpiBMZv1rSjIUapi8AfJHyqH7B6KyTgiulIy31x8Dji62925B63CSahkM
AminAq0lYkqbhIcqEr4sW0JQ/oW3iX4cZ3TJXTUL+vFByR0ZF81tgQcXufhrcvBs
ViNugcX0q1vDX3lZsm9L6UHXN2yhUb36sgreUvbGfwnE79tuR/eUnAukTWBfXau/
TWnyDiQn1CjZcfHB+YAPYZNyUHHqjoIJwzfJLwnsaHgFA80YcSwfSC9kcogCawK1
NCyfs29lAccWsrOul5iARJu8pLw1X//UfDEmVNrBD+1hveKYMrjjiQXnPoVVnNhc
J5T2q5S1QeO05+wf8WaZ7MbRNzHLj0A3gYHSVPWNclxFwsQjqCHHZS2qz8MTX+f6
W6/eZuvmMbG7
=w8QT
-----END PGP SIGNATURE-----
Merge tag 'pm-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"The amd-pstate cpufreq driver gets the majority of changes this time.
They are mostly fixes and cleanups, but one of them causes it to
become the default cpufreq driver on some AMD server platforms.
Apart from that, the menu cpuidle governor is modified to not use
iowait any more, the intel_idle gets a custom C-states table for
Granite Rapids Xeon D, and the intel_pstate driver will use a more
aggressive Balance- performance default EPP value on Granite Rapids
now.
There are also some fixes, cleanups and tooling updates.
Specifics:
- Update the amd-pstate driver to set the initial scaling frequency
policy lower bound to be the lowest non-linear frequency (Dhananjay
Ugwekar)
- Enable amd-pstate by default on servers starting with newer AMD
Epyc processors (Swapnil Sapkal)
- Align more codepaths between shared memory and MSR designs in
amd-pstate (Dhananjay Ugwekar)
- Clean up amd-pstate code to rename functions and remove redundant
calls (Dhananjay Ugwekar, Mario Limonciello)
- Do other assorted fixes and cleanups in amd-pstate (Dhananjay
Ugwekar and Mario Limonciello)
- Change the Balance-performance EPP value for Granite Rapids in the
intel_pstate driver to a more performance-biased one (Srinivas
Pandruvada)
- Simplify MSR read on the boot CPU in the ACPI cpufreq driver (Chang
S. Bae)
- Ensure sugov_eas_rebuild_sd() is always called when sugov_init()
succeeds to always enforce sched domains rebuild in case EAS needs
to be enabled (Christian Loehle)
- Switch cpufreq back to platform_driver::remove() (Uwe Kleine-König)
- Use proper frequency unit names in cpufreq (Marcin Juszkiewicz)
- Add a built-in idle states table for Granite Rapids Xeon D to the
intel_idle driver (Artem Bityutskiy)
- Fix some typos in comments in the cpuidle core and drivers (Shen
Lichuan)
- Remove iowait influence from the menu cpuidle governor (Christian
Loehle)
- Add min/max available performance state limits to the Energy Model
management code (Lukasz Luba)
- Update pm-graph to v5.13 (Todd Brandt)
- Add documentation for some recently introduced cpupower utility
options (Tor Vic)
- Make cpupower inform users where cpufreq-bench.conf should be
located when opening it fails (Peng Fan)
- Allow overriding cross-compiling env params in cpupower (Peng Fan)
- Add compile_commands.json to .gitignore in cpupower (John B. Wyatt
IV)
- Improve disable c_state block in cpupower bindings and add a test
to confirm that CPU state is disabled to it (John B. Wyatt IV)
- Add Chinese Simplified translation to cpupower (Kieran Moy)
- Add checks for xgettext and msgfmt to cpupower (Siddharth Menon)"
* tag 'pm-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (38 commits)
cpufreq: intel_pstate: Update Balance-performance EPP for Granite Rapids
cpufreq: ACPI: Simplify MSR read on the boot CPU
sched/cpufreq: Ensure sd is rebuilt for EAS check
intel_idle: add Granite Rapids Xeon D support
PM: EM: Add min/max available performance state limits
cpufreq/amd-pstate: Move registration after static function call update
cpufreq/amd-pstate: Push adjust_perf vfunc init into cpu_init
cpufreq/amd-pstate: Align offline flow of shared memory and MSR based systems
cpufreq/amd-pstate: Call cppc_set_epp_perf in the reenable function
cpufreq/amd-pstate: Do not attempt to clear MSR_AMD_CPPC_ENABLE
cpufreq/amd-pstate: Rename functions that enable CPPC
cpufreq/amd-pstate-ut: Add fix for min freq unit test
amd-pstate: Switch to amd-pstate by default on some Server platforms
amd-pstate: Set min_perf to nominal_perf for active mode performance gov
cpufreq/amd-pstate: Remove the redundant amd_pstate_set_driver() call
cpufreq/amd-pstate: Remove the switch case in amd_pstate_init()
cpufreq/amd-pstate: Call amd_pstate_set_driver() in amd_pstate_register_driver()
cpufreq/amd-pstate: Call amd_pstate_register() in amd_pstate_init()
cpufreq/amd-pstate: Set the initial min_freq to lowest_nonlinear_freq
cpufreq/amd-pstate: Remove the redundant verify() function
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzchMwAKCRCRxhvAZXjc
okICAP4h6tDl7dgTv8GkL0tgaHi/36m+ilctXbEtIe9fbkc/fQD8D5t6jYaz47gu
zVY7qOrtQOQ/diNavzxyky99Uh3dKgo=
=lwkw
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.usercopy' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull copy_struct_to_user helper from Christian Brauner:
"This adds a copy_struct_to_user() helper which is a companion helper
to the already widely used copy_struct_from_user().
It copies a struct from kernel space to userspace, in a way that
guarantees backwards-compatibility for struct syscall arguments as
long as future struct extensions are made such that all new fields are
appended to the old struct, and zeroed-out new fields have the same
meaning as the old struct.
The first user is sched_getattr() system call but the new extensible
pidfs ioctl will be ported to it as well"
* tag 'vfs-6.13.usercopy' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
sched_getattr: port to copy_struct_to_user
uaccess: add copy_struct_to_user helper
ops.cpu_acquire() was being invoked with the wrong kfunc mask allowing the
operation to call kfuncs which shouldn't be allowed. Fix it by using
SCX_KF_REST instead, which is trivial and low risk.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZzamXw4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGRReAP4/JQ1mKkJv+9nTZkW9OcFFHGVVhrprOUEEFk5j
pmHwPAD8DTBMMS/BCQOoXDdiB9uU7ut6M8VdsIj1jmJkMja+eQI=
=942J
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.12-rc7-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fix from Tejun Heo:
"One more fix for v6.12-rc7
ops.cpu_acquire() was being invoked with the wrong kfunc mask allowing
the operation to call kfuncs which shouldn't be allowed. Fix it by
using SCX_KF_REST instead, which is trivial and low risk"
* tag 'sched_ext-for-6.12-rc7-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: ops.cpu_acquire() should be called with SCX_KF_REST
scx_next_task_picked() has been replaced with siwtch_class(), but comment
is still referencing old one, so replace it.
Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
ops.cpu_acquire() is currently called with 0 kf_maks which is interpreted as
SCX_KF_UNLOCKED which allows all unlocked kfuncs, but ops.cpu_acquire() is
called from balance_one() under the rq lock and should only be allowed call
kfuncs that are safe under the rq lock. Update it to use SCX_KF_REST.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Zhao Mengmeng <zhaomzhao@126.com>
Link: http://lkml.kernel.org/r/ZzYvf2L3rlmjuKzh@slm.duckdns.org
Fixes: 245254f708 ("sched_ext: Implement sched_ext_ops.cpu_acquire/release()")
Ensure sugov_eas_rebuild_sd() is always called when sugov_init()
succeeds. The out goto initialized sugov without forcing the rebuild.
Previously the missing call to sugov_eas_rebuild_sd() could lead to EAS
not being enabled on boot when it should have been, because it requires
all policies to be controlled by schedutil while they might not have
been initialized yet.
Fixes: e7a1b32e43 ("cpufreq: Rebuild sched-domains when removing cpufreq driver")
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/35e572d9-1152-406a-9e34-2525f7548af9@arm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- The fair sched class currently has a bug where its balance() returns true
telling the sched core that it has tasks to run but then NULL from
pick_task(). This makes sched core call sched_ext's pick_task() without
preceding balance() which can lead to stalls in partial mode. For now,
work around by detecting the condition and forcing the CPU to go through
another scheduling cycle.
- Add a missing newline to an error message and fix drgn introspection tool
which went out of sync.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZzI8sw4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGb5KAP40b/o6TyAFDG+Hn6GxyxQT7rcAUMXsdB2bcEpg
/IjmzQEAwbHU5KP5vQXV6XHv+2V7Rs7u6ZqFtDnL88N0A9hf3wk=
=7hL8
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.12-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- The fair sched class currently has a bug where its balance() returns
true telling the sched core that it has tasks to run but then NULL
from pick_task(). This makes sched core call sched_ext's pick_task()
without preceding balance() which can lead to stalls in partial mode.
For now, work around by detecting the condition and forcing the CPU
to go through another scheduling cycle.
- Add a missing newline to an error message and fix drgn introspection
tool which went out of sync.
* tag 'sched_ext-for-6.12-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()
sched_ext: Update scx_show_state.py to match scx_ops_bypass_depth's new type
sched_ext: Add a missing newline at the end of an error message
In sched_ext API, a repeatedly reported pain point is the overuse of the
verb "dispatch" and confusion around "consume":
- ops.dispatch()
- scx_bpf_dispatch[_vtime]()
- scx_bpf_consume()
- scx_bpf_dispatch[_vtime]_from_dsq*()
This overloading of the term is historical. Originally, there were only
built-in DSQs and moving a task into a DSQ always dispatched it for
execution. Using the verb "dispatch" for the kfuncs to move tasks into these
DSQs made sense.
Later, user DSQs were added and scx_bpf_dispatch[_vtime]() updated to be
able to insert tasks into any DSQ. The only allowed DSQ to DSQ transfer was
from a non-local DSQ to a local DSQ and this operation was named "consume".
This was already confusing as a task could be dispatched to a user DSQ from
ops.enqueue() and then the DSQ would have to be consumed in ops.dispatch().
Later addition of scx_bpf_dispatch_from_dsq*() made the confusion even worse
as "dispatch" in this context meant moving a task to an arbitrary DSQ from a
user DSQ.
Clean up the API with the following renames:
1. scx_bpf_dispatch[_vtime]() -> scx_bpf_dsq_insert[_vtime]()
2. scx_bpf_consume() -> scx_bpf_dsq_move_to_local()
3. scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*()
This patch performs the third set of renames. Compatibility is maintained
by:
- The previous kfunc names are still provided by the kernel so that old
binaries can run. Kernel generates a warning when the old names are used.
- compat.bpf.h provides wrappers for the new names which automatically fall
back to the old names when running on older kernels. They also trigger
build error if old names are used for new builds.
- scx_bpf_dispatch[_vtime]_from_dsq*() were already wrapped in __COMPAT
macros as they were introduced during v6.12 cycle. Wrap new API in
__COMPAT macros too and trigger build errors on both __COMPAT prefixed and
naked usages of the old names.
The compat features will be dropped after v6.15.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Johannes Bechberger <me@mostlynerdless.de>
Acked-by: Giovanni Gherdovich <ggherdovich@suse.com>
Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Ming Yang <yougmark94@gmail.com>
In sched_ext API, a repeatedly reported pain point is the overuse of the
verb "dispatch" and confusion around "consume":
- ops.dispatch()
- scx_bpf_dispatch[_vtime]()
- scx_bpf_consume()
- scx_bpf_dispatch[_vtime]_from_dsq*()
This overloading of the term is historical. Originally, there were only
built-in DSQs and moving a task into a DSQ always dispatched it for
execution. Using the verb "dispatch" for the kfuncs to move tasks into these
DSQs made sense.
Later, user DSQs were added and scx_bpf_dispatch[_vtime]() updated to be
able to insert tasks into any DSQ. The only allowed DSQ to DSQ transfer was
from a non-local DSQ to a local DSQ and this operation was named "consume".
This was already confusing as a task could be dispatched to a user DSQ from
ops.enqueue() and then the DSQ would have to be consumed in ops.dispatch().
Later addition of scx_bpf_dispatch_from_dsq*() made the confusion even worse
as "dispatch" in this context meant moving a task to an arbitrary DSQ from a
user DSQ.
Clean up the API with the following renames:
1. scx_bpf_dispatch[_vtime]() -> scx_bpf_dsq_insert[_vtime]()
2. scx_bpf_consume() -> scx_bpf_dsq_move_to_local()
3. scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*()
This patch performs the second rename. Compatibility is maintained by:
- The previous kfunc names are still provided by the kernel so that old
binaries can run. Kernel generates a warning when the old names are used.
- compat.bpf.h provides wrappers for the new names which automatically fall
back to the old names when running on older kernels. They also trigger
build error if old names are used for new builds.
The compat features will be dropped after v6.15.
v2: Comment and documentation updates.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Johannes Bechberger <me@mostlynerdless.de>
Acked-by: Giovanni Gherdovich <ggherdovich@suse.com>
Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Ming Yang <yougmark94@gmail.com>
In sched_ext API, a repeatedly reported pain point is the overuse of the
verb "dispatch" and confusion around "consume":
- ops.dispatch()
- scx_bpf_dispatch[_vtime]()
- scx_bpf_consume()
- scx_bpf_dispatch[_vtime]_from_dsq*()
This overloading of the term is historical. Originally, there were only
built-in DSQs and moving a task into a DSQ always dispatched it for
execution. Using the verb "dispatch" for the kfuncs to move tasks into these
DSQs made sense.
Later, user DSQs were added and scx_bpf_dispatch[_vtime]() updated to be
able to insert tasks into any DSQ. The only allowed DSQ to DSQ transfer was
from a non-local DSQ to a local DSQ and this operation was named "consume".
This was already confusing as a task could be dispatched to a user DSQ from
ops.enqueue() and then the DSQ would have to be consumed in ops.dispatch().
Later addition of scx_bpf_dispatch_from_dsq*() made the confusion even worse
as "dispatch" in this context meant moving a task to an arbitrary DSQ from a
user DSQ.
Clean up the API with the following renames:
1. scx_bpf_dispatch[_vtime]() -> scx_bpf_dsq_insert[_vtime]()
2. scx_bpf_consume() -> scx_bpf_dsq_move_to_local()
3. scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*()
This patch performs the first set of renames. Compatibility is maintained
by:
- The previous kfunc names are still provided by the kernel so that old
binaries can run. Kernel generates a warning when the old names are used.
- compat.bpf.h provides wrappers for the new names which automatically fall
back to the old names when running on older kernels. They also trigger
build error if old names are used for new builds.
The compat features will be dropped after v6.15.
v2: Documentation updates.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Johannes Bechberger <me@mostlynerdless.de>
Acked-by: Giovanni Gherdovich <ggherdovich@suse.com>
Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Ming Yang <yougmark94@gmail.com>
sched_ext dispatches tasks from the BPF scheduler from balance_scx() and
thus every pick_task_scx() call must be preceded by balance_scx(). While
this usually holds, due to a bug, there are cases where the fair class's
balance() returns true indicating that it has tasks to run on the CPU and
thus terminating balance() calls but fails to actually find the next task to
run when pick_task() is called. In such cases, pick_task_scx() can be called
without preceding balance_scx().
Detect this condition using SCX_RQ_BAL_PENDING flags. If detected, keep
running the previous task if possible and avoid stalling from entering idle
without balancing.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/Ztj_h5c2LYsdXYbA@slm.duckdns.org
4c30f5ce4f ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
added four kfuncs for dispatching while iterating. They are allowed from the
dispatch and unlocked contexts but two of the kfuncs were only added in the
dispatch section. Add missing declarations in the unlocked section.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 4c30f5ce4f ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
Update the comments in sched_ext_ops to clarify this table is for
a BPF scheduler and a userland scheduler should also rely on the
sched_ext_ops table through the BPF scheduler.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
On 2 x Intel Sapphire Rapids machines with 224 logical CPUs, a poorly
behaving BPF scheduler can live-lock the system by making multiple CPUs bang
on the same DSQ to the point where soft-lockup detection triggers before
SCX's own watchdog can take action. It also seems possible that the machine
can be live-locked enough to prevent scx_ops_helper, which is an RT task,
from running in a timely manner.
Implement scx_softlockup() which is called when three quarters of
soft-lockup threshold has passed. The function immediately enables the ops
breather and triggers an ops error to initiate ejection of the BPF
scheduler.
The previous and this patch combined enable the kernel to reliably recover
the system from live-lock conditions that can be triggered by a poorly
behaving BPF scheduler on Intel dual socket systems.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
A poorly behaving BPF scheduler can live-lock the system by e.g. incessantly
banging on the same DSQ on a large NUMA system to the point where switching
to the bypass mode can take a long time. Turning on the bypass mode requires
dequeueing and re-enqueueing currently runnable tasks, if the DSQs that they
are on are live-locked, this can take tens of seconds cascading into other
failures. This was observed on 2 x Intel Sapphire Rapids machines with 224
logical CPUs.
Inject artifical delays while the bypass mode is switching to guarantee
timely completion.
While at it, move __scx_ops_bypass_lock into scx_ops_bypass() and rename it
to bypass_lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Valentin Andrei <vandrei@meta.com>
Reported-by: Patrick Lu <patlu@meta.com>
Pull sched_ext/for-6.12-fixes to receive 0e7ffff1b8 ("scx: Fix raciness in
scx_ops_bypass()"). Planned updates for scx_ops_bypass() depends on it.
Signed-off-by: Tejun Heo <tj@kernel.org>
There is no reason to use a bitwise AND when checking the conditions to
enable NUMA optimization for the built-in CPU idle selection policy, so
use a logical AND instead.
Fixes: f6ce6b9493 ("sched_ext: Do not enable LLC/NUMA optimizations when domains overlap")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Closes: https://lore.kernel.org/lkml/20241108181753.GA2681424@thelio-3990X/
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When the LLC and NUMA domains fully overlap, enabling both optimizations
in the built-in idle CPU selection policy is redundant, as it leads to
searching for an idle CPU within the same domain twice.
Likewise, if all online CPUs are within a single LLC domain, LLC
optimization is unnecessary.
Therefore, detect overlapping domains and enable topology optimizations
only when necessary.
Moreover, rely on the online CPUs for this detection logic, instead of
using the possible CPUs.
Fixes: 860a45219b ("sched_ext: Introduce NUMA awareness to the default idle selection policy")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
hrtimer_setup_on_stack() takes the callback function pointer as argument
and initializes the timer completely.
Replace hrtimer_init_on_stack() and the open coded initialization of
hrtimer::function with the new setup mechanism.
The conversion was done with Coccinelle.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/17f9421fed6061df4ad26a4cc91873d2c078cb0f.1730386209.git.namcao@linutronix.de
In order to enable PREEMPT_DYNAMIC for PREEMPT_RT, remove PREEMPT_RT
from the 'Preemption Model' choice. Strictly speaking PREEMPT_RT is
not a change in how preemption works, but rather it makes a ton more
code preemptible.
Notably, take away NONE and VOLUNTARY options for PREEMPT_RT, they make
no sense (but are techincally possible).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20241007075055.441622332@infradead.org
Change fair to use resched_curr_lazy(), which, when the lazy
preemption model is selected, will set TIF_NEED_RESCHED_LAZY.
This LAZY bit will be promoted to the full NEED_RESCHED bit on tick.
As such, the average delay between setting LAZY and actually
rescheduling will be TICK_NSEC/2.
In short, Lazy preemption will delay preemption for fair class but
will function as Full preemption for all the other classes, most
notably the realtime (RR/FIFO/DEADLINE) classes.
The goal is to bridge the performance gap with Voluntary, such that we
might eventually remove that option entirely.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20241007075055.331243614@infradead.org
Add the basic infrastructure to split the TIF_NEED_RESCHED bit in two.
Either bit will cause a resched on return-to-user, but only
TIF_NEED_RESCHED will drive IRQ preemption.
No behavioural change intended.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20241007075055.219540785@infradead.org
Instead of solving the underlying problem of the double invocation of
__sched_fork() for idle tasks, sched-ext decided to hack around the issue
by partially clearing out the entity struct to preserve the already
enqueued node. A provided analysis and solution has been ignored for four
months.
Now that someone else has taken care of cleaning it up, remove the
disgusting hack and clear out the full structure. Remove the comment in the
structure declaration as well, as there is no requirement for @node being
the last element anymore.
Fixes: f0e1a0643a ("sched_ext: Implement BPF extensible scheduler class")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/87ldy82wkc.ffs@tglx
Idle tasks are initialized via __sched_fork() twice:
fork_idle()
copy_process()
sched_fork()
__sched_fork()
init_idle()
__sched_fork()
Instead of cleaning this up, sched_ext hacked around it. Even when analyis
and solution were provided in a discussion, nobody cared to clean this up.
init_idle() is also invoked from sched_init() to initialize the boot CPU's
idle task, which requires the __sched_fork() invocation. But this can be
trivially solved by invoking __sched_fork() before init_idle() in
sched_init() and removing the __sched_fork() invocation from init_idle().
Do so and clean up the comments explaining this historical leftover.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241028103142.359584747@linutronix.de
- Plug a race between pick_next_task_fair() and try_to_wake_up() where
both try to write to the same task, even though both paths hold a
runqueue lock, but obviously from different runqueues.
The problem is that the store to task::on_rq in __block_task() is
visible to try_to_wake_up() which assumes that the task is not queued.
Both sides then operate on the same task.
Cure it by rearranging __block_task() so the the store to task::on_rq is
the last operation on the task.
- Prevent a potential NULL pointer dereference in task_numa_work()
task_numa_work() iterates the VMAs of a process. A concurrent unmap of
the address space can result in a NULL pointer return from vma_next()
which is unchecked.
Add the missing NULL pointer check to prevent this.
- Operate on the correct scheduler policy in task_should_scx()
task_should_scx() returns true when a task should be handled by sched
EXT. It checks the tasks scheduling policy.
This fails when the check is done before a policy has been set.
Cure it by handing the policy into task_should_scx() so it operates
on the requested value.
- Add the missing handling of sched EXT in the delayed dequeue
mechanism. This was simply forgotten.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmcnTqATHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoX/aD/4yvskeG9i7wAj2NOdDTAs1K0gLURt+
nHDb1YkoIOXOfanaG7ZdBWb4sYnsnLX/KIhVsDQiXACFr6G0IjQ1zaN1iRtEkH79
5BfVi98gAXdFU3y+EGqyaqiAp7MFOBTmsfJi5095fX0L+2aViSAjDEvHzvvC/hXD
tmq47vFQEgIZPSxljEaKPaNmyDM+geusv5lX/lABH5MG0fYsT85VV6BQ2T1LsN1O
WFBLD/uPEOSXumyZW8nV8yE2PioLDJz8W+uSnr38/HCH99mtJApqZyskaagKtr0g
vLhOfoaYVR/j5ODUk6LExZ8zy140zDzUWzC5+RNnyb8jQf/Lx88fTNZY8/Wsm5m9
oKtoiGzkL0LG/c05Cjh/vqReK26qILK4+ynDGaowDmTlUTS2jeNZL1ABlIwWkaLP
5TDegJPkoUA1Z4YegxtRFROGHp1J+lfbqz537bghMaqdJXMaG84qjSszsPz9NbS9
F7K63JKjfXAF6N8bhKvZk4jAbD97EYf3B0o8E69TjoZxaiuKf00xK7HGWmuQD3u3
lOHkfIZzf5b7ELNgcketCYsbJvxbI4oQrp/9V425ORSr1Ih2GxCT51/x/NlFHoEH
ujIjAe2YQyLhb26M0RG8Xao3BPT7RGMR058C8lwxtPLuPNIwB8MqCsXmU9xlEypg
iexGnsj6zXTddg==
=4mie
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
- Plug a race between pick_next_task_fair() and try_to_wake_up() where
both try to write to the same task, even though both paths hold a
runqueue lock, but obviously from different runqueues.
The problem is that the store to task::on_rq in __block_task() is
visible to try_to_wake_up() which assumes that the task is not
queued. Both sides then operate on the same task.
Cure it by rearranging __block_task() so the the store to task::on_rq
is the last operation on the task.
- Prevent a potential NULL pointer dereference in task_numa_work()
task_numa_work() iterates the VMAs of a process. A concurrent unmap
of the address space can result in a NULL pointer return from
vma_next() which is unchecked.
Add the missing NULL pointer check to prevent this.
- Operate on the correct scheduler policy in task_should_scx()
task_should_scx() returns true when a task should be handled by sched
EXT. It checks the tasks scheduling policy.
This fails when the check is done before a policy has been set.
Cure it by handing the policy into task_should_scx() so it operates
on the requested value.
- Add the missing handling of sched EXT in the delayed dequeue
mechanism. This was simply forgotten.
* tag 'sched-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/ext: Fix scx vs sched_delayed
sched: Pass correct scheduling policy to __setscheduler_class
sched/numa: Fix the potential null pointer dereference in task_numa_work()
sched: Fix pick_next_task_fair() vs try_to_wake_up() race
- Instances of scx_ops_bypass() could race each other leading to
misbehavior. Fix by protecting the operation with a spinlock.
- selftest and userspace header fixes.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZyF/5Q4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGRi+AP4+jGUz+O1LS0bCNj44Xlr0v6kci5dfJR7TlBv5
hwROcgEA84i7nRq6oJ1IkK7ItLbZYwgZyxqdn0Pgsq+oMWhgAwE=
=R766
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.12-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Instances of scx_ops_bypass() could race each other leading to
misbehavior. Fix by protecting the operation with a spinlock.
- selftest and userspace header fixes
* tag 'sched_ext-for-6.12-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Fix enq_last_no_enq_fails selftest
sched_ext: Make cast_mask() inline
scx: Fix raciness in scx_ops_bypass()
scx: Fix exit selftest to use custom DSQ
sched_ext: Fix function pointer type mismatches in BPF selftests
selftests/sched_ext: add order-only dependency of runner.o on BPFOBJ
Similarly to commit dfa4ed29b1 ("sched_ext: Introduce LLC awareness to
the default idle selection policy"), extend the built-in idle CPU
selection policy to also prioritize CPUs within the same NUMA node.
With this change applied, the built-in CPU idle selection policy follows
this logic:
- always prioritize CPUs from fully idle SMT cores,
- select the same CPU if possible,
- select a CPU within the same LLC domain,
- select a CPU within the same NUMA node.
Both NUMA and LLC awareness features are enabled only when the system
has multiple NUMA nodes or multiple LLC domains.
In the future, we may want to improve the NUMA node selection to account
the node distance from prev_cpu. Currently, the logic only tries to keep
tasks running on the same NUMA node. If all CPUs within a node are busy,
the next NUMA node is chosen randomly.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit 98442f0ccd ("sched: Fix delayed_dequeue vs
switched_from_fair()") overlooked that __setscheduler_prio(), now
__setscheduler_class() relies on p->policy for task_should_scx(), and
moved the call before __setscheduler_params() updates it, causing it
to be using the old p->policy value.
Resolve this by changing task_should_scx() to take the policy itself
instead of a task pointer, such that __sched_setscheduler() can pass
in the updated policy.
Fixes: 98442f0ccd ("sched: Fix delayed_dequeue vs switched_from_fair()")
Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
What psi needs to do on each enqueue and dequeue has gotten more
subtle, and the generic sched code trying to distill this into a bool
for the callbacks is awkward.
Pass the flags directly and let psi parse them. For that to work, the
#include "stats.h" (which has the psi callback implementations) needs
to be below the flag definitions in "sched.h". Move that section
further down, next to some of the other accounting stuff.
This also puts the ENQUEUE_SAVE/RESTORE branch behind the psi jump
label, slightly reducing overhead when PSI=y but runtime disabled.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241014144358.GB1021@cmpxchg.org
When running stress-ng-vm-segv test, we found a null pointer dereference
error in task_numa_work(). Here is the backtrace:
[323676.066985] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
......
[323676.067108] CPU: 35 PID: 2694524 Comm: stress-ng-vm-se
......
[323676.067113] pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
[323676.067115] pc : vma_migratable+0x1c/0xd0
[323676.067122] lr : task_numa_work+0x1ec/0x4e0
[323676.067127] sp : ffff8000ada73d20
[323676.067128] x29: ffff8000ada73d20 x28: 0000000000000000 x27: 000000003e89f010
[323676.067130] x26: 0000000000080000 x25: ffff800081b5c0d8 x24: ffff800081b27000
[323676.067133] x23: 0000000000010000 x22: 0000000104d18cc0 x21: ffff0009f7158000
[323676.067135] x20: 0000000000000000 x19: 0000000000000000 x18: ffff8000ada73db8
[323676.067138] x17: 0001400000000000 x16: ffff800080df40b0 x15: 0000000000000035
[323676.067140] x14: ffff8000ada73cc8 x13: 1fffe0017cc72001 x12: ffff8000ada73cc8
[323676.067142] x11: ffff80008001160c x10: ffff000be639000c x9 : ffff8000800f4ba4
[323676.067145] x8 : ffff000810375000 x7 : ffff8000ada73974 x6 : 0000000000000001
[323676.067147] x5 : 0068000b33e26707 x4 : 0000000000000001 x3 : ffff0009f7158000
[323676.067149] x2 : 0000000000000041 x1 : 0000000000004400 x0 : 0000000000000000
[323676.067152] Call trace:
[323676.067153] vma_migratable+0x1c/0xd0
[323676.067155] task_numa_work+0x1ec/0x4e0
[323676.067157] task_work_run+0x78/0xd8
[323676.067161] do_notify_resume+0x1ec/0x290
[323676.067163] el0_svc+0x150/0x160
[323676.067167] el0t_64_sync_handler+0xf8/0x128
[323676.067170] el0t_64_sync+0x17c/0x180
[323676.067173] Code: d2888001 910003fd f9000bf3 aa0003f3 (f9401000)
[323676.067177] SMP: stopping secondary CPUs
[323676.070184] Starting crashdump kernel...
stress-ng-vm-segv in stress-ng is used to stress test the SIGSEGV error
handling function of the system, which tries to cause a SIGSEGV error on
return from unmapping the whole address space of the child process.
Normally this program will not cause kernel crashes. But before the
munmap system call returns to user mode, a potential task_numa_work()
for numa balancing could be added and executed. In this scenario, since the
child process has no vma after munmap, the vma_next() in task_numa_work()
will return a null pointer even if the vma iterator restarts from 0.
Recheck the vma pointer before dereferencing it in task_numa_work().
Fixes: 214dbc4281 ("sched: convert to vma iterator")
Signed-off-by: Shawn Wang <shawnwang@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org # v6.2+
Link: https://lkml.kernel.org/r/20241025022208.125527-1-shawnwang@linux.alibaba.com
ops.dispatch() and ops.yield() may be fed a NULL task_struct pointer.
set_arg_maybe_null() is used to tell the verifier that they should be NULL
checked before being dereferenced. BPF now has an a lot prettier way to
express this - tagging arguments in CFI stubs with __nullable. Replace
set_arg_maybe_null() with __nullable CFI stub tags.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
CFI stubs can be used to tag arguments with __nullable (and possibly other
tags in the future) but for that to work the CFI stubs must have names that
are recognized by BPF. Rename them.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Rely on the scheduler topology information to implement basic LLC
awareness in the sched_ext build-in idle selection policy.
This allows schedulers using the built-in policy to make more informed
decisions when selecting an idle CPU in systems with multiple LLCs, such
as NUMA systems or chiplet-based architectures, and it helps keep tasks
within the same LLC domain, thereby improving cache locality.
For efficiency, LLC awareness is applied only to tasks that can run on
all the CPUs in the system for now. If a task's affinity is modified
from user space, it's the responsibility of user space to choose the
appropriate optimized scheduling domain.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Update ops.select_cpu() documentation to clarify that this method is not
called for tasks that are restricted to run on a single CPU, as these
tasks do not have the option to select a different CPU.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Syzkaller robot reported KCSAN tripping over the
ASSERT_EXCLUSIVE_WRITER(p->on_rq) in __block_task().
The report noted that both pick_next_task_fair() and try_to_wake_up()
were concurrently trying to write to the same p->on_rq, violating the
assertion -- even though both paths hold rq->__lock.
The logical consequence is that both code paths end up holding a
different rq->__lock. And looking through ttwu(), this is possible
when the __block_task() 'p->on_rq = 0' store is visible to the ttwu()
'p->on_rq' load, which then assumes the task is not queued and
continues to migrate it.
Rearrange things such that __block_task() releases @p with the store
and no code thereafter will use @p again.
Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue")
Reported-by: syzbot+0ec1e96c2cdf5c0e512a@syzkaller.appspotmail.com
Reported-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20241023093641.GE16066@noisy.programming.kicks-ass.net
In the sched_ext built-in idle CPU selection logic, when handling a
WF_SYNC wakeup, we always attempt to migrate the task to the waker's
CPU, as the waker is expected to yield the CPU after waking the task.
However, it may be preferable to keep the task on its previous CPU if
the waker's CPU is cache-affine.
The same approach is also used by the fair class and in other scx
schedulers, like scx_rusty and scx_bpfland.
Therefore, apply the same logic to the built-in idle CPU selection
policy as well.
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
Conflicts:
kernel/sched/ext.c
There's a context conflict between this upstream commit:
3fdb9ebcec sched_ext: Start schedulers with consistent p->scx.slice values
... and this fix in sched/urgent:
98442f0ccd sched: Fix delayed_dequeue vs switched_from_fair()
Resolve it.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- More issues reported in the enable/disable paths on large machines with
many tasks due to scx_tasks_lock being held too long. Break up the task
iterations.
- Remove ops.select_cpu() dependency in bypass mode so that a misbehaving
implementation can't live-lock the machine by pushing all tasks to few
CPUs in bypass mode.
- Other misc fixes.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZw7SMw4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGU6+AQDZKNsyRw0c8+rd7eASAtCzvzBU1KGf3RPrsbNF
pb3IGgEA4/uVBlVny10ZyeKaQ4A3L2G4ROyJwNEygUvhkvbfcgk=
=z7d9
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.12-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- More issues reported in the enable/disable paths on large machines
with many tasks due to scx_tasks_lock being held too long. Break up
the task iterations
- Remove ops.select_cpu() dependency in bypass mode so that a
misbehaving implementation can't live-lock the machine by pushing all
tasks to few CPUs in bypass mode
- Other misc fixes
* tag 'sched_ext-for-6.12-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Remove unnecessary cpu_relax()
sched_ext: Don't hold scx_tasks_lock for too long
sched_ext: Move scx_tasks_lock handling into scx_task_iter helpers
sched_ext: bypass mode shouldn't depend on ops.select_cpu()
sched_ext: Move scx_buildin_idle_enabled check to scx_bpf_select_cpu_dfl()
sched_ext: Start schedulers with consistent p->scx.slice values
Revert "sched_ext: Use shorter slice while bypassing"
sched_ext: use correct function name in pick_task_scx() warning message
selftests: sched_ext: Add sched_ext as proper selftest target
As described in commit b07996c7ab ("sched_ext: Don't hold
scx_tasks_lock for too long"), we're doing a cond_resched() every 32
calls to scx_task_iter_next() to avoid RCU and other stalls. That commit
also added a cpu_relax() to the codepath where we drop and reacquire the
lock, but as Waiman described in [0], cpu_relax() should only be
necessary in busy loops to avoid pounding on a cacheline (or to allow a
hypertwin to more fully utilize a core).
Let's remove the unnecessary cpu_relax().
[0]: https://lore.kernel.org/all/35b3889b-904a-4d26-981f-c8aa1557a7c7@redhat.com/
Cc: Waiman Long <llong@redhat.com>
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Let's define the "scheduling context" as all the scheduler state
in task_struct for the task chosen to run, which we'll call the
donor task, and the "execution context" as all state required to
actually run the task.
Currently both are intertwined in task_struct. We want to
logically split these such that we can use the scheduling
context of the donor task selected to be scheduled, but use
the execution context of a different task to actually be run.
To this purpose, introduce rq->donor field to point to the
task_struct chosen from the runqueue by the scheduler, and will
be used for scheduler state, and preserve rq->curr to indicate
the execution context of the task that will actually be run.
This patch introduces the donor field as a union with curr, so it
doesn't cause the contexts to be split yet, but adds the logic to
handle everything separately.
[add additional comments and update more sched_class code to use
rq::proxy]
[jstultz: Rebased and resolved minor collisions, reworked to use
accessors, tweaked update_curr_common to use rq_proxy fixing rt
scheduling issues]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Metin Kaya <metin.kaya@arm.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Metin Kaya <metin.kaya@arm.com>
Link: https://lore.kernel.org/r/20241009235352.1614323-8-jstultz@google.com
As we're going to re-use the deactivation logic,
split it into a helper.
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Metin Kaya <metin.kaya@arm.com>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Metin Kaya <metin.kaya@arm.com>
Link: https://lore.kernel.org/r/20241009235352.1614323-7-jstultz@google.com
This patch consolidates rt and deadline pick_*_task functions to
a task_is_pushable() helper
This patch was broken out from a larger chain migration
patch originally by Connor O'Brien.
[jstultz: split out from larger chain migration patch,
renamed helper function]
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Metin Kaya <metin.kaya@arm.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Metin Kaya <metin.kaya@arm.com>
Link: https://lore.kernel.org/r/20241009235352.1614323-6-jstultz@google.com
Switch logic that deactivates, sets the task cpu,
and reactivates a task on a different rq to use a
helper that will be later extended to push entire
blocked task chains.
This patch was broken out from a larger chain migration
patch originally by Connor O'Brien.
[jstultz: split out from larger chain migration patch]
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Metin Kaya <metin.kaya@arm.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Metin Kaya <metin.kaya@arm.com>
Link: https://lore.kernel.org/r/20241009235352.1614323-5-jstultz@google.com
commit 223baf9d17 ("sched: Fix performance regression introduced by mm_cid")
introduced a per-mm/cpu current concurrency id (mm_cid), which keeps
a reference to the concurrency id allocated for each CPU. This reference
expires shortly after a 100ms delay.
These per-CPU references keep the per-mm-cid data cache-local in
situations where threads are running at least once on each CPU within
each 100ms window, thus keeping the per-cpu reference alive.
However, intermittent workloads behaving in bursts spaced by more than
100ms on each CPU exhibit bad cache locality and degraded performance
compared to purely per-cpu data indexing, because concurrency IDs are
allocated over various CPUs and cores, therefore losing cache locality
of the associated data.
Introduce the following changes to improve per-mm-cid cache locality:
- Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep
track of which mm_cid value was last used, and use it as a hint to
attempt re-allocating the same concurrency ID the next time this
mm/cpu needs to allocate a concurrency ID,
- Add a per-mm CPUs allowed mask, which keeps track of the union of
CPUs allowed for all threads belonging to this mm. This cpumask is
only set during the lifetime of the mm, never cleared, so it
represents the union of all the CPUs allowed since the beginning of
the mm lifetime (note that the mm_cpumask() is really arch-specific
and tailored to the TLB flush needs, and is thus _not_ a viable
approach for this),
- Add a per-mm nr_cpus_allowed to keep track of the weight of the
per-mm CPUs allowed mask (for fast access),
- Add a per-mm max_nr_cid to keep track of the highest number of
concurrency IDs allocated for the mm. This is used for expanding the
concurrency ID allocation within the upper bound defined by:
min(mm->nr_cpus_allowed, mm->mm_users)
When the next unused CID value reaches this threshold, stop trying
to expand the cid allocation and use the first available cid value
instead.
Spreading allocation to use all the cid values within the range
[ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ]
improves cache locality while preserving mm_cid compactness within the
expected user limits,
- In __mm_cid_try_get, only return cid values within the range
[ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This
prevents allocating cids above the number of allowed cpus in
rare scenarios where cid allocation races with a concurrent
remote-clear of the per-mm/cpu cid. This improvement is made
possible by the addition of the per-mm CPUs allowed mask,
- In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than
t->nr_cpus_allowed. This criterion was really meant to compare
the number of mm->mm_users to the number of CPUs allowed for the
entire mm. Therefore, the prior comparison worked fine when all
threads shared the same CPUs allowed mask, but not so much in
scenarios where those threads have different masks (e.g. each
thread pinned to a single CPU). This improvement is made
possible by the addition of the per-mm CPUs allowed mask.
* Benchmarks
Each thread increments 16kB worth of 8-bit integers in bursts, with
a configurable delay between each thread's execution. Each thread run
one after the other (no threads run concurrently). The order of
thread execution in the sequence is random. The thread execution
sequence begins again after all threads have executed. The 16kB areas
are allocated with rseq_mempool and indexed by either cpu_id, mm_cid
(not cache-local), or cache-local mm_cid. Each thread is pinned to its
own core.
Testing configurations:
8-core/1-L3: Use 8 cores within a single L3
24-core/24-L3: Use 24 cores, 1 core per L3
192-core/24-L3: Use 192 cores (all cores in the system)
384-thread/24-L3: Use 384 HW threads (all HW threads in the system)
Intermittent workload delays between threads: 200ms, 10ms.
Hardware:
CPU(s): 384
On-line CPU(s) list: 0-383
Vendor ID: AuthenticAMD
Model name: AMD EPYC 9654 96-Core Processor
Thread(s) per core: 2
Core(s) per socket: 96
Socket(s): 2
Caches (sum of all):
L1d: 6 MiB (192 instances)
L1i: 6 MiB (192 instances)
L2: 192 MiB (192 instances)
L3: 768 MiB (24 instances)
Each result is an average of 5 test runs. The cache-local speedup
is calculated as: (cache-local mm_cid) / (mm_cid).
Intermittent workload delay: 200ms
per-cpu mm_cid cache-local mm_cid cache-local speedup
(ns) (ns) (ns)
8-core/1-L3 1374 19289 1336 14.4x
24-core/24-L3 2423 26721 1594 16.7x
192-core/24-L3 2291 15826 2153 7.3x
384-thread/24-L3 1874 13234 1907 6.9x
Intermittent workload delay: 10ms
per-cpu mm_cid cache-local mm_cid cache-local speedup
(ns) (ns) (ns)
8-core/1-L3 662 756 686 1.1x
24-core/24-L3 1378 3648 1035 3.5x
192-core/24-L3 1439 10833 1482 7.3x
384-thread/24-L3 1503 10570 1556 6.8x
[ This deprecates the prior "sched: NUMA-aware per-memory-map concurrency IDs"
patch series with a simpler and more general approach. ]
[ This patch applies on top of v6.12-rc1. ]
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/lkml/20240823185946.418340-1-mathieu.desnoyers@efficios.com/
The memory barrier rmb() in generic idle loop do_idle() function is not
needed, it doesn't order any load instruction, just remove it as needless
rmb() can cause performance impact.
The rmb() was introduced by the tglx/history.git commit f2f1b44c75c4
("[PATCH] Remove RCU abuse in cpu_idle()") to order the loads between
cpu_idle_map and pm_idle. It pairs with wmb() in function cpu_idle_wait().
And then with the removal of cpu_idle_state in function cpu_idle() and
wmb() in function cpu_idle_wait() in commit 783e391b7b ("x86: Simplify
cpu_idle_wait"), rmb() no longer has a reason to exist.
After that, commit d166991234 ("idle: Implement generic idle function")
implemented a generic idle function cpu_idle_loop() which resembles the
functionality found in arch/. And it retained the rmb() in generic idle
loop in file kernel/cpu/idle.c.
And at last, commit cf37b6b484 ("sched/idle: Move cpu/idle.c to
sched/idle.c") moved cpu/idle.c to sched/idle.c. And commit c1de45ca83
("sched/idle: Add support for tasks that inject idle") renamed function
cpu_idle_loop() to do_idle().
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: Zhongqiu Han <quic_zhonhan@quicinc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241009093745.9504-1-quic_zhonhan@quicinc.com
Sean noted that ever since commit 152e11f6df ("sched/fair: Implement
delayed dequeue") KVM's preemption notifiers have started
mis-classifying preemption vs blocking.
Notably p->on_rq is no longer sufficient to determine if a task is
runnable or blocked -- the aforementioned commit introduces tasks that
remain on the runqueue even through they will not run again, and
should be considered blocked for many cases.
Add the task_is_runnable() helper to classify things and audit all
external users of the p->on_rq state. Also add a few comments.
Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue")
Reported-by: Sean Christopherson <seanjc@google.com>
Tested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20241010091843.GK33184@noisy.programming.kicks-ass.net
Since sched_delayed tasks remain queued even after blocking, the load
balancer can migrate them between runqueues while PSI considers them
to be asleep. As a result, it misreads the migration requeue followed
by a wakeup as a double queue:
psi: inconsistent task state! task=... cpu=... psi_flags=4 clear=. set=4
First, call psi_enqueue() after p->sched_class->enqueue_task(). A
wakeup will clear p->se.sched_delayed while a migration will not, so
psi can use that flag to tell them apart.
Then teach psi to migrate any "sleep" state when delayed-dequeue tasks
are being migrated.
Delayed-dequeue tasks can be revived by ttwu_runnable(), which will
call down with a new ENQUEUE_DELAYED. Instead of further complicating
the wakeup conditional in enqueue_task(), identify migration contexts
instead and default to wakeup handling for all other cases.
It's not just the warning in dmesg, the task state corruption causes a
permanent CPU pressure indication, which messes with workload/machine
health monitoring.
Debugged-by-and-original-fix-by: K Prateek Nayak <kprateek.nayak@amd.com>
Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue")
Closes: https://lore.kernel.org/lkml/20240830123458.3557-1-spasswolf@web.de/
Closes: https://lore.kernel.org/all/cd67fbcd-d659-4822-bb90-7e8fbb40a856@molgen.mpg.de/
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20241010193712.GC181795@cmpxchg.org
psi_dequeue() in for blocked task expects psi_sched_switch() to clear
the TSK_.*RUNNING PSI flags and set the TSK_IOWAIT flags however
psi_sched_switch() uses "!task_on_rq_queued(prev)" to detect if the task
is blocked or still runnable which is no longer true with DELAY_DEQUEUE
since a blocking task can be left queued on the runqueue.
This can lead to PSI splats similar to:
psi: inconsistent task state! task=... cpu=... psi_flags=4 clear=0 set=4
when the task is requeued since the TSK_RUNNING flag was not cleared
when the task was blocked.
Explicitly communicate that the task was blocked to psi_sched_switch()
even if it was delayed and is still on the runqueue.
[ prateek: Broke off the relevant part from [1], commit message ]
Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue")
Closes: https://lore.kernel.org/lkml/20240830123458.3557-1-spasswolf@web.de/
Closes: https://lore.kernel.org/all/cd67fbcd-d659-4822-bb90-7e8fbb40a856@molgen.mpg.de/
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/lkml/20241004123506.GR18071@noisy.programming.kicks-ass.net/ [1]
Commit 2e0199df25 ("sched/fair: Prepare exit/cleanup paths for delayed_dequeue")
and its follow up fixes try to deal with a rather unfortunate
situation where is task is enqueued in a new class, even though it
shouldn't have been. Mostly because the existing ->switched_to/from()
hooks are in the wrong place for this case.
This all led to Paul being able to trigger failures at something like
once per 10k CPU hours of RCU torture.
For now, do the ugly thing and move the code to the right place by
ignoring the switch hooks.
Note: Clean up the whole sched_class::switch*_{to,from}() thing.
Fixes: 2e0199df25 ("sched/fair: Prepare exit/cleanup paths for delayed_dequeue")
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241003185037.GA5594@noisy.programming.kicks-ass.net
With KASAN and PREEMPT_RT enabled, calling task_work_add() in
task_tick_mm_cid() may cause the following splat.
[ 63.696416] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
[ 63.696416] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 610, name: modprobe
[ 63.696416] preempt_count: 10001, expected: 0
[ 63.696416] RCU nest depth: 1, expected: 1
This problem is caused by the following call trace.
sched_tick() [ acquire rq->__lock ]
-> task_tick_mm_cid()
-> task_work_add()
-> __kasan_record_aux_stack()
-> kasan_save_stack()
-> stack_depot_save_flags()
-> alloc_pages_mpol_noprof()
-> __alloc_pages_noprof()
-> get_page_from_freelist()
-> rmqueue()
-> rmqueue_pcplist()
-> __rmqueue_pcplist()
-> rmqueue_bulk()
-> rt_spin_lock()
The rq lock is a raw_spinlock_t. We can't sleep while holding
it. IOW, we can't call alloc_pages() in stack_depot_save_flags().
The task_tick_mm_cid() function with its task_work_add() call was
introduced by commit 223baf9d17 ("sched: Fix performance regression
introduced by mm_cid") in v6.4 kernel.
Fortunately, there is a kasan_record_aux_stack_noalloc() variant that
calls stack_depot_save_flags() while not allowing it to allocate
new pages. To allow task_tick_mm_cid() to use task_work without
page allocation, a new TWAF_NO_ALLOC flag is added to enable calling
kasan_record_aux_stack_noalloc() instead of kasan_record_aux_stack()
if set. The task_tick_mm_cid() function is modified to add this new flag.
The possible downside is the missing stack trace in a KASAN report due
to new page allocation required when task_work_add_noallloc() is called
which should be rare.
Fixes: 223baf9d17 ("sched: Fix performance regression introduced by mm_cid")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241010014432.194742-1-longman@redhat.com
The deadline server code moved one of the start_hrtick_dl() calls
but dropped the dl specific hrtick_enabled check. This causes hrticks
to get armed even when sched_feat(HRTICK_DL) is false. Fix it.
Fixes: 63ba8422f8 ("sched/deadline: Introduce deadline servers")
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20241004123729.460668-1-pauld@redhat.com
Iterating with scx_task_iter involves scx_tasks_lock and optionally the rq
lock of the task being iterated. Both locks can be released during iteration
and the iteration can be continued after re-grabbing scx_tasks_lock.
Currently, all lock handling is pushed to the caller which is a bit
cumbersome and makes it difficult to add lock-aware behaviors. Make the
scx_task_iter helpers handle scx_tasks_lock.
- scx_task_iter_init/scx_taks_iter_exit() now grabs and releases
scx_task_lock, respectively. Renamed to
scx_task_iter_start/scx_task_iter_stop() to more clearly indicate that
there are non-trivial side-effects.
- Add __ prefix to scx_task_iter_rq_unlock() to indicate that the function
is internal.
- Add scx_task_iter_unlock/relock(). The former drops both rq lock (if held)
and scx_tasks_lock and the latter re-locks only scx_tasks_lock.
This doesn't cause behavior changes and will be used to implement stall
avoidance.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Bypass mode was depending on ops.select_cpu() which can't be trusted as with
the rest of the BPF scheduler. Always enable and use scx_select_cpu_dfl() in
bypass mode.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Move the sanity check from the inner function scx_select_cpu_dfl() to the
exported kfunc scx_bpf_select_cpu_dfl(). This doesn't cause behavior
differences and will allow using scx_select_cpu_dfl() in bypass mode
regardless of scx_builtin_idle_enabled.
Signed-off-by: Tejun Heo <tj@kernel.org>
The disable path caps p->scx.slice to SCX_SLICE_DFL. As the field is already
being ignored at this stage during disable, the only effect this has is that
when the next BPF scheduler is loaded, it won't see unreasonable left-over
slices. Ultimately, this shouldn't matter but it's better to start in a
known state. Drop p->scx.slice capping from the disable path and instead
reset it to SCX_SLICE_DFL in the enable path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
This reverts commit 6f34d8d382.
Slice length is ignored while bypassing and tasks are switched on every tick
and thus the patch does not make any difference. The perceived difference
was from test noise.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
pick_next_task_scx() was turned into pick_task_scx() since
commit 753e2836d1 ("sched_ext: Unify regular and core-sched pick
task paths"). Update the outdated message.
Signed-off-by: Honglei Wang <jameshongleiwang@126.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
- ops.enqueue() didn't have a way to tell whether select_task_rq_scx() and
thus ops.select() were skipped. Some schedulers were incorrectly using
SCX_ENQ_WAKEUP. Add SCX_ENQ_CPU_SELECTED and fix scx_qmap using it.
- Remove a spurious WARN_ON_ONCE() in scx_cgroup_exit().
- Fix error information clobbering during load.
- Add missing __weak markers to BPF helper declarations.
- Doc update.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZwWKkA4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGelnAQDTA8GSIahTEHKM0c3yXE6K1/M56zo8Spp5OOA7
kXHR3AD/Y0RcXgaCvMI13aozmQWq756gyB6/qczN0+X3jx6wZwI=
=6xbe
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.12-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- ops.enqueue() didn't have a way to tell whether select_task_rq_scx()
and thus ops.select() were skipped. Some schedulers were incorrectly
using SCX_ENQ_WAKEUP. Add SCX_ENQ_CPU_SELECTED and fix scx_qmap using
it.
- Remove a spurious WARN_ON_ONCE() in scx_cgroup_exit()
- Fix error information clobbering during load
- Add missing __weak markers to BPF helper declarations
- Doc update
* tag 'sched_ext-for-6.12-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Documentation: Update instructions for running example schedulers
sched_ext, scx_qmap: Add and use SCX_ENQ_CPU_SELECTED
sched/core: Add ENQUEUE_RQ_SELECTED to indicate whether ->select_task_rq() was called
sched/core: Make select_task_rq() take the pointer to wake_flags instead of value
sched_ext: scx_cgroup_exit() may be called without successful scx_cgroup_init()
sched_ext: Improve error reporting during loading
sched_ext: Add __weak markers to BPF helper function decalarations
scx_qmap and other schedulers in the SCX repo are using SCX_ENQ_WAKEUP to
tell whether ops.select_cpu() was called. This is incorrect as
ops.select_cpu() can be skipped in the wakeup path and leads to e.g.
incorrectly skipping direct dispatch for tasks that are bound to a single
CPU.
sched core has been updated to specify ENQUEUE_RQ_SELECTED if
->select_task_rq() was called. Map it to SCX_ENQ_CPU_SELECTED and update
scx_qmap to test it instead of SCX_ENQ_WAKEUP.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Andrea Righi <andrea.righi@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
During ttwu, ->select_task_rq() can be skipped if only one CPU is allowed or
migration is disabled. sched_ext schedulers may perform operations such as
direct dispatch from ->select_task_rq() path and it is useful for them to
know whether ->select_task_rq() was skipped in the ->enqueue_task() path.
Currently, sched_ext schedulers are using ENQUEUE_WAKEUP for this purpose
and end up assuming incorrectly that ->select_task_rq() was called for tasks
that are bound to a single CPU or migration disabled.
Make select_task_rq() indicate whether ->select_task_rq() was called by
setting WF_RQ_SELECTED in *wake_flags and make ttwu_do_activate() map that
to ENQUEUE_RQ_SELECTED for ->enqueue_task().
This will be used by sched_ext to fix ->select_task_rq() skip detection.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
This will be used to allow select_task_rq() to indicate whether
->select_task_rq() was called by modifying *wake_flags.
This makes try_to_wake_up() call all functions that take wake_flags with
WF_TTWU set. Previously, only select_task_rq() was. Using the same flags is
more consistent, and, as the flag is only tested by ->select_task_rq()
implementations, it doesn't cause any behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
bit_wait_io_timeout has been unused since 2016's
commit 6290602709 ("mm: add PageWaiters indicating tasks are waiting for a page bit")
Remove it.
Signed-off-by: "Dr. David Alan Gilbert" <linux@treblig.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lore.kernel.org/r/20241001234016.231696-1-linux@treblig.org
We do not have RESPECT_SLICE, we only have RUN_TO_PARITY.
Change RESPECT_SLICE to RUN_TO_PARITY, makes it more clear.
Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Christoph Lameter (Ampere) <cl@linux.com>
Link: https://lkml.kernel.org/r/20241001070456.10939-1-shijie@os.amperecomputing.com
When PLACE_LAG is enabled, from the relationship:
vl_i = (W + w_i)*vl'_i / W
we know that if vl'_i(se->vlag) is zero, the vl_i is zero too.
So if se->vlag is zero, there is no need to waste cycles to
do the calculation.
Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Christoph Lameter (Ampere) <cl@linux.com>
Link: https://lkml.kernel.org/r/20241001070021.10626-1-shijie@os.amperecomputing.com
The patch "5e963f2bd46 sched/fair: Commit to EEVDF"
removed the code following the DOUBLE_TICK:
-
- if (!sched_feat(EEVDF) && cfs_rq->nr_running > 1)
- check_preempt_tick(cfs_rq, curr);
The DOUBLE_TICK feature becomes dead code now, so remove it.
Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: "Christoph Lameter (Ampere)" <cl@linux.com>
Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com>
Link: https://lore.kernel.org/r/20241001065451.10356-1-shijie@os.amperecomputing.com
wake_up_var(), wait_var_event() and related interfaces are not
documented but have important ordering requirements. This patch adds
documentation and makes these requirements explicit.
The return values for those wait_var_event_* functions which return a
value are documented. Note that these are, perhaps surprisingly,
sometimes different from comparable wait_on_bit() functions.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240925053405.3960701-4-neilb@suse.de
This patch revises the documention for wake_up_bit(),
clear_and_wake_up_bit(), and all the wait_on_bit() family of functions.
The new documentation places less emphasis on the pool of waitqueues
used (an implementation detail) and focuses instead on details of how
the functions behave.
The barriers included in the wait functions and clear_and_wake_up_bit()
and those required for wake_up_bit() are spelled out more clearly.
The error statuses returned are given explicitly.
The fact that the wait_on_bit_lock() function sets the bit is made more
obvious.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240925053405.3960701-3-neilb@suse.de
wake_up_bit() currently allows a "void *". While this isn't strictly a
problem as the address is never dereferenced, it is inconsistent with
the corresponding wait_on_bit() which requires "unsigned long *" and
does dereference the pointer.
Any code that needs to wait for a change in something other than an
unsigned long would be better served by wake_up_var()/wait_var_event().
This patch changes all related "void *" to "unsigned long *".
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240925053405.3960701-2-neilb@suse.de
568894edbe ("sched_ext: Add scx_cgroup_enabled to gate cgroup operations
and fix scx_tg_online()") assumed that scx_cgroup_exit() is only called
after scx_cgroup_init() finished successfully. This isn't true.
scx_cgroup_exit() can be called without scx_cgroup_init() being called at
all or after scx_cgroup_init() failed in the middle.
As init state is tracked per cgroup, scx_cgroup_exit() can be used safely to
clean up in all cases. Remove the incorrect WARN_ON_ONCE().
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 568894edbe ("sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online()")
When the BPF scheduler fails, ops.exit() allows rich error reporting through
scx_exit_info. Use scx.exit() path consistently for all failures which can
be caused by the BPF scheduler:
- scx_ops_error() is called after ops.init() and ops.cgroup_init() failure
to record error information.
- ops.init_task() failure now uses scx_ops_error() instead of pr_err().
- The err_disable path updated to automatically trigger scx_ops_error() to
cover cases that the error message hasn't already been generated and
always return 0 indicating init success so that the error is reported
through ops.exit().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Andrea Righi <andrea.righi@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Brandon reports sporadic, non-sensical spikes in cumulative pressure
time (total=) when reading cpu.pressure at a high rate. This is due to
a race condition between reader aggregation and tasks changing states.
While it affects all states and all resources captured by PSI, in
practice it most likely triggers with CPU pressure, since scheduling
events are so frequent compared to other resource events.
The race context is the live snooping of ongoing stalls during a
pressure read. The read aggregates per-cpu records for stalls that
have concluded, but will also incorporate ad-hoc the duration of any
active state that hasn't been recorded yet. This is important to get
timely measurements of ongoing stalls. Those ad-hoc samples are
calculated on-the-fly up to the current time on that CPU; since the
stall hasn't concluded, it's expected that this is the minimum amount
of stall time that will enter the per-cpu records once it does.
The problem is that the path that concludes the state uses a CPU clock
read that is not synchronized against aggregators; the clock is read
outside of the seqlock protection. This allows aggregators to race and
snoop a stall with a longer duration than will actually be recorded.
With the recorded stall time being less than the last snapshot
remembered by the aggregator, a subsequent sample will underflow and
observe a bogus delta value, resulting in an erratic jump in pressure.
Fix this by moving the clock read of the state change into the seqlock
protection. This ensures no aggregation can snoop live stalls past the
time that's recorded when the state concludes.
Reported-by: Brandon Duffany <brandon@buildbuddy.io>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219194
Link: https://lore.kernel.org/lkml/20240827121851.GB438928@cmpxchg.org/
Fixes: df77430639 ("psi: Reduce calls to sched_clock() in psi")
Cc: stable@vger.kernel.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 85e511df3c ("sched/eevdf: Allow shorter slices to wakeup-preempt")
introduced a mechanism that a wakee with shorter slice could preempt
the current running task. It also lower the bar for the current task
to be preempted, by checking the rq->nr_running instead of cfs_rq->nr_running
when the current task has ran out of time slice. But there is a scenario
that is problematic. Say, if there is 1 cfs task and 1 rt task, before
85e511df3c, update_deadline() will not trigger a reschedule, and after
85e511df3c, since rq->nr_running is 2 and resched is true, a resched_curr()
would happen.
Some workloads (like the hackbench reported by lkp) do not like
over-scheduling. We can see that the preemption rate has been
increased by 2.2%:
1.654e+08 +2.2% 1.69e+08 hackbench.time.involuntary_context_switches
Restore its previous check criterion.
Fixes: 85e511df3c ("sched/eevdf: Allow shorter slices to wakeup-preempt")
Closes: https://lore.kernel.org/oe-lkp/202409231416.9403c2e9-oliver.sang@intel.com
Reported-by: kernel test robot <oliver.sang@intel.com>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Honglei Wang <jameshongleiwang@126.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20240925085440.358138-1-yu.c.chen@intel.com
Meeting an unfinished DELAY_DEQUEUE treated entity in unthrottle_cfs_rq()
leads to a couple terminal scenarios. Finish it first, so ENQUEUE_WAKEUP
can proceed as it would have sans DELAY_DEQUEUE treatment.
Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue")
Reported-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
Link: https://lore.kernel.org/r/7515d2e64c989b9e3b828a9e21bcd959b99df06a.camel@gmx.de
The #endif trailing comment of CONFIG_EXT_GROUP_SCHED is unmatched, so fix
it.
Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The enable path uses three big locks - scx_fork_rwsem, scx_cgroup_rwsem and
cpus_read_lock. Currently, the locks are grabbed together which is prone to
locking order problems.
For example, currently, there is a possible deadlock involving
scx_fork_rwsem and cpus_read_lock. cpus_read_lock has to nest inside
scx_fork_rwsem due to locking order existing in other subsystems. However,
there exists a dependency in the other direction during hotplug if hotplug
needs to fork a new task, which happens in some cases. This leads to the
following deadlock:
scx_ops_enable() hotplug
percpu_down_write(&cpu_hotplug_lock)
percpu_down_write(&scx_fork_rwsem)
block on cpu_hotplug_lock
kthread_create() waits for kthreadd
kthreadd blocks on scx_fork_rwsem
Note that this doesn't trigger lockdep because the hotplug side dependency
bounces through kthreadd.
With the preceding scx_cgroup_enabled change, this can be solved by
decoupling cpus_read_lock, which is needed for static_key manipulations,
from the other two locks.
- Move the first block of static_key manipulations outside of scx_fork_rwsem
and scx_cgroup_rwsem. This is now safe with the preceding
scx_cgroup_enabled change.
- Drop scx_cgroup_rwsem and scx_fork_rwsem between the two task iteration
blocks so that __scx_ops_enabled static_key enabling is outside the two
rwsems.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Link: http://lkml.kernel.org/r/8cd0ec0c4c7c1bc0119e61fbef0bee9d5e24022d.camel@linux.ibm.com
The disable path uses three big locks - scx_fork_rwsem, scx_cgroup_rwsem and
cpus_read_lock. Currently, the locks are grabbed together which is prone to
locking order problems. With the preceding scx_cgroup_enabled change, we can
decouple them:
- As cgroup disabling no longer requires modifying a static_key which
requires cpus_read_lock(), no need to grab cpus_read_lock() before
grabbing scx_cgroup_rwsem.
- cgroup can now be independently disabled before tasks are moved back to
the fair class.
Relocate scx_cgroup_exit() invocation before scx_fork_rwsem is grabbed, drop
now unnecessary cpus_read_lock() and move static_key operations out of
scx_fork_rwsem. This decouples all three locks in the disable path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Link: http://lkml.kernel.org/r/8cd0ec0c4c7c1bc0119e61fbef0bee9d5e24022d.camel@linux.ibm.com
If the BPF scheduler does not implement ops.cgroup_init(), scx_tg_online()
didn't set SCX_TG_INITED which meant that ops.cgroup_exit(), even if
implemented, won't be called from scx_tg_offline(). This is because
SCX_HAS_OP(cgroupt_init) is used to test both whether SCX cgroup operations
are enabled and ops.cgroup_init() exists.
Fix it by introducing a separate bool scx_cgroup_enabled to gate cgroup
operations and use SCX_HAS_OP(cgroup_init) only to test whether
ops.cgroup_init() exists. Make all cgroup operations consistently use
scx_cgroup_enabled to test whether cgroup operations are enabled.
scx_cgroup_enabled is added instead of using scx_enabled() to ease planned
locking updates.
Signed-off-by: Tejun Heo <tj@kernel.org>
scx_ops_init_task() and the follow-up scx_ops_enable_task() in the fork path
were gated by scx_enabled() test and thus __scx_ops_enabled had to be turned
on before the first scx_ops_init_task() loop in scx_ops_enable(). However,
if an external entity causes sched_class switch before the loop is complete,
tasks which are not initialized could be switched to SCX.
The following can be reproduced by running a program which keeps toggling a
process between SCHED_OTHER and SCHED_EXT using sched_setscheduler(2).
sched_ext: Invalid task state transition 0 -> 3 for fish[1623]
WARNING: CPU: 1 PID: 1650 at kernel/sched/ext.c:3392 scx_ops_enable_task+0x1a1/0x200
...
Sched_ext: simple (enabling)
RIP: 0010:scx_ops_enable_task+0x1a1/0x200
...
switching_to_scx+0x13/0xa0
__sched_setscheduler+0x850/0xa50
do_sched_setscheduler+0x104/0x1c0
__x64_sys_sched_setscheduler+0x18/0x30
do_syscall_64+0x7b/0x140
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fix it by gating scx_ops_init_task() separately using
scx_ops_init_task_enabled. __scx_ops_enabled is now set after all tasks are
finished with scx_ops_init_task().
Signed-off-by: Tejun Heo <tj@kernel.org>
scx_ops_enable() has two task iteration loops. The first one calls
scx_ops_init_task() on every task and the latter switches the eligible ones
into SCX. The first loop left the tasks in SCX_TASK_INIT state and then the
second loop switched it into READY before switching the task into SCX.
The distinction between INIT and READY is only meaningful in the fork path
where it's used to tell whether the task finished forking so that we can
tell ops.exit_task() accordingly. Leaving task in INIT state between the two
loops is incosistent with the fork path and incorrect. The following can be
triggered by running a program which keeps toggling a task between
SCHED_OTHER and SCHED_SCX while enabling a task:
sched_ext: Invalid task state transition 1 -> 3 for fish[1526]
WARNING: CPU: 2 PID: 1615 at kernel/sched/ext.c:3393 scx_ops_enable_task+0x1a1/0x200
...
Sched_ext: qmap (enabling+all)
RIP: 0010:scx_ops_enable_task+0x1a1/0x200
...
switching_to_scx+0x13/0xa0
__sched_setscheduler+0x850/0xa50
do_sched_setscheduler+0x104/0x1c0
__x64_sys_sched_setscheduler+0x18/0x30
do_syscall_64+0x7b/0x140
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fix it by transitioning to READY in the first loop right after
scx_ops_init_task() succeeds.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
scx_ops_enable() used preempt_disable() around the task iteration loop to
switch tasks into SCX to guarantee forward progress of the task which is
running scx_ops_enable(). However, in the gap between setting
__scx_ops_enabled and preeempt_disable(), an external entity can put tasks
including the enabling one into SCX prematurely, which can lead to
malfunctions including stalls.
The bypass mode can wrap the entire enabling operation and guarantee forward
progress no matter what the BPF scheduler does. Use the bypass mode instead
to guarantee forward progress while enabling.
While at it, release and regrab scx_tasks_lock between the two task
iteration locks in scx_ops_enable() for clarity as there is no reason to
keep holding the lock between them.
Signed-off-by: Tejun Heo <tj@kernel.org>
The distinction between SCX_OPS_PREPPING and SCX_OPS_ENABLING is not used
anywhere and only adds confusion. Drop SCX_OPS_PREPPING.
Signed-off-by: Tejun Heo <tj@kernel.org>
check_hotplug_seq() is used to detect CPU hotplug event which occurred while
the BPF scheduler is being loaded so that initialization can be retried if
CPU hotplug events take place before the CPU hotplug callbacks are online.
As such, the best place to call it is in the same cpu_read_lock() section
that enables the CPU hotplug ops. Currently, it is called in the next
cpus_read_lock() block in scx_ops_enable(). The side effect of this
placement is a small window in which hotplug sequence detection can trigger
unnecessarily, which isn't critical.
Move check_hotplug_seq() invocation to the same cpus_read_lock() block as
the hotplug operation enablement to close the window and get the invocation
out of the way for planned locking updates.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
While bypassing, tasks are scheduled in FIFO order which favors tasks that
hog CPUs. This can slow down e.g. unloading of the BPF scheduler. While
bypassing, guaranteeing timely forward progress is the main goal. There's no
point in giving long slices. Shorten the time slice used while bypassing
from 20ms to 5ms.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
In the bypass mode, the global DSQ is used to schedule all tasks in simple
FIFO order. All tasks are queued into the global DSQ and all CPUs try to
execute tasks from it. This creates a lot of cross-node cacheline accesses
and scheduling across the node boundaries, and can lead to live-lock
conditions where the system takes tens of minutes to disable the BPF
scheduler while executing in the bypass mode.
Split the global DSQ per NUMA node. Each node has its own global DSQ. When a
task is dispatched to SCX_DSQ_GLOBAL, it's put into the global DSQ local to
the task's CPU and all CPUs in a node only consume its node-local global
DSQ.
This resolves a livelock condition which could be reliably triggered on an
2x EPYC 7642 system by running `stress-ng --race-sched 1024` together with
`stress-ng --workload 80 --workload-threads 10` while repeatedly enabling
and disabling a SCX scheduler.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
To prepare for the addition of find_global_dsq(). No functional changes.
Signed-off-by: tejun heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
SCX_DSQ_GLOBAL is special in that it can't be used as a priority queue and
is consumed implicitly, but all BPF DSQ related kfuncs could be used on it.
SCX_DSQ_GLOBAL will be split per-node for scalability and those operations
won't make sense anymore. Disallow SCX_DSQ_GLOBAL on scx_bpf_consume(),
scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new(). This means that
SCX_DSQ_GLOBAL can only be used as a dispatch target from BPF schedulers.
With scx_flatcg, which was using SCX_DSQ_GLOBAL as the fallback DSQ,
updated, this shouldn't affect any schedulers.
This leaves find_dsq_for_dispatch() the only user of find_non_local_dsq().
Open code and remove find_non_local_dsq().
Signed-off-by: tejun heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
move_remote_task_to_local_dsq() is only defined on SMP configs but
scx_disaptch_from_dsq() was calling move_remote_task_to_local_dsq() on UP
configs too causing build failures. Add a dummy
move_remote_task_to_local_dsq() which triggers a warning.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 4c30f5ce4f ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409241108.jaocHiDJ-lkp@intel.com/
As discussed during the distro-centric session within the sched_ext
Microconference at LPC 2024, introduce a sequence counter that is
incremented every time a BPF scheduler is loaded.
This feature can help distributions in diagnosing potential performance
regressions by identifying systems where users are running (or have ran)
custom BPF schedulers.
Example:
arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq
0
arighi@virtme-ng~> sudo scx_simple
local=1 global=0
^CEXIT: unregistered from user space
arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq
1
In this way user-space tools (such as Ubuntu's apport and similar) are
able to gather and include this information in bug reports.
Cc: Giovanni Gherdovich <giovanni.gherdovich@suse.com>
Cc: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Cc: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
Cc: Phil Auld <pauld@redhat.com>
Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
a2f4b16e73 ("sched_ext: Build fix on !CONFIG_STACKTRACE[_SUPPORT]") tried
fixing build when !CONFIG_STACKTRACE but didn't so fully. Also put
stack_trace_print() and stack_trace_save() inside CONFIG_STACKTRACE to fix
build when !CONFIG_STACKTRACE.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409220642.fDW2OmWc-lkp@intel.com/
Disable the rq empty path when scx is enabled. SCX must consult the BPF
scheduler (via the dispatch path in balance) to determine if rq is empty.
This fixes stalls when scx is enabled.
Signed-off-by: Pat Somaru <patso@likewhatevs.io>
Fixes: 3dcac251b0 ("sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()")
Signed-off-by: Tejun Heo <tj@kernel.org>
When build with CONFIG_GROUP_SCHED_WEIGHT && !CONFIG_FAIR_GROUP_SCHED,
the idle member is not defined:
kernel/sched/ext.c:3701:16: error: 'struct task_group' has no member named 'idle'
3701 | if (!tg->idle)
| ^~
Fix this by putting 'idle' under new CONFIG_GROUP_SCHED_WEIGHT.
tj: Move idle field upward to avoid breaking up CONFIG_FAIR_GROUP_SCHED block.
Fixes: e179e80c5d ("sched: Introduce CONFIG_GROUP_SCHED_WEIGHT")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409220859.UiCAoFOW-lkp@intel.com/
Signed-off-by: Yu Liao <liaoyu15@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fix the following error when build with CONFIG_GROUP_SCHED_WEIGHT &&
!CONFIG_FAIR_GROUP_SCHED:
kernel/sched/core.c:9634:15: error: implicit declaration of function
'sched_group_set_idle'; did you mean 'scx_group_set_idle'? [-Wimplicit-function-declaration]
9634 | ret = sched_group_set_idle(css_tg(css), idle);
| ^~~~~~~~~~~~~~~~~~~~
| scx_group_set_idle
Fixes: e179e80c5d ("sched: Introduce CONFIG_GROUP_SCHED_WEIGHT")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409220859.UiCAoFOW-lkp@intel.com/
Signed-off-by: Yu Liao <liaoyu15@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This is the initial pull request of sched_ext. The v7 patchset
(https://lkml.kernel.org/r/20240618212056.2833381-1-tj@kernel.org) is
applied on top of tip/sched/core + bpf/master as of Jun 18th.
tip/sched/core 793a62823d1c ("sched/core: Drop spinlocks on contention iff kernel is preempti
ble")
bpf/master f6afdaf72a ("Merge branch 'bpf-support-resilient-split-btf'")
Since then, the following pulls were made:
- v6.11-rc1 is pulled to keep up with the mainline.
- tip/sched/core was pulled several times:
- 7b9f6c864a, 0df340ceae, 5ac998574f, 0b1777f0fa: To resolve
conflicts. See each commit for details on conflicts and their
resolutions.
- d7b01aef9d: To receive fd03c5b858 ("sched: Rework pick_next_task()")
and related commits. @prev in added to sched_class->put_prev_task() and
put_prev_task() is reordered after ->pick_task(), which makes
sched_class->switch_class() unnecessary. The follow-up commits update
sched_ext accordingly and drop sched_class->switch_class().
- bpf/master was pulled to receive baebe9aaba ("bpf: allow passing struct
bpf_iter_<type> as kfunc arguments") and related changes in preparation
for the DSQ iterator patchset
To obtain the net sched_ext changes, diff against:
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git for-6.12-base
which is the merge of:
tip/sched/core bc9057da1a ("sched/cpufreq: Use NSEC_PER_MSEC for deadline task")
bpf/master 2ad6d23f46 ("selftests/bpf: Do not update vmlinux.h unnecessarily")
Since the v7 patchset, the following changes were made:
- cpuperf support which was a part of the v6 patchset was posted separately
and then applied after reviews.
- cgroup support which was a part of the v6 patchset was posted seprately,
iterated and then applied.
- Improve integration with sched core.
- Double locking usage in migration paths dropped. Depend on
TASK_ON_RQ_MIGRATING synchronization instead.
- The BPF scheduler couldn't directly dispatch to the local DSQ of another
CPU using a SCX_DSQ_LOCAL_ON verdict. This caused difficulties around
handling non-wakeup enqueues. Updated so that SCX_DSQ_LOCAL_ON can be used
in the enqueue path too.
- DSQ iterator which was a part of the v6 patchset was posted separately.
The iterator itself was applied after a couple revisions. The associated
selective consumption kfunc can use further improvements and is still
being worked on.
- scx_bpf_dispatch[_vtime]_from_dsq() added to increase flexibility. A task
can now be transferred between two DSQs from almost any context. This
involved significant refactoring of migration code.
- Various fixes and improvements.
As the branch is based on top of tip/sched/core + bpf/master, please merge
after both are applied.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZuOSuA4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGVZyAQDBU3WPkYKB8gl6a6YQ+/PzBXorOK7mioS9A2iJ
vBR3FgEAg1vtcss1S+2juWmVq7ItiFNWCqtXzUr/bVmL9CqqDwA=
=bOOC
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext support from Tejun Heo:
"This implements a new scheduler class called ‘ext_sched_class’, or
sched_ext, which allows scheduling policies to be implemented as BPF
programs.
The goals of this are:
- Ease of experimentation and exploration: Enabling rapid iteration
of new scheduling policies.
- Customization: Building application-specific schedulers which
implement policies that are not applicable to general-purpose
schedulers.
- Rapid scheduler deployments: Non-disruptive swap outs of scheduling
policies in production environments"
See individual commits for more documentation, but also the cover letter
for the latest series:
Link: https://lore.kernel.org/all/20240618212056.2833381-1-tj@kernel.org/
* tag 'sched_ext-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (110 commits)
sched: Move update_other_load_avgs() to kernel/sched/pelt.c
sched_ext: Don't trigger ops.quiescent/runnable() on migrations
sched_ext: Synchronize bypass state changes with rq lock
scx_qmap: Implement highpri boosting
sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()
sched_ext: Compact struct bpf_iter_scx_dsq_kern
sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq()
sched_ext: Move consume_local_task() upward
sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq()
sched_ext: Reorder args for consume_local/remote_task()
sched_ext: Restructure dispatch_to_local_dsq()
sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling
sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON
sched_ext: Refactor consume_remote_task()
sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate
sched_ext: Add missing static to scx_dump_data
sched_ext: Add missing static to scx_has_op[]
sched_ext: Temporarily work around pick_task_scx() being called without balance_scx()
sched_ext: Add a cgroup scheduler which uses flattened hierarchy
sched_ext: Add cgroup support
...
this pull request are:
"Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds
consistency to the APIs and behaviour of these two core allocation
functions. This also simplifies/enables Rustification.
"Some cleanups for shmem" from Baolin Wang. No functional changes - mode
code reuse, better function naming, logic simplifications.
"mm: some small page fault cleanups" from Josef Bacik. No functional
changes - code cleanups only.
"Various memory tiering fixes" from Zi Yan. A small fix and a little
cleanup.
"mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and
simplifications and .text shrinkage.
"Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt. This
is a feature, it adds new feilds to /proc/vmstat such as
$ grep kstack /proc/vmstat
kstack_1k 3
kstack_2k 188
kstack_4k 11391
kstack_8k 243
kstack_16k 0
which tells us that 11391 processes used 4k of stack while none at all
used 16k. Useful for some system tuning things, but partivularly useful
for "the dynamic kernel stack project".
"kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov.
Teaches kmemleak to detect leaksage of percpu memory.
"mm: memcg: page counters optimizations" from Roman Gushchin. "3
independent small optimizations of page counters".
"mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David
Hildenbrand. Improves PTE/PMD splitlock detection, makes powerpc/8xx work
correctly by design rather than by accident.
"mm: remove arch_make_page_accessible()" from David Hildenbrand. Some
folio conversions which make arch_make_page_accessible() unneeded.
"mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel.
Cleans up and fixes our handling of the resetting of the cgroup/process
peak-memory-use detector.
"Make core VMA operations internal and testable" from Lorenzo Stoakes.
Rationalizaion and encapsulation of the VMA manipulation APIs. With a
view to better enable testing of the VMA functions, even from a
userspace-only harness.
"mm: zswap: fixes for global shrinker" from Takero Funaki. Fix issues in
the zswap global shrinker, resulting in improved performance.
"mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill in
some missing info in /proc/zoneinfo.
"mm: replace follow_page() by folio_walk" from David Hildenbrand. Code
cleanups and rationalizations (conversion to folio_walk()) resulting in
the removal of follow_page().
"improving dynamic zswap shrinker protection scheme" from Nhat Pham. Some
tuning to improve zswap's dynamic shrinker. Significant reductions in
swapin and improvements in performance are shown.
"mm: Fix several issues with unaccepted memory" from Kirill Shutemov.
Improvements to the new unaccepted memory feature,
"mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on DAX
PUDs. This was missing, although nobody seems to have notied yet.
"Introduce a store type enum for the Maple tree" from Sidhartha Kumar.
Cleanups and modest performance improvements for the maple tree library
code.
"memcg: further decouple v1 code from v2" from Shakeel Butt. Move more
cgroup v1 remnants away from the v2 memcg code.
"memcg: initiate deprecation of v1 features" from Shakeel Butt. Adds
various warnings telling users that memcg v1 features are deprecated.
"mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li.
Greatly improves the success rate of the mTHP swap allocation.
"mm: introduce numa_memblks" from Mike Rapoport. Moves various disparate
per-arch implementations of numa_memblk code into generic code.
"mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly
improves the performance of munmap() of swap-filled ptes.
"support large folio swap-out and swap-in for shmem" from Baolin Wang.
With this series we no longer split shmem large folios into simgle-page
folios when swapping out shmem.
"mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice performance
improvements and code reductions for gigantic folios.
"support shmem mTHP collapse" from Baolin Wang. Adds support for
khugepaged's collapsing of shmem mTHP folios.
"mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect()
performance regression due to the addition of mseal().
"Increase the number of bits available in page_type" from Matthew Wilcox.
Increases the number of bits available in page_type!
"Simplify the page flags a little" from Matthew Wilcox. Many legacy page
flags are now folio flags, so the page-based flags and their
accessors/mutators can be removed.
"mm: store zero pages to be swapped out in a bitmap" from Usama Arif. An
optimization which permits us to avoid writing/reading zero-filled zswap
pages to backing store.
"Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race window
which occurs when a MAP_FIXED operqtion is occurring during an unrelated
vma tree walk.
"mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of the
vma_merge() functionality, making ot cleaner, more testable and better
tested.
"misc fixups for DAMON {self,kunit} tests" from SeongJae Park. Minor
fixups of DAMON selftests and kunit tests.
"mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang. Code
cleanups and folio conversions.
"Shmem mTHP controls and stats improvements" from Ryan Roberts. Cleanups
for shmem controls and stats.
"mm: count the number of anonymous THPs per size" from Barry Song. Expose
additional anon THP stats to userspace for improved tuning.
"mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio
conversions and removal of now-unused page-based APIs.
"replace per-quota region priorities histogram buffer with per-context
one" from SeongJae Park. DAMON histogram rationalization.
"Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae
Park. DAMON documentation updates.
"mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve
related doc and warn" from Jason Wang: fixes usage of page allocator
__GFP_NOFAIL and GFP_ATOMIC flags.
"mm: split underused THPs" from Yu Zhao. Improve THP=always policy - this
was overprovisioning THPs in sparsely accessed memory areas.
"zram: introduce custom comp backends API" frm Sergey Senozhatsky. Add
support for zram run-time compression algorithm tuning.
"mm: Care about shadow stack guard gap when getting an unmapped area" from
Mark Brown. Fix up the various arch_get_unmapped_area() implementations
to better respect guard areas.
"Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability of
mem_cgroup_iter() and various code cleanups.
"mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge
pfnmap support.
"resource: Fix region_intersects() vs add_memory_driver_managed()" from
Huang Ying. Fix a bug in region_intersects() for systems with CXL memory.
"mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches a
couple more code paths to correctly recover from the encountering of
poisoned memry.
"mm: enable large folios swap-in support" from Barry Song. Support the
swapin of mTHP memory into appropriately-sized folios, rather than into
single-page folios.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZu1BBwAKCRDdBJ7gKXxA
jlWNAQDYlqQLun7bgsAN4sSvi27VUuWv1q70jlMXTfmjJAvQqwD/fBFVR6IOOiw7
AkDbKWP2k0hWPiNJBGwoqxdHHx09Xgo=
=s0T+
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Along with the usual shower of singleton patches, notable patch series
in this pull request are:
- "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds
consistency to the APIs and behaviour of these two core allocation
functions. This also simplifies/enables Rustification.
- "Some cleanups for shmem" from Baolin Wang. No functional changes -
mode code reuse, better function naming, logic simplifications.
- "mm: some small page fault cleanups" from Josef Bacik. No
functional changes - code cleanups only.
- "Various memory tiering fixes" from Zi Yan. A small fix and a
little cleanup.
- "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and
simplifications and .text shrinkage.
- "Kernel stack usage histogram" from Pasha Tatashin and Shakeel
Butt. This is a feature, it adds new feilds to /proc/vmstat such as
$ grep kstack /proc/vmstat
kstack_1k 3
kstack_2k 188
kstack_4k 11391
kstack_8k 243
kstack_16k 0
which tells us that 11391 processes used 4k of stack while none at
all used 16k. Useful for some system tuning things, but
partivularly useful for "the dynamic kernel stack project".
- "kmemleak: support for percpu memory leak detect" from Pavel
Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory.
- "mm: memcg: page counters optimizations" from Roman Gushchin. "3
independent small optimizations of page counters".
- "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from
David Hildenbrand. Improves PTE/PMD splitlock detection, makes
powerpc/8xx work correctly by design rather than by accident.
- "mm: remove arch_make_page_accessible()" from David Hildenbrand.
Some folio conversions which make arch_make_page_accessible()
unneeded.
- "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David
Finkel. Cleans up and fixes our handling of the resetting of the
cgroup/process peak-memory-use detector.
- "Make core VMA operations internal and testable" from Lorenzo
Stoakes. Rationalizaion and encapsulation of the VMA manipulation
APIs. With a view to better enable testing of the VMA functions,
even from a userspace-only harness.
- "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix
issues in the zswap global shrinker, resulting in improved
performance.
- "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill
in some missing info in /proc/zoneinfo.
- "mm: replace follow_page() by folio_walk" from David Hildenbrand.
Code cleanups and rationalizations (conversion to folio_walk())
resulting in the removal of follow_page().
- "improving dynamic zswap shrinker protection scheme" from Nhat
Pham. Some tuning to improve zswap's dynamic shrinker. Significant
reductions in swapin and improvements in performance are shown.
- "mm: Fix several issues with unaccepted memory" from Kirill
Shutemov. Improvements to the new unaccepted memory feature,
- "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on
DAX PUDs. This was missing, although nobody seems to have notied
yet.
- "Introduce a store type enum for the Maple tree" from Sidhartha
Kumar. Cleanups and modest performance improvements for the maple
tree library code.
- "memcg: further decouple v1 code from v2" from Shakeel Butt. Move
more cgroup v1 remnants away from the v2 memcg code.
- "memcg: initiate deprecation of v1 features" from Shakeel Butt.
Adds various warnings telling users that memcg v1 features are
deprecated.
- "mm: swap: mTHP swap allocator base on swap cluster order" from
Chris Li. Greatly improves the success rate of the mTHP swap
allocation.
- "mm: introduce numa_memblks" from Mike Rapoport. Moves various
disparate per-arch implementations of numa_memblk code into generic
code.
- "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly
improves the performance of munmap() of swap-filled ptes.
- "support large folio swap-out and swap-in for shmem" from Baolin
Wang. With this series we no longer split shmem large folios into
simgle-page folios when swapping out shmem.
- "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice
performance improvements and code reductions for gigantic folios.
- "support shmem mTHP collapse" from Baolin Wang. Adds support for
khugepaged's collapsing of shmem mTHP folios.
- "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect()
performance regression due to the addition of mseal().
- "Increase the number of bits available in page_type" from Matthew
Wilcox. Increases the number of bits available in page_type!
- "Simplify the page flags a little" from Matthew Wilcox. Many legacy
page flags are now folio flags, so the page-based flags and their
accessors/mutators can be removed.
- "mm: store zero pages to be swapped out in a bitmap" from Usama
Arif. An optimization which permits us to avoid writing/reading
zero-filled zswap pages to backing store.
- "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race
window which occurs when a MAP_FIXED operqtion is occurring during
an unrelated vma tree walk.
- "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of
the vma_merge() functionality, making ot cleaner, more testable and
better tested.
- "misc fixups for DAMON {self,kunit} tests" from SeongJae Park.
Minor fixups of DAMON selftests and kunit tests.
- "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang.
Code cleanups and folio conversions.
- "Shmem mTHP controls and stats improvements" from Ryan Roberts.
Cleanups for shmem controls and stats.
- "mm: count the number of anonymous THPs per size" from Barry Song.
Expose additional anon THP stats to userspace for improved tuning.
- "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more
folio conversions and removal of now-unused page-based APIs.
- "replace per-quota region priorities histogram buffer with
per-context one" from SeongJae Park. DAMON histogram
rationalization.
- "Docs/damon: update GitHub repo URLs and maintainer-profile" from
SeongJae Park. DAMON documentation updates.
- "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and
improve related doc and warn" from Jason Wang: fixes usage of page
allocator __GFP_NOFAIL and GFP_ATOMIC flags.
- "mm: split underused THPs" from Yu Zhao. Improve THP=always policy.
This was overprovisioning THPs in sparsely accessed memory areas.
- "zram: introduce custom comp backends API" frm Sergey Senozhatsky.
Add support for zram run-time compression algorithm tuning.
- "mm: Care about shadow stack guard gap when getting an unmapped
area" from Mark Brown. Fix up the various arch_get_unmapped_area()
implementations to better respect guard areas.
- "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability
of mem_cgroup_iter() and various code cleanups.
- "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge
pfnmap support.
- "resource: Fix region_intersects() vs add_memory_driver_managed()"
from Huang Ying. Fix a bug in region_intersects() for systems with
CXL memory.
- "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches
a couple more code paths to correctly recover from the encountering
of poisoned memry.
- "mm: enable large folios swap-in support" from Barry Song. Support
the swapin of mTHP memory into appropriately-sized folios, rather
than into single-page folios"
* tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits)
zram: free secondary algorithms names
uprobes: turn xol_area->pages[2] into xol_area->page
uprobes: introduce the global struct vm_special_mapping xol_mapping
Revert "uprobes: use vm_special_mapping close() functionality"
mm: support large folios swap-in for sync io devices
mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios
mm: fix swap_read_folio_zeromap() for large folios with partial zeromap
mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries
set_memory: add __must_check to generic stubs
mm/vma: return the exact errno in vms_gather_munmap_vmas()
memcg: cleanup with !CONFIG_MEMCG_V1
mm/show_mem.c: report alloc tags in human readable units
mm: support poison recovery from copy_present_page()
mm: support poison recovery from do_cow_fault()
resource, kunit: add test case for region_intersects()
resource: make alloc_free_mem_region() works for iomem_resource
mm: z3fold: deprecate CONFIG_Z3FOLD
vfio/pci: implement huge_fault support
mm/arm64: support large pfn mappings
mm/x86: support large pfn mappings
...
- Implement the SCHED_DEADLINE server infrastructure - Daniel Bristot de Oliveira's
last major contribution to the kernel:
"SCHED_DEADLINE servers can help fixing starvation issues of low priority
tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU
cycles. Today we have RT Throttling; DEADLINE servers should be able to
replace and improve that."
(Daniel Bristot de Oliveira, Peter Zijlstra, Joel Fernandes,
Youssef Esmat, Huang Shijie)
- Preparatory changes for sched_ext integration:
- Use set_next_task(.first) where required
- Fix up set_next_task() implementations
- Clean up DL server vs. core sched
- Split up put_prev_task_balance()
- Rework pick_next_task()
- Combine the last put_prev_task() and the first set_next_task()
- Rework dl_server
- Add put_prev_task(.next)
(Peter Zijlstra, with a fix by Tejun Heo)
- Complete the EEVDF transition and refine EEVDF scheduling:
- Implement delayed dequeue
- Allow shorter slices to wakeup-preempt
- Use sched_attr::sched_runtime to set request/slice suggestion
- Document the new feature flags
- Remove unused and duplicate-functionality fields
- Simplify & unify pick_next_task_fair()
- Misc debuggability enhancements
(Peter Zijlstra, with fixes/cleanups by Dietmar Eggemann,
Valentin Schneider and Chuyi Zhou)
- Initialize the vruntime of a new task when it is first enqueued,
resulting in significant decrease in latency of newly woken tasks.
(Zhang Qiao)
- Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
(K Prateek Nayak, Peter Zijlstra)
- Clean up and clarify the usage of Clean up usage of rt_task()
(Qais Yousef)
- Preempt SCHED_IDLE entities in strict cgroup hierarchies
(Tianchen Ding)
- Clarify the documentation of time units for deadline scheduler
parameters. (Christian Loehle)
- Remove the HZ_BW chicken-bit feature flag introduced a year ago,
the original change seems to be working fine.
(Phil Auld)
- Misc fixes and cleanups (Chen Yu, Dan Carpenter, Huang Shijie,
Peilin He, Qais Yousefm and Vincent Guittot)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmbr8qcRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gdbw/+Mj3zWfYP+dtUkfgrR2FClPAJoo1/9Dz0
LYD8XgYHu8rEJ0Aq+VbdkgYGUt9utvzUFPIxvWFDcldQl57KwhF4hp9Ir+PqJyYC
NolQ1q8ddo1hnslxnEg6SgHVzQq/4FqMM0nDNUkQETCx6zTyFFeRf+q7o/2c2m5B
uI9dSU1Wrx7XrXm2D3kB8+xP+ZRy+qhbFN5Pfuz96mhelfklylgKMfPzgAiCT/7T
JTbQhQ2HdcCNgiLoSrWsHBDy2UYpouP4zb4jyd+lDQzhSUJrj3u4Xy4vVmuTKq+y
sTgWlgKB+MTuh9UuJ4UYzSnMqg161UlMvtXeH84ABmAqDNGHRPtOKrrlcLtJ3D4x
m1SPhNnsvpjOu2pH0XLIS8al3VUesWND5S+rucHRYSq6Nvhivf4MTvRJlicXXurL
Mt2APnIlhGJuKBNWnmyZovVdtO0ZUUPlaZWfr3rCS4txAVo+HwWhsm3uhtTycQqN
gazsCiuGh6Jds90ZqA/BvdLWG+DY8J0xLlV3ex4pCXuQ/HFrabVWTyThJsULhrZ2
5mTdWIsocPctNMO9/RHMy7vJI7G7ljgHEquWVn5kiGGzXhK6VwVwKAMpfgXGw+YA
yVP6/M7a7g2yEzj69gXkcDa8k/kedMVquJ/G/8YhZM7u7sPqsMjpmaGsqsJRfnpT
ChngAzap+kA=
=TEC6
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- Implement the SCHED_DEADLINE server infrastructure - Daniel Bristot
de Oliveira's last major contribution to the kernel:
"SCHED_DEADLINE servers can help fixing starvation issues of low
priority tasks (e.g., SCHED_OTHER) when higher priority tasks
monopolize CPU cycles. Today we have RT Throttling; DEADLINE
servers should be able to replace and improve that."
(Daniel Bristot de Oliveira, Peter Zijlstra, Joel Fernandes, Youssef
Esmat, Huang Shijie)
- Preparatory changes for sched_ext integration:
- Use set_next_task(.first) where required
- Fix up set_next_task() implementations
- Clean up DL server vs. core sched
- Split up put_prev_task_balance()
- Rework pick_next_task()
- Combine the last put_prev_task() and the first set_next_task()
- Rework dl_server
- Add put_prev_task(.next)
(Peter Zijlstra, with a fix by Tejun Heo)
- Complete the EEVDF transition and refine EEVDF scheduling:
- Implement delayed dequeue
- Allow shorter slices to wakeup-preempt
- Use sched_attr::sched_runtime to set request/slice suggestion
- Document the new feature flags
- Remove unused and duplicate-functionality fields
- Simplify & unify pick_next_task_fair()
- Misc debuggability enhancements
(Peter Zijlstra, with fixes/cleanups by Dietmar Eggemann, Valentin
Schneider and Chuyi Zhou)
- Initialize the vruntime of a new task when it is first enqueued,
resulting in significant decrease in latency of newly woken tasks
(Zhang Qiao)
- Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
(K Prateek Nayak, Peter Zijlstra)
- Clean up and clarify the usage of Clean up usage of rt_task()
(Qais Yousef)
- Preempt SCHED_IDLE entities in strict cgroup hierarchies
(Tianchen Ding)
- Clarify the documentation of time units for deadline scheduler
parameters (Christian Loehle)
- Remove the HZ_BW chicken-bit feature flag introduced a year ago,
the original change seems to be working fine (Phil Auld)
- Misc fixes and cleanups (Chen Yu, Dan Carpenter, Huang Shijie,
Peilin He, Qais Yousefm and Vincent Guittot)
* tag 'sched-core-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
sched/cpufreq: Use NSEC_PER_MSEC for deadline task
cpufreq/cppc: Use NSEC_PER_MSEC for deadline task
sched/deadline: Clarify nanoseconds in uapi
sched/deadline: Convert schedtool example to chrt
sched/debug: Fix the runnable tasks output
sched: Fix sched_delayed vs sched_core
kernel/sched: Fix util_est accounting for DELAY_DEQUEUE
kthread: Fix task state in kthread worker if being frozen
sched/pelt: Use rq_clock_task() for hw_pressure
sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c
sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
sched: Add put_prev_task(.next)
sched: Rework dl_server
sched: Combine the last put_prev_task() and the first set_next_task()
sched: Rework pick_next_task()
sched: Split up put_prev_task_balance()
sched: Clean up DL server vs core sched
sched: Fixup set_next_task() implementations
sched: Use set_next_task(.first) where required
sched/fair: Properly deactivate sched_delayed task upon class change
...
This pull request contains the following branches:
context_tracking.15.08.24a: Rename context tracking state related
symbols and remove references to "dynticks" in various context
tracking state variables and related helpers; force
context_tracking_enabled_this_cpu() to be inlined to avoid
leaving a noinstr section.
csd.lock.15.08.24a: Enhance CSD-lock diagnostic reports; add an API
to provide an indication of ongoing CSD-lock stall.
nocb.09.09.24a: Update and simplify RCU nocb code to handle
(de-)offloading of callbacks only for offline CPUs; fix RT
throttling hrtimer being armed from offline CPU.
rcutorture.14.08.24a: Remove redundant rcu_torture_ops get_gp_completed
fields; add SRCU ->same_gp_state and ->get_comp_state
functions; add generic test for NUM_ACTIVE_*RCU_POLL* for
testing RCU and SRCU polled grace periods; add CFcommon.arch
for arch-specific Kconfig options; print number of update types
in rcu_torture_write_types();
add rcutree.nohz_full_patience_delay testing to the TREE07
scenario; add a stall_cpu_repeat module parameter to test
repeated CPU stalls; add argument to limit number of CPUs a
guest OS can use in torture.sh;
rcustall.09.09.24a: Abbreviate RCU CPU stall warnings during CSD-lock
stalls; Allow dump_cpu_task() to be called without disabling
preemption; defer printing stall-warning backtrace when holding
rcu_node lock.
srcu.12.08.24a: Make SRCU gp seq wrap-around faster; add KCSAN checks
for concurrent updates to ->srcu_n_exp_nodelay and
->reschedule_count which are used in heuristics governing
auto-expediting of normal SRCU grace periods and
grace-period-state-machine delays; mark idle SRCU-barrier
callbacks to help identify stuck SRCU-barrier callback.
rcu.tasks.14.08.24a: Remove RCU Tasks Rude asynchronous APIs as they
are no longer used; stop testing RCU Tasks Rude asynchronous
APIs; fix access to non-existent percpu regions; check
processor-ID assumptions during chosen CPU calculation for
callback enqueuing; update description of rtp->tasks_gp_seq
grace-period sequence number; add rcu_barrier_cb_is_done()
to identify whether a given rcu_barrier callback is stuck;
mark idle Tasks-RCU-barrier callbacks; add
*torture_stats_print() functions to print detailed
diagnostics for Tasks-RCU variants; capture start time of
rcu_barrier_tasks*() operation to help distinguish a hung
barrier operation from a long series of barrier operations.
rcu_scaling_tests.15.08.24a:
refscale: Add a TINY scenario to support tests of Tiny RCU
and Tiny SRCU; Optimize process_durations() operation;
rcuscale: Dump stacks of stalled rcu_scale_writer() instances;
dump grace-period statistics when rcu_scale_writer() stalls;
mark idle RCU-barrier callbacks to identify stuck RCU-barrier
callbacks; print detailed grace-period and barrier diagnostics
on rcu_scale_writer() hangs for Tasks-RCU variants; warn if
async module parameter is specified for RCU implementations
that do not have async primitives such as RCU Tasks Rude;
make all writer tasks report upon hang; tolerate repeated
GFP_KERNEL failure in rcu_scale_writer(); use special allocator
for rcu_scale_writer(); NULL out top-level pointers to heap
memory to avoid double-free bugs on modprobe failures; maintain
per-task instead of per-CPU callbacks count to avoid any issues
with migration of either tasks or callbacks; constify struct
ref_scale_ops.
fixes.12.08.24a: Use system_unbound_wq for kfree_rcu work to avoid
disturbing isolated CPUs.
misc.11.08.24a: Warn on unexpected rcu_state.srs_done_tail state;
Better define "atomic" for list_replace_rcu() and
hlist_replace_rcu() routines; annotate struct
kvfree_rcu_bulk_data with __counted_by().
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSi2tPIQIc2VEtjarIAHS7/6Z0wpQUCZt8+8wAKCRAAHS7/6Z0w
pTqoAPwPN//tlEoJx2PRs6t0q+nD1YNvnZawPaRmdzgdM8zJogD+PiSN+XhqRr80
jzyvMDU4Aa0wjUNP3XsCoaCxo7L/lQk=
=bZ9z
-----END PGP SIGNATURE-----
Merge tag 'rcu.release.v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux
Pull RCU updates from Neeraj Upadhyay:
"Context tracking:
- rename context tracking state related symbols and remove references
to "dynticks" in various context tracking state variables and
related helpers
- force context_tracking_enabled_this_cpu() to be inlined to avoid
leaving a noinstr section
CSD lock:
- enhance CSD-lock diagnostic reports
- add an API to provide an indication of ongoing CSD-lock stall
nocb:
- update and simplify RCU nocb code to handle (de-)offloading of
callbacks only for offline CPUs
- fix RT throttling hrtimer being armed from offline CPU
rcutorture:
- remove redundant rcu_torture_ops get_gp_completed fields
- add SRCU ->same_gp_state and ->get_comp_state functions
- add generic test for NUM_ACTIVE_*RCU_POLL* for testing RCU and SRCU
polled grace periods
- add CFcommon.arch for arch-specific Kconfig options
- print number of update types in rcu_torture_write_types()
- add rcutree.nohz_full_patience_delay testing to the TREE07 scenario
- add a stall_cpu_repeat module parameter to test repeated CPU stalls
- add argument to limit number of CPUs a guest OS can use in
torture.sh
rcustall:
- abbreviate RCU CPU stall warnings during CSD-lock stalls
- Allow dump_cpu_task() to be called without disabling preemption
- defer printing stall-warning backtrace when holding rcu_node lock
srcu:
- make SRCU gp seq wrap-around faster
- add KCSAN checks for concurrent updates to ->srcu_n_exp_nodelay and
->reschedule_count which are used in heuristics governing
auto-expediting of normal SRCU grace periods and
grace-period-state-machine delays
- mark idle SRCU-barrier callbacks to help identify stuck
SRCU-barrier callback
rcu tasks:
- remove RCU Tasks Rude asynchronous APIs as they are no longer used
- stop testing RCU Tasks Rude asynchronous APIs
- fix access to non-existent percpu regions
- check processor-ID assumptions during chosen CPU calculation for
callback enqueuing
- update description of rtp->tasks_gp_seq grace-period sequence
number
- add rcu_barrier_cb_is_done() to identify whether a given
rcu_barrier callback is stuck
- mark idle Tasks-RCU-barrier callbacks
- add *torture_stats_print() functions to print detailed diagnostics
for Tasks-RCU variants
- capture start time of rcu_barrier_tasks*() operation to help
distinguish a hung barrier operation from a long series of barrier
operations
refscale:
- add a TINY scenario to support tests of Tiny RCU and Tiny
SRCU
- optimize process_durations() operation
rcuscale:
- dump stacks of stalled rcu_scale_writer() instances and
grace-period statistics when rcu_scale_writer() stalls
- mark idle RCU-barrier callbacks to identify stuck RCU-barrier
callbacks
- print detailed grace-period and barrier diagnostics on
rcu_scale_writer() hangs for Tasks-RCU variants
- warn if async module parameter is specified for RCU implementations
that do not have async primitives such as RCU Tasks Rude
- make all writer tasks report upon hang
- tolerate repeated GFP_KERNEL failure in rcu_scale_writer()
- use special allocator for rcu_scale_writer()
- NULL out top-level pointers to heap memory to avoid double-free
bugs on modprobe failures
- maintain per-task instead of per-CPU callbacks count to avoid any
issues with migration of either tasks or callbacks
- constify struct ref_scale_ops
Fixes:
- use system_unbound_wq for kfree_rcu work to avoid disturbing
isolated CPUs
Misc:
- warn on unexpected rcu_state.srs_done_tail state
- better define "atomic" for list_replace_rcu() and
hlist_replace_rcu() routines
- annotate struct kvfree_rcu_bulk_data with __counted_by()"
* tag 'rcu.release.v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (90 commits)
rcu: Defer printing stall-warning backtrace when holding rcu_node lock
rcu/nocb: Remove superfluous memory barrier after bypass enqueue
rcu/nocb: Conditionally wake up rcuo if not already waiting on GP
rcu/nocb: Fix RT throttling hrtimer armed from offline CPU
rcu/nocb: Simplify (de-)offloading state machine
context_tracking: Tag context_tracking_enabled_this_cpu() __always_inline
context_tracking, rcu: Rename rcu_dyntick trace event into rcu_watching
rcu: Update stray documentation references to rcu_dynticks_eqs_{enter, exit}()
rcu: Rename rcu_momentary_dyntick_idle() into rcu_momentary_eqs()
rcu: Rename rcu_implicit_dynticks_qs() into rcu_watching_snap_recheck()
rcu: Rename dyntick_save_progress_counter() into rcu_watching_snap_save()
rcu: Rename struct rcu_data .exp_dynticks_snap into .exp_watching_snap
rcu: Rename struct rcu_data .dynticks_snap into .watching_snap
rcu: Rename rcu_dynticks_zero_in_eqs() into rcu_watching_zero_in_eqs()
rcu: Rename rcu_dynticks_in_eqs_since() into rcu_watching_snap_stopped_since()
rcu: Rename rcu_dynticks_in_eqs() into rcu_watching_snap_in_eqs()
rcu: Rename rcu_dynticks_eqs_online() into rcu_watching_online()
context_tracking, rcu: Rename rcu_dynticks_curr_cpu_in_eqs() into rcu_is_watching_curr_cpu()
context_tracking, rcu: Rename rcu_dynticks_task*() into rcu_task*()
refscale: Constify struct ref_scale_ops
...
- Core:
- Overhaul of posix-timers in preparation of removing the
workaround for periodic timers which have signal delivery
ignored.
- Remove the historical extra jiffie in msleep()
msleep() adds an extra jiffie to the timeout value to ensure
minimal sleep time. The timer wheel ensures minimal sleep
time since the large rewrite to a non-cascading wheel, but the
extra jiffie in msleep() remained unnoticed. Remove it.
- Make the timer slack handling correct for realtime tasks.
The procfs interface is inconsistent and does neither reflect
reality nor conforms to the man page. Show the correct 0 slack
for real time tasks and enforce it at the core level instead of
having inconsistent individual checks in various timer setup
functions.
- The usual set of updates and enhancements all over the place.
- Drivers:
- Allow the ACPI PM timer to be turned off during suspend
- No new drivers
- The usual updates and enhancements in various drivers
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbn7jQTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYobqnD/9COlU0nwsulABI/aNIrsh6iYvnCC9v
14CcNta7Qn+157Wfw9BWOyHdNhR1/fPCXE8jJ71zTyIOeW27HV2JyTtxTwe9ZcdK
ViHAaj7YcIjcVUEC3StCoRCPnvLslEw4qJA5AOQuDyMivdQn+YVa2c0baJxKaXZt
xk4HZdMj4NAS0jRKnoZSwtKW/+Oz6rR4GAWrZo+Zs1/8ur3HfqnQfi8lJ1hJtLLW
V7XDCVRvamVi6Ah3ocYPPp/1P6yeQDA1ge9aMddqaza5STWISXRtSnFMUmYP3rbS
FaL8TyL+ilfny8pkGB2WlG6nLuSbtvogtdEh1gG1k1RmZt44kAtk8ba/KiWFPBSb
zK9cjojRMBS71f9G4kmb5F4rnXoLsg1YbD1Nzhz3wq2Cs1Z90dc2QwMren0zoQ1x
Fn56ueRyAiagBlnrSaKyso/2RvqJTNoSdi3RkpjYeAph0UoDCqvTvKjGAf1mWiw1
T/1lUWSVqWHnzZbM7XXzzajIN9bl6A7bbqlcAJ2O9vZIDt7273DG+bQym9Vh6Why
0LTGGERHxzKBsG7WRg+2Gmvv6S18UPKRo8tLtlA758rHlFuPTZCShWrIriwSNl1K
Hxon+d4BparSnm1h9W/NHPKJA574UbWRCBjdk58IkAj8DxZZY4ORD9SMP+ggkV7G
F6p9cgoDNP9KFg==
=jE0N
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Core:
- Overhaul of posix-timers in preparation of removing the workaround
for periodic timers which have signal delivery ignored.
- Remove the historical extra jiffie in msleep()
msleep() adds an extra jiffie to the timeout value to ensure
minimal sleep time. The timer wheel ensures minimal sleep time
since the large rewrite to a non-cascading wheel, but the extra
jiffie in msleep() remained unnoticed. Remove it.
- Make the timer slack handling correct for realtime tasks.
The procfs interface is inconsistent and does neither reflect
reality nor conforms to the man page. Show the correct 0 slack for
real time tasks and enforce it at the core level instead of having
inconsistent individual checks in various timer setup functions.
- The usual set of updates and enhancements all over the place.
Drivers:
- Allow the ACPI PM timer to be turned off during suspend
- No new drivers
- The usual updates and enhancements in various drivers"
* tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
ntp: Make sure RTC is synchronized when time goes backwards
treewide: Fix wrong singular form of jiffies in comments
cpu: Use already existing usleep_range()
timers: Rename next_expiry_recalc() to be unique
platform/x86:intel/pmc: Fix comment for the pmc_core_acpi_pm_timer_suspend_resume function
clocksource/drivers/jcore: Use request_percpu_irq()
clocksource/drivers/cadence-ttc: Add missing clk_disable_unprepare in ttc_setup_clockevent
clocksource/drivers/asm9260: Add missing clk_disable_unprepare in asm9260_timer_init
clocksource/drivers/qcom: Add missing iounmap() on errors in msm_dt_timer_init()
clocksource/drivers/ingenic: Use devm_clk_get_enabled() helpers
platform/x86:intel/pmc: Enable the ACPI PM Timer to be turned off when suspended
clocksource: acpi_pm: Add external callback for suspend/resume
clocksource/drivers/arm_arch_timer: Using for_each_available_child_of_node_scoped()
dt-bindings: timer: rockchip: Add rk3576 compatible
timers: Annotate possible non critical data race of next_expiry
timers: Remove historical extra jiffie for timeout in msleep()
hrtimer: Use and report correct timerslack values for realtime tasks
hrtimer: Annotate hrtimer_cpu_base_.*_expiry() for sparse.
timers: Add sparse annotation for timer_sync_wait_running().
signal: Replace BUG_ON()s
...
- Core:
- Remove a global lock in the affinity setting code
The lock protects a cpumask for intermediate results and the lock
causes a bottleneck on simultaneous start of multiple virtual
machines. Replace the lock and the static cpumask with a per CPU
cpumask which is nicely serialized by raw spinlock held when
executing this code.
- Provide support for giving a suffix to interrupt domain names.
That's required to support devices with subfunctions so that the
domain names are distinct even if they originate from the same
device node.
- The usual set of cleanups and enhancements all over the place
- Drivers:
- Support for longarch AVEC interrupt chip
- Refurbishment of the Armada driver so it can be extended for new
variants.
- The usual set of cleanups and enhancements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbn5p8THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoRFtD/43eB3h5usY2OPW0JmDqrE6qnzsvjPZ
1H52BcmMcOuI6yCfTnbi/fBB52mwSEGq9Dmt1GXradyq9/CJDIqZ1ajI1rA2jzW2
YdbeTDpKm1rS2ddzfp2LT2BryrNt+7etrRO7qHn4EKSuOcNuV2f58WPbIIqasvaK
uPbUDVDPrvXxLNcjoab6SqaKrEoAaHSyKpd0MvDd80wHrtcSC/QouW7JDSUXv699
RwvLebN1OF6mQ2J8Z3DLeCQpcbAs+UT8UvID7kYUJi1g71J/ZY+xpMLoX/gHiDNr
isBtsuEAiZeNaFpksc7A6Jgu5ljZf2/aLCqbPLlHaduHFNmo94x9KUbIF2cpEMN+
rsf5Ff7AVh1otz3cUwLLsm+cFLWRRoZdLuncn7rrgB4Yg0gll7qzyLO6YGvQHr8U
Ocj1RXtvvWsMk4XzhgCt1AH/42cO6go+bhA4HspeYykNpsIldIUl1MeFbO8sWiDJ
kybuwiwHp3oaMLjEK4Lpq65u7Ll8Lju2zRde65YUJN2nbNmJFORrOLmeC1qsr6ri
dpend6n2qD9UD1oAt32ej/uXnG160nm7UKescyxiZNeTm1+ez8GW31hY128ifTY3
4R3urGS38p3gazXBsfw6eqkeKx0kEoDNoQqrO5gBvb8kowYTvoZtkwMGAN9OADwj
w6vvU0i+NIyVMA==
=JlJ2
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"Core:
- Remove a global lock in the affinity setting code
The lock protects a cpumask for intermediate results and the lock
causes a bottleneck on simultaneous start of multiple virtual
machines. Replace the lock and the static cpumask with a per CPU
cpumask which is nicely serialized by raw spinlock held when
executing this code.
- Provide support for giving a suffix to interrupt domain names.
That's required to support devices with subfunctions so that the
domain names are distinct even if they originate from the same
device node.
- The usual set of cleanups and enhancements all over the place
Drivers:
- Support for longarch AVEC interrupt chip
- Refurbishment of the Armada driver so it can be extended for new
variants.
- The usual set of cleanups and enhancements all over the place"
* tag 'irq-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (73 commits)
genirq: Use cpumask_intersects()
genirq/cpuhotplug: Use cpumask_intersects()
irqchip/apple-aic: Only access system registers on SoCs which provide them
irqchip/apple-aic: Add a new "Global fast IPIs only" feature level
irqchip/apple-aic: Skip unnecessary enabling of use_fast_ipi
dt-bindings: apple,aic: Document A7-A11 compatibles
irqdomain: Use IS_ERR_OR_NULL() in irq_domain_trim_hierarchy()
genirq/msi: Use kmemdup_array() instead of kmemdup()
genirq/proc: Change the return value for set affinity permission error
genirq/proc: Use irq_move_pending() in show_irq_affinity()
genirq/proc: Correctly set file permissions for affinity control files
genirq: Get rid of global lock in irq_do_set_affinity()
genirq: Fix typo in struct comment
irqchip/loongarch-avec: Add AVEC irqchip support
irqchip/loongson-pch-msi: Prepare get_pch_msi_handle() for AVECINTC
irqchip/loongson-eiointc: Rename CPUHP_AP_IRQ_LOONGARCH_STARTING
LoongArch: Architectural preparation for AVEC irqchip
LoongArch: Move irqchip function prototypes to irq-loongson.h
irqchip/loongson-pch-msi: Switch to MSI parent domains
softirq: Remove unused 'action' parameter from action callback
...
96fd6c65ef ("sched: Factor out update_other_load_avgs() from
__update_blocked_others()") added update_other_load_avgs() in
kernel/sched/syscalls.c right above effective_cpu_util(). This location
didn't fit that well in the first place, and with 5d871a6399 ("sched/fair:
Move effective_cpu_util() and effective_cpu_util() in fair.c") moving
effective_cpu_util() to kernel/sched/fair.c, it looks even more out of
place.
Relocate the function to kernel/sched/pelt.c where all its callees are.
No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Pull in tip/sched/core to resolve two merge conflicts:
- 96fd6c65ef ("sched: Factor out update_other_load_avgs() from __update_blocked_others()")
5d871a6399 ("sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c")
A simple context conflict. The former added __update_blocked_others() in
the same #ifdef CONFIG_SMP block that effective_cpu_util() and
sched_cpu_util() are in and the latter moved those functions to fair.c.
This makes __update_blocked_others() more out of place. Will follow up
with a patch to relocate.
- 96fd6c65ef ("sched: Factor out update_other_load_avgs() from __update_blocked_others()")
84d265281d ("sched/pelt: Use rq_clock_task() for hw_pressure")
The former factored out the body of __update_blocked_others() into
update_other_load_avgs(). The latter changed how update_hw_load_avg() is
called in the body. Resolved by applying the change to
update_other_load_avgs() instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Convert the sugov deadline task attributes to use the available
definitions to make them more readable.
No functional change.
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20240813144348.1180344-5-christian.loehle@arm.com
A task moving across CPUs should not trigger quiescent/runnable task state
events as the task is staying runnable the whole time and just stopping and
then starting on different CPUs. Suppress quiescent/runnable task state
events if task_on_rq_migrating().
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: David Vernet <void@manifault.com>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Andrea Righi <andrea.righi@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
While the BPF scheduler is being unloaded, the following warning messages
trigger sometimes:
NOHZ tick-stop error: local softirq work is pending, handler #80!!!
This is caused by the CPU entering idle while there are pending softirqs.
The main culprit is the bypassing state assertion not being synchronized
with rq operations. As the BPF scheduler cannot be trusted in the disable
path, the first step is entering the bypass mode where the BPF scheduler is
ignored and scheduling becomes global FIFO.
This is implemented by turning scx_ops_bypassing() true. However, the
transition isn't synchronized against anything and it's possible for enqueue
and dispatch paths to have different ideas on whether bypass mode is on.
Make each rq track its own bypass state with SCX_RQ_BYPASSING which is
modified while rq is locked.
This removes most of the NOHZ tick-stop messages but not completely. I
believe the stragglers are from the sched core bug where pick_task_scx() can
be called without preceding balance_scx(). Once that bug is fixed, we should
verify that all occurrences of this error message are gone too.
v2: scx_enabled() test moved inside the for_each_possible_cpu() loop so that
the per-cpu states are always synchronized with the global state.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: David Vernet <void@manifault.com>
Remove delayed tasks from util_est even they are runnable.
Exclude delayed task which are (a) migrating between rq's or (b) in a
SAVE/RESTORE dequeue/enqueue.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/c49ef5fe-a909-43f1-b02f-a765ab9cedbf@arm.com
commit 97450eb909 ("sched/pelt: Remove shift of thermal clock")
removed the decay_shift for hw_pressure. This commit uses the
sched_clock_task() in sched_tick() while it replaces the
sched_clock_task() with rq_clock_pelt() in __update_blocked_others().
This could bring inconsistence. One possible scenario I can think of
is in ___update_load_sum():
u64 delta = now - sa->last_update_time
'now' could be calculated by rq_clock_pelt() from
__update_blocked_others(), and last_update_time was calculated by
rq_clock_task() previously from sched_tick(). Usually the former
chases after the latter, it cause a very large 'delta' and brings
unexpected behavior.
Fixes: 97450eb909 ("sched/pelt: Remove shift of thermal clock")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20240827112607.181206-1-yu.c.chen@intel.com
Move effective_cpu_util() and sched_cpu_util() functions in fair.c file
with others utilization related functions.
No functional change.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20240904092417.20660-1-vincent.guittot@linaro.org
Since commit b2a02fc43a ("smp: Optimize
send_call_function_single_ipi()") an idle CPU in TIF_POLLING_NRFLAG mode
can be pulled out of idle by setting TIF_NEED_RESCHED flag to service an
IPI without actually sending an interrupt. Even in cases where the IPI
handler does not queue a task on the idle CPU, do_idle() will call
__schedule() since need_resched() returns true in these cases.
Introduce and use SM_IDLE to identify call to __schedule() from
schedule_idle() and shorten the idle re-entry time by skipping
pick_next_task() when nr_running is 0 and the previous task is the idle
task.
With the SM_IDLE fast-path, the time taken to complete a fixed set of
IPIs using ipistorm improves noticeably. Following are the numbers
from a dual socket Intel Ice Lake Xeon server (2 x 32C/64T) and
3rd Generation AMD EPYC system (2 x 64C/128T) (boost on, C2 disabled)
running ipistorm between CPU8 and CPU16:
cmdline: insmod ipistorm.ko numipi=100000 single=1 offset=8 cpulist=8 wait=1
==================================================================
Test : ipistorm (modified)
Units : Normalized runtime
Interpretation: Lower is better
Statistic : AMean
======================= Intel Ice Lake Xeon ======================
kernel: time [pct imp]
tip:sched/core 1.00 [baseline]
tip:sched/core + SM_IDLE 0.80 [20.51%]
==================== 3rd Generation AMD EPYC =====================
kernel: time [pct imp]
tip:sched/core 1.00 [baseline]
tip:sched/core + SM_IDLE 0.90 [10.17%]
==================================================================
[ kprateek: Commit message, SM_RTLOCK_WAIT fix ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240809092240.6921-1-kprateek.nayak@amd.com
Once a task is put into a DSQ, the allowed operations are fairly limited.
Tasks in the built-in local and global DSQs are executed automatically and,
ignoring dequeue, there is only one way a task in a user DSQ can be
manipulated - scx_bpf_consume() moves the first task to the dispatching
local DSQ. This inflexibility sometimes gets in the way and is an area where
multiple feature requests have been made.
Implement scx_bpf_dispatch[_vtime]_from_dsq(), which can be called during
DSQ iteration and can move the task to any DSQ - local DSQs, global DSQ and
user DSQs. The kfuncs can be called from ops.dispatch() and any BPF context
which dosen't hold a rq lock including BPF timers and SYSCALL programs.
This is an expansion of an earlier patch which only allowed moving into the
dispatching local DSQ:
http://lkml.kernel.org/r/Zn4Cw4FDTmvXnhaf@slm.duckdns.org
v2: Remove @slice and @vtime from scx_bpf_dispatch_from_dsq[_vtime]() as
they push scx_bpf_dispatch_from_dsq_vtime() over the kfunc argument
count limit and often won't be needed anyway. Instead provide
scx_bpf_dispatch_from_dsq_set_{slice|vtime}() kfuncs which can be called
only when needed and override the specified parameter for the subsequent
dispatch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: David Vernet <void@manifault.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Andrea Righi <andrea.righi@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
struct scx_iter_scx_dsq is defined as 6 u64's and scx_dsq_iter_kern was
using 5 of them. We want to add two more u64 fields but it's better if we do
so while staying within scx_iter_scx_dsq to maintain binary compatibility.
The way scx_iter_scx_dsq_kern is laid out is rather inefficient - the node
field takes up three u64's but only one bit of the last u64 is used. Turn
the bool into u32 flags and only use the lower 16 bits freeing up 48 bits -
16 bits for flags, 32 bits for a u32 - for use by struct
bpf_iter_scx_dsq_kern.
This allows moving the dsq_seq and flags fields of bpf_iter_scx_dsq_kern
into the cursor field reducing the struct size by a full u64.
No behavior changes intended.
Signed-off-by: Tejun Heo <tj@kernel.org>
- Rename move_task_to_local_dsq() to move_remote_task_to_local_dsq().
- Rename consume_local_task() to move_local_task_to_local_dsq() and remove
task_unlink_from_dsq() and source DSQ unlocking from it.
This is to make the migration code easier to reuse.
No functional changes intended.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
So that the local case comes first and two CONFIG_SMP blocks can be merged.
No functional changes intended.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
All task_unlink_from_dsq() users are doing dsq_mod_nr(dsq, -1). Move it into
task_unlink_from_dsq(). Also move sanity check into it.
No functional changes intended.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Reorder args for consistency in the order of:
current_rq, p, src_[rq|dsq], dst_[rq|dsq].
No functional changes intended.
Signed-off-by: Tejun Heo <tj@kernel.org>
Now that there's nothing left after the big if block, flip the if condition
and unindent the body.
No functional changes intended.
v2: Add BUG() to clarify control can't reach the end of
dispatch_to_local_dsq() in UP kernels per David.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
With the preceding update, the only return value which makes meaningful
difference is DTL_INVALID, for which one caller, finish_dispatch(), falls
back to the global DSQ and the other, process_ddsp_deferred_locals(),
doesn't do anything.
It should always fallback to the global DSQ. Move the global DSQ fallback
into dispatch_to_local_dsq() and remove the return value.
v2: Patch title and description updated to reflect the behavior fix for
process_ddsp_deferred_locals().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
find_dsq_for_dispatch() handles all DSQ IDs except SCX_DSQ_LOCAL_ON.
Instead, each caller is hanlding SCX_DSQ_LOCAL_ON before calling it. Move
SCX_DSQ_LOCAL_ON lookup into find_dsq_for_dispatch() to remove duplicate
code in direct_dispatch() and dispatch_to_local_dsq().
No functional changes intended.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
The tricky p->scx.holding_cpu handling was split across
consume_remote_task() body and move_task_to_local_dsq(). Refactor such that:
- All the tricky part is now in the new unlink_dsq_and_lock_src_rq() with
consolidated documentation.
- move_task_to_local_dsq() now implements straightforward task migration
making it easier to use in other places.
- dispatch_to_local_dsq() is another user move_task_to_local_dsq(). The
usage is updated accordingly. This makes the local and remote cases more
symmetric.
No functional changes intended.
v2: s/task_rq/src_rq/ for consistency.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Sleepables don't need to be in its own kfunc set as each is tagged with
KF_SLEEPABLE. Rename to scx_kfunc_set_unlocked indicating that rq lock is
not held and relocate right above the any set. This will be used to add
kfuncs that are allowed to be called from SYSCALL but not TRACING.
No functional changes intended.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
pick_task_scx() must be preceded by balance_scx() but there currently is a
bug where fair could say yes on balance() but no on pick_task(), which then
ends up calling pick_task_scx() without preceding balance_scx(). Work around
by dropping WARN_ON_ONCE() and ignoring cases which don't make sense.
This isn't great and can theoretically lead to stalls. However, for
switch_all cases, this happens only while a BPF scheduler is being loaded or
unloaded, and, for partial cases, fair will likely keep triggering this CPU.
This will be reverted once the fair behavior is fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Pull bpf/master to receive baebe9aaba ("bpf: allow passing struct
bpf_iter_<type> as kfunc arguments") and related changes in preparation for
the DSQ iterator patchset.
Signed-off-by: Tejun Heo <tj@kernel.org>