linux-loongson

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson synced 2025-09-02 16:44:59 +00:00

Author	SHA1	Message	Date
Andrea Righi	ddf7233fca	sched/ext: Fix invalid task state transitions on class switch When enabling a sched_ext scheduler, we may trigger invalid task state transitions, resulting in warnings like the following (which can be easily reproduced by running the hotplug selftest in a loop): sched_ext: Invalid task state transition 0 -> 3 for fish[770] WARNING: CPU: 18 PID: 787 at kernel/sched/ext.c:3862 scx_set_task_state+0x7c/0xc0 ... RIP: 0010:scx_set_task_state+0x7c/0xc0 ... Call Trace: <TASK> scx_enable_task+0x11f/0x2e0 switching_to_scx+0x24/0x110 scx_enable.isra.0+0xd14/0x13d0 bpf_struct_ops_link_create+0x136/0x1a0 __sys_bpf+0x1edd/0x2c30 __x64_sys_bpf+0x21/0x30 do_syscall_64+0xbb/0x370 entry_SYSCALL_64_after_hwframe+0x77/0x7f This happens because we skip initialization for tasks that are already dead (with their usage counter set to zero), but we don't exclude them during the scheduling class transition phase. Fix this by also skipping dead tasks during class swiching, preventing invalid task state transitions. Fixes: `a8532fac7b` ("sched_ext: TASK_DEAD tasks must be switched into SCX on ops_enable") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-08-11 06:56:37 -10:00
Linus Torvalds	6a68cec16b	sched_ext: Changes for v6.17 - Add support for cgroup "cpu.max" interface. - Code organization cleanup so that ext_idle.c doesn't depend on the source-file-inclusion build method of sched/. - Drop UP paths in accordance with sched core changes. - Documentation and other misc changes. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaIqnxg4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGUh5AQC6YM7ggRPYRmy28m5B0nubpKtCHqPOAHSd/QbY MCiThgD+JuE9ewg3wYO/jvJx3NyIRB1McMnAaG59hf6R0Plh5Qo= =TeLF -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Add support for cgroup "cpu.max" interface - Code organization cleanup so that ext_idle.c doesn't depend on the source-file-inclusion build method of sched/ - Drop UP paths in accordance with sched core changes - Documentation and other misc changes * tag 'sched_ext-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix scx_bpf_reenqueue_local() reference sched_ext: Drop kfuncs marked for removal in 6.15 sched_ext, rcu: Eject BPF scheduler on RCU CPU stall panic kernel/sched/ext.c: fix typo "occured" -> "occurred" in comments sched_ext: Add support for cgroup bandwidth control interface sched_ext, sched/core: Factor out struct scx_task_group sched_ext: Return NULL in llc_span sched_ext: Always use SMP versions in kernel/sched/ext_idle.h sched_ext: Always use SMP versions in kernel/sched/ext_idle.c sched_ext: Always use SMP versions in kernel/sched/ext.h sched_ext: Always use SMP versions in kernel/sched/ext.c sched_ext: Documentation: Clarify time slice handling in task lifecycle sched_ext: Make scx_locked_rq() inline sched_ext: Make scx_rq_bypassing() inline sched_ext: idle: Make local functions static in ext_idle.c sched_ext: idle: Remove unnecessary ifdef in scx_bpf_cpu_node()	2025-07-31 16:29:46 -07:00
Christian Loehle	ae96bba1ca	sched_ext: Fix scx_bpf_reenqueue_local() reference The comment mentions bpf_scx_reenqueue_local(), but the function is provided for the BPF program implementing scx, as such the naming convention is scx_bpf_reenqueue_local(), fix the comment. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-07-17 08:17:26 -10:00
Breno Leitao	e14fd98c6d	sched/ext: Prevent update_locked_rq() calls with NULL rq Avoid invoking update_locked_rq() when the runqueue (rq) pointer is NULL in the SCX_CALL_OP and SCX_CALL_OP_RET macros. Previously, calling update_locked_rq(NULL) with preemption enabled could trigger the following warning: BUG: using __this_cpu_write() in preemptible [00000000] This happens because __this_cpu_write() is unsafe to use in preemptible context. rq is NULL when an ops invoked from an unlocked context. In such cases, we don't need to store any rq, since the value should already be NULL (unlocked). Ensure that update_locked_rq() is only called when rq is non-NULL, preventing calling __this_cpu_write() on preemptible context. Suggested-by: Peter Zijlstra <peterz@infradead.org> Fixes: `18853ba782` ("sched_ext: Track currently locked rq") Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v6.15	2025-07-16 15:02:12 -10:00
Jake Hillion	4ecf837414	sched_ext: Drop kfuncs marked for removal in 6.15 sched_ext performed a kfunc renaming pass in 6.13 and kept the old names around for compatibility with old binaries. These were scheduled for cleanup in 6.15 but were missed. Submitting for cleanup in for-next. Removed the kfuncs, their flags, and any references I could find to them in doc comments. Left the entries in include/scx/compat.bpf.h as they're still useful to make new binaries compatible with old kernels. Tested by applying to my kernel. It builds and a modern version of scx_lavd loads fine. Signed-off-by: Jake Hillion <jake@hillion.co.uk> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-06-25 13:13:20 -10:00
David Dai	cb444006a6	sched_ext, rcu: Eject BPF scheduler on RCU CPU stall panic For systems using a sched_ext scheduler and has panic_on_rcu_stall enabled, try kicking out the current scheduler before issuing a panic. While there are numerous reasons for RCU CPU stalls that are not directly attributed to the scheduler, deferring the panic gives sched_ext an opportunity to provide additional debug info when ejecting the current scheduler. Also, handling the event more gracefully allows us to potentially recover the system instead of incurring additional down time. Suggested-by: Tejun Heo <tj@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: David Dai <david.dai@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-06-24 13:05:26 -10:00
Ke Ma	e2a37c277c	kernel/sched/ext.c: fix typo "occured" -> "occurred" in comments Fixes a minor spelling mistake in two comment lines Signed-off-by: Ke Ma <makebit1999@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-06-23 08:11:16 -10:00
Tejun Heo	ddceadce63	sched_ext: Add support for cgroup bandwidth control interface From 077814f57f8acce13f91dc34bbd2b7e4911fbf25 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Fri, 13 Jun 2025 15:06:47 -1000 - Add CONFIG_GROUP_SCHED_BANDWIDTH which is selected by both CONFIG_CFS_BANDWIDTH and EXT_GROUP_SCHED. - Put bandwidth control interface files for both cgroup v1 and v2 under CONFIG_GROUP_SCHED_BANDWIDTH. - Update tg_bandwidth() to fetch configuration parameters from fair if CONFIG_CFS_BANDWIDTH, SCX otherwise. - Update tg_set_bandwidth() to update the parameters for both fair and SCX. - Add bandwidth control parameters to struct scx_cgroup_init_args. - Add sched_ext_ops.cgroup_set_bandwidth() which is invoked on bandwidth control parameter updates. - Update scx_qmap and maximal selftest to test the new feature. Signed-off-by: Tejun Heo <tj@kernel.org>	2025-06-20 17:03:51 -10:00
Tejun Heo	6e6558a6bc	sched_ext, sched/core: Factor out struct scx_task_group More sched_ext fields will be added to struct task_group. In preparation, factor out sched_ext fields into struct scx_task_group to reduce clutter in the common header. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>	2025-06-20 17:03:27 -10:00
Tejun Heo	e4e149dd2f	sched_ext: Merge branch 'for-6.16-fixes' into for-6.17 Pull sched_ext/for-6.16-fixes to receive: `c50784e99f` ("sched_ext: Make scx_group_set_weight() always update tg->scx.weight") `33796b9187` ("sched_ext, sched/core: Don't call scx_group_set_weight() prematurely from sched_create_group()") which are needed to implement CPU bandwidth control interface support. Signed-off-by: Tejun Heo <tj@kernel.org>	2025-06-20 17:01:21 -10:00
Tejun Heo	33796b9187	sched_ext, sched/core: Don't call scx_group_set_weight() prematurely from sched_create_group() During task_group creation, sched_create_group() calls scx_group_set_weight() with CGROUP_WEIGHT_DFL to initialize the sched_ext portion. This is premature and ends up calling ops.cgroup_set_weight() with an incorrect @cgrp before ops.cgroup_init() is called. sched_create_group() should just initialize SCX related fields in the new task_group. Fix it by factoring out scx_tg_init() from sched_init() and making sched_create_group() call that function instead of scx_group_set_weight(). v2: Retain CONFIG_EXT_GROUP_SCHED ifdef in sched_init() as removing it leads to build failures on !CONFIG_GROUP_SCHED configs. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `8195136669` ("sched_ext: Add cgroup support") Cc: stable@vger.kernel.org # v6.12+	2025-06-17 08:19:55 -10:00
Tejun Heo	c50784e99f	sched_ext: Make scx_group_set_weight() always update tg->scx.weight Otherwise, tg->scx.weight can go out of sync while scx_cgroup is not enabled and ops.cgroup_init() may be called with a stale weight value. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `8195136669` ("sched_ext: Add cgroup support") Cc: stable@vger.kernel.org # v6.12+	2025-06-17 08:19:43 -10:00
Cheng-Yang Chou	165af41516	sched_ext: Always use SMP versions in kernel/sched/ext.c Simplify the scheduler by making formerly SMP-only primitives and data structures unconditional. tj: Updated subject for clarity. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-06-13 14:29:39 -10:00
Andrea Righi	086ed90a64	sched_ext: Make scx_locked_rq() inline scx_locked_rq() is used both from ext.c and ext_idle.c, move it to ext.h as a static inline function. No functional changes. v2: Rename locked_rq to scx_locked_rq_state, expose it and make scx_locked_rq() inline, as suggested by Tejun. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-06-09 06:25:35 -10:00
Andrea Righi	e212743bd7	sched_ext: Make scx_rq_bypassing() inline scx_rq_bypassing() is used both from ext.c and ext_idle.c, move it to ext.h as a static inline function. No functional changes. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-06-09 06:25:24 -10:00
Linus Torvalds	90b83efa67	bpf-next-6.16 -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmg3NqgACgkQ6rmadz2v bTpNUQ/8DPeYtn3nskpsP2OwFy6O3hhfCe6gjOAmUVSk000xbG+AcI/h1DnGZWgk xlVcEs93ekzUzHd7k1+RJ2c5yDLXieLJAtb66rbFU1enkxs2cWlcWSKE6K/gaoh3 G1BCARVlKwtrJhrVrsXtYP/eGZxKRSUZFK7xhtCk7lp7sRI3xkTLE+FJBcDkTJ6W HwF14i3zO+BkqNGdFwwlASCCqRItSNBBiM3KjW1DbETOTfAKlvCTrcgdUiODqxhF PNnULW+xmICABDFlKfDMlUAGNlSHKjiI3+g31LdblA5eyEhIqiCRgBGFYoCnsluk qUauRSie61KqC7fxN3qVpC3bXJfD1td7uIvoqSkDLtTv8a5+HAoiohzi1qBzCayl LAGkBYewAfDtdDDjNY38JLH2RCdyY6zG9DhqghPHdPlM7zj7L5zZgj34igEwesMM mfj9TuFFF99yfX5UUeSxKpDGR1eO4Ew0p7tg8CRs8Fqh6AIQSmboREZrsncVRCTS 4SDHSI4KcO4LO2pEKzy+X4dewganN7aESnQG34iG0liyvDDwJOgUnDWLRwPLas7k 3b/zIfBLxOJpA5R+0hhAMtjMA4NgyKJf4yFZwEieuasQjvzwTApi24YhZ/b3HSEB 2Dp8kHEEbwezv0OFFz/fJ88dNQnrDmtJ+QByN/liA8kj4Yuh2+Q= =j3t8 -----END PGP SIGNATURE----- Merge tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Fix and improve BTF deduplication of identical BTF types (Alan Maguire and Andrii Nakryiko) - Support up to 12 arguments in BPF trampoline on arm64 (Xu Kuohai and Alexis Lothoré) - Support load-acquire and store-release instructions in BPF JIT on riscv64 (Andrea Parri) - Fix uninitialized values in BPF_{CORE,PROBE}_READ macros (Anton Protopopov) - Streamline allowed helpers across program types (Feng Yang) - Support atomic update for hashtab of BPF maps (Hou Tao) - Implement json output for BPF helpers (Ihor Solodrai) - Several s390 JIT fixes (Ilya Leoshkevich) - Various sockmap fixes (Jiayuan Chen) - Support mmap of vmlinux BTF data (Lorenz Bauer) - Support BPF rbtree traversal and list peeking (Martin KaFai Lau) - Tests for sockmap/sockhash redirection (Michal Luczaj) - Introduce kfuncs for memory reads into dynptrs (Mykyta Yatsenko) - Add support for dma-buf iterators in BPF (T.J. Mercier) - The verifier support for __bpf_trap() (Yonghong Song) * tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (135 commits) bpf, arm64: Remove unused-but-set function and variable. selftests/bpf: Add tests with stack ptr register in conditional jmp bpf: Do not include stack ptr register in precision backtracking bookkeeping selftests/bpf: enable many-args tests for arm64 bpf, arm64: Support up to 12 function arguments bpf: Check rcu_read_lock_trace_held() in bpf_map_lookup_percpu_elem() bpf: Avoid __bpf_prog_ret0_warn when jit fails bpftool: Add support for custom BTF path in prog load/loadall selftests/bpf: Add unit tests with __bpf_trap() kfunc bpf: Warn with __bpf_trap() kfunc maybe due to uninitialized variable bpf: Remove special_kfunc_set from verifier selftests/bpf: Add test for open coded dmabuf_iter selftests/bpf: Add test for dmabuf_iter bpf: Add open coded dmabuf iterator bpf: Add dmabuf iterator dma-buf: Rename debugfs symbols bpf: Fix error return value in bpf_copy_from_user_dynptr libbpf: Use mmap to parse vmlinux BTF from sysfs selftests: bpf: Add a test for mmapable vmlinux BTF btf: Allow mmap of vmlinux btf ...	2025-05-28 15:52:42 -07:00
Andrea Righi	617a77018f	sched_ext: Make scx_kf_allowed_if_unlocked() available outside ext.c Relocate the scx_kf_allowed_if_unlocked(), so it can be used from other source files (e.g., ext_idle.c). No functional change. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-05-20 10:23:50 -10:00
Tejun Heo	cb4ff91492	sched_ext: Explain the temporary situation around scx_root dereferences Naked scx_root dereferences are being used as temporary markers to indicate that they need to be updated to point to the right scheduler instance. Explain the situation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2025-05-14 12:37:18 -04:00
Tejun Heo	a8433f7a26	sched_ext: Add @sch to SCX_CALL_OP() In preparation of hierarchical scheduling support, add @sch to scx_exit() and friends: - scx_exit/error() updated to take explicit @sch instead of assuming scx_root. - scx_kf_exit/error() added. These are to be used from kfuncs, don't take @sch and internally determine the scx_sched instance to abort. Currently, it's always scx_root but once multiple scheduler support is in place, it will be the scx_sched instance that invoked the kfunc. This simplifies many callsites and defers scx_sched lookup until error is triggered. - @sch is propagated to ops_cpu_valid() and ops_sanitize_err(). The CPU validity conditions in ops_cpu_valid() are factored into __cpu_valid() to implement kf_cpu_valid() which is the counterpart to scx_kf_exit/error(). - All users are converted. Most conversions are straightforward. check_rq_for_timeouts() and scx_softlockup() are updated to use explicit rcu_dereference(scx_root) for safety as they may execute asynchronous to the exit path. scx_tick() is also updated to use rcu_dereference(). While not strictly necessary due to the preceding scx_enabled() test and IRQ disabled context, this removes the subtlety at no noticeable cost. No behavior changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2025-05-14 11:11:48 -04:00
Tejun Heo	c4c286d747	sched_ext: Cleanup [__]scx_exit/error*() __scx_exit() is the base exit implementation and there are three wrappers on top of it - scx_exit(), __scx_error() and scx_error(). This is more confusing than helpful especially given that there are only a couple users of scx_exit() and __scx_error(). To simplify the situation: - Make __scx_exit() take va_list and rename it to scx_vexit(). This is to ease implementing more complex extensions on top. - Make scx_exit() a varargs wrapper around __scx_exit(). scx_exit() now takes both @kind and @exit_code. - Convert existing scx_exit() and __scx_error() users to use the new scx_exit(). - scx_error() remains unchanged. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2025-05-14 11:11:48 -04:00
Tejun Heo	ab3f497ac1	sched_ext: Add @sch to SCX_CALL_OP() In preparation of hierarchical scheduling support, make SCX_CALL_OP() take explicit @sch instead of assuming scx_root. As scx_root is still the only scheduler instance, this patch doesn't make any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2025-05-14 11:11:48 -04:00
Tejun Heo	d310fb4009	sched_ext: Clean up scx_root usages - Always cache scx_root into local variable sch before using. - Don't use scx_root if cached sch is available. - Wrap !sch test with unlikely(). - Pass @scx into scx_cgroup_init/exit(). No behavior changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>	2025-05-14 11:11:48 -04:00
Feng Yang	8c112a428b	sched_ext: Remove bpf_scx_get_func_proto task_storage_{get,delete} has been moved to bpf_base_func_proto. Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/bpf/20250506061434.94277-3-yangfeng59949@163.com	2025-05-09 11:01:45 -07:00
Tejun Heo	9b30400ff6	Merge branch 'for-6.15-fixes' into for-6.16 To receive `428dc9fc08` ("sched_ext: bpf_iter_scx_dsq_new() should always initialize iterator") which conflicts with `cdf5a6faa8` ("sched_ext: Move dsq_hash into scx_sched"). The conflict is a simple context conflict which can be resolved by taking changes from both changes in the right order.	2025-05-07 06:25:39 -10:00
Tejun Heo	428dc9fc08	sched_ext: bpf_iter_scx_dsq_new() should always initialize iterator BPF programs may call next() and destroy() on BPF iterators even after new() returns an error value (e.g. bpf_for_each() macro ignores error returns from new()). bpf_iter_scx_dsq_new() could leave the iterator in an uninitialized state after an error return causing bpf_iter_scx_dsq_next() to dereference garbage data. Make bpf_iter_scx_dsq_new() always clear $kit->dsq so that next() and destroy() become noops. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `650ba21b13` ("sched_ext: Implement DSQ iterator") Cc: stable@vger.kernel.org # v6.12+ Acked-by: Andrea Righi <arighi@nvidia.com>	2025-05-07 06:24:07 -10:00
Andrea Righi	c8fafb3485	sched_ext: Avoid NULL scx_root deref in __scx_exit() A sched_ext scheduler may trigger __scx_exit() from a BPF timer callback, where scx_root may not be safely dereferenced. This can lead to a NULL pointer dereference as shown below (triggered by scx_tickless): BUG: kernel NULL pointer dereference, address: 0000000000000330 ... CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.14.0-virtme #1 PREEMPT(full) RIP: 0010:__scx_exit+0x2b/0x190 ... Call Trace: <IRQ> scx_bpf_get_idle_smtmask+0x59/0x80 bpf_prog_8320d4217989178c_dispatch_all_cpus+0x35/0x1b6 ... bpf_prog_97f847d871513f95_sched_timerfn+0x4c/0x264 bpf_timer_cb+0x7a/0x140 __hrtimer_run_queues+0x1f9/0x3a0 hrtimer_run_softirq+0x8c/0xd0 handle_softirqs+0xd3/0x3d0 __irq_exit_rcu+0x9a/0xc0 irq_exit_rcu+0xe/0x20 Fix this by checking for a valid scx_root and adding proper RCU protection. Fixes: `48e1267773` ("sched_ext: Introduce scx_sched") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-04-30 12:47:17 -10:00
Andrea Righi	c01adf4097	sched_ext: Add RCU protection to scx_root in DSQ iterator Using a DSQ iterators from a timer callback can trigger the following lockdep splat when accessing scx_root: ============================= WARNING: suspicious RCU usage 6.14.0-virtme #1 Not tainted ----------------------------- kernel/sched/ext.c:6907 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 no locks held by swapper/0/0. stack backtrace: CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.14.0-virtme #1 PREEMPT(full) Sched_ext: tickless (enabled+all) Call Trace: <IRQ> dump_stack_lvl+0x6f/0xb0 lockdep_rcu_suspicious.cold+0x4e/0xa3 bpf_iter_scx_dsq_new+0xb1/0xd0 bpf_prog_63f4fd1bccc101e7_dispatch_cpu+0x3e/0x156 bpf_prog_8320d4217989178c_dispatch_all_cpus+0x153/0x1b6 bpf_prog_97f847d871513f95_sched_timerfn+0x4c/0x264 ? hrtimer_run_softirq+0x4f/0xd0 bpf_timer_cb+0x7a/0x140 __hrtimer_run_queues+0x1f9/0x3a0 hrtimer_run_softirq+0x8c/0xd0 handle_softirqs+0xd3/0x3d0 __irq_exit_rcu+0x9a/0xc0 irq_exit_rcu+0xe/0x20 sysvec_apic_timer_interrupt+0x73/0x80 Add a proper dereference check to explicitly validate RCU-safe access to scx_root from rcu_read_lock() contexts and also from contexts that hold rcu_read_lock_bh(), such as timer callbacks. Fixes: `cdf5a6faa8` ("sched_ext: Move dsq_hash into scx_sched") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-04-30 12:39:50 -10:00
Tejun Heo	9ba7f37e5b	sched_ext: Clean up SCX_EXIT_NONE handling in scx_disable_workfn() With the global states and disable machinery moved into scx_sched, scx_disable_workfn() can only be scheduled and run for the specific scheduler instance. This makes it impossible for scx_disable_workfn() to see SCX_EXIT_NONE. Turn that condition into WARN_ON_ONCE(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>	2025-04-29 08:40:11 -10:00
Tejun Heo	bff3b5aec1	sched_ext: Move disable machinery into scx_sched Because disable can be triggered from any place and the scheduler cannot be trusted, disable path uses an irq_work to bounce and a kthread_work which is executed on an RT helper kthread to perform disable. These must be per scheduler instance to guarantee forward progress. Move them into scx_sched. - If an scx_sched is accessible, its helper kthread is always valid making the `helper` check in schedule_scx_disable_work() unnecessary. As the function becomes trivial after the removal of the test, inline it. - scx_create_rt_helper() has only one user - creation of the disable helper kthread. Inline it into scx_alloc_and_add_sched(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>	2025-04-29 08:40:10 -10:00
Tejun Heo	c201ea1578	sched_ext: Move event_stats_cpu into scx_sched The event counters are going to become per scheduler instance. Move event_stats_cpu into scx_sched. - [__]scx_add_event() are updated to take @sch. While at it, add missing parentheses around @cnt expansions. - scx_read_events() is updated to take @sch. - scx_bpf_events() accesses scx_root under RCU read lock. v2: - Replace stale scx_bpf_get_event_stat() reference in a comment with scx_bpf_events(). - Trivial goto label rename. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>	2025-04-29 08:40:10 -10:00
Tejun Heo	3a8facc424	sched_ext: Factor out scx_read_events() In prepration of moving event_stats_cpu into scx_sched, factor out scx_read_events() out of scx_bpf_events() and update the in-kernel users - scx_attr_events_show() and scx_dump_state() - to use scx_read_events() instead of scx_bpf_events(). No observable behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>	2025-04-29 08:40:10 -10:00
Tejun Heo	f97a79156a	sched_ext: Relocate scx_event_stats definition In prepration of moving event_stats_cpu into scx_sched, move scx_event_stats definitions above scx_sched definition. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>	2025-04-29 08:40:10 -10:00
Tejun Heo	8409b800a0	sched_ext: Move global_dsqs into scx_sched Global DSQs are going to become per scheduler instance. Move global_dsqs into scx_sched. find_global_dsq() already takes a task_struct pointer as an argument and should later be able to determine the scx_sched to use from that. For now, assume scx_root. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>	2025-04-29 08:40:10 -10:00
Tejun Heo	cdf5a6faa8	sched_ext: Move dsq_hash into scx_sched User DSQs are going to become per scheduler instance. Move dsq_hash into scx_sched. This shifts the code that assumes scx_root to be the only scx_sched instance up the call stack but doesn't remove them yet. v2: Add missing rcu_read_lock() in scx_bpf_destroy_dsq() as per Andrea. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>	2025-04-29 08:40:10 -10:00
Tejun Heo	d9f7546310	sched_ext: Factor out scx_alloc_and_add_sched() More will be moved into scx_sched. Factor out the allocation and kobject addition path into scx_alloc_and_add_sched(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>	2025-04-29 08:40:10 -10:00
Tejun Heo	392b7e08de	sched_ext: Inline create_dsq() into scx_bpf_create_dsq() create_dsq() is only used by scx_bpf_create_dsq() and the separation gets in the way of making dsq_hash per scx_sched. Inline it into scx_bpf_create_dsq(). While at it, add unlikely() around SCX_DSQ_FLAG_BUILTIN test. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>	2025-04-29 08:40:10 -10:00
Tejun Heo	17108735b4	sched_ext: Use dynamic allocation for scx_sched To prepare for supporting multiple schedulers, make scx_sched allocated dynamically. scx_sched->kobj is now an embedded field and the kobj's lifetime determines the lifetime of the containing scx_sched. - Enable path is updated so that kobj init and addition are performed later. - scx_sched freeing is initiated in scx_kobj_release() and also goes through an rcu_work so that scx_root can be accessed from an unsynchronized path - scx_disable(). - sched_ext_ops->priv is added and used to point to scx_sched instance created for the ops instance. This is used by bpf_scx_unreg() to determine the scx_sched instance to disable and put. No behavior changes intended. v2: Andrea reported kernel oops due to scx_bpf_unreg() trying to deref NULL scx_root after scheduler init failure. sched_ext_ops->priv added so that scx_bpf_unreg() can always find the scx_sched instance to unregister even if it failed early during init. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>	2025-04-29 08:40:10 -10:00
Tejun Heo	a77d10d032	sched_ext: Avoid NULL scx_root deref through SCX_HAS_OP() SCX_HAS_OP() tests scx_root->has_op bitmap. The bitmap is currently in a statically allocated struct scx_sched and initialized while loading the BPF scheduler and cleared while unloading, and thus can be tested anytime. However, scx_root will be switched to dynamic allocation and thus won't always be deferenceable. Most usages of SCX_HAS_OP() are already protected by scx_enabled() either directly or indirectly (e.g. through a task which is on SCX). However, there are a couple places that could try to dereference NULL scx_root. Update them so that scx_root is guaranteed to be valid before SCX_HAS_OP() is called. - In handle_hotplug(), test whether scx_root is NULL before doing anything else. This is safe because scx_root updates will be protected by cpus_read_lock(). - In scx_tg_offline(), test scx_cgroup_enabled before invoking SCX_HAS_OP(), which should guarnatee that scx_root won't turn NULL. This is also in line with other cgroup operations. As the code path is synchronized against scx_cgroup_init/exit() through scx_cgroup_rwsem, this shouldn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>	2025-04-29 08:40:10 -10:00
Tejun Heo	48e1267773	sched_ext: Introduce scx_sched To support multiple scheduler instances, collect some of the global variables that should be specific to a scheduler instance into the new struct scx_sched. scx_root is the root scheduler instance and points to a static instance of struct scx_sched. Except for an extra dereference through the scx_root pointer, this patch makes no functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>	2025-04-29 08:40:10 -10:00
Tejun Heo	ce565f839c	Merge branch 'for-6.15-fixes' into for-6.16 To receive `e38be1c764` ("sched_ext: Fix rq lock state in hotplug ops") to avoid conflicts with scx_sched related patches pending for for-6.16.	2025-04-29 08:24:58 -10:00
Andrea Righi	e38be1c764	sched_ext: Fix rq lock state in hotplug ops The ops.cpu_online() and ops.cpu_offline() callbacks incorrectly assume that the rq involved in the operation is locked, which is not the case during hotplug, triggering the following warning: WARNING: CPU: 1 PID: 20 at kernel/sched/sched.h:1504 handle_hotplug+0x280/0x340 Fix by not tracking the target rq as locked in the context of ops.cpu_online() and ops.cpu_offline(). Fixes: `18853ba782` ("sched_ext: Track currently locked rq") Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> Tested-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-04-29 08:01:32 -10:00
Andrea Righi	e7dcd1304b	sched_ext: Remove duplicate BTF_ID_FLAGS definitions Some kfuncs specific to the idle CPU selection policy are registered in both the scx_kfunc_ids_any and scx_kfunc_ids_idle blocks, even though they should only be defined in the latter. Remove the duplicates from scx_kfunc_ids_any. Fixes: `337d1b354a` ("sched_ext: Move built-in idle CPU selection policy to a separate file") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-04-25 08:41:02 -10:00
Andrea Righi	069ac9e161	sched_ext: Clarify CPU context for running/stopping callbacks The ops.running() and ops.stopping() callbacks can be invoked from a CPU other than the one the task is assigned to, particularly when a task property is changed, as both scx_next_task_scx() and dequeue_task_scx() may run on CPUs different from the task's target CPU. This behavior can lead to confusion or incorrect assumptions if not properly clarified, potentially resulting in bugs (see [1]). Therefore, update the documentation to clarify this aspect and advise users to use scx_bpf_task_cpu() to determine the actual CPU the task will run on or was running on. [1] https://github.com/sched-ext/scx/pull/1728 Cc: Jake Hillion <jake@hillion.co.uk> Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-04-23 13:34:32 -10:00
Tejun Heo	ac47c272b2	Merge branch 'for-6.15-fixes' into for-6.16 `a11d6784d7` ("sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()") added a call to scx_ops_error() which was renamed to scx_error() in for-6.16. Fix it up.	2025-04-22 09:29:44 -10:00
Andrea Righi	a11d6784d7	sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set() scx_bpf_cpuperf_set() can be used to set a performance target level on any CPU. However, it doesn't correctly acquire the corresponding rq lock, which may lead to unsafe behavior and trigger the following warning, due to the lockdep_assert_rq_held() check: [ 51.713737] WARNING: CPU: 3 PID: 3899 at kernel/sched/sched.h:1512 scx_bpf_cpuperf_set+0x1a0/0x1e0 ... [ 51.713836] Call trace: [ 51.713837] scx_bpf_cpuperf_set+0x1a0/0x1e0 (P) [ 51.713839] bpf_prog_62d35beb9301601f_bpfland_init+0x168/0x440 [ 51.713841] bpf__sched_ext_ops_init+0x54/0x8c [ 51.713843] scx_ops_enable.constprop.0+0x2c0/0x10f0 [ 51.713845] bpf_scx_reg+0x18/0x30 [ 51.713847] bpf_struct_ops_link_create+0x154/0x1b0 [ 51.713849] __sys_bpf+0x1934/0x22a0 Fix by properly acquiring the rq lock when possible or raising an error if we try to operate on a CPU that is not the one currently locked. Fixes: `d86adb4fc0` ("sched_ext: Add cpuperf support") Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-04-22 09:28:12 -10:00
Andrea Righi	18853ba782	sched_ext: Track currently locked rq Some kfuncs provided by sched_ext may need to operate on a struct rq, but they can be invoked from various contexts, specifically, different scx callbacks. While some of these callbacks are invoked with a particular rq already locked, others are not. This makes it impossible for a kfunc to reliably determine whether it's safe to access a given rq, triggering potential bugs or unsafe behaviors, see for example [1]. To address this, track the currently locked rq whenever a sched_ext callback is invoked via SCX_CALL_OP*(). This allows kfuncs that need to operate on an arbitrary rq to retrieve the currently locked one and apply the appropriate action as needed. [1] https://lore.kernel.org/lkml/20250325140021.73570-1-arighi@nvidia.com/ Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-04-22 09:27:50 -10:00
Honglei Wang	69120f8228	sched_ext: add helper for refill task with default slice Add helper for refilling task with default slice and event statistics accordingly. Signed-off-by: Honglei Wang <jameshongleiwang@126.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-04-18 17:27:00 -10:00
Honglei Wang	f203683c3e	sched_ext: change the variable name for slice refill event SCX_EV_ENQ_SLICE_DFL gives the impression that the event only occurs when the tasks were enqueued, which seems not accurate. What it actually means is the refilling with defalt slice, and this can occur either when enqueue or pick_task. Let's change the variable to SCX_EV_REFILL_SLICE_DFL. Signed-off-by: Honglei Wang <jameshongleiwang@126.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-04-18 17:25:39 -10:00
Tejun Heo	0b30461793	sched_ext: Make scx_has_op a bitmap scx_has_op is used to encode which ops are implemented by the BPF scheduler into an array of static_keys. While this saves a bit of branching overhead, that is unlikely to be noticeable compared to the overall cost. As the global static_keys can't work with the planned hierarchical multiple scheduler support, replace the static_key array with a bitmap. In repeated hackbench runs before and after static_keys removal on an AMD Ryzen 3900X, I couldn't tell any measurable performance difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com>	2025-04-09 09:06:00 -10:00
Tejun Heo	743354e3bb	sched_ext: Remove scx_ops_allow_queued_wakeup static_key scx_ops_allow_queued_wakeup is used to encode SCX_OPS_ALLOW_QUEUED_WAKEUP into a static_key. The test is gated behind scx_enabled(), and, even when sched_ext is enabled, is unlikely for the static_key usage to make any meaningful difference. It is made to use a static_key mostly because there was no reason not to. However, global static_keys can't work with the planned hierarchical multiple scheduler support. Remove the static_key and instead test SCX_OPS_ALLOW_QUEUED_WAKEUP directly. In repeated hackbench runs before and after static_keys removal on an AMD Ryzen 3900X, I couldn't tell any measurable performance difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com>	2025-04-09 09:06:00 -10:00

1 2 3 4 5 ...

268 Commits