mirror_ubuntu-kernels/kernel/sched
Tejun Heo b07996c7ab sched_ext: Don't hold scx_tasks_lock for too long
While enabling and disabling a BPF scheduler, every task is iterated a
couple times by walking scx_tasks. Except for one, all iterations keep
holding scx_tasks_lock. On multi-socket systems under heavy rq lock
contention and high number of threads, this can can lead to RCU and other
stalls.

The following is triggered on a 2 x AMD EPYC 7642 system (192 logical CPUs)
running `stress-ng --workload 150 --workload-threads 10` with >400k idle
threads and RCU stall period reduced to 5s:

  rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  rcu:     91-...!: (10 ticks this GP) idle=0754/1/0x4000000000000000 softirq=18204/18206 fqs=17
  rcu:     186-...!: (17 ticks this GP) idle=ec54/1/0x4000000000000000 softirq=25863/25866 fqs=17
  rcu:     (detected by 80, t=10042 jiffies, g=89305, q=33 ncpus=192)
  Sending NMI from CPU 80 to CPUs 91:
  NMI backtrace for cpu 91
  CPU: 91 UID: 0 PID: 284038 Comm: sched_ext_ops_h Kdump: loaded Not tainted 6.12.0-rc2-work-g6bf5681f7ee2-dirty #471
  Hardware name: Supermicro Super Server/H11DSi, BIOS 2.8 12/14/2023
  Sched_ext: simple (disabling+all)
  RIP: 0010:queued_spin_lock_slowpath+0x17b/0x2f0
  Code: 02 c0 10 03 00 83 79 08 00 75 08 f3 90 83 79 08 00 74 f8 48 8b 11 48 85 d2 74 09 0f 0d 0a eb 0a 31 d2 eb 06 31 d2 eb 02 f3 90 <8b> 07 66 85 c0 75 f7 39 d8 75 0d be 01 00 00 00 89 d8 f0 0f b1 37
  RSP: 0018:ffffc9000fadfcb8 EFLAGS: 00000002
  RAX: 0000000001700001 RBX: 0000000001700000 RCX: ffff88bfcaaf10c0
  RDX: 0000000000000000 RSI: 0000000000000101 RDI: ffff88bfca8f0080
  RBP: 0000000001700000 R08: 0000000000000090 R09: ffffffffffffffff
  R10: ffff88a74761b268 R11: 0000000000000000 R12: ffff88a6b6765460
  R13: ffffc9000fadfd60 R14: ffff88bfca8f0080 R15: ffff88bfcaac0000
  FS:  0000000000000000(0000) GS:ffff88bfcaac0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f5c55f526a0 CR3: 0000000afd474000 CR4: 0000000000350eb0
  Call Trace:
   <NMI>
   </NMI>
   <TASK>
   do_raw_spin_lock+0x9c/0xb0
   task_rq_lock+0x50/0x190
   scx_task_iter_next_locked+0x157/0x170
   scx_ops_disable_workfn+0x2c2/0xbf0
   kthread_worker_fn+0x108/0x2a0
   kthread+0xeb/0x110
   ret_from_fork+0x36/0x40
   ret_from_fork_asm+0x1a/0x30
   </TASK>
  Sending NMI from CPU 80 to CPUs 186:
  NMI backtrace for cpu 186
  CPU: 186 UID: 0 PID: 51248 Comm: fish Kdump: loaded Not tainted 6.12.0-rc2-work-g6bf5681f7ee2-dirty #471

scx_task_iter can safely drop locks while iterating. Make
scx_task_iter_next() drop scx_tasks_lock every 32 iterations to avoid
stalls.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-10-10 11:41:44 -10:00
..
autogroup.c scheduler: Remove the now superfluous sentinel elements from ctl_table array 2024-04-24 09:43:54 +02:00
autogroup.h sched/headers: Add header guard to kernel/sched/stats.h and kernel/sched/autogroup.h 2022-02-23 08:22:00 +01:00
build_policy.c sched_ext: Disallow loading BPF scheduler if isolcpus= domain isolation is in effect 2024-07-08 09:30:13 -10:00
build_utility.c sched/headers: Remove duplicate header inclusions 2023-10-03 21:27:55 +02:00
clock.c sched: Fix spelling in comments 2024-05-27 17:00:21 +02:00
completion.c sched: add a few helpers to wake up tasks on the current cpu 2023-07-17 16:08:08 -07:00
core_sched.c sched: Fix spelling in comments 2024-05-27 17:00:21 +02:00
core.c sched/core: Add ENQUEUE_RQ_SELECTED to indicate whether ->select_task_rq() was called 2024-10-07 10:16:18 -10:00
cpuacct.c Merge branch 'sched/fast-headers' into sched/core 2022-03-15 09:05:05 +01:00
cpudeadline.c sched/topology: Consolidate and clean up access to a CPU's max compute capacity 2023-10-09 12:59:48 +02:00
cpudeadline.h
cpufreq_schedutil.c Merge branch 'tip/sched/core' into sched_ext/for-6.12 2024-09-11 08:43:26 -10:00
cpufreq.c sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
cpupri.c sched/rt: Fix live lock between select_fallback_rq() and RT push 2023-09-28 22:58:13 +02:00
cpupri.h
cputime.c sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime 2024-07-29 12:22:32 +02:00
deadline.c sched: Add put_prev_task(.next) 2024-09-03 15:26:32 +02:00
debug.c Merge branch 'tip/sched/core' into sched_ext/for-6.12 2024-09-11 08:43:26 -10:00
ext.c sched_ext: Don't hold scx_tasks_lock for too long 2024-10-10 11:41:44 -10:00
ext.h sched_ext: Add cgroup support 2024-09-04 10:24:59 -10:00
fair.c sched_ext: Initial pull request for v6.12 2024-09-21 09:44:57 -07:00
features.h sched/eevdf: Allow shorter slices to wakeup-preempt 2024-08-17 11:06:45 +02:00
idle.c Merge branch 'tip/sched/core' into for-6.12 2024-09-03 12:49:18 -10:00
isolation.c sched/isolation: Fix boot crash when maxcpus < first housekeeping CPU 2024-04-28 10:08:21 +02:00
loadavg.c sched: Fix spelling in comments 2024-05-27 17:00:21 +02:00
Makefile sched/headers: Introduce kernel/sched/build_policy.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
membarrier.c RISC-V Patches for the 6.9 Merge Window 2024-03-22 10:41:13 -07:00
pelt.c sched: Move update_other_load_avgs() to kernel/sched/pelt.c 2024-09-11 20:00:21 -10:00
pelt.h sched: Move update_other_load_avgs() to kernel/sched/pelt.c 2024-09-11 20:00:21 -10:00
psi.c Merge branch 'sched/urgent' into sched/core, to pick up fixes and refresh the branch 2024-07-11 10:42:33 +02:00
rt.c sched: Add put_prev_task(.next) 2024-09-03 15:26:32 +02:00
sched-pelt.h
sched.h sched/core: Add ENQUEUE_RQ_SELECTED to indicate whether ->select_task_rq() was called 2024-10-07 10:16:18 -10:00
smp.h sched, smp: Trace smp callback causing an IPI 2023-03-24 11:01:29 +01:00
stats.c profiling: remove profile=sleep support 2024-08-04 13:36:28 -07:00
stats.h Merge branch 'sched/urgent' into sched/core, to pick up fixes and refresh the branch 2024-07-11 10:42:33 +02:00
stop_task.c sched: Add put_prev_task(.next) 2024-09-03 15:26:32 +02:00
swait.c sched: add a few helpers to wake up tasks on the current cpu 2023-07-17 16:08:08 -07:00
syscalls.c sched_ext: Initial pull request for v6.12 2024-09-21 09:44:57 -07:00
topology.c sched/fair: Fair server interface 2024-07-29 12:22:36 +02:00
wait_bit.c sched: Fix spelling in comments 2024-05-27 17:00:21 +02:00
wait.c sched: remove wait bookmarks 2023-10-18 14:34:18 -07:00