Commit Graph

46625 Commits

Author SHA1 Message Date
Linus Torvalds
3f020399e4 Scheduler changes for v6.13:
- Core facilities:
 
     - Add the "Lazy preemption" model (CONFIG_PREEMPT_LAZY=y), which optimizes
       fair-class preemption by delaying preemption requests to the
       tick boundary, while working as full preemption for RR/FIFO/DEADLINE
       classes. (Peter Zijlstra)
 
         - x86: Enable Lazy preemption (Peter Zijlstra)
         - riscv: Enable Lazy preemption (Jisheng Zhang)
 
     - Initialize idle tasks only once (Thomas Gleixner)
 
     - sched/ext: Remove sched_fork() hack (Thomas Gleixner)
 
  - Fair scheduler:
     - Optimize the PLACE_LAG when se->vlag is zero (Huang Shijie)
 
  - Idle loop:
       Optimize the generic idle loop by removing unnecessary
       memory barrier (Zhongqiu Han)
 
  - RSEQ:
     - Improve cache locality of RSEQ concurrency IDs for
       intermittent workloads (Mathieu Desnoyers)
 
  - Waitqueues:
     - Make wake_up_{bit,var} less fragile (Neil Brown)
 
  - PSI:
     - Pass enqueue/dequeue flags to psi callbacks directly (Johannes Weiner)
 
  - Preparatory patches for proxy execution:
     - core: Add move_queued_task_locked helper (Connor O'Brien)
     - core: Consolidate pick_*_task to task_is_pushable helper (Connor O'Brien)
     - core: Split out __schedule() deactivate task logic into a helper (John Stultz)
     - core: Split scheduler and execution contexts (Peter Zijlstra)
     - locking/mutex: Make mutex::wait_lock irq safe (Juri Lelli)
     - locking/mutex: Expose __mutex_owner() (Juri Lelli)
     - locking/mutex: Remove wakeups from under mutex::wait_lock (Peter Zijlstra)
 
  - Misc fixes and cleanups:
     - core: Remove unused __HAVE_THREAD_FUNCTIONS hook support (David Disseldorp)
     - core: Update the comment for TIF_NEED_RESCHED_LAZY (Sebastian Andrzej Siewior)
     - wait: Remove unused bit_wait_io_timeout (Dr. David Alan Gilbert)
     - fair: remove the DOUBLE_TICK feature (Huang Shijie)
     - fair: fix the comment for PREEMPT_SHORT (Huang Shijie)
     - uclamp: Fix unnused variable warning (Christian Loehle)
     - rt: No PREEMPT_RT=y for all{yes,mod}config
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmc7fnQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hZTBAAozVdWA2m51aNa67HvAZta/olmrIagVbW
 inwbTgqa8b+UfeWEuKOfrZr5khjEh6pLgR3dBTib1uH6xxYj/Okds+qbPWSBPVLh
 yzavlm/zJZM1U1XtxE3eyVfqWik4GrY7DoIMDQQr+YH7rNXonJeJkll38OI2E5MC
 q3Q01qyMo8RJJX8qkf3f8ObOoP/51NsVniTw0Zb2fzEhXz8FjezLlxk6cMfgSkJG
 lg9gfIwUZ7Xg5neRo4kJcc3Ht31KYOhWSiupBJzRD1hss/N/AybvMcTX/Cm8d07w
 HIAdDDAn84o46miFo/a0V/hsJZ72idWbqxVJUCtaezrpOUiFkG+uInRvG/ynr0lF
 5dEI9f+6PUw8Nc7L72IyHkobjPqS2IefSaxYYCBKmxMX2qrenfTor/pKiWzzhBIl
 rX3MZSuUJ8NjV4rNGD/qXRM1IsMJrsDwxDyv+sRec3XdH33x286ds6aAUEPDQ6N7
 96VS0sOKcNUJN8776ErNjlIxRl8HTlpkaO3nZlQIfXgTlXUpRvOuKbEWqP+606lo
 oANgJTKgUhgJPWZnvmdRxDjSiOp93QcImjus9i1tN81FGiEDleONsJUxu2Di1E5+
 s1nCiytjq+cdvzCqFyiOZUh+g6kSZ4yXxNgLg2UvbXzX1zOeUQT3WtyKUhMPXhU8
 esh1TgbUbpE=
 =Zcqj
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Core facilities:

   - Add the "Lazy preemption" model (CONFIG_PREEMPT_LAZY=y), which
     optimizes fair-class preemption by delaying preemption requests to
     the tick boundary, while working as full preemption for
     RR/FIFO/DEADLINE classes. (Peter Zijlstra)
        - x86: Enable Lazy preemption (Peter Zijlstra)
        - riscv: Enable Lazy preemption (Jisheng Zhang)

   - Initialize idle tasks only once (Thomas Gleixner)

   - sched/ext: Remove sched_fork() hack (Thomas Gleixner)

  Fair scheduler:

   - Optimize the PLACE_LAG when se->vlag is zero (Huang Shijie)

  Idle loop:

   - Optimize the generic idle loop by removing unnecessary memory
     barrier (Zhongqiu Han)

  RSEQ:

   - Improve cache locality of RSEQ concurrency IDs for intermittent
     workloads (Mathieu Desnoyers)

  Waitqueues:

   - Make wake_up_{bit,var} less fragile (Neil Brown)

  PSI:

   - Pass enqueue/dequeue flags to psi callbacks directly (Johannes
     Weiner)

  Preparatory patches for proxy execution:

   - Add move_queued_task_locked helper (Connor O'Brien)

   - Consolidate pick_*_task to task_is_pushable helper (Connor O'Brien)

   - Split out __schedule() deactivate task logic into a helper (John
     Stultz)

   - Split scheduler and execution contexts (Peter Zijlstra)

   - Make mutex::wait_lock irq safe (Juri Lelli)

   - Expose __mutex_owner() (Juri Lelli)

   - Remove wakeups from under mutex::wait_lock (Peter Zijlstra)

  Misc fixes and cleanups:

   - Remove unused __HAVE_THREAD_FUNCTIONS hook support (David
     Disseldorp)

   - Update the comment for TIF_NEED_RESCHED_LAZY (Sebastian Andrzej
     Siewior)

   - Remove unused bit_wait_io_timeout (Dr. David Alan Gilbert)

   - remove the DOUBLE_TICK feature (Huang Shijie)

   - fix the comment for PREEMPT_SHORT (Huang Shijie)

   - Fix unnused variable warning (Christian Loehle)

   - No PREEMPT_RT=y for all{yes,mod}config"

* tag 'sched-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  sched, x86: Update the comment for TIF_NEED_RESCHED_LAZY.
  sched: No PREEMPT_RT=y for all{yes,mod}config
  riscv: add PREEMPT_LAZY support
  sched, x86: Enable Lazy preemption
  sched: Enable PREEMPT_DYNAMIC for PREEMPT_RT
  sched: Add Lazy preemption model
  sched: Add TIF_NEED_RESCHED_LAZY infrastructure
  sched/ext: Remove sched_fork() hack
  sched: Initialize idle tasks only once
  sched: psi: pass enqueue/dequeue flags to psi callbacks directly
  sched/uclamp: Fix unnused variable warning
  sched: Split scheduler and execution contexts
  sched: Split out __schedule() deactivate task logic into a helper
  sched: Consolidate pick_*_task to task_is_pushable helper
  sched: Add move_queued_task_locked helper
  locking/mutex: Expose __mutex_owner()
  locking/mutex: Make mutex::wait_lock irq safe
  locking/mutex: Remove wakeups from under mutex::wait_lock
  sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads
  sched: idle: Optimize the generic idle loop by removing needless memory barrier
  ...
2024-11-19 14:16:06 -08:00
Linus Torvalds
f41dac3efb Performance events changes for v6.13:
- Uprobes:
     - Add BPF session support (Jiri Olsa)
     - Switch to RCU Tasks Trace flavor for better performance (Andrii Nakryiko)
     - Massively increase uretprobe SMP scalability by SRCU-protecting
       the uretprobe lifetime (Andrii Nakryiko)
     - Kill xol_area->slot_count (Oleg Nesterov)
 
  - Core facilities:
     - Implement targeted high-frequency profiling by adding the ability
       for an event to "pause" or "resume" AUX area tracing (Adrian Hunter)
 
  - VM profiling/sampling:
     - Correct perf sampling with guest VMs (Colton Lewis)
 
  - New hardware support:
     - x86/intel: Add PMU support for Intel ArrowLake-H CPUs (Dapeng Mi)
 
  - Misc fixes and enhancements:
     - x86/intel/pt: Fix buffer full but size is 0 case (Adrian Hunter)
     - x86/amd: Warn only on new bits set (Breno Leitao)
     - x86/amd/uncore: Avoid a false positive warning about snprintf
                       truncation in amd_uncore_umc_ctx_init (Jean Delvare)
     - uprobes: Re-order struct uprobe_task to save some space (Christophe JAILLET)
     - x86/rapl: Move the pmu allocation out of CPU hotplug (Kan Liang)
     - x86/rapl: Clean up cpumask and hotplug (Kan Liang)
     - uprobes: Deuglify xol_get_insn_slot/xol_free_insn_slot paths (Oleg Nesterov)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmc7eKERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1i57A/+KQ6TrIoICVTE+BPlDfUw8NU+N3DagVb0
 dzoyDxlDRsnsYzeXZipPn+3IitX1w+DrGxBNIojSoiFVCLnHIKgo4uHbj7cVrR7J
 fBTVSnoJ94SGAk5ySebvLwMLce/YhXBeHK2lx6W/pI6acNcxzDfIabjjETeqltUo
 g7hmT9lo10pzZEZyuUfYX9khlWBxda1dKHc9pMIq7baeLe4iz/fCGlJ0K4d4M4z3
 NPZw239Np6iHUwu3Lcs4gNKe4rcDe7Bt47hpedemHe0Y+7c4s2HaPxbXWxvDtE76
 mlsg93i28f8SYxeV83pREn0EOCptXcljhiek+US+GR7NSbltMnV+uUiDfPKIE9+Y
 vYP/DYF9hx73FsOucEFrHxYYcePorn3pne5/khBYWdQU6TnlrBYWpoLQsjgCKTTR
 4JhCFlBZ5cDpc6ihtpwCwVTQ4Q/H7vM1XOlDwx0hPhcIPPHDreaQD/wxo61jBdXf
 PY0EPAxh3BcQxfPYuDS+XiYjQ8qO8MtXMKz5bZyHBZlbHwccV6T4ExjsLKxFk5As
 6BG8pkBWLg7drXAgVdleIY0ux+34w/Zzv7gemdlQxvWLlZrVvpjiG93oU3PTpZeq
 A2UD9eAOuXVD6+HsF/dmn88sFmcLWbrMskFWujkvhEUmCvSGAnz3YSS/mLEawBiT
 2xI8xykNWSY=
 =ItOT
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull performance events updates from Ingo Molnar:
 "Uprobes:
    - Add BPF session support (Jiri Olsa)
    - Switch to RCU Tasks Trace flavor for better performance (Andrii
      Nakryiko)
    - Massively increase uretprobe SMP scalability by SRCU-protecting
      the uretprobe lifetime (Andrii Nakryiko)
    - Kill xol_area->slot_count (Oleg Nesterov)

  Core facilities:
    - Implement targeted high-frequency profiling by adding the ability
      for an event to "pause" or "resume" AUX area tracing (Adrian
      Hunter)

  VM profiling/sampling:
    - Correct perf sampling with guest VMs (Colton Lewis)

  New hardware support:
    - x86/intel: Add PMU support for Intel ArrowLake-H CPUs (Dapeng Mi)

  Misc fixes and enhancements:
    - x86/intel/pt: Fix buffer full but size is 0 case (Adrian Hunter)
    - x86/amd: Warn only on new bits set (Breno Leitao)
    - x86/amd/uncore: Avoid a false positive warning about snprintf
      truncation in amd_uncore_umc_ctx_init (Jean Delvare)
    - uprobes: Re-order struct uprobe_task to save some space
      (Christophe JAILLET)
    - x86/rapl: Move the pmu allocation out of CPU hotplug (Kan Liang)
    - x86/rapl: Clean up cpumask and hotplug (Kan Liang)
    - uprobes: Deuglify xol_get_insn_slot/xol_free_insn_slot paths (Oleg
      Nesterov)"

* tag 'perf-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
  perf/core: Correct perf sampling with guest VMs
  perf/x86: Refactor misc flag assignments
  perf/powerpc: Use perf_arch_instruction_pointer()
  perf/core: Hoist perf_instruction_pointer() and perf_misc_flags()
  perf/arm: Drop unused functions
  uprobes: Re-order struct uprobe_task to save some space
  perf/x86/amd/uncore: Avoid a false positive warning about snprintf truncation in amd_uncore_umc_ctx_init
  perf/x86/intel: Do not enable large PEBS for events with aux actions or aux sampling
  perf/x86/intel/pt: Add support for pause / resume
  perf/core: Add aux_pause, aux_resume, aux_start_paused
  perf/x86/intel/pt: Fix buffer full but size is 0 case
  uprobes: SRCU-protect uretprobe lifetime (with timeout)
  uprobes: allow put_uprobe() from non-sleepable softirq context
  perf/x86/rapl: Clean up cpumask and hotplug
  perf/x86/rapl: Move the pmu allocation out of CPU hotplug
  uprobe: Add support for session consumer
  uprobe: Add data pointer to consumer handlers
  perf/x86/amd: Warn only on new bits set
  uprobes: fold xol_take_insn_slot() into xol_get_insn_slot()
  uprobes: kill xol_area->slot_count
  ...
2024-11-19 13:34:06 -08:00
Linus Torvalds
364eeb79a2 Locking changes for v6.13 are:
- lockdep:
     - Enable PROVE_RAW_LOCK_NESTING with PROVE_LOCKING (Sebastian Andrzej Siewior)
     - Add lockdep_cleanup_dead_cpu() (David Woodhouse)
 
  - futexes:
     - Use atomic64_inc_return() in get_inode_sequence_number() (Uros Bizjak)
     - Use atomic64_try_cmpxchg_relaxed() in get_inode_sequence_number() (Uros Bizjak)
 
  - RT locking:
     - Add sparse annotation PREEMPT_RT's locking (Sebastian Andrzej Siewior)
 
  - spinlocks:
     - Use atomic_try_cmpxchg_release() in osq_unlock() (Uros Bizjak)
 
  - atomics:
     - x86: Use ALT_OUTPUT_SP() for __alternative_atomic64() (Uros Bizjak)
     - x86: Use ALT_OUTPUT_SP() for __arch_{,try_}cmpxchg64_emu() (Uros Bizjak)
 
  - KCSAN, seqlocks:
     - Support seqcount_latch_t (Marco Elver)
 
  - <linux/cleanup.h>:
     - Add if_not_cond_guard() conditional guard helper (David Lechner)
     - Adjust scoped_guard() macros to avoid potential warning (Przemek Kitszel)
     - Remove address space of returned pointer (Uros Bizjak)
 
  - WW mutexes:
     - locking/ww_mutex: Adjust to lockdep nest_lock requirements (Thomas Hellström)
 
  - Rust integration:
     - Fix raw_spin_lock initialization on PREEMPT_RT (Eder Zulian)
 
  - miscellaneous cleanups & fixes:
     - lockdep: Fix wait-type check related warnings (Ahmed Ehab)
     - lockdep: Use info level for initial info messages (Jiri Slaby)
     - spinlocks: Make __raw_* lock ops static (Geert Uytterhoeven)
     - pvqspinlock: Convert fields of 'enum vcpu_state' to uppercase (Qiuxu Zhuo)
     - iio: magnetometer: Fix if () scoped_guard() formatting (Stephen Rothwell)
     - rtmutex: Fix misleading comment (Peter Zijlstra)
     - percpu-rw-semaphores: Fix grammar in percpu-rw-semaphore.rst (Xiu Jianfeng)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmc7AkQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hGqQ/+KWR5arkoJjH/Nf5IyezYitOwqK7YAdJk
 mrWoZcez0DRopNTf8yZMv1m8jyx7W9KUQumEO/ghqJRlBW+AbxZ1t99kmqWI5Aw0
 +zmhpyo06JHeMYQAfKJXX3iRt2Rt59BPHtGzoop6b0e2i55+uPE+DZTNm2+FwCV9
 4vxmfpYyg5/sJB9/v5b0N9TTDe9a8caOHXU5F+HA1yWuxMmqFuDFIcpKrgS/sUeP
 NelOLbh2L3UOPWP6tRRfpajxCQTmRoeZOQQv0L9dd3jYpyQOCesgKqOhqNTCU8KK
 qamTPig2N00smSLp6I/OVyJ96vFYZrbhyq0kwMayaafAU7mB8lzcfUj+8qP0c90k
 1PROtD1XpF3Nobp1F+YUp3sQxEGdCgs+9VeLWWObv2b/Vt3MDZijdEiC/3OkRAUh
 LPCfl/ky41BmT8AlaxRDjkyrN7hH4oUOkGUdVx6yR389J0OR9MSwEX9qNaMw8bBg
 1ALvv9+OR3QhTWyG30PGqUf3Um230oIdWuWxwFrhaoMmDVEVMRZQMtvQahi5hDYq
 zyX79DKWtExEe/f2hY1m/6eNm6st5HE7X7scOba3TamQzvOzJkjzo7XoS2yeUAjb
 eByO2G0PvTrA0TFls6Hyrl6db5OW5KjQnVWr6W3fiWL5YIdh0SQMkWeaGVvGyfy8
 Q3vhk7POaZo=
 =BvPn
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:
 "Lockdep:
   - Enable PROVE_RAW_LOCK_NESTING with PROVE_LOCKING (Sebastian Andrzej
     Siewior)
   - Add lockdep_cleanup_dead_cpu() (David Woodhouse)

  futexes:
   - Use atomic64_inc_return() in get_inode_sequence_number() (Uros
     Bizjak)
   - Use atomic64_try_cmpxchg_relaxed() in get_inode_sequence_number()
     (Uros Bizjak)

  RT locking:
   - Add sparse annotation PREEMPT_RT's locking (Sebastian Andrzej
     Siewior)

  spinlocks:
   - Use atomic_try_cmpxchg_release() in osq_unlock() (Uros Bizjak)

  atomics:
   - x86: Use ALT_OUTPUT_SP() for __alternative_atomic64() (Uros Bizjak)
   - x86: Use ALT_OUTPUT_SP() for __arch_{,try_}cmpxchg64_emu() (Uros
     Bizjak)

  KCSAN, seqlocks:
   - Support seqcount_latch_t (Marco Elver)

  <linux/cleanup.h>:
   - Add if_not_guard() conditional guard helper (David Lechner)
   - Adjust scoped_guard() macros to avoid potential warning (Przemek
     Kitszel)
   - Remove address space of returned pointer (Uros Bizjak)

  WW mutexes:
   - locking/ww_mutex: Adjust to lockdep nest_lock requirements (Thomas
     Hellström)

  Rust integration:
   - Fix raw_spin_lock initialization on PREEMPT_RT (Eder Zulian)

  Misc cleanups & fixes:
   - lockdep: Fix wait-type check related warnings (Ahmed Ehab)
   - lockdep: Use info level for initial info messages (Jiri Slaby)
   - spinlocks: Make __raw_* lock ops static (Geert Uytterhoeven)
   - pvqspinlock: Convert fields of 'enum vcpu_state' to uppercase
     (Qiuxu Zhuo)
   - iio: magnetometer: Fix if () scoped_guard() formatting (Stephen
     Rothwell)
   - rtmutex: Fix misleading comment (Peter Zijlstra)
   - percpu-rw-semaphores: Fix grammar in percpu-rw-semaphore.rst (Xiu
     Jianfeng)"

* tag 'locking-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
  locking/Documentation: Fix grammar in percpu-rw-semaphore.rst
  iio: magnetometer: fix if () scoped_guard() formatting
  rust: helpers: Avoid raw_spin_lock initialization for PREEMPT_RT
  kcsan, seqlock: Fix incorrect assumption in read_seqbegin()
  seqlock, treewide: Switch to non-raw seqcount_latch interface
  kcsan, seqlock: Support seqcount_latch_t
  time/sched_clock: Broaden sched_clock()'s instrumentation coverage
  time/sched_clock: Swap update_clock_read_data() latch writes
  locking/atomic/x86: Use ALT_OUTPUT_SP() for __arch_{,try_}cmpxchg64_emu()
  locking/atomic/x86: Use ALT_OUTPUT_SP() for __alternative_atomic64()
  cleanup: Add conditional guard helper
  cleanup: Adjust scoped_guard() macros to avoid potential warning
  locking/osq_lock: Use atomic_try_cmpxchg_release() in osq_unlock()
  cleanup: Remove address space of returned pointer
  locking/rtmutex: Fix misleading comment
  locking/rt: Annotate unlock followed by lock for sparse.
  locking/rt: Add sparse annotation for RCU.
  locking/rt: Remove one __cond_lock() in RT's spin_trylock_irqsave()
  locking/rt: Add sparse annotation PREEMPT_RT's sleeping locks.
  locking/pvqspinlock: Convert fields of 'enum vcpu_state' to uppercase
  ...
2024-11-19 12:43:11 -08:00
Linus Torvalds
769ca7d4d2 Kernel Concurrency Sanitizer (KCSAN) updates for v6.13
- Fixes to make KCSAN compatible with PREEMPT_RT
 
 - Minor cleanups
 
 All changes have been in linux-next for the past 4 weeks.
 -----BEGIN PGP SIGNATURE-----
 
 iIcEABYIAC8WIQR7t4b/75lzOR3l5rcxsLN3bbyLnwUCZzMoFREcZWx2ZXJAZ29v
 Z2xlLmNvbQAKCRAxsLN3bbyLn6cVAP4l4IzMyRm+kAW8yqnMjfZBl2+cJ15J5Huy
 jQLqPSdruwD/W8ciiJvz9FhKtQQwVXtZF3WcNdkNgGLqhHbEkPBw4gA=
 =Lx19
 -----END PGP SIGNATURE-----

Merge tag 'kcsan-20241112-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/melver/linux

Pull Kernel Concurrency Sanitizer (KCSAN) updates from Marco Elver:

 - Make KCSAN compatible with PREEMPT_RT

 - Minor cleanup

* tag 'kcsan-20241112-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/melver/linux:
  kcsan: Remove redundant call of kallsyms_lookup_name()
  kcsan: Turn report_filterlist_lock into a raw_spinlock
2024-11-19 11:44:17 -08:00
Linus Torvalds
8cdf2d1903 RCU pull request for v6.13
SRCU:
 
 	- Introduction of the new SRCU-lite flavour with a new pair of
 	  srcu_read_[un]lock_lite() APIs. In practice the read side using
 	  this flavour becomes lighter by removing a full memory barrier on
 	  LOCK and a full memory barrier on UNLOCK. This comes at the
 	  expense of a higher latency write side with two (in the best case
 	  of a snaphot of unused read-sides) or more RCU grace periods on
 	  the update side which now assumes by itself the whole full
 	  ordering guarantee against the LOCK/UNLOCK counters on both
 	  indexes, along with the accesses performed inside.
 
 	  Uretprobes is a known potential user.
 
 	  Note this doesn't replace the default normal flavour of SRCU which
 	  still behaves the same as usual.
 
 	- Add testing of SRCU-lite through rcutorture and rcuscale
 
 	- Various cleanups on the way.
 
 FIXES:
 
 	- Allow short-circuiting RCU-TASKS-RUDE grace periods on architectures
 	  that have sane noinstr boundaries forbidding tracing on low-level
 	  idle and kernel entry code. RCU-TASKS is enough on such configurations
 	  because it involves an RCU grace period that waits for all idle
 	  tasks to either schedule out voluntarily or enter into RCU
 	  unwatched noinstr code.
 
 	- Allow and test start_poll_synchronize_rcu() with IRQs disabled.
 
 	- Mention rcuog kthreads in relevant documentation and Kconfig help
 
 	- Various fixes and consolidations
 
 RCUTORTURE:
 
 	- Add --no-affinity on tools to leave the affinity setting of guests
 	  up to the user.
 
 	- Add guest_os_delay parameter to rcuscale for better warm-up
 	  control.
 
 	- Fix and improve some rcuscale error handling.
 
 	- Various cleanups and fixes
 
 STALL:
 
 	- Remove dead code
 
 	- Stop dumping tasks if a stalled grace period eventually ended
 	  midway as that only produces confusing output.
 
 	- Optimize detection of stalling CPUs and avoid useless node
 	  locking otherwise.
 
 NOCB:
 
 	- Fix rcu_barrier() hang due to a race against callbacks
 	  deoffloading. This is not yet used, except by rcutorture, and
 	  waits for its promised cpusets interface.
 
 	- Remove leftover function declaration
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmc6gP0ACgkQhSRUR1CO
 jHcHfw/5AWg5wiapwJtLO9KNdtELflTTbT/NhhqwYVReHnOSvtPNwWgo984T3jYJ
 xikE4Ccn5Nu4zJVbTOtmwJ/RP6WWP1I28LgoTCdcz9BB9b+CRLogV/dR5r5uZbhD
 +jqXRAzDhEifR0pcfSK28MkXoh+puXMg4C78f7xtT1Oe3Gr67RLf6xvE59gHJrDg
 QrPStdwhOn2bhmbKcflw1bHYqpypL09P2WHuRLmsJJUMUGIHTohK05lJOkD3hV9g
 HTxOecNmeF/r8NyN8l/ERJgKmwDukIG02xih8UMEtqDEl04IxZFHbCfB6yyIsKDT
 fTFxnRCHnm/PxIKRA5ENvyg/6uArMJ0xuSTZRG4K5v0nx7okR8gbCPmwiwn1m5w3
 +/oppjCmG/gRgyiOytuEGKfaN9q/oJqQgeS7j8WruWj9V68FYUKr6COfQByw0xOc
 H6ftaLGeFHgHxk3nua2wFrfMtQhucYAMGAlVK82yd7Q1EFW47kzleO8w/HSvfrBt
 trX+9HZ77GVVmREJMstnIWRr5mbPtUf8yRZdA5bBrlEYz0A/ToNaFACid0fsaMC2
 Dbo9Q+wDqL2wwOpjZy+MA3k1IVyDdUTuOQmPt57LmFTxUNZ+AQQlJcrhrUqWVvdM
 Nne2EHdqCHADKd7g3i17HtvpTsapz+Qakpzx8UsPqNtfo1DSd5A=
 =MWrw
 -----END PGP SIGNATURE-----

Merge tag 'rcu.release.v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux

Pull RCU updates from Frederic Weisbecker:
 "SRCU:

   - Introduction of the new SRCU-lite flavour with a new pair of
     srcu_read_[un]lock_lite() APIs. In practice the read side using
     this flavour becomes lighter by removing a full memory barrier on
     LOCK and a full memory barrier on UNLOCK. This comes at the expense
     of a higher latency write side with two (in the best case of a
     snaphot of unused read-sides) or more RCU grace periods on the
     update side which now assumes by itself the whole full ordering
     guarantee against the LOCK/UNLOCK counters on both indexes, along
     with the accesses performed inside.

     Uretprobes is a known potential user.

     Note this doesn't replace the default normal flavour of SRCU which
     still behaves the same as usual.

   - Add testing of SRCU-lite through rcutorture and rcuscale

   - Various cleanups on the way.

  Fixes:

   - Allow short-circuiting RCU-TASKS-RUDE grace periods on
     architectures that have sane noinstr boundaries forbidding tracing
     on low-level idle and kernel entry code. RCU-TASKS is enough on
     such configurations because it involves an RCU grace period that
     waits for all idle tasks to either schedule out voluntarily or
     enter into RCU unwatched noinstr code.

   - Allow and test start_poll_synchronize_rcu() with IRQs disabled.

   - Mention rcuog kthreads in relevant documentation and Kconfig help

   - Various fixes and consolidations

  rcutorture:

   - Add --no-affinity on tools to leave the affinity setting of guests
     up to the user.

   - Add guest_os_delay parameter to rcuscale for better warm-up
     control.

   - Fix and improve some rcuscale error handling.

   - Various cleanups and fixes

  stall:

   - Remove dead code

   - Stop dumping tasks if a stalled grace period eventually ended
     midway as that only produces confusing output.

   - Optimize detection of stalling CPUs and avoid useless node locking
     otherwise.

  NOCB:

   - Fix rcu_barrier() hang due to a race against callbacks
     deoffloading. This is not yet used, except by rcutorture, and waits
     for its promised cpusets interface.

   - Remove leftover function declaration"

* tag 'rcu.release.v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (42 commits)
  rcuscale: Remove redundant WARN_ON_ONCE() splat
  rcuscale: Do a proper cleanup if kfree_scale_init() fails
  srcu: Unconditionally record srcu_read_lock_lite() in ->srcu_reader_flavor
  srcu: Check for srcu_read_lock_lite() across all CPUs
  srcu: Remove smp_mb() from srcu_read_unlock_lite()
  rcutorture: Avoid printing cpu=-1 for no-fault RCU boost failure
  rcuscale: Add guest_os_delay module parameter
  refscale: Correct affinity check
  torture: Add --no-affinity parameter to kvm.sh
  rcu/nocb: Fix missed RCU barrier on deoffloading
  rcu/kvfree: Fix data-race in __mod_timer / kvfree_call_rcu
  rcu/srcutiny: don't return before reenabling preemption
  rcu-tasks: Remove open-coded one-byte cmpxchg() emulation
  doc: Remove kernel-parameters.txt entry for rcutorture.read_exit
  rcutorture: Test start-poll primitives with interrupts disabled
  rcu: Permit start_poll_synchronize_rcu*() with interrupts disabled
  rcu: Allow short-circuiting of synchronize_rcu_tasks_rude()
  doc: Add rcuog kthreads to kernel-per-CPU-kthreads.rst
  rcu: Add rcuog kthreads to RCU_NOCB_CPU help text
  rcu: Use the BITS_PER_LONG macro
  ...
2024-11-19 11:27:07 -08:00
Linus Torvalds
ad52c55e1d Power management updates for 6.13-rc1
- Update the amd-pstate driver to set the initial scaling frequency
    policy lower bound to be the lowest non-linear frequency (Dhananjay
    Ugwekar).
 
  - Enable amd-pstate by default on servers starting with newer AMD Epyc
    processors (Swapnil Sapkal).
 
  - Align more codepaths between shared memory and MSR designs in
    amd-pstate (Dhananjay Ugwekar).
 
  - Clean up amd-pstate code to rename functions and remove redundant
    calls (Dhananjay Ugwekar, Mario Limonciello).
 
  - Do other assorted fixes and cleanups in amd-pstate (Dhananjay Ugwekar
    and Mario Limonciello).
 
  - Change the Balance-performance EPP value for Granite Rapids in the
    intel_pstate driver to a more performance-biased one (Srinivas
    Pandruvada).
 
  - Simplify MSR read on the boot CPU in the ACPI cpufreq driver (Chang
    S. Bae).
 
  - Ensure sugov_eas_rebuild_sd() is always called when sugov_init()
    succeeds to always enforce sched domains rebuild in case EAS needs
    to be enabled (Christian Loehle).
 
  - Switch cpufreq back to platform_driver::remove() (Uwe Kleine-König).
 
  - Use proper frequency unit names in cpufreq (Marcin Juszkiewicz).
 
  - Add a built-in idle states table for Granite Rapids Xeon D to the
    intel_idle driver (Artem Bityutskiy).
 
  - Fix some typos in comments in the cpuidle core and drivers (Shen
    Lichuan).
 
  - Remove iowait influence from the menu cpuidle governor (Christian
    Loehle).
 
  - Add min/max available performance state limits to the Energy Model
    management code (Lukasz Luba).
 
  - Update pm-graph to v5.13 (Todd Brandt).
 
  - Add documentation for some recently introduced cpupower utility
    options (Tor Vic).
 
  - Make cpupower inform users where cpufreq-bench.conf should be located
    when opening it fails (Peng Fan).
 
  - Allow overriding cross-compiling env params in cpupower (Peng Fan).
 
  - Add compile_commands.json to .gitignore in cpupower (John B. Wyatt
    IV).
 
  - Improve disable c_state block in cpupower bindings and add a test to
    confirm that CPU state is disabled to it (John B. Wyatt IV).
 
  - Add Chinese Simplified translation to cpupower (Kieran Moy).
 
  - Add checks for xgettext and msgfmt to cpupower (Siddharth Menon).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmc3r6sSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxQMUQALNEbh/Ko1d+avq0sfvyPw18BZjEiQw7
 M+L0GydLW6tXLYOrD+ZTASksdDhHbK0iuFr1Gca2cZi0Dl+1XF9sy70ITTqzCDIA
 8qj1JrPmRYI0KXCfiSSke0W9fU18IdxVX3I7XezVqBl0ICzsroN5wliCkmEnVOU9
 LQkw0fyYr7gev4GFEGSJ7WzfPxci0d6J9pYnafFlDEE28WpKz/cyOzYuSghX5lmG
 ISHIVNIM6lqNgXyQirConvhrlg60XAyw5k5jqAYZbe78T+dqhH7lr9sDi7c4XxkG
 syeiOOyjpiBMZv1rSjIUapi8AfJHyqH7B6KyTgiulIy31x8Dji62925B63CSahkM
 AminAq0lYkqbhIcqEr4sW0JQ/oW3iX4cZ3TJXTUL+vFByR0ZF81tgQcXufhrcvBs
 ViNugcX0q1vDX3lZsm9L6UHXN2yhUb36sgreUvbGfwnE79tuR/eUnAukTWBfXau/
 TWnyDiQn1CjZcfHB+YAPYZNyUHHqjoIJwzfJLwnsaHgFA80YcSwfSC9kcogCawK1
 NCyfs29lAccWsrOul5iARJu8pLw1X//UfDEmVNrBD+1hveKYMrjjiQXnPoVVnNhc
 J5T2q5S1QeO05+wf8WaZ7MbRNzHLj0A3gYHSVPWNclxFwsQjqCHHZS2qz8MTX+f6
 W6/eZuvmMbG7
 =w8QT
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "The amd-pstate cpufreq driver gets the majority of changes this time.
  They are mostly fixes and cleanups, but one of them causes it to
  become the default cpufreq driver on some AMD server platforms.

  Apart from that, the menu cpuidle governor is modified to not use
  iowait any more, the intel_idle gets a custom C-states table for
  Granite Rapids Xeon D, and the intel_pstate driver will use a more
  aggressive Balance- performance default EPP value on Granite Rapids
  now.

  There are also some fixes, cleanups and tooling updates.

  Specifics:

   - Update the amd-pstate driver to set the initial scaling frequency
     policy lower bound to be the lowest non-linear frequency (Dhananjay
     Ugwekar)

   - Enable amd-pstate by default on servers starting with newer AMD
     Epyc processors (Swapnil Sapkal)

   - Align more codepaths between shared memory and MSR designs in
     amd-pstate (Dhananjay Ugwekar)

   - Clean up amd-pstate code to rename functions and remove redundant
     calls (Dhananjay Ugwekar, Mario Limonciello)

   - Do other assorted fixes and cleanups in amd-pstate (Dhananjay
     Ugwekar and Mario Limonciello)

   - Change the Balance-performance EPP value for Granite Rapids in the
     intel_pstate driver to a more performance-biased one (Srinivas
     Pandruvada)

   - Simplify MSR read on the boot CPU in the ACPI cpufreq driver (Chang
     S. Bae)

   - Ensure sugov_eas_rebuild_sd() is always called when sugov_init()
     succeeds to always enforce sched domains rebuild in case EAS needs
     to be enabled (Christian Loehle)

   - Switch cpufreq back to platform_driver::remove() (Uwe Kleine-König)

   - Use proper frequency unit names in cpufreq (Marcin Juszkiewicz)

   - Add a built-in idle states table for Granite Rapids Xeon D to the
     intel_idle driver (Artem Bityutskiy)

   - Fix some typos in comments in the cpuidle core and drivers (Shen
     Lichuan)

   - Remove iowait influence from the menu cpuidle governor (Christian
     Loehle)

   - Add min/max available performance state limits to the Energy Model
     management code (Lukasz Luba)

   - Update pm-graph to v5.13 (Todd Brandt)

   - Add documentation for some recently introduced cpupower utility
     options (Tor Vic)

   - Make cpupower inform users where cpufreq-bench.conf should be
     located when opening it fails (Peng Fan)

   - Allow overriding cross-compiling env params in cpupower (Peng Fan)

   - Add compile_commands.json to .gitignore in cpupower (John B. Wyatt
     IV)

   - Improve disable c_state block in cpupower bindings and add a test
     to confirm that CPU state is disabled to it (John B. Wyatt IV)

   - Add Chinese Simplified translation to cpupower (Kieran Moy)

   - Add checks for xgettext and msgfmt to cpupower (Siddharth Menon)"

* tag 'pm-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (38 commits)
  cpufreq: intel_pstate: Update Balance-performance EPP for Granite Rapids
  cpufreq: ACPI: Simplify MSR read on the boot CPU
  sched/cpufreq: Ensure sd is rebuilt for EAS check
  intel_idle: add Granite Rapids Xeon D support
  PM: EM: Add min/max available performance state limits
  cpufreq/amd-pstate: Move registration after static function call update
  cpufreq/amd-pstate: Push adjust_perf vfunc init into cpu_init
  cpufreq/amd-pstate: Align offline flow of shared memory and MSR based systems
  cpufreq/amd-pstate: Call cppc_set_epp_perf in the reenable function
  cpufreq/amd-pstate: Do not attempt to clear MSR_AMD_CPPC_ENABLE
  cpufreq/amd-pstate: Rename functions that enable CPPC
  cpufreq/amd-pstate-ut: Add fix for min freq unit test
  amd-pstate: Switch to amd-pstate by default on some Server platforms
  amd-pstate: Set min_perf to nominal_perf for active mode performance gov
  cpufreq/amd-pstate: Remove the redundant amd_pstate_set_driver() call
  cpufreq/amd-pstate: Remove the switch case in amd_pstate_init()
  cpufreq/amd-pstate: Call amd_pstate_set_driver() in amd_pstate_register_driver()
  cpufreq/amd-pstate: Call amd_pstate_register() in amd_pstate_init()
  cpufreq/amd-pstate: Set the initial min_freq to lowest_nonlinear_freq
  cpufreq/amd-pstate: Remove the redundant verify() function
  ...
2024-11-19 11:05:00 -08:00
Linus Torvalds
8a7fa81137 Random number generator updates for Linux 6.13-rc1.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmc6oE0ACgkQSfxwEqXe
 A65n5BAAtNmfBJhYRiC6Svsg7+ktHmhCAHoHwnP7sv+bjs81FRAEv21CsfI+02Nb
 zUvaPuyiLtYzlWxzE5Yg44v1cADHAq+QZE1Fg5yl7ge6zPZ3+S1pv/8suNSyyI2M
 PKvh1sb4OkUtqplveYSuP1J87u55zAtV9mP9qC3hSlY3XkeQUObt9Awss8peOMdv
 sH2AxwBlRkqFXpY2worxlfg3p5iLemb3AUZ3f0Jc6fRmOagSJCt7i4mDrWo3EXke
 90Ao8ypY0x3YVGRFACHnxCS53X20HGwLxm7jdicfriMCzAJ6JQR6asO+NYnXR+Ev
 9Za3UquVHP6HbQGWj6d1k5k2nF+IbkTHTgFBPRK/CY9ZpVbP04B2K7tE1gmT81wj
 AscRGi9RBVBPKAUguyi99MXYlprFG/ZTLOux3hvdarv5u0bP94eXmy1FrRM+IO0r
 u4BiQ39FlkDdtRxjzKfCiKkMrf3NmFEciZJhxCnflzmOBaj64r1hRt/ea8Bjxvp3
 a4k0MfULmcEn2JwPiT1/Swz45ypZQc4OgbP87SCU8P0a23r21r2oK+9v3No/rCzB
 TI0fP6ykDTFQoiKUOSg1mJmkipdjeDyQ9E+0XIDsKd+T8Yv9rFoaV6RWoMrkt4AJ
 Yea9+V+XEI8F3SjhdD4OL/s3/+bjTjnRHDaXnJf2XzGmXcuvnbs=
 =o4ww
 -----END PGP SIGNATURE-----

Merge tag 'random-6.13-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random

Pull random number generator updates from Jason Donenfeld:
 "This contains a single series from Uros to replace uses of
  <linux/random.h> with prandom.h or other more specific headers
  as needed, in order to avoid a circular header issue.

  Uros' goal is to be able to use percpu.h from prandom.h, which
  will then allow him to define __percpu in percpu.h rather than
  in compiler_types.h"

* tag 'random-6.13-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
  prandom: Include <linux/percpu.h> in <linux/prandom.h>
  random: Do not include <linux/prandom.h> in <linux/random.h>
  netem: Include <linux/prandom.h> in sch_netem.c
  lib/test_scanf: Include <linux/prandom.h> instead of <linux/random.h>
  lib/test_parman: Include <linux/prandom.h> instead of <linux/random.h>
  bpf/tests: Include <linux/prandom.h> instead of <linux/random.h>
  lib/rbtree-test: Include <linux/prandom.h> instead of <linux/random.h>
  random32: Include <linux/prandom.h> instead of <linux/random.h>
  kunit: string-stream-test: Include <linux/prandom.h>
  lib/interval_tree_test.c: Include <linux/prandom.h> instead of <linux/random.h>
  bpf: Include <linux/prandom.h> instead of <linux/random.h>
  scsi: libfcoe: Include <linux/prandom.h> instead of <linux/random.h>
  fscrypt: Include <linux/once.h> in fs/crypto/keyring.c
  mtd: tests: Include <linux/prandom.h> instead of <linux/random.h>
  media: vivid: Include <linux/prandom.h> in vivid-vid-cap.c
  drm/lib: Include <linux/prandom.h> instead of <linux/random.h>
  drm/i915/selftests: Include <linux/prandom.h> instead of <linux/random.h>
  crypto: testmgr: Include <linux/prandom.h> instead of <linux/random.h>
  x86/kaslr: Include <linux/prandom.h> instead of <linux/random.h>
2024-11-19 10:43:44 -08:00
Linus Torvalds
02b2f1a7b8 This update includes the following changes:
API:
 
 - Add sig driver API.
 - Remove signing/verification from akcipher API.
 - Move crypto_simd_disabled_for_test to lib/crypto.
 - Add WARN_ON for return values from driver that indicates memory corruption.
 
 Algorithms:
 
 - Provide crc32-arch and crc32c-arch through Crypto API.
 - Optimise crc32c code size on x86.
 - Optimise crct10dif on arm/arm64.
 - Optimise p10-aes-gcm on powerpc.
 - Optimise aegis128 on x86.
 - Output full sample from test interface in jitter RNG.
 - Retry without padata when it fails in pcrypt.
 
 Drivers:
 
 - Add support for Airoha EN7581 TRNG.
 - Add support for STM32MP25x platforms in stm32.
 - Enable iproc-r200 RNG driver on BCMBCA.
 - Add Broadcom BCM74110 RNG driver.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmc6sQsACgkQxycdCkmx
 i6dfHxAAnkI65TE6agZq9DlkEU4ZqOsxxdk0MsGIhbCUTxW3KENzu9vtKjnvg9T/
 Ou0d2J49ny87Y4zaA59Wf/Q1+gg5YSQR5kelonpfrPLkCkJjr72HZpyCHv8TTzEC
 uHHoVj9cnPIF5/yfiqQsrWT1ACip9vn+slyVPaMJV1qR6gnvnSALtsg4e/vKHkn7
 ZMaf2pZ2ROYXdB02nMK5KQcCrxD64MQle/yQepY44eYjnT+XclkqPdi6o1nUSpj/
 RFAeY0jFSTu0pj3DqT48TnU/LiiNLlFOZrGjCdEySoac63vmTtKqfYDmrRaFz4hB
 sucxbgJ3xnnYseRijtfXnxaD/IkDJln+ipGNQKAZLfOVMDCTxPdYGmOpobMTXMS+
 0sY0eAHgqr23P9pOp+sOzcAEFIqg6llAYQVWx3Zl4vpXBUuxzg6AqmHnPicnck7y
 Lw1cJhQxij2De3dG2ZL/0dgQxMjGN/YfCM8SSg6l+Xn3j4j47rqJNH2ZsmXtbJ2n
 kTkmemmWdgRR1IvgQQGsvyKs9ThkcEDW+IzW26SUv3Clvru2NSkX4ZPHbezZQf+D
 R0wMZsW3Fw7Zymerz1GIBSqdLnsyFWtIAjukDpOR6ordPgOBeDt76v6tw5vL2/II
 KYoeN1pdEEecwuhAsEvCryT5ZG4noBeNirf/ElWAfEybgcXiTks=
 =T8pa
 -----END PGP SIGNATURE-----

Merge tag 'v6.13-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto updates from Herbert Xu:
 "API:
   - Add sig driver API
   - Remove signing/verification from akcipher API
   - Move crypto_simd_disabled_for_test to lib/crypto
   - Add WARN_ON for return values from driver that indicates memory
     corruption

  Algorithms:
   - Provide crc32-arch and crc32c-arch through Crypto API
   - Optimise crc32c code size on x86
   - Optimise crct10dif on arm/arm64
   - Optimise p10-aes-gcm on powerpc
   - Optimise aegis128 on x86
   - Output full sample from test interface in jitter RNG
   - Retry without padata when it fails in pcrypt

  Drivers:
   - Add support for Airoha EN7581 TRNG
   - Add support for STM32MP25x platforms in stm32
   - Enable iproc-r200 RNG driver on BCMBCA
   - Add Broadcom BCM74110 RNG driver"

* tag 'v6.13-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (112 commits)
  crypto: marvell/cesa - fix uninit value for struct mv_cesa_op_ctx
  crypto: cavium - Fix an error handling path in cpt_ucode_load_fw()
  crypto: aesni - Move back to module_init
  crypto: lib/mpi - Export mpi_set_bit
  crypto: aes-gcm-p10 - Use the correct bit to test for P10
  hwrng: amd - remove reference to removed PPC_MAPLE config
  crypto: arm/crct10dif - Implement plain NEON variant
  crypto: arm/crct10dif - Macroify PMULL asm code
  crypto: arm/crct10dif - Use existing mov_l macro instead of __adrl
  crypto: arm64/crct10dif - Remove remaining 64x64 PMULL fallback code
  crypto: arm64/crct10dif - Use faster 16x64 bit polynomial multiply
  crypto: arm64/crct10dif - Remove obsolete chunking logic
  crypto: bcm - add error check in the ahash_hmac_init function
  crypto: caam - add error check to caam_rsa_set_priv_key_form
  hwrng: bcm74110 - Add Broadcom BCM74110 RNG driver
  dt-bindings: rng: add binding for BCM74110 RNG
  padata: Clean up in padata_do_multithreaded()
  crypto: inside-secure - Fix the return value of safexcel_xcbcmac_cra_init()
  crypto: qat - Fix missing destroy_workqueue in adf_init_aer()
  crypto: rsassa-pkcs1 - Reinstate support for legacy protocols
  ...
2024-11-19 10:28:41 -08:00
Linus Torvalds
311e062ad5 CSD-lock diagnostic updates for v6.13
This commit switches from sched_clock() to ktime_get_mono_fast_ns(), which
 on x86 switches from the rdtsc instruction to the rdtscp instruction,
 thus avoiding instruction reorderings that cause false-positive reports
 of CSD-lock stalls of almost 2^46 nanoseconds.  These false positives
 are rare, but really are seen in the wild.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmc5XvETHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jNUCD/9NqeuxsVcumybbjlHs/IbJt47qTPVk
 1O+mpLiKfscw/ndfvqJe1RU+IOUJUPBPzBPUWvZQZ2SzeU03oOI4/szFttDdXSi3
 0uI9qOJn3auk2+cdU7CxXOLSiWYEWlMjWvN6d34QeLh7smLkendxH2wo2fkL9kf0
 DzvosOrlyNWGZPUQrb1TRW7RKGE7vap8x7tK/p1qMO2xmaPeIX7dfiY38CJC5fjj
 +n8i1aZIxLFc65I0/Z+nGTMFrktzbYjJik6k++QZzHx+GiXaCkgfidZFspj3uPXW
 CPa6KxheCrdmFV4A/TVnKYJyutoGeheMjwlVfz0YOSe8J5/N3F9RfDFBYedt2fL+
 11gRpOg5hz61AsyxZ1+iViW0guXoVzn2uwQ5rkou9184fBXPuwH1MAwBcsKYwQig
 Frd0ZzyrqGHCHwDWtBfAb+qC17b5krsa+fKkjiPFDRDRB5N2hh67tcquOE3wzvrG
 oAHEZgeFwxZQYGIZ7uITebyThe9NvkBRyrJLvxUEpvF2MoI0yJaqoAwkHieSl1vD
 KJJ+o+HxVa3D/WCWxTNCjDyxvCMJpFHFWB3h8+hi+X2UleRGrDiIDmIdddMM1/gr
 meYjZ/c1/t7Y14zwzp/SxUHFJ4U8U2jI23/K5ldJVH34k6XwccU06NYWjHYokXgT
 ZRMCJ8sAcTtlhw==
 =8GQJ
 -----END PGP SIGNATURE-----

Merge tag 'csd-lock.2024.11.16a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull CSD-lock update from Paul McKenney:
 "This switches from sched_clock() to ktime_get_mono_fast_ns(), which on
  x86 switches from the rdtsc instruction to the rdtscp instruction,
  thus avoiding instruction reorderings that cause false-positive
  reports of CSD-lock stalls of almost 2^46 nanoseconds. These false
  positives are rare, but really are seen in the wild"

* tag 'csd-lock.2024.11.16a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  locking/csd-lock: Switch from sched_clock() to ktime_get_mono_fast_ns()
2024-11-19 10:18:45 -08:00
Linus Torvalds
d7d4102f0a scftorture changes for v6.13
o	Avoid divide operation.
 
 o	Fix cleanup code waiting for IPI handlers.
 
 o	Move memory allocations out of preempt-disable region of code
 	for PREEMPT_RT compatibility.
 
 o	Use a lockless list to avoid freeing memory while interrupts
 	are disabled, again for PREEMPT_RT compatibility.
 
 o	Make lockless list scf_add_to_free_list() correctly handle
 	freeing a NULL pointer.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmc5X0gTHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jDVMEACQRdJ0NYxygGFpUzDj2Er2wdOtBG0E
 n1NOqmNX7nlBL8BzseCFa2OiVbvggE7+ynAGcqISzDLZGE6aa4/HwKLkxSGB62UV
 WMXNiJE+t4bb1TsdMwLcQnOmmDniy6ID0NIEA8YHEEZltuDNQGQfjB8ynJewwNmY
 yMU90JDwVvDVmM9+AXUqYYRAar1gR5k7jknQbnXqb+6xT/kMEu+B1z5BGiMB3Z5L
 LylobI+3OZTY417tgJU/iSeRZbLZn7Xs6pxOcJMpeFvvYMn4mkYaUX+WUOU9oTQd
 h91wGxRouTQpS41zGNI5HcqnTtevrnmtXNROyUkei1aipvnq8N9HR11UJDXWgSV4
 24dH8qZVzTv+/cWIuNA3uUH+hu7kFZztQQQeIJdenm3CBtEYIK4ssrlyXUM7U5AY
 JQOjeEzApQLht++VTjGSS3CZhODLCTQU+IeQH1ChM1EZz2M9gsv9RqKfXrnFTDnO
 6UrLNa2YCpvQCEeNj2i8TaFHZAInGTcNFHjhxd+kA4SsCDygi9PYxKq6xVadLVZs
 Kwj6kpgPpatQzZ5w7Il9RF+qTgpOnbqB52JFt3rGjQg8uALfDo5S85wurhvu6+GC
 Qy7XvDWhmUn8fZwvlRO+DABBWOYmXeAVHKxWA3VBxO3O454Pxx5IuVSW4213GFVz
 58sAl0WwK8Jscg==
 =ElHE
 -----END PGP SIGNATURE-----

Merge tag 'scftorture.2024.11.16a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull scftorture updates from Paul McKenney:

 - Avoid divide operation

 - Fix cleanup code waiting for IPI handlers

 - Move memory allocations out of preempt-disable region of code for
   PREEMPT_RT compatibility

 - Use a lockless list to avoid freeing memory while interrupts are
   disabled, again for PREEMPT_RT compatibility

 - Make lockless list scf_add_to_free_list() correctly handle freeing a
   NULL pointer

* tag 'scftorture.2024.11.16a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  scftorture: Handle NULL argument passed to scf_add_to_free_list().
  scftorture: Use a lock-less list to free memory.
  scftorture: Move memory allocation outside of preempt_disable region.
  scftorture: Wait until scf_cleanup_handler() completes.
  scftorture: Avoid additional div operation.
2024-11-19 10:16:59 -08:00
Linus Torvalds
ba1f9c8fe3 arm64 updates for 6.13:
* Support for running Linux in a protected VM under the Arm Confidential
   Compute Architecture (CCA)
 
 * Guarded Control Stack user-space support. Current patches follow the
   x86 ABI of implicitly creating a shadow stack on clone(). Subsequent
   patches (already on the list) will add support for clone3() allowing
   finer-grained control of the shadow stack size and placement from libc
 
 * AT_HWCAP3 support (not running out of HWCAP2 bits yet but we are
   getting close with the upcoming dpISA support)
 
 * Other arch features:
 
   - In-kernel use of the memcpy instructions, FEAT_MOPS (previously only
     exposed to user; uaccess support not merged yet)
 
   - MTE: hugetlbfs support and the corresponding kselftests
 
   - Optimise CRC32 using the PMULL instructions
 
   - Support for FEAT_HAFT enabling ARCH_HAS_NONLEAF_PMD_YOUNG
 
   - Optimise the kernel TLB flushing to use the range operations
 
   - POE/pkey (permission overlays): further cleanups after bringing the
     signal handler in line with the x86 behaviour for 6.12
 
 * arm64 perf updates:
 
   - Support for the NXP i.MX91 PMU in the existing IMX driver
 
   - Support for Ampere SoCs in the Designware PCIe PMU driver
 
   - Support for Marvell's 'PEM' PCIe PMU present in the 'Odyssey' SoC
 
   - Support for Samsung's 'Mongoose' CPU PMU
 
   - Support for PMUv3.9 finer-grained userspace counter access control
 
   - Switch back to platform_driver::remove() now that it returns 'void'
 
   - Add some missing events for the CXL PMU driver
 
 * Miscellaneous arm64 fixes/cleanups:
 
   - Page table accessors cleanup: type updates, drop unused macros,
     reorganise arch_make_huge_pte() and clean up pte_mkcont(), sanity
     check addresses before runtime P4D/PUD folding
 
   - Command line override for ID_AA64MMFR0_EL1.ECV (advertising the
     FEAT_ECV for the generic timers) allowing Linux to boot with
     firmware deployments that don't set SCTLR_EL3.ECVEn
 
   - ACPI/arm64: tighten the check for the array of platform timer
     structures and adjust the error handling procedure in
     gtdt_parse_timer_block()
 
   - Optimise the cache flush for the uprobes xol slot (skip if no
     change) and other uprobes/kprobes cleanups
 
   - Fix the context switching of tpidrro_el0 when kpti is enabled
 
   - Dynamic shadow call stack fixes
 
   - Sysreg updates
 
   - Various arm64 kselftest improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmc5POIACgkQa9axLQDI
 XvEDYA//a3eeNkgMuGdnSCVcLz+zy+oNwAwboG/4X1DqL8jiCbI4npwugPx95RIA
 YZOUvo9T2aL3OyefpUHll4gFHqx9OwoZIig2F70TEUmlPsGUbh0KBkdfQF3xZPdl
 EwV0kHSGEqMWMBwsGJGwgCYrUaf1MUQzh1GBl7VJ2ts5XsJBaBeOyKkysij26wtZ
 V+aHq2IUx7qQS7+HC/4P6IoHxKziFcsCMovaKaynP4cw9xXBQbDMcNlHEwndOMyk
 pu2zrv7GG0j3KQuVP/2Alf5FKhmI0GVGP/6Nc/zsOmw96w8Kf7HfzEtkHawr2aRq
 rqg/c9ivzDn1p+fUBo4ZYtrRk4IAY+yKu6hdzdLTP5+bQrBTWTO9rjQVBm9FAGYT
 sCdEj1NqzvExvNHD7X6ut/GJ05lmce3K+qeSXSEysN9gqiT3eomYWMXrD2V2lxzb
 rIDDcb/icfaqjt14Mksh19r/rzNeq7noj9CGSmcqw0BHZfHzl38Lai6pdfYzCNyn
 vCM/c4c1D/WWX8/lifO1JZVbhDk1jy82Iphg2KEhL8iKPxDsKBBZLmYuU1oa7tMo
 WryGAz9+GQwd+W9chFuaOEtMnzvW2scEJ5Eb2fEf0Qj0aEurkL+C9dZR6o1GN77V
 DBUxtU628Ef4PJJGfbNCwZzdd8UPYG3a/mKfQQ3dz0oz2LySlW4=
 =wDot
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:

 - Support for running Linux in a protected VM under the Arm
   Confidential Compute Architecture (CCA)

 - Guarded Control Stack user-space support. Current patches follow the
   x86 ABI of implicitly creating a shadow stack on clone(). Subsequent
   patches (already on the list) will add support for clone3() allowing
   finer-grained control of the shadow stack size and placement from
   libc

 - AT_HWCAP3 support (not running out of HWCAP2 bits yet but we are
   getting close with the upcoming dpISA support)

 - Other arch features:

     - In-kernel use of the memcpy instructions, FEAT_MOPS (previously
       only exposed to user; uaccess support not merged yet)

     - MTE: hugetlbfs support and the corresponding kselftests

     - Optimise CRC32 using the PMULL instructions

     - Support for FEAT_HAFT enabling ARCH_HAS_NONLEAF_PMD_YOUNG

     - Optimise the kernel TLB flushing to use the range operations

     - POE/pkey (permission overlays): further cleanups after bringing
       the signal handler in line with the x86 behaviour for 6.12

 - arm64 perf updates:

     - Support for the NXP i.MX91 PMU in the existing IMX driver

     - Support for Ampere SoCs in the Designware PCIe PMU driver

     - Support for Marvell's 'PEM' PCIe PMU present in the 'Odyssey' SoC

     - Support for Samsung's 'Mongoose' CPU PMU

     - Support for PMUv3.9 finer-grained userspace counter access
       control

     - Switch back to platform_driver::remove() now that it returns
       'void'

     - Add some missing events for the CXL PMU driver

 - Miscellaneous arm64 fixes/cleanups:

     - Page table accessors cleanup: type updates, drop unused macros,
       reorganise arch_make_huge_pte() and clean up pte_mkcont(), sanity
       check addresses before runtime P4D/PUD folding

     - Command line override for ID_AA64MMFR0_EL1.ECV (advertising the
       FEAT_ECV for the generic timers) allowing Linux to boot with
       firmware deployments that don't set SCTLR_EL3.ECVEn

     - ACPI/arm64: tighten the check for the array of platform timer
       structures and adjust the error handling procedure in
       gtdt_parse_timer_block()

     - Optimise the cache flush for the uprobes xol slot (skip if no
       change) and other uprobes/kprobes cleanups

     - Fix the context switching of tpidrro_el0 when kpti is enabled

     - Dynamic shadow call stack fixes

     - Sysreg updates

     - Various arm64 kselftest improvements

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (168 commits)
  arm64: tls: Fix context-switching of tpidrro_el0 when kpti is enabled
  kselftest/arm64: Try harder to generate different keys during PAC tests
  kselftest/arm64: Don't leak pipe fds in pac.exec_sign_all()
  arm64/ptrace: Clarify documentation of VL configuration via ptrace
  kselftest/arm64: Corrupt P0 in the irritator when testing SSVE
  acpi/arm64: remove unnecessary cast
  arm64/mm: Change protval as 'pteval_t' in map_range()
  kselftest/arm64: Fix missing printf() argument in gcs/gcs-stress.c
  kselftest/arm64: Add FPMR coverage to fp-ptrace
  kselftest/arm64: Expand the set of ZA writes fp-ptrace does
  kselftets/arm64: Use flag bits for features in fp-ptrace assembler code
  kselftest/arm64: Enable build of PAC tests with LLVM=1
  kselftest/arm64: Check that SVCR is 0 in signal handlers
  selftests/mm: Fix unused function warning for aarch64_write_signal_pkey()
  kselftest/arm64: Fix printf() compiler warnings in the arm64 syscall-abi.c tests
  kselftest/arm64: Fix printf() warning in the arm64 MTE prctl() test
  kselftest/arm64: Fix printf() compiler warnings in the arm64 fp tests
  kselftest/arm64: Fix build with stricter assemblers
  arm64/scs: Drop unused prototype __pi_scs_patch_vmlinux()
  arm64/scs: Deal with 64-bit relative offsets in FDE frames
  ...
2024-11-18 18:10:37 -08:00
Linus Torvalds
5591fd5e03 lsm/stable-6.13 PR 20241112
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmcztFcUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXPvFQ/+KYwRe3g6gFSu7tRA34okHtUopvpF
 KGAaic06c8oy85gSX4B2Xk4HINCgXVUuRi9Z+0yExRWvvBXRRdQRUj1Vdbj4KOEG
 sRsIA1j1YhPU3wyhkAqwpJ97sQE1v9Xb3xizGwTfQKGQkd+cvtHg0QKM08/jPQYq
 bbbcSxoVsUzh8+idAq1UMfdoTsMh2xeCW7Q1+dbBINJykNzKiqEEc21xgBxeomST
 lSG9XFP3BJr1RBlb4Ux+J8YL+2G/rDBWZh1sR5+t31kgClSgs3CMBRFdTATvplKk
 e9vrcUF8wR7xWWnDmmdobHa462qUt6BWifYarX9RTomGBugZfYDOR/C+jpb+xZwd
 +tZfL6HSOVeBtQ/Zu1bs18eS5i2dj7GxFN7GPY2qXIPvsW5Acwcx1CCK6oNDmX05
 1cOaNuZRYBDye4eAnT3yufnJ34VO80UQIfKTE6dqrX0XtCFYomTxb+Km0qM3utl5
 ubr3Krp6GmVs65lIvtnIhDKSlcNIBbJfH64vdQNnOn/8FvkovGqp2eaX+0wBhROM
 8KgbqntXU4/DgQuDiP01g13mTDeTGdcfyRWKcKMI/CzI/WASPZBpVuqX6xWXh3bs
 NlZmJ/7+Y48Xp2FvaEchQ/A8ppyIrigMLloZ8yAHf2P1z9g6wBNRCrsScdSQVx63
 ArxHLRY44pUOnPs=
 =m/yY
 -----END PGP SIGNATURE-----

Merge tag 'lsm-pr-20241112' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm

Pull lsm updates from Paul Moore:
 "Thirteen patches, all focused on moving away from the current 'secid'
  LSM identifier to a richer 'lsm_prop' structure.

  This move will help reduce the translation that is necessary in many
  LSMs, offering better performance, and make it easier to support
  different LSMs in the future"

* tag 'lsm-pr-20241112' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
  lsm: remove lsm_prop scaffolding
  netlabel,smack: use lsm_prop for audit data
  audit: change context data from secid to lsm_prop
  lsm: create new security_cred_getlsmprop LSM hook
  audit: use an lsm_prop in audit_names
  lsm: use lsm_prop in security_inode_getsecid
  lsm: use lsm_prop in security_current_getsecid
  audit: update shutdown LSM data
  lsm: use lsm_prop in security_ipc_getsecid
  audit: maintain an lsm_prop in audit_context
  lsm: add lsmprop_to_secctx hook
  lsm: use lsm_prop in security_audit_rule_match
  lsm: add the lsm_prop data structure
2024-11-18 17:34:05 -08:00
Linus Torvalds
a8220b0ca7 audit/stable-6.13 PR 20241112
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmcztDIUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXOI6g//dAY0z6TVYzGWSbsSim+ZBDycMAjA
 AVwQONdQTkQBO9MEw6C7HIQeECQn+jm52gTDSpRxeUfJBbO/KTPbm3e0TQ0vGT1A
 ED7QW2u3BkEM/8mMY8UOPPx+PWO7yb08gMZd+WSKGuhL34Ypsa1zm2Pf5hjiX7S+
 eGXJ/5IMaCQcCevR0EpMz8T1VgidJRRhl0HfaNALt4FR+4Ppsn6upMQtOZ9mmr7Q
 IQpL0ZlOJiSjoYRpOmNfGM94ikS+H8b7OC0EjJzRyetw7laaHqmM/OtLWqlgyOdZ
 B2oZ3q0J79wBOJMZxHf09rodNmhl686nHeDPOnpGKahjsNON7LFua13b+UqHzHHE
 QlMdquZpO2QNaXxfN+H9S8VOe7rcGfLO1yElhP+ydpfX4DHHUGSv22Gu1jmAmR8V
 Uyem7zZWTAkcK0zx0w9MjNN+IgD2uI+r175eL/jfOZUFqYnG2696KEBksnd8k4Vr
 fP/99MGn+juX8zhMTUUxcNfcYISPwJjLAT1mA/conhXHD8SSACFK963cGjuo8snI
 1QA1qMfW4sEMphTBiit4fDd2+2V6MPCR+uMovvzqTXOC1tO8FuSkJWmcXQbY0LkT
 JOQsVrp7XqZDecvmERYn2EAmO68XlVOJD+Bx8i4b4W3u5hlp5uG3WAv/QEDfVVdf
 y+gs16bgWb9fyS0=
 =SOyQ
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20241112' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit updates from Paul Moore:
 "The audit patches are minimal this time around with one patch to
  correct some kdoc function parameters and one to leverage the
  `str_yes_no()` function; nothing very exciting"

* tag 'audit-pr-20241112' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: Use str_yes_no() helper function
  audit: Reorganize kerneldoc parameter names
2024-11-18 17:28:52 -08:00
Linus Torvalds
0f25f0e4ef the bulk of struct fd memory safety stuff
Making sure that struct fd instances are destroyed in the same
 scope where they'd been created, getting rid of reassignments
 and passing them by reference, converting to CLASS(fd{,_pos,_raw}).
 
 We are getting very close to having the memory safety of that stuff
 trivial to verify.
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZzdikAAKCRBZ7Krx/gZQ
 69nJAQCmbQHK3TGUbQhOw6MJXOK9ezpyEDN3FZb4jsu38vTIdgEA6OxAYDO2m2g9
 CN18glYmD3wRyU6Bwl4vGODouSJvDgA=
 =gVH3
 -----END PGP SIGNATURE-----

Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull 'struct fd' class updates from Al Viro:
 "The bulk of struct fd memory safety stuff

  Making sure that struct fd instances are destroyed in the same scope
  where they'd been created, getting rid of reassignments and passing
  them by reference, converting to CLASS(fd{,_pos,_raw}).

  We are getting very close to having the memory safety of that stuff
  trivial to verify"

* tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
  deal with the last remaing boolean uses of fd_file()
  css_set_fork(): switch to CLASS(fd_raw, ...)
  memcg_write_event_control(): switch to CLASS(fd)
  assorted variants of irqfd setup: convert to CLASS(fd)
  do_pollfd(): convert to CLASS(fd)
  convert do_select()
  convert vfs_dedupe_file_range().
  convert cifs_ioctl_copychunk()
  convert media_request_get_by_fd()
  convert spu_run(2)
  switch spufs_calls_{get,put}() to CLASS() use
  convert cachestat(2)
  convert do_preadv()/do_pwritev()
  fdget(), more trivial conversions
  fdget(), trivial conversions
  privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget()
  o2hb_region_dev_store(): avoid goto around fdget()/fdput()
  introduce "fd_pos" class, convert fdget_pos() users to it.
  fdget_raw() users: switch to CLASS(fd_raw)
  convert vmsplice() to CLASS(fd)
  ...
2024-11-18 12:24:06 -08:00
Tatsuya S
6ce5a6f0a0 tracing: Fix function name for trampoline
The issue that unrelated function name is shown on stack trace like
following even though it should be trampoline code address is caused by
the creation of trampoline code in the area where .init.text section
of module was freed after module is loaded.

bash-1344    [002] .....    43.644608: <stack trace>
=> (MODULE INIT FUNCTION)
=> vfs_write
=> ksys_write
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe

To resolve this, when function address of stack trace entry is in
trampoline, output without looking up symbol name.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241021071454.34610-2-tatsuya.s2862@gmail.com
Signed-off-by: Tatsuya S <tatsuya.s2862@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-11-18 15:08:10 -05:00
Linus Torvalds
a5ca574796 vfs-6.13.usercopy
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzchMwAKCRCRxhvAZXjc
 okICAP4h6tDl7dgTv8GkL0tgaHi/36m+ilctXbEtIe9fbkc/fQD8D5t6jYaz47gu
 zVY7qOrtQOQ/diNavzxyky99Uh3dKgo=
 =lwkw
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.13.usercopy' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull copy_struct_to_user helper from Christian Brauner:
 "This adds a copy_struct_to_user() helper which is a companion helper
  to the already widely used copy_struct_from_user().

  It copies a struct from kernel space to userspace, in a way that
  guarantees backwards-compatibility for struct syscall arguments as
  long as future struct extensions are made such that all new fields are
  appended to the old struct, and zeroed-out new fields have the same
  meaning as the old struct.

  The first user is sched_getattr() system call but the new extensible
  pidfs ioctl will be ported to it as well"

* tag 'vfs-6.13.usercopy' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  sched_getattr: port to copy_struct_to_user
  uaccess: add copy_struct_to_user helper
2024-11-18 10:50:09 -08:00
Linus Torvalds
4c797b11a8 vfs-6.13.file
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcW4gAKCRCRxhvAZXjc
 okF+AP9xTMb2SlnRPBOBd9yFcmVXmQi86TSCUPAEVb+wIldGYwD/RIOdvXYJlp9v
 RgJkU1DC3ddkXtONNDY6gFaP+siIWA0=
 =gMc7
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.13.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs file updates from Christian Brauner:
 "This contains changes the changes for files for this cycle:

   - Introduce a new reference counting mechanism for files.

     As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop
     it has O(N^2) behaviour under contention with N concurrent
     operations and it is in a hot path in __fget_files_rcu().

     The rcuref infrastructures remedies this problem by using an
     unconditional increment relying on safe- and dead zones to make
     this work and requiring rcu protection for the data structure in
     question. This not just scales better it also introduces overflow
     protection.

     However, in contrast to generic rcuref, files require a memory
     barrier and thus cannot rely on *_relaxed() atomic operations and
     also require to be built on atomic_long_t as having massive amounts
     of reference isn't unheard of even if it is just an attack.

     This adds a file specific variant instead of making this a generic
     library.

     This has been tested by various people and it gives consistent
     improvement up to 3-5% on workloads with loads of threads.

   - Add a fastpath for find_next_zero_bit(). Skip 2-levels searching
     via find_next_zero_bit() when there is a free slot in the word that
     contains the next fd. This improves pts/blogbench-1.1.0 read by 8%
     and write by 4% on Intel ICX 160.

   - Conditionally clear full_fds_bits since it's very likely that a bit
     in full_fds_bits has been cleared during __clear_open_fds(). This
     improves pts/blogbench-1.1.0 read up to 13%, and write up to 5% on
     Intel ICX 160.

   - Get rid of all lookup_*_fdget_rcu() variants. They were used to
     lookup files without taking a reference count. That became invalid
     once files were switched to SLAB_TYPESAFE_BY_RCU and now we're
     always taking a reference count. Switch to an already existing
     helper and remove the legacy variants.

   - Remove pointless includes of <linux/fdtable.h>.

   - Avoid cmpxchg() in close_files() as nobody else has a reference to
     the files_struct at that point.

   - Move close_range() into fs/file.c and fold __close_range() into it.

   - Cleanup calling conventions of alloc_fdtable() and expand_files().

   - Merge __{set,clear}_close_on_exec() into one.

   - Make __set_open_fd() set cloexec as well instead of doing it in two
     separate steps"

* tag 'vfs-6.13.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  selftests: add file SLAB_TYPESAFE_BY_RCU recycling stressor
  fs: port files to file_ref
  fs: add file_ref
  expand_files(): simplify calling conventions
  make __set_open_fd() set cloexec state as well
  fs: protect backing files with rcu
  file.c: merge __{set,clear}_close_on_exec()
  alloc_fdtable(): change calling conventions.
  fs/file.c: add fast path in find_next_fd()
  fs/file.c: conditionally clear full_fds
  fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()
  move close_range(2) into fs/file.c, fold __close_range() into it
  close_files(): don't bother with xchg()
  remove pointless includes of <linux/fdtable.h>
  get rid of ...lookup...fdget_rcu() family
2024-11-18 10:30:29 -08:00
Linus Torvalds
6ac81fd55e vfs-6.13.mgtime
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcScQAKCRCRxhvAZXjc
 oj+5AP4k822a77wc/3iPFk379naIvQ4dsrgemh0/Pb6ZvzvkFQEAi3vFCfzCDR2x
 SkJF/RwXXKZv6U31QXMRt2Qo6wfBuAc=
 =nVlm
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.13.mgtime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs multigrain timestamps from Christian Brauner:
 "This is another try at implementing multigrain timestamps. This time
  with significant help from the timekeeping maintainers to reduce the
  performance impact.

  Thomas provided a base branch that contains the required timekeeping
  interfaces for the VFS. It serves as the base for the multi-grain
  timestamp work:

   - Multigrain timestamps allow the kernel to use fine-grained
     timestamps when an inode's attributes is being actively observed
     via ->getattr(). With this support, it's possible for a file to get
     a fine-grained timestamp, and another modified after it to get a
     coarse-grained stamp that is earlier than the fine-grained time. If
     this happens then the files can appear to have been modified in
     reverse order, which breaks VFS ordering guarantees.

     To prevent this, a floor value is maintained for multigrain
     timestamps. Whenever a fine-grained timestamp is handed out, record
     it, and when later coarse-grained stamps are handed out, ensure
     they are not earlier than that value. If the coarse-grained
     timestamp is earlier than the fine-grained floor, return the floor
     value instead.

     The timekeeper changes add a static singleton atomic64_t into
     timekeeper.c that is used to keep track of the latest fine-grained
     time ever handed out. This is tracked as a monotonic ktime_t value
     to ensure that it isn't affected by clock jumps. Because it is
     updated at different times than the rest of the timekeeper object,
     the floor value is managed independently of the timekeeper via a
     cmpxchg() operation, and sits on its own cacheline.

     Two new public timekeeper interfaces are added:

      (1) ktime_get_coarse_real_ts64_mg() fills a timespec64 with the
          later of the coarse-grained clock and the floor time

      (2) ktime_get_real_ts64_mg() gets the fine-grained clock value,
          and tries to swap it into the floor. A timespec64 is filled
          with the result.

   - The VFS has always used coarse-grained timestamps when updating the
     ctime and mtime after a change. This has the benefit of allowing
     filesystems to optimize away a lot metadata updates, down to around
     1 per jiffy, even when a file is under heavy writes.

     Unfortunately, this has always been an issue when we're exporting
     via NFSv3, which relies on timestamps to validate caches. A lot of
     changes can happen in a jiffy, so timestamps aren't sufficient to
     help the client decide when to invalidate the cache. Even with
     NFSv4, a lot of exported filesystems don't properly support a
     change attribute and are subject to the same problems with
     timestamp granularity. Other applications have similar issues with
     timestamps (e.g backup applications).

     If we were to always use fine-grained timestamps, that would
     improve the situation, but that becomes rather expensive, as the
     underlying filesystem would have to log a lot more metadata
     updates.

     This adds a way to only use fine-grained timestamps when they are
     being actively queried. Use the (unused) top bit in
     inode->i_ctime_nsec as a flag that indicates whether the current
     timestamps have been queried via stat() or the like. When it's set,
     we allow the kernel to use a fine-grained timestamp iff it's
     necessary to make the ctime show a different value.

     This solves the problem of being able to distinguish the timestamp
     between updates, but introduces a new problem: it's now possible
     for a file being changed to get a fine-grained timestamp. A file
     that is altered just a bit later can then get a coarse-grained one
     that appears older than the earlier fine-grained time. This
     violates timestamp ordering guarantees.

     This is where the earlier mentioned timkeeping interfaces help. A
     global monotonic atomic64_t value is kept that acts as a timestamp
     floor. When we go to stamp a file, we first get the latter of the
     current floor value and the current coarse-grained time. If the
     inode ctime hasn't been queried then we just attempt to stamp it
     with that value.

     If it has been queried, then first see whether the current coarse
     time is later than the existing ctime. If it is, then we accept
     that value. If it isn't, then we get a fine-grained time and try to
     swap that into the global floor. Whether that succeeds or fails, we
     take the resulting floor time, convert it to realtime and try to
     swap that into the ctime.

     We take the result of the ctime swap whether it succeeds or fails,
     since either is just as valid.

     Filesystems can opt into this by setting the FS_MGTIME fstype flag.
     Others should be unaffected (other than being subject to the same
     floor value as multigrain filesystems)"

* tag 'vfs-6.13.mgtime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: reduce pointer chasing in is_mgtime() test
  tmpfs: add support for multigrain timestamps
  btrfs: convert to multigrain timestamps
  ext4: switch to multigrain timestamps
  xfs: switch to multigrain timestamps
  Documentation: add a new file documenting multigrain timestamps
  fs: add percpu counters for significant multigrain timestamp events
  fs: tracepoints around multigrain timestamp events
  fs: handle delegated timestamps in setattr_copy_mgtime
  timekeeping: Add percpu counter for tracking floor swap events
  timekeeping: Add interfaces for handling timestamps with a floor value
  fs: have setattr_copy handle multigrain timestamps appropriately
  fs: add infrastructure for multigrain timestamps
2024-11-18 09:15:39 -08:00
Frederic Weisbecker
cdc905d16b posix-timers: Fix spurious warning on double enqueue versus do_exit()
A timer sigqueue may find itself already pending when it is tried to
be enqueued. This situation can happen if the timer sigqueue is enqueued
but then the timer is reset afterwards and fires before the pending
signal managed to be delivered.

However when such a double enqueue occurs while the corresponding signal
is ignored, the sigqueue is expected to be found either on the dedicated
ignored list if the timer was periodic or dropped if the timer was
one-shot. In any case it is not supposed to be queued on the real signal
queue.

An assertion verifies the latter expectation on top of the return value
of prepare_signal(), assuming "false" means that the signal is being
ignored. But prepare_signal() may also fail if the target is exiting as
the last task of its group. In this case the double enqueue observes the
sigqueue queued, as in such a situation:

    TASK A (same group as B)                   TASK B (same group as A)
    ------------------------                   ------------------------

    // timer event
    // queue signal to TASK B
    posix_timer_queue_signal()
    // reset timer through syscall
    do_timer_settime()
    // exit, leaving task B alone
    do_exit()
                                               do_exit()
                                                  synchronize_group_exit()
                                                      signal->flags = SIGNAL_GROUP_EXIT
                                                  // ========> <IRQ> timer event
                                                  posix_timer_queue_signal()
                                                  // return false due to SIGNAL_GROUP_EXIT
                                                  if (!prepare_signal())
                                                     WARN_ON_ONCE(!list_empty(&q->list))

And this spuriously triggers this warning:

    WARNING: CPU: 0 PID: 5854 at kernel/signal.c:2008 posixtimer_send_sigqueue
    CPU: 0 UID: 0 PID: 5854 Comm: syz-executor139 Not tainted 6.12.0-rc6-next-20241108-syzkaller #0
    RIP: 0010:posixtimer_send_sigqueue+0x9da/0xbc0 kernel/signal.c:2008
    Call Trace:
     <IRQ>
     alarm_handle_timer
     alarmtimer_fired
     __run_hrtimer
     __hrtimer_run_queues
     hrtimer_interrupt
     local_apic_timer_interrupt
     __sysvec_apic_timer_interrupt
     instr_sysvec_apic_timer_interrupt
     sysvec_apic_timer_interrupt
     </IRQ>

Fortunately the recovery code in that case already does the right thing:
just exit from posixtimer_send_sigqueue() and wait for __exit_signal()
to flush the pending signal. Just make sure to warn only the case when
the sigqueue is queued and the signal is really ignored.

Fixes: df7a996b4d ("signal: Queue ignored posixtimers on ignore list")
Reported-by: syzbot+852e935b899bde73626e@syzkaller.appspotmail.com
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: syzbot+852e935b899bde73626e@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/20241116234823.28497-1-frederic@kernel.org
Closes: https://lore.kernel.org/all/673549c6.050a0220.1324f8.008c.GAE@google.com
2024-11-18 18:03:59 +01:00
Jeff Xie
60b1f578b5 ftrace: Get the true parent ip for function tracer
When using both function tracer and function graph simultaneously,
it is found that function tracer sometimes captures a fake parent ip
(return_to_handler) instead of the true parent ip.

This issue is easy to reproduce. Below are my reproduction steps:

jeff-labs:~/bin # ./trace-net.sh

jeff-labs:~/bin # cat /sys/kernel/debug/tracing/instances/foo/trace | grep return_to_handler
    trace-net.sh-405     [001] ...2.    31.859501: avc_has_perm+0x4/0x190 <-return_to_handler+0x0/0x40
    trace-net.sh-405     [001] ...2.    31.859503: simple_setattr+0x4/0x70 <-return_to_handler+0x0/0x40
    trace-net.sh-405     [001] ...2.    31.859503: truncate_pagecache+0x4/0x60 <-return_to_handler+0x0/0x40
    trace-net.sh-405     [001] ...2.    31.859505: unmap_mapping_range+0x4/0x140 <-return_to_handler+0x0/0x40
    trace-net.sh-405     [001] ...3.    31.859508: _raw_spin_unlock+0x4/0x30 <-return_to_handler+0x0/0x40
    [...]

The following is my simple trace script:

<snip>
jeff-labs:~/bin # cat ./trace-net.sh
TRACE_PATH="/sys/kernel/tracing"

set_events() {
        echo 1 > $1/events/net/enable
        echo 1 > $1/events/tcp/enable
        echo 1 > $1/events/sock/enable
        echo 1 > $1/events/napi/enable
        echo 1 > $1/events/fib/enable
        echo 1 > $1/events/neigh/enable
}

set_events ${TRACE_PATH}
echo 1 > ${TRACE_PATH}/options/sym-offset
echo 1 > ${TRACE_PATH}/options/funcgraph-tail
echo 1 > ${TRACE_PATH}/options/funcgraph-proc
echo 1 > ${TRACE_PATH}/options/funcgraph-abstime

echo 'tcp_orphan*' > ${TRACE_PATH}/set_ftrace_notrace
echo function_graph > ${TRACE_PATH}/current_tracer

INSTANCE_FOO=${TRACE_PATH}/instances/foo
if [ ! -e $INSTANCE_FOO ]; then
        mkdir ${INSTANCE_FOO}
fi
set_events ${INSTANCE_FOO}
echo 1 > ${INSTANCE_FOO}/options/sym-offset
echo 'tcp_orphan*' > ${INSTANCE_FOO}/set_ftrace_notrace
echo function > ${INSTANCE_FOO}/current_tracer

echo 1 > ${TRACE_PATH}/tracing_on
echo 1 > ${INSTANCE_FOO}/tracing_on

echo > ${TRACE_PATH}/trace
echo > ${INSTANCE_FOO}/trace
</snip>

Link: https://lore.kernel.org/20241008033159.22459-1-jeff.xie@linux.dev
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Jeff Xie <jeff.xie@linux.dev>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-11-18 12:02:43 -05:00
Thomas Weißschuh
e7240bd91f cpu: Remove spurious NULL in attribute_group definition
This NULL value is most-likely a copy-paste error from an array
definition. The NULL doesn't have any effect.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20241118-sysfs-const-attribute_group-fixes-v1-3-48e0b0ad8cba@weissschuh.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-18 16:20:46 +01:00
Nir Lichtman
24b2455fe8 kdb: fix ctrl+e/a/f/b/d/p/n broken in keyboard mode
Problem: When using kdb via keyboard it does not react to control
characters which are supported in serial mode.

Example: Chords such as ctrl+a/e/d/p do not work in keyboard mode

Solution: Before disregarding non-printable key characters, check if they
are one of the supported control characters, I have took the control
characters from the switch case upwards in this function that translates
scan codes of arrow keys/backspace/home/.. to the control characters.

Suggested-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Nir Lichtman <nir@lichtman.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20241111215622.GA161253@lichtman.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2024-11-18 15:20:22 +00:00
liujing
537affea16 ring-buffer: Correct a grammatical error in a comment
The word "trace" begins with a consonant sound,
so "a" should be used instead of "an".

Link: https://lore.kernel.org/20241107095327.6390-1-liujing@cmss.chinamobile.com
Signed-off-by: liujing <liujing@cmss.chinamobile.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-11-18 09:40:17 -05:00
Petr Mladek
34767e5357 Merge branch 'for-6.13-force-console' into for-linus 2024-11-18 14:07:05 +01:00
Linus Torvalds
4a5df37964 10 hotfixes, 7 of which are cc:stable. All singletons, please see the
changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZzkr6AAKCRDdBJ7gKXxA
 jsb2AP9HCOI4w9rQTmBdnaefXytS7fiiPq+LVNpjJ0NGXX2FSgD/e1NM0wi8KevQ
 npcvlqTcXtRSJvYNF904aTNyDn+Kuw0=
 =KFGY
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-11-16-15-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull hotfixes from Andrew Morton:
 "10 hotfixes, 7 of which are cc:stable. All singletons, please see the
  changelogs for details"

* tag 'mm-hotfixes-stable-2024-11-16-15-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm: revert "mm: shmem: fix data-race in shmem_getattr()"
  ocfs2: uncache inode which has failed entering the group
  mm: fix NULL pointer dereference in alloc_pages_bulk_noprof
  mm, doc: update read_ahead_kb for MADV_HUGEPAGE
  fs/proc/task_mmu: prevent integer overflow in pagemap_scan_get_args()
  sched/task_stack: fix object_is_on_stack() for KASAN tagged pointers
  crash, powerpc: default to CRASH_DUMP=n on PPC_BOOK3S_32
  mm/mremap: fix address wraparound in move_page_tables()
  tools/mm: fix compile error
  mm, swap: fix allocation and scanning race with swapoff
2024-11-16 16:00:38 -08:00
Linus Torvalds
b5a24181e4 Ring buffer fixes for 6.12:
- Revert: "ring-buffer: Do not have boot mapped buffers hook to CPU hotplug"
 
   A crash that happened on cpu hotplug was actually caused by the incorrect
   ref counting that was fixed by commit 2cf9733891 ("ring-buffer: Fix
   refcount setting of boot mapped buffers"). The removal of calling cpu
   hotplug callbacks on memory mapped buffers was not an issue even though
   the tests at the time pointed toward it. But in fact, there's a check in
   that code that tests to see if the buffers are already allocated or not,
   and will not allocate them again if they are. Not calling the cpu hotplug
   callbacks ended up not initializing the non boot CPU buffers.
 
   Simply remove that change.
 
 - Clear all CPU buffers when starting tracing in a boot mapped buffer
 
   To properly process events from a previous boot, the address space needs to
   be accounted for due to KASLR and the events in the buffer are updated
   accordingly when read. This also requires that when the buffer has tracing
   enabled again in the current boot that the buffers are reset so that events
   from the previous boot do not interact with the events of the current boot
   and cause confusing due to not having the proper meta data.
 
   It was found that if a CPU is taken offline, that its per CPU buffer is not
   reset when tracing starts. This allows for events to be from both the
   previous boot and the current boot to be in the buffer at the same time.
   Clear all CPU buffers when tracing is started in a boot mapped buffer.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZzdr5hQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qq3gAQDsqNNld3D3wW72VMJ52d9zdBXFUdrV
 hbszve+PSj/wuAD/TeCp0BcI8Az+G7/enMXnlEugLo3XKLr/YvPQ3nlb8QA=
 =VR4z
 -----END PGP SIGNATURE-----

Merge tag 'trace-ringbuffer-v6.12-rc7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ring buffer fixes from Steven Rostedt:

 - Revert: "ring-buffer: Do not have boot mapped buffers hook to CPU
   hotplug"

   A crash that happened on cpu hotplug was actually caused by the
   incorrect ref counting that was fixed by commit 2cf9733891
   ("ring-buffer: Fix refcount setting of boot mapped buffers"). The
   removal of calling cpu hotplug callbacks on memory mapped buffers was
   not an issue even though the tests at the time pointed toward it. But
   in fact, there's a check in that code that tests to see if the
   buffers are already allocated or not, and will not allocate them
   again if they are. Not calling the cpu hotplug callbacks ended up not
   initializing the non boot CPU buffers.

   Simply remove that change.

 - Clear all CPU buffers when starting tracing in a boot mapped buffer

   To properly process events from a previous boot, the address space
   needs to be accounted for due to KASLR and the events in the buffer
   are updated accordingly when read. This also requires that when the
   buffer has tracing enabled again in the current boot that the buffers
   are reset so that events from the previous boot do not interact with
   the events of the current boot and cause confusing due to not having
   the proper meta data.

   It was found that if a CPU is taken offline, that its per CPU buffer
   is not reset when tracing starts. This allows for events to be from
   both the previous boot and the current boot to be in the buffer at
   the same time. Clear all CPU buffers when tracing is started in a
   boot mapped buffer.

* tag 'trace-ringbuffer-v6.12-rc7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/ring-buffer: Clear all memory mapped CPU ring buffers on first recording
  Revert: "ring-buffer: Do not have boot mapped buffers hook to CPU hotplug"
2024-11-16 08:12:43 -08:00
Frederic Weisbecker
d8dfba2c60 Merge branches 'rcu/fixes', 'rcu/nocb', 'rcu/torture', 'rcu/stall' and 'rcu/srcu' into rcu/dev 2024-11-15 22:38:53 +01:00
Uladzislau Rezki (Sony)
c229d579d0 rcuscale: Remove redundant WARN_ON_ONCE() splat
There are two places where WARN_ON_ONCE() is called two times
in the error paths. One which is encapsulated into if() condition
and another one, which is unnecessary, is placed in the brackets.

Remove an extra WARN_ON_ONCE() splat which is in brackets.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-15 22:24:41 +01:00
Uladzislau Rezki (Sony)
812a1c3b9f rcuscale: Do a proper cleanup if kfree_scale_init() fails
A static analyzer for C, Smatch, reports and triggers below
warnings:

   kernel/rcu/rcuscale.c:1215 rcu_scale_init()
   warn: inconsistent returns 'global &fullstop_mutex'.

The checker complains about, we do not unlock the "fullstop_mutex"
mutex, in case of hitting below error path:

<snip>
...
    if (WARN_ON_ONCE(jiffies_at_lazy_cb - jif_start < 2 * HZ)) {
        pr_alert("ERROR: call_rcu() CBs are not being lazy as expected!\n");
        WARN_ON_ONCE(1);
        return -1;
        ^^^^^^^^^^
...
<snip>

it happens because "-1" is returned right away instead of
doing a proper unwinding.

Fix it by jumping to "unwind" label instead of returning -1.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Closes: https://lore.kernel.org/rcu/ZxfTrHuEGtgnOYWp@pc636/T/
Fixes: 084e04fff1 ("rcuscale: Add laziness and kfree tests")
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-15 22:23:50 +01:00
Paul E. McKenney
9407f5c3ec srcu: Unconditionally record srcu_read_lock_lite() in ->srcu_reader_flavor
Currently, srcu_read_lock_lite() uses the SRCU_READ_FLAVOR_LITE bit in
->srcu_reader_flavor to communicate to the grace-period processing in
srcu_readers_active_idx_check() that the smp_mb() must be replaced by a
synchronize_rcu().  Unfortunately, ->srcu_reader_flavor is not updated
unless the kernel is built with CONFIG_PROVE_RCU=y.  Therefore in all
kernels built with CONFIG_PROVE_RCU=n, srcu_readers_active_idx_check()
incorrectly uses smp_mb() instead of synchronize_rcu() for srcu_struct
structures whose readers use srcu_read_lock_lite().

This commit therefore causes Tree SRCU srcu_read_lock_lite()
to unconditionally update ->srcu_reader_flavor so that
srcu_readers_active_idx_check() can make the correct choice.

Reported-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Closes: https://lore.kernel.org/all/d07e8f4a-d5ff-4c8e-8e61-50db285c57e9@amd.com/
Fixes: c0f08d6b5a61 ("srcu: Add srcu_read_lock_lite() and srcu_read_unlock_lite()")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-15 22:13:37 +01:00
Rafael J. Wysocki
923c256e37 Merge branches 'pm-cpuidle' and 'pm-em'
Merge cpuidle and Energy Model changes for 6.13-rc1:

 - Add a built-in idle states table for Granite Rapids Xeon D to the
   intel_idle driver (Artem Bityutskiy).

 - Fix some typos in comments in the cpuidle core and drivers (Shen
   Lichuan).

 - Remove iowait influence from the menu cpuidle governor (Christian
   Loehle).

 - Add min/max available performance state limits to the Energy Model
   management code (Lukasz Luba).

* pm-cpuidle:
  intel_idle: add Granite Rapids Xeon D support
  cpuidle: Correct some typos in comments
  cpuidle: menu: Remove iowait influence

* pm-em:
  PM: EM: Add min/max available performance state limits
2024-11-15 19:54:05 +01:00
Andrii Nakryiko
96a30e469c bpf: use common instruction history across all states
Instead of allocating and copying instruction history each time we
enqueue child verifier state, switch to a model where we use one common
dynamically sized array of instruction history entries across all states.

The key observation for proving this is correct is that instruction
history is only relevant while state is active, which means it either is
a current state (and thus we are actively modifying instruction history
and no other state can interfere with us) or we are checkpointed state
with some children still active (either enqueued or being current).

In the latter case our portion of instruction history is finalized and
won't change or grow, so as long as we keep it immutable until the state
is finalized, we are good.

Now, when state is finalized and is put into state hash for potentially
future pruning lookups, instruction history is not used anymore. This is
because instruction history is only used by precision marking logic, and
we never modify precision markings for finalized states.

So, instead of each state having its own small instruction history, we
keep a global dynamically-sized instruction history, where each state in
current DFS path from root to active state remembers its portion of
instruction history. Current state can append to this history, but
cannot modify any of its parent histories.

Async callback state enqueueing, while logically detached from parent
state, still is part of verification backtracking tree, so has to follow
the same schema as normal state checkpoints.

Because the insn_hist array can be grown through realloc, states don't
keep pointers, they instead maintain two indices, [start, end), into
global instruction history array. End is exclusive index, so
`start == end` means there is no relevant instruction history.

This eliminates a lot of allocations and minimizes overall memory usage.

For instance, running a worst-case test from [0] (but without the
heuristics-based fix [1]), it took 12.5 minutes until we get -ENOMEM.
With the changes in this patch the whole test succeeds in 10 minutes
(very slow, so heuristics from [1] is important, of course).

To further validate correctness, veristat-based comparison was performed for
Meta production BPF objects and BPF selftests objects. In both cases there
were no differences *at all* in terms of verdict or instruction and state
counts, providing a good confidence in the change.

Having this low-memory-overhead solution of keeping dynamic
per-instruction history cheaply opens up some new possibilities, like
keeping extra information for literally every single validated
instruction. This will be used for simplifying precision backpropagation
logic in follow up patches.

  [0] https://lore.kernel.org/bpf/20241029172641.1042523-2-eddyz87@gmail.com/
  [1] https://lore.kernel.org/bpf/20241029172641.1042523-1-eddyz87@gmail.com/

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20241115001303.277272-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-15 10:20:47 -08:00
Linus Torvalds
d79944b094 sched_ext: One more fix for v6.12-rc7
ops.cpu_acquire() was being invoked with the wrong kfunc mask allowing the
 operation to call kfuncs which shouldn't be allowed. Fix it by using
 SCX_KF_REST instead, which is trivial and low risk.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZzamXw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGRReAP4/JQ1mKkJv+9nTZkW9OcFFHGVVhrprOUEEFk5j
 pmHwPAD8DTBMMS/BCQOoXDdiB9uU7ut6M8VdsIj1jmJkMja+eQI=
 =942J
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.12-rc7-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fix from Tejun Heo:
 "One more fix for v6.12-rc7

  ops.cpu_acquire() was being invoked with the wrong kfunc mask allowing
  the operation to call kfuncs which shouldn't be allowed. Fix it by
  using SCX_KF_REST instead, which is trivial and low risk"

* tag 'sched_ext-for-6.12-rc7-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: ops.cpu_acquire() should be called with SCX_KF_REST
2024-11-15 09:59:51 -08:00
Wangyang Guo
85f0d8e39a workqueue: Reduce expensive locks for unbound workqueue
For unbound workqueue, pwqs usually map to just a few pools. Most of
the time, pwqs will be linked sequentially to wq->pwqs list by cpu
index.  Usually, consecutive CPUs have the same workqueue attribute
(e.g. belong to the same NUMA node). This makes pwqs with the same
pool cluster together in the pwq list.

Only do lock/unlock if the pool has changed in flush_workqueue_prep_pwqs().
This reduces the number of expensive lock operations.

The performance data shows this change boosts FIO by 65x in some cases
when multiple concurrent threads write to xfs mount points with fsync.

FIO Benchmark Details
- FIO version: v3.35
- FIO Options: ioengine=libaio,iodepth=64,norandommap=1,rw=write,
  size=128M,bs=4k,fsync=1
- FIO Job Configs: 64 jobs in total writing to 4 mount points (ramdisks
  formatted as xfs file system).
- Kernel Codebase: v6.12-rc5
- Test Platform: Xeon 8380 (2 sockets)

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-15 06:43:39 -10:00
Yonghong Song
4ff04abf9d bpf: Add necessary migrate_disable to range_tree.
When running bpf selftest (./test_progs -j), the following warnings
showed up:

  $ ./test_progs -t arena_atomics
  ...
  BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u19:0/12501
  caller is bpf_mem_free+0x128/0x330
  ...
  Call Trace:
   <TASK>
   dump_stack_lvl
   check_preemption_disabled
   bpf_mem_free
   range_tree_destroy
   arena_map_free
   bpf_map_free_deferred
   process_scheduled_works
   ...

For selftests arena_htab and arena_list, similar smp_process_id() BUGs are
dumped, and the following are two stack trace:

   <TASK>
   dump_stack_lvl
   check_preemption_disabled
   bpf_mem_alloc
   range_tree_set
   arena_map_alloc
   map_create
   ...

   <TASK>
   dump_stack_lvl
   check_preemption_disabled
   bpf_mem_alloc
   range_tree_clear
   arena_vm_fault
   do_pte_missing
   handle_mm_fault
   do_user_addr_fault
   ...

Add migrate_{disable,enable}() around related bpf_mem_{alloc,free}()
calls to fix the issue.

Fixes: b795379757 ("bpf: Introduce range_tree data structure and use it in bpf arena")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20241115060354.2832495-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-15 08:11:53 -08:00
Viktor Malik
ab4dc30c53 bpf: Do not alloc arena on unsupported arches
Do not allocate BPF arena on arches that do not support it, instead
return EOPNOTSUPP. This is useful to prevent bugs such as soft lockups
while trying to free the arena which we have witnessed on ppc64le [1].

[1] https://lore.kernel.org/bpf/4afdcb50-13f2-4772-8db1-3fd02bd985b3@redhat.com/

Signed-off-by: Viktor Malik <vmalik@redhat.com>
Link: https://lore.kernel.org/r/20241115082548.74972-1-vmalik@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-15 08:10:13 -08:00
Dave Vasilevsky
31daa34315 crash, powerpc: default to CRASH_DUMP=n on PPC_BOOK3S_32
Fixes boot failures on 6.9 on PPC_BOOK3S_32 machines using Open Firmware. 
On these machines, the kernel refuses to boot from non-zero
PHYSICAL_START, which occurs when CRASH_DUMP is on.

Since most PPC_BOOK3S_32 machines boot via Open Firmware, it should
default to off for them.  Users booting via some other mechanism can still
turn it on explicitly.

Does not change the default on any other architectures for the
time being.

Link: https://lkml.kernel.org/r/20240917163720.1644584-1-dave@vasilevsky.ca
Fixes: 75bc255a74 ("crash: clean up kdump related config items")
Signed-off-by: Dave Vasilevsky <dave@vasilevsky.ca>
Reported-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
Closes: https://lists.debian.org/debian-powerpc/2024/07/msg00001.html
Acked-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Acked-by: Baoquan He <bhe@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-14 22:43:48 -08:00
Zhao Mengmeng
6b8950ef99 sched_ext: Replace scx_next_task_picked() with switch_class() in comment
scx_next_task_picked() has been replaced with siwtch_class(), but comment
is still referencing old one, so replace it.

Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-14 15:30:24 -10:00
Sebastian Andrzej Siewior
f946cae86d scftorture: Handle NULL argument passed to scf_add_to_free_list().
Dan reported that after the rework the newly introduced
scf_add_to_free_list() may get a NULL pointer passed. This replaced
kfree() which was fine with a NULL pointer but scf_add_to_free_list()
isn't.

Let scf_add_to_free_list() handle NULL pointer.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/2375aa2c-3248-4ffa-b9b0-f0a24c50f237@stanley.mountain
Fixes: 4788c861ad ("scftorture: Use a lock-less list to free memory.")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-11-14 16:09:51 -08:00
Jakub Kicinski
a79993b5fc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.12-rc8).

Conflicts:

tools/testing/selftests/net/.gitignore
  252e01e682 ("selftests: net: add netlink-dumps to .gitignore")
  be43a6b238 ("selftests: ncdevmem: Move ncdevmem under drivers/net/hw")
https://lore.kernel.org/all/20241113122359.1b95180a@canb.auug.org.au/

drivers/net/phy/phylink.c
  671154f174 ("net: phylink: ensure PHY momentary link-fails are handled")
  7530ea26c8 ("net: phylink: remove "using_mac_select_pcs"")

Adjacent changes:

drivers/net/ethernet/stmicro/stmmac/dwmac-intel-plat.c
  5b366eae71 ("stmmac: dwmac-intel-plat: fix call balance of tx_clk handling routines")
  e96321fad3 ("net: ethernet: Switch back to struct platform_driver::remove()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-14 11:29:15 -08:00
Tejun Heo
a4af89cc50 sched_ext: ops.cpu_acquire() should be called with SCX_KF_REST
ops.cpu_acquire() is currently called with 0 kf_maks which is interpreted as
SCX_KF_UNLOCKED which allows all unlocked kfuncs, but ops.cpu_acquire() is
called from balance_one() under the rq lock and should only be allowed call
kfuncs that are safe under the rq lock. Update it to use SCX_KF_REST.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Zhao Mengmeng <zhaomzhao@126.com>
Link: http://lkml.kernel.org/r/ZzYvf2L3rlmjuKzh@slm.duckdns.org
Fixes: 245254f708 ("sched_ext: Implement sched_ext_ops.cpu_acquire/release()")
2024-11-14 08:50:58 -10:00
Waiman Long
fbfbf86685 cgroup/cpuset: Disable cpuset_cpumask_can_shrink() test if not load balancing
With some recent proposed changes [1] in the deadline server code,
it has caused a test failure in test_cpuset_prs.sh when a change
is being made to an isolated partition. This is due to failing
the cpuset_cpumask_can_shrink() check for SCHED_DEADLINE tasks at
validate_change().

This is actually a false positive as the failed test case involves an
isolated partition with load balancing disabled. The deadline check
is not meaningful in this case and the users should know what they
are doing.

Fix this by doing the cpuset_cpumask_can_shrink() check only when loading
balanced is enabled. Also change its arguments to use effective_cpus
for the current cpuset and user_xcpus() as an approiximation for the
target effective_cpus as the real effective_cpus hasn't been fully
computed yet as this early stage.

As the check isn't comprehensive, there may be false positives or
negatives. We may have to revise the code to do a more thorough check
in the future if this becomes a concern.

[1] https://lore.kernel.org/lkml/82be06c1-6d6d-4651-86c9-bcc828cbcb80@redhat.com/T/#t

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-14 08:44:03 -10:00
Steven Rostedt
09663753bb tracing/ring-buffer: Clear all memory mapped CPU ring buffers on first recording
The events of a memory mapped ring buffer from the previous boot should
not be mixed in with events from the current boot. There's meta data that
is used to handle KASLR so that function names can be shown properly.

Also, since the timestamps of the previous boot have no meaning to the
timestamps of the current boot, having them intermingled in a buffer can
also cause confusion because there could possibly be events in the future.

When a trace is activated the meta data is reset so that the pointers of
are now processed for the new address space. The trace buffers are reset
when tracing starts for the first time. The problem here is that the reset
only happens on online CPUs. If a CPU is offline, it does not get reset.

To demonstrate the issue, a previous boot had tracing enabled in the boot
mapped ring buffer on reboot. On the following boot, tracing has not been
started yet so the function trace from the previous boot is still visible.

 # trace-cmd show -B boot_mapped -c 3 | tail
          <idle>-0       [003] d.h2.   156.462395: __rcu_read_lock <-cpu_emergency_disable_virtualization
          <idle>-0       [003] d.h2.   156.462396: vmx_emergency_disable_virtualization_cpu <-cpu_emergency_disable_virtualization
          <idle>-0       [003] d.h2.   156.462396: __rcu_read_unlock <-__sysvec_reboot
          <idle>-0       [003] d.h2.   156.462397: stop_this_cpu <-__sysvec_reboot
          <idle>-0       [003] d.h2.   156.462397: set_cpu_online <-stop_this_cpu
          <idle>-0       [003] d.h2.   156.462397: disable_local_APIC <-stop_this_cpu
          <idle>-0       [003] d.h2.   156.462398: clear_local_APIC <-disable_local_APIC
          <idle>-0       [003] d.h2.   156.462574: mcheck_cpu_clear <-stop_this_cpu
          <idle>-0       [003] d.h2.   156.462575: mce_intel_feature_clear <-stop_this_cpu
          <idle>-0       [003] d.h2.   156.462575: lmce_supported <-mce_intel_feature_clear

Now, if CPU 3 is taken offline, and tracing is started on the memory
mapped ring buffer, the events from the previous boot in the CPU 3 ring
buffer is not reset. Now those events are using the meta data from the
current boot and produces just hex values.

 # echo 0 > /sys/devices/system/cpu/cpu3/online
 # trace-cmd start -B boot_mapped -p function
 # trace-cmd show -B boot_mapped -c 3 | tail
          <idle>-0       [003] d.h2.   156.462395: 0xffffffff9a1e3194 <-0xffffffff9a0f655e
          <idle>-0       [003] d.h2.   156.462396: 0xffffffff9a0a1d24 <-0xffffffff9a0f656f
          <idle>-0       [003] d.h2.   156.462396: 0xffffffff9a1e6bc4 <-0xffffffff9a0f7323
          <idle>-0       [003] d.h2.   156.462397: 0xffffffff9a0d12b4 <-0xffffffff9a0f732a
          <idle>-0       [003] d.h2.   156.462397: 0xffffffff9a1458d4 <-0xffffffff9a0d12e2
          <idle>-0       [003] d.h2.   156.462397: 0xffffffff9a0faed4 <-0xffffffff9a0d12e7
          <idle>-0       [003] d.h2.   156.462398: 0xffffffff9a0faaf4 <-0xffffffff9a0faef2
          <idle>-0       [003] d.h2.   156.462574: 0xffffffff9a0e3444 <-0xffffffff9a0d12ef
          <idle>-0       [003] d.h2.   156.462575: 0xffffffff9a0e4964 <-0xffffffff9a0d12ef
          <idle>-0       [003] d.h2.   156.462575: 0xffffffff9a0e3fb0 <-0xffffffff9a0e496f

Reset all CPUs when starting a boot mapped ring buffer for the first time,
and not just the online CPUs.

Fixes: 7a1d1e4b96 ("tracing/ring-buffer: Add last_boot_info file to boot instance")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-11-14 11:54:34 -05:00
Steven Rostedt
580bb355bc Revert: "ring-buffer: Do not have boot mapped buffers hook to CPU hotplug"
A crash happened when testing cpu hotplug with respect to the memory
mapped ring buffers. It was assumed that the hot plug code was adding a
per CPU buffer that was already created that caused the crash. The real
problem was due to ref counting and was fixed by commit 2cf9733891
("ring-buffer: Fix refcount setting of boot mapped buffers").

When a per CPU buffer is created, it will not be created again even with
CPU hotplug, so the fix to not use CPU hotplug was a red herring. In fact,
it caused only the boot CPU buffer to be created, leaving the other CPU
per CPU buffers disabled.

Revert that change as it was not the culprit of the fix it was intended to
be.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241113230839.6c03640f@gandalf.local.home
Fixes: 912da2c384 ("ring-buffer: Do not have boot mapped buffers hook to CPU hotplug")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-11-14 10:01:00 -05:00
Catalin Marinas
5a4332062e Merge branches 'for-next/gcs', 'for-next/probes', 'for-next/asm-offsets', 'for-next/tlb', 'for-next/misc', 'for-next/mte', 'for-next/sysreg', 'for-next/stacktrace', 'for-next/hwcap3', 'for-next/kselftest', 'for-next/crc32', 'for-next/guest-cca', 'for-next/haft' and 'for-next/scs', remote-tracking branch 'arm64/for-next/perf' into for-next/core
* arm64/for-next/perf:
  perf: Switch back to struct platform_driver::remove()
  perf: arm_pmuv3: Add support for Samsung Mongoose PMU
  dt-bindings: arm: pmu: Add Samsung Mongoose core compatible
  perf/dwc_pcie: Fix typos in event names
  perf/dwc_pcie: Add support for Ampere SoCs
  ARM: pmuv3: Add missing write_pmuacr()
  perf/marvell: Marvell PEM performance monitor support
  perf/arm_pmuv3: Add PMUv3.9 per counter EL0 access control
  perf/dwc_pcie: Convert the events with mixed case to lowercase
  perf/cxlpmu: Support missing events in 3.1 spec
  perf: imx_perf: add support for i.MX91 platform
  dt-bindings: perf: fsl-imx-ddr: Add i.MX91 compatible
  drivers perf: remove unused field pmu_node

* for-next/gcs: (42 commits)
  : arm64 Guarded Control Stack user-space support
  kselftest/arm64: Fix missing printf() argument in gcs/gcs-stress.c
  arm64/gcs: Fix outdated ptrace documentation
  kselftest/arm64: Ensure stable names for GCS stress test results
  kselftest/arm64: Validate that GCS push and write permissions work
  kselftest/arm64: Enable GCS for the FP stress tests
  kselftest/arm64: Add a GCS stress test
  kselftest/arm64: Add GCS signal tests
  kselftest/arm64: Add test coverage for GCS mode locking
  kselftest/arm64: Add a GCS test program built with the system libc
  kselftest/arm64: Add very basic GCS test program
  kselftest/arm64: Always run signals tests with GCS enabled
  kselftest/arm64: Allow signals tests to specify an expected si_code
  kselftest/arm64: Add framework support for GCS to signal handling tests
  kselftest/arm64: Add GCS as a detected feature in the signal tests
  kselftest/arm64: Verify the GCS hwcap
  arm64: Add Kconfig for Guarded Control Stack (GCS)
  arm64/ptrace: Expose GCS via ptrace and core files
  arm64/signal: Expose GCS state in signal frames
  arm64/signal: Set up and restore the GCS context for signal handlers
  arm64/mm: Implement map_shadow_stack()
  ...

* for-next/probes:
  : Various arm64 uprobes/kprobes cleanups
  arm64: insn: Simulate nop instruction for better uprobe performance
  arm64: probes: Remove probe_opcode_t
  arm64: probes: Cleanup kprobes endianness conversions
  arm64: probes: Move kprobes-specific fields
  arm64: probes: Fix uprobes for big-endian kernels
  arm64: probes: Fix simulate_ldr*_literal()
  arm64: probes: Remove broken LDR (literal) uprobe support

* for-next/asm-offsets:
  : arm64 asm-offsets.c cleanup (remove unused offsets)
  arm64: asm-offsets: remove PREEMPT_DISABLE_OFFSET
  arm64: asm-offsets: remove DMA_{TO,FROM}_DEVICE
  arm64: asm-offsets: remove VM_EXEC and PAGE_SZ
  arm64: asm-offsets: remove MM_CONTEXT_ID
  arm64: asm-offsets: remove COMPAT_{RT_,SIGFRAME_REGS_OFFSET
  arm64: asm-offsets: remove VMA_VM_*
  arm64: asm-offsets: remove TSK_ACTIVE_MM

* for-next/tlb:
  : TLB flushing optimisations
  arm64: optimize flush tlb kernel range
  arm64: tlbflush: add __flush_tlb_range_limit_excess()

* for-next/misc:
  : Miscellaneous patches
  arm64: tls: Fix context-switching of tpidrro_el0 when kpti is enabled
  arm64/ptrace: Clarify documentation of VL configuration via ptrace
  acpi/arm64: remove unnecessary cast
  arm64/mm: Change protval as 'pteval_t' in map_range()
  arm64: uprobes: Optimize cache flushes for xol slot
  acpi/arm64: Adjust error handling procedure in gtdt_parse_timer_block()
  arm64: fix .data.rel.ro size assertion when CONFIG_LTO_CLANG
  arm64/ptdump: Test both PTE_TABLE_BIT and PTE_VALID for block mappings
  arm64/mm: Sanity check PTE address before runtime P4D/PUD folding
  arm64/mm: Drop setting PTE_TYPE_PAGE in pte_mkcont()
  ACPI: GTDT: Tighten the check for the array of platform timer structures
  arm64/fpsimd: Fix a typo
  arm64: Expose ID_AA64ISAR1_EL1.XS to sanitised feature consumers
  arm64: Return early when break handler is found on linked-list
  arm64/mm: Re-organize arch_make_huge_pte()
  arm64/mm: Drop _PROT_SECT_DEFAULT
  arm64: Add command-line override for ID_AA64MMFR0_EL1.ECV
  arm64: head: Drop SWAPPER_TABLE_SHIFT
  arm64: cpufeature: add POE to cpucap_is_possible()
  arm64/mm: Change pgattr_change_is_safe() arguments as pteval_t

* for-next/mte:
  : Various MTE improvements
  selftests: arm64: add hugetlb mte tests
  hugetlb: arm64: add mte support

* for-next/sysreg:
  : arm64 sysreg updates
  arm64/sysreg: Update ID_AA64MMFR1_EL1 to DDI0601 2024-09

* for-next/stacktrace:
  : arm64 stacktrace improvements
  arm64: preserve pt_regs::stackframe during exec*()
  arm64: stacktrace: unwind exception boundaries
  arm64: stacktrace: split unwind_consume_stack()
  arm64: stacktrace: report recovered PCs
  arm64: stacktrace: report source of unwind data
  arm64: stacktrace: move dump_backtrace() to kunwind_stack_walk()
  arm64: use a common struct frame_record
  arm64: pt_regs: swap 'unused' and 'pmr' fields
  arm64: pt_regs: rename "pmr_save" -> "pmr"
  arm64: pt_regs: remove stale big-endian layout
  arm64: pt_regs: assert pt_regs is a multiple of 16 bytes

* for-next/hwcap3:
  : Add AT_HWCAP3 support for arm64 (also wire up AT_HWCAP4)
  arm64: Support AT_HWCAP3
  binfmt_elf: Wire up AT_HWCAP3 at AT_HWCAP4

* for-next/kselftest: (30 commits)
  : arm64 kselftest fixes/cleanups
  kselftest/arm64: Try harder to generate different keys during PAC tests
  kselftest/arm64: Don't leak pipe fds in pac.exec_sign_all()
  kselftest/arm64: Corrupt P0 in the irritator when testing SSVE
  kselftest/arm64: Add FPMR coverage to fp-ptrace
  kselftest/arm64: Expand the set of ZA writes fp-ptrace does
  kselftets/arm64: Use flag bits for features in fp-ptrace assembler code
  kselftest/arm64: Enable build of PAC tests with LLVM=1
  kselftest/arm64: Check that SVCR is 0 in signal handlers
  kselftest/arm64: Fix printf() compiler warnings in the arm64 syscall-abi.c tests
  kselftest/arm64: Fix printf() warning in the arm64 MTE prctl() test
  kselftest/arm64: Fix printf() compiler warnings in the arm64 fp tests
  kselftest/arm64: Fix build with stricter assemblers
  kselftest/arm64: Test signal handler state modification in fp-stress
  kselftest/arm64: Provide a SIGUSR1 handler in the kernel mode FP stress test
  kselftest/arm64: Implement irritators for ZA and ZT
  kselftest/arm64: Remove unused ADRs from irritator handlers
  kselftest/arm64: Correct misleading comments on fp-stress irritators
  kselftest/arm64: Poll less often while waiting for fp-stress children
  kselftest/arm64: Increase frequency of signal delivery in fp-stress
  kselftest/arm64: Fix encoding for SVE B16B16 test
  ...

* for-next/crc32:
  : Optimise CRC32 using PMULL instructions
  arm64/crc32: Implement 4-way interleave using PMULL
  arm64/crc32: Reorganize bit/byte ordering macros
  arm64/lib: Handle CRC-32 alternative in C code

* for-next/guest-cca:
  : Support for running Linux as a guest in Arm CCA
  arm64: Document Arm Confidential Compute
  virt: arm-cca-guest: TSM_REPORT support for realms
  arm64: Enable memory encrypt for Realms
  arm64: mm: Avoid TLBI when marking pages as valid
  arm64: Enforce bounce buffers for realm DMA
  efi: arm64: Map Device with Prot Shared
  arm64: rsi: Map unprotected MMIO as decrypted
  arm64: rsi: Add support for checking whether an MMIO is protected
  arm64: realm: Query IPA size from the RMM
  arm64: Detect if in a realm and set RIPAS RAM
  arm64: rsi: Add RSI definitions

* for-next/haft:
  : Support for arm64 FEAT_HAFT
  arm64: pgtable: Warn unexpected pmdp_test_and_clear_young()
  arm64: Enable ARCH_HAS_NONLEAF_PMD_YOUNG
  arm64: Add support for FEAT_HAFT
  arm64: setup: name 'tcr2' register
  arm64/sysreg: Update ID_AA64MMFR1_EL1 register

* for-next/scs:
  : Dynamic shadow call stack fixes
  arm64/scs: Drop unused prototype __pi_scs_patch_vmlinux()
  arm64/scs: Deal with 64-bit relative offsets in FDE frames
  arm64/scs: Fix handling of DWARF augmentation data in CIE/FDE frames
2024-11-14 12:07:16 +00:00
Paolo Bonzini
0586ade9e7 LoongArch KVM changes for v6.13
1. Add iocsr and mmio bus simulation in kernel.
 2. Add in-kernel interrupt controller emulation.
 3. Add virt extension support for eiointc irqchip.
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEzOlt8mkP+tbeiYy5AoYrw/LiJnoFAmc0otUWHGNoZW5odWFj
 YWlAa2VybmVsLm9yZwAKCRAChivD8uImega1D/0Q91hUlKVp55QXDZrnpW7Z71v+
 I9u8avjRiISDMLkjku/HE9eoD7lVYndzkDDSH32W+UVpBharJvuR+MIoH4jtLf3k
 IImybEaBwXru0+8YxbMqIzqcUEbQda0U5u31Ju1U6xcp+y1PGJJJDVPk4vBXOQB3
 +wnLE6Q7orddw3s6G0QYtTv8jPDPOOL0Jv2ClqBaM8mTr2dIEpMjbZg2yGPMQVlE
 mVEgoked9OS5blkoxz2rEfUMQX5CVs20lyhfr05Qk2mTbeKITceqVlx183CyLMUO
 /9uJl7sD1ctxmQtU7ezeM7n7ItP9ehdAPECkt8WWSHM6mGbwHVTAtJoQGZjgoc6O
 pL1aSzhfGH3mdbwUCjhGsov6cZ4hliDQ76H3dlxrSr0JJX3zOPY5qDegmfDlxlyT
 uoKOAsx5D2N+WgshDPApZonkh38agaeTWposamseJbVNZXHmQV8Q8ipiNhgcgtVe
 mAReWfoYHL2mFIQNrfKS2i9J8mRj9SrjcQyNxgeU3L1s5Mr1p11yYXrkfVrZiHVk
 0KzPfNJZvHO7zvgAIbyqyXEAY2Cq6F2r7UIELUOzY2zayoZwbn2jIZrsUVVbUsWp
 G4FbTRQDK1UR1cCVqe9jLmf5BzlSZ+jXOgcg+CxGIAelZ0qRcK/IgkX6/KygSlgY
 49W45xpHtVUycsWDNA==
 =Jov3
 -----END PGP SIGNATURE-----

Merge tag 'loongarch-kvm-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD

LoongArch KVM changes for v6.13

1. Add iocsr and mmio bus simulation in kernel.
2. Add in-kernel interrupt controller emulation.
3. Add virt extension support for eiointc irqchip.
2024-11-14 07:06:24 -05:00
Paolo Bonzini
7b541d557f KVM/arm64 changes for 6.13, part #1
- Support for stage-1 permission indirection (FEAT_S1PIE) and
    permission overlays (FEAT_S1POE), including nested virt + the
    emulated page table walker
 
  - Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This call
    was introduced in PSCIv1.3 as a mechanism to request hibernation,
    similar to the S4 state in ACPI
 
  - Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
    part of it, introduce trivial initialization of the host's MPAM
    context so KVM can use the corresponding traps
 
  - PMU support under nested virtualization, honoring the guest
    hypervisor's trap configuration and event filtering when running a
    nested guest
 
  - Fixes to vgic ITS serialization where stale device/interrupt table
    entries are not zeroed when the mapping is invalidated by the VM
 
  - Avoid emulated MMIO completion if userspace has requested synchronous
    external abort injection
 
  - Various fixes and cleanups affecting pKVM, vCPU initialization, and
    selftests
 -----BEGIN PGP SIGNATURE-----
 
 iI0EABYIADUWIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZzTZXRccb2xpdmVyLnVw
 dG9uQGxpbnV4LmRldgAKCRCivnWIJHzdFioUAP0cs2pYcwuCqLgmeHqfz6L5Xsw3
 hKBCNuvr5mjU0hZfLAEA5ml2eUKD7OnssAOmUZ/K/NoCdJFCe8mJWQDlURvr9g4=
 =u2/3
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-6.13' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 changes for 6.13, part #1

 - Support for stage-1 permission indirection (FEAT_S1PIE) and
   permission overlays (FEAT_S1POE), including nested virt + the
   emulated page table walker

 - Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This call
   was introduced in PSCIv1.3 as a mechanism to request hibernation,
   similar to the S4 state in ACPI

 - Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
   part of it, introduce trivial initialization of the host's MPAM
   context so KVM can use the corresponding traps

 - PMU support under nested virtualization, honoring the guest
   hypervisor's trap configuration and event filtering when running a
   nested guest

 - Fixes to vgic ITS serialization where stale device/interrupt table
   entries are not zeroed when the mapping is invalidated by the VM

 - Avoid emulated MMIO completion if userspace has requested synchronous
   external abort injection

 - Various fixes and cleanups affecting pKVM, vCPU initialization, and
   selftests
2024-11-14 07:05:36 -05:00
Geert Uytterhoeven
22293c3373 dma-mapping: save base/size instead of pointer to shared DMA pool
On RZ/Five, which is non-coherent, and uses CONFIG_DMA_GLOBAL_POOL=y:

    Oops - store (or AMO) access fault [#1]
    CPU: 0 UID: 0 PID: 1 Comm: swapper Not tainted 6.12.0-rc1-00015-g8a6e02d0c00e #201
    Hardware name: Renesas SMARC EVK based on r9a07g043f01 (DT)
    epc : __memset+0x60/0x100
     ra : __dma_alloc_from_coherent+0x150/0x17a
    epc : ffffffff8062d2bc ra : ffffffff80053a94 sp : ffffffc60000ba20
     gp : ffffffff812e9938 tp : ffffffd601920000 t0 : ffffffc6000d0000
     t1 : 0000000000000000 t2 : ffffffffe9600000 s0 : ffffffc60000baa0
     s1 : ffffffc6000d0000 a0 : ffffffc6000d0000 a1 : 0000000000000000
     a2 : 0000000000001000 a3 : ffffffc6000d1000 a4 : 0000000000000000
     a5 : 0000000000000000 a6 : ffffffd601adacc0 a7 : ffffffd601a841a8
     s2 : ffffffd6018573c0 s3 : 0000000000001000 s4 : ffffffd6019541e0
     s5 : 0000000200000022 s6 : ffffffd6018f8410 s7 : ffffffd6018573e8
     s8 : 0000000000000001 s9 : 0000000000000001 s10: 0000000000000010
     s11: 0000000000000000 t3 : 0000000000000000 t4 : ffffffffdefe62d1
     t5 : 000000001cd6a3a9 t6 : ffffffd601b2aad6
    status: 0000000200000120 badaddr: ffffffc6000d0000 cause: 0000000000000007
    [<ffffffff8062d2bc>] __memset+0x60/0x100
    [<ffffffff80053e1a>] dma_alloc_from_global_coherent+0x1c/0x28
    [<ffffffff80053056>] dma_direct_alloc+0x98/0x112
    [<ffffffff8005238c>] dma_alloc_attrs+0x78/0x86
    [<ffffffff8035fdb4>] rz_dmac_probe+0x3f6/0x50a
    [<ffffffff803a0694>] platform_probe+0x4c/0x8a

If CONFIG_DMA_GLOBAL_POOL=y, the reserved_mem structure passed to
rmem_dma_setup() is saved for later use, by saving the passed pointer.
However, when dma_init_reserved_memory() is called later, the pointer
has become stale, causing a crash.

E.g. in the RZ/Five case, the referenced memory now contains the
reserved_mem structure for the "mmode_resv0@30000" node (with base
0x30000 and size 0x10000), instead of the correct "pma_resv0@58000000"
node (with base 0x58000000 and size 0x8000000).

Fix this by saving the needed reserved_mem structure's contents instead.

Fixes: 8a6e02d0c0 ("of: reserved_mem: Restructure how the reserved memory regions are processed")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Oreoluwa Babatunde <quic_obabatun@quicinc.com>
Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-11-14 10:45:09 +01:00
Colton Lewis
2c47e7a74f perf/core: Correct perf sampling with guest VMs
Previously any PMU overflow interrupt that fired while a VCPU was
loaded was recorded as a guest event whether it truly was or not. This
resulted in nonsense perf recordings that did not honor
perf_event_attr.exclude_guest and recorded guest IPs where it should
have recorded host IPs.

Rework the sampling logic to only record guest samples for events with
exclude_guest = 0. This way any host-only events with exclude_guest
set will never see unexpected guest samples. The behaviour of events
with exclude_guest = 0 is unchanged.

Note that events configured to sample both host and guest may still
misattribute a PMI that arrived in the host as a guest event depending
on KVM arch and vendor behavior.

Signed-off-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20241113190156.2145593-6-coltonlewis@google.com
2024-11-14 10:40:01 +01:00
Colton Lewis
04782e6391 perf/core: Hoist perf_instruction_pointer() and perf_misc_flags()
For clarity, rename the arch-specific definitions of these functions
to perf_arch_* to denote they are arch-specifc. Define the
generic-named functions in one place where they can call the
arch-specific ones as needed.

Signed-off-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Acked-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/20241113190156.2145593-3-coltonlewis@google.com
2024-11-14 10:40:01 +01:00
Alexei Starovoitov
b795379757 bpf: Introduce range_tree data structure and use it in bpf arena
Introduce range_tree data structure and use it in bpf arena to track
ranges of allocated pages. range_tree is a large bitmap that is
implemented as interval tree plus rbtree. The contiguous sequence of
bits represents unallocated pages.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20241108025616.17625-2-alexei.starovoitov@gmail.com
2024-11-13 13:52:45 -08:00
Alexei Starovoitov
8714381703 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Cross-merge bpf fixes after downstream PR.

In particular to bring the fix in
commit aa30eb3260 ("bpf: Force checkpoint when jmp history is too long").
The follow up verifier work depends on it.
And the fix in
commit 6801cf7890 ("selftests/bpf: Use -4095 as the bad address for bits iterator").
It's fixing instability of BPF CI on s390 arch.

No conflicts.

Adjacent changes in:
Auto-merging arch/Kconfig
Auto-merging kernel/bpf/helpers.c
Auto-merging kernel/bpf/memalloc.c
Auto-merging kernel/bpf/verifier.c
Auto-merging mm/slab_common.c

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-13 12:52:51 -08:00
David Wang
f9ed1f7c2e genirq/proc: Use seq_put_decimal_ull_width() for decimal values
seq_printf() is more expensive than seq_put_decimal_ull_width() due to the
format string parsing costs.

Profiling on a x86 8-core system indicates seq_printf() takes ~47% samples
of show_interrupts(). Replacing it with seq_put_decimal_ull_width() yields
almost 30% performance gain.

[ tglx: Massaged changelog and fixed up coding style ]

Signed-off-by: David Wang <00107082@163.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241108160717.9547-1-00107082@163.com
2024-11-13 17:36:35 +01:00
Xu Kuohai
7c8ce4ffb6 bpf: Add kernel symbol for struct_ops trampoline
Without kernel symbols for struct_ops trampoline, the unwinder may
produce unexpected stacktraces.

For example, the x86 ORC and FP unwinders check if an IP is in kernel
text by verifying the presence of the IP's kernel symbol. When a
struct_ops trampoline address is encountered, the unwinder stops due
to the absence of symbol, resulting in an incomplete stacktrace that
consists only of direct and indirect child functions called from the
trampoline.

The arm64 unwinder is another example. While the arm64 unwinder can
proceed across a struct_ops trampoline address, the corresponding
symbol name is displayed as "unknown", which is confusing.

Thus, add kernel symbol for struct_ops trampoline. The name is
bpf__<struct_ops_name>_<member_name>, where <struct_ops_name> is the
type name of the struct_ops, and <member_name> is the name of
the member that the trampoline is linked to.

Below is a comparison of stacktraces captured on x86 by perf record,
before and after this patch.

Before:
ffffffff8116545d __lock_acquire+0xad ([kernel.kallsyms])
ffffffff81167fcc lock_acquire+0xcc ([kernel.kallsyms])
ffffffff813088f4 __bpf_prog_enter+0x34 ([kernel.kallsyms])

After:
ffffffff811656bd __lock_acquire+0x30d ([kernel.kallsyms])
ffffffff81167fcc lock_acquire+0xcc ([kernel.kallsyms])
ffffffff81309024 __bpf_prog_enter+0x34 ([kernel.kallsyms])
ffffffffc000d7e9 bpf__tcp_congestion_ops_cong_avoid+0x3e ([kernel.kallsyms])
ffffffff81f250a5 tcp_ack+0x10d5 ([kernel.kallsyms])
ffffffff81f27c66 tcp_rcv_established+0x3b6 ([kernel.kallsyms])
ffffffff81f3ad03 tcp_v4_do_rcv+0x193 ([kernel.kallsyms])
ffffffff81d65a18 __release_sock+0xd8 ([kernel.kallsyms])
ffffffff81d65af4 release_sock+0x34 ([kernel.kallsyms])
ffffffff81f15c4b tcp_sendmsg+0x3b ([kernel.kallsyms])
ffffffff81f663d7 inet_sendmsg+0x47 ([kernel.kallsyms])
ffffffff81d5ab40 sock_write_iter+0x160 ([kernel.kallsyms])
ffffffff8149c67b vfs_write+0x3fb ([kernel.kallsyms])
ffffffff8149caf6 ksys_write+0xc6 ([kernel.kallsyms])
ffffffff8149cb5d __x64_sys_write+0x1d ([kernel.kallsyms])
ffffffff81009200 x64_sys_call+0x1d30 ([kernel.kallsyms])
ffffffff82232d28 do_syscall_64+0x68 ([kernel.kallsyms])
ffffffff8240012f entry_SYSCALL_64_after_hwframe+0x76 ([kernel.kallsyms])

Fixes: 85d33df357 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20241112145849.3436772-4-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12 17:13:46 -08:00
Xu Kuohai
821a3fa32b bpf: Use function pointers count as struct_ops links count
Only function pointers in a struct_ops structure can be linked to bpf
progs, so set the links count to the function pointers count, instead
of the total members count in the structure.

Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20241112145849.3436772-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12 17:13:46 -08:00
Xu Kuohai
bd9d9b48eb bpf: Remove unused member rcu from bpf_struct_ops_map
The rcu member in bpf_struct_ops_map is not used after commit
b671c2067a ("bpf: Retire the struct_ops map kvalue->refcnt.")

Remove it.

Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20241112145849.3436772-2-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12 17:13:46 -08:00
Yonghong Song
5bd36da1e3 bpf: Support private stack for struct_ops progs
For struct_ops progs, whether a particular prog uses private stack
depends on prog->aux->priv_stack_requested setting before actual
insn-level verification for that prog. One particular implementation
is to piggyback on struct_ops->check_member(). The next patch has
an example for this. The struct_ops->check_member() sets
prog->aux->priv_stack_requested to be true which enables private stack
usage.

The struct_ops prog follows the same rule as kprobe/tracing progs after
function bpf_enable_priv_stack(). For example, even a struct_ops prog
requests private stack, it could still use normal kernel stack if
the stack size is small (< 64 bytes).

Similar to tracing progs, nested same cpu same prog run will be skipped.
A field (recursion_detected()) is added to bpf_prog_aux structure.
If bpf_prog->aux->recursion_detected is implemented by the struct_ops
subsystem and nested same cpu/prog happens, the function will be
triggered to report an error, collect related info, etc.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20241112163933.2224962-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12 16:26:25 -08:00
Yonghong Song
e00931c025 bpf: Enable private stack for eligible subprogs
If private stack is used by any subprog, set that subprog
prog->aux->jits_use_priv_stack to be true so later jit can allocate
private stack for that subprog properly.

Also set env->prog->aux->jits_use_priv_stack to be true if
any subprog uses private stack. This is a use case for a
single main prog (no subprogs) to use private stack, and
also a use case for later struct-ops progs where
env->prog->aux->jits_use_priv_stack will enable recursion
check if any subprog uses private stack.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20241112163912.2224007-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12 16:26:24 -08:00
Yonghong Song
a76ab5731e bpf: Find eligible subprogs for private stack support
Private stack will be allocated with percpu allocator in jit time.
To avoid complexity at runtime, only one copy of private stack is
available per cpu per prog. So runtime recursion check is necessary
to avoid stack corruption.

Current private stack only supports kprobe/perf_event/tp/raw_tp
which has recursion check in the kernel, and prog types that use
bpf trampoline recursion check. For trampoline related prog types,
currently only tracing progs have recursion checking.

To avoid complexity, all async_cb subprogs use normal kernel stack
including those subprogs used by both main prog subtree and async_cb
subtree. Any prog having tail call also uses kernel stack.

To avoid jit penalty with private stack support, a subprog stack
size threshold is set such that only if the stack size is no less
than the threshold, private stack is supported. The current threshold
is 64 bytes. This avoids jit penality if the stack usage is small.

A useless 'continue' is also removed from a loop in func
check_max_stack_depth().

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20241112163907.2223839-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12 16:26:24 -08:00
Paul E. McKenney
f8ce622ac9 srcu: Check for srcu_read_lock_lite() across all CPUs
If srcu_read_lock_lite() is used on a given srcu_struct structure, then
the grace-period processing must do synchronize_rcu() instead of smp_mb()
between the scans of the ->srcu_unlock_count[] and ->srcu_lock_count[]
counters.  Currently, it does that by testing the SRCU_READ_FLAVOR_LITE
bit of the ->srcu_reader_flavor mask, which works well.  But only if
the CPU running that srcu_struct structure's grace period has previously
executed srcu_read_lock_lite(), which might not be the case, especially
just after that srcu_struct structure has been created and initialized.

This commit therefore updates the srcu_readers_unlock_idx() function
to OR together the ->srcu_reader_flavor masks from all CPUs, and
then make the srcu_readers_active_idx_check() function that test the
SRCU_READ_FLAVOR_LITE bit in the resulting mask.

Note that the srcu_readers_unlock_idx() function is already scanning all
the CPUs to sum up the ->srcu_unlock_count[] fields and that this is on
the grace-period slow path, hence no concerns about the small amount of
extra work.

Reported-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Closes: https://lore.kernel.org/all/d07e8f4a-d5ff-4c8e-8e61-50db285c57e9@amd.com/
Fixes: c0f08d6b5a61 ("srcu: Add srcu_read_lock_lite() and srcu_read_unlock_lite()")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12 23:31:28 +01:00
Paul E. McKenney
80e935c8c1 rcutorture: Avoid printing cpu=-1 for no-fault RCU boost failure
If a CPU runs throughout the stalled grace period without passing
through a quiescent state, RCU priority boosting cannot help.
The rcu_torture_boost_failed() function therefore prints a message
flagging the first such CPU.  However, if the stall was instead due to
(for example) RCU's grace-period kthread being starved of CPU, there will
be no such CPU, causing rcu_check_boost_fail() to instead pass back -1
through its cpup CPU-pointer parameter.

Therefore, the current message complains about a mythical CPU -1.

This commit therefore checks for this situation, and notes that all CPUs
have passed through a quiescent state.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12 23:05:11 +01:00
Paul E. McKenney
ff9ba8db87 rcuscale: Add guest_os_delay module parameter
This commit adds a guest_os_delay module parameter that extends warm-up
and cool-down the specified number of seconds before and after the series
of test runs.  This allows the data-collection intervals from any given
rcuscale guest OSes to line up with active periods in the other rcuscale
guest OSes, and also allows the thermal warm-up period required to obtain
consistent results from one test to the next.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12 23:05:05 +01:00
Paul E. McKenney
046c06f5ba refscale: Correct affinity check
The current affinity check works fine until there are more reader
processes than CPUs, at which point the affinity check is looking for
non-existent CPUs.  This commit therefore applies the same modulus to
the check as is present in the set_cpus_allowed_ptr() call.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12 23:04:50 +01:00
Zqiang
2996980e20 rcu/nocb: Fix missed RCU barrier on deoffloading
Currently, running rcutorture test with torture_type=rcu fwd_progress=8
n_barrier_cbs=8 nocbs_nthreads=8 nocbs_toggle=100 onoff_interval=60
test_boost=2, will trigger the following warning:

	WARNING: CPU: 19 PID: 100 at kernel/rcu/tree_nocb.h:1061 rcu_nocb_rdp_deoffload+0x292/0x2a0
	RIP: 0010:rcu_nocb_rdp_deoffload+0x292/0x2a0
	 Call Trace:
	  <TASK>
	  ? __warn+0x7e/0x120
	  ? rcu_nocb_rdp_deoffload+0x292/0x2a0
	  ? report_bug+0x18e/0x1a0
	  ? handle_bug+0x3d/0x70
	  ? exc_invalid_op+0x18/0x70
	  ? asm_exc_invalid_op+0x1a/0x20
	  ? rcu_nocb_rdp_deoffload+0x292/0x2a0
	  rcu_nocb_cpu_deoffload+0x70/0xa0
	  rcu_nocb_toggle+0x136/0x1c0
	  ? __pfx_rcu_nocb_toggle+0x10/0x10
	  kthread+0xd1/0x100
	  ? __pfx_kthread+0x10/0x10
	  ret_from_fork+0x2f/0x50
	  ? __pfx_kthread+0x10/0x10
	  ret_from_fork_asm+0x1a/0x30
	  </TASK>

CPU0                               CPU2                          CPU3
//rcu_nocb_toggle             //nocb_cb_wait                   //rcutorture

// deoffload CPU1             // process CPU1's rdp
rcu_barrier()
    rcu_segcblist_entrain()
        rcu_segcblist_add_len(1);
        // len == 2
        // enqueue barrier
        // callback to CPU1's
        // rdp->cblist
                             rcu_do_batch()
                                 // invoke CPU1's rdp->cblist
                                 // callback
                                 rcu_barrier_callback()
                                                             rcu_barrier()
                                                               mutex_lock(&rcu_state.barrier_mutex);
                                                               // still see len == 2
                                                               // enqueue barrier callback
                                                               // to CPU1's rdp->cblist
                                                               rcu_segcblist_entrain()
                                                                   rcu_segcblist_add_len(1);
                                                                   // len == 3
                                 // decrement len
                                 rcu_segcblist_add_len(-2);
                             kthread_parkme()

// CPU1's rdp->cblist len == 1
// Warn because there is
// still a pending barrier
// trigger warning
WARN_ON_ONCE(rcu_segcblist_n_cbs(&rdp->cblist));
cpus_read_unlock();

                                                                // wait CPU1 to comes online and
                                                                // invoke barrier callback on
                                                                // CPU1 rdp's->cblist
                                                                wait_for_completion(&rcu_state.barrier_completion);
// deoffload CPU4
cpus_read_lock()
  rcu_barrier()
    mutex_lock(&rcu_state.barrier_mutex);
    // block on barrier_mutex
    // wait rcu_barrier() on
    // CPU3 to unlock barrier_mutex
    // but CPU3 unlock barrier_mutex
    // need to wait CPU1 comes online
    // when CPU1 going online will block on cpus_write_lock

The above scenario will not only trigger a WARN_ON_ONCE(), but also
trigger a deadlock.

Thanks to nocb locking, a second racing rcu_barrier() on an offline CPU
will either observe the decremented callback counter down to 0 and spare
the callback enqueue, or rcuo will observe the new callback and keep
rdp->nocb_cb_sleep to false.

Therefore check rdp->nocb_cb_sleep before parking to make sure no
further rcu_barrier() is waiting on the rdp.

Fixes: 1fcb932c8b ("rcu/nocb: Simplify (de-)offloading state machine")
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12 22:51:52 +01:00
Uladzislau Rezki (Sony)
a23da88c6c rcu/kvfree: Fix data-race in __mod_timer / kvfree_call_rcu
KCSAN reports a data race when access the krcp->monitor_work.timer.expires
variable in the schedule_delayed_monitor_work() function:

<snip>
BUG: KCSAN: data-race in __mod_timer / kvfree_call_rcu

read to 0xffff888237d1cce8 of 8 bytes by task 10149 on cpu 1:
 schedule_delayed_monitor_work kernel/rcu/tree.c:3520 [inline]
 kvfree_call_rcu+0x3b8/0x510 kernel/rcu/tree.c:3839
 trie_update_elem+0x47c/0x620 kernel/bpf/lpm_trie.c:441
 bpf_map_update_value+0x324/0x350 kernel/bpf/syscall.c:203
 generic_map_update_batch+0x401/0x520 kernel/bpf/syscall.c:1849
 bpf_map_do_batch+0x28c/0x3f0 kernel/bpf/syscall.c:5143
 __sys_bpf+0x2e5/0x7a0
 __do_sys_bpf kernel/bpf/syscall.c:5741 [inline]
 __se_sys_bpf kernel/bpf/syscall.c:5739 [inline]
 __x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5739
 x64_sys_call+0x2625/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:322
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

write to 0xffff888237d1cce8 of 8 bytes by task 56 on cpu 0:
 __mod_timer+0x578/0x7f0 kernel/time/timer.c:1173
 add_timer_global+0x51/0x70 kernel/time/timer.c:1330
 __queue_delayed_work+0x127/0x1a0 kernel/workqueue.c:2523
 queue_delayed_work_on+0xdf/0x190 kernel/workqueue.c:2552
 queue_delayed_work include/linux/workqueue.h:677 [inline]
 schedule_delayed_monitor_work kernel/rcu/tree.c:3525 [inline]
 kfree_rcu_monitor+0x5e8/0x660 kernel/rcu/tree.c:3643
 process_one_work kernel/workqueue.c:3229 [inline]
 process_scheduled_works+0x483/0x9a0 kernel/workqueue.c:3310
 worker_thread+0x51d/0x6f0 kernel/workqueue.c:3391
 kthread+0x1d1/0x210 kernel/kthread.c:389
 ret_from_fork+0x4b/0x60 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 UID: 0 PID: 56 Comm: kworker/u8:4 Not tainted 6.12.0-rc2-syzkaller-00050-g5b7c893ed5ed #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
Workqueue: events_unbound kfree_rcu_monitor
<snip>

kfree_rcu_monitor() rearms the work if a "krcp" has to be still
offloaded and this is done without holding krcp->lock, whereas
the kvfree_call_rcu() holds it.

Fix it by acquiring the "krcp->lock" for kfree_rcu_monitor() so
both functions do not race anymore.

Reported-by: syzbot+061d370693bdd99f9d34@syzkaller.appspotmail.com
Link: https://lore.kernel.org/lkml/ZxZ68KmHDQYU0yfD@pc636/T/
Fixes: 8fc5494ad5 ("rcu/kvfree: Move need_offload_krc() out of krcp->lock")
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12 21:51:34 +01:00
Michal Schmidt
0ea3acbc80 rcu/srcutiny: don't return before reenabling preemption
Code after the return statement is dead. Enable preemption before
returning from srcu_drive_gp().

This will be important when/if PREEMPT_AUTO (lazy resched) gets merged.

Fixes: 65b4a59557 ("srcu: Make Tiny SRCU explicitly disable preemption")
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12 21:45:20 +01:00
Paul E. McKenney
d4e287d7ca rcu-tasks: Remove open-coded one-byte cmpxchg() emulation
This commit removes the open-coded one-byte cmpxchg() emulation from
rcu_trc_cmpxchg_need_qs(), replacing it with just cmpxchg() given the
latter's new-found ability to handle single-byte arguments across all
architectures.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12 21:45:14 +01:00
Paul E. McKenney
de2ad0e72c rcutorture: Test start-poll primitives with interrupts disabled
This commit tests the ->start_poll() and ->start_poll_full() functions
with interrupts disabled, but only for RCU variants setting the
->start_poll_irqsoff flag.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12 21:44:59 +01:00
Paul E. McKenney
a30763800b rcu: Permit start_poll_synchronize_rcu*() with interrupts disabled
The header comment for both start_poll_synchronize_rcu() and
start_poll_synchronize_rcu_full() state that interrupts must be enabled
when calling these two functions, and there is a lockdep assertion in
start_poll_synchronize_rcu_common() enforcing this restriction.  However,
there is no need for this restrictions, as can be seen in call_rcu(),
which does wakeups when interrupts are disabled.

This commit therefore removes the lockdep assertion and the comments.

Reported-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12 21:44:30 +01:00
Paul E. McKenney
481aa5fca0 rcu: Allow short-circuiting of synchronize_rcu_tasks_rude()
There are now architectures for which all deep-idle and entry-exit
functions are properly inlined or marked noinstr.  Such architectures do
not need synchronize_rcu_tasks_rude(), or will not once RCU Tasks has
been modified to pay attention to idle tasks.  This commit therefore
allows a CONFIG_ARCH_HAS_NOINSTR_MARKINGS Kconfig option to turn
synchronize_rcu_tasks_rude() into a no-op.

To facilitate testing, kernels built by rcutorture scripting will enable
RCU Tasks Trace even on systems that do not need it.

[ paulmck: Apply Peter Zijlstra feedback. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12 21:44:24 +01:00
Paul E. McKenney
f30e2582a7 rcu: Add rcuog kthreads to RCU_NOCB_CPU help text
The RCU_NOCB_CPU help text currently fails to mention rcuog kthreads,
so this commit adds this information.

Reported-by: Olivier Langlois <olivier@trillion01.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12 21:41:08 +01:00
Jinjie Ruan
5d2501f42c rcu: Use the BITS_PER_LONG macro
sizeof(unsigned long) * 8 is the number of bits in an unsigned long
variable, replace it with BITS_PER_LONG macro to make it simpler.

Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12 21:41:04 +01:00
Hongbo Li
c329120696 rcu: Use bitwise instead of arithmetic operator for flags
This silences the following coccinelle warning:
  WARNING: sum of probable bitmasks, consider |

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12 21:40:53 +01:00
Christian Loehle
70d8b6485b sched/cpufreq: Ensure sd is rebuilt for EAS check
Ensure sugov_eas_rebuild_sd() is always called when sugov_init()
succeeds. The out goto initialized sugov without forcing the rebuild.

Previously the missing call to sugov_eas_rebuild_sd() could lead to EAS
not being enabled on boot when it should have been, because it requires
all policies to be controlled by schedutil while they might not have
been initialized yet.

Fixes: e7a1b32e43 ("cpufreq: Rebuild sched-domains when removing cpufreq driver")
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/35e572d9-1152-406a-9e34-2525f7548af9@arm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-11-12 21:36:51 +01:00
Waiman Long
c4c9cebe2f cgroup/cpuset: Further optimize code if CONFIG_CPUSETS_V1 not set
Currently the cpuset code uses group_subsys_on_dfl() to check if we
are running with cgroup v2. If CONFIG_CPUSETS_V1 isn't set, there is
really no need to do this check and we can optimize out some of the
unneeded v1 specific code paths. Introduce a new cpuset_v2() and use it
to replace the cgroup_subsys_on_dfl() check to further optimize the
code.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-12 09:07:38 -10:00
Waiman Long
a040c35128 cgroup/cpuset: Enforce at most one rebuild_sched_domains_locked() call per operation
Since commit ff0ce721ec ("cgroup/cpuset: Eliminate unncessary
sched domains rebuilds in hotplug"), there is only one
rebuild_sched_domains_locked() call per hotplug operation. However,
writing to the various cpuset control files may still casue more than
one rebuild_sched_domains_locked() call to happen in some cases.

Juri had found that two rebuild_sched_domains_locked() calls in
update_prstate(), one from update_cpumasks_hier() and another one from
update_partition_sd_lb() could cause cpuset partition to be created
with null total_bw for DL tasks. IOW, DL tasks may not be scheduled
correctly in such a partition.

A sample command sequence that can reproduce null total_bw is as
follows.

  # echo Y >/sys/kernel/debug/sched/verbose
  # echo +cpuset >/sys/fs/cgroup/cgroup.subtree_control
  # mkdir /sys/fs/cgroup/test
  # echo 0-7 > /sys/fs/cgroup/test/cpuset.cpus
  # echo 6-7 > /sys/fs/cgroup/test/cpuset.cpus.exclusive
  # echo root >/sys/fs/cgroup/test/cpuset.cpus.partition

Fix this double rebuild_sched_domains_locked() calls problem
by replacing existing calls with cpuset_force_rebuild() except
the rebuild_sched_domains_cpuslocked() call at the end of
cpuset_handle_hotplug(). Checking of the force_sd_rebuild flag is
now done at the end of cpuset_write_resmask() and update_prstate()
to determine if rebuild_sched_domains_locked() should be called or not.

The cpuset v1 code can still call rebuild_sched_domains_locked()
directly as double rebuild_sched_domains_locked() calls is not possible.

Reported-by: Juri Lelli <juri.lelli@redhat.com>
Closes: https://lore.kernel.org/lkml/ZyuUcJDPBln1BK1Y@jlelli-thinkpadt14gen4.remote.csb/
Signed-off-by: Waiman Long <longman@redhat.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-12 09:07:09 -10:00
Waiman Long
bcd7012afd cgroup/cpuset: Revert "Allow suppression of sched domain rebuild in update_cpumasks_hier()"
Revert commit 3ae0b77321 ("cgroup/cpuset: Allow suppression of sched
domain rebuild in update_cpumasks_hier()") to allow for an alternative
way to suppress unnecessary rebuild_sched_domains_locked() calls in
update_cpumasks_hier() and elsewhere in a following commit.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-12 09:07:01 -10:00
Colin Ian King
6371b4bc17 tracing: Remove redundant check on field->field in histograms
The check on field->field being true is handled as the first check
on the cascaded if statement, so the later checks on field->field
are redundant because this clause has already been handled. Since
this later check is redundant, just remove it.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241107120530.18728-1-colin.i.king@gmail.com
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-11-12 11:36:57 -05:00
Paul E. McKenney
6a2c0255e8 refscale: Add srcu_read_lock_lite() support using "srcu-lite"
This commit creates a new srcu-lite option for the refscale.scale_type
module parameter that selects srcu_read_lock_lite() and
srcu_read_unlock_lite().

[ paulmck: Apply Dan Carpenter feedback. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <bpf@vger.kernel.org>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12 15:45:02 +01:00
Paul E. McKenney
43349fc4d8 rcutorture: Add srcu_read_lock_lite() support to rcutorture.reader_flavor
This commit causes bit 0x4 of rcutorture.reader_flavor to select the new
srcu_read_lock_lite() and srcu_read_unlock_lite() functions.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <bpf@vger.kernel.org>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12 15:44:37 +01:00
Paul E. McKenney
95a5de2154 rcutorture: Add reader_flavor parameter for SRCU readers
This commit adds an rcutorture.reader_flavor parameter whose bits
correspond to reader flavors.  For example, SRCU's readers are 0x1 for
normal and 0x2 for NMI-safe.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <bpf@vger.kernel.org>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12 15:44:30 +01:00
Paul E. McKenney
37a1decb43 rcutorture: Expand RCUTORTURE_RDR_MASK_[12] to eight bits
This commit prepares for testing of multiple SRCU reader flavors by
expanding RCUTORTURE_RDR_MASK_1 and RCUTORTURE_RDR_MASK_2 from a single
bit to eight bits, allowing them to accommodate the return values from
multiple calls to srcu_read_lock*().  This will in turn permit better
testing coverage for these SRCU reader flavors, including testing of
the diagnostics for inproper use of mixed reader flavors.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <bpf@vger.kernel.org>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12 15:44:19 +01:00
Paul E. McKenney
bb94b12e45 srcu: Allow inlining of __srcu_read_{,un}lock_lite()
This commit moves __srcu_read_lock_lite() and __srcu_read_unlock_lite()
into include/linux/srcu.h and marks them "static inline" so that they
can be inlined into srcu_read_lock_lite() and srcu_read_unlock_lite(),
respectively.  They are not hand-inlined due to Tree SRCU and Tiny SRCU
having different implementations.

The earlier removal of smp_mb() combined with the inlining produce
significant single-percentage performance wins.

Link: https://lore.kernel.org/all/CAEf4BzYgiNmSb=ZKQ65tm6nJDi1UX2Gq26cdHSH1mPwXJYZj5g@mail.gmail.com/

Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <bpf@vger.kernel.org>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12 15:44:03 +01:00
Paul E. McKenney
6364dd8191 srcu: Add srcu_read_lock_lite() and srcu_read_unlock_lite()
This patch adds srcu_read_lock_lite() and srcu_read_unlock_lite(), which
dispense with the read-side smp_mb() but also are restricted to code
regions that RCU is watching.  If a given srcu_struct structure uses
srcu_read_lock_lite() and srcu_read_unlock_lite(), it is not permitted
to use any other SRCU read-side marker, before, during, or after.

Another price of light-weight readers is heavier weight grace periods.
Such readers mean that SRCU grace periods on srcu_struct structures
used by light-weight readers will incur at least two calls to
synchronize_rcu().  In addition, normal SRCU grace periods for
light-weight-reader srcu_struct structures never auto-expedite.
Note that expedited SRCU grace periods for light-weight-reader
srcu_struct structures still invoke synchronize_rcu(), not
synchronize_srcu_expedited().  Something about wishing to keep
the IPIs down to a dull roar.

The srcu_read_lock_lite() and srcu_read_unlock_lite() functions may not
(repeat, *not*) be used from NMI handlers, but if this is needed, an
additional flavor of SRCU reader can be added by some future commit.

[ paulmck: Apply Alexei Starovoitov expediting feedback. ]
[ paulmck: Apply kernel test robot feedback. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: kernel test robot <oliver.sang@intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <bpf@vger.kernel.org>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12 15:43:34 +01:00
Paul E. McKenney
05829be27f srcu: Create CPP macros for normal and NMI-safe SRCU readers
This commit creates SRCU_READ_FLAVOR_NORMAL and SRCU_READ_FLAVOR_NMI
C-preprocessor macros for srcu_read_lock() and srcu_read_lock_nmisafe(),
respectively.  These replace the old true/false values that were
previously passed to srcu_check_read_flavor().  In addition, the
srcu_check_read_flavor() function itself requires a bit of rework to
handle bitmasks instead of true/false values.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <bpf@vger.kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12 15:43:21 +01:00
Paul E. McKenney
9a87bda2b6 srcu: Standardize srcu_data pointers to "sdp" and similar
This commit changes a few "cpuc" variables to "sdp" to align with usage
elsewhere.

[ paulmck: Apply Neeraj Upadhyay feedback. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <bpf@vger.kernel.org>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12 15:42:41 +01:00
Paul E. McKenney
c2f9467c77 srcu: Bit manipulation changes for additional reader flavor
Currently, there are only two flavors of readers, normal and NMI-safe.
Very straightforward state updates suffice to check for erroneous
mixing of reader flavors on a given srcu_struct structure.  This commit
upgrades the checking in preparation for the addition of light-weight
(as in memory-barrier-free) readers.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <bpf@vger.kernel.org>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12 15:42:20 +01:00
Paul E. McKenney
365f34483b srcu: Renaming in preparation for additional reader flavor
Currently, there are only two flavors of readers, normal and NMI-safe.
A number of fields, functions, and types reflect this restriction.
This renaming-only commit prepares for the addition of light-weight
(as in memory-barrier-free) readers.  OK, OK, there is also a drive-by
white-space fixeup!

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <bpf@vger.kernel.org>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-12 15:41:30 +01:00
zhangguopeng
45dac1959b kernel/reboot: replace sprintf() with sysfs_emit()
As Documentation/filesystems/sysfs.rst suggested, show() should only use
sysfs_emit() or sysfs_emit_at() when formatting the value to be returned
to user space.

No functional change intended.

Link: https://lkml.kernel.org/r/20241105094941.33739-1-zhangguopeng@kylinos.cn
Signed-off-by: zhangguopeng <zhangguopeng@kylinos.cn>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Fabio Estevam <festevam@denx.de>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11 17:17:05 -08:00
Lance Yang
03ecb24db2 hung_task: add detect count for hung tasks
Patch series "add detect count for hung tasks", v2.

This patchset adds a counter, hung_task_detect_count, to track the number
of times hung tasks are detected.  

IHMO, hung tasks are a critical metric.  Currently, we detect them by
periodically parsing dmesg.  However, this method isn't as user-friendly
as using a counter.

Sometimes, a short-lived issue with NIC or hard drive can quickly decrease
the hung_task_warnings to zero.  Without warnings, we must directly access
the node to ensure that there are no more hung tasks and that the system
has recovered.  After all, load average alone cannot provide a clear
picture.

Once this counter is in place, in a high-density deployment pattern, we
plan to set hung_task_timeout_secs to a lower number to improve stability,
even though this might result in false positives.  And then we can set a
time-based threshold: if hung tasks last beyond this duration, we will
automatically migrate containers to other nodes.  Based on past
experience, this approach could help avoid many production disruptions.

Moreover, just like other important events such as OOM that already have
counters, having a dedicated counter for hung tasks makes sense ;)


This patch (of 2):

This commit adds a counter, hung_task_detect_count, to track the number of
times hung tasks are detected.

IHMO, hung tasks are a critical metric. Currently, we detect them by
periodically parsing dmesg. However, this method isn't as user-friendly as
using a counter.

Sometimes, a short-lived issue with NIC or hard drive can quickly decrease
the hung_task_warnings to zero. Without warnings, we must directly access
the node to ensure that there are no more hung tasks and that the system
has recovered. After all, load average alone cannot provide a clear
picture.

Once this counter is in place, in a high-density deployment pattern, we
plan to set hung_task_timeout_secs to a lower number to improve stability,
even though this might result in false positives. And then we can set a
time-based threshold: if hung tasks last beyond this duration, we will
automatically migrate containers to other nodes. Based on past experience,
this approach could help avoid many production disruptions.

Moreover, just like other important events such as OOM that already have
counters, having a dedicated counter for hung tasks makes sense.

[ioworker0@gmail.com: proc_doulongvec_minmax instead of proc_dointvec]
  Link: https://lkml.kernel.org/r/20241101114833.8377-1-ioworker0@gmail.com
Link: https://lkml.kernel.org/r/20241027120747.42833-1-ioworker0@gmail.com
Link: https://lkml.kernel.org/r/20241027120747.42833-2-ioworker0@gmail.com
Signed-off-by: Mingzhe Yang <mingzhe.yang@ly.com>
Signed-off-by: Lance Yang <ioworker0@gmail.com>
Cc: Bang Li <libang.li@antgroup.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Huang Cun <cunhuang@tencent.com>
Cc: Joel Granados <j.granados@samsung.com>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: John Siddle <jsiddle@redhat.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Yongliang Gao <leonylgao@tencent.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11 17:17:03 -08:00
Linus Torvalds
3022e9d00e sched_ext: Fixes for v6.12-rc7
- The fair sched class currently has a bug where its balance() returns true
   telling the sched core that it has tasks to run but then NULL from
   pick_task(). This makes sched core call sched_ext's pick_task() without
   preceding balance() which can lead to stalls in partial mode. For now,
   work around by detecting the condition and forcing the CPU to go through
   another scheduling cycle.
 
 - Add a missing newline to an error message and fix drgn introspection tool
   which went out of sync.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZzI8sw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGb5KAP40b/o6TyAFDG+Hn6GxyxQT7rcAUMXsdB2bcEpg
 /IjmzQEAwbHU5KP5vQXV6XHv+2V7Rs7u6ZqFtDnL88N0A9hf3wk=
 =7hL8
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.12-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

 - The fair sched class currently has a bug where its balance() returns
   true telling the sched core that it has tasks to run but then NULL
   from pick_task(). This makes sched core call sched_ext's pick_task()
   without preceding balance() which can lead to stalls in partial mode.

   For now, work around by detecting the condition and forcing the CPU
   to go through another scheduling cycle.

 - Add a missing newline to an error message and fix drgn introspection
   tool which went out of sync.

* tag 'sched_ext-for-6.12-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()
  sched_ext: Update scx_show_state.py to match scx_ops_bypass_depth's new type
  sched_ext: Add a missing newline at the end of an error message
2024-11-11 14:09:57 -08:00
Oliver Upton
7ccd615bc6 Merge branch kvm-arm64/psci-1.3 into kvmarm/next
* kvm-arm64/psci-1.3:
  : PSCI v1.3 support, courtesy of David Woodhouse
  :
  : Bump KVM's PSCI implementation up to v1.3, with the added bonus of
  : implementing the SYSTEM_OFF2 call. Like other system-scoped PSCI calls,
  : this gets relayed to userspace for further processing with a new
  : KVM_SYSTEM_EVENT_SHUTDOWN flag.
  :
  : As an added bonus, implement client-side support for hibernation with
  : the SYSTEM_OFF2 call.
  arm64: Use SYSTEM_OFF2 PSCI call to power off for hibernate
  KVM: arm64: nvhe: Pass through PSCI v1.3 SYSTEM_OFF2 call
  KVM: selftests: Add test for PSCI SYSTEM_OFF2
  KVM: arm64: Add support for PSCI v1.2 and v1.3
  KVM: arm64: Add PSCI v1.3 SYSTEM_OFF2 function for hibernation
  firmware/psci: Add definitions for PSCI v1.3 specification

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-11 18:36:46 +00:00
Tejun Heo
5cbb302880 sched_ext: Rename scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*()
In sched_ext API, a repeatedly reported pain point is the overuse of the
verb "dispatch" and confusion around "consume":

- ops.dispatch()
- scx_bpf_dispatch[_vtime]()
- scx_bpf_consume()
- scx_bpf_dispatch[_vtime]_from_dsq*()

This overloading of the term is historical. Originally, there were only
built-in DSQs and moving a task into a DSQ always dispatched it for
execution. Using the verb "dispatch" for the kfuncs to move tasks into these
DSQs made sense.

Later, user DSQs were added and scx_bpf_dispatch[_vtime]() updated to be
able to insert tasks into any DSQ. The only allowed DSQ to DSQ transfer was
from a non-local DSQ to a local DSQ and this operation was named "consume".
This was already confusing as a task could be dispatched to a user DSQ from
ops.enqueue() and then the DSQ would have to be consumed in ops.dispatch().
Later addition of scx_bpf_dispatch_from_dsq*() made the confusion even worse
as "dispatch" in this context meant moving a task to an arbitrary DSQ from a
user DSQ.

Clean up the API with the following renames:

1. scx_bpf_dispatch[_vtime]()		-> scx_bpf_dsq_insert[_vtime]()
2. scx_bpf_consume()			-> scx_bpf_dsq_move_to_local()
3. scx_bpf_dispatch[_vtime]_from_dsq*()	-> scx_bpf_dsq_move[_vtime]*()

This patch performs the third set of renames. Compatibility is maintained
by:

- The previous kfunc names are still provided by the kernel so that old
  binaries can run. Kernel generates a warning when the old names are used.

- compat.bpf.h provides wrappers for the new names which automatically fall
  back to the old names when running on older kernels. They also trigger
  build error if old names are used for new builds.

- scx_bpf_dispatch[_vtime]_from_dsq*() were already wrapped in __COMPAT
  macros as they were introduced during v6.12 cycle. Wrap new API in
  __COMPAT macros too and trigger build errors on both __COMPAT prefixed and
  naked usages of the old names.

The compat features will be dropped after v6.15.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Johannes Bechberger <me@mostlynerdless.de>
Acked-by: Giovanni Gherdovich <ggherdovich@suse.com>
Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Ming Yang <yougmark94@gmail.com>
2024-11-11 07:06:16 -10:00
Tejun Heo
5209c03c8e sched_ext: Rename scx_bpf_consume() to scx_bpf_dsq_move_to_local()
In sched_ext API, a repeatedly reported pain point is the overuse of the
verb "dispatch" and confusion around "consume":

- ops.dispatch()
- scx_bpf_dispatch[_vtime]()
- scx_bpf_consume()
- scx_bpf_dispatch[_vtime]_from_dsq*()

This overloading of the term is historical. Originally, there were only
built-in DSQs and moving a task into a DSQ always dispatched it for
execution. Using the verb "dispatch" for the kfuncs to move tasks into these
DSQs made sense.

Later, user DSQs were added and scx_bpf_dispatch[_vtime]() updated to be
able to insert tasks into any DSQ. The only allowed DSQ to DSQ transfer was
from a non-local DSQ to a local DSQ and this operation was named "consume".
This was already confusing as a task could be dispatched to a user DSQ from
ops.enqueue() and then the DSQ would have to be consumed in ops.dispatch().
Later addition of scx_bpf_dispatch_from_dsq*() made the confusion even worse
as "dispatch" in this context meant moving a task to an arbitrary DSQ from a
user DSQ.

Clean up the API with the following renames:

1. scx_bpf_dispatch[_vtime]()		-> scx_bpf_dsq_insert[_vtime]()
2. scx_bpf_consume()			-> scx_bpf_dsq_move_to_local()
3. scx_bpf_dispatch[_vtime]_from_dsq*()	-> scx_bpf_dsq_move[_vtime]*()

This patch performs the second rename. Compatibility is maintained by:

- The previous kfunc names are still provided by the kernel so that old
  binaries can run. Kernel generates a warning when the old names are used.

- compat.bpf.h provides wrappers for the new names which automatically fall
  back to the old names when running on older kernels. They also trigger
  build error if old names are used for new builds.

The compat features will be dropped after v6.15.

v2: Comment and documentation updates.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Johannes Bechberger <me@mostlynerdless.de>
Acked-by: Giovanni Gherdovich <ggherdovich@suse.com>
Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Ming Yang <yougmark94@gmail.com>
2024-11-11 07:06:16 -10:00
Tejun Heo
cc26abb1a1 sched_ext: Rename scx_bpf_dispatch[_vtime]() to scx_bpf_dsq_insert[_vtime]()
In sched_ext API, a repeatedly reported pain point is the overuse of the
verb "dispatch" and confusion around "consume":

- ops.dispatch()
- scx_bpf_dispatch[_vtime]()
- scx_bpf_consume()
- scx_bpf_dispatch[_vtime]_from_dsq*()

This overloading of the term is historical. Originally, there were only
built-in DSQs and moving a task into a DSQ always dispatched it for
execution. Using the verb "dispatch" for the kfuncs to move tasks into these
DSQs made sense.

Later, user DSQs were added and scx_bpf_dispatch[_vtime]() updated to be
able to insert tasks into any DSQ. The only allowed DSQ to DSQ transfer was
from a non-local DSQ to a local DSQ and this operation was named "consume".
This was already confusing as a task could be dispatched to a user DSQ from
ops.enqueue() and then the DSQ would have to be consumed in ops.dispatch().
Later addition of scx_bpf_dispatch_from_dsq*() made the confusion even worse
as "dispatch" in this context meant moving a task to an arbitrary DSQ from a
user DSQ.

Clean up the API with the following renames:

1. scx_bpf_dispatch[_vtime]()		-> scx_bpf_dsq_insert[_vtime]()
2. scx_bpf_consume()			-> scx_bpf_dsq_move_to_local()
3. scx_bpf_dispatch[_vtime]_from_dsq*()	-> scx_bpf_dsq_move[_vtime]*()

This patch performs the first set of renames. Compatibility is maintained
by:

- The previous kfunc names are still provided by the kernel so that old
  binaries can run. Kernel generates a warning when the old names are used.

- compat.bpf.h provides wrappers for the new names which automatically fall
  back to the old names when running on older kernels. They also trigger
  build error if old names are used for new builds.

The compat features will be dropped after v6.15.

v2: Documentation updates.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Johannes Bechberger <me@mostlynerdless.de>
Acked-by: Giovanni Gherdovich <ggherdovich@suse.com>
Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Ming Yang <yougmark94@gmail.com>
2024-11-11 07:06:16 -10:00
Kumar Kartikeya Dwivedi
ae6e3a273f bpf: Drop special callback reference handling
Logic to prevent callbacks from acquiring new references for the program
(i.e. leaving acquired references), and releasing caller references
(i.e. those acquired in parent frames) was introduced in commit
9d9d00ac29 ("bpf: Fix reference state management for synchronous callbacks").

This was necessary because back then, the verifier simulated each
callback once (that could potentially be executed N times, where N can
be zero). This meant that callbacks that left lingering resources or
cleared caller resources could do it more than once, operating on
undefined state or leaking memory.

With the fixes to callback verification in commit
ab5cfac139 ("bpf: verify callbacks as if they are called unknown number of times"),
all of this extra logic is no longer necessary. Hence, drop it as part
of this commit.

Cc: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241109231430.2475236-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-11-11 08:18:55 -08:00
Kumar Kartikeya Dwivedi
f6b9a69a9e bpf: Refactor active lock management
When bpf_spin_lock was introduced originally, there was deliberation on
whether to use an array of lock IDs, but since bpf_spin_lock is limited
to holding a single lock at any given time, we've been using a single ID
to identify the held lock.

In preparation for introducing spin locks that can be taken multiple
times, introduce support for acquiring multiple lock IDs. For this
purpose, reuse the acquired_refs array and store both lock and pointer
references. We tag the entry with REF_TYPE_PTR or REF_TYPE_LOCK to
disambiguate and find the relevant entry. The ptr field is used to track
the map_ptr or btf (for bpf_obj_new allocations) to ensure locks can be
matched with protected fields within the same "allocation", i.e.
bpf_obj_new object or map value.

The struct active_lock is changed to an int as the state is part of the
acquired_refs array, and we only need active_lock as a cheap way of
detecting lock presence.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241109231430.2475236-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-11-11 08:18:51 -08:00
Hou Tao
b9e9ed90b1 bpf: Call free_htab_elem() after htab_unlock_bucket()
For htab of maps, when the map is removed from the htab, it may hold the
last reference of the map. bpf_map_fd_put_ptr() will invoke
bpf_map_free_id() to free the id of the removed map element. However,
bpf_map_fd_put_ptr() is invoked while holding a bucket lock
(raw_spin_lock_t), and bpf_map_free_id() attempts to acquire map_idr_lock
(spinlock_t), triggering the following lockdep warning:

  =============================
  [ BUG: Invalid wait context ]
  6.11.0-rc4+ #49 Not tainted
  -----------------------------
  test_maps/4881 is trying to lock:
  ffffffff84884578 (map_idr_lock){+...}-{3:3}, at: bpf_map_free_id.part.0+0x21/0x70
  other info that might help us debug this:
  context-{5:5}
  2 locks held by test_maps/4881:
   #0: ffffffff846caf60 (rcu_read_lock){....}-{1:3}, at: bpf_fd_htab_map_update_elem+0xf9/0x270
   #1: ffff888149ced148 (&htab->lockdep_key#2){....}-{2:2}, at: htab_map_update_elem+0x178/0xa80
  stack backtrace:
  CPU: 0 UID: 0 PID: 4881 Comm: test_maps Not tainted 6.11.0-rc4+ #49
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
  Call Trace:
   <TASK>
   dump_stack_lvl+0x6e/0xb0
   dump_stack+0x10/0x20
   __lock_acquire+0x73e/0x36c0
   lock_acquire+0x182/0x450
   _raw_spin_lock_irqsave+0x43/0x70
   bpf_map_free_id.part.0+0x21/0x70
   bpf_map_put+0xcf/0x110
   bpf_map_fd_put_ptr+0x9a/0xb0
   free_htab_elem+0x69/0xe0
   htab_map_update_elem+0x50f/0xa80
   bpf_fd_htab_map_update_elem+0x131/0x270
   htab_map_update_elem+0x50f/0xa80
   bpf_fd_htab_map_update_elem+0x131/0x270
   bpf_map_update_value+0x266/0x380
   __sys_bpf+0x21bb/0x36b0
   __x64_sys_bpf+0x45/0x60
   x64_sys_call+0x1b2a/0x20d0
   do_syscall_64+0x5d/0x100
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

One way to fix the lockdep warning is using raw_spinlock_t for
map_idr_lock as well. However, bpf_map_alloc_id() invokes
idr_alloc_cyclic() after acquiring map_idr_lock, it will trigger a
similar lockdep warning because the slab's lock (s->cpu_slab->lock) is
still a spinlock.

Instead of changing map_idr_lock's type, fix the issue by invoking
htab_put_fd_value() after htab_unlock_bucket(). However, only deferring
the invocation of htab_put_fd_value() is not enough, because the old map
pointers in htab of maps can not be saved during batched deletion.
Therefore, also defer the invocation of free_htab_elem(), so these
to-be-freed elements could be linked together similar to lru map.

There are four callers for ->map_fd_put_ptr:

(1) alloc_htab_elem() (through htab_put_fd_value())
It invokes ->map_fd_put_ptr() under a raw_spinlock_t. The invocation of
htab_put_fd_value() can not simply move after htab_unlock_bucket(),
because the old element has already been stashed in htab->extra_elems.
It may be reused immediately after htab_unlock_bucket() and the
invocation of htab_put_fd_value() after htab_unlock_bucket() may release
the newly-added element incorrectly. Therefore, saving the map pointer
of the old element for htab of maps before unlocking the bucket and
releasing the map_ptr after unlock. Beside the map pointer in the old
element, should do the same thing for the special fields in the old
element as well.

(2) free_htab_elem() (through htab_put_fd_value())
Its caller includes __htab_map_lookup_and_delete_elem(),
htab_map_delete_elem() and __htab_map_lookup_and_delete_batch().

For htab_map_delete_elem(), simply invoke free_htab_elem() after
htab_unlock_bucket(). For __htab_map_lookup_and_delete_batch(), just
like lru map, linking the to-be-freed element into node_to_free list
and invoking free_htab_elem() for these element after unlock. It is safe
to reuse batch_flink as the link for node_to_free, because these
elements have been removed from the hash llist.

Because htab of maps doesn't support lookup_and_delete operation,
__htab_map_lookup_and_delete_elem() doesn't have the problem, so kept
it as is.

(3) fd_htab_map_free()
It invokes ->map_fd_put_ptr without raw_spinlock_t.

(4) bpf_fd_htab_map_update_elem()
It invokes ->map_fd_put_ptr without raw_spinlock_t.

After moving free_htab_elem() outside htab bucket lock scope, using
pcpu_freelist_push() instead of __pcpu_freelist_push() to disable
the irq before freeing elements, and protecting the invocations of
bpf_mem_cache_free() with migrate_{disable|enable} pair.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241106063542.357743-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-11-11 08:18:30 -08:00
Jiri Olsa
99b403d206 bpf: Add support for uprobe multi session context
Placing bpf_session_run_ctx layer in between bpf_run_ctx and
bpf_uprobe_multi_run_ctx, so the session data can be retrieved
from uprobe_multi link.

Plus granting session kfuncs access to uprobe session programs.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-5-jolsa@kernel.org
2024-11-11 08:18:04 -08:00
Jiri Olsa
d920179b3d bpf: Add support for uprobe multi session attach
Adding support to attach BPF program for entry and return probe
of the same function. This is common use case which at the moment
requires to create two uprobe multi links.

Adding new BPF_TRACE_UPROBE_SESSION attach type that instructs
kernel to attach single link program to both entry and exit probe.

It's possible to control execution of the BPF program on return
probe simply by returning zero or non zero from the entry BPF
program execution to execute or not the BPF program on return
probe respectively.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-4-jolsa@kernel.org
2024-11-11 08:18:03 -08:00
Jiri Olsa
f505005bc7 bpf: Force uprobe bpf program to always return 0
As suggested by Andrii make uprobe multi bpf programs to always return 0,
so they can't force uprobe removal.

Keeping the int return type for uprobe_prog_run, because it will be used
in following session changes.

Fixes: 89ae89f53d ("bpf: Add multi uprobe link")
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-3-jolsa@kernel.org
2024-11-11 08:18:00 -08:00
Jiri Olsa
17c4b65a24 bpf: Allow return values 0 and 1 for kprobe session
The kprobe session program can return only 0 or 1,
instruct verifier to check for that.

Fixes: 535a3692ba ("bpf: Add support for kprobe session attach")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-2-jolsa@kernel.org
2024-11-11 08:17:57 -08:00
Marcos Paulo de Souza
ed76c07c68 printk: Introduce FORCE_CON flag
Introduce FORCE_CON flag to printk. The new flag will make it possible to
create a context where printk messages will never be suppressed.

This mechanism will be used in the next patch to create a force_con
context on sysrq handling, removing an existing workaround on the
loglevel global variable. The workaround existed to make sure that sysrq
header messages were sent to all consoles, but this doesn't work with
deferred messages because the loglevel might be restored to its original
value before a console flushes the messages.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20241105-printk-loud-con-v2-1-bd3ecdf7b0e4@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-11-11 12:53:31 +01:00
Vinicius Costa Gomes
49dffdfde4 cred: Add a light version of override/revert_creds()
Add a light version of override/revert_creds(), this should only be
used when the credentials in question will outlive the critical
section and the critical section doesn't change the ->usage of the
credentials.

Suggested-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2024-11-11 10:45:04 +01:00
Andrew Morton
2ec0859039 Merge branch 'mm-hotfixes-stable' into mm-stable
Pick up e7ac4daeed ("mm: count zeromap read and set for swapout and
swapin") in order to move

mm: define obj_cgroup_get() if CONFIG_MEMCG is not defined
mm: zswap: modify zswap_compress() to accept a page instead of a folio
mm: zswap: rename zswap_pool_get() to zswap_pool_tryget()
mm: zswap: modify zswap_stored_pages to be atomic_long_t
mm: zswap: support large folios in zswap_store()
mm: swap: count successful large folio zswap stores in hugepage zswpout stats
mm: zswap: zswap_store_page() will initialize entry after adding to xarray.
mm: add per-order mTHP swpin counters

from mm-unstable into mm-stable.
2024-11-11 00:04:10 -08:00
Linus Torvalds
28e43197c4 20 hotfixes, 14 of which are cc:stable.
Three affect DAMON.  Lorenzo's five-patch series to address the
 mmap_region error handling is here also.
 
 Apart from that, various singletons.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZzBVmAAKCRDdBJ7gKXxA
 ju42AQD0EEnzW+zFyI+E7x5FwCmLL6ofmzM8Sw9YrKjaeShdZgEAhcyS2Rc/AaJq
 Uty2ZvVMDF2a9p9gqHfKKARBXEbN2w0=
 =n+lO
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-11-09-22-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "20 hotfixes, 14 of which are cc:stable.

  Three affect DAMON. Lorenzo's five-patch series to address the
  mmap_region error handling is here also.

  Apart from that, various singletons"

* tag 'mm-hotfixes-stable-2024-11-09-22-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mailmap: add entry for Thorsten Blum
  ocfs2: remove entry once instead of null-ptr-dereference in ocfs2_xa_remove()
  signal: restore the override_rlimit logic
  fs/proc: fix compile warning about variable 'vmcore_mmap_ops'
  ucounts: fix counter leak in inc_rlimit_get_ucounts()
  selftests: hugetlb_dio: check for initial conditions to skip in the start
  mm: fix docs for the kernel parameter ``thp_anon=``
  mm/damon/core: avoid overflow in damon_feed_loop_next_input()
  mm/damon/core: handle zero schemes apply interval
  mm/damon/core: handle zero {aggregation,ops_update} intervals
  mm/mlock: set the correct prev on failure
  objpool: fix to make percpu slot allocation more robust
  mm/page_alloc: keep track of free highatomic
  mm: resolve faulty mmap_region() error path behaviour
  mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling
  mm: refactor map_deny_write_exec()
  mm: unconditionally close VMAs on error
  mm: avoid unsafe VMA hook invocation when error arises on mmap hook
  mm/thp: fix deferred split unqueue naming and locking
  mm/thp: fix deferred split queue not partially_mapped
2024-11-10 09:04:27 -08:00
Zicheng Qu
e45f0ab6ee padata: Clean up in padata_do_multithreaded()
In commit 24cc57d8fa ("padata: Honor the caller's alignment in case of
chunk_size 0"), the line 'ps.chunk_size = max(ps.chunk_size, 1ul)' was
added, making 'ps.chunk_size = 1U' redundant and never executed.

Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-11-10 11:50:54 +08:00
Tejun Heo
a6250aa251 sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()
sched_ext dispatches tasks from the BPF scheduler from balance_scx() and
thus every pick_task_scx() call must be preceded by balance_scx(). While
this usually holds, due to a bug, there are cases where the fair class's
balance() returns true indicating that it has tasks to run on the CPU and
thus terminating balance() calls but fails to actually find the next task to
run when pick_task() is called. In such cases, pick_task_scx() can be called
without preceding balance_scx().

Detect this condition using SCX_RQ_BAL_PENDING flags. If detected, keep
running the previous task if possible and avoid stalling from entering idle
without balancing.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/Ztj_h5c2LYsdXYbA@slm.duckdns.org
2024-11-09 10:43:55 -10:00
Tejun Heo
72b85bf6a7 sched_ext: scx_bpf_dispatch_from_dsq_set_*() are allowed from unlocked context
4c30f5ce4f ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
added four kfuncs for dispatching while iterating. They are allowed from the
dispatch and unlocked contexts but two of the kfuncs were only added in the
dispatch section. Add missing declarations in the unlocked section.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 4c30f5ce4f ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
2024-11-09 09:40:25 -10:00
Sebastian Andrzej Siewior
4788c861ad scftorture: Use a lock-less list to free memory.
scf_handler() is used as a SMP function call. This function is always
invoked in IRQ-context even with forced-threading enabled. This function
frees memory which not allowed on PREEMPT_RT because the locking
underneath is using sleeping locks.

Add a per-CPU scf_free_pool where each SMP functions adds its memory to
be freed. This memory is then freed by scftorture_invoker() on each
iteration. On the majority of invocations the number of items is less
than five. If the thread sleeps/ gets delayed the number exceed 350 but
did not reach 400 in testing. These were the spikes during testing.
The bulk free of 64 pointers at once should improve the give-back if the
list grows. The list size is ~1.3 items per invocations.

Having one global scf_free_pool with one cleaning thread let the list
grow to over 10.000 items with 32 CPUs (again, spikes not the average)
especially if the CPU went to sleep. The per-CPU part looks like a good
compromise.

Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
Closes: https://lore.kernel.org/lkml/41619255-cdc2-4573-a360-7794fc3614f7@paulmck-laptop/
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Tested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-11-09 09:00:46 -08:00
Sebastian Andrzej Siewior
64bdaf963c scftorture: Move memory allocation outside of preempt_disable region.
Memory allocations can not happen within regions with explicit disabled
preemption PREEMPT_RT. The problem is that the locking structures
underneath are sleeping locks.

Move the memory allocation outside of the preempt-disabled section. Keep
the GFP_ATOMIC for the allocation to behave like a "ememergncy
allocation".

Tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Tested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-11-09 09:00:46 -08:00
Sebastian Andrzej Siewior
43082cd579 scftorture: Wait until scf_cleanup_handler() completes.
The smp_call_function() needs to be invoked with the wait flag set to
wait until scf_cleanup_handler() is done. This ensures that all SMP
function calls, that have been queued earlier, complete at this point.

Tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Tested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-11-09 09:00:46 -08:00
Sebastian Andrzej Siewior
42eeb3b573 scftorture: Avoid additional div operation.
Replace "scfp->cpu % nr_cpu_ids" with "cpu". This has been computed
earlier.

Tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Tested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-11-09 09:00:46 -08:00
Changwoo Min
f39489fea6 sched_ext: add a missing rcu_read_lock/unlock pair at scx_select_cpu_dfl()
When getting an LLC CPU mask in the default CPU selection policy,
scx_select_cpu_dfl(), a pointer to the sched_domain is dereferenced
using rcu_read_lock() without holding rcu_read_lock(). Such an unprotected
dereference often causes the following warning and can cause an invalid
memory access in the worst case.

Therefore, protect dereference of a sched_domain pointer using a pair
of rcu_read_lock() and unlock().

[   20.996135] =============================
[   20.996345] WARNING: suspicious RCU usage
[   20.996563] 6.11.0-virtme #17 Tainted: G        W
[   20.996576] -----------------------------
[   20.996576] kernel/sched/ext.c:3323 suspicious rcu_dereference_check() usage!
[   20.996576]
[   20.996576] other info that might help us debug this:
[   20.996576]
[   20.996576]
[   20.996576] rcu_scheduler_active = 2, debug_locks = 1
[   20.996576] 4 locks held by kworker/8:1/140:
[   20.996576]  #0: ffff8b18c00dd348 ((wq_completion)pm){+.+.}-{0:0}, at: process_one_work+0x4a0/0x590
[   20.996576]  #1: ffffb3da01f67e58 ((work_completion)(&dev->power.work)){+.+.}-{0:0}, at: process_one_work+0x1ba/0x590
[   20.996576]  #2: ffffffffa316f9f0 (&rcu_state.gp_wq){..-.}-{2:2}, at: swake_up_one+0x15/0x60
[   20.996576]  #3: ffff8b1880398a60 (&p->pi_lock){-.-.}-{2:2}, at: try_to_wake_up+0x59/0x7d0
[   20.996576]
[   20.996576] stack backtrace:
[   20.996576] CPU: 8 UID: 0 PID: 140 Comm: kworker/8:1 Tainted: G        W          6.11.0-virtme #17
[   20.996576] Tainted: [W]=WARN
[   20.996576] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[   20.996576] Workqueue: pm pm_runtime_work
[   20.996576] Sched_ext: simple (disabling+all), task: runnable_at=-6ms
[   20.996576] Call Trace:
[   20.996576]  <IRQ>
[   20.996576]  dump_stack_lvl+0x6f/0xb0
[   20.996576]  lockdep_rcu_suspicious.cold+0x4e/0x96
[   20.996576]  scx_select_cpu_dfl+0x234/0x260
[   20.996576]  select_task_rq_scx+0xfb/0x190
[   20.996576]  select_task_rq+0x47/0x110
[   20.996576]  try_to_wake_up+0x110/0x7d0
[   20.996576]  swake_up_one+0x39/0x60
[   20.996576]  rcu_core+0xb08/0xe50
[   20.996576]  ? srso_alias_return_thunk+0x5/0xfbef5
[   20.996576]  ? mark_held_locks+0x40/0x70
[   20.996576]  handle_softirqs+0xd3/0x410
[   20.996576]  irq_exit_rcu+0x78/0xa0
[   20.996576]  sysvec_apic_timer_interrupt+0x73/0x80
[   20.996576]  </IRQ>
[   20.996576]  <TASK>
[   20.996576]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[   20.996576] RIP: 0010:_raw_spin_unlock_irqrestore+0x36/0x70
[   20.996576] Code: f5 53 48 8b 74 24 10 48 89 fb 48 83 c7 18 e8 11 b4 36 ff 48 89 df e8 99 0d 37 ff f7 c5 00 02 00 00 75 17 9c 58 f6 c4 02 75 2b <65> ff 0d 5b 55 3c 5e 74 16 5b 5d e9 95 8e 28 00 e8 a5 ee 44 ff 9c
[   20.996576] RSP: 0018:ffffb3da01f67d20 EFLAGS: 00000246
[   20.996576] RAX: 0000000000000002 RBX: ffffffffa4640220 RCX: 0000000000000040
[   20.996576] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa1c7b27b
[   20.996576] RBP: 0000000000000246 R08: 0000000000000001 R09: 0000000000000000
[   20.996576] R10: 0000000000000001 R11: 000000000000021c R12: 0000000000000246
[   20.996576] R13: ffff8b1881363958 R14: 0000000000000000 R15: ffff8b1881363800
[   20.996576]  ? _raw_spin_unlock_irqrestore+0x4b/0x70
[   20.996576]  serial_port_runtime_resume+0xd4/0x1a0
[   20.996576]  ? __pfx_serial_port_runtime_resume+0x10/0x10
[   20.996576]  __rpm_callback+0x44/0x170
[   20.996576]  ? __pfx_serial_port_runtime_resume+0x10/0x10
[   20.996576]  rpm_callback+0x55/0x60
[   20.996576]  ? __pfx_serial_port_runtime_resume+0x10/0x10
[   20.996576]  rpm_resume+0x582/0x7b0
[   20.996576]  pm_runtime_work+0x7c/0xb0
[   20.996576]  process_one_work+0x1fb/0x590
[   20.996576]  worker_thread+0x18e/0x350
[   20.996576]  ? __pfx_worker_thread+0x10/0x10
[   20.996576]  kthread+0xe2/0x110
[   20.996576]  ? __pfx_kthread+0x10/0x10
[   20.996576]  ret_from_fork+0x34/0x50
[   20.996576]  ? __pfx_kthread+0x10/0x10
[   20.996576]  ret_from_fork_asm+0x1a/0x30
[   20.996576]  </TASK>
[   21.056592] sched_ext: BPF scheduler "simple" disabled (unregistered from user space)

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-09 05:47:00 -10:00
Changwoo Min
153591f703 sched_ext: Clarify sched_ext_ops table for userland scheduler
Update the comments in sched_ext_ops to clarify this table is for
a BPF scheduler and a userland scheduler should also rely on the
sched_ext_ops table through the BPF scheduler.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-08 16:40:27 -10:00
Tejun Heo
e32c260195 sched_ext: Enable the ops breather and eject BPF scheduler on softlockup
On 2 x Intel Sapphire Rapids machines with 224 logical CPUs, a poorly
behaving BPF scheduler can live-lock the system by making multiple CPUs bang
on the same DSQ to the point where soft-lockup detection triggers before
SCX's own watchdog can take action. It also seems possible that the machine
can be live-locked enough to prevent scx_ops_helper, which is an RT task,
from running in a timely manner.

Implement scx_softlockup() which is called when three quarters of
soft-lockup threshold has passed. The function immediately enables the ops
breather and triggers an ops error to initiate ejection of the BPF
scheduler.

The previous and this patch combined enable the kernel to reliably recover
the system from live-lock conditions that can be triggered by a poorly
behaving BPF scheduler on Intel dual socket systems.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
2024-11-08 10:42:22 -10:00
Tejun Heo
62dcbab8b0 sched_ext: Avoid live-locking bypass mode switching
A poorly behaving BPF scheduler can live-lock the system by e.g. incessantly
banging on the same DSQ on a large NUMA system to the point where switching
to the bypass mode can take a long time. Turning on the bypass mode requires
dequeueing and re-enqueueing currently runnable tasks, if the DSQs that they
are on are live-locked, this can take tens of seconds cascading into other
failures. This was observed on 2 x Intel Sapphire Rapids machines with 224
logical CPUs.

Inject artifical delays while the bypass mode is switching to guarantee
timely completion.

While at it, move __scx_ops_bypass_lock into scx_ops_bypass() and rename it
to bypass_lock.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Valentin Andrei <vandrei@meta.com>
Reported-by: Patrick Lu <patlu@meta.com>
2024-11-08 10:42:13 -10:00
Tejun Heo
f07b806ad8 Merge branch 'for-6.12-fixes' into for-6.13
Pull sched_ext/for-6.12-fixes to receive 0e7ffff1b8 ("scx: Fix raciness in
scx_ops_bypass()"). Planned updates for scx_ops_bypass() depends on it.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-08 10:40:44 -10:00
Andrea Righi
6d594af5bf sched_ext: Fix incorrect use of bitwise AND
There is no reason to use a bitwise AND when checking the conditions to
enable NUMA optimization for the built-in CPU idle selection policy, so
use a logical AND instead.

Fixes: f6ce6b9493 ("sched_ext: Do not enable LLC/NUMA optimizations when domains overlap")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Closes: https://lore.kernel.org/lkml/20241108181753.GA2681424@thelio-3990X/
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-08 09:56:38 -10:00
Sean Anderson
d5bbfbad58 dma-mapping: fix swapped dir/flags arguments to trace_dma_alloc_sgt_err
trace_dma_alloc_sgt_err was called with the dir and flags arguments
swapped. Fix this.

Fixes: 68b6dbf1f4 ("dma-mapping: trace more error paths")
Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202410302243.1wnTlPk3-lkp@intel.com/
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-11-08 14:56:39 +01:00
Andrea Righi
f6ce6b9493 sched_ext: Do not enable LLC/NUMA optimizations when domains overlap
When the LLC and NUMA domains fully overlap, enabling both optimizations
in the built-in idle CPU selection policy is redundant, as it leads to
searching for an idle CPU within the same domain twice.

Likewise, if all online CPUs are within a single LLC domain, LLC
optimization is unnecessary.

Therefore, detect overlapping domains and enable topology optimizations
only when necessary.

Moreover, rely on the online CPUs for this detection logic, instead of
using the possible CPUs.

Fixes: 860a45219b ("sched_ext: Introduce NUMA awareness to the default idle selection policy")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-07 14:56:39 -10:00
Matthew Wilcox (Oracle)
7d3e93eca3 mm: use page_pgoff() in more places
There are several places which currently open-code page_pgoff(), convert
them to call it.

Link: https://lkml.kernel.org/r/20241005200121.3231142-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07 14:38:07 -08:00
Suren Baghdasaryan
0db6f8d782 alloc_tag: load module tags into separate contiguous memory
When a module gets unloaded there is a possibility that some of the
allocations it made are still used and therefore the allocation tags
corresponding to these allocations are still referenced.  As such, the
memory for these tags can't be freed.  This is currently handled as an
abnormal situation and module's data section is not being unloaded.  To
handle this situation without keeping module's data in memory, allow
codetags with longer lifespan than the module to be loaded into their own
separate memory.  The in-use memory areas and gaps after module unloading
in this separate memory are tracked using maple trees.  Allocation tags
arrange their separate memory so that it is virtually contiguous and that
will allow simple allocation tag indexing later on in this patchset.  The
size of this virtually contiguous memory is set to store up to 100000
allocation tags.

[surenb@google.com: fix empty codetag module section handling]
  Link: https://lkml.kernel.org/r/20241101000017.3856204-1-surenb@google.com
[akpm@linux-foundation.org: update comment, per Dan]
Link: https://lkml.kernel.org/r/20241023170759.999909-4-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Sourav Panda <souravpanda@google.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Huth <thuth@redhat.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xiongwei Song <xiongwei.song@windriver.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07 14:25:16 -08:00
Mike Rapoport (Microsoft)
0c133b1e78 module: prepare to handle ROX allocations for text
In order to support ROX allocations for module text, it is necessary to
handle modifications to the code, such as relocations and alternatives
patching, without write access to that memory.

One option is to use text patching, but this would make module loading
extremely slow and will expose executable code that is not finally formed.

A better way is to have memory allocated with ROX permissions contain
invalid instructions and keep a writable, but not executable copy of the
module text.  The relocations and alternative patches would be done on the
writable copy using the addresses of the ROX memory.  Once the module is
completely ready, the updated text will be copied to ROX memory using text
patching in one go and the writable copy will be freed.

Add support for that to module initialization code and provide necessary
interfaces in execmem.

Link: https://lkml.kernel.org/r/20241023162711.2579610-5-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewd-by: Luis Chamberlain <mcgrof@kernel.org>
Tested-by: kdevops <kdevops@lists.linux.dev>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Song Liu <song@kernel.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07 14:25:15 -08:00
Roman Gushchin
9e05e5c7ee signal: restore the override_rlimit logic
Prior to commit d646969055 ("Reimplement RLIMIT_SIGPENDING on top of
ucounts") UCOUNT_RLIMIT_SIGPENDING rlimit was not enforced for a class of
signals.  However now it's enforced unconditionally, even if
override_rlimit is set.  This behavior change caused production issues.  

For example, if the limit is reached and a process receives a SIGSEGV
signal, sigqueue_alloc fails to allocate the necessary resources for the
signal delivery, preventing the signal from being delivered with siginfo. 
This prevents the process from correctly identifying the fault address and
handling the error.  From the user-space perspective, applications are
unaware that the limit has been reached and that the siginfo is
effectively 'corrupted'.  This can lead to unpredictable behavior and
crashes, as we observed with java applications.

Fix this by passing override_rlimit into inc_rlimit_get_ucounts() and skip
the comparison to max there if override_rlimit is set.  This effectively
restores the old behavior.

Link: https://lkml.kernel.org/r/20241104195419.3962584-1-roman.gushchin@linux.dev
Fixes: d646969055 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Co-developed-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Alexey Gladkov <legion@kernel.org>
Cc: Kees Cook <kees@kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07 14:14:59 -08:00
Andrei Vagin
432dc0654c ucounts: fix counter leak in inc_rlimit_get_ucounts()
The inc_rlimit_get_ucounts() increments the specified rlimit counter and
then checks its limit.  If the value exceeds the limit, the function
returns an error without decrementing the counter.

Link: https://lkml.kernel.org/r/20241101191940.3211128-1-roman.gushchin@linux.dev
Fixes: 15bc01effe ("ucounts: Fix signal ucount refcounting")
Signed-off-by: Andrei Vagin <avagin@google.com>
Co-developed-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Tested-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Alexey Gladkov <legion@kernel.org>
Cc: Kees Cook <kees@kernel.org>
Cc: Andrei Vagin <avagin@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Gladkov <legion@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07 14:14:59 -08:00
Jakub Kicinski
2696e451df Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.12-rc7).

Conflicts:

drivers/net/ethernet/freescale/enetc/enetc_pf.c
  e15c5506dd ("net: enetc: allocate vf_state during PF probes")
  3774409fd4 ("net: enetc: build enetc_pf_common.c as a separate module")
https://lore.kernel.org/20241105114100.118bd35e@canb.auug.org.au

Adjacent changes:

drivers/net/ethernet/ti/am65-cpsw-nuss.c
  de794169cf ("net: ethernet: ti: am65-cpsw: Fix multi queue Rx on J7")
  4a7b2ba94a ("net: ethernet: ti: am65-cpsw: Use tstats instead of open coded version")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-07 13:44:16 -08:00
Peter Zijlstra
fe9beaaa80 sched: No PREEMPT_RT=y for all{yes,mod}config
While PREEMPT_RT is undoubtedly totally awesome, it does not, at this
time, make sense to have all{yes,mod}config select it.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: 35772d627b ("sched: Enable PREEMPT_DYNAMIC for PREEMPT_RT")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-11-07 15:25:05 +01:00
John Hubbard
afe789b736 kaslr: rename physmem_end and PHYSMEM_END to direct_map_physmem_end
For clarity.  It's increasingly hard to reason about the code, when KASLR
is moving around the boundaries.  In this case where KASLR is randomizing
the location of the kernel image within physical memory, the maximum
number of address bits for physical memory has not changed.

What has changed is the ending address of memory that is allowed to be
directly mapped by the kernel.

Let's name the variable, and the associated macro accordingly.

Also, enhance the comment above the direct_map_physmem_end definition,
to further clarify how this all works.

Link: https://lkml.kernel.org/r/20241009025024.89813-1-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jordan Niethe <jniethe@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06 20:11:11 -08:00
Nam Cao
3c2fb01521 hrtimers: Delete hrtimer_init_on_stack()
hrtimer_init_on_stack() is now unused. Delete it.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/510ce0d2944c4a382ea51e51d03dcfb73ba0f4f7.1730386209.git.namcao@linutronix.de
2024-11-07 02:47:07 +01:00
Nam Cao
d82fadc727 alarmtimer: Switch to use hrtimer_setup() and hrtimer_setup_on_stack()
hrtimer_setup() and hrtimer_setup_on_stack() take the callback function
pointer as argument and initialize the timer completely.

Replace the hrtimer_init*() variants and the open coded initialization of
hrtimer::function with the new setup mechanism.

Switch to use the new functions.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/2bae912336103405adcdab96b88d3ea0353b4228.1730386209.git.namcao@linutronix.de
2024-11-07 02:47:07 +01:00
Nam Cao
46d076af6d sched/idle: Switch to use hrtimer_setup_on_stack()
hrtimer_setup_on_stack() takes the callback function pointer as argument
and initializes the timer completely.

Replace hrtimer_init_on_stack() and the open coded initialization of
hrtimer::function with the new setup mechanism.

The conversion was done with Coccinelle.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/17f9421fed6061df4ad26a4cc91873d2c078cb0f.1730386209.git.namcao@linutronix.de
2024-11-07 02:47:06 +01:00
Nam Cao
f3bef7aaa6 hrtimers: Delete hrtimer_init_sleeper_on_stack()
hrtimer_init_sleeper_on_stack() is now unused. Delete it.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/52549846635c0b3a2abf82101f539efdabcd9778.1730386209.git.namcao@linutronix.de
2024-11-07 02:47:06 +01:00
Nam Cao
8fae141107 timers: Switch to use hrtimer_setup_sleeper_on_stack()
hrtimer_setup_sleeper_on_stack() replaces hrtimer_init_sleeper_on_stack()
to keep the naming convention consistent.

Convert the usage sites over to it. The conversion was done with
Coccinelle.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/299c07f0f96af8ab3a7631b47b6ca22b06b20577.1730386209.git.namcao@linutronix.de
2024-11-07 02:47:06 +01:00
Nam Cao
9788c1f0ff futex: Switch to use hrtimer_setup_sleeper_on_stack()
hrtimer_setup_sleeper_on_stack() replaces hrtimer_init_sleeper_on_stack()
to keep the naming convention consistent.

Convert the usage site over to it. The conversion was done with Coccinelle.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/d92116a17313dee283ebc959869bea80fbf94cdb.1730386209.git.namcao@linutronix.de
2024-11-07 02:47:06 +01:00
Nam Cao
c9bd83abfe hrtimers: Introduce hrtimer_setup_sleeper_on_stack()
The hrtimer_init*() API is replaced by hrtimer_setup*() variants to
initialize the timer including the callback function at once.

hrtimer_init_sleeper_on_stack() does not need user to setup the callback
function separately, so a new variant would not be strictly necessary.

Nonetheless, to keep the naming convention consistent, introduce
hrtimer_setup_sleeper_on_stack(). hrtimer_init_on_stack() will be removed
once all users are converted.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/7b5e18e6dd0ace9eaa211201528cb9dc23752454.1730386209.git.namcao@linutronix.de
2024-11-07 02:47:05 +01:00
Nam Cao
444cb7db4c hrtimers: Introduce hrtimer_setup_on_stack()
To initialize hrtimer on stack, hrtimer_init_on_stack() needs to be called
and also hrtimer::function must be set. This is error-prone and awkward to
use.

Introduce hrtimer_setup_on_stack() which does both of these things, so that
users of hrtimer can be simplified.

The new setup function also has a sanity check for the provided function
pointer. If NULL, a warning is emitted and a dummy callback installed.

hrtimer_init_on_stack() will be removed as soon as all of its users have
been converted to the new function.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/4b05e2ab3a82c517adf67fabc0f0cd8fe118b97c.1730386209.git.namcao@linutronix.de
2024-11-07 02:47:05 +01:00
Nam Cao
908a1d7754 hrtimers: Introduce hrtimer_setup() to replace hrtimer_init()
To initialize hrtimer, hrtimer_init() needs to be called and also
hrtimer::function must be set. This is error-prone and awkward to use.

Introduce hrtimer_setup() which does both of these things, so that users of
hrtimer can be simplified.

The new setup function also has a sanity check for the provided function
pointer. If NULL, a warning is emitted and a dummy callback installed.

hrtimer_init() will be removed as soon as all of its users have been
converted to the new function.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/5057c1ddbfd4b92033cd93d37fe38e6b069d5ba6.1730386209.git.namcao@linutronix.de
2024-11-07 02:47:05 +01:00
Nam Cao
fbf920f255 hrtimers: Add missing hrtimer_init() trace points
hrtimer_init*_on_stack() is not covered by tracing when
CONFIG_DEBUG_OBJECTS_TIMERS=y.

Rework the functions similar to hrtimer_init() and hrtimer_init_sleeper()
so that the hrtimer_init() tracepoint is unconditionally available.

The rework makes hrtimer_init_sleeper() unused. Delete it.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/74528e8abf2bb96e8bee85ffacbf14e15cf89f0d.1730386209.git.namcao@linutronix.de
2024-11-07 02:47:04 +01:00
Sebastian Andrzej Siewior
49a1763950 softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT.
The timer and hrtimer soft interrupts are raised in hard interrupt
context. With threaded interrupts force enabled or on PREEMPT_RT this leads
to waking the ksoftirqd for the processing of the soft interrupt.

ksoftirqd runs as SCHED_OTHER task which means it will compete with other
tasks for CPU resources.  This can introduce long delays for timer
processing on heavy loaded systems and is not desired.

Split the TIMER_SOFTIRQ and HRTIMER_SOFTIRQ processing into a dedicated
timers thread and let it run at the lowest SCHED_FIFO priority.
Wake-ups for RT tasks happen from hardirq context so only timer_list timers
and hrtimers for "regular" tasks are processed here. The higher priority
ensures that wakeups are performed before scheduling SCHED_OTHER tasks.

Using a dedicated variable to store the pending softirq bits values ensure
that the timer are not accidentally picked up by ksoftirqd and other
threaded interrupts.

It shouldn't be picked up by ksoftirqd since it runs at lower priority.
However if ksoftirqd is already running while a timer fires, then ksoftird
will be PI-boosted due to the BH-lock to ktimer's priority.

The timer thread can pick up pending softirqs from ksoftirqd but only
if the softirq load is high. It is not be desired that the picked up
softirqs are processed at SCHED_FIFO priority under high softirq load
but this can already happen by a PI-boost by a force-threaded interrupt.

[ frederic@kernel.org: rcutorture.c fixes, storm fix by introduction of
  local_timers_pending() for tick_nohz_next_event() ]

[ junxiao.chang@intel.com: Ensure ktimersd gets woken up even if a
  softirq is currently served. ]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org> [rcutorture]
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20241106150419.2593080-4-bigeasy@linutronix.de
2024-11-07 02:44:38 +01:00
Sebastian Andrzej Siewior
a02976cfce timers: Use __raise_softirq_irqoff() to raise the softirq.
Raising the timer soft interrupt is always done from hard interrupt
context, so it can be reduced to just setting the TIMER soft interrupt
flag. The soft interrupt will be invoked on return from interrupt.

Use therefore __raise_softirq_irqoff() to raise the TIMER soft interrupt,
which is a trivial optimization.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20241106150419.2593080-3-bigeasy@linutronix.de
2024-11-07 02:44:38 +01:00
Sebastian Andrzej Siewior
7a7f5065bc hrtimer: Use __raise_softirq_irqoff() to raise the softirq
Raising the hrtimer soft interrupt is always done from hard interrupt
context, so it can be reduced to just setting the HRTIMER soft interrupt
flag. The soft interrupt will be invoked on return from interrupt.

Use therefore __raise_softirq_irqoff() to raise the HRTIMER soft interrupt,
which is a trivial optimization.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20241106150419.2593080-2-bigeasy@linutronix.de
2024-11-07 02:44:38 +01:00
Thomas Gleixner
2634303f87 alarmtimers: Remove return value from alarm functions
Now that the SIG_IGN problem is solved in the core code, the alarmtimer
callbacks do not require a return value anymore.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20241105064214.318837272@linutronix.de
2024-11-07 02:14:46 +01:00
Thomas Gleixner
6b0aa14578 alarmtimers: Remove the throttle mechanism from alarm_forward_now()
Now that ignored posix timer signals are requeued and the timers are
rearmed on signal delivery the workaround to keep such timers alive and
self rearm them is not longer required.

Remove the unused alarm timer parts.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064214.252443020@linutronix.de
2024-11-07 02:14:45 +01:00
Thomas Gleixner
7a66f72b09 posix-timers: Cleanup SIG_IGN workaround leftovers
Now that ignored posix timer signals are requeued and the timers are
rearmed on signal delivery the workaround to keep such timers alive and
self rearm them is not longer required.

Remove the relevant hacks and the not longer required return values from
the related functions. The alarm timer workarounds will be cleaned up in a
separate step.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064214.187239060@linutronix.de
2024-11-07 02:14:45 +01:00
Thomas Gleixner
df7a996b4d signal: Queue ignored posixtimers on ignore list
Queue posixtimers which have their signal ignored on the ignored list:

   1) When the timer fires and the signal has SIG_IGN set

   2) When SIG_IGN is installed via sigaction() and a timer signal
      is already queued

This only happens when the signal is for a valid timer, which delivered the
signal in periodic mode. One-shot timer signals are correctly dropped.

Due to the lock order constraints (sighand::siglock nests inside
timer::lock) the signal code cannot access any of the timer fields which
are relevant to make this decision, e.g. timer::it_status.

This is addressed by establishing a protection scheme which requires to
lock both locks on the timer side for modifying decision fields in the
timer struct and therefore makes it possible for the signal delivery to
evaluate with only sighand:siglock being held:

  1) Move the NULLification of timer->it_signal into the sighand::siglock
     protected section of timer_delete() and check timer::it_signal in the
     code path which determines whether the signal is dropped or queued on
     the ignore list.

     This ensures that a deleted timer cannot be moved onto the ignore
     list, which would prevent it from being freed on exit() as it is not
     longer in the process' posix timer list.

     If the timer got moved to the ignored list before deletion then it is
     removed from the ignored list under sighand lock in timer_delete().

  2) Provide a new timer::it_sig_periodic flag, which gets set in the
     signal queue path with both timer and sighand locks held if the timer
     is actually in periodic mode at expiry time.

     The ignore list code checks this flag under sighand::siglock and drops
     the signal when it is not set.

     If it is set, then the signal is moved to the ignored list independent
     of the actual state of the timer.

     When the signal is un-ignored later then the signal is moved back to
     the signal queue. On signal delivery the posix timer side decides
     about dropping the signal if the timer was re-armed, dis-armed or
     deleted based on the signal sequence counter check.

     If the thread/process exits then not yet delivered signals are
     discarded which means the reference of the timer containing the
     sigqueue is dropped and frees the timer.

     This is way cheaper than requiring all code paths to lock
     sighand::siglock of the target thread/process on any modification of
     timer::it_status or going all the way and removing pending signals
     from the signal queues on every rearm, disarm or delete operation.

So the protection scheme here is that on the timer side both timer::lock
and sighand::siglock have to be held for modifying

   timer::it_signal
   timer::it_sig_periodic

which means that on the signal side holding sighand::siglock is enough to
evaluate these fields.
                                                                                                                                                                                                                                                                                                                             
In posixtimer_deliver_signal() holding timer::lock is sufficient to do the
sequence validation against timer::it_signal_seq because a concurrent
expiry is waiting on timer::lock to be released.

This completes the SIG_IGN handling and such timers are not longer self
rearmed which avoids pointless wakeups.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064214.120756416@linutronix.de
2024-11-07 02:14:45 +01:00
Thomas Gleixner
caf77435dd signal: Handle ignored signals in do_sigaction(action != SIG_IGN)
When a real handler (including SIG_DFL) is installed for a signal, which
had previously SIG_IGN set, then the list of ignored posix timers has to be
checked for timers which are affected by this change.

Add a list walk function which checks for the matching signal number and if
found requeues the timers signal, so the timer is rearmed on signal
delivery.

Rearming the timer right away is not possible because that requires to drop
sighand lock.

No functional change as the counter part which queues the timers on the
ignored list is still missing.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064214.054091076@linutronix.de
2024-11-07 02:14:45 +01:00
Thomas Gleixner
0e20cd33ac posix-timers: Handle ignored list on delete and exit
To handle posix timer signals on sigaction(SIG_IGN) properly, the timers
will be queued on a separate ignored list.

Add the necessary cleanup code for timer_delete() and exit_itimers().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064213.987530588@linutronix.de
2024-11-07 02:14:45 +01:00
Thomas Gleixner
69f032c92c signal: Provide ignored_posix_timers list
To prepare for handling posix timer signals on sigaction(SIG_IGN) properly,
add a list to task::signal.

This list will be used to queue posix timers so their signal can be
requeued when SIG_IGN is lifted later.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064213.920101900@linutronix.de
2024-11-07 02:14:45 +01:00
Thomas Gleixner
647da5f709 posix-timers: Move sequence logic into struct k_itimer
The posix timer signal handling uses siginfo::si_sys_private for handling
the sequence counter check. That indirection is not longer required and the
sequence count value at signal queueing time can be stored in struct
k_itimer itself.

This removes the requirement of treating siginfo::si_sys_private special as
it's now always zero as the kernel does not touch it anymore.

Suggested-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://lore.kernel.org/all/20241105064213.852619866@linutronix.de
2024-11-07 02:14:45 +01:00
Thomas Gleixner
c2a4796a15 signal: Cleanup unused posix-timer leftovers
Remove the leftovers of sigqueue preallocation as it's not longer used.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064213.786506636@linutronix.de
2024-11-07 02:14:44 +01:00
Thomas Gleixner
6017a158be posix-timers: Embed sigqueue in struct k_itimer
To cure the SIG_IGN handling for posix interval timers, the preallocated
sigqueue needs to be embedded into struct k_itimer to prevent life time
races of all sorts.

Now that the prerequisites are in place, embed the sigqueue into struct
k_itimer and fixup the relevant usage sites.

Aside of preparing for proper SIG_IGN handling, this spares an extra
allocation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064213.719695194@linutronix.de
2024-11-07 02:14:44 +01:00
Thomas Gleixner
11629b9808 signal: Replace resched_timer logic
In preparation for handling ignored posix timer signals correctly and
embedding the sigqueue struct into struct k_itimer, hand down a pointer to
the sigqueue struct into posix_timer_deliver_signal() instead of just
having a boolean flag.

No functional change.

Suggested-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://lore.kernel.org/all/20241105064213.652658158@linutronix.de
2024-11-07 02:14:44 +01:00
Thomas Gleixner
0360ed14d9 signal: Refactor send_sigqueue()
To handle posix timers which have their signal ignored via SIG_IGN properly
it is required to requeue a ignored signal for delivery when SIG_IGN is
lifted so the timer gets rearmed.

Split the required code out of send_sigqueue() so it can be reused in
context of sigaction().

While at it rename send_sigqueue() to posixtimer_send_sigqueue() so its
clear what this is about.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064213.586453412@linutronix.de
2024-11-07 02:14:44 +01:00
Thomas Gleixner
ef1c5bcd6d posix-timers: Store PID type in the timer
instead of re-evaluating the signal delivery mode everywhere.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064213.519086500@linutronix.de
2024-11-07 02:14:44 +01:00
Thomas Gleixner
54f1dd642f signal: Provide posixtimer_sigqueue_init()
To cure the SIG_IGN handling for posix interval timers, the preallocated
sigqueue needs to be embedded into struct k_itimer to prevent life time
races of all sorts.

Provide a new function to initialize the embedded sigqueue to prepare for
that.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064213.450427515@linutronix.de
2024-11-07 02:14:44 +01:00
Thomas Gleixner
5cac427f79 signal: Split up __sigqueue_alloc()
To cure the SIG_IGN handling for posix interval timers, the preallocated
sigqueue needs to be embedded into struct k_itimer to prevent life time
races of all sorts.

Reorganize __sigqueue_alloc() so the ucounts retrieval and the
initialization can be used independently.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064213.371410037@linutronix.de
2024-11-07 02:14:44 +01:00
Thomas Gleixner
5d916a0988 posix-timers: Add a refcount to struct k_itimer
To cure the SIG_IGN handling for posix interval timers, the preallocated
sigqueue needs to be embedded into struct k_itimer to prevent life time
races of all sorts.

To make that work correctly it needs reference counting so that timer
deletion does not free the timer prematuraly when there is a signal queued
or delivered concurrently.

Add a rcuref to the posix timer part.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064213.304756440@linutronix.de
2024-11-07 02:14:43 +01:00
Thomas Gleixner
4cf7bf2a2f posix-cpu-timers: Use dedicated flag for CPU timer nanosleep
POSIX CPU timer nanosleep creates a k_itimer on stack and uses the sigq
pointer to detect the nanosleep case in the expiry function.

Prepare for embedding sigqueue into struct k_itimer by using a dedicated
flag for nanosleep.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064213.238550394@linutronix.de
2024-11-07 02:14:43 +01:00
Thomas Gleixner
bf635681c9 posix-cpu-timers: Cleanup the firing logic
The firing flag of a posix CPU timer is tristate:

  0: when the timer is not about to deliver a signal

  1: when the timer has expired, but the signal has not been delivered yet

 -1: when the timer was queued for signal delivery and a rearm operation
     raced against it and supressed the signal delivery.

This is a pointless exercise as this can be simply expressed with a
boolean. Only if set, the signal is delivered. This makes delete and rearm
consistent with the rest of the posix timers.

Convert firing to bool and fixup the usage sites accordingly and add
comments why the timer cannot be dequeued right away.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20241105064213.172848618@linutronix.de
2024-11-07 02:14:43 +01:00
Thomas Gleixner
b06b0345ff posix-timers: Make signal overrun accounting sensible
The handling of the timer overrun in the signal code is inconsistent as it
takes previous overruns into account. This is just wrong as after the
reprogramming of a timer the overrun count starts over from a clean state,
i.e. 0.

Don't touch info::si_overrun in send_sigqueue() and only store the overrun
value at signal delivery time, which is computed from the timer itself
relative to the expiry time.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064213.106738193@linutronix.de
2024-11-07 02:14:43 +01:00
Thomas Gleixner
513793bc6a posix-timers: Make signal delivery consistent
Signals of timers which are reprogammed, disarmed or deleted can deliver
signals related to the past. The POSIX spec is blury about this:

 - "The effect of disarming or resetting a timer with pending expiration
    notifications is unspecified."

 - "The disposition of pending signals for the deleted timer is
    unspecified."

In both cases it is reasonable to expect that pending signals are
discarded. Especially in the reprogramming case it does not make sense to
account for previous overruns or to deliver a signal for a timer which has
been disarmed. This makes the behaviour consistent and understandable.

Remove the si_sys_private check from the signal delivery code and invoke
posix_timer_deliver_signal() unconditionally for posix timer related
signals.

Change posix_timer_deliver_signal() so it controls the actual signal
delivery via the return value. It now instructs the signal code to drop the
signal when:

  1) The timer does not longer exist in the hash table

  2) The timer signal_seq value is not the same as the si_sys_private value
     which was set when the signal was queued.

This is also a preparatory change to embed the sigqueue into the k_itimer
structure, which in turn allows to remove the si_sys_private magic.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20241105064213.040348644@linutronix.de
2024-11-07 02:14:43 +01:00
Thomas Gleixner
15cbfb92ef posix-cpu-timers: Correctly update timer status in posix_cpu_timer_del()
If posix_cpu_timer_del() exits early due to task not found or sighand
invalid, it fails to clear the state of the timer. That's harmless but
inconsistent.

These early exits are accounted as successful delete. Move the update of
the timer state into the success return path, so all "successful" deletions
are handled.

Reported-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20241105064212.974053438@linutronix.de
2024-11-07 02:14:43 +01:00
Huang Ying
d7ce9c73da resource: avoid unnecessary resource tree walking in __region_intersects()
Currently, if __region_intersects() finds any overlapped but unmatched
resource, it walks the descendant resource tree to check for overlapped
and matched descendant resources using for_each_resource().  However, in
current kernel, for_each_resource() iterates not only the descendant tree,
but also subsequent sibling trees in certain scenarios.  While this
doesn't introduce bugs, it makes code hard to be understood and
potentially inefficient.

So, the patch revises next_resource() and for_each_resource() and makes
for_each_resource() traverse the subtree under the specified subtree root
only.  Test shows that this avoids unnecessary resource tree walking in
__region_intersects().

For the example resource tree as follows,

  X
  |
  A----D----E
  |
  B--C

if 'A' is the overlapped but unmatched resource, original kernel
iterates 'B', 'C', 'D', 'E' when it walks the descendant tree.  While
the patched kernel iterates only 'B', 'C'.

Thanks David Hildenbrand for providing a good resource tree example.

Link: https://lkml.kernel.org/r/20241029122735.79164-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06 13:36:37 -08:00
Linus Torvalds
7758b20611 Fix tracefs mount options:
The commit 78ff640819 ("vfs: Convert tracefs to use the new mount API")
 broke the gid setting when set by fstab or other mount utility.
 It is ignored when it is set. Fix the code so that it recognises the
 option again and will honor the settings on mount at boot up.
 
 Update the internal documentation and create a selftest to make sure
 it doesn't break again in the future.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZyuidRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qsgQAQDuV0x4RLpCrrowDS/ITQw/eb/WjhR7
 lhkXVROLN6RK6wD+JWmbaCP82q2S4A2Vx0Rjc72gUMmTzDb1HQflhQiLhwU=
 =0dZF
 -----END PGP SIGNATURE-----

Merge tag 'tracefs-v6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracefs fixes from Steven Rostedt:
 "Fix tracefs mount options.

  Commit 78ff640819 ("vfs: Convert tracefs to use the new mount API")
  broke the gid setting when set by fstab or other mount utility. It is
  ignored when it is set. Fix the code so that it recognises the option
  again and will honor the settings on mount at boot up.

  Update the internal documentation and create a selftest to make sure
  it doesn't break again in the future"

* tag 'tracefs-v6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/selftests: Add tracefs mount options test
  tracing: Document tracefs gid mount option
  tracing: Fix tracefs mount options
2024-11-06 08:08:39 -10:00
Andrii Nakryiko
5f67329cb2 Stable tag for bpf-next's uprobe work.
-----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEv3OU3/byMaA0LqWJdkfhpEvA5LoFAmcrTRsVHHBldGVyekBp
 bmZyYWRlYWQub3JnAAoJEHZH4aRLwOS6PLoP/jL4pUgW/ZrQFwpZh71BxeDt2Ka/
 Eb6AsHe0PcKAMJYaJDfin6FRU87hp3tHIefSGdexvSttWwbnwKl8cVb+Y7gVnytu
 b2PkMfiOFShKEhu6YAJmxWIOi6MDxonjIMQgjvsVGrZmHiPgGTrh+nnmHYQ+qxFq
 wCaZXO3E65drtZKbi1HddHDYR+e1mHQU0uC+mLO44sP3lzJVxPnYGKGjaS62Z/Da
 XF+3tz6jc6jpu08FJy8ltrqLvcHPmTuDkR6f8mG3Hc8Hw0mndY/4yk0bGbbHo7Vx
 y42Aq4UUgcpvb8OUIicMRLzp3hRjsSTn8UJjsinEaCexdw6ZZiZVU/YR9Mf5ivrJ
 dlplFJvP8b6psnHrRf5xJ1SUv7+dap075A3/28MEvGErZOINoULAGa/hJIndHfuL
 NeWaZj0+of2eAX1SDePia87jX1P9xuU6AEw944i2rhI4P1J5I6XYfcaDDICBYitv
 yREafY/i6wb/Q8GhpjWmSE7p4wUIi5o3CpZsncj7B4Me9JBdHWrcnyUY55Tz05mo
 zoKnNgYC3d9DAIwXvq7x6tM2Tw183YXul/aHJSr3/rFKuuGQx0XACt6BO+yI35q3
 6max4kMyr+kUqr9YYZtb9fuBw3TPhwY/zXG0ydSxNNh7oX+boxh4/bxXljLWXmRQ
 eHgsXuuF1YgCg1R9
 =Wiky
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-for-bpf-next' from tip tree

Stable tag for bpf-next's uprobe work.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-11-06 08:13:03 -08:00
Kuan-Wei Chiu
083ad2871a perf/core: update min_heap_callbacks to use default builtin swap
After introducing the default builtin swap implementation, update the
min_heap_callbacks to replace the swp function pointer with NULL.  This
change allows the min heap to directly utilize the builtin swap,
simplifying the code.

Link: https://lkml.kernel.org/r/20241020040200.939973-6-visitorckw@gmail.com
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
Cc: Coly Li <colyli@suse.de>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Sakai <msakai@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05 17:12:35 -08:00
Kuan-Wei Chiu
92a8b224b8 lib/min_heap: introduce non-inline versions of min heap API functions
Patch series "Enhance min heap API with non-inline functions and
optimizations", v2.

Add non-inline versions of the min heap API functions in lib/min_heap.c
and updates all users outside of kernel/events/core.c to use these
non-inline versions.  To mitigate the performance impact of indirect
function calls caused by the non-inline versions of the swap and compare
functions, a builtin swap has been introduced that swaps elements based on
their size.  Additionally, it micro-optimizes the efficiency of the min
heap by pre-scaling the counter, following the same approach as in
lib/sort.c.  Documentation for the min heap API has also been added to the
core-api section.


This patch (of 10):

All current min heap API functions are marked with '__always_inline'. 
However, as the number of users increases, inlining these functions
everywhere leads to a increase in kernel size.

In performance-critical paths, such as when perf events are enabled and
min heap functions are called on every context switch, it is important to
retain the inline versions for optimal performance.  To balance this, the
original inline functions are kept, and additional non-inline versions of
the functions have been added in lib/min_heap.c.

Link: https://lkml.kernel.org/r/20241020040200.939973-1-visitorckw@gmail.com
Link: https://lore.kernel.org/20240522161048.8d8bbc7b153b4ecd92c50666@linux-foundation.org
Link: https://lkml.kernel.org/r/20241020040200.939973-2-visitorckw@gmail.com
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
Cc: Coly Li <colyli@suse.de>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Kuan-Wei Chiu <visitorckw@gmail.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Sakai <msakai@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05 17:12:34 -08:00
Uros Bizjak
ad8f63f935 perf/hw_breakpoint: use ERR_PTR_PCPU(), IS_ERR_PCPU() and PTR_ERR_PCPU() macros
Use ERR_PTR_PCPU() when returning error pointer in the percpu address
space.  Use IS_ERR_PCPU() and PTR_ERR_PCPU() when returning the error
pointer from the percpu address space.  These macros add intermediate cast
to unsigned long when switching named address spaces.

The patch will avoid future build errors due to pointer address space
mismatch with enabled strict percpu address space checks.

Link: https://lkml.kernel.org/r/20240924090813.1353586-1-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05 17:12:32 -08:00
Thomas Weißschuh
f2fa0fd4e7 reboot: move reboot_notifier_list to kernel/reboot.c
All the functions related to the reboot notifier list are in
kernel/reboot.c.  Move the list itself, too.  As there are no direct users
anymore, make the declaration static.

Link: https://lkml.kernel.org/r/20241012-reboot_notifier_list-v1-1-6093bb9455ce@weissschuh.net
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05 17:12:31 -08:00
Ilpo Järvinen
834b251b1d resource: correct reallocate_resource() documentation
reallocate_resource() documentation claims constraint is about "the size
and alignment" but the size is provided in another parameter.  Instead of
size, constraint has the allowed memory range (min, max) so change the
wording to reflect that.

Link: https://lkml.kernel.org/r/20241009125751.8090-1-ilpo.jarvinen@linux.intel.com
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05 17:12:31 -08:00
Yafang Shao
286d7a54c8 auditsc: replace memcpy() with strscpy()
Using strscpy() to read the task comm ensures that the name is always
NUL-terminated, regardless of the source string.  This approach also
facilitates future extensions to the task comm.

Link: https://lkml.kernel.org/r/20241007144911.27693-3-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Reviewed-by: Justin Stitt <justinstitt@google.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Alejandro Colomar <alx@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Airlie <airlied@gmail.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morris <jmorris@namei.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matus Jokay <matus.jokay@stuba.sk>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Quentin Monnet <qmo@kernel.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05 17:12:29 -08:00
Yafang Shao
4cc0473d77 get rid of __get_task_comm()
Patch series "Improve the copy of task comm", v8.

Using {memcpy,strncpy,strcpy,kstrdup} to copy the task comm relies on the
length of task comm.  Changes in the task comm could result in a
destination string that is overflow.  Therefore, we should explicitly
ensure the destination string is always NUL-terminated, regardless of the
task comm.  This approach will facilitate future extensions to the task
comm.

As suggested by Linus [0], we can identify all relevant code with the
following git grep command:

  git grep 'memcpy.*->comm\>'
  git grep 'kstrdup.*->comm\>'
  git grep 'strncpy.*->comm\>'
  git grep 'strcpy.*->comm\>'

PATCH #2~#4:   memcpy
PATCH #5~#6:   kstrdup
PATCH #7:      strcpy

Please note that strncpy() is not included in this series as it is being
tracked by another effort. [1]


This patch (of 7):

We want to eliminate the use of __get_task_comm() for the following
reasons:

- The task_lock() is unnecessary
  Quoted from Linus [0]:
  : Since user space can randomly change their names anyway, using locking
  : was always wrong for readers (for writers it probably does make sense
  : to have some lock - although practically speaking nobody cares there
  : either, but at least for a writer some kind of race could have
  : long-term mixed results

Link: https://lkml.kernel.org/r/20241007144911.27693-1-laoar.shao@gmail.com
Link: https://lkml.kernel.org/r/20241007144911.27693-2-laoar.shao@gmail.com
Link: https://lore.kernel.org/all/CAHk-=wivfrF0_zvf+oj6==Sh=-npJooP8chLPEfaFV0oNYTTBA@mail.gmail.com [0]
Link: https://lore.kernel.org/all/CAHk-=whWtUC-AjmGJveAETKOMeMFSTwKwu99v7+b6AyHMmaDFA@mail.gmail.com/
Link: https://lore.kernel.org/all/CAHk-=wjAmmHUg6vho1KjzQi2=psR30+CogFd4aXrThr2gsiS4g@mail.gmail.com/ [0]
Link: https://github.com/KSPP/linux/issues/90 [1]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Matus Jokay <matus.jokay@stuba.sk>
Cc: Alejandro Colomar <alx@kernel.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Justin Stitt <justinstitt@google.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Airlie <airlied@gmail.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Quentin Monnet <qmo@kernel.org>
Cc: Simon Horman <horms@kernel.org>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05 17:12:28 -08:00
Andy Shevchenko
ba1eccc114 resource: introduce is_type_match() helper and use it
There are already a couple of places where we may replace a few lines of
code by calling a helper, which increases readability while deduplicating
the code.

Introduce is_type_match() helper and use it.

Link: https://lkml.kernel.org/r/20240925154355.1170859-3-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05 17:12:28 -08:00
Andy Shevchenko
5c1edea773 resource: replace open coded resource_intersection()
Patch series "resource: A couple of cleanups".

A couple of ad-hoc cleanups since there was a recent development of
the code in question. No functional changes intended.


This patch (of 2):

__region_intersects() uses open coded resource_intersection().  Replace it
with existing API which also make more clear what we are checking.

Link: https://lkml.kernel.org/r/20240925154355.1170859-1-andriy.shevchenko@linux.intel.com
Link: https://lkml.kernel.org/r/20240925154355.1170859-2-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05 17:12:27 -08:00
Tio Zhang
8380101802 kernel/watchdog: always restore watchdog_softlockup(,hardlockup)_user_enabled after proc show
Otherwise when watchdog_enabled becomes 0,
watchdog_softlockup(,hardlockup)_user_enabled will changes to 0 after proc
show.

Steps to reproduce:

  step 1:
  # cat /proc/sys/kernel/*watchdog
  1
  1
  1

  | name                             | value
  |----------------------------------|--------------------------
  | watchdog_enabled                 | 1
  |----------------------------------|--------------------------
  | watchdog_hardlockup_user_enabled | 1
  |----------------------------------|--------------------------
  | watchdog_softlockup_user_enabled | 1
  |----------------------------------|--------------------------
  | watchdog_user_enabled            | 1
  |----------------------------------|--------------------------

  step 2:
  # echo 0 > /proc/sys/kernel/watchdog

  | name                             | value
  |----------------------------------|--------------------------
  | watchdog_enabled                 | 0
  |----------------------------------|--------------------------
  | watchdog_hardlockup_user_enabled | 1
  |----------------------------------|--------------------------
  | watchdog_softlockup_user_enabled | 1
  |----------------------------------|--------------------------
  | watchdog_user_enabled            | 0
  |----------------------------------|--------------------------

  step 3:
  # cat /proc/sys/kernel/*watchdog
  0
  0
  0

  | name                             | value
  |----------------------------------|--------------------------
  | watchdog_enabled                 | 0
  |----------------------------------|--------------------------
  | watchdog_hardlockup_user_enabled | 0
  |----------------------------------|--------------------------
  | watchdog_softlockup_user_enabled | 0
  |----------------------------------|--------------------------
  | watchdog_user_enabled            | 0
  |----------------------------------|--------------------------

  step 4:
  # echo 1 > /proc/sys/kernel/watchdog

  | name                             | value
  |----------------------------------|--------------------------
  | watchdog_enabled                 | 0
  |----------------------------------|--------------------------
  | watchdog_hardlockup_user_enabled | 0
  |----------------------------------|--------------------------
  | watchdog_softlockup_user_enabled | 0
  |----------------------------------|--------------------------
  | watchdog_user_enabled            | 0
  |----------------------------------|--------------------------

  step 5:
  # cat /proc/sys/kernel/*watchdog
  0
  0
  0

If we dont do "step 3", do "step 4" right after "step 2", it will be

  | name                             | value
  |----------------------------------|--------------------------
  | watchdog_enabled                 | 1
  |----------------------------------|--------------------------
  | watchdog_hardlockup_user_enabled | 1
  |----------------------------------|--------------------------
  | watchdog_softlockup_user_enabled | 1
  |----------------------------------|--------------------------
  | watchdog_user_enabled            | 1
  |----------------------------------|--------------------------

then everything works correctly.

So this patch fix "step 3"'s value into

| name                             | value
|----------------------------------|--------------------------
| watchdog_enabled                 | 0
|----------------------------------|--------------------------
| watchdog_hardlockup_user_enabled | 1
|----------------------------------|--------------------------
| watchdog_softlockup_user_enabled | 1
|----------------------------------|--------------------------
| watchdog_user_enabled            | 0
|----------------------------------|--------------------------

And still print 0 as before.

Link: https://lkml.kernel.org/r/20240906094700.GA30052@didi-ThinkCentre-M930t-N000
Signed-off-by: Tio Zhang <tiozhang@didiglobal.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Krister Johansen <kjlx@templeofstupid.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05 17:12:27 -08:00
Sourabh Jain
6efbd5ddb6 kexec/crash: no crash update when kexec in progress
The following errors are observed when kexec is done with SMT=off on
powerpc.

[  358.458385] Removing IBM Power 842 compression device
[  374.795734] kexec_core: Starting new kernel
[  374.795748] kexec: Waking offline cpu 1.
[  374.875695] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
[  374.935833] kexec: Waking offline cpu 2.
[  375.015664] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
snip..
[  375.515823] kexec: Waking offline cpu 6.
[  375.635667] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
[  375.695836] kexec: Waking offline cpu 7.

To avoid kexec kernel boot failure on PowerPC, all the present CPUs that
are offline are brought online during kexec.  For more information, refer
to commit e8e5c2155b ("powerpc/kexec: Fix orphaned offline CPUs across
kexec").  Bringing the CPUs online triggers the crash hotplug handler,
crash_handle_hotplug_event(), to update the kdump image.  Since the system
is on the kexec kernel boot path and the kexec lock is held, the
crash_handle_hotplug_event() function fails to acquire the same lock to
update the kdump image, resulting in the error messages mentioned above.

To fix this, return from crash_handle_hotplug_event() without printing the
error message if kexec is in progress.

The same applies to the crash_check_hotplug_support() function.  Return 0
if kexec is in progress because kernel is not in a position to update the
kdump image.

Link: https://lkml.kernel.org/r/20240921103745.560430-1-sourabhjain@linux.ibm.com
Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Baoquan he <bhe@redhat.com>
Reported-by: Sachin P Bappalige <sachinpb@linux.vnet.ibm.com>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05 17:12:27 -08:00
Nanyong Sun
f2f484085e mm: move mm flags to mm_types.h
The types of mm flags are now far beyond the core dump related features. 
This patch moves mm flags from linux/sched/coredump.h to linux/mm_types.h.
The linux/sched/coredump.h has include the mm_types.h, so the C files
related to coredump does not need to change head file inclusion.  In
addition, the inclusion of sched/coredump.h now can be deleted from the C
files that irrelevant to core dump.

Link: https://lkml.kernel.org/r/20240926074922.2721274-1-sunnanyong@huawei.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05 16:56:26 -08:00
Lorenzo Stoakes
cd3f8467af mm: refactor mm_access() to not return NULL
mm_access() can return NULL if the mm is not found, but this is handled
the same as an error in all callers, with some translating this into an
-ESRCH error.

Only proc_mem_open() returns NULL if no mm is found, however in this case
it is clearer and makes more sense to explicitly handle the error. 
Additionally we take the opportunity to refactor the function to eliminate
unnecessary nesting.

Simplify things by simply returning -ESRCH if no mm is found - this both
eliminates confusing use of the IS_ERR_OR_NULL() macro, and simplifies
callers which would return -ESRCH by returning this error directly.

[lorenzo.stoakes@oracle.com: prefer neater pointer error comparison]
  Link: https://lkml.kernel.org/r/2fae1834-749a-45e1-8594-5e5979cf7103@lucifer.local
Link: https://lkml.kernel.org/r/20240924201023.193135-1-lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05 16:56:23 -08:00
Tejun Heo
f7d1b585e1 sched_ext: Add a missing newline at the end of an error message
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-05 11:45:24 -10:00
Adrian Hunter
18d92bb57c perf/core: Add aux_pause, aux_resume, aux_start_paused
Hardware traces, such as instruction traces, can produce a vast amount of
trace data, so being able to reduce tracing to more specific circumstances
can be useful.

The ability to pause or resume tracing when another event happens, can do
that.

Add ability for an event to "pause" or "resume" AUX area tracing.

Add aux_pause bit to perf_event_attr to indicate that, if the event
happens, the associated AUX area tracing should be paused. Ditto
aux_resume. Do not allow aux_pause and aux_resume to be set together.

Add aux_start_paused bit to perf_event_attr to indicate to an AUX area
event that it should start in a "paused" state.

Add aux_paused to struct hw_perf_event for AUX area events to keep track of
the "paused" state. aux_paused is initialized to aux_start_paused.

Add PERF_EF_PAUSE and PERF_EF_RESUME modes for ->stop() and ->start()
callbacks. Call as needed, during __perf_event_output(). Add
aux_in_pause_resume to struct perf_buffer to prevent races with the NMI
handler. Pause/resume in NMI context will miss out if it coincides with
another pause/resume.

To use aux_pause or aux_resume, an event must be in a group with the AUX
area event as the group leader.

Example (requires Intel PT and tools patches also):

 $ perf record --kcore -e intel_pt/aux-action=start-paused/k,syscalls:sys_enter_newuname/aux-action=resume/,syscalls:sys_exit_newuname/aux-action=pause/ uname
 Linux
 [ perf record: Woken up 1 times to write data ]
 [ perf record: Captured and wrote 0.043 MB perf.data ]
 $ perf script --call-trace
 uname   30805 [000] 24001.058782799: name: 0x7ffc9c1865b0
 uname   30805 [000] 24001.058784424:  psb offs: 0
 uname   30805 [000] 24001.058784424:  cbr: 39 freq: 3904 MHz (139%)
 uname   30805 [000] 24001.058784629: ([kernel.kallsyms])        debug_smp_processor_id
 uname   30805 [000] 24001.058784629: ([kernel.kallsyms])        __x64_sys_newuname
 uname   30805 [000] 24001.058784629: ([kernel.kallsyms])            down_read
 uname   30805 [000] 24001.058784629: ([kernel.kallsyms])                __cond_resched
 uname   30805 [000] 24001.058784629: ([kernel.kallsyms])                preempt_count_add
 uname   30805 [000] 24001.058784629: ([kernel.kallsyms])                    in_lock_functions
 uname   30805 [000] 24001.058784629: ([kernel.kallsyms])                preempt_count_sub
 uname   30805 [000] 24001.058784629: ([kernel.kallsyms])            up_read
 uname   30805 [000] 24001.058784629: ([kernel.kallsyms])                preempt_count_add
 uname   30805 [000] 24001.058784838: ([kernel.kallsyms])                    in_lock_functions
 uname   30805 [000] 24001.058784838: ([kernel.kallsyms])                preempt_count_sub
 uname   30805 [000] 24001.058784838: ([kernel.kallsyms])            _copy_to_user
 uname   30805 [000] 24001.058784838: ([kernel.kallsyms])        syscall_exit_to_user_mode
 uname   30805 [000] 24001.058784838: ([kernel.kallsyms])            syscall_exit_work
 uname   30805 [000] 24001.058784838: ([kernel.kallsyms])                perf_syscall_exit
 uname   30805 [000] 24001.058784838: ([kernel.kallsyms])                    debug_smp_processor_id
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                    perf_trace_buf_alloc
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                        perf_swevent_get_recursion_context
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                            debug_smp_processor_id
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                        debug_smp_processor_id
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                    perf_tp_event
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                        perf_trace_buf_update
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                            tracing_gen_ctx_irq_test
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                        perf_swevent_event
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                            __perf_event_account_interrupt
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                                __this_cpu_preempt_check
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                            perf_event_output_forward
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                                perf_event_aux_pause
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                                    ring_buffer_get
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                                        __rcu_read_lock
 uname   30805 [000] 24001.058785046: ([kernel.kallsyms])                                        __rcu_read_unlock
 uname   30805 [000] 24001.058785254: ([kernel.kallsyms])                                    pt_event_stop
 uname   30805 [000] 24001.058785254: ([kernel.kallsyms])                                        debug_smp_processor_id
 uname   30805 [000] 24001.058785254: ([kernel.kallsyms])                                        debug_smp_processor_id
 uname   30805 [000] 24001.058785254: ([kernel.kallsyms])                                        native_write_msr
 uname   30805 [000] 24001.058785463: ([kernel.kallsyms])                                        native_write_msr
 uname   30805 [000] 24001.058785639: 0x0

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: James Clark <james.clark@arm.com>
Link: https://lkml.kernel.org/r/20241022155920.17511-3-adrian.hunter@intel.com
2024-11-05 12:55:43 +01:00
Peter Zijlstra
35772d627b sched: Enable PREEMPT_DYNAMIC for PREEMPT_RT
In order to enable PREEMPT_DYNAMIC for PREEMPT_RT, remove PREEMPT_RT
from the 'Preemption Model' choice. Strictly speaking PREEMPT_RT is
not a change in how preemption works, but rather it makes a ton more
code preemptible.

Notably, take away NONE and VOLUNTARY options for PREEMPT_RT, they make
no sense (but are techincally possible).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20241007075055.441622332@infradead.org
2024-11-05 12:55:38 +01:00
Peter Zijlstra
7c70cb94d2 sched: Add Lazy preemption model
Change fair to use resched_curr_lazy(), which, when the lazy
preemption model is selected, will set TIF_NEED_RESCHED_LAZY.

This LAZY bit will be promoted to the full NEED_RESCHED bit on tick.
As such, the average delay between setting LAZY and actually
rescheduling will be TICK_NSEC/2.

In short, Lazy preemption will delay preemption for fair class but
will function as Full preemption for all the other classes, most
notably the realtime (RR/FIFO/DEADLINE) classes.

The goal is to bridge the performance gap with Voluntary, such that we
might eventually remove that option entirely.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20241007075055.331243614@infradead.org
2024-11-05 12:55:38 +01:00
Peter Zijlstra
26baa1f1c4 sched: Add TIF_NEED_RESCHED_LAZY infrastructure
Add the basic infrastructure to split the TIF_NEED_RESCHED bit in two.
Either bit will cause a resched on return-to-user, but only
TIF_NEED_RESCHED will drive IRQ preemption.

No behavioural change intended.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20241007075055.219540785@infradead.org
2024-11-05 12:55:37 +01:00
Thomas Gleixner
0f0d1b8e50 sched/ext: Remove sched_fork() hack
Instead of solving the underlying problem of the double invocation of
__sched_fork() for idle tasks, sched-ext decided to hack around the issue
by partially clearing out the entity struct to preserve the already
enqueued node. A provided analysis and solution has been ignored for four
months.

Now that someone else has taken care of cleaning it up, remove the
disgusting hack and clear out the full structure. Remove the comment in the
structure declaration as well, as there is no requirement for @node being
the last element anymore.

Fixes: f0e1a0643a ("sched_ext: Implement BPF extensible scheduler class")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/87ldy82wkc.ffs@tglx
2024-11-05 12:55:37 +01:00
Thomas Gleixner
b23decf8ac sched: Initialize idle tasks only once
Idle tasks are initialized via __sched_fork() twice:

     fork_idle()
        copy_process()
	  sched_fork()
             __sched_fork()
	init_idle()
          __sched_fork()

Instead of cleaning this up, sched_ext hacked around it. Even when analyis
and solution were provided in a discussion, nobody cared to clean this up.

init_idle() is also invoked from sched_init() to initialize the boot CPU's
idle task, which requires the __sched_fork() invocation. But this can be
trivially solved by invoking __sched_fork() before init_idle() in
sched_init() and removing the __sched_fork() invocation from init_idle().

Do so and clean up the comments explaining this historical leftover.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241028103142.359584747@linutronix.de
2024-11-05 12:55:37 +01:00
Marco Elver
93190bc35d seqlock, treewide: Switch to non-raw seqcount_latch interface
Switch all instrumentable users of the seqcount_latch interface over to
the non-raw interface.

Co-developed-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241104161910.780003-5-elver@google.com
2024-11-05 12:55:35 +01:00
Marco Elver
8ab40fc2b9 time/sched_clock: Broaden sched_clock()'s instrumentation coverage
Most of sched_clock()'s implementation is ineligible for instrumentation
due to relying on sched_clock_noinstr().

Split the implementation off into an __always_inline function
__sched_clock(), which is then used by the noinstr and instrumentable
version, to allow more of sched_clock() to be covered by various
instrumentation.

This will allow instrumentation with the various sanitizers (KASAN,
KCSAN, KMSAN, UBSAN). For KCSAN, we know that raw seqcount_latch usage
without annotations will result in false positive reports: tell it that
all of __sched_clock() is "atomic" for the latch reader; later changes
in this series will take care of the writers.

Co-developed-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241104161910.780003-3-elver@google.com
2024-11-05 12:55:35 +01:00
Marco Elver
1139c71df5 time/sched_clock: Swap update_clock_read_data() latch writes
Swap the writes to the odd and even copies to make the writer critical
section look like all other seqcount_latch writers.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241104161910.780003-2-elver@google.com
2024-11-05 12:55:34 +01:00
Lukasz Luba
5609296750 PM: EM: Add min/max available performance state limits
On some devices there are HW dependencies for shared frequency and voltage
between devices. It will impact Energy Aware Scheduler (EAS) decision,
where CPUs share the voltage & frequency domain with other CPUs or devices
e.g.
 - Mid CPUs + Big CPU
 - Little CPU + L3 cache in DSU
 - some other device + Little CPUs

Detailed explanation of one example:
When the L3 cache frequency is increased, the affected Little CPUs might
run at higher voltage and frequency. That higher voltage causes higher CPU
power and thus more energy is used for running the tasks. This is
important for background running tasks, which try to run on energy
efficient CPUs.

Therefore, add performance state limits which are applied for the device
(in this case CPU). This is important on SoCs with HW dependencies
mentioned above so that the Energy Aware Scheduler (EAS) does not use
performance states outside the valid min-max range for energy calculation.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://patch.msgid.link/20241030164126.1263793-2-lukasz.luba@arm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-11-04 23:00:47 +01:00
Kumar Kartikeya Dwivedi
cb4158ce8e bpf: Mark raw_tp arguments with PTR_MAYBE_NULL
Arguments to a raw tracepoint are tagged as trusted, which carries the
semantics that the pointer will be non-NULL.  However, in certain cases,
a raw tracepoint argument may end up being NULL. More context about this
issue is available in [0].

Thus, there is a discrepancy between the reality, that raw_tp arguments
can actually be NULL, and the verifier's knowledge, that they are never
NULL, causing explicit NULL checks to be deleted, and accesses to such
pointers potentially crashing the kernel.

To fix this, mark raw_tp arguments as PTR_MAYBE_NULL, and then special
case the dereference and pointer arithmetic to permit it, and allow
passing them into helpers/kfuncs; these exceptions are made for raw_tp
programs only. Ensure that we don't do this when ref_obj_id > 0, as in
that case this is an acquired object and doesn't need such adjustment.

The reason we do mask_raw_tp_trusted_reg logic is because other will
recheck in places whether the register is a trusted_reg, and then
consider our register as untrusted when detecting the presence of the
PTR_MAYBE_NULL flag.

To allow safe dereference, we enable PROBE_MEM marking when we see loads
into trusted pointers with PTR_MAYBE_NULL.

While trusted raw_tp arguments can also be passed into helpers or kfuncs
where such broken assumption may cause issues, a future patch set will
tackle their case separately, as PTR_TO_BTF_ID (without PTR_TRUSTED) can
already be passed into helpers and causes similar problems. Thus, they
are left alone for now.

It is possible that these checks also permit passing non-raw_tp args
that are trusted PTR_TO_BTF_ID with null marking. In such a case,
allowing dereference when pointer is NULL expands allowed behavior, so
won't regress existing programs, and the case of passing these into
helpers is the same as above and will be dealt with later.

Also update the failure case in tp_btf_nullable selftest to capture the
new behavior, as the verifier will no longer cause an error when
directly dereference a raw tracepoint argument marked as __nullable.

  [0]: https://lore.kernel.org/bpf/ZrCZS6nisraEqehw@jlelli-thinkpadt14gen4.remote.csb

Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Reported-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Fixes: 3f00c52393 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241104171959.2938862-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-04 11:37:36 -08:00
Kumar Kartikeya Dwivedi
d402755ced bpf: Unify resource leak checks
There are similar checks for covering locks, references, RCU read
sections and preempt_disable sections in 3 places in the verifer, i.e.
for tail calls, bpf_ld_[abs, ind], and exit path (for BPF_EXIT and
bpf_throw). Unify all of these into a common check_resource_leak
function to avoid code duplication.

Also update the error strings in selftests to the new ones in the same
change to ensure clean bisection.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241103225940.1408302-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-03 16:52:06 -08:00
Kumar Kartikeya Dwivedi
46f7ed32f7 bpf: Tighten tail call checks for lingering locks, RCU, preempt_disable
There are three situations when a program logically exits and transfers
control to the kernel or another program: bpf_throw, BPF_EXIT, and tail
calls. The former two check for any lingering locks and references, but
tail calls currently do not. Expand the checks to check for spin locks,
RCU read sections and preempt disabled sections.

Spin locks are indirectly preventing tail calls as function calls are
disallowed, but the checks for preemption and RCU are more relaxed,
hence ensure tail calls are prevented in their presence.

Fixes: 9bb00b2895 ("bpf: Add kfunc bpf_rcu_read_lock/unlock()")
Fixes: fc7566ad0a ("bpf: Introduce bpf_preempt_[disable,enable] kfuncs")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241103225940.1408302-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-03 16:52:06 -08:00
Paul E. McKenney
9650edd9bf rcu: Finer-grained grace-period-end checks in rcu_dump_cpu_stacks()
This commit pushes the grace-period-end checks further down into
rcu_dump_cpu_stacks(), and also uses lockless checks coupled with
finer-grained locking.

The result is that the current leaf rcu_node structure's ->lock is
acquired only if a stack backtrace might be needed from the current CPU,
and is held across only that CPU's backtrace.  As a result, if there are
no stalled CPUs associated with a given rcu_node structure, then its
->lock will not be acquired at all.  On large systems, it is usually
(though not always) the case that a small number of CPUs are stalling
the current grace period, which means that the ->lock need be acquired
only for a small fraction of the rcu_node structures.

[ paulmck: Apply Dan Carpenter feedback. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-03 21:55:35 +01:00
Linus Torvalds
b019b4a670 A single fix for posix CPU timers
When a thread is cloned, the posix CPU timers are not inherited.
 
   If the parent has a CPU timer armed the corresponding tick dependency in
   the tasks tick_dep_mask is set and copied to the new thread, which means
   the new thread and all decendants will prevent the system to go into full
   NOHZ operation.
 
   Clear the tick dependency mask in copy_process() to fix this.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmcnT/oTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoZw5D/0bqjjMYiE0ws8nuuXN1gL2T1wt6P2C
 2zzEKnk6nsxnGbMfFs7XifDkqSOHNVro7F6kCkz6cH4U/VSK6R2nNONbufz4mNWk
 6uDvdlZps30ekmxN+C3fB6S/4MNVOhXFFXjsQhT/PxX/CfibVP5fATLtcRLq9Lfx
 mV3nnxKTHPjDGz2/2QRyYpA3G1fzOj/l2QeJsVAIC8GxCo7drLVe0yn5Yt+13zhU
 JMjgt9ox4PKFsjaXrqvt1yKNTtb+evjYSIVXxIg60oxUkfva6XFLXJv3rjbARUNj
 aqBHfkZ1/d2Hwc0WexDZfvhNWeCqnfUA+db7ALSYICbNd37EVxWVZA2TwfkkWKSt
 RDq3xX6NJSd71h0lxDvzv7Ph3NUq23rq3LycAkjqfhiFjPQmE6axtnioXcR5mtVp
 q9tilB/3I4zj4BIYPfd9KowkdclRSK+B3Oo0DTyuVhKLggF0UD3poDxT4HxnBCFB
 uKDV8GDsoD8Ksjsl0/X/D4oorqLYAT0tG9gxMw1Kii16gijhhu5qeqTWY+qs9ieg
 2J+Ku83QLQgljvy2s7y0AnMZMxaeKN5YMs6zNV+yGAoTFyft3CnBMKalPHfOCI0A
 fdKHi2aKm+lLrp/UIG2Yw9N1xfcrGWA2moH+9dw6zTyAei+TInP+WYndI7VR5EDf
 3KvW3OoRf0sywA==
 =83r/
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Thomas Gleixner:
 "A single fix for posix CPU timers.

  When a thread is cloned, the posix CPU timers are not inherited.

  If the parent has a CPU timer armed the corresponding tick dependency
  in the tasks tick_dep_mask is set and copied to the new thread, which
  means the new thread and all decendants will prevent the system to go
  into full NOHZ operation.

  Clear the tick dependency mask in copy_process() to fix this"

* tag 'timers-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  posix-cpu-timers: Clear TICK_DEP_BIT_POSIX_TIMER on clone
2024-11-03 08:22:21 -10:00
Linus Torvalds
33e83ffe4c A few scheduler fixes:
- Plug a race between pick_next_task_fair() and try_to_wake_up() where
    both try to write to the same task, even though both paths hold a
    runqueue lock, but obviously from different runqueues.
 
    The problem is that the store to task::on_rq in __block_task() is
    visible to try_to_wake_up() which assumes that the task is not queued.
    Both sides then operate on the same task.
 
    Cure it by rearranging __block_task() so the the store to task::on_rq is
    the last operation on the task.
 
  - Prevent a potential NULL pointer dereference in task_numa_work()
 
    task_numa_work() iterates the VMAs of a process. A concurrent unmap of
    the address space can result in a NULL pointer return from vma_next()
    which is unchecked.
 
    Add the missing NULL pointer check to prevent this.
 
  - Operate on the correct scheduler policy in task_should_scx()
 
    task_should_scx() returns true when a task should be handled by sched
    EXT. It checks the tasks scheduling policy.
 
    This fails when the check is done before a policy has been set.
 
    Cure it by handing the policy into task_should_scx() so it operates
    on the requested value.
 
  - Add the missing handling of sched EXT in the delayed dequeue
    mechanism. This was simply forgotten.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmcnTqATHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoX/aD/4yvskeG9i7wAj2NOdDTAs1K0gLURt+
 nHDb1YkoIOXOfanaG7ZdBWb4sYnsnLX/KIhVsDQiXACFr6G0IjQ1zaN1iRtEkH79
 5BfVi98gAXdFU3y+EGqyaqiAp7MFOBTmsfJi5095fX0L+2aViSAjDEvHzvvC/hXD
 tmq47vFQEgIZPSxljEaKPaNmyDM+geusv5lX/lABH5MG0fYsT85VV6BQ2T1LsN1O
 WFBLD/uPEOSXumyZW8nV8yE2PioLDJz8W+uSnr38/HCH99mtJApqZyskaagKtr0g
 vLhOfoaYVR/j5ODUk6LExZ8zy140zDzUWzC5+RNnyb8jQf/Lx88fTNZY8/Wsm5m9
 oKtoiGzkL0LG/c05Cjh/vqReK26qILK4+ynDGaowDmTlUTS2jeNZL1ABlIwWkaLP
 5TDegJPkoUA1Z4YegxtRFROGHp1J+lfbqz537bghMaqdJXMaG84qjSszsPz9NbS9
 F7K63JKjfXAF6N8bhKvZk4jAbD97EYf3B0o8E69TjoZxaiuKf00xK7HGWmuQD3u3
 lOHkfIZzf5b7ELNgcketCYsbJvxbI4oQrp/9V425ORSr1Ih2GxCT51/x/NlFHoEH
 ujIjAe2YQyLhb26M0RG8Xao3BPT7RGMR058C8lwxtPLuPNIwB8MqCsXmU9xlEypg
 iexGnsj6zXTddg==
 =4mie
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Thomas Gleixner:

 - Plug a race between pick_next_task_fair() and try_to_wake_up() where
   both try to write to the same task, even though both paths hold a
   runqueue lock, but obviously from different runqueues.

   The problem is that the store to task::on_rq in __block_task() is
   visible to try_to_wake_up() which assumes that the task is not
   queued. Both sides then operate on the same task.

   Cure it by rearranging __block_task() so the the store to task::on_rq
   is the last operation on the task.

 - Prevent a potential NULL pointer dereference in task_numa_work()

   task_numa_work() iterates the VMAs of a process. A concurrent unmap
   of the address space can result in a NULL pointer return from
   vma_next() which is unchecked.

   Add the missing NULL pointer check to prevent this.

 - Operate on the correct scheduler policy in task_should_scx()

   task_should_scx() returns true when a task should be handled by sched
   EXT. It checks the tasks scheduling policy.

   This fails when the check is done before a policy has been set.

   Cure it by handing the policy into task_should_scx() so it operates
   on the requested value.

 - Add the missing handling of sched EXT in the delayed dequeue
   mechanism. This was simply forgotten.

* tag 'sched-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/ext: Fix scx vs sched_delayed
  sched: Pass correct scheduling policy to __setscheduler_class
  sched/numa: Fix the potential null pointer dereference in task_numa_work()
  sched: Fix pick_next_task_fair() vs try_to_wake_up() race
2024-11-03 08:18:28 -10:00
Linus Torvalds
68f05b251b A single fix for perf:
perf_event_clear_cpumask() uses list_for_each_entry_rcu() without being
   in a RCU read side critical section, which triggers a "suspicious RCU
   usage" warning.
 
   It turns out that the list walk does not be RCU protected because the
   write side lock is held in this contxt.
 
   Change it to a regular list walk.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmcnSfoTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVERD/0fMj1Y79mC2DoAB9XBy6d7xumHbDlN
 Jt4v3Hq0oADjqrGAro/XDbAm9y7gK9b5BhV2UH4ehvJj+WeesGRikV26Fdh3WuWo
 aTyNTh7PxsMoNRqIWWBiGbumHya9+INZKyKFAMD/WtQ3Av2emws0nnm9uv+eJzVZ
 zr1+NiofUDsu1I04E6zVXBra3aLqIbsWg5NOCsJAdW/9AKE+GQMA0/aw14Z2ftqH
 Mry4PqW4aGcTnCRNtoaHHwbP4677ZXX6pQnbUTGYZ4ywJJFKQ54YH1mUqUUP6cOo
 GWg20gVK4PTkJSt2nL/I+i1RVq7Ipw725e540XEAFDsDVj5jy/rJbmrmyUys6sr7
 Xu6cXbjAs/kV/A9TB1wBsb+iMUnHTNbRWMS1d8bsxaUWSIe9wouDJHAIumCMr3B3
 qALdXxHqppPZuccMFWHyxAClJEY8YEp9+n32BMpePASLhv3JBJHOUSn8HWr+GIgC
 N4slnJvLevETlO0HcQ3IUifwqfQBJ6O0Kyu0IXmrb3aCV9TzrbE1iZDgv6HbZBVP
 FsUaMBB/se24R/4zxSsH+u7yLFcgEJKVWVzngXzNoUvRX8xF4um6x1y89049Q0CC
 iGdRq3/fV/b/Tp7wvEuIxCr0GPUi28OCZTwjESmluUIS6ZSd83oDajBXix725hk+
 1YIwANTHMeBadQ==
 =qlTN
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fix from Thomas Gleixner:
 "perf_event_clear_cpumask() uses list_for_each_entry_rcu() without
  being in a RCU read side critical section, which triggers a
  'suspicious RCU usage' warning.

  It turns out that the list walk does not be RCU protected because the
  write side lock is held in this context.

  Change it to a regular list walk"

* tag 'perf-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix missing RCU reader protection in perf_event_clear_cpumask()
2024-11-03 08:13:52 -10:00
Linus Torvalds
8f0b844adc Two fixes for the interrupt subsystem:
- Fix an off-by-one error in the failure path of msi_domain_alloc(),
     which causes the cleanup loop to terminate early and leaking the first
     allocated interrupt.
 
   - Handle a corner case in GIC-V4 versus a lazily mapped Virtual
     Processing Element (VPE). If the VPE has not been mapped because the
     guest has not yet emitted a mapping command, then the set_affinity()
     callback returns an error code, which causes the vCPU management to fail.
 
     Return success in this case without touching the hardware. This will be
     done later when the guest issues the mapping command.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmcnSPwTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoYp6D/9VtT4/JrR8mc44OIfAPKdFfQFkAowB
 +lqyv+8PXl41/BY2gMv6kdoaHZzI/nJspbgs8nh4DfIOfhoZXo+jVe0ewn0kvgEw
 NzEsN2l5TfFWk6W3pJ/dzklPEepja1Ju9/7E9bHX0sJlZwfl43PGJqqBXQjKyhJB
 +NXlqCh66P137V6LgTBobjfO8B+gdbZn80+LHtBsA7M8dEyK7zdYINM3OwK3li0V
 umNTsvabimxY7om8xZVI03h8wedABG+/itINzfiEu3fR9Dpp4gwQrbOzQTxion7S
 4WkbVCh2OKiEJGcjstzHeaNYZatCvkEKyvSBIRDrI2+JCJlnFax8fhZn9w65ExMv
 BeU0mG/ip6tfH9ieaqm82IT7yYX4PPv+ma6L3BGmdDkM1o0z317Orm/mbcE4a6MD
 EPxnUxOEGqBKc+ylsvZiHriYRtUsyxR2y343XSuCZuYZHpdB1IN+Q1qFBoNY0MlU
 q7igpXj6FM0qD3zadz5H4Kb4Sj09oWMnhGJCUMEqknOzd1U0cBwWsIvuNUq3VWCe
 8P9arwFK4fa7B1YZ8cgLVw9JYqazpdY1GOn6k0lBDdF0tnieP4NaOvWs/imlkt4s
 kX9Qr/mFoA2EV4vBiURPsK43TlGpRm0kEJgIeElvsXVFlmiTBkVFS2CR7Ep2dFs4
 ezIo15/4GmlRsQ==
 =6Li6
 -----END PGP SIGNATURE-----

Merge tag 'irq-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fixes from Thomas Gleixner:

 - Fix an off-by-one error in the failure path of msi_domain_alloc(),
   which causes the cleanup loop to terminate early and leaking the
   first allocated interrupt.

 - Handle a corner case in GIC-V4 versus a lazily mapped Virtual
   Processing Element (VPE). If the VPE has not been mapped because the
   guest has not yet emitted a mapping command, then the set_affinity()
   callback returns an error code, which causes the vCPU management to
   fail.

   Return success in this case without touching the hardware. This will
   be done later when the guest issues the mapping command.

* tag 'irq-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/gic-v4: Correctly deal with set_affinity on lazily-mapped VPEs
  genirq/msi: Fix off-by-one error in msi_domain_alloc()
2024-11-03 08:09:25 -10:00
Al Viro
457a654939 css_set_fork(): switch to CLASS(fd_raw, ...)
reference acquired there by fget_raw() is not stashed anywhere -
we could as well borrow instead.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03 01:28:07 -05:00
Al Viro
8152f82010 fdget(), more trivial conversions
all failure exits prior to fdget() leave the scope, all matching fdput()
are immediately followed by leaving the scope.

[xfs_ioc_commit_range() chunk moved here as well]

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03 01:28:06 -05:00