mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
synced 2025-08-26 04:47:45 +00:00
master
1109 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
61399e0c54 |
rcu: Fix racy re-initialization of irq_work causing hangs
RCU re-initializes the deferred QS irq work everytime before attempting
to queue it. However there are situations where the irq work is
attempted to be queued even though it is already queued. In that case
re-initializing messes-up with the irq work queue that is about to be
handled.
The chances for that to happen are higher when the architecture doesn't
support self-IPIs and irq work are then all lazy, such as with the
following sequence:
1) rcu_read_unlock() is called when IRQs are disabled and there is a
grace period involving blocked tasks on the node. The irq work
is then initialized and queued.
2) The related tasks are unblocked and the CPU quiescent state
is reported. rdp->defer_qs_iw_pending is reset to DEFER_QS_IDLE,
allowing the irq work to be requeued in the future (note the previous
one hasn't fired yet).
3) A new grace period starts and the node has blocked tasks.
4) rcu_read_unlock() is called when IRQs are disabled again. The irq work
is re-initialized (but it's queued! and its node is cleared) and
requeued. Which means it's requeued to itself.
5) The irq work finally fires with the tick. But since it was requeued
to itself, it loops and hangs.
Fix this with initializing the irq work only once before the CPU boots.
Fixes:
|
||
![]() |
cc1d1365f0 | Merge branches 'rcu-exp.23.07.2025', 'rcu.22.07.2025', 'torture-scripts.16.07.2025', 'srcu.19.07.2025', 'rcu.nocb.18.07.2025' and 'refscale.07.07.2025' into rcu.merge.23.07.2025 | ||
![]() |
5d71c2b53f |
rcu: Document concurrent quiescent state reporting for offline CPUs
The synchronization of CPU offlining with GP initialization is confusing to put it mildly (rightfully so as the issue it deals with is complex). Recent discussions brought up a question -- what prevents the rcu_implicit_dyntick_qs() from warning about QS reports for offline CPUs (missing QS reports for offline CPUs causing indefinite hangs). QS reporting for now-offline CPUs should only happen from: - gp_init() - rcutree_cpu_report_dead() Add some documentation on this and refer to it from comments in the code explaining how QS reporting is not missed when these functions are concurrently running. I referred heavily to this post [1] about the need for the ofl_lock. [1] https://lore.kernel.org/all/20180924164443.GF4222@linux.ibm.com/ [ Applied paulmck feedback on moving documentation to Requirements.rst ] Link: https://lore.kernel.org/all/01b4d228-9416-43f8-a62e-124b92e8741a@paulmck-laptop/ Co-developed-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org> |
||
![]() |
186779c036 |
rcu: Document separation of rcu_state and rnp's gp_seq
The details of this are subtle and was discussed recently. Add a quick-quiz about this and refer to it from the code, for more clarity. Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org> |
||
![]() |
30a7806ada |
rcu: Document GP init vs hotplug-scan ordering requirements
Add detailed comments explaining the critical ordering constraints during RCU grace period initialization, based on discussions with Frederic. Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org> Co-developed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org> |
||
![]() |
78370df5c3 |
rcu: Enable rcu_normal_wake_from_gp on small systems
Automatically enable the rcu_normal_wake_from_gp parameter on systems with a small number of CPUs. The activation threshold is set to 16 CPUs. This helps to reduce a latency of normal synchronize_rcu() API by waking up GP-waiters earlier and decoupling synchronize_rcu() callers from regular callback handling. A benchmark running 64 parallel jobs(system with 64 CPUs) invoking synchronize_rcu() demonstrates a notable latency reduction with the setting enabled. Latency distribution (microseconds): <default> 0 - 9999 : 1 10000 - 19999 : 4 20000 - 29999 : 399 30000 - 39999 : 3197 40000 - 49999 : 10428 50000 - 59999 : 17363 60000 - 69999 : 15529 70000 - 79999 : 9287 80000 - 89999 : 4249 90000 - 99999 : 1915 100000 - 109999 : 922 110000 - 119999 : 390 120000 - 129999 : 187 ... <default> <rcu_normal_wake_from_gp> 0 - 9999 : 1 10000 - 19999 : 234 20000 - 29999 : 6678 30000 - 39999 : 33463 40000 - 49999 : 20669 50000 - 59999 : 2766 60000 - 69999 : 183 ... <rcu_normal_wake_from_gp> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org> |
||
![]() |
fc39760cd0 |
rcu/exp: Warn on QS requested on dying CPU
It is not possible to send an IPI to a dying CPU that has passed the CPUHP_TEARDOWN_CPU stage. Remaining unhandled IPIs are handled later at CPUHP_AP_SMPCFD_DYING stage by stop machine. This is the last opportunity for RCU exp handler to request an expedited quiescent state. And the upcoming final context switch between stop machine and idle must have reported the requested context switch. Therefore, it should not be possible to observe a pending requested expedited quiescent state when RCU finally stops watching the outgoing CPU. Once IPIs aren't possible anymore, the QS for the target CPU will be reported on its behalf by the RCU exp kworker. Provide an assertion to verify those expectations. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org> |
||
![]() |
bf0a57445d |
rcu/exp: Remove needless CPU up quiescent state report
A CPU coming online checks for an ongoing grace period and reports
a quiescent state accordingly if needed. This special treatment that
shortcuts the expedited IPI finds its origin as an optimization purpose
on the following commit:
|
||
![]() |
3dfdfaff2d |
rcu: Robustify rcu_is_cpu_rrupt_from_idle()
RCU relies on the context tracking nesting counter in order to determine if it is running in extended quiescent state. However the context tracking nesting counter is not completely synchronized with the actual context tracking state: * The nesting counter is set to 1 or incremented further _after_ the actual state is set to RCU watching. * The nesting counter is set to 0 or decremented further _before_ the actual state is set to RCU not watching. Therefore it is safe to assume that if ct_nesting() > 0, RCU is watching. But if ct_nesting() <= 0, RCU is not watching except for tiny windows. This hasn't been a problem so far because rcu_is_cpu_rrupt_from_idle() has only been called from interrupts. However the code is confusing and abuses the role of the context tracking nesting counter while there are more accurate indicators available. Clarify and robustify accordingly. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org> |
||
![]() |
33b6a1f155 |
rcu: Return early if callback is not specified
Currently the call_rcu() API does not check whether a callback pointer is NULL. If NULL is passed, rcu_core() will try to invoke it, resulting in NULL pointer dereference and a kernel crash. To prevent this and improve debuggability, this patch adds a check for NULL and emits a kernel stack trace to help identify a faulty caller. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> |
||
![]() |
9c80e44337 | Merge branches 'rcu/misc-for-6.16', 'rcu/seq-counters-for-6.16' and 'rcu/torture-for-6.16' into rcu/for-next | ||
![]() |
aafe12f980 |
rcutorture: Perform more frequent testing of ->gpwrap
Currently, the ->gpwrap is not tested (at all per my testing) due to the requirement of a large delta between a CPU's rdp->gp_seq and its node's rnp->gpseq. This results in no testing of ->gpwrap being set. This patch by default adds 5 minutes of testing with ->gpwrap forced by lowering the delta between rdp->gp_seq and rnp->gp_seq to just 8 GPs. All of this is configurable, including the active time for the setting and a full testing cycle. By default, the first 25 minutes of a test will have the _default_ behavior there is right now (ULONG_MAX / 4) delta. Then for 5 minutes, we switch to a smaller delta causing 1-2 wraps in 5 minutes. I believe this is reasonable since we at least add a little bit of testing for usecases where ->gpwrap is set. [ Apply fix for Dan Carpenter's bug report on init path cleanup. ] [ Apply kernel doc warning fix from Akira Yokosawa. ] Tested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> |
||
![]() |
da6b85598a |
rcu/cpu_stall_cputime: fix the hardirq count for x86 architecture
When counting the number of hardirqs in the x86 architecture,
it is essential to add arch_irq_stat_cpu to ensure accuracy.
For example, a CPU loop within the rcu_read_lock function.
Before:
[ 70.910184] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 70.910436] rcu: 3-....: (4999 ticks this GP) idle=***
[ 70.910711] rcu: hardirqs softirqs csw/system
[ 70.910870] rcu: number: 0 657 0
[ 70.911024] rcu: cputime: 0 0 2498 ==> 2498(ms)
[ 70.911278] rcu: (t=5001 jiffies g=3677 q=29 ncpus=8)
After:
[ 68.046132] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 68.046354] rcu: 2-....: (4999 ticks this GP) idle=***
[ 68.046628] rcu: hardirqs softirqs csw/system
[ 68.046793] rcu: number: 2498 663 0
[ 68.046951] rcu: cputime: 0 0 2496 ==> 2496(ms)
[ 68.047244] rcu: (t=5000 jiffies g=3825 q=4 ncpus=8)
Fixes:
|
||
![]() |
0999f61560 |
rcu: Remove swake_up_one_online() bandaid
It's now ok to perform a wake-up from an offline CPU because the resulting armed scheduler bandwidth hrtimers are now correctly targeted by hrtimer infrastructure. Remove the obsolete hackerry. Link: https://lore.kernel.org/all/20241231170712.149394-3-frederic@kernel.org/ Reviewed-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> |
||
![]() |
4aa6e94cf9 |
rcu: Add warning to ensure rcu_seq_done_exact() is working
The previous patch improved the rcu_seq_done_exact() function by adding a meaningful constant for the guardband. Ensure that this is working for the future by a quick check during rcu_gp_init(). Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> |
||
![]() |
3ba7dfb8da |
RCU pull request for v6.15
This pull request contains the following branches: docs.2025.02.04a: - Add broken-timing possibility to stallwarn.rst. - Improve discussion of this_cpu_ptr(), add raw_cpu_ptr(). - Document self-propagating callbacks. - Point call_srcu() to call_rcu() for detailed memory ordering. - Add CONFIG_RCU_LAZY delays to call_rcu() kernel-doc header. - Clarify RCU_LAZY and RCU_LAZY_DEFAULT_OFF help text. - Remove references to old grace-period-wait primitives. srcu.2025.02.05a: - Introduce srcu_read_{un,}lock_fast(), which is similar to srcu_read_{un,}lock_lite(): avoid smp_mb()s in lock and unlock at the cost of calling synchronize_rcu() in synchronize_srcu(). Moreover, by returning the percpu offset of the counter at srcu_read_lock_fast() time, srcu_read_unlock_fast() can save extra pointer dereferencing, which makes it faster than srcu_read_{un,}lock_lite(). srcu_read_{un,}lock_fast() are intended to replace rcu_read_{un,}lock_trace() if possible. torture.2025.02.05a: - Add get_torture_init_jiffies() to return the start time of the test. - Add a test_boost_holdoff module parameter to allow delaying boosting tests when building rcutorture as built-in. - Add grace period sequence number logging at the beginning and end of failure/close-call results. - Switch to hexadecimal for the expedited grace period sequence number in the rcu_exp_grace_period trace point. - Make cur_ops->format_gp_seqs take buffer length. - Move RCU_TORTURE_TEST_{CHK_RDR_STATE,LOG_CPU} to bool. - Complain when invalid SRCU reader_flavor is specified. - Add FORCE_NEED_SRCU_NMI_SAFE Kconfig for testing, which forces SRCU uses atomics even when percpu ops are NMI safe, and use the Kconfig for SRCU lockdep testing. misc.2025.03.04a: - Split rcu_report_exp_cpu_mult() mask parameter and use for tracing. - Remove READ_ONCE() for rdp->gpwrap access in __note_gp_changes(). - Fix get_state_synchronize_rcu_full() GP-start detection. - Move RCU Tasks self-tests to core_initcall(). - Print segment lengths in show_rcu_nocb_gp_state(). - Make RCU watch ct_kernel_exit_state() warning. - Flush console log from kernel_power_off(). - rcutorture: Allow a negative value for nfakewriters. - rcu: Update TREE05.boot to test normal synchronize_rcu(). - rcu: Use _full() API to debug synchronize_rcu(). lazypreempt.2025.03.04a: Make RCU handle PREEMPT_LAZY better: - Fix header guard for rcu_all_qs(). - rcu: Rename PREEMPT_AUTO to PREEMPT_LAZY. - Update __cond_resched comment about RCU quiescent states. - Handle unstable rdp in rcu_read_unlock_strict(). - Handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y. - osnoise: Provide quiescent states. - Adjust rcutorture with possible PREEMPT_RCU=n && PREEMPT_COUNT=y combination. - Limit PREEMPT_RCU configurations. - Make rcutorture senario TREE07 and senario TREE10 use PREEMPT_LAZY=y. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEj5IosQTPz8XU1wRHSXnow7UH+rgFAmfeBLQACgkQSXnow7UH +rh11Qf/Rt6IZJ/YT/V9Sd+8hMx4O0BMh779pr9cD6mbAG+FDk2Yeva1m8vIdFOb qId6oc8K/ef2JfFjSn0oHMzQP2D3XUyiJWPNbBDHv/D8Os8GZgjzu8dkxVkSbdbY OxtvIflbcqFN1JDJfGKZnTEW0/YxGqfnS9b6R7iyyA7SOGQ/WubGOE5qNCqPufc9 zJiP+qTUFYQzCIiPlEJul39o9KboPogbt3QAAQjWmi3utd77ehJnm/15FvAjyau4 uhC2cnGfMY535rQaiaQeBQ/IHIowKripCq0JQFvcUNdyArZM3HOI2x79+2II6ft7 mjHskNODOIJHfW2o1RzQ0yRYAywFIg== =J+mH -----END PGP SIGNATURE----- Merge tag 'rcu-next-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux Pull RCU updates from Boqun Feng: "Documentation: - Add broken-timing possibility to stallwarn.rst - Improve discussion of this_cpu_ptr(), add raw_cpu_ptr() - Document self-propagating callbacks - Point call_srcu() to call_rcu() for detailed memory ordering - Add CONFIG_RCU_LAZY delays to call_rcu() kernel-doc header - Clarify RCU_LAZY and RCU_LAZY_DEFAULT_OFF help text - Remove references to old grace-period-wait primitives srcu: - Introduce srcu_read_{un,}lock_fast(), which is similar to srcu_read_{un,}lock_lite(): avoid smp_mb()s in lock and unlock at the cost of calling synchronize_rcu() in synchronize_srcu() Moreover, by returning the percpu offset of the counter at srcu_read_lock_fast() time, srcu_read_unlock_fast() can avoid extra pointer dereferencing, which makes it faster than srcu_read_{un,}lock_lite() srcu_read_{un,}lock_fast() are intended to replace rcu_read_{un,}lock_trace() if possible RCU torture: - Add get_torture_init_jiffies() to return the start time of the test - Add a test_boost_holdoff module parameter to allow delaying boosting tests when building rcutorture as built-in - Add grace period sequence number logging at the beginning and end of failure/close-call results - Switch to hexadecimal for the expedited grace period sequence number in the rcu_exp_grace_period trace point - Make cur_ops->format_gp_seqs take buffer length - Move RCU_TORTURE_TEST_{CHK_RDR_STATE,LOG_CPU} to bool - Complain when invalid SRCU reader_flavor is specified - Add FORCE_NEED_SRCU_NMI_SAFE Kconfig for testing, which forces SRCU uses atomics even when percpu ops are NMI safe, and use the Kconfig for SRCU lockdep testing Misc: - Split rcu_report_exp_cpu_mult() mask parameter and use for tracing - Remove READ_ONCE() for rdp->gpwrap access in __note_gp_changes() - Fix get_state_synchronize_rcu_full() GP-start detection - Move RCU Tasks self-tests to core_initcall() - Print segment lengths in show_rcu_nocb_gp_state() - Make RCU watch ct_kernel_exit_state() warning - Flush console log from kernel_power_off() - rcutorture: Allow a negative value for nfakewriters - rcu: Update TREE05.boot to test normal synchronize_rcu() - rcu: Use _full() API to debug synchronize_rcu() Make RCU handle PREEMPT_LAZY better: - Fix header guard for rcu_all_qs() - rcu: Rename PREEMPT_AUTO to PREEMPT_LAZY - Update __cond_resched comment about RCU quiescent states - Handle unstable rdp in rcu_read_unlock_strict() - Handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y - osnoise: Provide quiescent states - Adjust rcutorture with possible PREEMPT_RCU=n && PREEMPT_COUNT=y combination - Limit PREEMPT_RCU configurations - Make rcutorture senario TREE07 and senario TREE10 use PREEMPT_LAZY=y" * tag 'rcu-next-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (59 commits) rcutorture: Make scenario TREE07 build CONFIG_PREEMPT_LAZY=y rcutorture: Make scenario TREE10 build CONFIG_PREEMPT_LAZY=y rcu: limit PREEMPT_RCU configurations rcutorture: Update ->extendables check for lazy preemption rcutorture: Update rcutorture_one_extend_check() for lazy preemption osnoise: provide quiescent states rcu: Use _full() API to debug synchronize_rcu() rcu: Update TREE05.boot to test normal synchronize_rcu() rcutorture: Allow a negative value for nfakewriters Flush console log from kernel_power_off() context_tracking: Make RCU watch ct_kernel_exit_state() warning rcu/nocb: Print segment lengths in show_rcu_nocb_gp_state() rcu-tasks: Move RCU Tasks self-tests to core_initcall() rcu: Fix get_state_synchronize_rcu_full() GP-start detection torture: Make SRCU lockdep testing use srcu_read_lock_nmisafe() srcu: Add FORCE_NEED_SRCU_NMI_SAFE Kconfig for testing rcutorture: Complain when invalid SRCU reader_flavor is specified rcutorture: Move RCU_TORTURE_TEST_{CHK_RDR_STATE,LOG_CPU} to bool rcutorture: Make cur_ops->format_gp_seqs take buffer length rcutorture: Add ftrace-compatible timestamp to GP# failure/close-call output ... |
||
![]() |
467c890f2d | Merge branches 'docs.2025.02.04a', 'lazypreempt.2025.03.04a', 'misc.2025.03.04a', 'srcu.2025.02.05a' and 'torture.2025.02.05a' | ||
![]() |
5a562b8b3f |
rcu: Use _full() API to debug synchronize_rcu()
Switch for using of get_state_synchronize_rcu_full() and
poll_state_synchronize_rcu_full() pair to debug a normal
synchronize_rcu() call.
Just using "not" full APIs to identify if a grace period is
passed or not might lead to a false-positive kernel splat.
It can happen, because get_state_synchronize_rcu() compresses
both normal and expedited states into one single unsigned long
value, so a poll_state_synchronize_rcu() can miss GP-completion
when synchronize_rcu()/synchronize_rcu_expedited() concurrently
run.
To address this, switch to poll_state_synchronize_rcu_full() and
get_state_synchronize_rcu_full() APIs, which use separate variables
for expedited and normal states.
Reported-by: cheung wall <zzqq0103.hey@gmail.com>
Closes: https://lore.kernel.org/lkml/Z5ikQeVmVdsWQrdD@pc636/T/
Fixes:
|
||
![]() |
85aad7cc41 |
rcu: Fix get_state_synchronize_rcu_full() GP-start detection
The get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full()
functions use the root rcu_node structure's ->gp_seq field to detect
the beginnings and ends of grace periods, respectively. This choice is
necessary for the poll_state_synchronize_rcu_full() function because
(give or take counter wrap), the following sequence is guaranteed not
to trigger:
get_state_synchronize_rcu_full(&rgos);
synchronize_rcu();
WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&rgos));
The RCU callbacks that awaken synchronize_rcu() instances are
guaranteed not to be invoked before the root rcu_node structure's
->gp_seq field is updated to indicate the end of the grace period.
However, these callbacks might start being invoked immediately
thereafter, in particular, before rcu_state.gp_seq has been updated.
Therefore, poll_state_synchronize_rcu_full() must refer to the
root rcu_node structure's ->gp_seq field. Because this field is
updated under this structure's ->lock, any code following a call to
poll_state_synchronize_rcu_full() will be fully ordered after the
full grace-period computation, as is required by RCU's memory-ordering
semantics.
By symmetry, the get_state_synchronize_rcu_full() function should also
use this same root rcu_node structure's ->gp_seq field. But it turns out
that symmetry is profoundly (though extremely infrequently) destructive
in this case. To see this, consider the following sequence of events:
1. CPU 0 starts a new grace period, and updates rcu_state.gp_seq
accordingly.
2. As its first step of grace-period initialization, CPU 0 examines
the current CPU hotplug state and decides that it need not wait
for CPU 1, which is currently offline.
3. CPU 1 comes online, and updates its state. But this does not
affect the current grace period, but rather the one after that.
After all, CPU 1 was offline when the current grace period
started, so all pre-existing RCU readers on CPU 1 must have
completed or been preempted before it last went offline.
The current grace period therefore has nothing it needs to wait
for on CPU 1.
4. CPU 1 switches to an rcutorture kthread which is running
rcutorture's rcu_torture_reader() function, which starts a new
RCU reader.
5. CPU 2 is running rcutorture's rcu_torture_writer() function
and collects a new polled grace-period "cookie" using
get_state_synchronize_rcu_full(). Because the newly started
grace period has not completed initialization, the root rcu_node
structure's ->gp_seq field has not yet been updated to indicate
that this new grace period has already started.
This cookie is therefore set up for the end of the current grace
period (rather than the end of the following grace period).
6. CPU 0 finishes grace-period initialization.
7. If CPU 1’s rcutorture reader is preempted, it will be added to
the ->blkd_tasks list, but because CPU 1’s ->qsmask bit is not
set in CPU 1's leaf rcu_node structure, the ->gp_tasks pointer
will not be updated. Thus, this grace period will not wait on
it. Which is only fair, given that the CPU did not come online
until after the grace period officially started.
8. CPUs 0 and 2 then detect the new grace period and then report
a quiescent state to the RCU core.
9. Because CPU 1 was offline at the start of the current grace
period, CPUs 0 and 2 are the only CPUs that this grace period
needs to wait on. So the grace period ends and post-grace-period
cleanup starts. In particular, the root rcu_node structure's
->gp_seq field is updated to indicate that this grace period
has now ended.
10. CPU 2 continues running rcu_torture_writer() and sees that,
from the viewpoint of the root rcu_node structure consulted by
the poll_state_synchronize_rcu_full() function, the grace period
has ended. It therefore updates state accordingly.
11. CPU 1 is still running the same RCU reader, which notices this
update and thus complains about the too-short grace period.
The fix is for the get_state_synchronize_rcu_full() function to use
rcu_state.gp_seq instead of the root rcu_node structure's ->gp_seq field.
With this change in place, if step 5's cookie indicates that the grace
period has not yet started, then any prior code executed by CPU 2 must
have happened before CPU 1 came online. This will in turn prevent CPU
1's code in steps 3 and 11 from spanning CPU 2's grace-period wait,
thus preventing CPU 1 from being subjected to a too-short grace period.
This commit therefore makes this change. Note that there is no change to
the poll_state_synchronize_rcu_full() function, which as noted above,
must continue to use the root rcu_node structure's ->gp_seq field.
This is of course an asymmetry between these two functions, but is an
asymmetry that is absolutely required for correct operation. It is a
common human tendency to greatly value symmetry, and sometimes symmetry
is a wonderful thing. Other times, symmetry results in poor performance.
But in this case, symmetry is just plain wrong.
Nevertheless, the asymmetry does require an additional adjustment.
It is possible for get_state_synchronize_rcu_full() to see a given
grace period as having started, but for an immediately following
poll_state_synchronize_rcu_full() to see it as having not yet started.
Given the current rcu_seq_done_exact() implementation, this will
result in a false-positive indication that the grace period is done
from poll_state_synchronize_rcu_full(). This is dealt with by making
rcu_seq_done_exact() reach back three grace periods rather than just
two of them.
However, simply changing get_state_synchronize_rcu_full() function to
use rcu_state.gp_seq instead of the root rcu_node structure's ->gp_seq
field results in a theoretical bug in kernels booted with
rcutree.rcu_normal_wake_from_gp=1 due to the following sequence of
events:
o The rcu_gp_init() function invokes rcu_seq_start() to officially
start a new grace period.
o A new RCU reader begins, referencing X from some RCU-protected
list. The new grace period is not obligated to wait for this
reader.
o An updater removes X, then calls synchronize_rcu(), which queues
a wait element.
o The grace period ends, awakening the updater, which frees X
while the reader is still referencing it.
The reason that this is theoretical is that although the grace period
has officially started, none of the CPUs are officially aware of this,
and thus will have to assume that the RCU reader pre-dated the start of
the grace period. Detailed explanation can be found at [2] and [3].
Except for kernels built with CONFIG_PROVE_RCU=y, which use the polled
grace-period APIs, which can and do complain bitterly when this sequence
of events occurs. Not only that, there might be some future RCU
grace-period mechanism that pulls this sequence of events from theory
into practice. This commit therefore also pulls the call to
rcu_sr_normal_gp_init() to precede that to rcu_seq_start().
Although this fixes commit
|
||
![]() |
7acc2d9015 |
rcutorture: Make cur_ops->format_gp_seqs take buffer length
The Tree and Tiny implementations of rcutorture_format_gp_seqs() use hard-coded constants for the length of the buffer that they format into. This is of course an accident waiting to happen, so this commit therefore makes them take a length argument. The rcutorture calling code uses ARRAY_SIZE() to safely compute this new argument. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> |
||
![]() |
2db7ab8c10 |
rcutorture: Expand failure/close-call grace-period output
With only eight bits per grace-period sequence number, wrap can happen in 64 grace periods. This commit therefore increases this to sixteen bits for normal grace-period sequence numbers and the combined short-form polling sequence numbers, thus deferring wrap for at least 16,384 grace periods. Because expedited grace periods go faster, expand these to 24 bits, deferring wrap for at least 4,194,304 expedited grace periods. These longer wrap times makes it easier to correlate these numbers to trace-event output. Note that the low-order two bits are reserved for intra-grace-period state, hence the above wrap numbers being a factor of four smaller than you might expect. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> |
||
![]() |
84ae91018a |
rcutorture: Include grace-period sequence numbers in failure/close-call
This commit includes the grace-period sequence numbers at the beginning and end of each segment in the "Failure/close-call rcutorture reader segments" list. These are in hexadecimal, and only the bottom byte. Currently, only RCU is supported, with its three sequence numbers (normal, expedited, and polled). Note that if all the grace-period sequence numbers remain the same across a given reader segment, only one copy of the number will be printed. Of course, if there is a change, both sets of values will be printed. Because the overhead of collecting this information can suppress heisenbugs, this information is collected and printed only in kernels built with CONFIG_RCU_TORTURE_TEST_LOG_GP=y. [ paulmck: Apply Nathan Chancellor feedback for IS_ENABLED(). ] [ paulmck: Apply feedback from kernel test robot. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> |
||
![]() |
7f4b19ef31 |
rcu: remove trace_rcu_kvfree_callback
Tree RCU does not handle kvfree_rcu() by queueing individual objects by call_rcu() anymore, thus the tracepoint and associated __is_kvfree_rcu_offset() check is dead code now. Remove it. Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
![]() |
764f6a8110 |
rcu: Remove READ_ONCE() for rdp->gpwrap access in __note_gp_changes()
There is one access to the per-CPU rdp->gpwrap field in the __note_gp_changes() function that does not use READ_ONCE(), but all other accesses do use READ_ONCE(). When using the 8*TREE03 and CONFIG_NR_CPUS=8 configuration, KCSAN found no data races at that point. This is because all calls to __note_gp_changes() hold rnp->lock, which excludes writes to the rdp->gpwrap fields for all CPUs associated with that same leaf rcu_node structure. This commit therefore removes READ_ONCE() from rdp->gpwrap accesses within the __note_gp_changes() function. Signed-off-by: Zilin Guan <zilinguan811@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> |
||
![]() |
053ca72554 |
rcu: Add CONFIG_RCU_LAZY delays to call_rcu() kernel-doc header
This commit adds a description of the energy-efficiency delays that call_rcu() can impose, along with a pointer to call_rcu_hurry() for latency-sensitive kernel code. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> |
||
![]() |
21ef249862 |
rcu: Document self-propagating callbacks
This commit documents the fact that a given RCU callback function can repost itself. Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> |
||
![]() |
9c5968db9e |
The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec. - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones. - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest. - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code. - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups. - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code. - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c. - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator. - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading. - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated (https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/). Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED). - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled. - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL. - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests. - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size. - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic. - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated. - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated. - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed. - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic. - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy. - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic. - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions. - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed. - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting. - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface. - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior. - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi "introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors." - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram. - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal. - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance. - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation. - "Cleanup for memfd_create()" from Isaac Manjarres does that thing. - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration. - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices. - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5a+cwAKCRDdBJ7gKXxA jtoyAP9R58oaOKPJuTizEKKXvh/RpMyD6sYcz/uPpnf+cKTZxQEAqfVznfWlw/Lz uC3KRZYhmd5YrxU4o+qjbzp9XWX/xAE= =Ib2s -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "The various patchsets are summarized below. Plus of course many indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated: https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED) - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation - "Cleanup for memfd_create()" from Isaac Manjarres does that thing - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests" * tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm/compaction: fix UBSAN shift-out-of-bounds warning s390/mm: add missing ctor/dtor on page table upgrade kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags() tools: add VM_WARN_ON_VMG definition mm/damon/core: use str_high_low() helper in damos_wmark_wait_us() seqlock: add missing parameter documentation for raw_seqcount_try_begin() mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh mm/page_alloc: remove the incorrect and misleading comment zram: remove zcomp_stream_put() from write_incompressible_page() mm: separate move/undo parts from migrate_pages_batch() mm/kfence: use str_write_read() helper in get_access_type() selftests/mm/mkdirty: fix memory leak in test_uffdio_copy() kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags() selftests/mm: virtual_address_range: avoid reading from VM_IO mappings selftests/mm: vm_util: split up /proc/self/smaps parsing selftests/mm: virtual_address_range: unmap chunks after validation selftests/mm: virtual_address_range: mmap() without PROT_WRITE selftests/memfd/memfd_test: fix possible NULL pointer dereference mm: add FGP_DONTCACHE folio creation flag mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue ... |
||
![]() |
1d6d399223 |
Kthreads affinity follow either of 4 existing different patterns:
1) Per-CPU kthreads must stay affine to a single CPU and never execute relevant code on any other CPU. This is currently handled by smpboot code which takes care of CPU-hotplug operations. Affinity here is a correctness constraint. 2) Some kthreads _have_ to be affine to a specific set of CPUs and can't run anywhere else. The affinity is set through kthread_bind_mask() and the subsystem takes care by itself to handle CPU-hotplug operations. Affinity here is assumed to be a correctness constraint. 3) Per-node kthreads _prefer_ to be affine to a specific NUMA node. This is not a correctness constraint but merely a preference in terms of memory locality. kswapd and kcompactd both fall into this category. The affinity is set manually like for any other task and CPU-hotplug is supposed to be handled by the relevant subsystem so that the task is properly reaffined whenever a given CPU from the node comes up. Also care should be taken so that the node affinity doesn't cross isolated (nohz_full) cpumask boundaries. 4) Similar to the previous point except kthreads have a _preferred_ affinity different than a node. Both RCU boost kthreads and RCU exp kworkers fall into this category as they refer to "RCU nodes" from a distinctly distributed tree. Currently the preferred affinity patterns (3 and 4) have at least 4 identified users, with more or less success when it comes to handle CPU-hotplug operations and CPU isolation. Each of which do it in its own ad-hoc way. This is an infrastructure proposal to handle this with the following API changes: _ kthread_create_on_node() automatically affines the created kthread to its target node unless it has been set as per-cpu or bound with kthread_bind[_mask]() before the first wake-up. - kthread_affine_preferred() is a new function that can be called right after kthread_create_on_node() to specify a preferred affinity different than the specified node. When the preferred affinity can't be applied because the possible targets are offline or isolated (nohz_full), the kthread is affine to the housekeeping CPUs (which means to all online CPUs most of the time or only the non-nohz_full CPUs when nohz_full= is set). kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been converted, along with a few old drivers. Summary of the changes: * Consolidate a bunch of ad-hoc implementations of kthread_run_on_cpu() * Introduce task_cpu_fallback_mask() that defines the default last resort affinity of a task to become nohz_full aware * Add some correctness check to ensure kthread_bind() is always called before the first kthread wake up. * Default affine kthread to its preferred node. * Convert kswapd / kcompactd and remove their halfway working ad-hoc affinity implementation * Implement kthreads preferred affinity * Unify kthread worker and kthread API's style * Convert RCU kthreads to the new API and remove the ad-hoc affinity implementation. -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmeNf8gACgkQhSRUR1CO jHedQQ/+IxTjjqQiItzrq41TES2S0desHDq8lNJFb7rsR/DtKFyLx3s67cOYV+cM Yx54QHg2m/Fz4nXMQ7Po5ygOtJGCKBc5C5QQy7y0lVKeTQK+daDfEtBSa3oG7j3C u+E3tTY6qxkbCzymUyaKkHN4/ay2vLvjFS50luV7KMyI3x47Aji+t7VdCX4LCPP2 eAwOALWD0+7qLJ/VF6gsmQLKA4Qx7PQAzBa3KSBmUN9UcN8Gk1bQHCTIQKDHP9LQ v8BXrNZtYX1o2+snNYpX2z6/ECjxkdwriOgqqZY5306hd9RAQ1u46Dx3byrIqjGn ULG/XQ2istPyhTqb/h+RbrobdOcwEUIeqk8hRRbBXE8bPpqUz9EMuaCMxWDbQjgH NTuKG4ifKJ/IqstkkuDkdOiByE/ysMmwqrTXgSnu2ITNL9yY3BEgFbvA95hgo42s f7QCxEfZb1MHcNEMENSMwM3xw5lLMGMpxVZcMQ3gLwyotMBRrhFZm1qZJG7TITYW IDIeCbH4JOMdQwLs3CcWTXio0N5/85NhRNFV+IDn96OrgxObgnMtV8QwNgjXBAJ5 wGeJWt8s34W1Zo3qS9gEuVzEhW4XaxISQQMkHe8faKkK6iHmIB/VjSQikDwwUNQ/ AspYj82RyWBCDZsqhiYh71kpxjvS6Xp0bj39Ce1sNsOnuksxKkQ= =g8In -----END PGP SIGNATURE----- Merge tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks Pull kthread updates from Frederic Weisbecker: "Kthreads affinity follow either of 4 existing different patterns: 1) Per-CPU kthreads must stay affine to a single CPU and never execute relevant code on any other CPU. This is currently handled by smpboot code which takes care of CPU-hotplug operations. Affinity here is a correctness constraint. 2) Some kthreads _have_ to be affine to a specific set of CPUs and can't run anywhere else. The affinity is set through kthread_bind_mask() and the subsystem takes care by itself to handle CPU-hotplug operations. Affinity here is assumed to be a correctness constraint. 3) Per-node kthreads _prefer_ to be affine to a specific NUMA node. This is not a correctness constraint but merely a preference in terms of memory locality. kswapd and kcompactd both fall into this category. The affinity is set manually like for any other task and CPU-hotplug is supposed to be handled by the relevant subsystem so that the task is properly reaffined whenever a given CPU from the node comes up. Also care should be taken so that the node affinity doesn't cross isolated (nohz_full) cpumask boundaries. 4) Similar to the previous point except kthreads have a _preferred_ affinity different than a node. Both RCU boost kthreads and RCU exp kworkers fall into this category as they refer to "RCU nodes" from a distinctly distributed tree. Currently the preferred affinity patterns (3 and 4) have at least 4 identified users, with more or less success when it comes to handle CPU-hotplug operations and CPU isolation. Each of which do it in its own ad-hoc way. This is an infrastructure proposal to handle this with the following API changes: - kthread_create_on_node() automatically affines the created kthread to its target node unless it has been set as per-cpu or bound with kthread_bind[_mask]() before the first wake-up. - kthread_affine_preferred() is a new function that can be called right after kthread_create_on_node() to specify a preferred affinity different than the specified node. When the preferred affinity can't be applied because the possible targets are offline or isolated (nohz_full), the kthread is affine to the housekeeping CPUs (which means to all online CPUs most of the time or only the non-nohz_full CPUs when nohz_full= is set). kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been converted, along with a few old drivers. Summary of the changes: - Consolidate a bunch of ad-hoc implementations of kthread_run_on_cpu() - Introduce task_cpu_fallback_mask() that defines the default last resort affinity of a task to become nohz_full aware - Add some correctness check to ensure kthread_bind() is always called before the first kthread wake up. - Default affine kthread to its preferred node. - Convert kswapd / kcompactd and remove their halfway working ad-hoc affinity implementation - Implement kthreads preferred affinity - Unify kthread worker and kthread API's style - Convert RCU kthreads to the new API and remove the ad-hoc affinity implementation" * tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: kthread: modify kernel-doc function name to match code rcu: Use kthread preferred affinity for RCU exp kworkers treewide: Introduce kthread_run_worker[_on_cpu]() kthread: Unify kthread_create_on_cpu() and kthread_create_worker_on_cpu() automatic format rcu: Use kthread preferred affinity for RCU boost kthread: Implement preferred affinity mm: Create/affine kswapd to its preferred node mm: Create/affine kcompactd to its preferred node kthread: Default affine kthread to its preferred NUMA node kthread: Make sure kthread hasn't started while binding it sched,arm64: Handle CPU isolation on last resort fallback rq selection arm64: Exclude nohz_full CPUs from 32bits el0 support lib: test_objpool: Use kthread_run_on_cpu() kallsyms: Use kthread_run_on_cpu() soc/qman: test: Use kthread_run_on_cpu() arm/bL_switcher: Use kthread_run_on_cpu() |
||
![]() |
9f3ee94e70 |
RCU pull request for v6.14
This pull request contains the following branches: fixes.2024.12.14a: Misc fixes, check if IRQs are disabled in rcu_exp_need_qs(), instrument KCSAN exclusive-writer assertions, add extra WARN_ON_ONCE() check, set the cpu_no_qs.b.exp under lock, warn if callback enqueued on offline CPU. rcutorture.2024.12.14a: Torture-test updates, add rcutorture.preempt_duration kernel module parameter, make the TREE03 scenario do preemption, improve pooling timeouts for rcu_torture_writer(), improve output of "Failure/close-call rcutorture reader segments", add some reader-state debugging checks, update doc of polled APIs, add extra diagnostics for per-reader-segment preemption. srcu.2024.12.14a: SRCU updates, improve doc for srcu_read_lock() in terms of return value, fix typo in comments, remove redundant GP sequence checks in the srcu_funnel_gp_start. torture-test.2024.12.14a: Add an extra test for sched_clock(), improve testing on unresponsive systems. -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEEu6QRe/mAUYNn5U0PBYqkjnKWLM8FAmeGprwACgkQBYqkjnKW LM+QUwv/VVwYKI3f9eH4AcjijIFufmsP3I5REfY3s7+a6BItwZrulwhGK+mWWE+9 nKjsrrjw3sv3dEvdaUfZxwiLQBfJmWdUUGTZ748qyCpidlo4wB7OW/BR+Pn4ZiB/ rWkn28fHdDxUJV+nQGbGC82EiGrLC9XYlbTbnh9VzGEjcyKIIU3Dw5tGXzEVMn5w Tc6H6jWg8+fXxIdmhdEkjpH+rS9H160Lt1bGeGadI3LMdmMj89x5u+i6gheT83WL FBBwgNVITWPTwfQFyK4wuRcKzi/UIrRdQIU+2xqJKs6NeWhwqhFDfW8FP5brBI2o f7fFQA+CbP/oRCqXCaZKmB3i/xGeJUsJ/IJ992jq61TCLNoc3LDxovYdaHfUJgph W/8KHUc8oZMQU4CjGJkj30jnpBLPwZeZuJvuTfQZl5QuBeUivIMg89cXLpE9Vnny yof3pm++Fru8wJ0rooq3ef2A5vblpoBQcnYelQV2EJisMCOkd+P5PNVA2viTrG8F QGfmDzm1 =EwPt -----END PGP SIGNATURE----- Merge tag 'rcu.release.v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux Pull RCU updates from Uladzislau Rezki: "Misc fixes: - check if IRQs are disabled in rcu_exp_need_qs() - instrument KCSAN exclusive-writer assertions - add extra WARN_ON_ONCE() check - set the cpu_no_qs.b.exp under lock - warn if callback enqueued on offline CPU Torture-test updates: - add rcutorture.preempt_duration kernel module parameter - make the TREE03 scenario do preemption - improve pooling timeouts for rcu_torture_writer() - improve output of "Failure/close-call rcutorture reader segments" - add some reader-state debugging checks - update doc of polled APIs - add extra diagnostics for per-reader-segment preemption - add an extra test for sched_clock() - improve testing on unresponsive systems SRCU updates: - improve doc for srcu_read_lock() in terms of return value - fix typo in comments - remove redundant GP sequence checks in the srcu_funnel_gp_start" * tag 'rcu.release.v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (31 commits) srcu: Remove redundant GP sequence checks in srcu_funnel_gp_start srcu: Fix typo s/srcu_check_read_flavor()/__srcu_check_read_flavor()/ srcu: Guarantee non-negative return value from srcu_read_lock() MAINTAINERS: Update RCU git tree rcu: Add lockdep_assert_irqs_disabled() to rcu_exp_need_qs() rcu: Add KCSAN exclusive-writer assertions for rdp->cpu_no_qs.b.exp rcu: Make preemptible rcu_exp_handler() check idempotency rcu: Replace open-coded rcu_exp_need_qs() from rcu_exp_handler() with call rcu: Move rcu_report_exp_rdp() setting of ->cpu_no_qs.b.exp under lock rcu: Make rcu_report_exp_cpu_mult() caller acquire lock rcu: Report callbacks enqueued on offline CPU blind spot rcutorture: Use symbols for SRCU reader flavors rcutorture: Add per-reader-segment preemption diagnostics rcutorture: Read CPU ID for decoration protected by both reader types rcutorture: Add preempt_count() to rcutorture_one_extend_check() diagnostics rcutorture: Add parameters to control polled/conditional wait interval rcutorture: Add documentation for recent conditional and polled APIs rcutorture: Ignore attempts to test preemption and forward progress rcutorture: Make rcutorture_one_extend() check reader state rcutorture: Pretty-print rcutorture reader segments ... |
||
![]() |
d40797d672 |
kasan: make kasan_record_aux_stack_noalloc() the default behaviour
kasan_record_aux_stack_noalloc() was introduced to record a stack trace
without allocating memory in the process. It has been added to callers
which were invoked while a raw_spinlock_t was held. More and more callers
were identified and changed over time. Is it a good thing to have this
while functions try their best to do a locklessly setup? The only
downside of having kasan_record_aux_stack() not allocate any memory is
that we end up without a stacktrace if stackdepot runs out of memory and
at the same stacktrace was not recorded before To quote Marco Elver from
https://lore.kernel.org/all/CANpmjNPmQYJ7pv1N3cuU8cP18u7PP_uoZD8YxwZd4jtbof9nVQ@mail.gmail.com/
| I'd be in favor, it simplifies things. And stack depot should be
| able to replenish its pool sufficiently in the "non-aux" cases
| i.e. regular allocations. Worst case we fail to record some
| aux stacks, but I think that's only really bad if there's a bug
| around one of these allocations. In general the probabilities
| of this being a regression are extremely small [...]
Make the kasan_record_aux_stack_noalloc() behaviour default as
kasan_record_aux_stack().
[bigeasy@linutronix.de: dressed the diff as patch]
Link: https://lkml.kernel.org/r/20241122155451.Mb2pmeyJ@linutronix.de
Fixes:
|
||
![]() |
bbe658d658 |
mm/slab: Move kvfree_rcu() into SLAB
Move kvfree_rcu() functionality to the slab_common.c file. The reason to have kvfree_rcu() functionality as part of SLAB is that there is a clear trend and need of closer integration. One of the recent example is creating a barrier function for SLAB caches. Another reason is to prevent of having several implementations of RCU machinery for reclaiming objects after a GP. As future steps, it can be more integrated(easier) with SLAB internals. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Hyeonggon Yoo <hyeonggon.yoo@sk.com> Tested-by: Hyeonggon Yoo <hyeonggon.yoo@sk.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
![]() |
c18bcd85ce |
rcu/kvfree: Adjust a shrinker name
Rename "rcu-kfree" to "slab-kvfree-rcu" since it goes to the slab_common.c file soon. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Hyeonggon Yoo <hyeonggon.yoo@sk.com> Tested-by: Hyeonggon Yoo <hyeonggon.yoo@sk.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
![]() |
ba5cac52d0 |
rcu/kvfree: Adjust names passed into trace functions
Currently trace functions are supplied with "rcu_state.name" member which is located in the structure. The problem is that the "rcu_state" structure variable is local and can not be accessed from another place. To address this, this preparation patch passes "slab" string as a first argument. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Hyeonggon Yoo <hyeonggon.yoo@sk.com> Tested-by: Hyeonggon Yoo <hyeonggon.yoo@sk.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
![]() |
d824ed707b |
rcu/kvfree: Move some functions under CONFIG_TINY_RCU
Currently when a tiny RCU is enabled, the tree.c file is not compiled, thus duplicating function names do not conflict with each other. Because of moving of kvfree_rcu() functionality to the SLAB, we have to reorder some functions and place them together under CONFIG_TINY_RCU macro definition. Therefore, those functions name will not conflict when a kernel is compiled for CONFIG_TINY_RCU flavor. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Hyeonggon Yoo <hyeonggon.yoo@sk.com> Tested-by: Hyeonggon Yoo <hyeonggon.yoo@sk.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
![]() |
0f52b4db4f |
rcu/kvfree: Initialize kvfree_rcu() separately
Introduce a separate initialization of kvfree_rcu() functionality. For such purpose a kfree_rcu_batch_init() is renamed to a kvfree_rcu_init() and it is invoked from the main.c right after rcu_init() is done. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Hyeonggon Yoo <hyeonggon.yoo@sk.com> Tested-by: Hyeonggon Yoo <hyeonggon.yoo@sk.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
![]() |
8044c58976 |
rcu: Use kthread preferred affinity for RCU exp kworkers
Now that kthreads have an infrastructure to handle preferred affinity against CPU hotplug and housekeeping cpumask, convert RCU exp workers to use it instead of handling all the constraints by itself. Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> |
||
![]() |
b04e317b52 |
treewide: Introduce kthread_run_worker[_on_cpu]()
kthread_create() creates a kthread without running it yet. kthread_run() creates a kthread and runs it. On the other hand, kthread_create_worker() creates a kthread worker and runs it. This difference in behaviours is confusing. Also there is no way to create a kthread worker and affine it using kthread_bind_mask() or kthread_affine_preferred() before starting it. Consolidate the behaviours and introduce kthread_run_worker[_on_cpu]() that behaves just like kthread_run(). kthread_create_worker[_on_cpu]() will now only create a kthread worker without starting it. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> |
||
![]() |
db7ee3cb62 |
rcu: Use kthread preferred affinity for RCU boost
Now that kthreads have an infrastructure to handle preferred affinity against CPU hotplug and housekeeping cpumask, convert RCU boost to use it instead of handling all the constraints by itself. Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> |
||
![]() |
049dfe96ba |
rcu: Report callbacks enqueued on offline CPU blind spot
Callbacks enqueued after rcutree_report_cpu_dead() fall into RCU barrier blind spot. Report any potential misuse. Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> |
||
![]() |
a23da88c6c |
rcu/kvfree: Fix data-race in __mod_timer / kvfree_call_rcu
KCSAN reports a data race when access the krcp->monitor_work.timer.expires
variable in the schedule_delayed_monitor_work() function:
<snip>
BUG: KCSAN: data-race in __mod_timer / kvfree_call_rcu
read to 0xffff888237d1cce8 of 8 bytes by task 10149 on cpu 1:
schedule_delayed_monitor_work kernel/rcu/tree.c:3520 [inline]
kvfree_call_rcu+0x3b8/0x510 kernel/rcu/tree.c:3839
trie_update_elem+0x47c/0x620 kernel/bpf/lpm_trie.c:441
bpf_map_update_value+0x324/0x350 kernel/bpf/syscall.c:203
generic_map_update_batch+0x401/0x520 kernel/bpf/syscall.c:1849
bpf_map_do_batch+0x28c/0x3f0 kernel/bpf/syscall.c:5143
__sys_bpf+0x2e5/0x7a0
__do_sys_bpf kernel/bpf/syscall.c:5741 [inline]
__se_sys_bpf kernel/bpf/syscall.c:5739 [inline]
__x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5739
x64_sys_call+0x2625/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:322
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
write to 0xffff888237d1cce8 of 8 bytes by task 56 on cpu 0:
__mod_timer+0x578/0x7f0 kernel/time/timer.c:1173
add_timer_global+0x51/0x70 kernel/time/timer.c:1330
__queue_delayed_work+0x127/0x1a0 kernel/workqueue.c:2523
queue_delayed_work_on+0xdf/0x190 kernel/workqueue.c:2552
queue_delayed_work include/linux/workqueue.h:677 [inline]
schedule_delayed_monitor_work kernel/rcu/tree.c:3525 [inline]
kfree_rcu_monitor+0x5e8/0x660 kernel/rcu/tree.c:3643
process_one_work kernel/workqueue.c:3229 [inline]
process_scheduled_works+0x483/0x9a0 kernel/workqueue.c:3310
worker_thread+0x51d/0x6f0 kernel/workqueue.c:3391
kthread+0x1d1/0x210 kernel/kthread.c:389
ret_from_fork+0x4b/0x60 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 UID: 0 PID: 56 Comm: kworker/u8:4 Not tainted 6.12.0-rc2-syzkaller-00050-g5b7c893ed5ed #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
Workqueue: events_unbound kfree_rcu_monitor
<snip>
kfree_rcu_monitor() rearms the work if a "krcp" has to be still
offloaded and this is done without holding krcp->lock, whereas
the kvfree_call_rcu() holds it.
Fix it by acquiring the "krcp->lock" for kfree_rcu_monitor() so
both functions do not race anymore.
Reported-by: syzbot+061d370693bdd99f9d34@syzkaller.appspotmail.com
Link: https://lore.kernel.org/lkml/ZxZ68KmHDQYU0yfD@pc636/T/
Fixes:
|
||
![]() |
a30763800b |
rcu: Permit start_poll_synchronize_rcu*() with interrupts disabled
The header comment for both start_poll_synchronize_rcu() and start_poll_synchronize_rcu_full() state that interrupts must be enabled when calling these two functions, and there is a lockdep assertion in start_poll_synchronize_rcu_common() enforcing this restriction. However, there is no need for this restrictions, as can be seen in call_rcu(), which does wakeups when interrupts are disabled. This commit therefore removes the lockdep assertion and the comments. Reported-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> |
||
![]() |
5d2501f42c |
rcu: Use the BITS_PER_LONG macro
sizeof(unsigned long) * 8 is the number of bits in an unsigned long variable, replace it with BITS_PER_LONG macro to make it simpler. Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> |
||
![]() |
3c5d61ae91 |
rcu/kvfree: Refactor kvfree_rcu_queue_batch()
Improve readability of kvfree_rcu_queue_batch() function
in away that, after a first batch queuing, the loop is break
and success value is returned to a caller.
There is no reason to loop and check batches further as all
outstanding objects have already been picked and attached to
a certain batch to complete an offloading.
Fixes:
|
||
![]() |
bdf56c7580 |
slab updates for 6.12
-----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmbn5g0ACgkQu+CwddJF iJq+Uwf/aqnLNEpjUBzwUUhSojCpPnTtiyjv+AILTxoSTHmbu8OvN0W79+Rpbdmk O4QapAK+BCs+VL2VATwCCufcJ75Z78txO+buQE0DgwluFTIYZ+IwpUMPsK04ln6A FD1/uvP1QFx60heqcp2c4zWFBUpg4DE6ufx2A5kieO268lFcWLxyVlcdgRU79ZCt uAcV2yDLk3GvPGfxZwPKEmZUo/FmuSoBv0XgT+eWxmTu/R7hcpFse49OyjBH8Tvb 8d/RCIFgXOr8dTIjtds7eenwB/is4TkRlctezEQ0jO9/JwL/BVOgXZjD1qCtNWqz is4TWK7VV+vdq1RD+0xC2hV/+uGEwQ== =+WAm -----END PGP SIGNATURE----- Merge tag 'slab-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: "This time it's mostly refactoring and improving APIs for slab users in the kernel, along with some debugging improvements. - kmem_cache_create() refactoring (Christian Brauner) Over the years have been growing new parameters to kmem_cache_create() where most of them are needed only for a small number of caches - most recently the rcu_freeptr_offset parameter. To avoid adding new parameters to kmem_cache_create() and adjusting all its callers, or creating new wrappers such as kmem_cache_create_rcu(), we can now pass extra parameters using the new struct kmem_cache_args. Not explicitly initialized fields default to values interpreted as unused. kmem_cache_create() is for now a wrapper that works both with the new form: kmem_cache_create(name, object_size, args, flags) and the legacy form: kmem_cache_create(name, object_size, align, flags, ctor) - kmem_cache_destroy() waits for kfree_rcu()'s in flight (Vlastimil Babka, Uladislau Rezki) Since SLOB removal, kfree() is allowed for freeing objects allocated by kmem_cache_create(). By extension kfree_rcu() as allowed as well, which can allow converting simple call_rcu() callbacks that only do kmem_cache_free(), as there was never a kmem_cache_free_rcu() variant. However, for caches that can be destroyed e.g. on module removal, the cache owners knew to issue rcu_barrier() first to wait for the pending call_rcu()'s, and this is not sufficient for pending kfree_rcu()'s due to its internal batching optimizations. Ulad has provided a new kvfree_rcu_barrier() and to make the usage less error-prone, kmem_cache_destroy() calls it. Additionally, destroying SLAB_TYPESAFE_BY_RCU caches now again issues rcu_barrier() synchronously instead of using an async work, because the past motivation for async work no longer applies. Users of custom call_rcu() callbacks should however keep calling rcu_barrier() before cache destruction. - Debugging use-after-free in SLAB_TYPESAFE_BY_RCU caches (Jann Horn) Currently, KASAN cannot catch UAFs in such caches as it is legal to access them within a grace period, and we only track the grace period when trying to free the underlying slab page. The new CONFIG_SLUB_RCU_DEBUG option changes the freeing of individual object to be RCU-delayed, after which KASAN can poison them. - Delayed memcg charging (Shakeel Butt) In some cases, the memcg is uknown at allocation time, such as receiving network packets in softirq context. With kmem_cache_charge() these may be now charged later when the user and its memcg is known. - Misc fixes and improvements (Pedro Falcato, Axel Rasmussen, Christoph Lameter, Yan Zhen, Peng Fan, Xavier)" * tag 'slab-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (34 commits) mm, slab: restore kerneldoc for kmem_cache_create() io_uring: port to struct kmem_cache_args slab: make __kmem_cache_create() static inline slab: make kmem_cache_create_usercopy() static inline slab: remove kmem_cache_create_rcu() file: port to struct kmem_cache_args slab: create kmem_cache_create() compatibility layer slab: port KMEM_CACHE_USERCOPY() to struct kmem_cache_args slab: port KMEM_CACHE() to struct kmem_cache_args slab: remove rcu_freeptr_offset from struct kmem_cache slab: pass struct kmem_cache_args to do_kmem_cache_create() slab: pull kmem_cache_open() into do_kmem_cache_create() slab: pass struct kmem_cache_args to create_cache() slab: port kmem_cache_create_usercopy() to struct kmem_cache_args slab: port kmem_cache_create_rcu() to struct kmem_cache_args slab: port kmem_cache_create() to struct kmem_cache_args slab: add struct kmem_cache_args slab: s/__kmem_cache_create/do_kmem_cache_create/g memcg: add charging of already allocated slab objects mm/slab: Optimize the code logic in find_mergeable() ... |
||
![]() |
067610ebaa |
RCU pull request for v6.12
This pull request contains the following branches: context_tracking.15.08.24a: Rename context tracking state related symbols and remove references to "dynticks" in various context tracking state variables and related helpers; force context_tracking_enabled_this_cpu() to be inlined to avoid leaving a noinstr section. csd.lock.15.08.24a: Enhance CSD-lock diagnostic reports; add an API to provide an indication of ongoing CSD-lock stall. nocb.09.09.24a: Update and simplify RCU nocb code to handle (de-)offloading of callbacks only for offline CPUs; fix RT throttling hrtimer being armed from offline CPU. rcutorture.14.08.24a: Remove redundant rcu_torture_ops get_gp_completed fields; add SRCU ->same_gp_state and ->get_comp_state functions; add generic test for NUM_ACTIVE_*RCU_POLL* for testing RCU and SRCU polled grace periods; add CFcommon.arch for arch-specific Kconfig options; print number of update types in rcu_torture_write_types(); add rcutree.nohz_full_patience_delay testing to the TREE07 scenario; add a stall_cpu_repeat module parameter to test repeated CPU stalls; add argument to limit number of CPUs a guest OS can use in torture.sh; rcustall.09.09.24a: Abbreviate RCU CPU stall warnings during CSD-lock stalls; Allow dump_cpu_task() to be called without disabling preemption; defer printing stall-warning backtrace when holding rcu_node lock. srcu.12.08.24a: Make SRCU gp seq wrap-around faster; add KCSAN checks for concurrent updates to ->srcu_n_exp_nodelay and ->reschedule_count which are used in heuristics governing auto-expediting of normal SRCU grace periods and grace-period-state-machine delays; mark idle SRCU-barrier callbacks to help identify stuck SRCU-barrier callback. rcu.tasks.14.08.24a: Remove RCU Tasks Rude asynchronous APIs as they are no longer used; stop testing RCU Tasks Rude asynchronous APIs; fix access to non-existent percpu regions; check processor-ID assumptions during chosen CPU calculation for callback enqueuing; update description of rtp->tasks_gp_seq grace-period sequence number; add rcu_barrier_cb_is_done() to identify whether a given rcu_barrier callback is stuck; mark idle Tasks-RCU-barrier callbacks; add *torture_stats_print() functions to print detailed diagnostics for Tasks-RCU variants; capture start time of rcu_barrier_tasks*() operation to help distinguish a hung barrier operation from a long series of barrier operations. rcu_scaling_tests.15.08.24a: refscale: Add a TINY scenario to support tests of Tiny RCU and Tiny SRCU; Optimize process_durations() operation; rcuscale: Dump stacks of stalled rcu_scale_writer() instances; dump grace-period statistics when rcu_scale_writer() stalls; mark idle RCU-barrier callbacks to identify stuck RCU-barrier callbacks; print detailed grace-period and barrier diagnostics on rcu_scale_writer() hangs for Tasks-RCU variants; warn if async module parameter is specified for RCU implementations that do not have async primitives such as RCU Tasks Rude; make all writer tasks report upon hang; tolerate repeated GFP_KERNEL failure in rcu_scale_writer(); use special allocator for rcu_scale_writer(); NULL out top-level pointers to heap memory to avoid double-free bugs on modprobe failures; maintain per-task instead of per-CPU callbacks count to avoid any issues with migration of either tasks or callbacks; constify struct ref_scale_ops. fixes.12.08.24a: Use system_unbound_wq for kfree_rcu work to avoid disturbing isolated CPUs. misc.11.08.24a: Warn on unexpected rcu_state.srs_done_tail state; Better define "atomic" for list_replace_rcu() and hlist_replace_rcu() routines; annotate struct kvfree_rcu_bulk_data with __counted_by(). -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSi2tPIQIc2VEtjarIAHS7/6Z0wpQUCZt8+8wAKCRAAHS7/6Z0w pTqoAPwPN//tlEoJx2PRs6t0q+nD1YNvnZawPaRmdzgdM8zJogD+PiSN+XhqRr80 jzyvMDU4Aa0wjUNP3XsCoaCxo7L/lQk= =bZ9z -----END PGP SIGNATURE----- Merge tag 'rcu.release.v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux Pull RCU updates from Neeraj Upadhyay: "Context tracking: - rename context tracking state related symbols and remove references to "dynticks" in various context tracking state variables and related helpers - force context_tracking_enabled_this_cpu() to be inlined to avoid leaving a noinstr section CSD lock: - enhance CSD-lock diagnostic reports - add an API to provide an indication of ongoing CSD-lock stall nocb: - update and simplify RCU nocb code to handle (de-)offloading of callbacks only for offline CPUs - fix RT throttling hrtimer being armed from offline CPU rcutorture: - remove redundant rcu_torture_ops get_gp_completed fields - add SRCU ->same_gp_state and ->get_comp_state functions - add generic test for NUM_ACTIVE_*RCU_POLL* for testing RCU and SRCU polled grace periods - add CFcommon.arch for arch-specific Kconfig options - print number of update types in rcu_torture_write_types() - add rcutree.nohz_full_patience_delay testing to the TREE07 scenario - add a stall_cpu_repeat module parameter to test repeated CPU stalls - add argument to limit number of CPUs a guest OS can use in torture.sh rcustall: - abbreviate RCU CPU stall warnings during CSD-lock stalls - Allow dump_cpu_task() to be called without disabling preemption - defer printing stall-warning backtrace when holding rcu_node lock srcu: - make SRCU gp seq wrap-around faster - add KCSAN checks for concurrent updates to ->srcu_n_exp_nodelay and ->reschedule_count which are used in heuristics governing auto-expediting of normal SRCU grace periods and grace-period-state-machine delays - mark idle SRCU-barrier callbacks to help identify stuck SRCU-barrier callback rcu tasks: - remove RCU Tasks Rude asynchronous APIs as they are no longer used - stop testing RCU Tasks Rude asynchronous APIs - fix access to non-existent percpu regions - check processor-ID assumptions during chosen CPU calculation for callback enqueuing - update description of rtp->tasks_gp_seq grace-period sequence number - add rcu_barrier_cb_is_done() to identify whether a given rcu_barrier callback is stuck - mark idle Tasks-RCU-barrier callbacks - add *torture_stats_print() functions to print detailed diagnostics for Tasks-RCU variants - capture start time of rcu_barrier_tasks*() operation to help distinguish a hung barrier operation from a long series of barrier operations refscale: - add a TINY scenario to support tests of Tiny RCU and Tiny SRCU - optimize process_durations() operation rcuscale: - dump stacks of stalled rcu_scale_writer() instances and grace-period statistics when rcu_scale_writer() stalls - mark idle RCU-barrier callbacks to identify stuck RCU-barrier callbacks - print detailed grace-period and barrier diagnostics on rcu_scale_writer() hangs for Tasks-RCU variants - warn if async module parameter is specified for RCU implementations that do not have async primitives such as RCU Tasks Rude - make all writer tasks report upon hang - tolerate repeated GFP_KERNEL failure in rcu_scale_writer() - use special allocator for rcu_scale_writer() - NULL out top-level pointers to heap memory to avoid double-free bugs on modprobe failures - maintain per-task instead of per-CPU callbacks count to avoid any issues with migration of either tasks or callbacks - constify struct ref_scale_ops Fixes: - use system_unbound_wq for kfree_rcu work to avoid disturbing isolated CPUs Misc: - warn on unexpected rcu_state.srs_done_tail state - better define "atomic" for list_replace_rcu() and hlist_replace_rcu() routines - annotate struct kvfree_rcu_bulk_data with __counted_by()" * tag 'rcu.release.v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (90 commits) rcu: Defer printing stall-warning backtrace when holding rcu_node lock rcu/nocb: Remove superfluous memory barrier after bypass enqueue rcu/nocb: Conditionally wake up rcuo if not already waiting on GP rcu/nocb: Fix RT throttling hrtimer armed from offline CPU rcu/nocb: Simplify (de-)offloading state machine context_tracking: Tag context_tracking_enabled_this_cpu() __always_inline context_tracking, rcu: Rename rcu_dyntick trace event into rcu_watching rcu: Update stray documentation references to rcu_dynticks_eqs_{enter, exit}() rcu: Rename rcu_momentary_dyntick_idle() into rcu_momentary_eqs() rcu: Rename rcu_implicit_dynticks_qs() into rcu_watching_snap_recheck() rcu: Rename dyntick_save_progress_counter() into rcu_watching_snap_save() rcu: Rename struct rcu_data .exp_dynticks_snap into .exp_watching_snap rcu: Rename struct rcu_data .dynticks_snap into .watching_snap rcu: Rename rcu_dynticks_zero_in_eqs() into rcu_watching_zero_in_eqs() rcu: Rename rcu_dynticks_in_eqs_since() into rcu_watching_snap_stopped_since() rcu: Rename rcu_dynticks_in_eqs() into rcu_watching_snap_in_eqs() rcu: Rename rcu_dynticks_eqs_online() into rcu_watching_online() context_tracking, rcu: Rename rcu_dynticks_curr_cpu_in_eqs() into rcu_is_watching_curr_cpu() context_tracking, rcu: Rename rcu_dynticks_task*() into rcu_task*() refscale: Constify struct ref_scale_ops ... |
||
![]() |
355debb83b | Merge branches 'context_tracking.15.08.24a', 'csd.lock.15.08.24a', 'nocb.09.09.24a', 'rcutorture.14.08.24a', 'rcustall.09.09.24a', 'srcu.12.08.24a', 'rcu.tasks.14.08.24a', 'rcu_scaling_tests.15.08.24a', 'fixes.12.08.24a' and 'misc.11.08.24a' into next.09.09.24a | ||
![]() |
2b55d6a42d |
rcu/kvfree: Add kvfree_rcu_barrier() API
Add a kvfree_rcu_barrier() function. It waits until all in-flight pointers are freed over RCU machinery. It does not wait any GP completion and it is within its right to return immediately if there are no outstanding pointers. This function is useful when there is a need to guarantee that a memory is fully freed before destroying memory caches. For example, during unloading a kernel module. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
![]() |
e68ac2b488 |
softirq: Remove unused 'action' parameter from action callback
When soft interrupt actions are called, they are passed a pointer to the struct softirq action which contains the action's function pointer. This pointer isn't useful, as the action callback already knows what function it is. And since each callback handles a specific soft interrupt, the callback also knows which soft interrupt number is running. No soft interrupt action callback actually uses this parameter, so remove it from the function pointer signature. This clarifies that soft interrupt actions are global routines and makes it slightly cheaper to call them. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/all/20240815171549.3260003-1-csander@purestorage.com |
||
![]() |
32a9f26e5e |
rcu: Rename rcu_momentary_dyntick_idle() into rcu_momentary_eqs()
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, replace "dyntick_idle" into "eqs" to drop the dyntick reference. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org> |
||
![]() |
3b18eb3f9f |
rcu: Rename rcu_implicit_dynticks_qs() into rcu_watching_snap_recheck()
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, drop the dyntick reference and update the name of this helper to express that it rechecks rdp->watching_snap after an earlier rcu_watching_snap_save(). Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org> |