linux-loongson

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson synced 2025-09-01 06:39:05 +00:00

Author	SHA1	Message	Date
Valentin Schneider	49f82c64fd	rcu: Rename dyntick_save_progress_counter() into rcu_watching_snap_save() The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, and the 'dynticks' prefix can be dropped without losing any meaning. Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-08-15 21:30:42 +05:30
Valentin Schneider	2dba2230f9	rcu: Rename struct rcu_data .dynticks_snap into .watching_snap The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, and the snapshot helpers are now prefix by "rcu_watching". Reflect that change into the storage variables for these snapshots. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-08-15 21:30:42 +05:30
Valentin Schneider	fc1096ab1f	rcu: Rename rcu_dynticks_zero_in_eqs() into rcu_watching_zero_in_eqs() The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, reflect that change in the related helpers. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-08-15 21:30:42 +05:30
Valentin Schneider	3116a32eb4	rcu: Rename rcu_dynticks_in_eqs_since() into rcu_watching_snap_stopped_since() The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, the dynticks prefix can go. While at it, this helper is only meant to be called after failing an earlier call to rcu_watching_snap_in_eqs(), document this in the comments and add a WARN_ON_ONCE() for good measure. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-08-15 21:30:42 +05:30
Valentin Schneider	9629936d06	rcu: Rename rcu_dynticks_in_eqs() into rcu_watching_snap_in_eqs() The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, reflect that change in the related helpers. While at it, update a comment that still refers to rcu_dynticks_snap(), which was removed by commit: 7be2e6323b9b ("rcu: Remove full memory barrier on RCU stall printout") Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-08-15 21:30:42 +05:30
Valentin Schneider	654b578e4d	rcu: Rename rcu_dynticks_eqs_online() into rcu_watching_online() The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, reflect that change in the related helpers. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-08-15 21:30:42 +05:30
Valentin Schneider	fda7020713	context_tracking, rcu: Rename rcu_dynticks_curr_cpu_in_eqs() into rcu_is_watching_curr_cpu() The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, reflect that change in the related helpers. Note that "watching" is the opposite of "in EQS", so the negation is lifted out of the helper and into the callsites. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-08-15 21:30:42 +05:30
Paul E. McKenney	0616f7e981	rcu: Mark callbacks not currently participating in barrier operation RCU keeps a count of the number of callbacks that the current rcu_barrier() is waiting on, but there is currently no easy way to work out which callback is stuck. One way to do this is to mark idle RCU-barrier callbacks by making the ->next pointer point to the callback itself, and this commit does just that. Later commits will use this for debug output. Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-08-14 17:06:01 +05:30
Waiman Long	0aac9daef6	rcu: Use system_unbound_wq to avoid disturbing isolated CPUs It was discovered that isolated CPUs could sometimes be disturbed by kworkers processing kfree_rcu() works causing higher than expected latency. It is because the RCU core uses "system_wq" which doesn't have the WQ_UNBOUND flag to handle all its work items. Fix this violation of latency limits by using "system_unbound_wq" in the RCU core instead. This will ensure that those work items will not be run on CPUs marked as isolated. Beside the WQ_UNBOUND flag, the other major difference between system_wq and system_unbound_wq is their max_active count. The system_unbound_wq has a max_active of WQ_MAX_ACTIVE (512) while system_wq's max_active is WQ_DFL_ACTIVE (256) which is half of WQ_MAX_ACTIVE. Reported-by: Vratislav Bendel <vbendel@redhat.com> Closes: https://issues.redhat.com/browse/RHEL-50220 Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Tested-by: Breno Leitao <leitao@debian.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-08-12 23:46:41 +05:30
Thorsten Blum	fb579e6656	rcu: Annotate struct kvfree_rcu_bulk_data with __counted_by() Add the __counted_by compiler attribute to the flexible array member records to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE. Increment nr_records before adding a new pointer to the records array. Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org> Reviewed-by: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-08-11 11:34:47 +05:30
Valentin Schneider	e1de438336	context_tracking, rcu: Rename DYNTICK_IRQ_NONIDLE into CT_NESTING_IRQ_NONIDLE The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, and the 'dynticks' prefix can be dropped without losing any meaning. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-08-11 11:13:10 +05:30
Valentin Schneider	8375cb260d	context_tracking, rcu: Rename ct_dynticks_nmi_nesting() into ct_nmi_nesting() The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, and the 'dynticks' prefix can be dropped without losing any meaning. Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-08-11 11:11:51 +05:30
Valentin Schneider	dc5fface4b	context_tracking, rcu: Rename struct context_tracking .dynticks_nmi_nesting into .nmi_nesting The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, and the 'dynticks' prefix can be dropped without losing any meaning. [ neeraj.upadhyay: Fix htmldocs build error reported by Stephen Rothwell ] Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-08-11 11:11:10 +05:30
Valentin Schneider	1089c0078b	context_tracking, rcu: Rename ct_dynticks_nesting() into ct_nesting() The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, reflect that change in the related helpers. Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-08-11 11:09:12 +05:30
Valentin Schneider	bf66471987	context_tracking, rcu: Rename struct context_tracking .dynticks_nesting into .nesting The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, reflect that change in the related helpers. [ neeraj.upadhyay: Fix htmldocs build error reported by Stephen Rothwell ] Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-08-11 11:05:09 +05:30
Paul E. McKenney	3471e96bcf	rcu/kfree: Warn on unexpected tail state Within the rcu_sr_normal_gp_cleanup_work() function, there is an acquire load from rcu_state.srs_done_tail, which is expected to be non-NULL. This commit adds a WARN_ON_ONCE() to check this expectation. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-07-29 07:43:06 +05:30
Frederic Weisbecker	bae6076ebb	rcu/nocb: Remove SEGCBLIST_RCU_CORE RCU core can't be running anymore while in the middle of (de-)offloading since this sort of transition now only applies to offline CPUs. The SEGCBLIST_RCU_CORE state can therefore be removed. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-07-29 07:34:32 +05:30
Frederic Weisbecker	df7c249a0e	rcu/nocb: Remove halfway (de-)offloading handling from rcu_core RCU core can't be running anymore while in the middle of (de-)offloading since this sort of transition now only applies to offline CPUs. The locked callback acceleration handling during the transition can therefore be removed, along with concurrent batch execution. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-07-29 07:34:32 +05:30
Frederic Weisbecker	5a4f9059a8	rcu/nocb: Remove halfway (de-)offloading handling from rcu_core()'s QS reporting RCU core can't be running anymore while in the middle of (de-)offloading since this sort of transition now only applies to offline CPUs. The locked callback acceleration handling during the transition can therefore be removed. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-07-29 07:34:32 +05:30
Frederic Weisbecker	7121dd915a	rcu/nocb: Introduce nocb mutex The barrier_mutex is used currently to protect (de-)offloading operations and prevent from nocb_lock locking imbalance in rcu_barrier() and shrinker, and also from misordered RCU barrier invocation. Now since RCU (de-)offloading is going to happen on offline CPUs, an RCU barrier will have to be executed while transitionning from offloaded to de-offloaded state. And this can't happen while holding the barrier_mutex. Introduce a NOCB mutex to protect (de-)offloading transitions. The barrier_mutex is still held for now when necessary to avoid barrier callbacks reordering and nocb_lock imbalance. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-07-29 07:34:31 +05:30
Valentin Schneider	125716c393	context_tracking, rcu: Rename ct_dynticks_cpu_acquire() into ct_rcu_watching_cpu_acquire() The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, reflect that change in the related helpers. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-07-29 07:33:10 +05:30
Valentin Schneider	a9fde9d1a5	context_tracking, rcu: Rename ct_dynticks_cpu() into ct_rcu_watching_cpu() The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, reflect that change in the related helpers. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-07-29 07:33:10 +05:30
Valentin Schneider	a4a7921ec0	context_tracking, rcu: Rename ct_dynticks() into ct_rcu_watching() The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, reflect that change in the related helpers. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-07-29 07:33:10 +05:30
Valentin Schneider	4aa35e0db6	context_tracking, rcu: Rename RCU_DYNTICKS_IDX into CT_RCU_WATCHING The symbols relating to the CT_STATE part of context_tracking.state are now all prefixed with CT_STATE. The RCU dynticks counter part of that atomic variable still involves symbols with different prefixes, align them all to be prefixed with CT_RCU_WATCHING. Suggested-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-07-29 07:33:10 +05:30
Paul E. McKenney	02219caa92	Merge branches 'doc.2024.06.06a', 'fixes.2024.07.04a', 'mb.2024.06.28a', 'nocb.2024.06.03a', 'rcu-tasks.2024.06.06a', 'rcutorture.2024.06.06a' and 'srcu.2024.06.18a' into HEAD doc.2024.06.06a: Documentation updates. fixes.2024.07.04a: Miscellaneous fixes. mb.2024.06.28a: Grace-period memory-barrier redundancy removal. nocb.2024.06.03a: No-CB CPU updates. rcu-tasks.2024.06.06a: RCU-Tasks updates. rcutorture.2024.06.06a: Torture-test updates. srcu.2024.06.18a: SRCU polled-grace-period updates.	2024-07-04 13:54:17 -07:00
Frederic Weisbecker	55d4669ef1	rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU invocation When rcu_barrier() calls rcu_rdp_cpu_online() and observes a CPU off rnp->qsmaskinitnext, it means that all accesses from the offline CPU preceding the CPUHP_TEARDOWN_CPU are visible to RCU barrier, including callbacks expiration and counter updates. However interrupts can still fire after stop_machine() re-enables interrupts and before rcutree_report_cpu_dead(). The related accesses happening between CPUHP_TEARDOWN_CPU and rnp->qsmaskinitnext clearing are _NOT_ guaranteed to be seen by rcu_barrier() without proper ordering, especially when callbacks are invoked there to the end, making rcutree_migrate_callback() bypass barrier_lock. The following theoretical race example can make rcu_barrier() hang: CPU 0 CPU 1 ----- ----- //cpu_down() smpboot_park_threads() //ksoftirqd is parked now <IRQ> rcu_sched_clock_irq() invoke_rcu_core() do_softirq() rcu_core() rcu_do_batch() // callback storm // rcu_do_batch() returns // before completing all // of them // do_softirq also returns early because of // timeout. It defers to ksoftirqd but // it's parked </IRQ> stop_machine() take_cpu_down() rcu_barrier() spin_lock(barrier_lock) // observes rcu_segcblist_n_cbs(&rdp->cblist) != 0 <IRQ> do_softirq() rcu_core() rcu_do_batch() //completes all pending callbacks //smp_mb() implied _after_ callback number dec </IRQ> rcutree_report_cpu_dead() rnp->qsmaskinitnext &= ~rdp->grpmask; rcutree_migrate_callback() // no callback, early return without locking // barrier_lock //observes !rcu_rdp_cpu_online(rdp) rcu_barrier_entrain() rcu_segcblist_entrain() // Observe rcu_segcblist_n_cbs(rsclp) == 0 // because no barrier between reading // rnp->qsmaskinitnext and rsclp->len rcu_segcblist_add_len() smp_mb__before_atomic() // will now observe the 0 count and empty // list, but too late, we enqueue regardless WRITE_ONCE(rsclp->len, rsclp->len + v); // ignored barrier callback // rcu barrier stall... This could be solved with a read memory barrier, enforcing the message passing between rnp->qsmaskinitnext and rsclp->len, matching the full memory barrier after rsclp->len addition in rcu_segcblist_add_len() performed at the end of rcu_do_batch(). However the rcu_barrier() is complicated enough and probably doesn't need too many more subtleties. CPU down is a slowpath and the barrier_lock seldom contended. Solve the issue with unconditionally locking the barrier_lock on rcutree_migrate_callbacks(). This makes sure that either rcu_barrier() sees the empty queue or its entrained callback will be migrated. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2024-07-04 13:48:57 -07:00
Paul E. McKenney	68d124b099	rcu: Add rcutree.nohz_full_patience_delay to reduce nohz_full OS jitter If a CPU is running either a userspace application or a guest OS in nohz_full mode, it is possible for a system call to occur just as an RCU grace period is starting. If that CPU also has the scheduling-clock tick enabled for any reason (such as a second runnable task), and if the system was booted with rcutree.use_softirq=0, then RCU can add insult to injury by awakening that CPU's rcuc kthread, resulting in yet another task and yet more OS jitter due to switching to that task, running it, and switching back. In addition, in the common case where that system call is not of excessively long duration, awakening the rcuc task is pointless. This pointlessness is due to the fact that the CPU will enter an extended quiescent state upon returning to the userspace application or guest OS. In this case, the rcuc kthread cannot do anything that the main RCU grace-period kthread cannot do on its behalf, at least if it is given a few additional milliseconds (for example, given the time duration specified by rcutree.jiffies_till_first_fqs, give or take scheduling delays). This commit therefore adds a rcutree.nohz_full_patience_delay kernel boot parameter that specifies the grace period age (in milliseconds, rounded to jiffies) before which RCU will refrain from awakening the rcuc kthread. Preliminary experimentation suggests a value of 1000, that is, one second. Increasing rcutree.nohz_full_patience_delay will increase grace-period latency and in turn increase memory footprint, so systems with constrained memory might choose a smaller value. Systems with less-aggressive OS-jitter requirements might choose the default value of zero, which keeps the traditional immediate-wakeup behavior, thus avoiding increases in grace-period latency. [ paulmck: Apply Leonardo Bras feedback. ] Link: https://lore.kernel.org/all/20240328171949.743211-1-leobras@redhat.com/ Reported-by: Leonardo Bras <leobras@redhat.com> Suggested-by: Leonardo Bras <leobras@redhat.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Leonardo Bras <leobras@redhat.com>	2024-07-04 13:47:39 -07:00
Frederic Weisbecker	55911a9f42	rcu: Remove full memory barrier on RCU stall printout RCU stall printout fetches the EQS state of a CPU with a preceding full memory barrier. However there is nothing to order this read against at this debugging stage. It is inherently racy when performed remotely. Do a plain read instead. This was the last user of rcu_dynticks_snap(). Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Reviewed-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-06-28 06:44:12 -07:00
Frederic Weisbecker	e7a3c8ea6e	rcu: Remove full memory barrier on boot time eqs sanity check When the boot CPU initializes the per-CPU data on behalf of all possible CPUs, a sanity check is performed on each of them to make sure none is initialized in an extended quiescent state. This check involves a full memory barrier which is useless at this early boot stage. Do a plain access instead. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Reviewed-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>	2024-06-28 06:44:12 -07:00
Frederic Weisbecker	9a7e73c9be	rcu: Remove superfluous full memory barrier upon first EQS snapshot When the grace period kthread checks the extended quiescent state counter of a CPU, full ordering is necessary to ensure that either: * If the GP kthread observes the remote target in an extended quiescent state, then that target must observe all accesses prior to the current grace period, including the current grace period sequence number, once it exits that extended quiescent state. or: * If the GP kthread observes the remote target NOT in an extended quiescent state, then the target further entering in an extended quiescent state must observe all accesses prior to the current grace period, including the current grace period sequence number, once it enters that extended quiescent state. This ordering is enforced through a full memory barrier placed right before taking the first EQS snapshot. However this is superfluous because the snapshot is taken while holding the target's rnp lock which provides the necessary ordering through its chain of smp_mb__after_unlock_lock(). Remove the needless explicit barrier before the snapshot and put a comment about the implicit barrier newly relied upon here. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Reviewed-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2024-06-28 06:43:34 -07:00
Frederic Weisbecker	0a5e9bd31e	rcu: Remove full ordering on second EQS snapshot When the grace period kthread checks the extended quiescent state counter of a CPU, full ordering is necessary to ensure that either: * If the GP kthread observes the remote target in an extended quiescent state, then that target must observe all accesses prior to the current grace period, including the current grace period sequence number, once it exits that extended quiescent state. Also the GP kthread must observe all accesses performed by the target prior it entering in EQS. or: * If the GP kthread observes the remote target NOT in an extended quiescent state, then the target further entering in an extended quiescent state must observe all accesses prior to the current grace period, including the current grace period sequence number, once it enters that extended quiescent state. Also the GP kthread later observing that EQS must also observe all accesses performed by the target prior it entering in EQS. This ordering is explicitly performed both on the first EQS snapshot and on the second one as well through the combination of a preceding full barrier followed by an acquire read. However the second snapshot's full memory barrier is redundant and not needed to enforce the above guarantees: GP kthread Remote target ---- ----- // Access prior GP WRITE_ONCE(A, 1) // first snapshot smp_mb() x = smp_load_acquire(EQS) // Access prior GP WRITE_ONCE(B, 1) // EQS enter // implied full barrier by atomic_add_return() atomic_add_return(RCU_DYNTICKS_IDX, EQS) // implied full barrier by atomic_add_return() READ_ONCE(A) // second snapshot y = smp_load_acquire(EQS) z = READ_ONCE(B) If the GP kthread above fails to observe the remote target in EQS (x not in EQS), the remote target will observe A == 1 after further entering in EQS. Then the second snapshot taken by the GP kthread only need to be an acquire read in order to observe z == 1. Therefore remove the needless full memory barrier on second snapshot. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Reviewed-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2024-06-28 06:43:33 -07:00
Paul E. McKenney	51cace1372	rcu: Disable interrupts directly in rcu_gp_init() Interrupts are enabled in rcu_gp_init(), so this commit switches from local_irq_save() and local_irq_restore() to local_irq_disable() and local_irq_enable(). Link: https://lore.kernel.org/all/febb13ab-a4bb-48b4-8e97-7e9f7749e6da@moroto.mountain/ Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2024-06-18 10:00:48 -07:00
Joel Fernandes (Google)	6f948568fd	rcu/tree: Reduce wake up for synchronize_rcu() common case In the synchronize_rcu() common case, we will have less than SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker is pointless just to free the last injected wait head since at that point, all the users have already been awakened. Introduce a new counter to track this and prevent the wakeup in the common case. [ paulmck: Remove atomic_dec_return_release in cannot-happen state. ] Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2024-06-18 09:59:40 -07:00
Uladzislau Rezki (Sony)	64619b283b	Merge branches 'fixes.2024.04.15a', 'misc.2024.04.12a', 'rcu-sync-normal-improve.2024.04.15a', 'rcu-tasks.2024.04.15a' and 'rcutorture.2024.04.15a' into rcu-merge.2024.04.15a fixes.2024.04.15a: RCU fixes misc.2024.04.12a: Miscellaneous fixes rcu-sync-normal-improve.2024.04.15a: Improving synchronize_rcu() call rcu-tasks.2024.04.15a: Tasks RCU updates rcutorture.2024.04.15a: Torture-test updates	2024-05-01 13:04:02 +02:00
Zqiang	dddcddef14	rcutorture: Make rcutorture support print rcu-tasks gp state This commit make rcu-tasks related rcutorture test support rcu-tasks gp state printing when the writer stall occurs or the at the end of rcutorture test, and generate rcu_ops->get_gp_data() operation to simplify the acquisition of gp state for different types of rcutorture tests. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-16 11:16:35 +02:00
Uladzislau Rezki (Sony)	0fd210baa0	rcu: Allocate WQ with WQ_MEM_RECLAIM bit set synchronize_rcu() users have to be processed regardless of memory pressure so our private WQ needs to have at least one execution context what WQ_MEM_RECLAIM flag guarantees. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 19:51:26 +02:00
Uladzislau Rezki (Sony)	462df2f543	rcu: Support direct wake-up of synchronize_rcu() users This patch introduces a small enhancement which allows to do a direct wake-up of synchronize_rcu() callers. It occurs after a completion of grace period, thus by the gp-kthread. Number of clients is limited by the hard-coded maximum allowed threshold. The remaining part, if still exists is deferred to a main worker. Link: https://lore.kernel.org/lkml/Zd0ZtNu+Rt0qXkfS@lothringen/ Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 19:47:51 +02:00
Uladzislau Rezki (Sony)	2053937a31	rcu: Add a trace event for synchronize_rcu_normal() Add an rcu_sr_normal() trace event. It takes three arguments first one is the name of RCU flavour, second one is a user id which triggeres synchronize_rcu_normal() and last one is an event. There are two traces in the synchronize_rcu_normal(). On entry, when a new request is registered and on exit point when request is completed. Please note, CONFIG_RCU_TRACE=y is required to activate traces. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 19:47:51 +02:00
Uladzislau Rezki (Sony)	988f569ae0	rcu: Reduce synchronize_rcu() latency A call to a synchronize_rcu() can be optimized from a latency point of view. Workloads which depend on this can benefit of it. The delay of wakeme_after_rcu() callback, which unblocks a waiter, depends on several factors: - how fast a process of offloading is started. Combination of: - !CONFIG_RCU_NOCB_CPU/CONFIG_RCU_NOCB_CPU; - !CONFIG_RCU_LAZY/CONFIG_RCU_LAZY; - other. - when started, invoking path is interrupted due to: - time limit; - need_resched(); - if limit is reached. - where in a nocb list it is located; - how fast previous callbacks completed; Example: 1. On our embedded devices i can easily trigger the scenario when it is a last in the list out of ~3600 callbacks: <snip> <...>-29 [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28 ... <...>-29 [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt <...>-29 [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt <...>-29 [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt <...>-29 [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt <...>-29 [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt <...>-29 [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt <...>-29 [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=.... <snip> 2. We use cpuset/cgroup to classify tasks and assign them into different cgroups. For example "backgrond" group which binds tasks only to little CPUs or "foreground" which makes use of all CPUs. Tasks can be migrated between groups by a request if an acceleration is needed. See below an example how "surfaceflinger" task gets migrated. Initially it is located in the "system-background" cgroup which allows to run only on little cores. In order to speed it up it can be temporary moved into "foreground" cgroup which allows to use big/all CPUs: cgroup_attach_task(): -> cgroup_migrate_execute() -> cpuset_can_attach() -> percpu_down_write() -> rcu_sync_enter() -> synchronize_rcu() -> now move tasks to the new cgroup. -> cgroup_migrate_finish() <snip> rcuop/1-29 [000] ..... 7030.528570: rcu_invoke_callback: rcu_preempt rhp=00000000461605e0 func=wakeme_after_rcu.cfi_jt PERFD-SERVER-1855 [000] d..1. 7030.530293: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger TimerDispatch-2768 [002] d..5. 7030.537542: sched_migrate_task: comm=surfaceflinger pid=1900 prio=98 orig_cpu=0 dest_cpu=4 <snip> "Boosting a task" depends on synchronize_rcu() latency: - first trace shows a completion of synchronize_rcu(); - second shows attaching a task to a new group; - last shows a final step when migration occurs. 3. To address this drawback, maintain a separate track that consists of synchronize_rcu() callers only. After completion of a grace period users are deferred to a dedicated worker to process requests. 4. This patch reduces the latency of synchronize_rcu() approximately by ~30-40% on synthetic tests. The real test case, camera launch time, shows(time is in milliseconds): 1-run 542 vs 489 improvement 9% 2-run 540 vs 466 improvement 13% 3-run 518 vs 468 improvement 9% 4-run 531 vs 457 improvement 13% 5-run 548 vs 475 improvement 13% 6-run 509 vs 484 improvement 4% Synthetic test(no "noise" from other callbacks): Hardware: x86_64 64 CPUs, 64GB of memory Linux-6.6 - 10K tasks(simultaneous); - each task does(1000 loops) synchronize_rcu(); kfree(p); default: CONFIG_RCU_NOCB_CPU: takes 54 seconds to complete all users; patch: CONFIG_RCU_NOCB_CPU: takes 35 seconds to complete all users. Running 60K gives approximately same results on my setup. Please note it is without any interaction with another type of callbacks, otherwise it will impact a lot a default case. 5. By default it is disabled. To enable this perform one of the below sequence: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1" Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Co-developed-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 19:47:49 +02:00
Paul E. McKenney	ae2b217ab5	rcu: Make hotplug operations track GP state, not flags Currently, there are rcu_data structure fields named ->rcu_onl_gp_seq and ->rcu_ofl_gp_seq that track the rcu_state.gp_flags field at the time of the corresponding CPU's last online or offline operation, respectively. However, this information is not particularly useful. It would be better to instead track the grace period state kept in rcu_state.gp_state. This would also be consistent with the initialization in rcu_boot_init_percpu_data(), which is to RCU_GP_CLEANED (an rcu_state.gp_state value), and also with the diagnostics in rcu_implicit_dynticks_qs(), whose format is consistent with an integer, not a bitmask. This commit therefore makes this change and changes the names to ->rcu_onl_gp_flags and ->rcu_ofl_gp_flags, respectively. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 16:48:28 +02:00
Paul E. McKenney	09e077cf22	rcu: Mark loads from rcu_state.n_online_cpus The rcu_state.n_online_cpus value is only ever updated by CPU-hotplug operations, which are serialized. However, this value is read locklessly. This commit therefore marks those reads. While in the area, it also adds ASSERT_EXCLUSIVE_WRITER() calls just in case parallel CPU hotplug becomes a thing. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 16:48:26 +02:00
Paul E. McKenney	62bb24c4b0	rcu: Remove redundant READ_ONCE() of rcu_state.gp_flags in tree.c Although it is functionally OK to do READ_ONCE() of a variable that cannot change, it is confusing and at best an accident waiting to happen. This commit therefore removes a number of READ_ONCE(rcu_state.gp_flags) instances from kernel/rcu/tree.c that are not needed due to updates to this field being excluded by virtue of holding the root rcu_node structure's ->lock. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Closes: https://lore.kernel.org/lkml/4857c5ef-bd8f-4670-87ac-0600a1699d05@paulmck-laptop/T/#mccb23c2a4902da4d3c750165329f8de056903c58 Reported-by: Julia Lawall <julia.lawall@inria.fr> Closes: https://lore.kernel.org/lkml/4857c5ef-bd8f-4670-87ac-0600a1699d05@paulmck-laptop/T/#md1b5c026584f9c3c7b0fbc9240dd7de584597b73 Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-15 13:07:14 +02:00
Paul E. McKenney	3183059ad8	rcu: Add lockdep checks and kernel-doc header to rcu_softirq_qs() There is some indications that rcu_softirq_qs() might be more generally used than anticipated. This commit therefore adds some lockdep assertions and some cautionary tales in a new kernel-doc header. Link: https://lore.kernel.org/all/Zd4DXTyCf17lcTfq@debian.debian/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Yan Zhai <yan@cloudflare.com> Cc: <netdev@vger.kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>	2024-04-09 15:08:34 +02:00
Boqun Feng	3add00be5f	Merge branches 'rcu-doc.2024.02.14a', 'rcu-nocb.2024.02.14a', 'rcu-exp.2024.02.14a', 'rcu-tasks.2024.02.26a' and 'rcu-misc.2024.02.14a' into rcu.2024.02.26a	2024-02-26 17:37:25 -08:00
Paul E. McKenney	30ef09635b	rcu-tasks: Initialize callback lists at rcu_init() time In order for RCU Tasks to reliably maintain per-CPU lists of exiting tasks, those lists must be initialized before it is possible for tasks to exit, especially given that the boot CPU is not necessarily CPU 0 (an example being, powerpc kexec() kernels). And at the time that rcu_init_tasks_generic() is called, a task could potentially exit, unconventional though that sort of thing might be. This commit therefore moves the calls to cblist_init_generic() from functions called from rcu_init_tasks_generic() to a new function named tasks_cblist_init_generic() that is invoked from rcu_init(). This constituted a bug in a commit that never went to mainline, so there is no need for any backporting to -stable. Reported-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-25 14:21:34 -08:00
Qais Yousef	7f66f099de	rcu: Provide a boot time parameter to control lazy RCU To allow more flexible arrangements while still provide a single kernel for distros, provide a boot time parameter to enable/disable lazy RCU. Specify: rcutree.enable_rcu_lazy=[y\|1\|n\|0] Which also requires rcu_nocbs=all at boot time to enable/disable lazy RCU. To disable it by default at build time when CONFIG_RCU_LAZY=y, the new CONFIG_RCU_LAZY_DEFAULT_OFF can be used. Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Tested-by: Andrea Righi <andrea.righi@canonical.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 08:00:57 -08:00
Frederic Weisbecker	23da2ad64d	rcu/exp: Remove rcu_par_gp_wq TREE04 running on short iterations can produce writer stalls of the following kind: ??? Writer stall state RTWS_EXP_SYNC(4) g3968 f0x0 ->state 0x2 cpu 0 task:rcu_torture_wri state:D stack:14568 pid:83 ppid:2 flags:0x00004000 Call Trace: <TASK> __schedule+0x2de/0x850 ? trace_event_raw_event_rcu_exp_funnel_lock+0x6d/0xb0 schedule+0x4f/0x90 synchronize_rcu_expedited+0x430/0x670 ? __pfx_autoremove_wake_function+0x10/0x10 ? __pfx_synchronize_rcu_expedited+0x10/0x10 do_rtws_sync.constprop.0+0xde/0x230 rcu_torture_writer+0x4b4/0xcd0 ? __pfx_rcu_torture_writer+0x10/0x10 kthread+0xc7/0xf0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2f/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> Waiting for an expedited grace period and polling for an expedited grace period both are operations that internally rely on the same workqueue performing necessary asynchronous work. However, a dependency chain is involved between those two operations, as depicted below: ====== CPU 0 ======= ====== CPU 1 ======= synchronize_rcu_expedited() exp_funnel_lock() mutex_lock(&rcu_state.exp_mutex); start_poll_synchronize_rcu_expedited queue_work(rcu_gp_wq, &rnp->exp_poll_wq); synchronize_rcu_expedited_queue_work() queue_work(rcu_gp_wq, &rew->rew_work); wait_event() // A, wait for &rew->rew_work completion mutex_unlock() // B //======> switch to kworker sync_rcu_do_polled_gp() { synchronize_rcu_expedited() exp_funnel_lock() mutex_lock(&rcu_state.exp_mutex); // C, wait B .... } // D Since workqueues are usually implemented on top of several kworkers handling the queue concurrently, the above situation wouldn't deadlock most of the time because A then doesn't depend on D. But in case of memory stress, a single kworker may end up handling alone all the works in a serialized way. In that case the above layout becomes a problem because A then waits for D, closing a circular dependency: A -> D -> C -> B -> A This however only happens when CONFIG_RCU_EXP_KTHREAD=n. Indeed synchronize_rcu_expedited() is otherwise implemented on top of a kthread worker while polling still relies on rcu_gp_wq workqueue, breaking the above circular dependency chain. Fix this with making expedited grace period to always rely on kthread worker. The workqueue based implementation is essentially a duplicate anyway now that the per-node initialization is performed by per-node kthread workers. Meanwhile the CONFIG_RCU_EXP_KTHREAD switch is still kept around to manage the scheduler policy of these kthread workers. Reported-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Reported-by: Thomas Gleixner <tglx@linutronix.de> Suggested-by: Joel Fernandes <joel@joelfernandes.org> Suggested-by: Paul E. McKenney <paulmck@kernel.org> Suggested-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:51:36 -08:00
Frederic Weisbecker	b67cffcbbf	rcu/exp: Handle parallel exp gp kworkers affinity Affine the parallel expedited gp kworkers to their respective RCU node in order to make them close to the cache their are playing with. This reuses the boost kthreads machinery that probe into CPU hotplug operations such that the kthreads become/stay affine to their respective node as soon/long as they contain online CPUs. Otherwise and if the current CPU going down was the last online on the leaf node, the related kthread is affine to the housekeeping CPUs. In the long run, this affinity VS CPU hotplug operation game should probably be implemented at the generic kthread level. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> [boqun: s/* rcu_boost_task/*rcu_boost_task as reported by checkpatch] Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:51:36 -08:00
Frederic Weisbecker	8e5e621566	rcu/exp: Make parallel exp gp kworker per rcu node When CONFIG_RCU_EXP_KTHREAD=n, the expedited grace period per node initialization is performed in parallel via workqueues (one work per node). However in CONFIG_RCU_EXP_KTHREAD=y, this per node initialization is performed by a single kworker serializing each node initialization (one work for all nodes). The second part is certainly less scalable and efficient beyond a single leaf node. To improve this, expand this single kworker into per-node kworkers. This new layout is eventually intended to remove the workqueues based implementation since it will essentially now become duplicate code. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:51:36 -08:00
Frederic Weisbecker	c19e5d3b49	rcu/exp: Move expedited kthread worker creation functions above rcutree_prepare_cpu() The expedited kthread worker performing the per node initialization is going to be split into per node kthreads. As such, the future per node kthread creation will need to be called from CPU hotplug callbacks instead of an initcall, right beside the per node boost kthread creation. To prepare for that, move the kthread worker creation above rcutree_prepare_cpu() as a first step to make the review smoother for the upcoming modifications. No intended functional change. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:51:36 -08:00
Frederic Weisbecker	7836b27060	rcu: s/boost_kthread_mutex/kthread_mutex This mutex is currently protecting per node boost kthreads creation and affinity setting across CPU hotplug operations. Since the expedited kworkers will soon be split per node as well, they will be subject to the same concurrency constraints against hotplug. Therefore their creation and affinity tuning operations will be grouped with those of boost kthreads and then rely on the same mutex. To prepare for that, generalize its name. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:51:36 -08:00
Frederic Weisbecker	e7539ffc9a	rcu/exp: Handle RCU expedited grace period kworker allocation failure Just like is done for the kworker performing nodes initialization, gracefully handle the possible allocation failure of the RCU expedited grace period main kworker. While at it perform a rename of the related checking functions to better reflect the expedited specifics. Reviewed-by: Kalesh Singh <kaleshsingh@google.com> Fixes: `9621fbee44` ("rcu: Move expedited grace period (GP) work to RT kthread_worker") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:51:36 -08:00
Frederic Weisbecker	a636c5e6f8	rcu/exp: Fix RCU expedited parallel grace period kworker allocation failure recovery Under CONFIG_RCU_EXP_KTHREAD=y, the nodes initialization for expedited grace periods is queued to a kworker. However if the allocation of that kworker failed, the nodes initialization is performed synchronously by the caller instead. Now the check for kworker initialization failure relies on the kworker pointer to be NULL while its value might actually encapsulate an allocation failure error. Make sure to handle this case. Reviewed-by: Kalesh Singh <kaleshsingh@google.com> Fixes: `9621fbee44` ("rcu: Move expedited grace period (GP) work to RT kthread_worker") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:51:36 -08:00
Frederic Weisbecker	afd4e69647	rcu/nocb: Re-arrange call_rcu() NOCB specific code Currently the call_rcu() function interleaves NOCB and !NOCB enqueue code in a complicated way such that: * The bypass enqueue code may or may not have enqueued and may or may not have locked the ->nocb_lock. Everything that follows is in a Schrödinger locking state for the unwary reviewer's eyes. * The was_alldone is always set but only used in NOCB related code. * The NOCB wake up is distantly related to the locking hopefully performed by the bypass enqueue code that did not enqueue on the bypass list. Unconfuse the whole and gather NOCB and !NOCB specific enqueue code to their own functions. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:50:45 -08:00
Frederic Weisbecker	b913c3fe68	rcu/nocb: Make IRQs disablement symmetric Currently IRQs are disabled on call_rcu() and then depending on the context: * If the CPU is in nocb mode: - If the callback is enqueued in the bypass list, IRQs are re-enabled implictly by rcu_nocb_try_bypass() - If the callback is enqueued in the normal list, IRQs are re-enabled implicitly by __call_rcu_nocb_wake() * If the CPU is NOT in nocb mode, IRQs are reenabled explicitly from call_rcu() This makes the code a bit hard to follow, especially as it interleaves with nocb locking. To make the IRQ flags coverage clearer and also in order to prepare for moving all the nocb enqueue code to its own function, always re-enable the IRQ flags explicitly from call_rcu(). Reviewed-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:50:45 -08:00
Frederic Weisbecker	1e8e6951a5	rcu/nocb: Remove needless full barrier after callback advancing A full barrier is issued from nocb_gp_wait() upon callbacks advancing to order grace period completion with callbacks execution. However these two events are already ordered by the smp_mb__after_unlock_lock() barrier within the call to raw_spin_lock_rcu_node() that is necessary for callbacks advancing to happen. The following litmus test shows the kind of guarantee that this barrier provides: C smp_mb__after_unlock_lock {} // rcu_gp_cleanup() P0(spinlock_t rnp_lock, int gpnum) { // Grace period cleanup increase gp sequence number spin_lock(rnp_lock); WRITE_ONCE(gpnum, 1); spin_unlock(rnp_lock); } // nocb_gp_wait() P1(spinlock_t rnp_lock, spinlock_t nocb_lock, int gpnum, int cb_ready) { int r1; // Call rcu_advance_cbs() from nocb_gp_wait() spin_lock(nocb_lock); spin_lock(rnp_lock); smp_mb__after_unlock_lock(); r1 = READ_ONCE(gpnum); WRITE_ONCE(cb_ready, 1); spin_unlock(rnp_lock); spin_unlock(nocb_lock); } // nocb_cb_wait() P2(spinlock_t nocb_lock, int cb_ready, int cb_executed) { int r2; // rcu_do_batch() -> rcu_segcblist_extract_done_cbs() spin_lock(nocb_lock); r2 = READ_ONCE(cb_ready); spin_unlock(nocb_lock); // Actual callback execution WRITE_ONCE(cb_executed, 1); } P3(int cb_executed, int gpnum) { int r3; WRITE_ONCE(cb_executed, 2); smp_mb(); r3 = READ_ONCE(gpnum); } exists (1:r1=1 /\ 2:r2=1 /\ cb_executed=2 /\ 3:r3=0) (* Bad outcome. *) Here the bad outcome only occurs if the smp_mb__after_unlock_lock() is removed. This barrier orders the grace period completion against callbacks advancing and even later callbacks invocation, thanks to the opportunistic propagation via the ->nocb_lock to nocb_cb_wait(). Therefore the smp_mb() placed after callbacks advancing can be safely removed. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:50:45 -08:00
Frederic Weisbecker	e787644caf	rcu: Defer RCU kthreads wakeup when CPU is dying When the CPU goes idle for the last time during the CPU down hotplug process, RCU reports a final quiescent state for the current CPU. If this quiescent state propagates up to the top, some tasks may then be woken up to complete the grace period: the main grace period kthread and/or the expedited main workqueue (or kworker). If those kthreads have a SCHED_FIFO policy, the wake up can indirectly arm the RT bandwith timer to the local offline CPU. Since this happens after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the timer gets ignored. Therefore if the RCU kthreads are waiting for RT bandwidth to be available, they may never be actually scheduled. This triggers TREE03 rcutorture hangs: rcu: INFO: rcu_preempt self-detected stall on CPU rcu: 4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved) rcu: (t=21035 jiffies g=938281 q=40787 ncpus=6) rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0 rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. rcu: RCU grace-period kthread stack dump: task:rcu_preempt state:R running task stack:14896 pid:14 tgid:14 ppid:2 flags:0x00004000 Call Trace: <TASK> __schedule+0x2eb/0xa80 schedule+0x1f/0x90 schedule_timeout+0x163/0x270 ? __pfx_process_timeout+0x10/0x10 rcu_gp_fqs_loop+0x37c/0x5b0 ? __pfx_rcu_gp_kthread+0x10/0x10 rcu_gp_kthread+0x17c/0x200 kthread+0xde/0x110 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2b/0x40 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> The situation can't be solved with just unpinning the timer. The hrtimer infrastructure and the nohz heuristics involved in finding the best remote target for an unpinned timer would then also need to handle enqueues from an offline CPU in the most horrendous way. So fix this on the RCU side instead and defer the wake up to an online CPU if it's too late for the local one. Reported-by: Paul E. McKenney <paulmck@kernel.org> Fixes: `5c0930ccaa` ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>	2024-01-24 22:46:17 +05:30
Zqiang	dee39c0c1e	rcu: Force quiescent states only for ongoing grace period If an rcutorture test scenario creates an fqs_task kthread, it will periodically invoke rcu_force_quiescent_state() in order to start force-quiescent-state (FQS) operations. However, an FQS operation will be started even if there is no RCU grace period in progress. Although testing FQS operations startup when there is no grace period in progress is necessary, it need not happen all that often. This commit therefore causes rcu_force_quiescent_state() to take an early exit if there is no grace period in progress. Note that there will still be attempts to start an FQS scan in the absence of a grace period because the grace period might end right after the rcu_force_quiescent_state() function's check. In actual testing, this happens about once every ten minutes, which should provide adequate testing. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>	2023-12-14 01:19:02 +05:30
Linus Torvalds	90450a0616	RCU fixes for v6.7 * Fix a lock inversion between scheduler and RCU introduced in v6.2-rc4. The scenario could trigger on any user of RCU_NOCB (mostly Android and also nohz_full). * Fix PF_IDLE semantic changes introduced in v6.6-rc3 breaking some RCU-Tasks and RCU-Tasks-Trace expectations as to what exactly is an idle task. This resulted in potential spurious stalls and warnings. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmVFB/QACgkQhSRUR1CO jHctUg/+LjAHNvT8YR99qxP3v8Ylqf191ADhDla1Zx1LYkhcrApdRKQU5G1dzfiF arJ/KSSgCLCYMecwdh6iOL5s3S3DtQGsh7Y7f/dJ6vwJbNe2NbPaMYK2WQ171zT+ ht7oAugFrOpWn5H6i1eC170N+rG7nHYuCXrzsta0fx1fOp9G0OeemvweD7HjpaB7 66tp0KnuJWLfPLoZucscxKRtC7ZG0zDztFUXcfcv+TQE6w5fXuhtLX7E7fE1jPMr Pa3x9Su5fjFCRJCCCFNHzKqbf8VKG/teGOONFBwt8AwwU1/7TuIuwSOjrISOaG91 eEeIwWyiAw+aP+7CMMGca0oYPOR+b3Xq7vmN85KIEFUHpNls6MvR5/itR1d3/FsA 9EkiNLIwbRV9ICi+JeGeJsSmm6TRJL4XToiw9fJKfC05IHjotMSAtUOqHS/VmdjL o1kDdiWBSHUy7VVelu5wRHb0WINc9Zrq/tc3MT2W0EpTDRSL2r04MsP8UDYre1oY KDdTLSPsy2D+5YWY484Z+bCOHl0bXSgZe4PycGVrP9HEus6MGHyxU6uDCXwWbqQz R0QdZ60AdquwWTdK1K0PsU5VKQw3oZKKwj8hyKOr31NpSZ1FvLJa8Ct8D2UZPbMS WNscgGE4ZNjCYFq8wtnL5LkUym8is1Y6/nLifBV1scx5GtyHYiU= =a5L7 -----END PGP SIGNATURE----- Merge tag 'rcu-fixes-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks Pull RCU fixes from Frederic Weisbecker: - Fix a lock inversion between scheduler and RCU introduced in v6.2-rc4. The scenario could trigger on any user of RCU_NOCB (mostly Android but also nohz_full) - Fix PF_IDLE semantic changes introduced in v6.6-rc3 breaking some RCU-Tasks and RCU-Tasks-Trace expectations as to what exactly is an idle task. This resulted in potential spurious stalls and warnings. * tag 'rcu-fixes-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: rcu/tasks-trace: Handle new PF_IDLE semantics rcu/tasks: Handle new PF_IDLE semantics rcu: Introduce rcu_cpu_online() rcu: Break rcu_node_0 --> &rq->__lock order	2023-11-08 09:47:52 -08:00
Linus Torvalds	ecae0bd517	Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series "Fixes and cleanups to compaction". - Joel Fernandes has a patchset ("Optimize mremap during mutual alignment within PMD") which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested. - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series "Do not try to access unaccepted memory" Adrian Hunter provides some fixups for the recently-added "unaccepted memory' feature. To increase the feature's checking coverage. "Plug a few gaps where RAM is exposed without checking if it is unaccepted memory". - In the series "cleanups for lockless slab shrink" Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code. - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series "use refcount+RCU method to implement lockless slab shrink". - David Hildenbrand contributes some maintenance work for the rmap code in the series "Anon rmap cleanups". - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series "mm: migrate: more folio conversion and unification". - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series "Add and use bdev_getblk()". - In the series "Use nth_page() in place of direct struct page manipulation" Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames. - In the series "mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO" has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use. - Matthew Wilcox has sent the series "Small hugetlb cleanups" - code rationalization and folio conversions in the hugetlb code. - Yin Fengwei has improved mlock()'s handling of large folios in the series "support large folio for mlock" - In the series "Expose swapcache stat for memcg v1" Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2. - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named "MDWE without inheritance". - Kefeng Wang has provided the series "mm: convert numa balancing functions to use a folio" which does what it says. - In the series "mm/ksm: add fork-exec support for prctl" Stefan Roesch makes is possible for a process to propagate KSM treatment across exec(). - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use "high bandwidth memory" in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named "memory tiering: calculate abstract distance based on ACPI HMAT" - In the series "Smart scanning mode for KSM" Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans. - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series "mm: memcg: fix tracking of pending stats updates values". - In the series "Implement IOCTL to get and optionally clear info about PTEs" Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU. - Hugh Dickins contributed the series "shmem,tmpfs: general maintenance" - a bunch of relatively minor maintenance tweaks to this code. - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series "Handle more faults under the VMA lock". Some rationalizations of the fault path became possible as a result. - In the series "mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()" David Hildenbrand has implemented some cleanups and folio conversions. - In the series "various improvements to the GUP interface" Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements. - Andrey Konovalov has sent along the series "kasan: assorted fixes and improvements" which does those things. - Some page allocator maintenance work from Kemeng Shi in the series "Two minor cleanups to break_down_buddy_pages". - In thes series "New selftest for mm" Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults. - In the series "Add folio_end_read" Matthew Wilcox provides cleanups and an optimization to the core pagecache code. - Nhat Pham has added memcg accounting for hugetlb memory in the series "hugetlb memcg accounting". - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series "Abstract vma_merge() and split_vma()". - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series "Fix page_owner's use of free timestamps". - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series "permit write-sealed memfd read-only shared mappings". - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series "Batch hugetlb vmemmap modification operations". - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series "Finish the create_empty_buffers() transition". - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series "mm: PCP high auto-tuning". - Roman Gushchin has contributed the patchset "mm: improve performance of accounted kernel memory allocations" which improves their performance by ~30% as measured by a micro-benchmark. - folio conversions from Kefeng Wang in the series "mm: convert page cpupid functions to folios". - Some kmemleak fixups in Liu Shixin's series "Some bugfix about kmemleak". - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series "handle memoryless nodes more appropriately". - khugepaged conversions from Vishal Moola in the series "Some khugepaged folio conversions". -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZULEMwAKCRDdBJ7gKXxA jhQHAQCYpD3g849x69DmHnHWHm/EHQLvQmRMDeYZI+nx/sCJOwEAw4AKg0Oemv9y FgeUPAD1oasg6CP+INZvCj34waNxwAc= =E+Y4 -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series 'Fixes and cleanups to compaction' - Joel Fernandes has a patchset ('Optimize mremap during mutual alignment within PMD') which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series 'Do not try to access unaccepted memory' Adrian Hunter provides some fixups for the recently-added 'unaccepted memory' feature. To increase the feature's checking coverage. 'Plug a few gaps where RAM is exposed without checking if it is unaccepted memory' - In the series 'cleanups for lockless slab shrink' Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series 'use refcount+RCU method to implement lockless slab shrink' - David Hildenbrand contributes some maintenance work for the rmap code in the series 'Anon rmap cleanups' - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series 'mm: migrate: more folio conversion and unification' - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series 'Add and use bdev_getblk()' - In the series 'Use nth_page() in place of direct struct page manipulation' Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames - In the series 'mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO' has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code rationalization and folio conversions in the hugetlb code - Yin Fengwei has improved mlock()'s handling of large folios in the series 'support large folio for mlock' - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named 'MDWE without inheritance' - Kefeng Wang has provided the series 'mm: convert numa balancing functions to use a folio' which does what it says - In the series 'mm/ksm: add fork-exec support for prctl' Stefan Roesch makes is possible for a process to propagate KSM treatment across exec() - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use 'high bandwidth memory' in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named 'memory tiering: calculate abstract distance based on ACPI HMAT' - In the series 'Smart scanning mode for KSM' Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series 'mm: memcg: fix tracking of pending stats updates values' - In the series 'Implement IOCTL to get and optionally clear info about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU - Hugh Dickins contributed the series 'shmem,tmpfs: general maintenance', a bunch of relatively minor maintenance tweaks to this code - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series 'Handle more faults under the VMA lock'. Some rationalizations of the fault path became possible as a result - In the series 'mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()' David Hildenbrand has implemented some cleanups and folio conversions - In the series 'various improvements to the GUP interface' Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements - Andrey Konovalov has sent along the series 'kasan: assorted fixes and improvements' which does those things - Some page allocator maintenance work from Kemeng Shi in the series 'Two minor cleanups to break_down_buddy_pages' - In thes series 'New selftest for mm' Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups and an optimization to the core pagecache code - Nhat Pham has added memcg accounting for hugetlb memory in the series 'hugetlb memcg accounting' - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series 'Abstract vma_merge() and split_vma()' - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series 'Fix page_owner's use of free timestamps' - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series 'permit write-sealed memfd read-only shared mappings' - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series 'Batch hugetlb vmemmap modification operations' - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series 'Finish the create_empty_buffers() transition' - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series 'mm: PCP high auto-tuning' - Roman Gushchin has contributed the patchset 'mm: improve performance of accounted kernel memory allocations' which improves their performance by ~30% as measured by a micro-benchmark - folio conversions from Kefeng Wang in the series 'mm: convert page cpupid functions to folios' - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about kmemleak' - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series 'handle memoryless nodes more appropriately' - khugepaged conversions from Vishal Moola in the series 'Some khugepaged folio conversions'" [ bcachefs conflicts with the dynamically allocated shrinkers have been resolved as per Stephen Rothwell in https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/ with help from Qi Zheng. The clone3 test filtering conflict was half-arsed by yours truly ] * tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits) mm/damon/sysfs: update monitoring target regions for online input commit mm/damon/sysfs: remove requested targets when online-commit inputs selftests: add a sanity check for zswap Documentation: maple_tree: fix word spelling error mm/vmalloc: fix the unchecked dereference warning in vread_iter() zswap: export compression failure stats Documentation: ubsan: drop "the" from article title mempolicy: migration attempt to match interleave nodes mempolicy: mmap_lock is not needed while migrating folios mempolicy: alloc_pages_mpol() for NUMA policy without vma mm: add page_rmappable_folio() wrapper mempolicy: remove confusing MPOL_MF_LAZY dead code mempolicy: mpol_shared_policy_init() without pseudo-vma mempolicy trivia: use pgoff_t in shared mempolicy tree mempolicy trivia: slightly more consistent naming mempolicy trivia: delete those ancient pr_debug()s mempolicy: fix migrate_pages(2) syscall return nr_failed kernfs: drop shared NUMA mempolicy hooks hugetlbfs: drop shared NUMA mempolicy pretence mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets() ...	2023-11-02 19:38:47 -10:00
Frederic Weisbecker	2be4686d86	rcu: Introduce rcu_cpu_online() Export the RCU point of view as to when a CPU is considered offline (ie: when does RCU consider that a CPU is sufficiently down in the hotplug process to not feature any possible read side). This will be used by RCU-tasks whose vision of an offline CPU should reasonably match the one of RCU core. Fixes: `cff9b2332a` ("kernel/sched: Modify initial boot task idle setup") Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>	2023-11-01 22:03:27 +01:00
Peter Zijlstra	85d68222dd	rcu: Break rcu_node_0 --> &rq->__lock order Commit `851a723e45` ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()") added a kfree() call to free any user provided affinity mask, if present. It was changed later to use kfree_rcu() in commit `9a5418bc48` ("sched/core: Use kfree_rcu() in do_set_cpus_allowed()") to avoid a circular locking dependency problem. It turns out that even kfree_rcu() isn't safe for avoiding circular locking problem. As reported by kernel test robot, the following circular locking dependency now exists: &rdp->nocb_lock --> rcu_node_0 --> &rq->__lock Solve this by breaking the rcu_node_0 --> &rq->__lock chain by moving the resched_cpu() out from under rcu_node lock. [peterz: heavily borrowed from Waiman's Changelog] [paulmck: applied Z qiang feedback] Fixes: `851a723e45` ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()") Reported-by: kernel test robot <oliver.sang@intel.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/oe-lkp/202310302207.a25f1a30-oliver.sang@intel.com Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>	2023-11-01 21:39:58 +01:00
Frederic Weisbecker	d97ae6474c	Merge branches 'rcu/torture', 'rcu/fixes', 'rcu/docs', 'rcu/refscale', 'rcu/tasks' and 'rcu/stall' into rcu/next rcu/torture: RCU torture, locktorture and generic torture infrastructure rcu/fixes: Generic and misc fixes rcu/docs: RCU documentation updates rcu/refscale: RCU reference scalability test updates rcu/tasks: RCU tasks updates rcu/stall: Stall detection updates	2023-10-23 15:24:11 +02:00
Frederic Weisbecker	448e9f34d9	rcu: Standardize explicit CPU-hotplug calls rcu_report_dead() and rcutree_migrate_callbacks() have their headers in rcupdate.h while those are pure rcutree calls, like the other CPU-hotplug functions. Also rcu_cpu_starting() and rcu_report_dead() have different naming conventions while they mirror each other's effects. Fix the headers and propose a naming that relates both functions and aligns with the prefix of other rcutree CPU-hotplug functions. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>	2023-10-04 22:29:45 +02:00
Frederic Weisbecker	2cb1f6e9a7	rcu: Conditionally build CPU-hotplug teardown callbacks Among the three CPU-hotplug teardown RCU callbacks, two of them early exit if CONFIG_HOTPLUG_CPU=n, and one is left unchanged. In any case all of them have an implementation when CONFIG_HOTPLUG_CPU=n. Align instead with the common way to deal with CPU-hotplug teardown callbacks and provide a proper stub when they are not supported. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>	2023-10-04 22:25:28 +02:00
Qi Zheng	21e0b932fb	rcu: dynamically allocate the rcu-kfree shrinker Use new APIs to dynamically allocate the rcu-kfree shrinker. Link: https://lkml.kernel.org/r/20230911094444.68966-17-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Chuck Lever <cel@kernel.org> Cc: Coly Li <colyli@suse.de> Cc: Dai Ngo <Dai.Ngo@oracle.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Airlie <airlied@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huang Rui <ray.huang@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Marijn Suijten <marijn.suijten@somainline.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Neil Brown <neilb@suse.de> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Olga Kornievskaia <kolga@netapp.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Clark <robdclark@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sean Paul <sean@poorly.run> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Song Liu <song@kernel.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com> Cc: Tom Talpey <tom@talpey.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yue Hu <huyue2@coolpad.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-10-04 10:32:24 -07:00
Frederic Weisbecker	c964c1f5ee	rcu: Assume rcu_report_dead() is always called locally rcu_report_dead() has to be called locally by the CPU that is going to exit the RCU state machine. Passing a cpu argument here is error-prone and leaves the possibility for a racy remote call. Use local access instead. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>	2023-10-04 17:35:56 +02:00
Frederic Weisbecker	358662a961	rcu: Assume IRQS disabled from rcu_report_dead() rcu_report_dead() is the last RCU word from the CPU down through the hotplug path. It is called in the idle loop right before the CPU shuts down for good. Because it removes the CPU from the grace period state machine and reports an ultimate quiescent state if necessary, no further use of RCU is allowed. Therefore it is expected that IRQs are disabled upon calling this function and are not to be re-enabled again until the CPU shuts down. Remove the IRQs disablement from that function and verify instead that it is actually called with IRQs disabled as it is expected at that special point in the idle path. Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>	2023-10-04 17:34:54 +02:00
Catalin Marinas	5f98fd034c	rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objects Since the actual slab freeing is deferred when calling kvfree_rcu(), so is the kmemleak_free() callback informing kmemleak of the object deletion. From the perspective of the kvfree_rcu() caller, the object is freed and it may remove any references to it. Since kmemleak does not scan RCU internal data storing the pointer, it will report such objects as leaks during the grace period. Tell kmemleak to ignore such objects on the kvfree_call_rcu() path. Note that the tiny RCU implementation does not have such issue since the objects can be tracked from the rcu_ctrlblk structure. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://lore.kernel.org/all/F903A825-F05F-4B77-A2B5-7356282FBA2C@apple.com/ Cc: <stable@vger.kernel.org> Tested-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>	2023-10-04 17:28:09 +02:00
Paul E. McKenney	0ae9942f03	rcu: Eliminate rcu_gp_slow_unregister() false positive When using rcutorture as a module, there are a number of conditions that can abort the modprobe operation, for example, when attempting to run both RCU CPU stall warning tests and forward-progress tests. This can cause rcu_torture_cleanup() to be invoked on the unwind path out of rcu_rcu_torture_init(), which will mean that rcu_gp_slow_unregister() is invoked without a matching rcu_gp_slow_register(). This will cause a splat because rcu_gp_slow_unregister() is passed rcu_fwd_cb_nodelay, which does not match a NULL pointer. This commit therefore forgives a mismatch involving a NULL pointer, thus avoiding this false-positive splat. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>	2023-09-13 22:29:12 +02:00
Zhen Lei	2cbc482d32	rcu: Dump memory object info if callback function is invalid When a structure containing an RCU callback rhp is (incorrectly) freed and reallocated after rhp is passed to call_rcu(), it is not unusual for rhp->func to be set to NULL. This defeats the debugging prints used by __call_rcu_common() in kernels built with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y, which expect to identify the offending code using the identity of this function. And in kernels build without CONFIG_DEBUG_OBJECTS_RCU_HEAD=y, things are even worse, as can be seen from this splat: Unable to handle kernel NULL pointer dereference at virtual address 0 ... ... PC is at 0x0 LR is at rcu_do_batch+0x1c0/0x3b8 ... ... (rcu_do_batch) from (rcu_core+0x1d4/0x284) (rcu_core) from (__do_softirq+0x24c/0x344) (__do_softirq) from (__irq_exit_rcu+0x64/0x108) (__irq_exit_rcu) from (irq_exit+0x8/0x10) (irq_exit) from (__handle_domain_irq+0x74/0x9c) (__handle_domain_irq) from (gic_handle_irq+0x8c/0x98) (gic_handle_irq) from (__irq_svc+0x5c/0x94) (__irq_svc) from (arch_cpu_idle+0x20/0x3c) (arch_cpu_idle) from (default_idle_call+0x4c/0x78) (default_idle_call) from (do_idle+0xf8/0x150) (do_idle) from (cpu_startup_entry+0x18/0x20) (cpu_startup_entry) from (0xc01530) This commit therefore adds calls to mem_dump_obj(rhp) to output some information, for example: slab kmalloc-256 start ffff410c45019900 pointer offset 0 size 256 This provides the rough size of the memory block and the offset of the rcu_head structure, which as least provides at least a few clues to help locate the problem. If the problem is reproducible, additional slab debugging can be enabled, for example, CONFIG_DEBUG_SLAB=y, which can provide significantly more information. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>	2023-09-13 22:29:12 +02:00
Paul E. McKenney	16128b1f8c	rcu: Add sysfs to provide throttled access to rcu_barrier() When running a series of stress tests all making heavy use of RCU, it is all too possible to OOM the system when the prior test's RCU callbacks don't get invoked until after the subsequent test starts. One way of handling this is just a timed wait, but this fails when a given CPU has so many callbacks queued that they take longer to invoke than allowed for by that timed wait. This commit therefore adds an rcutree.do_rcu_barrier module parameter that is accessible from sysfs. Writing one of the many synonyms for boolean "true" will cause an rcu_barrier() to be invoked, but will guarantee that no more than one rcu_barrier() will be invoked per sixteenth of a second via this mechanism. The flip side is that a given request might wait a second or three longer than absolutely necessary, but only when there are multiple uses of rcutree.do_rcu_barrier within a one-second time interval. This commit unnecessarily serializes the rcu_barrier() machinery, given that serialization is already provided by procfs. This has the advantage of allowing throttled rcu_barrier() from other sources within the kernel. Reported-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>	2023-09-13 22:28:49 +02:00
Joel Fernandes (Google)	4502138acc	rcu/tree: Remove superfluous return from void call_rcu* functions The return keyword is not needed here. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>	2023-09-13 22:28:49 +02:00
Joel Fernandes (Google)	b96e7a5fa0	rcu/tree: Defer setting of jiffies during stall reset There are instances where rcu_cpu_stall_reset() is called when jiffies did not get a chance to update for a long time. Before jiffies is updated, the CPU stall detector can go off triggering false-positives where a just-started grace period appears to be ages old. In the past, we disabled stall detection in rcu_cpu_stall_reset() however this got changed [1]. This is resulting in false-positives in KGDB usecase [2]. Fix this by deferring the update of jiffies to the third run of the FQS loop. This is more robust, as, even if rcu_cpu_stall_reset() is called just before jiffies is read, we would end up pushing out the jiffies read by 3 more FQS loops. Meanwhile the CPU stall detection will be delayed and we will not get any false positives. [1] https://lore.kernel.org/all/20210521155624.174524-2-senozhatsky@chromium.org/ [2] https://lore.kernel.org/all/20230814020045.51950-2-chenhuacai@loongson.cn/ Tested with rcutorture.cpu_stall option as well to verify stall behavior with/without patch. Tested-by: Huacai Chen <chenhuacai@loongson.cn> Reported-by: Binbin Zhou <zhoubinbin@loongson.cn> Closes: https://lore.kernel.org/all/20230814020045.51950-2-chenhuacai@loongson.cn/ Suggested-by: Paul McKenney <paulmck@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Fixes: `a80be428fb` ("rcu: Do not disable GP stall detection in rcu_cpu_stall_reset()") Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>	2023-09-11 22:36:40 +02:00
Paul E. McKenney	343640cb5b	rcu: Mark __rcu_irq_enter_check_tick() ->rcu_urgent_qs load The rcu_request_urgent_qs_task() function does a cross-CPU store to ->rcu_urgent_qs, so this commit therefore marks the load in __rcu_irq_enter_check_tick() with READ_ONCE(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>	2023-08-16 14:27:41 -07:00
Paul E. McKenney	c924bf5a43	rcu: Clarify rcu_is_watching() kernel-doc comment Make it clear that this function always returns either true or false without other planned failure modes. Reported-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>	2023-07-19 13:21:28 -07:00
Paul E. McKenney	2e31da752c	Merge branches 'doc.2023.05.10a', 'fixes.2023.05.11a', 'kvfree.2023.05.10a', 'nocb.2023.05.11a', 'rcu-tasks.2023.05.10a', 'torture.2023.05.15a' and 'rcu-urgent.2023.06.06a' into HEAD doc.2023.05.10a: Documentation updates fixes.2023.05.11a: Miscellaneous fixes kvfree.2023.05.10a: kvfree_rcu updates nocb.2023.05.11a: Callback-offloading updates rcu-tasks.2023.05.10a: Tasks RCU updates torture.2023.05.15a: Torture-test updates rcu-urgent.2023.06.06a: Urgent SRCU fix	2023-06-07 13:44:06 -07:00
Paul E. McKenney	401b0de3ae	rcu-tasks: Stop rcu_tasks_invoke_cbs() from using never-onlined CPUs The rcu_tasks_invoke_cbs() function relies on queue_work_on() to silently fall back to WORK_CPU_UNBOUND when the specified CPU is offline. However, the queue_work_on() function's silent fallback mechanism relies on that CPU having been online at some time in the past. When queue_work_on() is passed a CPU that has never been online, workqueue lockups ensue, which can be bad for your kernel's general health and well-being. This commit therefore checks whether a given CPU has ever been online, and, if not substitutes WORK_CPU_UNBOUND in the subsequent call to queue_work_on(). Why not simply omit the queue_work_on() call entirely? Because this function is flooding callback-invocation notifications to all CPUs, and must deal with possibilities that include a sparse cpu_possible_mask. This commit also moves the setting of the rcu_data structure's ->beenonline field to rcu_cpu_starting(), which executes on the incoming CPU before that CPU has ever enabled interrupts. This ensures that the required workqueues are present. In addition, because the incoming CPU has not yet enabled its interrupts, there cannot yet have been any softirq handlers running on this CPU, which means that the WARN_ON_ONCE(!rdp->beenonline) within the RCU_SOFTIRQ handler cannot have triggered yet. Fixes: `d363f833c6` ("rcu-tasks: Use workqueues for multiple rcu_tasks_invoke_cbs() invocations") Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-05-11 13:42:39 -07:00
Paul E. McKenney	15d44dfa40	rcu: Make rcu_cpu_starting() rely on interrupts being disabled Currently, rcu_cpu_starting() is written so that it might be invoked with interrupts enabled. However, it is always called when interrupts are disabled, either by rcu_init(), notify_cpu_starting(), or from a call point prior to the call to notify_cpu_starting(). But why bother requiring that interrupts be disabled? The purpose is to allow the rcu_data structure's ->beenonline flag to be set after all early processing has completed for the incoming CPU, thus allowing this flag to be used to determine when workqueues have been set up for the incoming CPU, while still allowing this flag to be used as a diagnostic within rcu_core(). This commit therefore makes rcu_cpu_starting() rely on interrupts being disabled. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-05-11 13:42:39 -07:00
Paul E. McKenney	a24c1aab65	rcu: Mark rcu_cpu_kthread() accesses to ->rcu_cpu_has_work The rcu_data structure's ->rcu_cpu_has_work field can be modified by any CPU attempting to wake up the rcuc kthread. Therefore, this commit marks accesses to this field from the rcu_cpu_kthread() function. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-05-11 13:42:39 -07:00
Paul E. McKenney	f51164a808	rcu: Employ jiffies-based backstop to callback time limit Currently, if there are more than 100 ready-to-invoke RCU callbacks queued on a given CPU, the rcu_do_batch() function sets a timeout for invocation of the series. This timeout defaulting to three milliseconds, and may be adjusted using the rcutree.rcu_resched_ns kernel boot parameter. This timeout is checked using local_clock(), but the overhead of this function combined with the common-case very small callback-invocation overhead means that local_clock() is checked every 32nd invocation. This works well except for longer-than average callbacks. For example, a series of 500-microsecond-duration callbacks means that local_clock() is checked only once every 16 milliseconds, which makes it difficult to enforce a three-millisecond timeout. This commit therefore adds a Kconfig option RCU_DOUBLE_CHECK_CB_TIME that enables backup timeout checking using the coarser grained but lighter weight jiffies. If the jiffies counter detects a timeout, then local_clock() is consulted even if this is not the 32nd callback. This prevents the aforementioned 16-millisecond latency blow. Reported-by: Domas Mituzas <dmituzas@meta.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-05-11 13:42:39 -07:00
Paul E. McKenney	fea1c1f010	rcu: Check callback-invocation time limit for rcuc kthreads Currently, a callback-invocation time limit is enforced only for callbacks invoked from the softirq environment, the rationale being that when callbacks are instead invoked from rcuc and rcuoc kthreads, these callbacks cannot be holding up other softirq vectors. Which is in fact true. However, if an rcuc kthread spends too much time invoking callbacks, it can delay quiescent-state reports from its CPU, which can also be a problem. This commit therefore applies the callback-invocation time limit to callback invocation from the rcuc kthreads as well as from softirq. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-05-11 13:42:39 -07:00
Zqiang	6b706e5603	rcu/kvfree: Make drain_page_cache() take early return if cache is disabled If the rcutree.rcu_min_cached_objs kernel boot parameter is set to zero, then krcp->page_cache_work will never be triggered to fill page cache. In addition, the put_cached_bnode() will not fill page cache. As a result krcp->bkvcache will always be empty, so there is no need to acquire krcp->lock to get page from krcp->bkvcache. This commit therefore makes drain_page_cache() return immediately if the rcu_min_cached_objs is zero. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-05-09 17:26:21 -07:00
Zqiang	60888b77a0	rcu/kvfree: Make fill page cache start from krcp->nr_bkv_objs When the fill_page_cache_func() function is invoked, it assumes that the cache of pages is completely empty. However, there can be some time between triggering execution of this function and its actual invocation. During this time, kfree_rcu_work() might run, and might fill in part or all of this cache of pages, thus invalidating the fill_page_cache_func() function's assumption. This will not overfill the cache because put_cached_bnode() will reject the extra page. However, it will result in a needless allocation and freeing of one extra page, which might not be helpful under lowish-memory conditions. This commit therefore causes the fill_page_cache_func() to explicitly account for pages that have been placed into the cache shortly before it starts running. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-05-09 17:26:21 -07:00
Uladzislau Rezki (Sony)	021a5ff847	rcu/kvfree: Do not run a page work if a cache is disabled By default the cache size is 5 pages per CPU, but it can be disabled at boot time by setting the rcu_min_cached_objs to zero. When that happens, the current code will uselessly set an hrtimer to schedule refilling this cache with zero pages. This commit therefore streamlines this process by simply refusing the set the hrtimer when rcu_min_cached_objs is zero. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-05-09 17:26:21 -07:00
Zqiang	309a431650	rcu/kvfree: Use consistent krcp when growing kfree_rcu() page cache The add_ptr_to_bulk_krc_lock() function is invoked to allocate a new kfree_rcu() page, also known as a kvfree_rcu_bulk_data structure. The kfree_rcu_cpu structure's lock is used to protect this operation, except that this lock must be momentarily dropped when allocating memory. It is clearly important that the lock that is reacquired be the same lock that was acquired initially via krc_this_cpu_lock(). Unfortunately, this same krc_this_cpu_lock() function is used to re-acquire this lock, and if the task migrated to some other CPU during the memory allocation, this will result in the kvfree_rcu_bulk_data structure being added to the wrong CPU's kfree_rcu_cpu structure. This commit therefore replaces that second call to krc_this_cpu_lock() with raw_spin_lock_irqsave() in order to explicitly acquire the lock on the correct kfree_rcu_cpu structure, thus keeping things straight even when the task migrates. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-05-09 17:26:21 -07:00
Zqiang	1e237994d9	rcu/kvfree: Invoke debug_rcu_bhead_unqueue() after checking bnode->gp_snap If kvfree_rcu_bulk() sees that the required grace period has failed to elapse, it leaks the memory because readers might still be using it. But in that case, the debug-objects subsystem still marks the relevant structures as having been freed, even though they are instead being leaked. This commit fixes this mismatch by invoking debug_rcu_bhead_unqueue() only when we are actually going to free the objects. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-05-09 17:26:21 -07:00
Uladzislau Rezki (Sony)	f32276a376	rcu/kvfree: Add debug check for GP complete for kfree_rcu_cpu list Under low-memory conditions, kvfree_rcu() will use each object's rcu_head structure to queue objects in a singly linked list headed by the kfree_rcu_cpu structure's ->head field. This list is passed to call_rcu() as a unit, but there is no indication of which grace period this list needs to wait for. This in turn prevents adding debug checks in the kfree_rcu_work() as was done for the two page-of-pointers channels in the kfree_rcu_cpu structure. This commit therefore adds a ->head_free_gp_snap field to the kfree_rcu_cpu_work structure to record this grace-period number. It also adds a WARN_ON_ONCE() to kfree_rcu_monitor() that checks to make sure that the required grace period has in fact elapsed. [ paulmck: Fix kerneldoc issue raised by Stephen Rothwell. ] Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-05-09 17:26:21 -07:00
Paul E. McKenney	cdfa0f6fa6	rcu/kvfree: Add debug to check grace periods This commit adds debugging checks to verify that the required RCU grace period has elapsed for each kvfree_rcu_bulk_data structure that arrives at the kvfree_rcu_bulk() function. These checks make use of that structure's ->gp_snap field, which has been upgraded from an unsigned long to an rcu_gp_oldstate structure. This upgrade reduces the chances of false positives to nearly zero, even on 32-bit systems, for which this structure carries 64 bits of state. Cc: Ziwei Dai <ziwei.dai@unisoc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-05-09 17:26:21 -07:00
Linus Torvalds	5dfb75e842	RCU Changes for 6.4: o MAINTAINERS files additions and changes. o Fix hotplug warning in nohz code. o Tick dependency changes by Zqiang. o Lazy-RCU shrinker fixes by Zqiang. o rcu-tasks stall reporting improvements by Neeraj. o Initial changes for renaming of k[v]free_rcu() to its new k[v]free_rcu_mightsleep() name for robustness. o Documentation Updates: o Significant changes to srcu_struct size. o Deadlock detection for srcu_read_lock() vs synchronize_srcu() from Boqun. o rcutorture and rcu-related tool, which are targeted for v6.4 from Boqun's tree. o Other misc changes. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEcoCIrlGe4gjE06JJqA4nf2o45hAFAmQuBnIACgkQqA4nf2o4 5hACVRAAoXu7/gfh5Pjw9O4E4pCdPJKsZZVYrcrVGrq6NAxRn6M1SgurAdC5grj2 96x0waoGaiO82V0H5iJMcKdAVu67x9R8WaQ1JoxN75Efn8h9W4TguB87TV1gk0xS eZ18b/CyEaM5mNb80DFFF4FLohy5737p/kNTMqXQdUyR1BsDl16iRMgjiBiFhNUx yPo8Y2kC2U2OTbldZgaE7s9bQO3xxEcifx93sGWsAex/gx54FYNisiwSlCOSgOE+ XkYo/OKk8Xvr82tLVX8XQVEPCMJ+rxea8T5zSs8/alvsPq7gA8wW3y6fsoa3vUU/ +Gd+W+Q/OsONIDtp8rQAY1qsD0ScDpaR8052RSH0zTa7pj8HsQgE5PjZ+cJW0SEi cKN+Oe8+ETqKald+xZ6PDf58O212VLrru3RpQWrOQcJ7fmKmfT4REK0RcbLgg4qT CBgOo6eg+ub4pxq2y11LZJBNTv1/S7xAEzFE0kArew64KB2gyVud0VJRZVAJnEfe 93QQVDFrwK2bhgWQZ6J6IbTvGeQW0L93IibuaU6jhZPR283VtUIIvM7vrOylN7Fq 4jsae0T7YGYfKUhgTpm7rCnm8A/D3Ni8MY0sKYYgDSyKmZUsnpI5wpx1xke4lwwV ErrY46RCFa+k8wscc6iWfB4cGXyyFHyu+wtyg0KpFn5JAzcfz4A= =Rgbj -----END PGP SIGNATURE----- Merge tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux Pull RCU updates from Joel Fernandes: - Updates and additions to MAINTAINERS files, with Boqun being added to the RCU entry and Zqiang being added as an RCU reviewer. I have also transitioned from reviewer to maintainer; however, Paul will be taking over sending RCU pull-requests for the next merge window. - Resolution of hotplug warning in nohz code, achieved by fixing cpu_is_hotpluggable() through interaction with the nohz subsystem. Tick dependency modifications by Zqiang, focusing on fixing usage of the TICK_DEP_BIT_RCU_EXP bitmask. - Avoid needless calls to the rcu-lazy shrinker for CONFIG_RCU_LAZY=n kernels, fixed by Zqiang. - Improvements to rcu-tasks stall reporting by Neeraj. - Initial renaming of k[v]free_rcu() to k[v]free_rcu_mightsleep() for increased robustness, affecting several components like mac802154, drbd, vmw_vmci, tracing, and more. A report by Eric Dumazet showed that the API could be unknowingly used in an atomic context, so we'd rather make sure they know what they're asking for by being explicit: https://lore.kernel.org/all/20221202052847.2623997-1-edumazet@google.com/ - Documentation updates, including corrections to spelling, clarifications in comments, and improvements to the srcu_size_state comments. - Better srcu_struct cache locality for readers, by adjusting the size of srcu_struct in support of SRCU usage by Christoph Hellwig. - Teach lockdep to detect deadlocks between srcu_read_lock() vs synchronize_srcu() contributed by Boqun. Previously lockdep could not detect such deadlocks, now it can. - Integration of rcutorture and rcu-related tools, targeted for v6.4 from Boqun's tree, featuring new SRCU deadlock scenarios, test_nmis module parameter, and more - Miscellaneous changes, various code cleanups and comment improvements * tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux: (71 commits) checkpatch: Error out if deprecated RCU API used mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep() rcuscale: Rename kfree_rcu() to kfree_rcu_mightsleep() ext4/super: Rename kfree_rcu() to kfree_rcu_mightsleep() net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep() net/sysctl: Rename kvfree_rcu() to kvfree_rcu_mightsleep() lib/test_vmalloc.c: Rename kvfree_rcu() to kvfree_rcu_mightsleep() tracing: Rename kvfree_rcu() to kvfree_rcu_mightsleep() misc: vmw_vmci: Rename kvfree_rcu() to kvfree_rcu_mightsleep() drbd: Rename kvfree_rcu() to kvfree_rcu_mightsleep() rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed rcu-tasks: Report stalls during synchronize_srcu() in rcu_tasks_postscan() rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early rcu: Remove never-set needwake assignment from rcu_report_qs_rdp() rcu: Register rcu-lazy shrinker only for CONFIG_RCU_LAZY=y kernels rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race rcu/trace: use strscpy() to instead of strncpy() tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem ...	2023-04-24 12:16:14 -07:00
Ziwei Dai	5da7cb193d	rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period Memory passed to kvfree_rcu() that is to be freed is tracked by a per-CPU kfree_rcu_cpu structure, which in turn contains pointers to kvfree_rcu_bulk_data structures that contain pointers to memory that has not yet been handed to RCU, along with an kfree_rcu_cpu_work structure that tracks the memory that has already been handed to RCU. These structures track three categories of memory: (1) Memory for kfree(), (2) Memory for kvfree(), and (3) Memory for both that arrived during an OOM episode. The first two categories are tracked in a cache-friendly manner involving a dynamically allocated page of pointers (the aforementioned kvfree_rcu_bulk_data structures), while the third uses a simple (but decidedly cache-unfriendly) linked list through the rcu_head structures in each block of memory. On a given CPU, these three categories are handled as a unit, with that CPU's kfree_rcu_cpu_work structure having one pointer for each of the three categories. Clearly, new memory for a given category cannot be placed in the corresponding kfree_rcu_cpu_work structure until any old memory has had its grace period elapse and thus has been removed. And the kfree_rcu_monitor() function does in fact check for this. Except that the kfree_rcu_monitor() function checks these pointers one at a time. This means that if the previous kfree_rcu() memory passed to RCU had only category 1 and the current one has only category 2, the kfree_rcu_monitor() function will send that current category-2 memory along immediately. This can result in memory being freed too soon, that is, out from under unsuspecting RCU readers. To see this, consider the following sequence of events, in which: o Task A on CPU 0 calls rcu_read_lock(), then uses "from_cset", then is preempted. o CPU 1 calls kfree_rcu(cset, rcu_head) in order to free "from_cset" after a later grace period. Except that "from_cset" is freed right after the previous grace period ended, so that "from_cset" is immediately freed. Task A resumes and references "from_cset"'s member, after which nothing good happens. In full detail: CPU 0 CPU 1 ---------------------- ---------------------- count_memcg_event_mm() \|rcu_read_lock() <--- \|mem_cgroup_from_task() \|// css_set_ptr is the "from_cset" mentioned on CPU 1 \|css_set_ptr = rcu_dereference((task)->cgroups) \|// Hard irq comes, current task is scheduled out. cgroup_attach_task() \|cgroup_migrate() \|cgroup_migrate_execute() \|css_set_move_task(task, from_cset, to_cset, true) \|cgroup_move_task(task, to_cset) \|rcu_assign_pointer(.., to_cset) \|... \|cgroup_migrate_finish() \|put_css_set_locked(from_cset) \|from_cset->refcount return 0 \|kfree_rcu(cset, rcu_head) // free from_cset after new gp \|add_ptr_to_bulk_krc_lock() \|schedule_delayed_work(&krcp->monitor_work, ..) kfree_rcu_monitor() \|krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[] \|queue_rcu_work(system_wq, &krwp->rcu_work) \|if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state, \|call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request new gp // There is a perious call_rcu(.., rcu_work_rcufn) // gp end, rcu_work_rcufn() is called. rcu_work_rcufn() \|__queue_work(.., rwork->wq, &rwork->work); \|kfree_rcu_work() \|krwp->bulk_head_free[0] bulk is freed before new gp end!!! \|The "from_cset" is freed before new gp end. // the task resumes some time later. \|css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed. This commit therefore causes kfree_rcu_monitor() to refrain from moving kfree_rcu() memory to the kfree_rcu_cpu_work structure until the RCU grace period has completed for all three categories. v2: Use helper function instead of inserted code block at kfree_rcu_monitor(). Fixes: `34c8817455` ("rcu: Support kfree_bulk() interface in kfree_rcu()") Fixes: `5f3c8d6204` ("rcu/tree: Maintain separate array for vmalloc ptrs") Reported-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-04-06 10:04:23 -07:00
Joel Fernandes (Google)	8ae9985774	Merge branches 'rcu/staging-core', 'rcu/staging-docs' and 'rcu/staging-kfree', remote-tracking branches 'paul/srcu-cf.2023.04.04a', 'fbq/rcu/lockdep.2023.03.27a' and 'fbq/rcu/rcutorture.2023.03.20a' into rcu/staging	2023-04-05 13:50:37 +00:00
Zheng Yejian	7a29fb4a47	rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed Registering a kprobe on __rcu_irq_enter_check_tick() can cause kernel stack overflow as shown below. This issue can be reproduced by enabling CONFIG_NO_HZ_FULL and booting the kernel with argument "nohz_full=", and then giving the following commands at the shell prompt: # cd /sys/kernel/tracing/ # echo 'p:mp1 __rcu_irq_enter_check_tick' >> kprobe_events # echo 1 > events/kprobes/enable This commit therefore adds __rcu_irq_enter_check_tick() to the kprobes blacklist using NOKPROBE_SYMBOL(). Insufficient stack space to handle exception! ESR: 0x00000000f2000004 -- BRK (AArch64) FAR: 0x0000ffffccf3e510 Task stack: [0xffff80000ad30000..0xffff80000ad38000] IRQ stack: [0xffff800008050000..0xffff800008058000] Overflow stack: [0xffff089c36f9f310..0xffff089c36fa0310] CPU: 5 PID: 190 Comm: bash Not tainted 6.2.0-rc2-00320-g1f5abbd77e2c #19 Hardware name: linux,dummy-virt (DT) pstate: 400003c5 (nZcv DAIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : __rcu_irq_enter_check_tick+0x0/0x1b8 lr : ct_nmi_enter+0x11c/0x138 sp : ffff80000ad30080 x29: ffff80000ad30080 x28: ffff089c82e20000 x27: 0000000000000000 x26: 0000000000000000 x25: ffff089c02a8d100 x24: 0000000000000000 x23: 00000000400003c5 x22: 0000ffffccf3e510 x21: ffff089c36fae148 x20: ffff80000ad30120 x19: ffffa8da8fcce148 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: ffffa8da8e44ea6c x14: ffffa8da8e44e968 x13: ffffa8da8e03136c x12: 1fffe113804d6809 x11: ffff6113804d6809 x10: 0000000000000a60 x9 : dfff800000000000 x8 : ffff089c026b404f x7 : 00009eec7fb297f7 x6 : 0000000000000001 x5 : ffff80000ad30120 x4 : dfff800000000000 x3 : ffffa8da8e3016f4 x2 : 0000000000000003 x1 : 0000000000000000 x0 : 0000000000000000 Kernel panic - not syncing: kernel stack overflow CPU: 5 PID: 190 Comm: bash Not tainted 6.2.0-rc2-00320-g1f5abbd77e2c #19 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0xf8/0x108 show_stack+0x20/0x30 dump_stack_lvl+0x68/0x84 dump_stack+0x1c/0x38 panic+0x214/0x404 add_taint+0x0/0xf8 panic_bad_stack+0x144/0x160 handle_bad_stack+0x38/0x58 __bad_stack+0x78/0x7c __rcu_irq_enter_check_tick+0x0/0x1b8 arm64_enter_el1_dbg.isra.0+0x14/0x20 el1_dbg+0x2c/0x90 el1h_64_sync_handler+0xcc/0xe8 el1h_64_sync+0x64/0x68 __rcu_irq_enter_check_tick+0x0/0x1b8 arm64_enter_el1_dbg.isra.0+0x14/0x20 el1_dbg+0x2c/0x90 el1h_64_sync_handler+0xcc/0xe8 el1h_64_sync+0x64/0x68 __rcu_irq_enter_check_tick+0x0/0x1b8 arm64_enter_el1_dbg.isra.0+0x14/0x20 el1_dbg+0x2c/0x90 el1h_64_sync_handler+0xcc/0xe8 el1h_64_sync+0x64/0x68 __rcu_irq_enter_check_tick+0x0/0x1b8 [...] el1_dbg+0x2c/0x90 el1h_64_sync_handler+0xcc/0xe8 el1h_64_sync+0x64/0x68 __rcu_irq_enter_check_tick+0x0/0x1b8 arm64_enter_el1_dbg.isra.0+0x14/0x20 el1_dbg+0x2c/0x90 el1h_64_sync_handler+0xcc/0xe8 el1h_64_sync+0x64/0x68 __rcu_irq_enter_check_tick+0x0/0x1b8 arm64_enter_el1_dbg.isra.0+0x14/0x20 el1_dbg+0x2c/0x90 el1h_64_sync_handler+0xcc/0xe8 el1h_64_sync+0x64/0x68 __rcu_irq_enter_check_tick+0x0/0x1b8 el1_interrupt+0x28/0x60 el1h_64_irq_handler+0x18/0x28 el1h_64_irq+0x64/0x68 __ftrace_set_clr_event_nolock+0x98/0x198 __ftrace_set_clr_event+0x58/0x80 system_enable_write+0x144/0x178 vfs_write+0x174/0x738 ksys_write+0xd0/0x188 __arm64_sys_write+0x4c/0x60 invoke_syscall+0x64/0x180 el0_svc_common.constprop.0+0x84/0x160 do_el0_svc+0x48/0xe8 el0_svc+0x34/0xd0 el0t_64_sync_handler+0xb8/0xc0 el0t_64_sync+0x190/0x194 SMP: stopping secondary CPUs Kernel Offset: 0x28da86000000 from 0xffff800008000000 PHYS_OFFSET: 0xfffff76600000000 CPU features: 0x00000,01a00100,0000421b Memory Limit: none Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org> Link: https://lore.kernel.org/all/20221119040049.795065-1-zhengyejian1@huawei.com/ Fixes: `aaf2bc50df` ("rcu: Abstract out rcu_irq_enter_check_tick() from rcu_nmi_enter()") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>	2023-04-05 13:47:44 +00:00
Zqiang	7ea91307ad	rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early According to the commit log of the patch that added it to the kernel, start_poll_synchronize_rcu_expedited() can be invoked very early, as in long before rcu_init() has been invoked. But before rcu_init(), the rcu_data structure's ->mynode field has not yet been initialized. This means that the start_poll_synchronize_rcu_expedited() function's attempt to set the CPU's leaf rcu_node structure's ->exp_seq_poll_rq field will result in a segmentation fault. This commit therefore causes start_poll_synchronize_rcu_expedited() to set ->exp_seq_poll_rq only after rcu_init() has initialized all CPUs' rcu_data structures' ->mynode fields. It also removes the check from the rcu_init() function so that start_poll_synchronize_rcu_expedited( is unconditionally invoked. Yes, this might result in an unnecessary boot-time grace period, but this is down in the noise. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>	2023-04-05 13:47:44 +00:00
Zqiang	46103fe01b	rcu: Remove never-set needwake assignment from rcu_report_qs_rdp() The rcu_accelerate_cbs() function is invoked by rcu_report_qs_rdp() only if there is a grace period in progress that is still blocked by at least one CPU on this rcu_node structure. This means that rcu_accelerate_cbs() should never return the value true, and thus that this function should never set the needwake variable and in turn never invoke rcu_gp_kthread_wake(). This commit therefore removes the needwake variable and the invocation of rcu_gp_kthread_wake() in favor of a WARN_ON_ONCE() on the call to rcu_accelerate_cbs(). The purpose of this new WARN_ON_ONCE() is to detect situations where the system's opinion differs from ours. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>	2023-04-05 13:47:44 +00:00
Paul E. McKenney	09853fb89f	rcu: Add comment to rcu_do_batch() identifying rcuoc code path This commit adds a comment to help explain why the "else" clause of the in_serving_softirq() "if" statement does not need to enforce a time limit. The reason is that this "else" clause handles rcuoc kthreads that do not block handlers for other softirq vectors. Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>	2023-04-05 13:47:18 +00:00
Paul E. McKenney	bba8d3d17d	Merge branch 'stall.2023.01.09a' into HEAD stall.2023.01.09a: RCU CPU stall-warning updates.	2023-02-02 16:40:07 -08:00
Paul E. McKenney	8e1704b6a8	Merge branches 'doc.2023.01.05a', 'fixes.2023.01.23a', 'kvfree.2023.01.03a', 'srcu.2023.01.03a', 'srcu-always.2023.02.02a', 'tasks.2023.01.03a', 'torture.2023.01.05a' and 'torturescript.2023.01.03a' into HEAD doc.2023.01.05a: Documentation update. fixes.2023.01.23a: Miscellaneous fixes. kvfree.2023.01.03a: kvfree_rcu() updates. srcu.2023.01.03a: SRCU updates. srcu-always.2023.02.02a: Finish making SRCU be unconditionally available. tasks.2023.01.03a: Tasks-RCU updates. torture.2023.01.05a: Torture-test updates. torturescript.2023.01.03a: Torture-test scripting updates.	2023-02-02 16:33:43 -08:00
Joel Fernandes (Google)	cf7066b97e	rcu: Disable laziness if lazy-tracking says so During suspend, we see failures to suspend 1 in 300-500 suspends. Looking closer, it appears that asynchronous RCU callbacks are being queued as lazy even though synchronous callbacks are expedited. These delays appear to not be very welcome by the suspend/resume code as evidenced by these occasional suspend failures. This commit modifies call_rcu() to check if rcu_async_should_hurry(), which will return true if we are in suspend or in-kernel boot. [ paulmck: Alphabetize local variables. ] Ignoring the lazy hint makes the 3000 suspend/resume cycles pass reliably on a 12th gen 12-core Intel CPU, and there is some evidence that it also slightly speeds up boot performance. Fixes: `3cb278e73b` ("rcu: Make call_rcu() lazy to save power") Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-01-23 16:51:29 -08:00
Joel Fernandes (Google)	6efdda8bec	rcu: Track laziness during boot and suspend Boot and suspend/resume should not be slowed down in kernels built with CONFIG_RCU_LAZY=y. In particular, suspend can sometimes fail in such kernels. This commit therefore adds rcu_async_hurry(), rcu_async_relax(), and rcu_async_should_hurry() functions that track whether or not either a boot or a suspend/resume operation is in progress. This will enable a later commit to refrain from laziness during those times. Export rcu_async_should_hurry(), rcu_async_hurry(), and rcu_async_relax() for later use by rcutorture. [ paulmck: Apply feedback from Steve Rostedt. ] Fixes: `3cb278e73b` ("rcu: Make call_rcu() lazy to save power") Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-01-17 20:20:11 -08:00

1 2 3 4 5 ...

1109 Commits