Commit 13fb0f74d7 ("printk: Avoid livelock with heavy printk
during panic") introduced a mechanism to silence non-panic CPUs
if too many messages are being dropped. Aside from trying to
workaround the livelock bugs of legacy consoles, it was also
intended to avoid losing panic messages. However, if non-panic
CPUs are writing to the ringbuffer, then reacting to dropped
messages is too late.
Another motivation is that non-finalized messages already might
be skipped in panic(). In other words, random messages from
non-panic CPUs might already get lost. It is better to ignore
all to avoid confusion.
To avoid losing panic CPU messages, silence non-panic CPUs
immediately on panic.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-13-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
The commit d51507098f ("printk: disable optimistic spin
during panic") added checks to avoid becoming a console waiter
if a panic is in progress.
However, the transition to panic can occur while there is
already a waiter. The current owner should not pass the lock to
the waiter because it might get stopped or blocked anytime.
Also the panic context might pass the console lock owner to an
already stopped waiter by mistake. It might happen when
console_flush_on_panic() ignores the current lock owner, for
example:
CPU0 CPU1
---- ----
console_lock_spinning_enable()
console_trylock_spinning()
[CPU1 now console waiter]
NMI: panic()
panic_other_cpus_shutdown()
[stopped as console waiter]
console_flush_on_panic()
console_lock_spinning_enable()
[print 1 record]
console_lock_spinning_disable_and_check()
[handover to stopped CPU1]
This results in panic() not flushing the panic messages.
Fix these problems by disabling all spinning operations
completely during panic().
Another advantage is that it prevents possible deadlocks caused
by "console_owner_lock". The panic() context does not need to
take it any longer. The lockless checks are safe because the
functions become NOPs when they see the panic in progress. All
operations manipulating the state are still synchronized by the
lock even when non-panic CPUs would notice the panic
synchronously.
The current owner might stay spinning. But non-panic() CPUs
would get stopped anyway and the panic context will never start
spinning.
Fixes: dbdda842fe ("printk: Add console owner and waiter logic to load balance console writes")
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Link: https://lore.kernel.org/r/20240207134103.1357162-12-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Normally a reader will stop once reaching a non-finalized
record. However, when a panic happens, writers from other CPUs
(or an interrupted context on the panic CPU) may have been
writing a record and were unable to finalize it. The panic CPU
will reserve/commit/finalize its panic records, but these will
be located after the non-finalized records. This results in
panic() not flushing the panic messages.
Extend _prb_read_valid() to skip over non-finalized records if
on the panic CPU.
Fixes: 896fbe20b4 ("printk: use the lockless ringbuffer")
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-11-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Currently pr_flush() will only wait for records that were
available to readers at the time of the call (using
prb_next_seq()). But there may be more records (non-finalized)
that have following finalized records. pr_flush() should wait
for these to print as well. Particularly because any trailing
finalized records may be the messages that the calling context
wants to ensure are printed.
Add a new ringbuffer function prb_next_reserve_seq() to return
the sequence number following the most recently reserved record.
This guarantees that pr_flush() will wait until all current
printk() messages (completed or in progress) have been printed.
Fixes: 3b604ca812 ("printk: add pr_flush()")
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-10-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
With the lockless ringbuffer, it is allowed that multiple
CPUs/contexts write simultaneously into the buffer. This creates
an ambiguity as some writers will finalize sooner.
The documentation for the prb_read functions is not clear as it
refers to "not yet written" and "no data available". Clarify the
return values and language to be in terms of the reader: records
available for reading.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-9-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
There is already panic_in_progress() and other_cpu_in_panic(),
but checking if the current CPU is the panic CPU must still be
open coded.
Add this_cpu_in_panic() to complete the set.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-8-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Currently @suppress_panic_printk is checked along with
non-matching @panic_cpu and current CPU. This works
because @suppress_panic_printk is only set when
panic_in_progress() is true.
Rather than relying on the @suppress_panic_printk semantics,
use the concise helper function other_cpu_in_progress(). The
helper function exists to avoid open coding such tests.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-7-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
For empty line records, no data blocks are created. Instead,
these valid records are identified by special logical position
values (in fields of @prb_desc.text_blk_lpos).
Currently the macro NO_LPOS is used for empty line records.
This name is confusing because it does not imply _why_ there is
no data block.
Rename NO_LPOS to EMPTY_LINE_LPOS so that it is clear why there
is no data block.
Also add comments explaining the use of EMPTY_LINE_LPOS as well
as clarification to the values used to represent data-less
blocks.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-6-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Commit f244b4dc53 ("printk: ringbuffer: Improve
prb_next_seq() performance") introduced an optimization for
prb_next_seq() by using best-effort to track recently finalized
records. However, the order of finalization does not
necessarily match the order of the records. The optimization
changed prb_next_seq() to return inconsistent results, possibly
yielding sequence numbers that are not available to readers
because they are preceded by non-finalized records or they are
not yet visible to the reader CPU.
Rather than simply best-effort tracking recently finalized
records, force the committing writer to read records and
increment the last "contiguous block" of finalized records. In
order to do this, the sequence number instead of ID must be
stored because ID's cannot be directly compared.
A new memory barrier pair is introduced to guarantee that a
reader can always read the records up until the sequence number
returned by prb_next_seq() (unless the records have since
been overwritten in the ringbuffer).
This restores the original functionality of prb_next_seq()
while also keeping the optimization.
For 32bit systems, only the lower 32 bits of the sequence
number are stored. When reading the value, it is expanded to
the full 64bit sequence number using the 32bit seq macros,
which fold in the value returned by prb_first_seq().
Fixes: f244b4dc53 ("printk: ringbuffer: Improve prb_next_seq() performance")
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-5-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Note: This change only applies to 32bit architectures. On 64bit
architectures the macros are NOPs.
Currently prb_next_seq() is used as the base for the 32bit seq
macros __u64seq_to_ulseq() and __ulseq_to_u64seq(). However, in
a follow-up commit, prb_next_seq() will need to make use of the
32bit seq macros.
Use prb_first_seq() as the base for the 32bit seq macros instead
because it is guaranteed to return 64bit sequence numbers without
relying on any 32bit seq macros.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-4-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Note: This change only applies to 32bit architectures. On 64bit
architectures the macros are NOPs.
__ulseq_to_u64seq() computes the upper 32 bits of the passed
argument value (@ulseq). The upper bits are derived from a base
value (@rb_next_seq) in a way that assumes @ulseq represents a
64bit number that is less than or equal to @rb_next_seq.
Until now this mapping has been correct for all call sites. However,
in a follow-up commit, values of @ulseq will be passed in that are
higher than the base value. This requires a change to how the 32bit
value is mapped to a 64bit sequence number.
Rather than mapping @ulseq such that the base value is the end of a
32bit block, map @ulseq such that the base value is in the middle of
a 32bit block. This allows supporting 31 bits before and after the
base value, which is deemed acceptable for the console sequence
number during runtime.
Here is an example to illustrate the previous and new mappings.
For a base value (@rb_next_seq) of 2 2000 0000...
Before this change the range of possible return values was:
1 2000 0001 to 2 2000 0000
__ulseq_to_u64seq(1fff ffff) => 2 1fff ffff
__ulseq_to_u64seq(2000 0000) => 2 2000 0000
__ulseq_to_u64seq(2000 0001) => 1 2000 0001
__ulseq_to_u64seq(9fff ffff) => 1 9fff ffff
__ulseq_to_u64seq(a000 0000) => 1 a000 0000
__ulseq_to_u64seq(a000 0001) => 1 a000 0001
After this change the range of possible return values are:
1 a000 0001 to 2 a000 0000
__ulseq_to_u64seq(1fff ffff) => 2 1fff ffff
__ulseq_to_u64seq(2000 0000) => 2 2000 0000
__ulseq_to_u64seq(2000 0001) => 2 2000 0001
__ulseq_to_u64seq(9fff ffff) => 2 9fff ffff
__ulseq_to_u64seq(a000 0000) => 2 a000 0000
__ulseq_to_u64seq(a000 0001) => 1 a000 0001
[ john.ogness: Rewrite commit message. ]
Reported-by: Francesco Dolcini <francesco@dolcini.it>
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-3-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
The macros __seq_to_nbcon_seq() and __nbcon_seq_to_seq() are
used to provide support for atomic handling of sequence numbers
on 32bit systems. Until now this was only used by nbcon.c,
which is why they were located in nbcon.c and include nbcon in
the name.
In a follow-up commit this functionality is also needed by
printk_ringbuffer. Rather than duplicating the functionality,
relocate the macros to printk_ringbuffer.h.
Also, since the macros will be no longer nbcon-specific, rename
them to __u64seq_to_ulseq() and __ulseq_to_u64seq().
This does not result in any functional change.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240207134103.1357162-2-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Clocksource pointers can be problematic to obtain for drivers which are not
clocksource drivers themselves. In particular, the RFC virtio_rtc driver
[1] would require a new helper function to obtain a pointer to the ARM
Generic Timer clocksource. The ptp_kvm driver also required a similar
workaround.
Address this by evaluating the clocksource ID, rather than the clocksource
pointer, of struct system_counterval_t. By this, setting the clocksource
pointer becomes unneeded, and get_device_system_crosststamp() callers will
no longer need to supply clocksource pointers.
All relevant clocksource drivers provide the ID, so this change is not
changing the behaviour.
[1] https://lore.kernel.org/lkml/20231218073849.35294-1-peter.hilber@opensynergy.com/
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240201010453.2212371-7-peter.hilber@opensynergy.com
Now that the driver core can properly handle constant struct bus_type,
move the clockevents_subsys variable to be a constant structure as well,
placing it into read-only memory which can not be modified at runtime.
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20240204-bus_cleanup-time-v1-2-207ec18e24b8@marliere.net
Now that the driver core can properly handle constant struct bus_type,
move the clocksource_subsys variable to be a constant structure as well,
placing it into read-only memory which can not be modified at runtime.
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/20240204-bus_cleanup-time-v1-1-207ec18e24b8@marliere.net
We can get EBADF from pidfd_getfd() if a task is currently exiting,
which might be confusing. Let's check PF_EXITING, and just report ESRCH
if so.
I chose PF_EXITING, because it is set in exit_signals(), which is called
before exit_files(). Since ->exit_status is mostly set after
exit_files() in exit_notify(), using that still leaves a window open for
the race.
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tycho Andersen <tandersen@netflix.com>
Link: https://lore.kernel.org/r/20240206192357.81942-1-tycho@tycho.pizza
Signed-off-by: Christian Brauner <brauner@kernel.org>
copy_process() just needs to pass PIDFD_THREAD to __pidfd_prepare()
if clone_flags & CLONE_THREAD.
We can also add another CLONE_ flag (or perhaps reuse CLONE_DETACHED)
to enforce PIDFD_THREAD without CLONE_THREAD.
Originally-from: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240205145532.GA28823@redhat.com
Reviewed-by: Tycho Andersen <tandersen@netflix.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
It was used by pidfd_poll() but now it has no callers.
If it finally finds a modular user we can revert this change, but note
that the comment above this helper and the changelog in 38fd525a4c
("exit: Factor thread_group_exited out of pidfd_poll") are not accurate,
thread_group_exited() won't return true if all other threads have passed
exit_notify() and are zombies, it returns true only when all other threads
are completely gone. Not to mention that it can only work if the task
identified by @pid is a thread-group leader.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240205174347.GA31461@redhat.com
Reviewed-by: Tycho Andersen <tandersen@netflix.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
rather than wake_up_all(). This way do_notify_pidfd() won't wakeup the
POLLHUP-only waiters which wait for pid_task() == NULL.
TODO:
- as Christian pointed out, this asks for the new wake_up_all_poll()
helper, it can already have other users.
- we can probably discriminate the PIDFD_THREAD and non-PIDFD_THREAD
waiters, but this needs more work. See
https://lore.kernel.org/all/20240205140848.GA15853@redhat.com/
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240205141348.GA16539@redhat.com
Reviewed-by: Tycho Andersen <tandersen@netflix.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The hrtimers migration on CPU-down hotplug process has been moved
earlier, before the CPU actually goes to die. This leaves a small window
of opportunity to queue an hrtimer in a blind spot, leaving it ignored.
For example a practical case has been reported with RCU waking up a
SCHED_FIFO task right before the CPUHP_AP_IDLE_DEAD stage, queuing that
way a sched/rt timer to the local offline CPU.
Make sure such situations never go unnoticed and warn when that happens.
Fixes: 5c0930ccaa ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240129235646.3171983-4-boqun.feng@gmail.com
Allow transferring an imbalanced RCU lock state between subprog calls
during verification. This allows patterns where a subprog call returns
with an RCU lock held, or a subprog call releases an RCU lock held by
the caller. Currently, the verifier would end up complaining if the RCU
lock is not released when processing an exit from a subprog, which is
non-ideal if its execution is supposed to be enclosed in an RCU read
section of the caller.
Instead, simply only check whether we are processing exit for frame#0
and do not complain on an active RCU lock otherwise. We only need to
update the check when processing BPF_EXIT insn, as copy_verifier_state
is already set up to do the right thing.
Suggested-by: David Vernet <void@manifault.com>
Tested-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20240205055646.1112186-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, calling any helpers, kfuncs, or subprogs except the graph
data structure (lists, rbtrees) API kfuncs while holding a bpf_spin_lock
is not allowed. One of the original motivations of this decision was to
force the BPF programmer's hand into keeping the bpf_spin_lock critical
section small, and to ensure the execution time of the program does not
increase due to lock waiting times. In addition to this, some of the
helpers and kfuncs may be unsafe to call while holding a bpf_spin_lock.
However, when it comes to subprog calls, atleast for static subprogs,
the verifier is able to explore their instructions during verification.
Therefore, it is similar in effect to having the same code inlined into
the critical section. Hence, not allowing static subprog calls in the
bpf_spin_lock critical section is mostly an annoyance that needs to be
worked around, without providing any tangible benefit.
Unlike static subprog calls, global subprog calls are not safe to permit
within the critical section, as the verifier does not explore them
during verification, therefore whether the same lock will be taken
again, or unlocked, cannot be ascertained.
Therefore, allow calling static subprogs within a bpf_spin_lock critical
section, and only reject it in case the subprog linkage is global.
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20240204222349.938118-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The for-6.8-fixes commit ae9cc8956944 ("Revert "workqueue: Override implicit
ordered attribute in workqueue_apply_unbound_cpumask()") also fixes build for
Signed-off-by: Tejun Heo <tj@kernel.org>
This reverts commit ca10d851b9.
The commit allowed workqueue_apply_unbound_cpumask() to clear __WQ_ORDERED
on now removed implicitly ordered workqueues. This was incorrect in that
system-wide config change shouldn't break ordering properties of all
workqueues. The reason why apply_workqueue_attrs() path was allowed to do so
was because it was targeting the specific workqueue - either the workqueue
had WQ_SYSFS set or the workqueue user specifically tried to change
max_active, both of which indicate that the workqueue doesn't need to be
ordered.
The implicitly ordered workqueue promotion was removed by the previous
commit 3bc1e711c2 ("workqueue: Don't implicitly make UNBOUND workqueues w/
@max_active==1 ordered"). However, it didn't update this path and broke
build. Let's revert the commit which was incorrect in the first place which
also fixes build.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 3bc1e711c2 ("workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered")
Fixes: ca10d851b9 ("workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()")
Cc: stable@vger.kernel.org # v6.6+
Signed-off-by: Tejun Heo <tj@kernel.org>
5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
automoatically promoted UNBOUND workqueues w/ @max_active==1 to ordered
workqueues because UNBOUND workqueues w/ @max_active==1 used to be the way
to create ordered workqueues and the new NUMA support broke it. These
problems can be subtle and the fact that they can only trigger on NUMA
machines made them even more difficult to debug.
However, overloading the UNBOUND allocation interface this way creates other
issues. It's difficult to tell whether a given workqueue actually needs to
be ordered and users that legitimately want a min concurrency level wq
unexpectedly gets an ordered one instead. With planned UNBOUND workqueue
udpates to improve execution locality and more prevalence of chiplet designs
which can benefit from such improvements, this isn't a state we wanna be in
forever.
There aren't that many UNBOUND w/ @max_active==1 users in the tree and the
preceding patches audited all and converted them to
alloc_ordered_workqueue() as appropriate. This patch removes the implicit
promotion of UNBOUND w/ @max_active==1 workqueues to ordered ones.
v2: v1 patch incorrectly dropped !list_empty(&wq->pwqs) condition in
apply_workqueue_attrs_locked() which spuriously triggers WARNING and
fails workqueue creation. Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/oe-lkp/202304251050.45a5df1f-oliver.sang@intel.com
The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws. To
replace tasklets, BH workqueue support was recently added. A BH workqueue
behaves similarly to regular workqueues except that the queued work items
are executed in the BH context.
This patch converts backtracetest from tasklet to BH workqueue.
- Replace "irq" with "bh" in names and message to better reflect what's
happening.
- Replace completion usage with a flush_work() call.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
The "i" here is always equal to "btf_type_vlen(t)" since
the "for_each_member()" loop never breaks.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240203055119.2235598-1-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Skip updating workqueues with __WQ_DESTROYING bit set when updating
global unbound cpumask to avoid unnecessary work and other complications.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This reverts commit d412ace111. This leads to
build failures as it depends on a driver-core commit 32f78abe59 ("driver
core: bus: constantify subsys_register() calls"). Let's drop it from wq tree
and route it through driver-core tree.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202402051505.kM9Rr3CJ-lkp@intel.com/
Mark the task as having a cached timestamp when set assign it, so we
can efficiently check if it needs updating post being scheduled back in.
This covers both the actual schedule out case, which would've flushed
the plug, and the preemption case which doesn't touch the plugged
requests (for many reasons, one of them being then we'd need to have
preemption disabled around plug state manipulation).
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Extend the support for LZ4 compression to be used with hibernation.
The main idea is that different compression algorithms
have different characteristics and hibernation may benefit when it uses
any of these algorithms: a default algorithm, having higher
compression rate but is slower(compression/decompression) and a
secondary algorithm, that is faster(compression/decompression) but has
lower compression rate.
LZ4 algorithm has better decompression speeds over LZO. This reduces
the hibernation image restore time.
As per test results:
LZO LZ4
Size before Compression(bytes) 682696704 682393600
Size after Compression(bytes) 146502402 155993547
Decompression Rate 335.02 MB/s 501.05 MB/s
Restore time 4.4s 3.8s
LZO is the default compression algorithm used for hibernation. Enable
CONFIG_HIBERNATION_COMP_LZ4 to set the default compressor as LZ4.
Signed-off-by: Nikhil V <quic_nprakash@quicinc.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently for hibernation, LZO is the only compression algorithm
available and uses the existing LZO library calls. However, there
is no flexibility to switch to other algorithms which provides better
results. The main idea is that different compression algorithms have
different characteristics and hibernation may benefit when it uses
alternate algorithms.
By moving to crypto based APIs, it lays a foundation to use other
compression algorithms for hibernation. There are no functional changes
introduced by this approach.
Signed-off-by: Nikhil V <quic_nprakash@quicinc.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Renaming lzo* to generic names, except for lzo_xxx() APIs. This is
used in the next patch where we move to crypto based APIs for
compression. There are no functional changes introduced by this
approach.
Signed-off-by: Nikhil V <quic_nprakash@quicinc.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Because dpm_save_failed_dev() may be called simultaneously by multiple
failing device PM functions, the state of the suspend_stats fields
updated by it may become inconsistent.
Prevent that from happening by using a lock in dpm_save_failed_dev().
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
It is not necessary to define struct suspend_stats in a header file and the
suspend_stats variable in the core device system-wide PM code. They both
can be defined in kernel/power/main.c, next to the sysfs and debugfs code
accessing suspend_stats, which can be static.
Modify the code in question in accordance with the above observation and
replace the static inline functions manipulating suspend_stats with
regular ones defined in kernel/power/main.c.
While at it, move the enum suspend_stat_step to the end of suspend.h which
is a more suitable place for it.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Change the type of the "success" and "fail" fields in struct
suspend_stats to unsigned int, because they cannot be negative.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Instead of using a set of individual struct suspend_stats fields
representing suspend step failure counters, use an array of counters
indexed by enum suspend_stat_step for this purpose, which allows
dpm_save_failed_step() to increment the appropriate counter
automatically, so that its callers don't need to do that directly.
It also allows suspend_stats_show() to carry out a loop over the
counters array to print their values.
Because the counters cannot become negative, use unsigned int for
representing them.
The only user-observable impact of this change is a different
ordering of entries in the suspend_stats debugfs file which is not
expected to matter.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Replace suspend_step_name() in the suspend statistics code with an array
of suspend step names which has fewer lines of code and less overhead.
While at it, remove two unnecessary line breaks in suspend_stats_show()
and adjust some white space in there to the kernel coding style for a
more consistent code layout.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws such as
the execution code accessing the tasklet item after the execution is
complete which can lead to subtle use-after-free in certain usage scenarios
and less-developed flush and cancel mechanisms.
This patch implements BH workqueues which share the same semantics and
features of regular workqueues but execute their work items in the softirq
context. As there is always only one BH execution context per CPU, none of
the concurrency management mechanisms applies and a BH workqueue can be
thought of as a convenience wrapper around softirq.
Except for the inability to sleep while executing and lack of max_active
adjustments, BH workqueues and work items should behave the same as regular
workqueues and work items.
Currently, the execution is hooked to tasklet[_hi]. However, the goal is to
convert all tasklet users over to BH workqueues. Once the conversion is
complete, tasklet can be removed and BH workqueues can directly take over
the tasklet softirqs.
system_bh[_highpri]_wq are added. As queue-wide flushing doesn't exist in
tasklet, all existing tasklet users should be able to use the system BH
workqueues without creating their own workqueues.
v3: - Add missing interrupt.h include.
v2: - Instead of using tasklets, hook directly into its softirq action
functions - tasklet[_hi]_action(). This is slightly cheaper and closer
to the eventual code structure we want to arrive at. Suggested by Lai.
- Lai also pointed out several places which need NULL worker->task
handling or can use clarification. Updated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/CAHk-=wjDW53w4-YcSmgKC5RruiRLHmJ1sXeYdp_ZgVoBw=5byA@mail.gmail.com
Tested-by: Allen Pais <allen.lkml@gmail.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Factor out init_cpu_worker_pool() from workqueue_init_early(). This is pure
reorganization in preparation of BH workqueue support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Allen Pais <allen.lkml@gmail.com>
These changes are in preparation of BH workqueue which will execute work
items from BH context.
- Update lock and RCU depth checks in process_one_work() so that it
remembers and checks against the starting depths and prints out the depth
changes.
- Factor out lockdep annotations in the flush paths into
touch_{wq|work}_lockdep_map(). The work->lockdep_map touching is moved
from __flush_work() to its callee - start_flush_work(). This brings it
closer to the wq counterpart and will allow testing the associated wq's
flags which will be needed to support BH workqueues. This is not expected
to cause any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Allen Pais <allen.lkml@gmail.com>
Now that the driver core can properly handle constant struct bus_type,
move the wq_subsys variable to be a constant structure as well,
placing it into read-only memory which can not be modified at runtime.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suggested-and-reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
dd6c3c5441 ("workqueue: Move pwq_dec_nr_in_flight() to the end of work
item handling") relocated pwq_dec_nr_in_flight() after
set_work_pool_and_keep_pending(). However, the latter destroys information
contained in work->data that's needed by pwq_dec_nr_in_flight() including
the flush color. With flush color destroyed, flush_workqueue() can stall
easily when mixed with cancel_work*() usages.
This is easily triggered by running xfstests generic/001 test on xfs:
INFO: task umount:6305 blocked for more than 122 seconds.
...
task:umount state:D stack:13008 pid:6305 tgid:6305 ppid:6301 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x2f6/0xa20
schedule+0x36/0xb0
schedule_timeout+0x20b/0x280
wait_for_completion+0x8a/0x140
__flush_workqueue+0x11a/0x3b0
xfs_inodegc_flush+0x24/0xf0
xfs_unmountfs+0x14/0x180
xfs_fs_put_super+0x3d/0x90
generic_shutdown_super+0x7c/0x160
kill_block_super+0x1b/0x40
xfs_kill_sb+0x12/0x30
deactivate_locked_super+0x35/0x90
deactivate_super+0x42/0x50
cleanup_mnt+0x109/0x170
__cleanup_mnt+0x12/0x20
task_work_run+0x60/0x90
syscall_exit_to_user_mode+0x146/0x150
do_syscall_64+0x5d/0x110
entry_SYSCALL_64_after_hwframe+0x6c/0x74
Fix it by stashing work_data before calling set_work_pool_and_keep_pending()
and using the stashed value for pwq_dec_nr_in_flight().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Chandan Babu R <chandanbabu@kernel.org>
Link: http://lkml.kernel.org/r/87o7cxeehy.fsf@debian-BULLSEYE-live-builder-AMD64
Fixes: dd6c3c5441 ("workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling")
When btf_prepare_func_args() was generalized to handle both static and
global subprogs, a few warnings/errors that are meant only for global
subprog cases started to be emitted for static subprogs, where they are
sort of expected and irrelavant.
Stop polutting verifier logs with irrelevant scary-looking messages.
Fixes: e26080d0da ("bpf: prepare btf_prepare_func_args() for handling static subprogs")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240202190529.2374377-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add PTR_TRUSTED | PTR_MAYBE_NULL modifiers for PTR_TO_BTF_ID to
check_reg_type() to support passing trusted nullable PTR_TO_BTF_ID
registers into global functions accepting `__arg_trusted __arg_nullable`
arguments. This hasn't been caught earlier because tests were either
passing known non-NULL PTR_TO_BTF_ID registers or known NULL (SCALAR)
registers.
When utilizing this functionality in complicated real-world BPF
application that passes around PTR_TO_BTF_ID_OR_NULL, it became apparent
that verifier rejects valid case because check_reg_type() doesn't handle
this case explicitly. Existing check_reg_type() logic is already
anticipating this combination, so we just need to explicitly list this
combo in the switch statement.
Fixes: e2b3c4ff5d ("bpf: add __arg_trusted global func arg tag")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240202190529.2374377-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Fix the return code for ring_buffer_poll_wait()
It was returing a -EINVAL instead of EPOLLERR.
- Zero out the tracefs_inode so that all fields are initialized.
The ti->private could have had stale data, but instead of
just initializing it to NULL, clear out the entire structure
when it is allocated.
- Fix a crash in timerlat
The hrtimer was initialized at read and not open, but is
canceled at close. If the file was opened and never read
the close will pass a NULL pointer to hrtime_cancel().
- Rewrite of eventfs.
Linus wrote a patch series to remove the dentry references in the
eventfs_inode and to use ref counting and more of proper VFS
interfaces to make it work.
- Add warning to put_ei() if ei is not set to free. That means
something is about to free it when it shouldn't.
- Restructure the eventfs_inode to make it more compact, and remove
the unused llist field.
- Remove the fsnotify*() funtions for when the inodes were being created
in the lookup code. It doesn't make sense to notify about creation
just because something is being looked up.
- The inode hard link count was not accurate. It was being updated
when a file was looked up. The inodes of directories were updating
their parent inode hard link count every time the inode was created.
That means if memory reclaim cleaned a stale directory inode and
the inode was lookup up again, it would increment the parent inode
again as well. Al Viro said to just have all eventfs directories
have a hard link count of 1. That tells user space not to trust it.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZb1l/RQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qk6jAQDmecDOnx+j/Rm5krbX/meVPYXFj2CU
1wO7w1HBzopsBwEA5AjTKm9IGrl/eVG/+jViS165b+sJfwEcblHEFPWcIwo=
=uUzb
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing and eventfs fixes from Steven Rostedt:
- Fix the return code for ring_buffer_poll_wait()
It was returing a -EINVAL instead of EPOLLERR.
- Zero out the tracefs_inode so that all fields are initialized.
The ti->private could have had stale data, but instead of just
initializing it to NULL, clear out the entire structure when it is
allocated.
- Fix a crash in timerlat
The hrtimer was initialized at read and not open, but is canceled at
close. If the file was opened and never read the close will pass a
NULL pointer to hrtime_cancel().
- Rewrite of eventfs.
Linus wrote a patch series to remove the dentry references in the
eventfs_inode and to use ref counting and more of proper VFS
interfaces to make it work.
- Add warning to put_ei() if ei is not set to free. That means
something is about to free it when it shouldn't.
- Restructure the eventfs_inode to make it more compact, and remove the
unused llist field.
- Remove the fsnotify*() funtions for when the inodes were being
created in the lookup code. It doesn't make sense to notify about
creation just because something is being looked up.
- The inode hard link count was not accurate.
It was being updated when a file was looked up. The inodes of
directories were updating their parent inode hard link count every
time the inode was created. That means if memory reclaim cleaned a
stale directory inode and the inode was lookup up again, it would
increment the parent inode again as well. Al Viro said to just have
all eventfs directories have a hard link count of 1. That tells user
space not to trust it.
* tag 'trace-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
eventfs: Keep all directory links at 1
eventfs: Remove fsnotify*() functions from lookup()
eventfs: Restructure eventfs_inode structure to be more condensed
eventfs: Warn if an eventfs_inode is freed without is_freed being set
tracing/timerlat: Move hrtimer_init to timerlat_fd open()
eventfs: Get rid of dentry pointers without refcounts
eventfs: Clean up dentry ops and add revalidate function
eventfs: Remove unused d_parent pointer field
tracefs: dentry lookup crapectomy
tracefs: Avoid using the ei->dentry pointer unnecessarily
eventfs: Initialize the tracefs inode properly
tracefs: Zero out the tracefs_inode when allocating it
ring-buffer: Clean ring_buffer_poll_wait() error return
When check_stack_read_fixed_off() reads value from an spi
all stack slots of which are set to STACK_{MISC,INVALID},
the destination register is set to unbound SCALAR_VALUE.
Exploit this fact by allowing stacksafe() to use a fake
unbound scalar register to compare 'mmmm mmmm' stack value
in old state vs spilled 64-bit scalar in current state
and vice versa.
Veristat results after this patch show some gains:
./veristat -C -e file,prog,states -f 'states_pct>10' not-opt after
File Program States (DIFF)
----------------------- --------------------- ---------------
bpf_overlay.o tail_rev_nodeport_lb4 -45 (-15.85%)
bpf_xdp.o tail_lb_ipv4 -541 (-19.57%)
pyperf100.bpf.o on_event -680 (-10.42%)
pyperf180.bpf.o on_event -2164 (-19.62%)
pyperf600.bpf.o on_event -9799 (-24.84%)
strobemeta.bpf.o on_event -9157 (-65.28%)
xdp_synproxy_kern.bpf.o syncookie_tc -54 (-19.29%)
xdp_synproxy_kern.bpf.o syncookie_xdp -74 (-24.50%)
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240127175237.526726-6-maxtram95@gmail.com
When the width of a fill is smaller than the width of the preceding
spill, the information about scalar boundaries can still be preserved,
as long as it's coerced to the right width (done by coerce_reg_to_size).
Even further, if the actual value fits into the fill width, the ID can
be preserved as well for further tracking of equal scalars.
Implement the above improvements, which makes narrowing fills behave the
same as narrowing spills and MOVs between registers.
Two tests are adjusted to accommodate for endianness differences and to
take into account that it's now allowed to do a narrowing fill from the
least significant bits.
reg_bounds_sync is added to coerce_reg_to_size to correctly adjust
umin/umax boundaries after the var_off truncation, for example, a 64-bit
value 0xXXXXXXXX00000000, when read as a 32-bit, gets umin = 0, umax =
0xFFFFFFFF, var_off = (0x0; 0xffffffff00000000), which needs to be
synced down to umax = 0, otherwise reg_bounds_sanity_check doesn't pass.
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240127175237.526726-4-maxtram95@gmail.com
Support the pattern where an unbounded scalar is spilled to the stack,
then boundary checks are performed on the src register, after which the
stack frame slot is refilled into a register.
Before this commit, the verifier didn't treat the src register and the
stack slot as related if the src register was an unbounded scalar. The
register state wasn't copied, the id wasn't preserved, and the stack
slot was marked as STACK_MISC. Subsequent boundary checks on the src
register wouldn't result in updating the boundaries of the spilled
variable on the stack.
After this commit, the verifier will preserve the bond between src and
dst even if src is unbounded, which permits to do boundary checks on src
and refill dst later, still remembering its boundaries. Such a pattern
is sometimes generated by clang when compiling complex long functions.
One test is adjusted to reflect that now unbounded scalars are tracked.
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240127175237.526726-2-maxtram95@gmail.com
Now that rodata_enabled is declared at all time, the #ifdef
CONFIG_STRICT_MODULE_RWX can be removed.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
transfer_pid() must be never called with pid == PIDTYPE_PID,
new_leader->thread_pid should be changed by exchange_tids().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240202131255.GA26025@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Add another wake_up_all(wait_pidfd) into __change_pid() and change
pidfd_poll() to include EPOLLHUP if task == NULL.
This allows to wait until the target process/thread is reaped.
TODO: change do_notify_pidfd() to use the keyed wakeups.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240202131226.GA26018@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
With this flag:
- pidfd_open() doesn't require that the target task must be
a thread-group leader
- pidfd_poll() succeeds when the task exits and becomes a
zombie (iow, passes exit_notify()), even if it is a leader
and thread-group is not empty.
This means that the behaviour of pidfd_poll(PIDFD_THREAD,
pid-of-group-leader) is not well defined if it races with
exec() from its sub-thread; pidfd_poll() can succeed or not
depending on whether pidfd_task_exited() is called before
or after exchange_tids().
Perhaps we can improve this behaviour later, pidfd_poll()
can probably take sig->group_exec_task into account. But
this doesn't really differ from the case when the leader
exits before other threads (so pidfd_poll() succeeds) and
then another thread execs and pidfd_poll() will block again.
thread_group_exited() is no longer used, perhaps it can die.
Co-developed-by: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240131132602.GA23641@redhat.com
Tested-by: Tycho Andersen <tandersen@netflix.com>
Reviewed-by: Tycho Andersen <tandersen@netflix.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
do_notify_pidfd() makes no sense until the whole thread group exits, change
do_notify_parent() to check thread_group_empty().
This avoids the unnecessary do_notify_pidfd() when tsk is not a leader, or
it exits before other threads, or it has a ptraced EXIT_ZOMBIE sub-thread.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240127132407.GA29136@redhat.com
Reviewed-by: Tycho Andersen <tandersen@netflix.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
- make pidfd_create() static.
- Don't pass O_RDWR | O_CLOEXEC to __pidfd_prepare() in copy_process(),
__pidfd_prepare() adds these flags unconditionally.
- Kill the flags check in __pidfd_prepare(). sys_pidfd_open() checks the
flags itself, all other users of pidfd_prepare() pass flags = 0.
If we need a sanity check for those other in kernel users then
WARN_ON_ONCE(flags & ~PIDFD_NONBLOCK) makes more sense.
- Don't pass O_RDWR to get_unused_fd_flags(), it ignores everything except
O_CLOEXEC.
- Don't pass O_CLOEXEC to anon_inode_getfile(), it ignores everything except
O_ACCMODE | O_NONBLOCK.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240125161734.GA778@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
In the current implementation of clone(), there is a line that
initializes `u64 clone_flags = args->flags` at the top.
This means that there is no longer a need to use args->flags
for the legacy clone check.
Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com>
Link: https://lore.kernel.org/r/202401311054+0800-wangjinchao@xfusion.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
CAP_SYSLOG was separated from CAP_SYS_ADMIN and introduced in Linux
2.6.37 (2010-11). For a long time, certain syslog actions required
CAP_SYS_ADMIN or CAP_SYSLOG. Maybe it’s time to officially remove
CAP_SYS_ADMIN for more fine-grained control.
CAP_SYS_ADMIN was once removed but added back for backwards
compatibility reasons. In commit 38ef4c2e43 ("syslog: check cap_syslog
when dmesg_restrict") (2010-12), CAP_SYS_ADMIN was no longer needed. And
in commit ee24aebffb ("cap_syslog: accept CAP_SYS_ADMIN for now")
(2011-02), it was accepted again. Since then, CAP_SYS_ADMIN has been
preserved.
Now that almost 13 years have passed, the legacy application may have
had enough time to be updated.
Signed-off-by: Jingzi Meng <mengjingzi@iie.ac.cn>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20240105062007.26965-1-mengjingzi@iie.ac.cn
Signed-off-by: Kees Cook <keescook@chromium.org>
There's already one main CONFIG_SECURITY_NETWORK ifdef block within
the sleepable_lsm_hooks BTF set. Consolidate this duplicated ifdef
block as there's no need for it and all things guarded by it should
remain in one place in this specific context.
Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/Zbt1smz43GDMbVU3@google.com
This commit marks kfuncs as such inside the .BTF_ids section. The upshot
of these annotations is that we'll be able to automatically generate
kfunc prototypes for downstream users. The process is as follows:
1. In source, use BTF_KFUNCS_START/END macro pair to mark kfuncs
2. During build, pahole injects into BTF a "bpf_kfunc" BTF_DECL_TAG for
each function inside BTF_KFUNCS sets
3. At runtime, vmlinux or module BTF is made available in sysfs
4. At runtime, bpftool (or similar) can look at provided BTF and
generate appropriate prototypes for functions with "bpf_kfunc" tag
To ensure future kfunc are similarly tagged, we now also return error
inside kfunc registration for untagged kfuncs. For vmlinux kfuncs,
we also WARN(), as initcall machinery does not handle errors.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Acked-by: Benjamin Tissoires <bentiss@kernel.org>
Link: https://lore.kernel.org/r/e55150ceecbf0a5d961e608941165c0bee7bc943.1706491398.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The return type for ring_buffer_poll_wait() is __poll_t. This is behind
the scenes an unsigned where we can set event bits. In case of a
non-allocated CPU, we do return instead -EINVAL (0xffffffea). Lucky us,
this ends up setting few error bits (EPOLLERR | EPOLLHUP | EPOLLNVAL), so
user-space at least is aware something went wrong.
Nonetheless, this is an incorrect code. Replace that -EINVAL with a
proper EPOLLERR to clean that output. As this doesn't change the
behaviour, there's no need to treat this change as a bug fix.
Link: https://lore.kernel.org/linux-trace-kernel/20240131140955.3322792-1-vdonnefort@google.com
Cc: stable@vger.kernel.org
Fixes: 6721cb6002 ("ring-buffer: Do not poll non allocated cpu buffers")
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
System workqueues are allocated early during boot from
workqueue_init_early(). While allocating unbound workqueues,
wq_update_node_max_active() is invoked from apply_workqueue_attrs() and
accesses NUMA topology to initialize wq->node_nr_active[].max.
However, topology information may not be set up at this point.
wq_update_node_max_active() is explicitly invoked from
workqueue_init_topology() later when topology information is known to be
available.
This doesn't seem to crash anything but it's doing useless work with dubious
data. Let's skip the premature and duplicate node_max_active updates by
initializing the field to WQ_DFL_MIN_ACTIVE on allocation and making
wq_update_node_max_active() noop until workqueue_init_topology().
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/workqueue.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9221a4c57ae1..a65081ec6780 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -386,6 +386,8 @@ static const char *wq_affn_names[WQ_AFFN_NR_TYPES] = {
[WQ_AFFN_SYSTEM] = "system",
};
+static bool wq_topo_initialized = false;
+
/*
* Per-cpu work items which run for longer than the following threshold are
* automatically considered CPU intensive and excluded from concurrency
@@ -1510,6 +1512,9 @@ static void wq_update_node_max_active(struct workqueue_struct *wq, int off_cpu)
lockdep_assert_held(&wq->mutex);
+ if (!wq_topo_initialized)
+ return;
+
if (!cpumask_test_cpu(off_cpu, effective))
off_cpu = -1;
@@ -4356,6 +4361,7 @@ static void free_node_nr_active(struct wq_node_nr_active **nna_ar)
static void init_node_nr_active(struct wq_node_nr_active *nna)
{
+ nna->max = WQ_DFL_MIN_ACTIVE;
atomic_set(&nna->nr, 0);
raw_spin_lock_init(&nna->lock);
INIT_LIST_HEAD(&nna->pending_pwqs);
@@ -7400,6 +7406,8 @@ void __init workqueue_init_topology(void)
init_pod_type(&wq_pod_types[WQ_AFFN_CACHE], cpus_share_cache);
init_pod_type(&wq_pod_types[WQ_AFFN_NUMA], cpus_share_numa);
+ wq_topo_initialized = true;
+
mutex_lock(&wq_pool_mutex);
/*
For wq_update_node_max_active(), @off_cpu of -1 indicates that no CPU is
going down. The function was incorrectly calling cpumask_test_cpu() with -1
CPU leading to oopses like the following on some archs:
Unable to handle kernel paging request at virtual address ffff0002100296e0
..
pc : wq_update_node_max_active+0x50/0x1fc
lr : wq_update_node_max_active+0x1f0/0x1fc
...
Call trace:
wq_update_node_max_active+0x50/0x1fc
apply_wqattrs_commit+0xf0/0x114
apply_workqueue_attrs_locked+0x58/0xa0
alloc_workqueue+0x5ac/0x774
workqueue_init_early+0x460/0x540
start_kernel+0x258/0x684
__primary_switched+0xb8/0xc0
Code: 9100a273 35000d01 53067f00 d0016dc1 (f8607a60)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Attempted to kill the idle task!
---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---
Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: http://lkml.kernel.org/r/91eacde0-df99-4d5c-a980-91046f66e612@samsung.com
Fixes: 5797b1c189 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")
Add ability to mark arg:trusted arguments with optional arg:nullable
tag to mark it as PTR_TO_BTF_ID_OR_NULL variant, which will allow
callers to pass NULL, and subsequently will force global subprog's code
to do NULL check. This allows to have "optional" PTR_TO_BTF_ID values
passed into global subprogs.
For now arg:nullable cannot be combined with anything else.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240130000648.2144827-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add support for passing PTR_TO_BTF_ID registers to global subprogs.
Currently only PTR_TRUSTED flavor of PTR_TO_BTF_ID is supported.
Non-NULL semantics is assumed, so caller will be forced to prove
PTR_TO_BTF_ID can't be NULL.
Note, we disallow global subprogs to destroy passed in PTR_TO_BTF_ID
arguments, even the trusted one. We achieve that by not setting
ref_obj_id when validating subprog code. This basically enforces (in
Rust terms) borrowing semantics vs move semantics. Borrowing semantics
seems to be a better fit for isolated global subprog validation
approach.
Implementation-wise, we utilize existing logic for matching
user-provided BTF type to kernel-side BTF type, used by BPF CO-RE logic
and following same matching rules. We enforce a unique match for types.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240130000648.2144827-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Fix register_snapshot_trigger() on allocation error
If the snapshot fails to allocate, the register_snapshot_trigger() can
still return success. If the call to tracing_alloc_snapshot_instance()
returned anything but 0, it returned 0, but it should have been returning
the error code from that allocation function.
- Remove leftover code from tracefs doing a dentry walk on remount.
The update_gid() function was called by the tracefs code on remount
to update the gid of eventfs, but that is no longer the case, but that
code wasn't deleted. Nothing calls it. Remove it.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZbcbZBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qqEsAQCWTIQ/yAktP2doLvuccZbZSc5NFHZt
hZNRHep/SMQqKwD6AlesgiXlXM39HiRTiV17jPzrL65brcTEYMJmqCaWugk=
=8nia
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
"Two small fixes for tracefs and eventfs:
- Fix register_snapshot_trigger() on allocation error
If the snapshot fails to allocate, the register_snapshot_trigger()
can still return success. If the call to
tracing_alloc_snapshot_instance() returned anything but 0, it
returned 0, but it should have been returning the error code from
that allocation function.
- Remove leftover code from tracefs doing a dentry walk on remount.
The update_gid() function was called by the tracefs code on remount
to update the gid of eventfs, but that is no longer the case, but
that code wasn't deleted. Nothing calls it. Remove it"
* tag 'trace-v6.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracefs: remove stale 'update_gid' code
tracing/trigger: Fix to return error if failed to alloc snapshot
When __queue_delayed_work() is called, it chooses a cpu for handling the
timer interrupt. As of today, it will pick either the cpu passed as
parameter or the last cpu used for this.
This is not good if a system does use CPU isolation, because it can take
away some valuable cpu time to:
1 - deal with the timer interrupt,
2 - schedule-out the desired task,
3 - queue work on a random workqueue, and
4 - schedule the desired task back to the cpu.
So to fix this, during __queue_delayed_work(), if cpu isolation is in
place, pick a random non-isolated cpu to handle the timer interrupt.
As an optimization, if the current cpu is not isolated, use it instead
of looking for another candidate.
Signed-off-by: Leonardo Bras <leobras@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
or aren't considered appropriate for backporting.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZbdSnwAKCRDdBJ7gKXxA
jv49AQCY8eLOgE0L+25HZm99HleBwapbKJozcmsXgMPlgeFZHgEA8saExeL+Nzae
6ktxmGXoVw2t3FJ67Zr66VE3EyHVKAY=
=HWuo
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-01-28-23-21' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"22 hotfixes. 11 are cc:stable and the remainder address post-6.7
issues or aren't considered appropriate for backporting"
* tag 'mm-hotfixes-stable-2024-01-28-23-21' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits)
mm: thp_get_unmapped_area must honour topdown preference
mm: huge_memory: don't force huge page alignment on 32 bit
userfaultfd: fix mmap_changing checking in mfill_atomic_hugetlb
selftests/mm: ksm_tests should only MADV_HUGEPAGE valid memory
scs: add CONFIG_MMU dependency for vfree_atomic()
mm/memory: fix folio_set_dirty() vs. folio_mark_dirty() in zap_pte_range()
mm/huge_memory: fix folio_set_dirty() vs. folio_mark_dirty()
selftests/mm: Update va_high_addr_switch.sh to check CPU for la57 flag
selftests: mm: fix map_hugetlb failure on 64K page size systems
MAINTAINERS: supplement of zswap maintainers update
stackdepot: make fast paths lock-less again
stackdepot: add stats counters exported via debugfs
mm, kmsan: fix infinite recursion due to RCU critical section
mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again
selftests/mm: switch to bash from sh
MAINTAINERS: add man-pages git trees
mm: memcontrol: don't throttle dying tasks on memory.high
mm: mmap: map MAP_STACK to VM_NOHUGEPAGE
uprobes: use pagesize-aligned virtual address when replacing pages
selftests/mm: mremap_test: fix build warning
...
Remove the duplicate check on type and unify result.
Signed-off-by: Florian Lehner <dev@der-flo.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20240120150920.3370-1-dev@der-flo.net
Now that bpf and bpf-next trees converged and we don't run the risk of
merge conflicts, move btf_validate_prog_ctx_type() into its most logical
place inside the main logic loop.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240125205510.3642094-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
It's a bit puzzling to see a call to module_enable_nx() followed by a
call to module_enable_x(). This is because one applies on text while
the other applies on data.
Change name to make that more clear.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
A couple of architectures seem concerned about calling set_memory_ro()
and set_memory_x() too frequently and have implemented a version of
set_memory_rox(), see commit 60463628c9 ("x86/mm: Implement native
set_memory_rox()") and commit 22e99fa564 ("s390/mm: implement
set_memory_rox()")
Use set_memory_rox() in modules when STRICT_MODULES_RWX is set.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Currently, for both percpu and unbound workqueues, max_active applies
per-cpu, which is a recent change for unbound workqueues. The change for
unbound workqueues was a significant departure from the previous behavior of
per-node application. It made some use cases create undesirable number of
concurrent work items and left no good way of fixing them. To address the
problem, workqueue is implementing a NUMA node segmented global nr_active
mechanism, which will be explained further in the next patch.
As a preparation, this patch introduces struct wq_node_nr_active. It's a
data structured allocated for each workqueue and NUMA node pair and
currently only tracks the workqueue's number of active work items on the
node. This is split out from the next patch to make it easier to understand
and review.
Note that there is an extra wq_node_nr_active allocated for the invalid node
nr_node_ids which is used to track nr_active for pools which don't have NUMA
node associated such as the default fallback system-wide pool.
This doesn't cause any behavior changes visible to userland yet. The next
patch will expand to implement the control mechanism on top.
v4: - Fixed out-of-bound access when freeing per-cpu workqueues.
v3: - Use flexible array for wq->node_nr_active as suggested by Lai.
v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.
- Lai pointed out that pwq_tryinc_nr_active() incorrectly dropped
pwq->max_active check. Restored. As the next patch replaces the
max_active enforcement mechanism, this doesn't change the end result.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
The planned shared nr_active handling for unbound workqueues will make
pwq_dec_nr_active() sometimes drop the pool lock temporarily to acquire
other pool locks, which is necessary as retirement of an nr_active count
from one pool may need kick off an inactive work item in another pool.
This patch moves pwq_dec_nr_in_flight() call in try_to_grab_pending() to the
end of work item handling so that work item state changes stay atomic.
process_one_work() which is the other user of pwq_dec_nr_in_flight() already
calls it at the end of work item handling. Comments are added to both call
sites and pwq_dec_nr_in_flight().
This shouldn't cause any behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
wq->cpu_pwq is RCU protected but wq->dfl_pwq isn't. This is okay because
currently wq->dfl_pwq is used only accessed to install it into wq->cpu_pwq
which doesn't require RCU access. However, we want to be able to access
wq->dfl_pwq under RCU in the future to access its __pod_cpumask and the code
can be made easier to read by making the two pwq fields behave in the same
way.
- Make wq->dfl_pwq RCU protected.
- Add unbound_pwq_slot() and unbound_pwq() which can access both ->dfl_pwq
and ->cpu_pwq. The former returns the double pointer that can be used
access and update the pwqs. The latter performs locking check and
dereferences the double pointer.
- pwq accesses and updates are converted to use unbound_pwq[_slot]().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
wq_adjust_max_active() needs to activate work items after max_active is
increased. Previously, it did that by visiting each pwq once activating all
that could be activated. While this makes sense with per-pwq nr_active,
nr_active will be shared across multiple pwqs for unbound wqs. Then, we'd
want to round-robin through pwqs to be fairer.
In preparation, this patch makes wq_adjust_max_active() round-robin pwqs
while activating. While the activation ordering changes, this shouldn't
cause user-noticeable behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
__queue_work(), pwq_dec_nr_in_flight() and wq_adjust_max_active() were
open-coding nr_active handling, which is fine given that the operations are
trivial. However, the planned unbound nr_active update will make them more
complicated, so let's move them into helpers.
- pwq_tryinc_nr_active() is added. It increments nr_active if under
max_active limit and return a boolean indicating whether inc was
successful. Note that the function is structured to accommodate future
changes. __queue_work() is updated to use the new helper.
- pwq_activate_first_inactive() is updated to use pwq_tryinc_nr_active() and
thus no longer assumes that nr_active is under max_active and returns a
boolean to indicate whether a work item has been activated.
- wq_adjust_max_active() no longer tests directly whether a work item can be
activated. Instead, it's updated to use the return value of
pwq_activate_first_inactive() to tell whether a work item has been
activated.
- nr_active decrement and activating the first inactive work item is
factored into pwq_dec_nr_active().
v3: - WARN_ON_ONCE(!WORK_STRUCT_INACTIVE) added to __pwq_activate_work() as
now we're calling the function unconditionally from
pwq_activate_first_inactive().
v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
To prepare for unbound nr_active handling improvements, move work activation
part of pwq_activate_inactive_work() into __pwq_activate_work() and add
pwq_activate_work() which tests WORK_STRUCT_INACTIVE and updates nr_active.
pwq_activate_first_inactive() and try_to_grab_pending() are updated to use
pwq_activate_work(). The latter conversion is functionally identical. For
the former, this conversion adds an unnecessary WORK_STRUCT_INACTIVE
testing. This is temporary and will be removed by the next patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
"!pwq->nr_active && list_empty(&pwq->inactive_works)" test is repeated
multiple times. Let's factor it out into pwq_is_empty().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
max_active is a workqueue-wide setting and the configured value is stored in
wq->saved_max_active; however, the effective value was stored in
pwq->max_active. While this is harmless, it makes max_active update process
more complicated and gets in the way of the planned max_active semantic
updates for unbound workqueues.
This patches moves pwq->max_active to wq->max_active. This simplifies the
code and makes freezing and noop max_active updates cheaper too. No
user-visible behavior change is intended.
As wq->max_active is updated while holding wq mutex but read without any
locking, it now uses WRITE/READ_ONCE(). A new locking locking rule WO is
added for it.
v2: wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Use the new __free() mechanism to remove all gotos and simplify the error
paths.
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20240122124243.44002-5-brgl@bgdev.pl
exposure
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmW2LAgACgkQEsHwGGHe
VUqowBAAiW9aPQmp401DSXLX+bX0oS5IQVEZnAEE3hQTWxdvDoIdmX+SBReSutXy
PDm8mZgVtIiUg3V5bu7/9Dgpu7ovRuChJPjjkYFUDcEmzmsMI11W6u8+8eyt8yRd
X9LuGUeXPJSI1kadYudhFUhl6X6KcXj4Y+XUqNcyp8yClSEcLriYeiumNApSEzj6
BneO5VBbXTpJq1b7GOlC4MNhNXhx+WlUdJUb3VPLlxy/akxrNs9x0ASdOuqslCq8
X9SJPnKeRh0mpezmWDgU72eQ/3vpvWQzwyXvp2pQGbjArCx7IwwD765NDu0P6651
C/+4ruXmcd+Jp3wuobdHG8/J2NlZQy8tZQm284YkS5vyBQDi4s17hycXw/aeUFpu
/3LR1Hppl//u7hkaHszE8vE5l6in4a2XAbk9EozChVj/aHRJqIaLn8TGQRquK4Tg
uRjIC3O2ubJCsIlNIczysjCobSCO+cELwUuFVHh7cdmQAgUwF3efDab0+pJ7MHFb
ZEcqQbIt4FGea4BGzvRYCYj6W9bkhzttnH+68ef+mDA3BcdGoYnHcQ143M8duNhe
0inWCibQXMFC9EGPjC8Sz8WvzF/L5KL9bPQmO1sitIzH6kbU3o7PBk2Fe4V6+KP9
THK865SJ/9QirjXrGmp9Sle6dqJRUylmt1ts8reOWACZ98LKeWU=
=ibuM
-----END PGP SIGNATURE-----
Merge tag 'locking_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Borislav Petkov:
- Prevent an inconsistent futex operation leading to stale state
exposure
* tag 'locking_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Prevent the reuse of stale pi_state
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmW2KrEACgkQEsHwGGHe
VUr1WA//Qsi2JkxO1lyUQgnyuXqs0+oVZJxFyH2dFYzWkfSaxgsyPZ0H+wsweDfP
OgoNzwwDf1IaNbVz2voV6lSM/30ujJMx4aAucT5WTEXa12cJsvipxRiNd8WU8GqQ
buBz+vnS9IJ2WfM7UxhIVevYFU8H/ERcSO9WCII0YjlcVxmlwMK3B7kFdpBPdT5Q
m5hvBorZzIa9wD3TI7e+VvEVbCx0WjsYYEpXDXM/yCf1Juc9952pjjunzx3YmJES
5JG2WpnEvmNdWwIPO0NAjs7Shw/MNViXTy5Ls5jcbswiAcBoUxNHQlUsvNVDaVyv
8eMCkPzuSipY8HSoetQSTJl+mr3LyYRvevKahuTgwbS8K+kxgClqHFoLZVqolWnk
2IDo63R6Ex6lb1Xpb/Rpg/4j4NqUVWcvPHf6Z2CmMRq/XbSk2DIFl1Wxjgy/Cjnu
+nNLw2FYayEBrKF3VlYgERGoCfBrEsksxzljjeHFn5XWr+G2x1ykF37xaWjQ2+oV
sFl6UYwIsdqPCjHmpT6R1lwCdeEC3o3Zc2Kf5uEVj+pXacKJkxZU0L6ZneO8UiEc
rtc0gTgm9ZNd8oDsjsaBU1A3KxH9lOfVz82ZV0tipz94dcN4zrB9Qag0Yw+64YOC
cQ9cRKiFiCVCeD1ksDLZe1IUX++T2Y9O8MZv06ZDrbaFC56+lU8=
=IgON
-----END PGP SIGNATURE-----
Merge tag 'irq_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Borislav Petkov:
- Initialize the resend node of each IRQ descriptor, not only the first
one
* tag 'irq_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Initialize resend_node hlist for all interrupt descriptors
events in order to be able to compute correct averages
- Limit the duration of the clocksource watchdog checking interval as
too long intervals lead to wrongly marking the TSC as unstable
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmW2KYEACgkQEsHwGGHe
VUoX7w//Ulls1tp3m1oiejTBtUmkewSmnhNAfkHJv3MlKNe+ttG6LvQVh9g2bf1Z
FOu2M2se0ge8G5xf3+I5E4rpqlJZSuhPmNmIET+aj+2a61UJq/6zE1Zw6mxjrJOK
emjOYKTxZ/HxvKJJGO6NiH8Iv5Aj3nQR3Y6oyb/FyP5TLJ6MCT21iEaqyqU7P+Ix
AHIS3cL97M5R/tFtP2CY3PV2M6hJ0lqapSi9t75hT8DfJN1TNQ5SvFkKgmOIrGFw
2WxPTSTEZAnXlvI4cC3Nru9i64QQRw9S05FFelX2pwxE/7wVzBvfh8cjuGZJBve/
KQhNnQ4/fzv6E/hUcavKuOyk1lx5XonfCuG4RFoLl67LjLbLh+Q55RBdXflBPF4T
Ow9BSyQNFu391C2Bl5gJUYVd2JMv+IVpi2wUiwrXJ/Mxj+A2J7Fj0jz7hMbNCmsU
EaA+QyfkAGsoa99xP3UDhPzxoCr2s5YTAxH+IUeSWeI25PMq9f+6fifXBwG+GaVa
FS6Ei1VI0GCNmcYFYawHJbdM2ui5h7lZ96aEpOBSVcAv/2yBNgqxuYZ+icO/wI6N
JM0DSEEOrWcytfxftl7LmglJauhXSKZH4UTG4RCz0IkDgR72Wn0QF4cqm4wWQ5yh
n5/xO+SbkzE57bltsnAkvpu0a110fdK5ec+vkFIy4PrxyO83XhY=
=9RXx
-----END PGP SIGNATURE-----
Merge tag 'timers_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Borislav Petkov:
- Preserve the number of idle calls and sleep entries across CPU
hotplug events in order to be able to compute correct averages
- Limit the duration of the clocksource watchdog checking interval as
too long intervals lead to wrongly marking the TSC as unstable
* tag 'timers_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick/sched: Preserve number of idle sleeps across CPU hotplug events
clocksource: Skip watchdog check for large watchdog intervals
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZbQV+gAKCRDbK58LschI
g2OeAP0VvhZS9SPiS+/AMAFuw2W1BkMrFNbfBTc3nzRnyJSmNAD+NG4CLLJvsKI9
olu7VC20B8pLTGLUGIUSwqnjOC+Kkgc=
=wVMl
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-01-26
We've added 107 non-merge commits during the last 4 day(s) which contain
a total of 101 files changed, 6009 insertions(+), 1260 deletions(-).
The main changes are:
1) Add BPF token support to delegate a subset of BPF subsystem
functionality from privileged system-wide daemons such as systemd
through special mount options for userns-bound BPF fs to a trusted
& unprivileged application. With addressed changes from Christian
and Linus' reviews, from Andrii Nakryiko.
2) Support registration of struct_ops types from modules which helps
projects like fuse-bpf that seeks to implement a new struct_ops type,
from Kui-Feng Lee.
3) Add support for retrieval of cookies for perf/kprobe multi links,
from Jiri Olsa.
4) Bigger batch of prep-work for the BPF verifier to eventually support
preserving boundaries and tracking scalars on narrowing fills,
from Maxim Mikityanskiy.
5) Extend the tc BPF flavor to support arbitrary TCP SYN cookies to help
with the scenario of SYN floods, from Kuniyuki Iwashima.
6) Add code generation to inline the bpf_kptr_xchg() helper which
improves performance when stashing/popping the allocated BPF objects,
from Hou Tao.
7) Extend BPF verifier to track aligned ST stores as imprecise spilled
registers, from Yonghong Song.
8) Several fixes to BPF selftests around inline asm constraints and
unsupported VLA code generation, from Jose E. Marchesi.
9) Various updates to the BPF IETF instruction set draft document such
as the introduction of conformance groups for instructions,
from Dave Thaler.
10) Fix BPF verifier to make infinite loop detection in is_state_visited()
exact to catch some too lax spill/fill corner cases,
from Eduard Zingerman.
11) Refactor the BPF verifier pointer ALU check to allow ALU explicitly
instead of implicitly for various register types, from Hao Sun.
12) Fix the flaky tc_redirect_dtime BPF selftest due to slowness
in neighbor advertisement at setup time, from Martin KaFai Lau.
13) Change BPF selftests to skip callback tests for the case when the
JIT is disabled, from Tiezhu Yang.
14) Add a small extension to libbpf which allows to auto create
a map-in-map's inner map, from Andrey Grafin.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (107 commits)
selftests/bpf: Add missing line break in test_verifier
bpf, docs: Clarify definitions of various instructions
bpf: Fix error checks against bpf_get_btf_vmlinux().
bpf: One more maintainer for libbpf and BPF selftests
selftests/bpf: Incorporate LSM policy to token-based tests
selftests/bpf: Add tests for LIBBPF_BPF_TOKEN_PATH envvar
libbpf: Support BPF token path setting through LIBBPF_BPF_TOKEN_PATH envvar
selftests/bpf: Add tests for BPF object load with implicit token
selftests/bpf: Add BPF object loading tests with explicit token passing
libbpf: Wire up BPF token support at BPF object level
libbpf: Wire up token_fd into feature probing logic
libbpf: Move feature detection code into its own file
libbpf: Further decouple feature checking logic from bpf_object
libbpf: Split feature detectors definitions from cached results
selftests/bpf: Utilize string values for delegate_xxx mount options
bpf: Support symbolic BPF FS delegation mount options
bpf: Fail BPF_TOKEN_CREATE if no delegation option was set on BPF FS
bpf,selinux: Allocate bpf_security_struct per BPF token
selftests/bpf: Add BPF token-enabled tests
libbpf: Add BPF token support to bpf_prog_load() API
...
====================
Link: https://lore.kernel.org/r/20240126215710.19855-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
workqueue is collecting different sorts of enums into a single unnamed enum
type which can increase confusion around enum width. Also, unnamed enums
can't be accessed from BPF. Let's break up enum definitions according to
their purposes and give them type names.
Signed-off-by: Tejun Heo <tj@kernel.org>
After creating a new worker, create_worker() is calling kick_pool() to wake
up the new worker task. However, as kick_pool() doesn't do anything if there
is no work pending, it also calls wake_up_process() explicitly. There's no
reason to call kick_pool() at all. wake_up_process() is enough by itself.
Drop the unnecessary kick_pool() call.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fix register_snapshot_trigger() to return error code if it failed to
allocate a snapshot instead of 0 (success). Unless that, it will register
snapshot trigger without an error.
Link: https://lore.kernel.org/linux-trace-kernel/170622977792.270660.2789298642759362200.stgit@devnote2
Fixes: 0bbe7f7199 ("tracing: Fix the race between registering 'snapshot' event trigger and triggering 'snapshot' operation")
Cc: stable@vger.kernel.org
Cc: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Per filesystems/sysfs.rst, show() should only use sysfs_emit()
or sysfs_emit_at() when formatting the value to be returned to user space.
coccinelle complains that there are still a couple of functions that use
snprintf(). Convert them to sysfs_emit().
No functional change intended.
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240116045151.3940401-40-lizhijian@fujitsu.com
struct cpuhp_cpu_state has an extraneous kernel-doc comment for @cpu.
There is no struct member by that name, so remove the comment to
prevent the kernel-doc warning:
kernel/cpu.c:85: warning: Excess struct member 'cpu' description in 'cpuhp_cpu_state'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240114030615.30441-1-rdunlap@infradead.org
For better readability and maintenance keep headers in alphabetical
order.
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240122124243.44002-4-brgl@bgdev.pl
uprobes passes an unaligned page mapping address to
folio_add_new_anon_rmap(), which ends up triggering a VM_BUG_ON() we
recently extended in commit 372cbd4d5a ("mm: non-pmd-mappable, large
folios for folio_add_new_anon_rmap()").
Arguably, this is uprobes code doing something wrong; however, for the
time being it would have likely worked in rmap code because
__folio_set_anon() would set folio->index to the same value.
Looking at __replace_page(), we'd also pass slightly wrong values to
mmu_notifier_range_init(), page_vma_mapped_walk(), flush_cache_page(),
ptep_clear_flush() and set_pte_at_notify(). I suspect most of them are
fine, but let's just mark the introducing commit as the one needed fixing.
I don't think CC stable is warranted.
We'll add more sanity checks in rmap code separately, to make sure that we
always get properly aligned addresses.
Link: https://lkml.kernel.org/r/20240115100731.91007-1-david@redhat.com
Fixes: c517ee744b ("uprobes: __replace_page() should not use page_address_in_vma()")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Jiri Olsa <jolsa@kernel.org>
Closes: https://lkml.kernel.org/r/ZaMR2EWN-HvlCfUl@krava
Tested-by: Jiri Olsa <jolsa@kernel.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In bpf_struct_ops_map_alloc, it needs to check for NULL in the returned
pointer of bpf_get_btf_vmlinux() when CONFIG_DEBUG_INFO_BTF is not set.
ENOTSUPP is used to preserve the same behavior before the
struct_ops kmod support.
In the function check_struct_ops_btf_id(), instead of redoing the
bpf_get_btf_vmlinux() that has already been done in syscall.c, the fix
here is to check for prog->aux->attach_btf_id.
BPF_PROG_TYPE_STRUCT_OPS must require attach_btf_id and syscall.c
guarantees a valid attach_btf as long as attach_btf_id is set.
When attach_btf_id is not set, this patch returns -ENOTSUPP
because it is what the selftest in test_libbpf_probe_prog_types()
and libbpf_probes.c are expecting for feature probing purpose.
Changes from v1:
- Remove an unnecessary NULL check in check_struct_ops_btf_id()
Reported-by: syzbot+88f0aafe5f950d7489d7@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/00000000000040d68a060fc8db8c@google.com/
Reported-by: syzbot+1336f3d4b10bcda75b89@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/00000000000026353b060fc21c07@google.com/
Fixes: fcc2c1fb06 ("bpf: pass attached BTF to the bpf_struct_ops subsystem")
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240126023113.1379504-1-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Since we have set the WQ_NAME_LEN to 32, decrease the name of
events_freezable_power_efficient so that it does not trip the name length
warning when the workqueue is created.
Signed-off-by: Audra Mitchell <audra@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This commit fixes RCU grace period stalls, which are observed when
an outgoing CPU's quiescent state reporting results in wakeup of
one of the grace period kthreads, to complete the grace period. If
those kthreads have SCHED_FIFO policy, the wake up can indirectly
arm the RT bandwith timer to the local offline CPU. Earlier migration
of the hrtimers from the CPU introduced in commit 5c0930ccaa
("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
results in this timer getting ignored. If the RCU grace period
kthreads are waiting for RT bandwidth to be available, they may
never be actually scheduled, resulting in RCU stall warnings.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSi2tPIQIc2VEtjarIAHS7/6Z0wpQUCZbFOUgAKCRAAHS7/6Z0w
pQjcAQCg/tJYRjwGUPebKLUgkmXlR+IIANzEgvES/RgWTOld5gEAklZVTjf3J0qt
QeU9WC3My2cVPKvv6kqnuQ9rrqMQ3g0=
=U5an
-----END PGP SIGNATURE-----
Merge tag 'urgent-rcu.2024.01.24a' of https://github.com/neeraju/linux
Pull RCU fix from Neeraj Upadhyay:
"This fixes RCU grace period stalls, which are observed when an
outgoing CPU's quiescent state reporting results in wakeup of one of
the grace period kthreads, to complete the grace period.
If those kthreads have SCHED_FIFO policy, the wake up can indirectly
arm the RT bandwith timer to the local offline CPU.
Earlier migration of the hrtimers from the CPU introduced in commit
5c0930ccaa ("hrtimers: Push pending hrtimers away from outgoing CPU
earlier") results in this timer getting ignored.
If the RCU grace period kthreads are waiting for RT bandwidth to be
available, they may never be actually scheduled, resulting in RCU
stall warnings"
* tag 'urgent-rcu.2024.01.24a' of https://github.com/neeraju/linux:
rcu: Defer RCU kthreads wakeup when CPU is dying
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
[PM: subject line tweaks]
Signed-off-by: Paul Moore <paul@paul-moore.com>
The ret variable is assigned when it does not need to be defined, as it
has already been assigned before use.
Signed-off-by: Li zeming <zeming@nfschina.com>
[PM: rewrite subject line]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Commit 71fee48f ("tick-sched: Fix idle and iowait sleeptime accounting vs
CPU hotplug") preserved total idle sleep time and iowait sleeptime across
CPU hotplug events.
Similar reasoning applies to the number of idle calls and idle sleeps to
get the proper average of sleep time per idle invocation.
Preserve those fields too.
Fixes: 71fee48f ("tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug")
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240122233534.3094238-1-tim.c.chen@linux.intel.com
There have been reports of the watchdog marking clocksources unstable on
machines with 8 NUMA nodes:
clocksource: timekeeping watchdog on CPU373:
Marking clocksource 'tsc' as unstable because the skew is too large:
clocksource: 'hpet' wd_nsec: 14523447520
clocksource: 'tsc' cs_nsec: 14524115132
The measured clocksource skew - the absolute difference between cs_nsec
and wd_nsec - was 668 microseconds:
cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612
The kernel used 200 microseconds for the uncertainty_margin of both the
clocksource and watchdog, resulting in a threshold of 400 microseconds (the
md variable). Both the cs_nsec and the wd_nsec value indicate that the
readout interval was circa 14.5 seconds. The observed behaviour is that
watchdog checks failed for large readout intervals on 8 NUMA node
machines. This indicates that the size of the skew was directly proportinal
to the length of the readout interval on those machines. The measured
clocksource skew, 668 microseconds, was evaluated against a threshold (the
md variable) that is suited for readout intervals of roughly
WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second.
The intention of 2e27e793e2 ("clocksource: Reduce clocksource-skew
threshold") was to tighten the threshold for evaluating skew and set the
lower bound for the uncertainty_margin of clocksources to twice
WATCHDOG_MAX_SKEW. Later in c37e85c135 ("clocksource: Loosen clocksource
watchdog constraints"), the WATCHDOG_MAX_SKEW constant was increased to
125 microseconds to fit the limit of NTP, which is able to use a
clocksource that suffers from up to 500 microseconds of skew per second.
Both the TSC and the HPET use default uncertainty_margin. When the
readout interval gets stretched the default uncertainty_margin is no
longer a suitable lower bound for evaluating skew - it imposes a limit
that is far stricter than the skew with which NTP can deal.
The root causes of the skew being directly proportinal to the length of
the readout interval are:
* the inaccuracy of the shift/mult pairs of clocksources and the watchdog
* the conversion to nanoseconds is imprecise for large readout intervals
Prevent this by skipping the current watchdog check if the readout
interval exceeds 2 * WATCHDOG_INTERVAL. Considering the maximum readout
interval of 2 * WATCHDOG_INTERVAL, the current default uncertainty margin
(of the TSC and HPET) corresponds to a limit on clocksource skew of 250
ppm (microseconds of skew per second). To keep the limit imposed by NTP
(500 microseconds of skew per second) for all possible readout intervals,
the margins would have to be scaled so that the threshold value is
proportional to the length of the actual readout interval.
As for why the readout interval may get stretched: Since the watchdog is
executed in softirq context the expiration of the watchdog timer can get
severely delayed on account of a ksoftirqd thread not getting to run in a
timely manner. Surely, a system with such belated softirq execution is not
working well and the scheduling issue should be looked into but the
clocksource watchdog should be able to deal with it accordingly.
Fixes: 2e27e793e2 ("clocksource: Reduce clocksource-skew threshold")
Suggested-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Jiri Wiesner <jwiesner@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Feng Tang <feng.tang@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240122172350.GA740@incl
Besides already supported special "any" value and hex bit mask, support
string-based parsing of delegation masks based on exact enumerator
names. Utilize BTF information of `enum bpf_cmd`, `enum bpf_map_type`,
`enum bpf_prog_type`, and `enum bpf_attach_type` types to find supported
symbolic names (ignoring __MAX_xxx guard values and stripping repetitive
prefixes like BPF_ for cmd and attach types, BPF_MAP_TYPE_ for maps, and
BPF_PROG_TYPE_ for prog types). The case doesn't matter, but it is
normalized to lower case in mount option output. So "PROG_LOAD",
"prog_load", and "MAP_create" are all valid values to specify for
delegate_cmds options, "array" is among supported for map types, etc.
Besides supporting string values, we also support multiple values
specified at the same time, using colon (':') separator.
There are corresponding changes on bpf_show_options side to use known
values to print them in human-readable format, falling back to hex mask
printing, if there are any unrecognized bits. This shouldn't be
necessary when enum BTF information is present, but in general we should
always be able to fall back to this even if kernel was built without BTF.
As mentioned, emitted symbolic names are normalized to be all lower case.
Example below shows various ways to specify delegate_cmds options
through mount command and how mount options are printed back:
12/14 14:39:07.604
vmuser@archvm:~/local/linux/tools/testing/selftests/bpf
$ mount | rg token
$ sudo mkdir -p /sys/fs/bpf/token
$ sudo mount -t bpf bpffs /sys/fs/bpf/token \
-o delegate_cmds=prog_load:MAP_CREATE \
-o delegate_progs=kprobe \
-o delegate_attachs=xdp
$ mount | grep token
bpffs on /sys/fs/bpf/token type bpf (rw,relatime,delegate_cmds=map_create:prog_load,delegate_progs=kprobe,delegate_attachs=xdp)
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-20-andrii@kernel.org
It's quite confusing in practice when it's possible to successfully
create a BPF token from BPF FS that didn't have any of delegate_xxx
mount options set up. While it's not wrong, it's actually more
meaningful to reject BPF_TOKEN_CREATE with specific error code (-ENOENT)
to let user-space know that no token delegation is setup up.
So, instead of creating empty BPF token that will be always ignored
because it doesn't have any of the allow_xxx bits set, reject it with
-ENOENT. If we ever need empty BPF token to be possible, we can support
that with extra flag passed into BPF_TOKEN_CREATE.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-19-andrii@kernel.org
Wire up bpf_token_create and bpf_token_free LSM hooks, which allow to
allocate LSM security blob (we add `void *security` field to struct
bpf_token for that), but also control who can instantiate BPF token.
This follows existing pattern for BPF map and BPF prog.
Also add security_bpf_token_allow_cmd() and security_bpf_token_capable()
LSM hooks that allow LSM implementation to control and negate (if
necessary) BPF token's delegation of a specific bpf_cmd and capability,
respectively.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-12-andrii@kernel.org
Similarly to bpf_prog_alloc LSM hook, rename and extend bpf_map_alloc
hook into bpf_map_create, taking not just struct bpf_map, but also
bpf_attr and bpf_token, to give a fuller context to LSMs.
Unlike bpf_prog_alloc, there is no need to move the hook around, as it
currently is firing right before allocating BPF map ID and FD, which
seems to be a sweet spot.
But like bpf_prog_alloc/bpf_prog_free combo, make sure that bpf_map_free
LSM hook is called even if bpf_map_create hook returned error, as if few
LSMs are combined together it could be that one LSM successfully
allocated security blob for its needs, while subsequent LSM rejected BPF
map creation. The former LSM would still need to free up LSM blob, so we
need to ensure security_bpf_map_free() is called regardless of the
outcome.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-11-andrii@kernel.org
Based on upstream discussion ([0]), rework existing
bpf_prog_alloc_security LSM hook. Rename it to bpf_prog_load and instead
of passing bpf_prog_aux, pass proper bpf_prog pointer for a full BPF
program struct. Also, we pass bpf_attr union with all the user-provided
arguments for BPF_PROG_LOAD command. This will give LSMs as much
information as we can basically provide.
The hook is also BPF token-aware now, and optional bpf_token struct is
passed as a third argument. bpf_prog_load LSM hook is called after
a bunch of sanity checks were performed, bpf_prog and bpf_prog_aux were
allocated and filled out, but right before performing full-fledged BPF
verification step.
bpf_prog_free LSM hook is now accepting struct bpf_prog argument, for
consistency. SELinux code is adjusted to all new names, types, and
signatures.
Note, given that bpf_prog_load (previously bpf_prog_alloc) hook can be
used by some LSMs to allocate extra security blob, but also by other
LSMs to reject BPF program loading, we need to make sure that
bpf_prog_free LSM hook is called after bpf_prog_load/bpf_prog_alloc one
*even* if the hook itself returned error. If we don't do that, we run
the risk of leaking memory. This seems to be possible today when
combining SELinux and BPF LSM, as one example, depending on their
relative ordering.
Also, for BPF LSM setup, add bpf_prog_load and bpf_prog_free to
sleepable LSM hooks list, as they are both executed in sleepable
context. Also drop bpf_prog_load hook from untrusted, as there is no
issue with refcount or anything else anymore, that originally forced us
to add it to untrusted list in c0c852dd18 ("bpf: Do not mark certain LSM
hook arguments as trusted"). We now trigger this hook much later and it
should not be an issue anymore.
[0] https://lore.kernel.org/bpf/9fe88aef7deabbe87d3fc38c4aea3c69.paul@paul-moore.com/
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-10-andrii@kernel.org
Remove remaining direct queries to perfmon_capable() and bpf_capable()
in BPF verifier logic and instead use BPF token (if available) to make
decisions about privileges.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-9-andrii@kernel.org
Instead of performing unconditional system-wide bpf_capable() and
perfmon_capable() calls inside bpf_base_func_proto() function (and other
similar ones) to determine eligibility of a given BPF helper for a given
program, use previously recorded BPF token during BPF_PROG_LOAD command
handling to inform the decision.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-8-andrii@kernel.org
Add basic support of BPF token to BPF_PROG_LOAD. BPF_F_TOKEN_FD flag
should be set in prog_flags field when providing prog_token_fd.
Wire through a set of allowed BPF program types and attach types,
derived from BPF FS at BPF token creation time. Then make sure we
perform bpf_token_capable() checks everywhere where it's relevant.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-7-andrii@kernel.org
Accept BPF token FD in BPF_BTF_LOAD command to allow BTF data loading
through delegated BPF token. BPF_F_TOKEN_FD flag has to be specified
when passing BPF token FD. Given BPF_BTF_LOAD command didn't have flags
field before, we also add btf_flags field.
BTF loading is a pretty straightforward operation, so as long as BPF
token is created with allow_cmds granting BPF_BTF_LOAD command, kernel
proceeds to parsing BTF data and creating BTF object.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-6-andrii@kernel.org
Allow providing token_fd for BPF_MAP_CREATE command to allow controlled
BPF map creation from unprivileged process through delegated BPF token.
New BPF_F_TOKEN_FD flag is added to specify together with BPF token FD
for BPF_MAP_CREATE command.
Wire through a set of allowed BPF map types to BPF token, derived from
BPF FS at BPF token creation time. This, in combination with allowed_cmds
allows to create a narrowly-focused BPF token (controlled by privileged
agent) with a restrictive set of BPF maps that application can attempt
to create.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-5-andrii@kernel.org
Add new kind of BPF kernel object, BPF token. BPF token is meant to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while having a good amount of control over which
privileged operations could be performed using provided BPF token.
This is achieved through mounting BPF FS instance with extra delegation
mount options, which determine what operations are delegatable, and also
constraining it to the owning user namespace (as mentioned in the
previous patch).
BPF token itself is just a derivative from BPF FS and can be created
through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
FS FD, which can be attained through open() API by opening BPF FS mount
point. Currently, BPF token "inherits" delegated command, map types,
prog type, and attach type bit sets from BPF FS as is. In the future,
having an BPF token as a separate object with its own FD, we can allow
to further restrict BPF token's allowable set of things either at the
creation time or after the fact, allowing the process to guard itself
further from unintentionally trying to load undesired kind of BPF
programs. But for now we keep things simple and just copy bit sets as is.
When BPF token is created from BPF FS mount, we take reference to the
BPF super block's owning user namespace, and then use that namespace for
checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
capabilities that are normally only checked against init userns (using
capable()), but now we check them using ns_capable() instead (if BPF
token is provided). See bpf_token_capable() for details.
Such setup means that BPF token in itself is not sufficient to grant BPF
functionality. User namespaced process has to *also* have necessary
combination of capabilities inside that user namespace. So while
previously CAP_BPF was useless when granted within user namespace, now
it gains a meaning and allows container managers and sys admins to have
a flexible control over which processes can and need to use BPF
functionality within the user namespace (i.e., container in practice).
And BPF FS delegation mount options and derived BPF tokens serve as
a per-container "flag" to grant overall ability to use bpf() (plus further
restrict on which parts of bpf() syscalls are treated as namespaced).
Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
within the BPF FS owning user namespace, rounding up the ns_capable()
story of BPF token. Also creating BPF token in init user namespace is
currently not supported, given BPF token doesn't have any effect in init
user namespace anyways.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-4-andrii@kernel.org
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
- `delegate_cmds` allow to specify which bpf() syscall commands are
allowed with BPF token derived from this BPF FS instance;
- if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
a set of allowable BPF map types that could be created with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
a set of allowable BPF program types that could be loaded with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
a set of allowable BPF program attach types that could be loaded with
BPF token; delegate_progs and delegate_attachs are meant to be used
together, as full BPF program type is, in general, determined
through both program type and program attach type.
Currently, these mount options accept the following forms of values:
- a special value "any", that enables all possible values of a given
bit set;
- numeric value (decimal or hexadecimal, determined by kernel
automatically) that specifies a bit mask value directly;
- all the values for a given mount option are combined, if specified
multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
mask.
Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.
Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.
This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.
There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.
This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.
[0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org
Within BPF syscall handling code CAP_NET_ADMIN checks stand out a bit
compared to CAP_BPF and CAP_PERFMON checks. For the latter, CAP_BPF or
CAP_PERFMON are checked first, but if they are not set, CAP_SYS_ADMIN
takes over and grants whatever part of BPF syscall is required.
Similar kind of checks that involve CAP_NET_ADMIN are not so consistent.
One out of four uses does follow CAP_BPF/CAP_PERFMON model: during
BPF_PROG_LOAD, if the type of BPF program is "network-related" either
CAP_NET_ADMIN or CAP_SYS_ADMIN is required to proceed.
But in three other cases CAP_NET_ADMIN is required even if CAP_SYS_ADMIN
is set:
- when creating DEVMAP/XDKMAP/CPU_MAP maps;
- when attaching CGROUP_SKB programs;
- when handling BPF_PROG_QUERY command.
This patch is changing the latter three cases to follow BPF_PROG_LOAD
model, that is allowing to proceed under either CAP_NET_ADMIN or
CAP_SYS_ADMIN.
This also makes it cleaner in subsequent BPF token patches to switch
wholesomely to a generic bpf_token_capable(int cap) check, that always
falls back to CAP_SYS_ADMIN if requested capability is missing.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-2-andrii@kernel.org
Just to help distinguish the fs->in_exec flag from the current->in_execve
flag, add comments in check_unsafe_exec() and copy_fs() for more
context. Also note that in_execve is only used by TOMOYO now.
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
When the CPU goes idle for the last time during the CPU down hotplug
process, RCU reports a final quiescent state for the current CPU. If
this quiescent state propagates up to the top, some tasks may then be
woken up to complete the grace period: the main grace period kthread
and/or the expedited main workqueue (or kworker).
If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
arm the RT bandwith timer to the local offline CPU. Since this happens
after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
timer gets ignored. Therefore if the RCU kthreads are waiting for RT
bandwidth to be available, they may never be actually scheduled.
This triggers TREE03 rcutorture hangs:
rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved)
rcu: (t=21035 jiffies g=938281 q=40787 ncpus=6)
rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
rcu: RCU grace-period kthread stack dump:
task:rcu_preempt state:R running task stack:14896 pid:14 tgid:14 ppid:2 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x2eb/0xa80
schedule+0x1f/0x90
schedule_timeout+0x163/0x270
? __pfx_process_timeout+0x10/0x10
rcu_gp_fqs_loop+0x37c/0x5b0
? __pfx_rcu_gp_kthread+0x10/0x10
rcu_gp_kthread+0x17c/0x200
kthread+0xde/0x110
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2b/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
The situation can't be solved with just unpinning the timer. The hrtimer
infrastructure and the nohz heuristics involved in finding the best
remote target for an unpinned timer would then also need to handle
enqueues from an offline CPU in the most horrendous way.
So fix this on the RCU side instead and defer the wake up to an online
CPU if it's too late for the local one.
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Fixes: 5c0930ccaa ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
alloc_desc() and early_irq_init() contain duplicated code to initialize
interrupt descriptors.
Replace that with a helper function.
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Dawei Li <dawei.li@shingroup.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240122085716.2999875-6-dawei.li@shingroup.cn
For a CONFIG_SPARSE_IRQ=n kernel, early_irq_init() is supposed to
initialize all interrupt descriptors.
It does except for irq_desc::resend_node, which ia only initialized for the
first descriptor.
Use the indexed decriptor and not the base pointer to address that.
Fixes: bc06a9e087 ("genirq: Use hlist for managing resend handlers")
Signed-off-by: Dawei Li <dawei.li@shingroup.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240122085716.2999875-5-dawei.li@shingroup.cn
The module requires the use of btf_ctx_access() to invoke
bpf_tracing_btf_ctx_access() from a module. This function is valuable for
implementing validation functions that ensure proper access to ctx.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-14-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Replace the static list of struct_ops types with per-btf struct_ops_tab to
enable dynamic registration.
Both bpf_dummy_ops and bpf_tcp_ca now utilize the registration function
instead of being listed in bpf_struct_ops_types.h.
Cc: netdev@vger.kernel.org
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-12-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
A value_type should consist of three components: refcnt, state, and data.
refcnt and state has been move to struct bpf_struct_ops_common_value to
make it easier to check the value type.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-11-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
To ensure that a module remains accessible whenever a struct_ops object of
a struct_ops type provided by the module is still in use.
struct bpf_struct_ops_map doesn't hold a refcnt to btf anymore since a
module will hold a refcnt to it's btf already. But, struct_ops programs are
different. They hold their associated btf, not the module since they need
only btf to assure their types (signatures).
However, verifier holds the refcnt of the associated module of a struct_ops
type temporarily when verify a struct_ops prog. Verifier needs the help
from the verifier operators (struct bpf_verifier_ops) provided by the owner
module to verify data access of a prog, provide information, and generate
code.
This patch also add a count of links (links_cnt) to bpf_struct_ops_map. It
avoids bpf_struct_ops_map_put_progs() from accessing btf after calling
module_put() in bpf_struct_ops_map_free().
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-10-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Pass the fd of a btf from the userspace to the bpf() syscall, and then
convert the fd into a btf. The btf is generated from the module that
defines the target BPF struct_ops type.
In order to inform the kernel about the module that defines the target
struct_ops type, the userspace program needs to provide a btf fd for the
respective module's btf. This btf contains essential information on the
types defined within the module, including the target struct_ops type.
A btf fd must be provided to the kernel for struct_ops maps and for the bpf
programs attached to those maps.
In the case of the bpf programs, the attach_btf_obj_fd parameter is passed
as part of the bpf_attr and is converted into a btf. This btf is then
stored in the prog->aux->attach_btf field. Here, it just let the verifier
access attach_btf directly.
In the case of struct_ops maps, a btf fd is passed as value_type_btf_obj_fd
of bpf_attr. The bpf_struct_ops_map_alloc() function converts the fd to a
btf and stores it as st_map->btf. A flag BPF_F_VTYPE_BTF_OBJ_FD is added
for map_flags to indicate that the value of value_type_btf_obj_fd is set.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-9-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
This is a preparation for searching for struct_ops types from a specified
module. BTF is always btf_vmlinux now. This patch passes a pointer of BTF
to bpf_struct_ops_find_value() and bpf_struct_ops_find(). Once the new
registration API of struct_ops types is used, other BTFs besides
btf_vmlinux can also be passed to them.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-8-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Include btf object id (btf_obj_id) in bpf_map_info so that tools (ex:
bpftools struct_ops dump) know the correct btf from the kernel to look up
type information of struct_ops types.
Since struct_ops types can be defined and registered in a module. The
type information of a struct_ops type are defined in the btf of the
module defining it. The userspace tools need to know which btf is for
the module defining a struct_ops type.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-7-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Once new struct_ops can be registered from modules, btf_vmlinux is no
longer the only btf that struct_ops_map would face. st_map should remember
what btf it should use to get type information.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-6-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Maintain a registry of registered struct_ops types in the per-btf (module)
struct_ops_tab. This registry allows for easy lookup of struct_ops types
that are registered by a specific module.
It is a preparation work for supporting kernel module struct_ops in a
latter patch. Each struct_ops will be registered under its own kernel
module btf and will be stored in the newly added btf->struct_ops_tab. The
bpf verifier and bpf syscall (e.g. prog and map cmd) can find the
struct_ops and its btf type/size/id... information from
btf->struct_ops_tab.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-5-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Move some of members of bpf_struct_ops to bpf_struct_ops_desc. type_id is
unavailabe in bpf_struct_ops anymore. Modules should get it from the btf
received by kmod's init function.
Cc: netdev@vger.kernel.org
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-4-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Get ready to remove bpf_struct_ops_init() in the future. By using
BTF_ID_LIST, it is possible to gather type information while building
instead of runtime.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-3-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Move the majority of the code to bpf_struct_ops_init_one(), which can then
be utilized for the initialization of newly registered dynamically
allocated struct_ops types in the following patches.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-2-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Storing cookies in kprobe_multi bpf_link_info data. The cookies
field is optional and if provided it needs to be an array of
__u64 with kprobe_multi.count length.
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240119110505.400573-3-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
At the moment we don't store cookie for perf_event probes,
while we do that for the rest of the probes.
Adding cookie fields to struct bpf_link_info perf event
probe records:
perf_event.uprobe
perf_event.kprobe
perf_event.tracepoint
perf_event.perf_event
And the code to store that in bpf_link_info struct.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20240119110505.400573-2-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Current checking rules are structured to disallow alu on particular ptr
types explicitly, so default cases are allowed implicitly. This may lead
to newly added ptr types being allowed unexpectedly. So restruture it to
allow alu explicitly. The tradeoff is mainly a bit more cases added in
the switch. The following table from Eduard summarizes the rules:
| Pointer type | Arithmetics allowed |
|---------------------+---------------------|
| PTR_TO_CTX | yes |
| CONST_PTR_TO_MAP | conditionally |
| PTR_TO_MAP_VALUE | yes |
| PTR_TO_MAP_KEY | yes |
| PTR_TO_STACK | yes |
| PTR_TO_PACKET_META | yes |
| PTR_TO_PACKET | yes |
| PTR_TO_PACKET_END | no |
| PTR_TO_FLOW_KEYS | conditionally |
| PTR_TO_SOCKET | no |
| PTR_TO_SOCK_COMMON | no |
| PTR_TO_TCP_SOCK | no |
| PTR_TO_TP_BUFFER | yes |
| PTR_TO_XDP_SOCK | no |
| PTR_TO_BTF_ID | yes |
| PTR_TO_MEM | yes |
| PTR_TO_BUF | yes |
| PTR_TO_FUNC | yes |
| CONST_PTR_TO_DYNPTR | yes |
The refactored rules are equivalent to the original one. Note that
PTR_TO_FUNC and CONST_PTR_TO_DYNPTR are not reject here because: (1)
check_mem_access() rejects load/store on those ptrs, and those ptrs
with offset passing to calls are rejected check_func_arg_reg_off();
(2) someone may rely on the verifier not rejecting programs earily.
Signed-off-by: Hao Sun <sunhao.th@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240117094012.36798-1-sunhao.th@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
With patch set [1], precision backtracing supports register spill/fill
to/from the stack. The patch [2] allows initial imprecise register spill
with content 0. This is a common case for cpuv3 and lower for
initializing the stack variables with pattern
r1 = 0
*(u64 *)(r10 - 8) = r1
and the [2] has demonstrated good verification improvement.
For cpuv4, the initialization could be
*(u64 *)(r10 - 8) = 0
The current verifier marks the r10-8 contents with STACK_ZERO.
Similar to [2], let us permit the above insn to behave like
imprecise register spill which can reduce number of verified states.
The change is in function check_stack_write_fixed_off().
Before this patch, spilled zero will be marked as STACK_ZERO
which can provide precise values. In check_stack_write_var_off(),
STACK_ZERO will be maintained if writing a const zero
so later it can provide precise values if needed.
The above handling of '*(u64 *)(r10 - 8) = 0' as a spill
will have issues in check_stack_write_var_off() as the spill
will be converted to STACK_MISC and the precise value 0
is lost. To fix this issue, if the spill slots with const
zero and the BPF_ST write also with const zero, the spill slots
are preserved, which can later provide precise values
if needed. Without the change in check_stack_write_var_off(),
the test_verifier subtest 'BPF_ST_MEM stack imm zero, variable offset'
will fail.
I checked cpuv3 and cpuv4 with and without this patch with veristat.
There is no state change for cpuv3 since '*(u64 *)(r10 - 8) = 0'
is only generated with cpuv4.
For cpuv4:
$ ../veristat -C old.cpuv4.csv new.cpuv4.csv -e file,prog,insns,states -f 'insns_diff!=0'
File Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF)
------------------------------------------ ------------------- --------- --------- --------------- ---------- ---------- -------------
local_storage_bench.bpf.linked3.o get_local 228 168 -60 (-26.32%) 17 14 -3 (-17.65%)
pyperf600_bpf_loop.bpf.linked3.o on_event 6066 4889 -1177 (-19.40%) 403 321 -82 (-20.35%)
test_cls_redirect.bpf.linked3.o cls_redirect 35483 35387 -96 (-0.27%) 2179 2177 -2 (-0.09%)
test_l4lb_noinline.bpf.linked3.o balancer_ingress 4494 4522 +28 (+0.62%) 217 219 +2 (+0.92%)
test_l4lb_noinline_dynptr.bpf.linked3.o balancer_ingress 1432 1455 +23 (+1.61%) 92 94 +2 (+2.17%)
test_xdp_noinline.bpf.linked3.o balancer_ingress_v6 3462 3458 -4 (-0.12%) 216 216 +0 (+0.00%)
verifier_iterating_callbacks.bpf.linked3.o widening 52 41 -11 (-21.15%) 4 3 -1 (-25.00%)
xdp_synproxy_kern.bpf.linked3.o syncookie_tc 12412 11719 -693 (-5.58%) 345 330 -15 (-4.35%)
xdp_synproxy_kern.bpf.linked3.o syncookie_xdp 12478 11794 -684 (-5.48%) 346 331 -15 (-4.34%)
test_l4lb_noinline and test_l4lb_noinline_dynptr has minor regression, but
pyperf600_bpf_loop and local_storage_bench gets pretty good improvement.
[1] https://lore.kernel.org/all/20231205184248.1502704-1-andrii@kernel.org/
[2] https://lore.kernel.org/all/20231205184248.1502704-9-andrii@kernel.org/
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Tested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240110051348.2737007-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, when a scalar bounded register is spilled to the stack, its
ID is preserved, but only if was already assigned, i.e. if this register
was MOVed before.
Assign an ID on spill if none is set, so that equal scalars could be
tracked if a register is spilled to the stack and filled into another
register.
One test is adjusted to reflect the change in register IDs.
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-9-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Put calculation of the register value width into a dedicated function.
This function will also be used in a following commit.
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Link: https://lore.kernel.org/r/20240108205209.838365-8-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Extract the common code that generates a register ID for src_reg before
MOV if needed into a new function. This function will also be used in
a following commit.
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-7-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Current infinite loops detection mechanism is speculative:
- first, states_maybe_looping() check is done which simply does memcmp
for R1-R10 in current frame;
- second, states_equal(..., exact=false) is called. With exact=false
states_equal() would compare scalars for equality only if in old
state scalar has precision mark.
Such logic might be problematic if compiler makes some unlucky stack
spill/fill decisions. An artificial example of a false positive looks
as follows:
r0 = ... unknown scalar ...
r0 &= 0xff;
*(u64 *)(r10 - 8) = r0;
r0 = 0;
loop:
r0 = *(u64 *)(r10 - 8);
if r0 > 10 goto exit_;
r0 += 1;
*(u64 *)(r10 - 8) = r0;
r0 = 0;
goto loop;
This commit updates call to states_equal to use exact=true, forcing
all scalar comparisons to be exact.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-3-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add ability to iterate multiple decl_tag types pointed to the same
function argument. Use this to support multiple __arg_xxx tags per
global subprog argument.
We leave btf_find_decl_tag_value() intact, but change its implementation
to use a new btf_find_next_decl_tag() which can be straightforwardly
used to find next BTF type ID of a matching btf_decl_tag type.
btf_prepare_func_args() is switched from btf_find_decl_tag_value() to
btf_find_next_decl_tag() to gain multiple tags per argument support.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240105000909.2818934-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add btf_arg_tag flags enum to be able to record multiple tags per
argument. Also streamline pointer argument processing some more.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240105000909.2818934-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Move scalar arg processing in btf_prepare_func_args() after all pointer
arg processing is done. This makes it easier to do validation. One
example of unintended behavior right now is ability to specify
__arg_nonnull for integer/enum arguments. This patch fixes this.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240105000909.2818934-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The motivation of inlining bpf_kptr_xchg() comes from the performance
profiling of bpf memory allocator benchmark. The benchmark uses
bpf_kptr_xchg() to stash the allocated objects and to pop the stashed
objects for free. After inling bpf_kptr_xchg(), the performance for
object free on 8-CPUs VM increases about 2%~10%. The inline also has
downside: both the kasan and kcsan checks on the pointer will be
unavailable.
bpf_kptr_xchg() can be inlined by converting the calling of
bpf_kptr_xchg() into an atomic_xchg() instruction. But the conversion
depends on two conditions:
1) JIT backend supports atomic_xchg() on pointer-sized word
2) For the specific arch, the implementation of xchg is the same as
atomic_xchg() on pointer-sized words.
It seems most 64-bit JIT backends satisfies these two conditions. But
as a precaution, defining a weak function bpf_jit_supports_ptr_xchg()
to state whether such conversion is safe and only supporting inline for
64-bit host.
For x86-64, it supports BPF_XCHG atomic operation and both xchg() and
atomic_xchg() use arch_xchg() to implement the exchange, so enabling the
inline of bpf_kptr_xchg() on x86-64 first.
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20240105104819.3916743-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Running the following two commands in parallel on a multi-processor
AArch64 machine can sporadically produce an unexpected warning about
duplicate histogram entries:
$ while true; do
echo hist:key=id.syscall:val=hitcount > \
/sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
sleep 0.001
done
$ stress-ng --sysbadaddr $(nproc)
The warning looks as follows:
[ 2911.172474] ------------[ cut here ]------------
[ 2911.173111] Duplicates detected: 1
[ 2911.173574] WARNING: CPU: 2 PID: 12247 at kernel/trace/tracing_map.c:983 tracing_map_sort_entries+0x3e0/0x408
[ 2911.174702] Modules linked in: iscsi_ibft(E) iscsi_boot_sysfs(E) rfkill(E) af_packet(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) ena(E) tiny_power_button(E) qemu_fw_cfg(E) button(E) fuse(E) efi_pstore(E) ip_tables(E) x_tables(E) xfs(E) libcrc32c(E) aes_ce_blk(E) aes_ce_cipher(E) crct10dif_ce(E) polyval_ce(E) polyval_generic(E) ghash_ce(E) gf128mul(E) sm4_ce_gcm(E) sm4_ce_ccm(E) sm4_ce(E) sm4_ce_cipher(E) sm4(E) sm3_ce(E) sm3(E) sha3_ce(E) sha512_ce(E) sha512_arm64(E) sha2_ce(E) sha256_arm64(E) nvme(E) sha1_ce(E) nvme_core(E) nvme_auth(E) t10_pi(E) sg(E) scsi_mod(E) scsi_common(E) efivarfs(E)
[ 2911.174738] Unloaded tainted modules: cppc_cpufreq(E):1
[ 2911.180985] CPU: 2 PID: 12247 Comm: cat Kdump: loaded Tainted: G E 6.7.0-default #2 1b58bbb22c97e4399dc09f92d309344f69c44a01
[ 2911.182398] Hardware name: Amazon EC2 c7g.8xlarge/, BIOS 1.0 11/1/2018
[ 2911.183208] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[ 2911.184038] pc : tracing_map_sort_entries+0x3e0/0x408
[ 2911.184667] lr : tracing_map_sort_entries+0x3e0/0x408
[ 2911.185310] sp : ffff8000a1513900
[ 2911.185750] x29: ffff8000a1513900 x28: ffff0003f272fe80 x27: 0000000000000001
[ 2911.186600] x26: ffff0003f272fe80 x25: 0000000000000030 x24: 0000000000000008
[ 2911.187458] x23: ffff0003c5788000 x22: ffff0003c16710c8 x21: ffff80008017f180
[ 2911.188310] x20: ffff80008017f000 x19: ffff80008017f180 x18: ffffffffffffffff
[ 2911.189160] x17: 0000000000000000 x16: 0000000000000000 x15: ffff8000a15134b8
[ 2911.190015] x14: 0000000000000000 x13: 205d373432323154 x12: 5b5d313131333731
[ 2911.190844] x11: 00000000fffeffff x10: 00000000fffeffff x9 : ffffd1b78274a13c
[ 2911.191716] x8 : 000000000017ffe8 x7 : c0000000fffeffff x6 : 000000000057ffa8
[ 2911.192554] x5 : ffff0012f6c24ec0 x4 : 0000000000000000 x3 : ffff2e5b72b5d000
[ 2911.193404] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0003ff254480
[ 2911.194259] Call trace:
[ 2911.194626] tracing_map_sort_entries+0x3e0/0x408
[ 2911.195220] hist_show+0x124/0x800
[ 2911.195692] seq_read_iter+0x1d4/0x4e8
[ 2911.196193] seq_read+0xe8/0x138
[ 2911.196638] vfs_read+0xc8/0x300
[ 2911.197078] ksys_read+0x70/0x108
[ 2911.197534] __arm64_sys_read+0x24/0x38
[ 2911.198046] invoke_syscall+0x78/0x108
[ 2911.198553] el0_svc_common.constprop.0+0xd0/0xf8
[ 2911.199157] do_el0_svc+0x28/0x40
[ 2911.199613] el0_svc+0x40/0x178
[ 2911.200048] el0t_64_sync_handler+0x13c/0x158
[ 2911.200621] el0t_64_sync+0x1a8/0x1b0
[ 2911.201115] ---[ end trace 0000000000000000 ]---
The problem appears to be caused by CPU reordering of writes issued from
__tracing_map_insert().
The check for the presence of an element with a given key in this
function is:
val = READ_ONCE(entry->val);
if (val && keys_match(key, val->key, map->key_size)) ...
The write of a new entry is:
elt = get_free_elt(map);
memcpy(elt->key, key, map->key_size);
entry->val = elt;
The "memcpy(elt->key, key, map->key_size);" and "entry->val = elt;"
stores may become visible in the reversed order on another CPU. This
second CPU might then incorrectly determine that a new key doesn't match
an already present val->key and subsequently insert a new element,
resulting in a duplicate.
Fix the problem by adding a write barrier between
"memcpy(elt->key, key, map->key_size);" and "entry->val = elt;", and for
good measure, also use WRITE_ONCE(entry->val, elt) for publishing the
element. The sequence pairs with the mentioned "READ_ONCE(entry->val);"
and the "val->key" check which has an address dependency.
The barrier is placed on a path executed when adding an element for
a new key. Subsequent updates targeting the same key remain unaffected.
From the user's perspective, the issue was introduced by commit
c193707dde ("tracing: Remove code which merges duplicates"), which
followed commit cbf4100efb ("tracing: Add support to detect and avoid
duplicates"). The previous code operated differently; it inherently
expected potential races which result in duplicates but merged them
later when they occurred.
Link: https://lore.kernel.org/linux-trace-kernel/20240122150928.27725-1-petr.pavlu@suse.com
Fixes: c193707dde ("tracing: Remove code which merges duplicates")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
- A fix for the idle and iowait time accounting vs. CPU hotplug.
The time is reset on CPU hotplug which makes the accumulated
systemwide time jump backwards.
- Assorted fixes and improvements for clocksource/event drivers
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmWtTLgTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoUXiD/4uN4Ntps8TwxSdg1X11M6++rizg9q9
EmIfwWcfQQJDM5Ss5FE88ye55NxIOwJ1brYo08+yTAXjnnZ/yNP1BBegHbMNiGil
NCHye7tYKZle25+hErdgfBB9n6brPz7dPOvV04/wRRWW+9p2ejt/5nEvojkyco9Y
S9KgBCxkvUqScMbdKKFW1UsThWh2euxwQXRGiWhTPPkbKcVynPvQJjvVyRxn01NS
eEhTn8YUNcAPT+1YApouGXrSCxo/IzBJ36CxOoCoUfaXcJ6FG1LLeAjNxKZ26Dfs
Ah0e3Hhyv6KOsBvBNwwabXDwryd6L8rZd8yL2KakI1vIC51uS2wneFy8GCieDVGh
xmy3U/tfkS0L7pmN+dQW2l4k9PHRNrwvbISKhs0UAHSOgGIMHZcjE6aFbYKru5i4
1W+dEjiktlceZ94mrEHbLpKmxWH2z5P8m0BzUs4kt3nkaOf6CTUKqa/qdAiU5dv+
lovKT26L8HBrMXf48I70UpgW/bYzOUGk55sR6hiLTXAelz1z02D1uYHFkshc0NCO
/O4wvHcgvMM46CtWVbim42AlRcyyWCr+FrY+jvfiG2icOcHPLqc81iHL8EKj7pJl
IxLgyPHVckgnE5gx+GQ8aDkg/qwCZnj4rFWgub8QMYtjI+pO+9T9kPAYPCxFhP7J
gmcJxZAB2RnKXA==
=RD6E
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Updates for time and clocksources:
- A fix for the idle and iowait time accounting vs CPU hotplug.
The time is reset on CPU hotplug which makes the accumulated
systemwide time jump backwards.
- Assorted fixes and improvements for clocksource/event drivers"
* tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug
clocksource/drivers/ep93xx: Fix error handling during probe
clocksource/drivers/cadence-ttc: Fix some kernel-doc warnings
clocksource/drivers/timer-ti-dm: Fix make W=n kerneldoc warnings
clocksource/timer-riscv: Add riscv_clock_shutdown callback
dt-bindings: timer: Add StarFive JH8100 clint
dt-bindings: timer: thead,c900-aclint-mtimer: separate mtime and mtimecmp regs
When offlining and onlining CPUs the overall reported idle and iowait
times as reported by /proc/stat jump backward and forward:
cpu 132 0 176 225249 47 6 6 21 0 0
cpu0 80 0 115 112575 33 3 4 18 0 0
cpu1 52 0 60 112673 13 3 1 2 0 0
cpu 133 0 177 226681 47 6 6 21 0 0
cpu0 80 0 116 113387 33 3 4 18 0 0
cpu 133 0 178 114431 33 6 6 21 0 0 <---- jump backward
cpu0 80 0 116 114247 33 3 4 18 0 0
cpu1 52 0 61 183 0 3 1 2 0 0 <---- idle + iowait start with 0
cpu 133 0 178 228956 47 6 6 21 0 0 <---- jump forward
cpu0 81 0 117 114929 33 3 4 18 0 0
Reason for this is that get_idle_time() in fs/proc/stat.c has different
sources for both values depending on if a CPU is online or offline:
- if a CPU is online the values may be taken from its per cpu
tick_cpu_sched structure
- if a CPU is offline the values are taken from its per cpu cpustat
structure
The problem is that the per cpu tick_cpu_sched structure is set to zero on
CPU offline. See tick_cancel_sched_timer() in kernel/time/tick-sched.c.
Therefore when a CPU is brought offline and online afterwards both its idle
and iowait sleeptime will be zero, causing a jump backward in total system
idle and iowait sleeptime. In a similar way if a CPU is then brought
offline again the total idle and iowait sleeptimes will jump forward.
It looks like this behavior was introduced with commit 4b0c0f294f
("tick: Cleanup NOHZ per cpu data on cpu down").
This was only noticed now on s390, since we switched to generic idle time
reporting with commit be76ea6144 ("s390/idle: remove arch_cpu_idle_time()
and corresponding code").
Fix this by preserving the values of idle_sleeptime and iowait_sleeptime
members of the per-cpu tick_sched structure on CPU hotplug.
Fixes: 4b0c0f294f ("tick: Cleanup NOHZ per cpu data on cpu down")
Reported-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240115163555.1004144-1-hca@linux.ibm.com
Jiri Slaby reported a futex state inconsistency resulting in -EINVAL during
a lock operation for a PI futex. It requires that the a lock process is
interrupted by a timeout or signal:
T1 Owns the futex in user space.
T2 Tries to acquire the futex in kernel (futex_lock_pi()). Allocates a
pi_state and attaches itself to it.
T2 Times out and removes its rt_waiter from the rt_mutex. Drops the
rtmutex lock and tries to acquire the hash bucket lock to remove
the futex_q. The lock is contended and T2 schedules out.
T1 Unlocks the futex (futex_unlock_pi()). Finds a futex_q but no
rt_waiter. Unlocks the futex (do_uncontended) and makes it available
to user space.
T3 Acquires the futex in user space.
T4 Tries to acquire the futex in kernel (futex_lock_pi()). Finds the
existing futex_q of T2 and tries to attach itself to the existing
pi_state. This (attach_to_pi_state()) fails with -EINVAL because uval
contains the TID of T3 but pi_state points to T1.
It's incorrect to unlock the futex and make it available for user space to
acquire as long as there is still an existing state attached to it in the
kernel.
T1 cannot hand over the futex to T2 because T2 already gave up and started
to clean up and is blocked on the hash bucket lock, so T2's futex_q with
the pi_state pointing to T1 is still queued.
T2 observes the futex_q, but ignores it as there is no waiter on the
corresponding rt_mutex and takes the uncontended path which allows the
subsequent caller of futex_lock_pi() (T4) to observe that stale state.
To prevent this the unlock path must dequeue all futex_q entries which
point to the same pi_state when there is no waiter on the rt mutex. This
requires obviously to make the dequeue conditional in the locking path to
prevent a double dequeue. With that it's guaranteed that user space cannot
observe an uncontended futex which has kernel state attached.
Fixes: fbeb558b0d ("futex/pi: Fix recursive rt_mutex waiter state")
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Slaby <jirislaby@kernel.org>
Link: https://lore.kernel.org/r/20240118115451.0TkD_ZhB@linutronix.de
Closes: https://lore.kernel.org/all/4611bcf2-44d0-4c34-9b84-17406f881003@kernel.org
The entire changeset for kgdb this cycle is a single two-line change to
remove some deadcode that, had it not been dead, would have called
strncpy() in an unsafe manner.
To be fair there were other modest clean ups were discussed this cycle
but they are not finalized and will have to wait until next time.
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAmWpMp8ACgkQfOMlXTn3
iKE2YQ/9HGgROWCSQPHXmly/VFbcvp/0J77+XA4yNLqTKMfyV1ZHQ5xWoiDp2Q0Q
o689iYQ6V6YWl1/KNR5c7Xct82zgKaSksypTvIZBvk7vPVCElkQM9tpvE9VahMr7
/YB59GFr83Rks7a0tfQ6SMUFWyFFtaU3YdV1374CfGTmlFiqMBbfCz40izjgT7so
yLbbDCZxIaNBlRnRore1kPTt/3KAih6udai74H+OmlifWR0saS57Hwywlfd2H3yv
Q3bFK4NkYLkptJZVmAC8B68YK5c4W24Y+lQ9fouKiFMajym5ONHLFsmdbZe6MrPk
CUU2cGgDvei8nyAXAucocr5ndvJMi8R/JNunBNbnI6oYnT7K1MW3u+v1R+SRYm22
Ixza37hcFkR+7wWnXsPB/POQo74C5ylZt5Of20ogzBEm8qjY7zAGoVZhNh3Ny4t2
NGHwm6MR9613dIh/cVuD4waZdWS2TWYkaTA1E8PwIIqSnKfmMo0pJQU7ddy5V8Kb
F6un3/+9IA6mBQLOqPuFpHZNzIMkDdgE4GvZIrF/DIzlHeWxnk6hDOO2ftbGgsIY
BTelXfE1L8UoHlhixnjnFrTxRNjsZE8CGLY/Zd++b1mhY6tCwSGUS+cInUiIf/3k
X9bN+rwzxV+MVx2V5j/g2i2sEP2YELgCGDKtb7WbXDv43T+Q6/I=
=dLyK
-----END PGP SIGNATURE-----
Merge tag 'kgdb-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux
Pull kgdb update from Daniel Thompson:
"The entire changeset for kgdb this cycle is a single two-line change
to remove some deadcode that, had it not been dead, would have called
strncpy() in an unsafe manner.
To be fair there were other modest clean ups were discussed this cycle
but they are not finalized and will have to wait until next time"
* tag 'kgdb-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
kdb: Fix a potential buffer overflow in kdb_local()
Previous releases - regressions:
- Revert "net: rtnetlink: Enslave device before bringing it up",
breaks the case inverse to the one it was trying to fix
- net: dsa: fix oob access in DSA's netdevice event handler
dereference netdev_priv() before check its a DSA port
- sched: track device in tcf_block_get/put_ext() only for clsact
binder types
- net: tls, fix WARNING in __sk_msg_free when record becomes full
during splice and MORE hint set
- sfp-bus: fix SFP mode detect from bitrate
- drv: stmmac: prevent DSA tags from breaking COE
Previous releases - always broken:
- bpf: fix no forward progress in in bpf_iter_udp if output
buffer is too small
- bpf: reject variable offset alu on registers with a type
of PTR_TO_FLOW_KEYS to prevent oob access
- netfilter: tighten input validation
- net: add more sanity check in virtio_net_hdr_to_skb()
- rxrpc: fix use of Don't Fragment flag on RESPONSE packets,
avoid infinite loop
- amt: do not use the portion of skb->cb area which may get clobbered
- mptcp: improve validation of the MPTCPOPT_MP_JOIN MCTCP option
Misc:
- spring cleanup of inactive maintainers
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmWpnvoACgkQMUZtbf5S
Irvskg/+Or5tETxOmpQXxnj6ECZyrSp0Jcyd7+TIcos/7JfPdn3Kebl004SG4h/s
bwKDOIIP1iSjQ+0NFsPjyYIVd6wFuCElSB7npV5uQAT6ptXx7A4Ym68/rVxodI8T
6hiYV/mlPuZF8JjRhtp/VJL8sY1qnG7RIUB4oH3y9HQNfwZX0lIWChuUilHuWfbq
zQ2Iu97tMkoIBjXrkIT3Qaj0aFxYbjCOrg9zy+FZ69a7Rmrswr//7amlCH6saNTx
Ku7Wl8FXhe7O23OiM6GSl7AechSM1aJ5kOS3orseej0+aSp9eH3ekYGmbsQr6sjz
ix/eZ7V7SUkJK3bEH5haeymk4TDV3lHE8SziMbosK4wVbHOyPwEmqCxppADYJLZs
WycHZKcTBluFBOxknAofH7m5Hh0ToXkeTfpptSSGtRB4WncAOMsMapr3yS4WXg/q
AnOo/tzCBgMrnSJtD/kjqgUiCk8vYoLc8lBR9K74l0zqI1sf13OfuTHvEgqIS6z1
Ir/ewlAV6fCH8gQbyzjKUVlyjZS4+vFv19xg/2GgLf+LdyzcCOxUZkND3/DE6+OA
Dgf9gtABYU4hGXMUfTfml3KCBTF65QmY8dIh17zraNylYUHEJ2lI4D+sdiqWUrXb
mXPBJh4nOPwIV5t2gT80skNwF3aWPr6l4ieY2codSbP04rO74S8=
=YhQQ
-----END PGP SIGNATURE-----
Merge tag 'net-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf and netfilter.
Previous releases - regressions:
- Revert "net: rtnetlink: Enslave device before bringing it up",
breaks the case inverse to the one it was trying to fix
- net: dsa: fix oob access in DSA's netdevice event handler
dereference netdev_priv() before check its a DSA port
- sched: track device in tcf_block_get/put_ext() only for clsact
binder types
- net: tls, fix WARNING in __sk_msg_free when record becomes full
during splice and MORE hint set
- sfp-bus: fix SFP mode detect from bitrate
- drv: stmmac: prevent DSA tags from breaking COE
Previous releases - always broken:
- bpf: fix no forward progress in in bpf_iter_udp if output buffer is
too small
- bpf: reject variable offset alu on registers with a type of
PTR_TO_FLOW_KEYS to prevent oob access
- netfilter: tighten input validation
- net: add more sanity check in virtio_net_hdr_to_skb()
- rxrpc: fix use of Don't Fragment flag on RESPONSE packets, avoid
infinite loop
- amt: do not use the portion of skb->cb area which may get clobbered
- mptcp: improve validation of the MPTCPOPT_MP_JOIN MCTCP option
Misc:
- spring cleanup of inactive maintainers"
* tag 'net-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits)
i40e: Include types.h to some headers
ipv6: mcast: fix data-race in ipv6_mc_down / mld_ifc_work
selftests: mlxsw: qos_pfc: Adjust the test to support 8 lanes
selftests: mlxsw: qos_pfc: Remove wrong description
mlxsw: spectrum_router: Register netdevice notifier before nexthop
mlxsw: spectrum_acl_tcam: Fix stack corruption
mlxsw: spectrum_acl_tcam: Fix NULL pointer dereference in error path
mlxsw: spectrum_acl_erp: Fix error flow of pool allocation failure
ethtool: netlink: Add missing ethnl_ops_begin/complete
selftests: bonding: Add more missing config options
selftests: netdevsim: add a config file
libbpf: warn on unexpected __arg_ctx type when rewriting BTF
selftests/bpf: add tests confirming type logic in kernel for __arg_ctx
bpf: enforce types for __arg_ctx-tagged arguments in global subprogs
bpf: extract bpf_ctx_convert_map logic and make it more reusable
libbpf: feature-detect arg:ctx tag support in kernel
ipvs: avoid stat macros calls from preemptible context
netfilter: nf_tables: reject NFT_SET_CONCAT with not field length description
netfilter: nf_tables: skip dead set elements in netlink dump
netfilter: nf_tables: do not allow mismatch field size and set key length
...
Including:
- Core changes:
- Fix race conditions in device probe path
- Retire IOMMU bus_ops
- Support for passing custom allocators to page table drivers
- Clean up Kconfig around IOMMU_SVA
- Support for sharing SVA domains with all devices bound to
a mm
- Firmware data parsing cleanup
- Tracing improvements for iommu-dma code
- Some smaller fixes and cleanups
- ARM-SMMU drivers:
- Device-tree binding updates:
- Add additional compatible strings for Qualcomm SoCs
- Document Adreno clocks for Qualcomm's SM8350 SoC
- SMMUv2:
- Implement support for the ->domain_alloc_paging() callback
- Ensure Secure context is restored following suspend of Qualcomm SMMU
implementation
- SMMUv3:
- Disable stalling mode for the "quiet" context descriptor
- Minor refactoring and driver cleanups
- Intel VT-d driver:
- Cleanup and refactoring
- AMD IOMMU driver:
- Improve IO TLB invalidation logic
- Small cleanups and improvements
- Rockchip IOMMU driver:
- DT binding update to add Rockchip RK3588
- Apple DART driver:
- Apple M1 USB4/Thunderbolt DART support
- Cleanups
- Virtio IOMMU driver:
- Add support for iotlb_sync_map
- Enable deferred IO TLB flushes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmWecQoACgkQK/BELZcB
GuN5ZxAAzC5QUKAzANx0puk7QhPpKKlbSvj6Q7iRgCLk00KJO1+VQh9v4ouCmXqF
kn3Ko8gddjhtrgwN0OQ54F39cLUrp1SBemy71K5YOR+vu8VKtwtmawZGeeRZ+k+B
Eohw58oaXTiR1maYvoLixLYczLrjklqyJOQ1vZ0GxFGxDqrFByAryHDgG/3OCpJx
C9e6PsLbbfhfqA8Kv97iKcBqniGbXxAMuodqSUG0buQ3oZgfpIP6Bt3EgUzFGPGk
3BTlYxowS/gkjUWd3fgjQFIFLTA01u9FhpA2Jb0a4v67pUCR64YxHN7rBQ6ZChtG
kB9laQfU9re79RsHhqQzr0JT9x/eyq7pzGzjp5TV5TPW6IW+sqjMIPhzd9P08Ef7
BclkCVobx0jSAHOhnnG4QJiKANr2Y2oM3HfsAJccMMY45RRhUKmVqM7jxMPfGn3A
i+inlee73xTjZXJse1EWG1fmKKMLvX9LDEp4DyOfn9CqVT+7hpZvzPjfbGr937Rm
JlwXhF3rQXEpOCagEsbt1vOf+V0e9QiCLf1Y2KpkIkDbE5wwSD/2qLm3tFhJG3oF
fkW+J14Cid0pj+hY0afGe0kOUOIYlimu0nFmSf0pzMH+UktZdKogSfyb1gSDsy+S
rsZRGPFhMJ832ExqhlDfxqBebqh+jsfKynlskui6Td5C9ZULaHA=
=q751
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
"Core changes:
- Fix race conditions in device probe path
- Retire IOMMU bus_ops
- Support for passing custom allocators to page table drivers
- Clean up Kconfig around IOMMU_SVA
- Support for sharing SVA domains with all devices bound to a mm
- Firmware data parsing cleanup
- Tracing improvements for iommu-dma code
- Some smaller fixes and cleanups
ARM-SMMU drivers:
- Device-tree binding updates:
- Add additional compatible strings for Qualcomm SoCs
- Document Adreno clocks for Qualcomm's SM8350 SoC
- SMMUv2:
- Implement support for the ->domain_alloc_paging() callback
- Ensure Secure context is restored following suspend of Qualcomm
SMMU implementation
- SMMUv3:
- Disable stalling mode for the "quiet" context descriptor
- Minor refactoring and driver cleanups
Intel VT-d driver:
- Cleanup and refactoring
AMD IOMMU driver:
- Improve IO TLB invalidation logic
- Small cleanups and improvements
Rockchip IOMMU driver:
- DT binding update to add Rockchip RK3588
Apple DART driver:
- Apple M1 USB4/Thunderbolt DART support
- Cleanups
Virtio IOMMU driver:
- Add support for iotlb_sync_map
- Enable deferred IO TLB flushes"
* tag 'iommu-updates-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (66 commits)
iommu: Don't reserve 0-length IOVA region
iommu/vt-d: Move inline helpers to header files
iommu/vt-d: Remove unused vcmd interfaces
iommu/vt-d: Remove unused parameter of intel_pasid_setup_pass_through()
iommu/vt-d: Refactor device_to_iommu() to retrieve iommu directly
iommu/sva: Fix memory leak in iommu_sva_bind_device()
dt-bindings: iommu: rockchip: Add Rockchip RK3588
iommu/dma: Trace bounce buffer usage when mapping buffers
iommu/arm-smmu: Convert to domain_alloc_paging()
iommu/arm-smmu: Pass arm_smmu_domain to internal functions
iommu/arm-smmu: Implement IOMMU_DOMAIN_BLOCKED
iommu/arm-smmu: Convert to a global static identity domain
iommu/arm-smmu: Reorganize arm_smmu_domain_add_master()
iommu/arm-smmu-v3: Remove ARM_SMMU_DOMAIN_NESTED
iommu/arm-smmu-v3: Master cannot be NULL in arm_smmu_write_strtab_ent()
iommu/arm-smmu-v3: Add a type for the STE
iommu/arm-smmu-v3: disable stall for quiet_cd
iommu/qcom: restore IOMMU state if needed
iommu/arm-smmu-qcom: Add QCM2290 MDSS compatible
iommu/arm-smmu-qcom: Add missing GMU entry to match table
...
- Allow kernel trace instance creation to specify what events are created
Inside the kernel, a subsystem may create a tracing instance that it can
use to send events to user space. This sub-system may not care about the
thousands of events that exist in eventfs. Allow the sub-system to specify
what sub-systems of events it cares about, and only those events are exposed
to this instance.
- Allow the ring buffer to be broken up into bigger sub-buffers than just the
architecture page size. A new tracefs file called "buffer_subbuf_size_kb"
is created. The user can now specify a minimum size the sub-buffer may be
in kilobytes. Note, that the implementation currently make the sub-buffer
size a power of 2 pages (1, 2, 4, 8, 16, ...) but the user only writes in
kilobyte size, and the sub-buffer will be updated to the next size that
it will can accommodate it. If the user writes in 10, it will change the
size to be 4 pages on x86 (16K), as that is the next available size that
can hold 10K pages.
- Update the debug output when a corrupt time is detected in the ring buffer.
If the ring buffer detects inconsistent timestamps, there's a debug config
options that will dump the contents of the meta data of the sub-buffer that
is used for debugging. Add some more information to this dump that helps
with debugging.
- Add more timestamp debugging checks (only triggers when the config is enabled)
- Increase the trace_seq iterator to 2 page sizes.
- Allow strings written into tracefs_marker to be larger. Up to just under
2 page sizes (based on what trace_seq can hold).
- Increase the trace_maker_raw write to be as big as a sub-buffer can hold.
- Remove 32 bit time stamp logic, now that the rb_time_cmpxchg() has been
removed.
- More selftests were added.
- Some code clean ups as well.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZZ8p3BQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6ql2GAQDZg/zlFEiJHyTfWbCIE8pA3T5xbzKo
26TNxIZAxJJZpQEAvGFU5Smy14pG6soEoVMp8B6ZOANbqU8VVamhOL+r+Qw=
=0OYG
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
- Allow kernel trace instance creation to specify what events are
created
Inside the kernel, a subsystem may create a tracing instance that it
can use to send events to user space. This sub-system may not care
about the thousands of events that exist in eventfs. Allow the
sub-system to specify what sub-systems of events it cares about, and
only those events are exposed to this instance.
- Allow the ring buffer to be broken up into bigger sub-buffers than
just the architecture page size.
A new tracefs file called "buffer_subbuf_size_kb" is created. The
user can now specify a minimum size the sub-buffer may be in
kilobytes. Note, that the implementation currently make the
sub-buffer size a power of 2 pages (1, 2, 4, 8, 16, ...) but the user
only writes in kilobyte size, and the sub-buffer will be updated to
the next size that it will can accommodate it. If the user writes in
10, it will change the size to be 4 pages on x86 (16K), as that is
the next available size that can hold 10K pages.
- Update the debug output when a corrupt time is detected in the ring
buffer. If the ring buffer detects inconsistent timestamps, there's a
debug config options that will dump the contents of the meta data of
the sub-buffer that is used for debugging. Add some more information
to this dump that helps with debugging.
- Add more timestamp debugging checks (only triggers when the config is
enabled)
- Increase the trace_seq iterator to 2 page sizes.
- Allow strings written into tracefs_marker to be larger. Up to just
under 2 page sizes (based on what trace_seq can hold).
- Increase the trace_maker_raw write to be as big as a sub-buffer can
hold.
- Remove 32 bit time stamp logic, now that the rb_time_cmpxchg() has
been removed.
- More selftests were added.
- Some code clean ups as well.
* tag 'trace-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (29 commits)
ring-buffer: Remove stale comment from ring_buffer_size()
tracing histograms: Simplify parse_actions() function
tracing/selftests: Remove exec permissions from trace_marker.tc test
ring-buffer: Use subbuf_order for buffer page masking
tracing: Update subbuffer with kilobytes not page order
ringbuffer/selftest: Add basic selftest to test changing subbuf order
ring-buffer: Add documentation on the buffer_subbuf_order file
ring-buffer: Just update the subbuffers when changing their allocation order
ring-buffer: Keep the same size when updating the order
tracing: Stop the tracing while changing the ring buffer subbuf size
tracing: Update snapshot order along with main buffer order
ring-buffer: Make sure the spare sub buffer used for reads has same size
ring-buffer: Do no swap cpu buffers if order is different
ring-buffer: Clear pages on error in ring_buffer_subbuf_order_set() failure
ring-buffer: Read and write to ring buffers with custom sub buffer size
ring-buffer: Set new size of the ring buffer sub page
ring-buffer: Add interface for configuring trace sub buffer size
ring-buffer: Page size per ring buffer
ring-buffer: Have ring_buffer_print_page_header() be able to access ring_buffer_iter
ring-buffer: Check if absolute timestamp goes backwards
...
- Kprobes trace event to show the actual function name in notrace-symbol
warning. Instead of using user specified symbol name, use "%ps" printk
format to show the actual symbol at the probe address. Since kprobe
event accepts the offset from symbol which is bigger than the symbol
size, user specified symbol may not be the actual probed symbol.
-----BEGIN PGP SIGNATURE-----
iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmWdZB0bHG1hc2FtaS5o
aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8b9AcH/R8mNbgAbKlxSXUm0NAG
xrUcN9vyb9yaLgvoIEvW+XF6EMaCM6G2kG+wSaJB6xFiPlJgf9FhILjDjHAtV2x1
wXL8r3eLyKvkU3HXfS7RphUTPecgblI16FHZ12x2TkQ41KoRzQf2c7cSQs4B8SHP
W5LPqvxxqjbV84iqZPScez99S0ZS0Of3ubmepVEm2LDshfhUVMIUH1vfvEn3vQI7
k5PoNiVRem+rjduERM3I7Zd51K7Lz/5hN56q6ok2vY8hVoRdp0j83Ly36h21ClS9
CtvlzPX0YjaogVd8Gyc3z+vqy61YiNA1q0fRqIhagmfIy/26s1ORaq/0S2gywxXn
piA=
=mfU0
-----END PGP SIGNATURE-----
Merge tag 'probes-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes update from Masami Hiramatsu:
- Update the Kprobes trace event to show the actual function name in
notrace-symbol warning.
Instead of using the user specified symbol name, use "%ps" printk
format to show the actual symbol at the probe address. Since kprobe
event accepts the offset from symbol which is bigger than the symbol
size, the user specified symbol may not be the actual probed symbol.
* tag 'probes-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
trace/kprobe: Display the actual notrace function when rejecting a probe
where the CPU would remain at the lowest frequency, degrading
performance substantially.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWpM0sRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1giEg/+Mn9hdLqgE7xPPvCa8UWoJzFGTIYgTT3O
gma5Ras/kqB6cJTb1zn/HocAIj1Y2gZAsRU/U3IpOfPzklwIKQLBID1PE+d0izAc
NC9N0LuPau+XbMY5U+G0YNQZzDW+Zioe/9I6uDRKRTtLTdZAk8Plk9yh+tRtpSG8
aEswyoDOJfvkLbl7kJGymHgxDiDtmXEcz6j2pNlFtcEdHFjiSHo2Jq09DMia9sHr
W563FSvO7DVBMOosKH8sq7sSPdCBi0zshaWDiyz2M7Ry2uBsqJvx+9qxDnloafTp
Yqp5rkSVzOxtQwxjtYD+WWy+AgwQqo+O5FHsm0JmoiGVkmpB95bdhQxk2gtshSCo
IwUt2Gqsndd0JM4v5gOn4G/qCPxFUA/Tx1OMWM89nQUVp3OmIwm8z99f5gFxoSYa
DFn2P2Ku/A/fiKfWcNDOCyMgYcJNmqRKSjWEh+mfFeexiuWR3jPrQ4GKbSl9Gusw
vLmBM9pMSyGvivptu+ALXERDDm95wEVVkULgxlcUgpuT8jjpmovbtFj2xYcnzvc4
EKOgJ0FmXCM/B6QFnnbzgMzu2IThoQpL8Ud3JlMeGDRLGDvZip9AA+0RsnirURwX
+EuE7fHcDzfAA+Fv9sGosaFmxD1dUh1EJL41XrFZSYfMsZzzzlj+k9PWf9ABCE4R
6gEHuRza+rU=
=c7Ib
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2024-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
"Fix a cpufreq related performance regression on certain systems, where
the CPU would remain at the lowest frequency, degrading performance
substantially"
* tag 'sched-urgent-2024-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix frequency selection for non-invariant case
Here are the set of driver core and kernfs changes for 6.8-rc1. Nothing
major in here this release cycle, just lots of small cleanups and some
tweaks on kernfs that in the very end, got reverted and will come back
in a safer way next release cycle.
Included in here are:
- more driver core 'const' cleanups and fixes
- fw_devlink=rpm is now the default behavior
- kernfs tiny changes to remove some string functions
- cpu handling in the driver core is updated to work better on many
systems that add topologies and cpus after booting
- other minor changes and cleanups
All of the cpu handling patches have been acked by the respective
maintainers and are coming in here in one series. Everything has been
in linux-next for a while with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZaeOrg8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ymtcwCffzvKKkSY9qAp6+0v2WQNkZm1JWoAoJCPYUwF
If6wEoPLWvRfKx4gIoq9
=D96r
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here are the set of driver core and kernfs changes for 6.8-rc1.
Nothing major in here this release cycle, just lots of small cleanups
and some tweaks on kernfs that in the very end, got reverted and will
come back in a safer way next release cycle.
Included in here are:
- more driver core 'const' cleanups and fixes
- fw_devlink=rpm is now the default behavior
- kernfs tiny changes to remove some string functions
- cpu handling in the driver core is updated to work better on many
systems that add topologies and cpus after booting
- other minor changes and cleanups
All of the cpu handling patches have been acked by the respective
maintainers and are coming in here in one series. Everything has been
in linux-next for a while with no reported issues"
* tag 'driver-core-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (51 commits)
Revert "kernfs: convert kernfs_idr_lock to an irq safe raw spinlock"
kernfs: convert kernfs_idr_lock to an irq safe raw spinlock
class: fix use-after-free in class_register()
PM: clk: make pm_clk_add_notifier() take a const pointer
EDAC: constantify the struct bus_type usage
kernfs: fix reference to renamed function
driver core: device.h: fix Excess kernel-doc description warning
driver core: class: fix Excess kernel-doc description warning
driver core: mark remaining local bus_type variables as const
driver core: container: make container_subsys const
driver core: bus: constantify subsys_register() calls
driver core: bus: make bus_sort_breadthfirst() take a const pointer
kernfs: d_obtain_alias(NULL) will do the right thing...
driver core: Better advertise dev_err_probe()
kernfs: Convert kernfs_path_from_node_locked() from strlcpy() to strscpy()
kernfs: Convert kernfs_name_locked() from strlcpy() to strscpy()
kernfs: Convert kernfs_walk_ns() from strlcpy() to strscpy()
initramfs: Expose retained initrd as sysfs file
fs/kernfs/dir: obey S_ISGID
kernel/cgroup: use kernfs_create_dir_ns()
...
Add enforcement of expected types for context arguments tagged with
arg:ctx (__arg_ctx) tag.
First, any program type will accept generic `void *` context type when
combined with __arg_ctx tag.
Besides accepting "canonical" struct names and `void *`, for a bunch of
program types for which program context is actually a named struct, we
allows a bunch of pragmatic exceptions to match real-world and expected
usage:
- for both kprobes and perf_event we allow `bpf_user_pt_regs_t *` as
canonical context argument type, where `bpf_user_pt_regs_t` is a
*typedef*, not a struct;
- for kprobes, we also always accept `struct pt_regs *`, as that's what
actually is passed as a context to any kprobe program;
- for perf_event, we resolve typedefs (unless it's `bpf_user_pt_regs_t`)
down to actual struct type and accept `struct pt_regs *`, or
`struct user_pt_regs *`, or `struct user_regs_struct *`, depending
on the actual struct type kernel architecture points `bpf_user_pt_regs_t`
typedef to; otherwise, canonical `struct bpf_perf_event_data *` is
expected;
- for raw_tp/raw_tp.w programs, `u64/long *` are accepted, as that's
what's expected with BPF_PROG() usage; otherwise, canonical
`struct bpf_raw_tracepoint_args *` is expected;
- tp_btf supports both `struct bpf_raw_tracepoint_args *` and `u64 *`
formats, both are coded as expections as tp_btf is actually a TRACING
program type, which has no canonical context type;
- iterator programs accept `struct bpf_iter__xxx *` structs, currently
with no further iterator-type specific enforcement;
- fentry/fexit/fmod_ret/lsm/struct_ops all accept `u64 *`;
- classic tracepoint programs, as well as syscall and freplace
programs allow any user-provided type.
In all other cases kernel will enforce exact match of struct name to
expected canonical type. And if user-provided type doesn't match that
expectation, verifier will emit helpful message with expected type name.
Note a bit unnatural way the check is done after processing all the
arguments. This is done to avoid conflict between bpf and bpf-next
trees. Once trees converge, a small follow up patch will place a simple
btf_validate_prog_ctx_type() check into a proper ARG_PTR_TO_CTX branch
(which bpf-next tree patch refactored already), removing duplicated
arg:ctx detection logic.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240118033143.3384355-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Refactor btf_get_prog_ctx_type() a bit to allow reuse of
bpf_ctx_convert_map logic in more than one places. Simplify interface by
returning btf_type instead of btf_member (field reference in BTF).
To do the above we need to touch and start untangling
btf_translate_to_vmlinux() implementation. We do the bare minimum to
not regress anything for btf_translate_to_vmlinux(), but its
implementation is very questionable for what it claims to be doing.
Mapping kfunc argument types to kernel corresponding types conceptually
is quite different from recognizing program context types. Fixing this
is out of scope for this change though.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240118033143.3384355-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Here is the big set of char/misc and other driver subsystem changes for
6.8-rc1. Lots of stuff in here, but first off, you will get a merge
conflict in drivers/android/binder_alloc.c when merging this tree due to
changing coming in through the -mm tree.
The resolution of the merge issue can be found here:
https://lore.kernel.org/r/20231207134213.25631ae9@canb.auug.org.au
or in a simpler patch form in that thread:
https://lore.kernel.org/r/ZXHzooF07LfQQYiE@google.com
If there are issues with the merge of this file, please let me know.
Other than lots of binder driver changes (as you can see by the merge
conflicts) included in here are:
- lots of iio driver updates and additions
- spmi driver updates
- eeprom driver updates
- firmware driver updates
- ocxl driver updates
- mhi driver updates
- w1 driver updates
- nvmem driver updates
- coresight driver updates
- platform driver remove callback api changes
- tags.sh script updates
- bus_type constant marking cleanups
- lots of other small driver updates
All of these have been in linux-next for a while with no reported issues
(other than the binder merge conflict.)
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZaeMMQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ynWNgCfQ/Yz7QO6EMLDwHO5LRsb3YMhjL4AoNVdanjP
YoI7f1I4GBcC0GKNfK6s
=+Kyv
-----END PGP SIGNATURE-----
Merge tag 'char-misc-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc and other driver updates from Greg KH:
"Here is the big set of char/misc and other driver subsystem changes
for 6.8-rc1.
Other than lots of binder driver changes (as you can see by the merge
conflicts) included in here are:
- lots of iio driver updates and additions
- spmi driver updates
- eeprom driver updates
- firmware driver updates
- ocxl driver updates
- mhi driver updates
- w1 driver updates
- nvmem driver updates
- coresight driver updates
- platform driver remove callback api changes
- tags.sh script updates
- bus_type constant marking cleanups
- lots of other small driver updates
All of these have been in linux-next for a while with no reported
issues"
* tag 'char-misc-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (341 commits)
android: removed duplicate linux/errno
uio: Fix use-after-free in uio_open
drivers: soc: xilinx: add check for platform
firmware: xilinx: Export function to use in other module
scripts/tags.sh: remove find_sources
scripts/tags.sh: use -n to test archinclude
scripts/tags.sh: add local annotation
scripts/tags.sh: use more portable -path instead of -wholename
scripts/tags.sh: Update comment (addition of gtags)
firmware: zynqmp: Convert to platform remove callback returning void
firmware: turris-mox-rwtm: Convert to platform remove callback returning void
firmware: stratix10-svc: Convert to platform remove callback returning void
firmware: stratix10-rsu: Convert to platform remove callback returning void
firmware: raspberrypi: Convert to platform remove callback returning void
firmware: qemu_fw_cfg: Convert to platform remove callback returning void
firmware: mtk-adsp-ipc: Convert to platform remove callback returning void
firmware: imx-dsp: Convert to platform remove callback returning void
firmware: coreboot_table: Convert to platform remove callback returning void
firmware: arm_scpi: Convert to platform remove callback returning void
firmware: arm_scmi: Convert to platform remove callback returning void
...
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZaHe5gAKCRDdBJ7gKXxA
jrAiAQCYZQuwsNVyGJUuPD/GGQzqVUZNpWcuYwMXXAi6dO5rSAD+LDeFviun2K52
uHCz4iRq5EwNLA+MbdHtAnQzr+e5CQ8=
=Jjkw
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-01-12-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc hotfixes from Andrew Morton:
"For once not mostly MM-related.
17 hotfixes. 10 address post-6.7 issues and the other 7 are cc:stable"
* tag 'mm-hotfixes-stable-2024-01-12-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
userfaultfd: avoid huge_zero_page in UFFDIO_MOVE
MAINTAINERS: add entry for shrinker
selftests: mm: hugepage-vmemmap fails on 64K page size systems
mm/memory_hotplug: fix memmap_on_memory sysfs value retrieval
mailmap: switch email for Tanzir Hasan
mailmap: add old address mappings for Randy
kernel/crash_core.c: make __crash_hotplug_lock static
efi: disable mirror feature during crashkernel
kexec: do syscore_shutdown() in kernel_kexec
mailmap: update entry for Manivannan Sadhasivam
fs/proc/task_mmu: move mmu notification mechanism inside mm lock
mm: zswap: switch maintainers to recently active developers and reviewers
scripts/decode_stacktrace.sh: optionally use LLVM utilities
kasan: avoid resetting aux_lock
lib/Kconfig.debug: disable CONFIG_DEBUG_INFO_BTF for Hexagon
MAINTAINERS: update LTP maintainers
kdump: defer the insertion of crashkernel resources
When appending "[defcmd]" to 'kdb_prompt_str', the size of the string
already in the buffer should be taken into account.
An option could be to switch from strncat() to strlcat() which does the
correct test to avoid such an overflow.
However, this actually looks as dead code, because 'defcmd_in_progress'
can't be true here.
See a more detailed explanation at [1].
[1]: https://lore.kernel.org/all/CAD=FV=WSh7wKN7Yp-3wWiDgX4E3isQ8uh0LCzTmd1v9Cg9j+nQ@mail.gmail.com/
Fixes: 5d5314d679 ("kdb: core for kgdb back end (1 of 2)")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Currently the workqueue just checks the atomic and locking states after work
execution ends. However, sometimes, a work item may not unlock rcu after
acquiring rcu_read_lock(). And as a result, it would cause rcu stall, but
the rcu stall warning can not dump the work func, because the work has
finished.
In order to quickly discover those works that do not call rcu_read_unlock()
after rcu_read_lock(), add the rcu lock check.
Use rcu_preempt_depth() to check the work's rcu status. Normally, this value
is 0. If this value is bigger than 0, it means the work are still holding
rcu lock. If so, print err info and the work func.
tj: Reworded the description for clarity. Minor formatting tweak.
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
At the time they are created unbound workqueues rescuers currently use
cpu_possible_mask as their affinity, but this can be too wide in case a
workqueue unbound mask has been set as a subset of cpu_possible_mask.
Make new rescuers use their associated workqueue unbound cpumask from
the start.
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently we limit the size of the workqueue name to 24 characters due to
commit ecf6881ff3 ("workqueue: make workqueue->name[] fixed len")
Increase the size to 32 characters and print a warning in the event
the requested name is larger than the limit of 32 characters.
Signed-off-by: Audra Mitchell <audra@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Linus reported a ~50% performance regression on single-threaded
workloads on his AMD Ryzen system, and bisected it to:
9c0b4bb7f6 ("sched/cpufreq: Rework schedutil governor performance estimation")
When frequency invariance is not enabled, get_capacity_ref_freq(policy)
is supposed to return the current frequency and the performance margin
applied by map_util_perf(), enabling the utilization to go above the
maximum compute capacity and to select a higher frequency than the current one.
After the changes in 9c0b4bb7f6, the performance margin was applied
earlier in the path to take into account utilization clampings and
we couldn't get a utilization higher than the maximum compute capacity,
and the CPU remained 'stuck' at lower frequencies.
To fix this, we must use a frequency above the current frequency to
get a chance to select a higher OPP when the current one becomes fully used.
Apply the same margin and return a frequency 25% higher than the current
one in order to switch to the next OPP before we fully use the CPU
at the current one.
[ mingo: Clarified the changelog. ]
Fixes: 9c0b4bb7f6 ("sched/cpufreq: Rework schedutil governor performance estimation")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Bisected-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Wyes Karny <wkarny@gmail.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Wyes Karny <wkarny@gmail.com>
Link: https://lore.kernel.org/r/20240114183600.135316-1-vincent.guittot@linaro.org
Update the kernel-doc comments to catch up with the code changes and
fix the kernel-doc warnings:
debug.c:83: warning: Excess struct member 'stacktrace' description in 'dma_debug_entry'
debug.c:83: warning: Function parameter or struct member 'stack_len' not described in 'dma_debug_entry'
debug.c:83: warning: Function parameter or struct member 'stack_entries' not described in 'dma_debug_entry'
Fixes: 746017ed8d ("dma/debug: Simplify stracktrace retrieval")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: iommu@lists.linux.dev
Signed-off-by: Christoph Hellwig <hch@lst.de>
This pull request contains the following branches:
doc.2023.12.13a: Documentation and comment updates.
torture.2023.11.23a: RCU torture, locktorture updates that include
cleanups; nolibc init build support for mips, ppc and rv64;
testing of mid stall duration scenario and fixing fqs task
creation conditions.
fixes.2023.12.13a: Misc fixes, most notably restricting usage of
RCU CPU stall notifiers, to confine their usage primarily
to debug kernels.
rcu-tasks.2023.12.12b: RCU tasks minor fixes.
srcu.2023.12.13a: lockdep annotation fix for NMI-safe accesses,
callback advancing/acceleration cleanup and documentation
improvements.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSi2tPIQIc2VEtjarIAHS7/6Z0wpQUCZYUS0AAKCRAAHS7/6Z0w
pRXgAQD+k8oqjvKL6la61ppWm5Y7NLjdj/IbV+cOd42jKnM6PAEAyavNhX0n7zGx
o9cDlvIDxJfHnFrOTc5WLH9yEs3IiQQ=
=8rdu
-----END PGP SIGNATURE-----
Merge tag 'rcu.release.v6.8' of https://github.com/neeraju/linux
Pull RCU updates from Neeraj Upadhyay:
- Documentation and comment updates
- RCU torture, locktorture updates that include cleanups; nolibc init
build support for mips, ppc and rv64; testing of mid stall duration
scenario and fixing fqs task creation conditions
- Misc fixes, most notably restricting usage of RCU CPU stall
notifiers, to confine their usage primarily to debug kernels
- RCU tasks minor fixes
- lockdep annotation fix for NMI-safe accesses, callback
advancing/acceleration cleanup and documentation improvements
* tag 'rcu.release.v6.8' of https://github.com/neeraju/linux:
rcu: Force quiescent states only for ongoing grace period
doc: Clarify historical disclaimers in memory-barriers.txt
doc: Mention address and data dependencies in rcu_dereference.rst
doc: Clarify RCU Tasks reader/updater checklist
rculist.h: docs: Fix wrong function summary
Documentation: RCU: Remove repeated word in comments
srcu: Use try-lock lockdep annotation for NMI-safe access.
srcu: Explain why callbacks invocations can't run concurrently
srcu: No need to advance/accelerate if no callback enqueued
srcu: Remove superfluous callbacks advancing from srcu_gp_start()
rcu: Remove unused macros from rcupdate.h
rcu: Restrict access to RCU CPU stall notifiers
rcu-tasks: Mark RCU Tasks accesses to current->rcu_tasks_idle_cpu
rcutorture: Add fqs_holdoff check before fqs_task is created
rcutorture: Add mid-sized stall to TREE07
rcutorture: add nolibc init support for mips, ppc and rv64
locktorture: Increase Hamming distance between call_rcu_chain and rcu_call_chains
sparse warnings:
kernel/crash_core.c:749:1: sparse: sparse: symbol '__crash_hotplug_lock' was not declared. Should it be static?
Fixes: e2a8f20dd8 ("Crash: add lock to serialize crash hotplug handling")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202401080654.IjjU5oK7-lkp@intel.com/
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
syscore_shutdown() runs driver and module callbacks to get the system into
a state where it can be correctly shut down. In commit 6f389a8f1d ("PM
/ reboot: call syscore_shutdown() after disable_nonboot_cpus()")
syscore_shutdown() was removed from kernel_restart_prepare() and hence got
(incorrectly?) removed from the kexec flow. This was innocuous until
commit 6735150b69 ("KVM: Use syscore_ops instead of reboot_notifier to
hook restart/shutdown") changed the way that KVM registered its shutdown
callbacks, switching from reboot notifiers to syscore_ops.shutdown. As
syscore_shutdown() is missing from kexec, KVM's shutdown hook is not run
and virtualisation is left enabled on the boot CPU which results in triple
faults when switching to the new kernel on Intel x86 VT-x with VMXE
enabled.
Fix this by adding syscore_shutdown() to the kexec sequence. In terms of
where to add it, it is being added after migrating the kexec task to the
boot CPU, but before APs are shut down. It is not totally clear if this
is the best place: in commit 6f389a8f1d ("PM / reboot: call
syscore_shutdown() after disable_nonboot_cpus()") it is stated that
"syscore_ops operations should be carried with one CPU on-line and
interrupts disabled." APs are only offlined later in machine_shutdown(),
so this syscore_shutdown() is being run while APs are still online. This
seems to be the correct place as it matches where syscore_shutdown() is
run in the reboot and halt flows - they also run it before APs are shut
down. The assumption is that the commit message in commit 6f389a8f1d
("PM / reboot: call syscore_shutdown() after disable_nonboot_cpus()") is
no longer valid.
KVM has been discussed here as it is what broke loudly by not having
syscore_shutdown() in kexec, but this change impacts more than just KVM;
all drivers/modules which register a syscore_ops.shutdown callback will
now be invoked in the kexec flow. Looking at some of them like x86 MCE it
is probably more correct to also shut these down during kexec.
Maintainers of all drivers which use syscore_ops.shutdown are added on CC
for visibility. They are:
arch/powerpc/platforms/cell/spu_base.c .shutdown = spu_shutdown,
arch/x86/kernel/cpu/mce/core.c .shutdown = mce_syscore_shutdown,
arch/x86/kernel/i8259.c .shutdown = i8259A_shutdown,
drivers/irqchip/irq-i8259.c .shutdown = i8259A_shutdown,
drivers/irqchip/irq-sun6i-r.c .shutdown = sun6i_r_intc_shutdown,
drivers/leds/trigger/ledtrig-cpu.c .shutdown = ledtrig_cpu_syscore_shutdown,
drivers/power/reset/sc27xx-poweroff.c .shutdown = sc27xx_poweroff_shutdown,
kernel/irq/generic-chip.c .shutdown = irq_gc_shutdown,
virt/kvm/kvm_main.c .shutdown = kvm_shutdown,
This has been tested by doing a kexec on x86_64 and aarch64.
Link: https://lkml.kernel.org/r/20231213064004.2419447-1-jgowans@amazon.com
Fixes: 6735150b69 ("KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown")
Signed-off-by: James Gowans <jgowans@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chen-Yu Tsai <wens@csie.org>
Cc: Jernej Skrabec <jernej.skrabec@gmail.com>
Cc: Samuel Holland <samuel@sholland.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Orson Zhai <orsonzhai@gmail.com>
Cc: Alexander Graf <graf@amazon.de>
Cc: Jan H. Schoenherr <jschoenh@amazon.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Core & protocols
----------------
- Analyze and reorganize core networking structs (socks, netdev,
netns, mibs) to optimize cacheline consumption and set up
build time warnings to safeguard against future header changes.
This improves TCP performances with many concurrent connections
up to 40%.
- Add page-pool netlink-based introspection, exposing the
memory usage and recycling stats. This helps indentify
bad PP users and possible leaks.
- Refine TCP/DCCP source port selection to no longer favor even
source port at connect() time when IP_LOCAL_PORT_RANGE is set.
This lowers the time taken by connect() for hosts having
many active connections to the same destination.
- Refactor the TCP bind conflict code, shrinking related socket
structs.
- Refactor TCP SYN-Cookie handling, as a preparation step to
allow arbitrary SYN-Cookie processing via eBPF.
- Tune optmem_max for 0-copy usage, increasing the default value
to 128KB and namespecifying it.
- Allow coalescing for cloned skbs coming from page pools, improving
RX performances with some common configurations.
- Reduce extension header parsing overhead at GRO time.
- Add bridge MDB bulk deletion support, allowing user-space to
request the deletion of matching entries.
- Reorder nftables struct members, to keep data accessed by the
datapath first.
- Introduce TC block ports tracking and use. This allows supporting
multicast-like behavior at the TC layer.
- Remove UAPI support for retired TC qdiscs (dsmark, CBQ and ATM) and
classifiers (RSVP and tcindex).
- More data-race annotations.
- Extend the diag interface to dump TCP bound-only sockets.
- Conditional notification of events for TC qdisc class and actions.
- Support for WPAN dynamic associations with nearby devices, to form
a sub-network using a specific PAN ID.
- Implement SMCv2.1 virtual ISM device support.
- Add support for Batman-avd mulicast packet type.
BPF
---
- Tons of verifier improvements:
- BPF register bounds logic and range support along with a large
test suite
- log improvements
- complete precision tracking support for register spills
- track aligned STACK_ZERO cases as imprecise spilled registers. It
improves the verifier "instructions processed" metric from single
digit to 50-60% for some programs
- support for user's global BPF subprogram arguments with few
commonly requested annotations for a better developer experience
- support tracking of BPF_JNE which helps cases when the compiler
transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the
like
- several fixes
- Add initial TX metadata implementation for AF_XDP with support in
mlx5 and stmmac drivers. Two types of offloads are supported right
now, that is, TX timestamp and TX checksum offload.
- Fix kCFI bugs in BPF all forms of indirect calls from BPF into
kernel and from kernel into BPF work with CFI enabled. This allows
BPF to work with CONFIG_FINEIBT=y.
- Change BPF verifier logic to validate global subprograms lazily
instead of unconditionally before the main program, so they can be
guarded using BPF CO-RE techniques.
- Support uid/gid options when mounting bpffs.
- Add a new kfunc which acquires the associated cgroup of a task
within a specific cgroup v1 hierarchy where the latter is identified
by its id.
- Extend verifier to allow bpf_refcount_acquire() of a map value field
obtained via direct load which is a use-case needed in sched_ext.
- Add BPF link_info support for uprobe multi link along with bpftool
integration for the latter.
- Support for VLAN tag in XDP hints.
- Remove deprecated bpfilter kernel leftovers given the project
is developed in user-space (https://github.com/facebook/bpfilter).
Misc
----
- Support for parellel TC self-tests execution.
- Increase MPTCP self-tests coverage.
- Updated the bridge documentation, including several so-far
undocumented features.
- Convert all the net self-tests to run in unique netns, to
avoid random failures due to conflict and allow concurrent
runs.
- Add TCP-AO self-tests.
- Add kunit tests for both cfg80211 and mac80211.
- Autogenerate Netlink families documentation from YAML spec.
- Add yml-gen support for fixed headers and recursive nests, the
tool can now generate user-space code for all genetlink families
for which we have specs.
- A bunch of additional module descriptions fixes.
- Catch incorrect freeing of pages belonging to a page pool.
Driver API
----------
- Rust abstractions for network PHY drivers; do not cover yet the
full C API, but already allow implementing functional PHY drivers
in rust.
- Introduce queue and NAPI support in the netdev Netlink interface,
allowing complete access to the device <> NAPIs <> queues
relationship.
- Introduce notifications filtering for devlink to allow control
application scale to thousands of instances.
- Improve PHY validation, requesting rate matching information for
each ethtool link mode supported by both the PHY and host.
- Add support for ethtool symmetric-xor RSS hash.
- ACPI based Wifi band RFI (WBRF) mitigation feature for the AMD
platform.
- Expose pin fractional frequency offset value over new DPLL generic
netlink attribute.
- Convert older drivers to platform remove callback returning void.
- Add support for PHY package MMD read/write.
New hardware / drivers
----------------------
- Ethernet:
- Octeon CN10K devices
- Broadcom 5760X P7
- Qualcomm SM8550 SoC
- Texas Instrument DP83TG720S PHY
- Bluetooth:
- IMC Networks Bluetooth radio
Removed
-------
- WiFi:
- libertas 16-bit PCMCIA support
- Atmel at76c50x drivers
- HostAP ISA/PCMCIA style 802.11b driver
- zd1201 802.11b USB dongles
- Orinoco ISA/PCMCIA 802.11b driver
- Aviator/Raytheon driver
- Planet WL3501 driver
- RNDIS USB 802.11b driver
Drivers
-------
- Ethernet high-speed NICs:
- Intel (100G, ice, idpf):
- allow one by one port representors creation and removal
- add temperature and clock information reporting
- add get/set for ethtool's header split ringparam
- add again FW logging
- adds support switchdev hardware packet mirroring
- iavf: implement symmetric-xor RSS hash
- igc: add support for concurrent physical and free-running timers
- i40e: increase the allowable descriptors
- nVidia/Mellanox:
- Preparation for Socket-Direct multi-dev netdev. That will allow
in future releases combining multiple PFs devices attached to
different NUMA nodes under the same netdev
- Broadcom (bnxt):
- TX completion handling improvements
- add basic ntuple filter support
- reduce MSIX vectors usage for MQPRIO offload
- add VXLAN support, USO offload and TX coalesce completion for P7
- Marvell Octeon EP:
- xmit-more support
- add PF-VF mailbox support and use it for FW notifications for VFs
- Wangxun (ngbe/txgbe):
- implement ethtool functions to operate pause param, ring param,
coalesce channel number and msglevel
- Netronome/Corigine (nfp):
- add flow-steering support
- support UDP segmentation offload
- Ethernet NICs embedded, slower, virtual:
- Xilinx AXI: remove duplicate DMA code adopting the dma engine driver
- stmmac: add support for HW-accelerated VLAN stripping
- TI AM654x sw: add mqprio, frame preemption & coalescing
- gve: add support for non-4k page sizes.
- virtio-net: support dynamic coalescing moderation
- nVidia/Mellanox Ethernet datacenter switches:
- allow firmware upgrade without a reboot
- more flexible support for bridge flooding via the compressed
FID flooding mode
- Ethernet embedded switches:
- Microchip:
- fine-tune flow control and speed configurations in KSZ8xxx
- KSZ88X3: enable setting rmii reference
- Renesas:
- add jumbo frames support
- Marvell:
- 88E6xxx: add "eth-mac" and "rmon" stats support
- Ethernet PHYs:
- aquantia: add firmware load support
- at803x: refactor the driver to simplify adding support for more
chip variants
- NXP C45 TJA11xx: Add MACsec offload support
- Wifi:
- MediaTek (mt76):
- NVMEM EEPROM improvements
- mt7996 Extremely High Throughput (EHT) improvements
- mt7996 Wireless Ethernet Dispatcher (WED) support
- mt7996 36-bit DMA support
- Qualcomm (ath12k):
- support for a single MSI vector
- WCN7850: support AP mode
- Intel (iwlwifi):
- new debugfs file fw_dbg_clear
- allow concurrent P2P operation on DFS channels
- Bluetooth:
- QCA2066: support HFP offload
- ISO: more broadcast-related improvements
- NXP: better recovery in case receiver/transmitter get out of sync
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmWdamsSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkGC4P/2xjLzdw22ckSssuE9ORbGko9SNjnqHk
PQh1E+26BHiCg5KB8VvzMsL78E79MRNXEattSW+1g7dhCvln3oi+Vd0WkdRkgt35
98Iv18zLbbwFAJeyKvmLAPAkQkMLtVj19QILBBRrugF+egEZgVSE3JBcTAiKv2ZQ
HzkabA171Ri6LpCcEEtY5XuaKvimGnGzF8YMFf8rX0wtqd2p5kbY9aMe47WAGxvU
Vf9548XvH+A5yVH2/4/gujtUOpA/RHuhuCMb+oo0cZ+VCC1x9MGzoXzj6r87OTkf
k2W1whNzcGoin92f+9Lk1JYMuiGKBH4QVaDdNXJnYFSJWPTE7RvRsPzYTSD4/GzK
yEZbzSJXpy/2vDQm16NoAxl7evRs8Sorzkw4LQRviZHI/5SAkK2ZQiCK5CO8QSYy
C1LELcV5kn6Foe24xWnrWLjAGug9oJnYoGPMU5gvPmFJMvUMXqm5rmbBgUWL5Rxw
q1M6gVzabCyWUy6z2G2vaqW2ZntNVvCkdsLtIX0XZkcTzNoP0MA+TuhyGz4wbiuo
PeyQp/mbGnDgCYggqKIA0YWrTVxkhFrKN520cbO8qXBQytV9oFbM/0/+C0/r/5WX
pL1JVzLrh6l5ME7EIQfha8UOF9j8q4ueSwb40P3AR2NaZiDABM0zfUZ6+sx+91WF
ucqPEcZB5cRE
=1bW6
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"The most interesting thing is probably the networking structs
reorganization and a significant amount of changes is around
self-tests.
Core & protocols:
- Analyze and reorganize core networking structs (socks, netdev,
netns, mibs) to optimize cacheline consumption and set up build
time warnings to safeguard against future header changes
This improves TCP performances with many concurrent connections up
to 40%
- Add page-pool netlink-based introspection, exposing the memory
usage and recycling stats. This helps indentify bad PP users and
possible leaks
- Refine TCP/DCCP source port selection to no longer favor even
source port at connect() time when IP_LOCAL_PORT_RANGE is set. This
lowers the time taken by connect() for hosts having many active
connections to the same destination
- Refactor the TCP bind conflict code, shrinking related socket
structs
- Refactor TCP SYN-Cookie handling, as a preparation step to allow
arbitrary SYN-Cookie processing via eBPF
- Tune optmem_max for 0-copy usage, increasing the default value to
128KB and namespecifying it
- Allow coalescing for cloned skbs coming from page pools, improving
RX performances with some common configurations
- Reduce extension header parsing overhead at GRO time
- Add bridge MDB bulk deletion support, allowing user-space to
request the deletion of matching entries
- Reorder nftables struct members, to keep data accessed by the
datapath first
- Introduce TC block ports tracking and use. This allows supporting
multicast-like behavior at the TC layer
- Remove UAPI support for retired TC qdiscs (dsmark, CBQ and ATM) and
classifiers (RSVP and tcindex)
- More data-race annotations
- Extend the diag interface to dump TCP bound-only sockets
- Conditional notification of events for TC qdisc class and actions
- Support for WPAN dynamic associations with nearby devices, to form
a sub-network using a specific PAN ID
- Implement SMCv2.1 virtual ISM device support
- Add support for Batman-avd mulicast packet type
BPF:
- Tons of verifier improvements:
- BPF register bounds logic and range support along with a large
test suite
- log improvements
- complete precision tracking support for register spills
- track aligned STACK_ZERO cases as imprecise spilled registers.
This improves the verifier "instructions processed" metric from
single digit to 50-60% for some programs
- support for user's global BPF subprogram arguments with few
commonly requested annotations for a better developer
experience
- support tracking of BPF_JNE which helps cases when the compiler
transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the
like
- several fixes
- Add initial TX metadata implementation for AF_XDP with support in
mlx5 and stmmac drivers. Two types of offloads are supported right
now, that is, TX timestamp and TX checksum offload
- Fix kCFI bugs in BPF all forms of indirect calls from BPF into
kernel and from kernel into BPF work with CFI enabled. This allows
BPF to work with CONFIG_FINEIBT=y
- Change BPF verifier logic to validate global subprograms lazily
instead of unconditionally before the main program, so they can be
guarded using BPF CO-RE techniques
- Support uid/gid options when mounting bpffs
- Add a new kfunc which acquires the associated cgroup of a task
within a specific cgroup v1 hierarchy where the latter is
identified by its id
- Extend verifier to allow bpf_refcount_acquire() of a map value
field obtained via direct load which is a use-case needed in
sched_ext
- Add BPF link_info support for uprobe multi link along with bpftool
integration for the latter
- Support for VLAN tag in XDP hints
- Remove deprecated bpfilter kernel leftovers given the project is
developed in user-space (https://github.com/facebook/bpfilter)
Misc:
- Support for parellel TC self-tests execution
- Increase MPTCP self-tests coverage
- Updated the bridge documentation, including several so-far
undocumented features
- Convert all the net self-tests to run in unique netns, to avoid
random failures due to conflict and allow concurrent runs
- Add TCP-AO self-tests
- Add kunit tests for both cfg80211 and mac80211
- Autogenerate Netlink families documentation from YAML spec
- Add yml-gen support for fixed headers and recursive nests, the tool
can now generate user-space code for all genetlink families for
which we have specs
- A bunch of additional module descriptions fixes
- Catch incorrect freeing of pages belonging to a page pool
Driver API:
- Rust abstractions for network PHY drivers; do not cover yet the
full C API, but already allow implementing functional PHY drivers
in rust
- Introduce queue and NAPI support in the netdev Netlink interface,
allowing complete access to the device <> NAPIs <> queues
relationship
- Introduce notifications filtering for devlink to allow control
application scale to thousands of instances
- Improve PHY validation, requesting rate matching information for
each ethtool link mode supported by both the PHY and host
- Add support for ethtool symmetric-xor RSS hash
- ACPI based Wifi band RFI (WBRF) mitigation feature for the AMD
platform
- Expose pin fractional frequency offset value over new DPLL generic
netlink attribute
- Convert older drivers to platform remove callback returning void
- Add support for PHY package MMD read/write
New hardware / drivers:
- Ethernet:
- Octeon CN10K devices
- Broadcom 5760X P7
- Qualcomm SM8550 SoC
- Texas Instrument DP83TG720S PHY
- Bluetooth:
- IMC Networks Bluetooth radio
Removed:
- WiFi:
- libertas 16-bit PCMCIA support
- Atmel at76c50x drivers
- HostAP ISA/PCMCIA style 802.11b driver
- zd1201 802.11b USB dongles
- Orinoco ISA/PCMCIA 802.11b driver
- Aviator/Raytheon driver
- Planet WL3501 driver
- RNDIS USB 802.11b driver
Driver updates:
- Ethernet high-speed NICs:
- Intel (100G, ice, idpf):
- allow one by one port representors creation and removal
- add temperature and clock information reporting
- add get/set for ethtool's header split ringparam
- add again FW logging
- adds support switchdev hardware packet mirroring
- iavf: implement symmetric-xor RSS hash
- igc: add support for concurrent physical and free-running
timers
- i40e: increase the allowable descriptors
- nVidia/Mellanox:
- Preparation for Socket-Direct multi-dev netdev. That will
allow in future releases combining multiple PFs devices
attached to different NUMA nodes under the same netdev
- Broadcom (bnxt):
- TX completion handling improvements
- add basic ntuple filter support
- reduce MSIX vectors usage for MQPRIO offload
- add VXLAN support, USO offload and TX coalesce completion
for P7
- Marvell Octeon EP:
- xmit-more support
- add PF-VF mailbox support and use it for FW notifications
for VFs
- Wangxun (ngbe/txgbe):
- implement ethtool functions to operate pause param, ring
param, coalesce channel number and msglevel
- Netronome/Corigine (nfp):
- add flow-steering support
- support UDP segmentation offload
- Ethernet NICs embedded, slower, virtual:
- Xilinx AXI: remove duplicate DMA code adopting the dma engine
driver
- stmmac: add support for HW-accelerated VLAN stripping
- TI AM654x sw: add mqprio, frame preemption & coalescing
- gve: add support for non-4k page sizes.
- virtio-net: support dynamic coalescing moderation
- nVidia/Mellanox Ethernet datacenter switches:
- allow firmware upgrade without a reboot
- more flexible support for bridge flooding via the compressed
FID flooding mode
- Ethernet embedded switches:
- Microchip:
- fine-tune flow control and speed configurations in KSZ8xxx
- KSZ88X3: enable setting rmii reference
- Renesas:
- add jumbo frames support
- Marvell:
- 88E6xxx: add "eth-mac" and "rmon" stats support
- Ethernet PHYs:
- aquantia: add firmware load support
- at803x: refactor the driver to simplify adding support for more
chip variants
- NXP C45 TJA11xx: Add MACsec offload support
- Wifi:
- MediaTek (mt76):
- NVMEM EEPROM improvements
- mt7996 Extremely High Throughput (EHT) improvements
- mt7996 Wireless Ethernet Dispatcher (WED) support
- mt7996 36-bit DMA support
- Qualcomm (ath12k):
- support for a single MSI vector
- WCN7850: support AP mode
- Intel (iwlwifi):
- new debugfs file fw_dbg_clear
- allow concurrent P2P operation on DFS channels
- Bluetooth:
- QCA2066: support HFP offload
- ISO: more broadcast-related improvements
- NXP: better recovery in case receiver/transmitter get out of sync"
* tag 'net-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1714 commits)
lan78xx: remove redundant statement in lan78xx_get_eee
lan743x: remove redundant statement in lan743x_ethtool_get_eee
bnxt_en: Fix RCU locking for ntuple filters in bnxt_rx_flow_steer()
bnxt_en: Fix RCU locking for ntuple filters in bnxt_srxclsrldel()
bnxt_en: Remove unneeded variable in bnxt_hwrm_clear_vnic_filter()
tcp: Revert no longer abort SYN_SENT when receiving some ICMP
Revert "mlx5 updates 2023-12-20"
Revert "net: stmmac: Enable Per DMA Channel interrupt"
ipvlan: Remove usage of the deprecated ida_simple_xx() API
ipvlan: Fix a typo in a comment
net/sched: Remove ipt action tests
net: stmmac: Use interrupt mode INTM=1 for per channel irq
net: stmmac: Add support for TX/RX channel interrupt
net: stmmac: Make MSI interrupt routine generic
dt-bindings: net: snps,dwmac: per channel irq
net: phy: at803x: make read_status more generic
net: phy: at803x: add support for cdt cross short test for qca808x
net: phy: at803x: refactor qca808x cable test get status function
net: phy: at803x: generalize cdt fault length function
net: ethernet: cortina: Drop TSO support
...
Just one cleanup and one documentation improvement change. No functional
changes. However, this has been tested on linux-next for over 1 month.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmWdW0USHG1jZ3JvZkBr
ZXJuZWwub3JnAAoJEM4jHQowkoinL5gQAMUtU4nv+HahLwJK3ha5fRmcHm2kV9Q4
g9X7JUscEe+mNYLNy2Kjpl/BWatEHWn8jQD2nsMQJOcEkX2Mf5tfqBR3561wrfSZ
dnMmNDG7Ym6Y9kOSDz6cpCxsu8Xm5Dj9MLKJ51qPfyXgeobD8IiKBe2oCDqcKzGY
ZpDnmpUaYCOIloNhJNK9ybrNLsVDQwdPiC8vVQXULC4ePBw3i+mnh8c1wr442wEO
G6xfi0wNXIoB9S8ynzakW2lJPD1XeMQYu/SJR0nz61KhJxAjs/LVAt9k7itcnUgG
Zbc1Fn944oWoe7/ywwDIxstR56NYVpcXoTJexeHxrLe6PiEmRbh6phrBMUCOpUsh
0DrHJE8z4dsHovo6w6m1zvMF6FphLHhUU6L/opBwrUJ5CGrYfegG95v8d3jRSphe
GSoMo9iGHvr0PgY7OcG77m5NJjrFwwbPO88Fe3IAXmOXIrKYuzBoUnt3Te1dw8vX
6vYPUUZ3HLDOACWtKf2Tjhr7pM6b1C72vrg7uYc2560BH6ERqjAtuOWKrx8QOeeo
SUT4ACs4qa3DrPz7zpNhwwnfwpFoVuJd+fooMV/WhOU9KsCVCokC7zZub+J8UD7m
1j1nhWkKaF08c0BKW4zRMmUVSV6Dh/AO0YybetBq9b4o7NPEcij2HlIk+jJa0VMJ
XSx5UcuxRyf7
=hvRa
-----END PGP SIGNATURE-----
Merge tag 'modules-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull module updates from Luis Chamberlain:
"Just one cleanup and one documentation improvement change. No
functional changes"
* tag 'modules-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
kernel/module: improve documentation for try_module_get()
module: Remove redundant TASK_UNINTERRUPTIBLE
The goal is to get sched.h down to a type only header, so the main thing
happening in this patchset is splitting out various _types.h headers and
dependency fixups, as well as moving some things out of sched.h to
better locations.
This is prep work for the memory allocation profiling patchset which
adds new sched.h interdepencencies.
Testing - it's been in -next, and fixes from pretty much all
architectures have percolated in - nothing major.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmWfBwwACgkQE6szbY3K
bnZPwBAAmuRojXaeWxi01IPIOehSGDe68vw44PR9glEMZvxdnZuPOdvE4/+245/L
bRKU2WBCjBUokUbV9msIShwRkFTZAmEMPNfPAAsFMA+VXeDYHKB+ZRdwTggNAQ+I
SG6fZgh5m0HsewCDxU8oqVHkjVq4fXn0cy+aL6xLEd9gu67GoBzX2pDieS2Kvy6j
jnyoKTxFwb+LTQgph0P4EIpq5I2umAsdLwdSR8EJ+8e9NiNvMo1pI00Lx/ntAnFZ
JftWUJcMy3TQ5u1GkyfQN9y/yThX1bZK5GvmHS9SJ2Dkacaus5d+xaKCHtRuFS1I
7C6b8PsNgRczUMumBXus44HdlNfNs1yU3lvVxFvBIPE1qC9pYRHrkWIXXIocXLLC
oxTEJ6B2G3BQZVQgLIA4fOaxMVhmvKffi/aEZLi9vN9VVosd1a6XNKI6KbyRnXFp
GSs9qDqszhn5I3GYNlDNQTc/8UsRlhPFgS6nS0By6QnvxtGi9QkU2tBRBsXvqwCy
cLoCYIhc2tvugHvld70dz26umiJ4rnmxGlobStNoigDvIKAIUt1UmIdr1so8P8eH
xehnL9ZcOX6xnANDL0AqMFFHV6I58CJynhFdUoXfVQf/DWLGX48mpi9LVNsYBzsI
CAwVOAQ0UjGrpdWmJ9ueY/ABYqg9vRjzaDEXQ+MhAYO55CLaVsg=
=3tyT
-----END PGP SIGNATURE-----
Merge tag 'header_cleanup-2024-01-10' of https://evilpiepirate.org/git/bcachefs
Pull header cleanups from Kent Overstreet:
"The goal is to get sched.h down to a type only header, so the main
thing happening in this patchset is splitting out various _types.h
headers and dependency fixups, as well as moving some things out of
sched.h to better locations.
This is prep work for the memory allocation profiling patchset which
adds new sched.h interdepencencies"
* tag 'header_cleanup-2024-01-10' of https://evilpiepirate.org/git/bcachefs: (51 commits)
Kill sched.h dependency on rcupdate.h
kill unnecessary thread_info.h include
Kill unnecessary kernel.h include
preempt.h: Kill dependency on list.h
rseq: Split out rseq.h from sched.h
LoongArch: signal.c: add header file to fix build error
restart_block: Trim includes
lockdep: move held_lock to lockdep_types.h
sem: Split out sem_types.h
uidgid: Split out uidgid_types.h
seccomp: Split out seccomp_types.h
refcount: Split out refcount_types.h
uapi/linux/resource.h: fix include
x86/signal: kill dependency on time.h
syscall_user_dispatch.h: split out *_types.h
mm_types_task.h: Trim dependencies
Split out irqflags_types.h
ipc: Kill bogus dependency on spinlock.h
shm: Slim down dependencies
workqueue: Split out workqueue_types.h
...
- Introduce the param_unknown_fn type and other clean ups (Andy Shevchenko)
- Various __counted_by annotations (Christophe JAILLET, Gustavo A. R. Silva,
Kees Cook)
- Add KFENCE test to LKDTM (Stephen Boyd)
- Various strncpy() refactorings (Justin Stitt)
- Fix qnx4 to avoid writing into the smaller of two overlapping buffers
- Various strlcpy() refactorings
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmWcOsQWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJoiDD/9gNhalNG+6MNF5TDwSvO9X7pvL
bQ6D3clByRxYjnJ4dMQ7p3s+rJ937uQt9PezIWHgRoldjQy3x7AJ5BxkhjeMlD2B
YLbfdVYPy09X0Ewk1Efvfm/ta6tJpBGYF7Bc7LIneZrdQ6gemBpLW1PNZAFYzcWX
oDjV+M1NytxaiF0aebxPZvZ1W+NGQ105Sxvj5MheDoezyO/j0CTe+ZYtCzFguFY0
8SPpR5FG4AFidb8GHd5Ndv0trVWjF1jat0FUFgEFOCE0fJNWLVR0Bbr2MtXiG7wL
LF7IZ/Mn+mi+O3BmcD6JiaYf9EPlMUXCyqc8NvsnoWGqhWhWmQPCInZVrpplMUNK
V/UHVMkmjDs4f/lAHBJoJHDK6fmOD+cAFaNMOltfErcjV4s+lEo6vHoiKl8hfPnH
EzpQaK3funGroVYwTc35e07NrJJHCzqIUhZ0FJO7ByuOE2tIomiVo9Xy9gy54iCT
qzC7zkrZ0MKqui4qiUY9FWayRRYLX4qNxELm4yie6Pzmk8943hNOaDofcyKWuZFC
eqvhIkvqb4LasLrzCBk+ehA2KWSRmTrR6E9IygwbBXUTsvn2yj2RRYeAlGQNBTBZ
adgSXQpRBmtKYqyihWLhP4QcunknEiQdDS3lS2qJmPH33Iv3jGH4yS6BNIBufMGL
PoC2UxSfGd+YT079fw==
=1Wxx
-----END PGP SIGNATURE-----
Merge tag 'hardening-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
- Introduce the param_unknown_fn type and other clean ups (Andy
Shevchenko)
- Various __counted_by annotations (Christophe JAILLET, Gustavo A. R.
Silva, Kees Cook)
- Add KFENCE test to LKDTM (Stephen Boyd)
- Various strncpy() refactorings (Justin Stitt)
- Fix qnx4 to avoid writing into the smaller of two overlapping buffers
- Various strlcpy() refactorings
* tag 'hardening-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
qnx4: Use get_directory_fname() in qnx4_match()
qnx4: Extract dir entry filename processing into helper
atags_proc: Add __counted_by for struct buffer and use struct_size()
tracing/uprobe: Replace strlcpy() with strscpy()
params: Fix multi-line comment style
params: Sort headers
params: Use size_add() for kmalloc()
params: Do not go over the limit when getting the string length
params: Introduce the param_unknown_fn type
lkdtm: Add kfence read after free crash type
nvme-fc: replace deprecated strncpy with strscpy
nvdimm/btt: replace deprecated strncpy with strscpy
nvme-fabrics: replace deprecated strncpy with strscpy
drm/modes: replace deprecated strncpy with strscpy_pad
afs: Add __counted_by for struct afs_acl and use struct_size()
VMCI: Annotate struct vmci_handle_arr with __counted_by
i40e: Annotate struct i40e_qvlist_info with __counted_by
HID: uhid: replace deprecated strncpy with strscpy
samples: Replace strlcpy() with strscpy()
SUNRPC: Replace strlcpy() with strscpy()
Step 5/10 of the namespace unification of CPU mitigations related Kconfig options.
[ mingo: Converted a few more uses in comments/messages as well. ]
Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ariel Miculas <amiculas@cisco.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20231121160740.1249350-6-leitao@debian.org
It's been 11 years since the ring_buffer_size() function was updated to
use the nr_pages from the buffer->buffers[cpu] structure instead of using
the buffer->nr_pages that no longer exists.
The comment in the code is more of what a change log should have and is
pretty much useless for development. It's saying how things worked back in
2012 that bares no purpose on today's code. Remove it.
Link: https://lore.kernel.org/linux-trace-kernel/84d3b41a72bd43dbb9d44921ef535c92@AcuMS.aculab.com/
Link: https://lore.kernel.org/linux-trace-kernel/20231220081028.7cd7e8e2@gandalf.local.home
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reported-by: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
This KUnit update for Linux 6.8-rc1 consists of:
- a new feature that adds APIs for managing devices introducing
a set of helper functions which allow devices (internally a
struct kunit_device) to be created and managed by KUnit.
These devices will be automatically unregistered on
test exit. These helpers can either use a user-provided
struct device_driver, or have one automatically created and
managed by KUnit. In both cases, the device lives on a new
kunit_bus.
- changes to switch drm/tests to use kunit devices
- several fixes and enhancements to attribute feature
- changes to reorganize deferred action function introducing
KUNIT_DEFINE_ACTION_WRAPPER
- new feature adds ability to run tests after boot using debugfs
- fixes and enhancements to string-stream-test:
- parse ERR_PTR in string_stream_destroy()
- unchecked dereference in bug fix in debugfs_print_results()
- handling errors from alloc_string_stream()
- NULL-dereference bug fix in kunit_init_suite()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmWdiHIACgkQCwJExA0N
QxxzCxAAmhn+rkKV4DfuGXxUAJbO109H7LSP1Y7FKMYCVp83msWKASziujb2IQR9
87jnmgeJMbmQaPcc9m//NHuFhZmJwQZwAdGZryoDiz7XK+1MwLxYeUj92HI7FPaD
o5Jz6tlqFdehx5jCOymgwbvhI5kJMkQCTTtnEaiHCByfaA02UqmTtt3bXK5OeJkZ
UG0HqdvI/6Xo01i+BnerRBZFcQV49GMhl4acw1k+dJnPLkqusL6txftRBoKtxuVd
mXQHKS1SmNgiNA+nqs4d/8qERoMJWuwj6wV4pldVBXhgZwOHXbBxBf67i7hTakE/
TkEURCkOb5X0QrT6akJj6phJ2xqXsF7xwzBJh9G4jF2Pdwwo8GGuAXW+ol0TRrm8
ZEQ4eMBGIK07Lb9FeBMLO2bZ0Ox+oiN+YNGY/bs1d6Ibf4PnBUfy7IlmMjKL9h/V
A/EpYdaq5q72IZZQ2pu1rYkBRPbnP7vHmjLXVYIq7Pq8bLA9/ycKO/0jnGHdo1oz
rBK/6t7yB+ATi4KeKQpjpmUTX/vdEenUQI/QDn9ngXIEwYQfNrEUZitEvBXR1Kw+
T8iKDIPFkvb/yEZgjWgNpxETooDx3yBkeeC29gKMj4QoN38wEdfy0Xltp8eqq9cS
6lijRoipUypHRAuXeSJMW2dflLnFIt4mtC25hBNF+DmyNVT+IF4=
=79+u
-----END PGP SIGNATURE-----
Merge tag 'linux_kselftest-kunit-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull KUnit updates from Shuah Khan:
- a new feature that adds APIs for managing devices introducing a set
of helper functions which allow devices (internally a struct
kunit_device) to be created and managed by KUnit.
These devices will be automatically unregistered on test exit. These
helpers can either use a user-provided struct device_driver, or have
one automatically created and managed by KUnit. In both cases, the
device lives on a new kunit_bus.
- changes to switch drm/tests to use kunit devices
- several fixes and enhancements to attribute feature
- changes to reorganize deferred action function introducing
KUNIT_DEFINE_ACTION_WRAPPER
- new feature adds ability to run tests after boot using debugfs
- fixes and enhancements to string-stream-test:
- parse ERR_PTR in string_stream_destroy()
- unchecked dereference in bug fix in debugfs_print_results()
- handling errors from alloc_string_stream()
- NULL-dereference bug fix in kunit_init_suite()
* tag 'linux_kselftest-kunit-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (27 commits)
kunit: Fix some comments which were mistakenly kerneldoc
kunit: Protect string comparisons against NULL
kunit: Add example of kunit_activate_static_stub() with pointer-to-function
kunit: Allow passing function pointer to kunit_activate_static_stub()
kunit: Fix NULL-dereference in kunit_init_suite() if suite->log is NULL
kunit: Reset test->priv after each param iteration
kunit: Add example for using test->priv
drm/tests: Switch to kunit devices
ASoC: topology: Replace fake root_device with kunit_device in tests
overflow: Replace fake root_device with kunit_device
fortify: test: Use kunit_device
kunit: Add APIs for managing devices
Documentation: Add debugfs docs with run after boot
kunit: add ability to run tests after boot using debugfs
kunit: add is_init test attribute
kunit: add example suite to test init suites
kunit: add KUNIT_INIT_TABLE to init linker section
kunit: move KUNIT_TABLE out of INIT_DATA
kunit: tool: add test for parsing attributes
kunit: tool: fix parsing of test attributes
...
- Add support for the Sierra Forest, Grand Ridge and Meteorlake SoCs to
the intel_idle cpuidle driver (Artem Bityutskiy, Zhang Rui).
- Do not enable interrupts when entering idle in the haltpoll cpuidle
driver (Borislav Petkov).
- Add Emerald Rapids support in no-HWP mode to the intel_pstate cpufreq
driver (Zhenguo Yao).
- Use EPP values programmed by the platform firmware as balanced
performance ones by default in intel_pstate (Srinivas Pandruvada).
- Add a missing function return value check to the SCMI cpufreq driver
to avoid unexpected behavior (Alexandra Diupina).
- Fix parameter type warning in the armada-8k cpufreq driver (Gregory
CLEMENT).
- Rework trans_stat_show() in the devfreq core code to avoid buffer
overflows (Christian Marangi).
- Synchronize devfreq_monitor_[start/stop] so as to prevent a timer
list corruption from occurring when devfreq governors are switched
frequently (Mukesh Ojha).
- Fix possible deadlocks in the core system-wide PM code that occur if
device-handling functions cannot be executed asynchronously during
resume from system-wide suspend (Rafael J. Wysocki).
- Clean up unnecessary local variable initializations in multiple
places in the hibernation code (Wang chaodong, Li zeming).
- Adjust core hibernation code to avoid missing wakeup events that
occur after saving an image to persistent storage (Chris Feng).
- Update hibernation code to enforce correct ordering during image
compression and decompression (Hongchen Zhang).
- Use kmap_local_page() instead of kmap_atomic() in copy_data_page()
during hibernation and restore (Chen Haonan).
- Adjust documentation and code comments to reflect recent tasks freezer
changes (Kevin Hao).
- Repair excess function parameter description warning in the
hibernation image-saving code (Randy Dunlap).
- Fix _set_required_opps when opp is NULL (Bryan O'Donoghue).
- Use device_get_match_data() in the OPP code for TI (Rob Herring).
- Clean up OPP level and other parts and call dev_pm_opp_set_opp()
recursively for required OPPs (Viresh Kumar).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmWb8o8SHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxbOQP/2D+YGyd9R1awYqxQoIQSGbpIr7a9oTR
BOIcOn2PvWLXH8w9wVbIUJsSL9Nx90D9T4S5QcyHrS2qHmR+1Gb0gX3D4QAmEBcG
+wFCLt2//5PwqShtPcJEUcGdL274aVEpmnEAmzKnk20MkLQM3twxe6FKSkwWMLYb
u9OKgdN8Vah0iSBUCpyT52O0x4d65MD/tka0QaGjLg64TtyqhTSKi+XgWtZkSZ7H
lRgn9qMoMXq/h1aeK4MKp5UtJKRxBWRdMijIFFXAfgO8dwbDbyXTo0d2LMR6DiEM
VsvRIjEePoRcGf7bwAbrUeSoNb5Ec32RW3v9GSNn2sWutW+vhD//frZq48zAR6lm
i8Xlf2Ar63Z+qNcFpCZjlNwAbfEuZ1vIr0Pu3oDd0GkOXjxiVMgAwtatTp1nSW7/
wWFuMA5G+wdzU/Z5KcV1p7S8CP1gC8S05LHGwtKKGm9pLbzhauF8GK6Xpa4711T8
oI3uDFIgxaxW8B/ymsM5cNa2QbfYUuQbOFTwXvBcy4gizrbZwwXRSpfaKoDIYAXZ
2kfwmFbu3IbrRypboY58lG3SzbnN94oEMANtsVYuxMimGz2x3ZmHBAFm2l9YPYRz
dBq/RUM7sMIvM1SwqR4tG8rt206L7KpPyW99pUa2AhEdof4iV2bpyujHFdkm83MK
nJ0OF/xcc98Z
=zh4c
-----END PGP SIGNATURE-----
Merge tag 'pm-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These add support for new processors (Sierra Forest, Grand Ridge and
Meteor Lake) to the intel_idle driver, make intel_pstate run on
Emerald Rapids without HWP support and adjust it to utilize EPP values
supplied by the platform firmware, fix issues, clean up code and
improve documentation.
The most significant fix addresses deadlocks in the core system-wide
resume code that occur if async_schedule_dev() attempts to run its
argument function synchronously (for example, due to a memory
allocation failure). It rearranges the code in question which may
increase the system resume time in some cases, but this basically is a
removal of a premature optimization. That optimization will be added
back later, but properly this time.
Specifics:
- Add support for the Sierra Forest, Grand Ridge and Meteorlake SoCs
to the intel_idle cpuidle driver (Artem Bityutskiy, Zhang Rui)
- Do not enable interrupts when entering idle in the haltpoll cpuidle
driver (Borislav Petkov)
- Add Emerald Rapids support in no-HWP mode to the intel_pstate
cpufreq driver (Zhenguo Yao)
- Use EPP values programmed by the platform firmware as balanced
performance ones by default in intel_pstate (Srinivas Pandruvada)
- Add a missing function return value check to the SCMI cpufreq
driver to avoid unexpected behavior (Alexandra Diupina)
- Fix parameter type warning in the armada-8k cpufreq driver (Gregory
CLEMENT)
- Rework trans_stat_show() in the devfreq core code to avoid buffer
overflows (Christian Marangi)
- Synchronize devfreq_monitor_[start/stop] so as to prevent a timer
list corruption from occurring when devfreq governors are switched
frequently (Mukesh Ojha)
- Fix possible deadlocks in the core system-wide PM code that occur
if device-handling functions cannot be executed asynchronously
during resume from system-wide suspend (Rafael J. Wysocki)
- Clean up unnecessary local variable initializations in multiple
places in the hibernation code (Wang chaodong, Li zeming)
- Adjust core hibernation code to avoid missing wakeup events that
occur after saving an image to persistent storage (Chris Feng)
- Update hibernation code to enforce correct ordering during image
compression and decompression (Hongchen Zhang)
- Use kmap_local_page() instead of kmap_atomic() in copy_data_page()
during hibernation and restore (Chen Haonan)
- Adjust documentation and code comments to reflect recent tasks
freezer changes (Kevin Hao)
- Repair excess function parameter description warning in the
hibernation image-saving code (Randy Dunlap)
- Fix _set_required_opps when opp is NULL (Bryan O'Donoghue)
- Use device_get_match_data() in the OPP code for TI (Rob Herring)
- Clean up OPP level and other parts and call dev_pm_opp_set_opp()
recursively for required OPPs (Viresh Kumar)"
* tag 'pm-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (35 commits)
OPP: Rename 'rate_clk_single'
OPP: Pass rounded rate to _set_opp()
OPP: Relocate dev_pm_opp_sync_regulators()
PM: sleep: Fix possible deadlocks in core system-wide PM code
OPP: Move dev_pm_opp_icc_bw to internal opp.h
async: Introduce async_schedule_dev_nocall()
async: Split async_schedule_node_domain()
cpuidle: haltpoll: Do not enable interrupts when entering idle
OPP: Fix _set_required_opps when opp is NULL
OPP: The level field is always of unsigned int type
PM: hibernate: Repair excess function parameter description warning
PM: sleep: Remove obsolete comment from unlock_system_sleep()
cpufreq: intel_pstate: Add Emerald Rapids support in no-HWP mode
Documentation: PM: Adjust freezing-of-tasks.rst to the freezer changes
PM: hibernate: Use kmap_local_page() in copy_data_page()
intel_idle: add Sierra Forest SoC support
intel_idle: add Grand Ridge SoC support
PM / devfreq: Synchronize devfreq_monitor_[start/stop]
cpufreq: armada-8k: Fix parameter type warning
PM: hibernate: Enforce ordering during image compression/decompression
...
- Add dynamic thresholds for trip point crossing detection to prevent
trip point crossing notifications from being sent at incorrect times
or not at all in some cases (Rafael J. Wysocki).
- Fix synchronization issues related to the resume of thermal zones
during a system-wide resume and allow thermal zones to be resumed
concurrently (Rafael J. Wysocki).
- Modify the thermal zone unregistration to wait for the given zone to
go away completely before returning to the caller and rework the
sysfs interface for trip points on top of that (Rafael J. Wysocki).
- Fix a possible NULL pointer dereference in thermal zone registration
error path (Rafael J. Wysocki).
- Clean up the IPA thermal governor and modify it (with the help of a
new governor callback) to avoid allocating and freeing memory every
time its throttling callback is invoked (Lukasz Luba).
- Make the IPA thermal governor handle thermal instance weight changes
via sysfs correctly (Lukasz Luba).
- Update the thermal netlink code to avoid sending messages if there
are no recipients (Stanislaw Gruszka).
- Convert Mediatek Thermal to the json-schema (Rafał Miłecki).
- Fix thermal DT bindings issue on Loongson (Binbin Zhou).
- Fix returning NULL instead of -ENODEV during thermal probe on
Loogsoon (Binbin Zhou).
- Add thermal DT binding for tsens on the SM8650 platform (Neil
Armstrong).
- Add reboot on the critical trip point crossing option feature (Fabio
Estevam).
- Use DEFINE_SIMPLE_DEV_PM_OPS do define PM functions for thermal
suspend/resume on AmLogic (Uwe Kleine-König)
- Add D1/T113s THS controller support to the Sun8i thermal control
driver (Maxim Kiselev)
- Fix example in the thermal DT binding for QCom SPMI (Johan Hovold).
- Fix compilation warning in the tmon utility (Florian Eckert).
- Add support for interrupt-based thermal configuration on Exynos along
with a set of related cleanups (Mateusz Majewski).
- Make the Intel HFI thermal driver enable an HFI instance (eg. processor
package) from its first online CPU and disable it when the last CPU in
it goes offline (Ricardo Neri).
- Fix a kernel-doc warning and a spello in the cpuidle_cooling thermal
driver (Randy Dunlap).
- Move the .get_temp() thermal zone callback presence check to the
thermal zone registration code (Daniel Lezcano).
- Use the for_each_trip() macro for trip points table walks in a few
places in the thermal core (Rafael J. Wysocki).
- Make all trip point updates (via sysfs as well as from the platform
firmware) trigger trip change notifications (Rafael J. Wysocki).
- Drop redundant code from the thermal core and make one function in
it take a const pointer argument (Rafael J. Wysocki).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmWb8iUSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxKLkP/iDsuDwmhZAjbAu2iftk/8ad8Trm2VoK
+9eZ5Eqa8lKEcJLb0RxueTnFT4ppvT/hY99HOG4FM+mCnWeH/Z32N697DhqiUg4v
GZUpeOPzxYgsfOOTeuL5XgfrVMgBjJrJunTXmzgAd8lIhTmRbAMVmFVJ18CJO11O
RHgqvYznYFi5cywA9/NkG2xkhFB0VDoiTuIiuMMV+pMjqF0d5ooBMkhmjvPQ5Rp9
FjNJ7hqiTamAsDPdULAFqhIGGhKZWWFbh4+S+JPCwBW8nqvxyJpemsm20vrwctJR
bSXWQkgkDpWEeg9yrEAOO/Uk9yGd3jiLfkvPBKbK0x/YxGZ4hOYHcbF3cOUvmPYP
5K3ZJ61DNrzB/5S3LY54VYrWmTVRdK6Lk3HYNvfAUYFJZMZ5oMYZLCUmo4SswUdy
UUEIY27H7L18eLhP9zCcKo4njdaVG+vXQn/rJIFOpG0k9OElzPs1X8Dp/m9pKQDR
rDUsMXqB34NUVrIEhjAgqvwF5xHooW8gykpuJgxwBetA9w8Pls2A/mzLsDY3wgdQ
htiANGpKTDqBQSn+HrjzYckv9/R+1tDyTJmEDNZwllA1DJfrOlpCRD2VHRpgTZEA
Ldnq0bhyq6RQnousqxhgpYkIAoGaebs9XasRH0YtBG5gIumeWfqeVzmTcM5xdsNB
yf6RdQy8QunS
=QVyh
-----END PGP SIGNATURE-----
Merge tag 'thermal-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull thermal control updates from Rafael Wysocki:
"These add support for the D1/T113s THS controller to the sun8i driver
and a DT-based mechanism for platforms to indicate a preference to
reboot (instead of shutting down) on crossing a critical trip point,
fix issues, make other improvements (in the IPA governor, the Intel
HFI driver, the exynos driver and the thermal netlink interface among
other places) and clean up code.
One long-standing issue addressed here is that trip point crossing
notifications sent to user space might be unreliable due to the
incorrect handling of trip point hysteresis in the thermal core:
multiple notifications might be sent for the same event or there might
be events without any notification at all.
Specifics:
- Add dynamic thresholds for trip point crossing detection to prevent
trip point crossing notifications from being sent at incorrect
times or not at all in some cases (Rafael J. Wysocki)
- Fix synchronization issues related to the resume of thermal zones
during a system-wide resume and allow thermal zones to be resumed
concurrently (Rafael J. Wysocki)
- Modify the thermal zone unregistration to wait for the given zone
to go away completely before returning to the caller and rework the
sysfs interface for trip points on top of that (Rafael J. Wysocki)
- Fix a possible NULL pointer dereference in thermal zone
registration error path (Rafael J. Wysocki)
- Clean up the IPA thermal governor and modify it (with the help of a
new governor callback) to avoid allocating and freeing memory every
time its throttling callback is invoked (Lukasz Luba)
- Make the IPA thermal governor handle thermal instance weight
changes via sysfs correctly (Lukasz Luba)
- Update the thermal netlink code to avoid sending messages if there
are no recipients (Stanislaw Gruszka)
- Convert Mediatek Thermal to the json-schema (Rafał Miłecki)
- Fix thermal DT bindings issue on Loongson (Binbin Zhou)
- Fix returning NULL instead of -ENODEV during thermal probe on
Loogsoon (Binbin Zhou)
- Add thermal DT binding for tsens on the SM8650 platform (Neil
Armstrong)
- Add reboot on the critical trip point crossing option feature
(Fabio Estevam)
- Use DEFINE_SIMPLE_DEV_PM_OPS do define PM functions for thermal
suspend/resume on AmLogic (Uwe Kleine-König)
- Add D1/T113s THS controller support to the Sun8i thermal control
driver (Maxim Kiselev)
- Fix example in the thermal DT binding for QCom SPMI (Johan Hovold)
- Fix compilation warning in the tmon utility (Florian Eckert)
- Add support for interrupt-based thermal configuration on Exynos
along with a set of related cleanups (Mateusz Majewski)
- Make the Intel HFI thermal driver enable an HFI instance (eg.
processor package) from its first online CPU and disable it when
the last CPU in it goes offline (Ricardo Neri)
- Fix a kernel-doc warning and a spello in the cpuidle_cooling
thermal driver (Randy Dunlap)
- Move the .get_temp() thermal zone callback presence check to the
thermal zone registration code (Daniel Lezcano)
- Use the for_each_trip() macro for trip points table walks in a few
places in the thermal core (Rafael J. Wysocki)
- Make all trip point updates (via sysfs as well as from the platform
firmware) trigger trip change notifications (Rafael J. Wysocki)
- Drop redundant code from the thermal core and make one function in
it take a const pointer argument (Rafael J. Wysocki)"
* tag 'thermal-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (64 commits)
thermal: trip: Constify thermal zone argument of thermal_zone_trip_id()
thermal: intel: hfi: Disable an HFI instance when all its CPUs go offline
thermal: intel: hfi: Enable an HFI instance from its first online CPU
thermal: intel: hfi: Refactor enabling code into helper functions
thermal/drivers/exynos: Use set_trips ops
thermal/drivers/exynos: Use BIT wherever possible
thermal/drivers/exynos: Split initialization of TMU and the thermal zone
thermal/drivers/exynos: Stop using the threshold mechanism on Exynos 4210
thermal/drivers/exynos: Simplify regulator (de)initialization
thermal/drivers/exynos: Handle devm_regulator_get_optional return value correctly
thermal/drivers/exynos: Wwitch from workqueue-driven interrupt handling to threaded interrupts
thermal/drivers/exynos: Drop id field
thermal/drivers/exynos: Remove an unnecessary field description
tools/thermal/tmon: Fix compilation warning for wrong format
dt-bindings: thermal: qcom-spmi-adc-tm5/hc: Clean up examples
dt-bindings: thermal: qcom-spmi-adc-tm5/hc: Fix example node names
thermal/drivers/sun8i: Add D1/T113s THS controller support
dt-bindings: thermal: sun8i: Add binding for D1/T113s THS controller
thermal: amlogic: Use DEFINE_SIMPLE_DEV_PM_OPS for PM functions
thermal: amlogic: Make amlogic_thermal_disable() return void
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmWYKUIUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNyHw/+IKnqL1MZ5QS+/HtSzi4jCL47N9yZ
OHLol6XswyEGHH9myKPPGnT5lVA93v98v4ty2mws7EJUSGZQQUntYBPbU9Gi40+B
XDzYSRocoj96sdlKeOJMgaWo3NBRD9HYSoGPDNWZixy6m+bLPk/Dqhn3FabKf1lo
2qQSmstvChFRmVNkmgaQnBCAtWVqla4EJEL0EKX6cspHbuzRNTeJdTPn6Q/zOUVL
O2znOZuEtSVpYS7yg3uJT0hHD8H0GnIciAcDAhyPSBL5Uk5l6gwJiACcdRfLRbgp
QM5Z4qUFdKljV5XBCzYnfhhrx1df08h1SG84El8UK8HgTTfOZfYmawByJRWNJSQE
TdCmtyyvEbfb61CKBFVwD7Tzb9/y8WgcY5N3Un8uCQqRzFIO+6cghHri5NrVhifp
nPFlP4klxLHh3d7ZVekLmCMHbpaacRyJKwLy+f/nwbBEID47jpPkvZFIpbalat+r
QaKRBNWdTeV+GZ+Yu0uWsI029aQnpcO1kAnGg09fl6b/dsmxeKOVWebir25AzQ++
a702S8HRmj80X+VnXHU9a64XeGtBH7Nq0vu0lGHQPgwhSx/9P6/qICEPwsIriRjR
I9OulWt4OBPDtlsonHFgDs+lbnd0Z0GJUwYT8e9pjRDMxijVO9lhAXyglVRmuNR8
to2ByKP5BO+Vh8Y=
=Py+n
-----END PGP SIGNATURE-----
Merge tag 'lsm-pr-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull security module updates from Paul Moore:
- Add three new syscalls: lsm_list_modules(), lsm_get_self_attr(), and
lsm_set_self_attr().
The first syscall simply lists the LSMs enabled, while the second and
third get and set the current process' LSM attributes. Yes, these
syscalls may provide similar functionality to what can be found under
/proc or /sys, but they were designed to support multiple,
simultaneaous (stacked) LSMs from the start as opposed to the current
/proc based solutions which were created at a time when only one LSM
was allowed to be active at a given time.
We have spent considerable time discussing ways to extend the
existing /proc interfaces to support multiple, simultaneaous LSMs and
even our best ideas have been far too ugly to support as a kernel
API; after +20 years in the kernel, I felt the LSM layer had
established itself enough to justify a handful of syscalls.
Support amongst the individual LSM developers has been nearly
unanimous, with a single objection coming from Tetsuo (TOMOYO) as he
is worried that the LSM_ID_XXX token concept will make it more
difficult for out-of-tree LSMs to survive. Several members of the LSM
community have demonstrated the ability for out-of-tree LSMs to
continue to exist by picking high/unused LSM_ID values as well as
pointing out that many kernel APIs rely on integer identifiers, e.g.
syscalls (!), but unfortunately Tetsuo's objections remain.
My personal opinion is that while I have no interest in penalizing
out-of-tree LSMs, I'm not going to penalize in-tree development to
support out-of-tree development, and I view this as a necessary step
forward to support the push for expanded LSM stacking and reduce our
reliance on /proc and /sys which has occassionally been problematic
for some container users. Finally, we have included the linux-api
folks on (all?) recent revisions of the patchset and addressed all of
their concerns.
- Add a new security_file_ioctl_compat() LSM hook to handle the 32-bit
ioctls on 64-bit systems problem.
This patch includes support for all of the existing LSMs which
provide ioctl hooks, although it turns out only SELinux actually
cares about the individual ioctls. It is worth noting that while
Casey (Smack) and Tetsuo (TOMOYO) did not give explicit ACKs to this
patch, they did both indicate they are okay with the changes.
- Fix a potential memory leak in the CALIPSO code when IPv6 is disabled
at boot.
While it's good that we are fixing this, I doubt this is something
users are seeing in the wild as you need to both disable IPv6 and
then attempt to configure IPv6 labeled networking via
NetLabel/CALIPSO; that just doesn't make much sense.
Normally this would go through netdev, but Jakub asked me to take
this patch and of all the trees I maintain, the LSM tree seemed like
the best fit.
- Update the LSM MAINTAINERS entry with additional information about
our process docs, patchwork, bug reporting, etc.
I also noticed that the Lockdown LSM is missing a dedicated
MAINTAINERS entry so I've added that to the pull request. I've been
working with one of the major Lockdown authors/contributors to see if
they are willing to step up and assume a Lockdown maintainer role;
hopefully that will happen soon, but in the meantime I'll continue to
look after it.
- Add a handful of mailmap entries for Serge Hallyn and myself.
* tag 'lsm-pr-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: (27 commits)
lsm: new security_file_ioctl_compat() hook
lsm: Add a __counted_by() annotation to lsm_ctx.ctx
calipso: fix memory leak in netlbl_calipso_add_pass()
selftests: remove the LSM_ID_IMA check in lsm/lsm_list_modules_test
MAINTAINERS: add an entry for the lockdown LSM
MAINTAINERS: update the LSM entry
mailmap: add entries for Serge Hallyn's dead accounts
mailmap: update/replace my old email addresses
lsm: mark the lsm_id variables are marked as static
lsm: convert security_setselfattr() to use memdup_user()
lsm: align based on pointer length in lsm_fill_user_ctx()
lsm: consolidate buffer size handling into lsm_fill_user_ctx()
lsm: correct error codes in security_getselfattr()
lsm: cleanup the size counters in security_getselfattr()
lsm: don't yet account for IMA in LSM_CONFIG_COUNT calculation
lsm: drop LSM_ID_IMA
LSM: selftests for Linux Security Module syscalls
SELinux: Add selfattr hooks
AppArmor: Add selfattr hooks
Smack: implement setselfattr and getselfattr hooks
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmWYKJAUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXML3hAAyP/6RwJjUMM9Gsi9ZJNRz79X8uIp
/MYONbzy1xKq/d78jZhjsJm9yIlLk3muVdd3oRcXdmahA5zs3jOKRaM+OfNLOrt6
nuNwS+yaMUYKsKNh/A8TLoxcmBuNAN6lubCKwbccR6hvugqrZuZFkAqCIkiWUDeb
N64u1rL1q/tLI+jI76GIiK4SMMQihF3MMVVTmBWYDiIdrfPhFIHxipLgZaEBUqZM
43+2Y/blV75jcqPTZRgT9tk0LVLkiFtO97qUp9j+pYZbeoJ7CAaDH5A8NVm38yIX
tyzYiTV2lGS3qf/HdLc3OpJQlBVkhbq6cRiLGvyiKQp60xiqYffoL7iFP4/DJMoT
JKzoqXCixINRqdHWYbVY9hHBGg6R5c+1QqZzsnEy2MnBF++iLwJQAMz5JO9Qdh8F
tD6fD82QzvfoNPuP0lBA67preqN3wiH1Zsv6cstoI/6QKCAMeTMZt/ywniBTKhX6
WMmhdmMQKTwGrnCosydAOonYesieiYPhxz6oSeRIqoHRHtNow8rjnFh7DR7yi8uc
nv1x5bDqEI+QTrDys0cAq6fvdUZT2B9joqSovzXUGllRRS7w17WNf1Cu16jMTrHH
FeZ2P1BvKE7YIFkqxcE/RY5NHX3ylxA4unFM8UgIheYiWbWLm5+xrwZdNL30KQJ4
4Hvvy3Buq6kb4HE=
=908g
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
"The audit updates are fairly minor with only two patches:
- Send an audit ACK to userspace immediately upon receiving an auditd
registration event as opposed to waiting until the registration has
been fully processed and the audit backlog starts filling the
netlink buffers.
Sending the ACK earlier, as done here, is still safe as the
operation should not fail at the point when the ACK is done, and
doing so helps avoid the ACK being dropped in extreme situations.
- Update the audit MAINTAINERS entry with additional information.
There isn't anything in this update that should be new to regular
contributors or list subscribers, but I'm pushing to start
documenting our processes, conventions, etc. and this seems like an
important part of that"
* tag 'audit-pr-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
MAINTAINERS: update the audit entry
audit: Send netlink ACK before setting connection in auditd_set
many places. The notable patch series are:
- nilfs2 folio conversion from Matthew Wilcox in "nilfs2: Folio
conversions for file paths".
- Additional nilfs2 folio conversion from Ryusuke Konishi in "nilfs2:
Folio conversions for directory paths".
- IA64 remnant removal in Heiko Carstens's "Remove unused code after
IA-64 removal".
- Arnd Bergmann has enabled the -Wmissing-prototypes warning everywhere
in "Treewide: enable -Wmissing-prototypes". This had some followup
fixes:
- Nathan Chancellor has cleaned up the hexagon build in the series
"hexagon: Fix up instances of -Wmissing-prototypes".
- Nathan also addressed some s390 warnings in "s390: A couple of
fixes for -Wmissing-prototypes".
- Arnd Bergmann addresses the same warnings for MIPS in his series
"mips: address -Wmissing-prototypes warnings".
- Baoquan He has made kexec_file operate in a top-down-fitting manner
similar to kexec_load in the series "kexec_file: Load kernel at top of
system RAM if required"
- Baoquan He has also added the self-explanatory "kexec_file: print out
debugging message if required".
- Some checkstack maintenance work from Tiezhu Yang in the series
"Modify some code about checkstack".
- Douglas Anderson has disentangled the watchdog code's logging when
multiple reports are occurring simultaneously. The series is "watchdog:
Better handling of concurrent lockups".
- Yuntao Wang has contributed some maintenance work on the crash code in
"crash: Some cleanups and fixes".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZ2R6AAKCRDdBJ7gKXxA
juCVAP4t76qUISDOSKugB/Dn5E4Nt9wvPY9PcufnmD+xoPsgkQD+JVl4+jd9+gAV
vl6wkJDiJO5JZ3FVtBtC3DFA/xHtVgk=
=kQw+
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2024-01-09-10-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
"Quite a lot of kexec work this time around. Many singleton patches in
many places. The notable patch series are:
- nilfs2 folio conversion from Matthew Wilcox in 'nilfs2: Folio
conversions for file paths'.
- Additional nilfs2 folio conversion from Ryusuke Konishi in 'nilfs2:
Folio conversions for directory paths'.
- IA64 remnant removal in Heiko Carstens's 'Remove unused code after
IA-64 removal'.
- Arnd Bergmann has enabled the -Wmissing-prototypes warning
everywhere in 'Treewide: enable -Wmissing-prototypes'. This had
some followup fixes:
- Nathan Chancellor has cleaned up the hexagon build in the series
'hexagon: Fix up instances of -Wmissing-prototypes'.
- Nathan also addressed some s390 warnings in 's390: A couple of
fixes for -Wmissing-prototypes'.
- Arnd Bergmann addresses the same warnings for MIPS in his series
'mips: address -Wmissing-prototypes warnings'.
- Baoquan He has made kexec_file operate in a top-down-fitting manner
similar to kexec_load in the series 'kexec_file: Load kernel at top
of system RAM if required'
- Baoquan He has also added the self-explanatory 'kexec_file: print
out debugging message if required'.
- Some checkstack maintenance work from Tiezhu Yang in the series
'Modify some code about checkstack'.
- Douglas Anderson has disentangled the watchdog code's logging when
multiple reports are occurring simultaneously. The series is
'watchdog: Better handling of concurrent lockups'.
- Yuntao Wang has contributed some maintenance work on the crash code
in 'crash: Some cleanups and fixes'"
* tag 'mm-nonmm-stable-2024-01-09-10-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (157 commits)
crash_core: fix and simplify the logic of crash_exclude_mem_range()
x86/crash: use SZ_1M macro instead of hardcoded value
x86/crash: remove the unused image parameter from prepare_elf_headers()
kdump: remove redundant DEFAULT_CRASH_KERNEL_LOW_SIZE
scripts/decode_stacktrace.sh: strip unexpected CR from lines
watchdog: if panicking and we dumped everything, don't re-enable dumping
watchdog/hardlockup: use printk_cpu_sync_get_irqsave() to serialize reporting
watchdog/softlockup: use printk_cpu_sync_get_irqsave() to serialize reporting
watchdog/hardlockup: adopt softlockup logic avoiding double-dumps
kexec_core: fix the assignment to kimage->control_page
x86/kexec: fix incorrect end address passed to kernel_ident_mapping_init()
lib/trace_readwrite.c:: replace asm-generic/io with linux/io
nilfs2: cpfile: fix some kernel-doc warnings
stacktrace: fix kernel-doc typo
scripts/checkstack.pl: fix no space expression between sp and offset
x86/kexec: fix incorrect argument passed to kexec_dprintk()
x86/kexec: use pr_err() instead of kexec_dprintk() when an error occurs
nilfs2: add missing set_freezable() for freezable kthread
kernel: relay: remove relay_file_splice_read dead code, doesn't work
docs: submit-checklist: remove all of "make namespacecheck"
...
are included in this merge do the following:
- Peng Zhang has done some mapletree maintainance work in the
series
"maple_tree: add mt_free_one() and mt_attr() helpers"
"Some cleanups of maple tree"
- In the series "mm: use memmap_on_memory semantics for dax/kmem"
Vishal Verma has altered the interworking between memory-hotplug
and dax/kmem so that newly added 'device memory' can more easily
have its memmap placed within that newly added memory.
- Matthew Wilcox continues folio-related work (including a few
fixes) in the patch series
"Add folio_zero_tail() and folio_fill_tail()"
"Make folio_start_writeback return void"
"Fix fault handler's handling of poisoned tail pages"
"Convert aops->error_remove_page to ->error_remove_folio"
"Finish two folio conversions"
"More swap folio conversions"
- Kefeng Wang has also contributed folio-related work in the series
"mm: cleanup and use more folio in page fault"
- Jim Cromie has improved the kmemleak reporting output in the
series "tweak kmemleak report format".
- In the series "stackdepot: allow evicting stack traces" Andrey
Konovalov to permits clients (in this case KASAN) to cause
eviction of no longer needed stack traces.
- Charan Teja Kalla has fixed some accounting issues in the page
allocator's atomic reserve calculations in the series "mm:
page_alloc: fixes for high atomic reserve caluculations".
- Dmitry Rokosov has added to the samples/ dorectory some sample
code for a userspace memcg event listener application. See the
series "samples: introduce cgroup events listeners".
- Some mapletree maintanance work from Liam Howlett in the series
"maple_tree: iterator state changes".
- Nhat Pham has improved zswap's approach to writeback in the
series "workload-specific and memory pressure-driven zswap
writeback".
- DAMON/DAMOS feature and maintenance work from SeongJae Park in
the series
"mm/damon: let users feed and tame/auto-tune DAMOS"
"selftests/damon: add Python-written DAMON functionality tests"
"mm/damon: misc updates for 6.8"
- Yosry Ahmed has improved memcg's stats flushing in the series
"mm: memcg: subtree stats flushing and thresholds".
- In the series "Multi-size THP for anonymous memory" Ryan Roberts
has added a runtime opt-in feature to transparent hugepages which
improves performance by allocating larger chunks of memory during
anonymous page faults.
- Matthew Wilcox has also contributed some cleanup and maintenance
work against eh buffer_head code int he series "More buffer_head
cleanups".
- Suren Baghdasaryan has done work on Andrea Arcangeli's series
"userfaultfd move option". UFFDIO_MOVE permits userspace heap
compaction algorithms to move userspace's pages around rather than
UFFDIO_COPY'a alloc/copy/free.
- Stefan Roesch has developed a "KSM Advisor", in the series
"mm/ksm: Add ksm advisor". This is a governor which tunes KSM's
scanning aggressiveness in response to userspace's current needs.
- Chengming Zhou has optimized zswap's temporary working memory
use in the series "mm/zswap: dstmem reuse optimizations and
cleanups".
- Matthew Wilcox has performed some maintenance work on the
writeback code, both code and within filesystems. The series is
"Clean up the writeback paths".
- Andrey Konovalov has optimized KASAN's handling of alloc and
free stack traces for secondary-level allocators, in the series
"kasan: save mempool stack traces".
- Andrey also performed some KASAN maintenance work in the series
"kasan: assorted clean-ups".
- David Hildenbrand has gone to town on the rmap code. Cleanups,
more pte batching, folio conversions and more. See the series
"mm/rmap: interface overhaul".
- Kinsey Ho has contributed some maintenance work on the MGLRU
code in the series "mm/mglru: Kconfig cleanup".
- Matthew Wilcox has contributed lruvec page accounting code
cleanups in the series "Remove some lruvec page accounting
functions".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZyF2wAKCRDdBJ7gKXxA
jjWjAP42LHvGSjp5M+Rs2rKFL0daBQsrlvy6/jCHUequSdWjSgEAmOx7bc5fbF27
Oa8+DxGM9C+fwqZ/7YxU2w/WuUmLPgU=
=0NHs
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Many singleton patches against the MM code. The patch series which are
included in this merge do the following:
- Peng Zhang has done some mapletree maintainance work in the series
'maple_tree: add mt_free_one() and mt_attr() helpers'
'Some cleanups of maple tree'
- In the series 'mm: use memmap_on_memory semantics for dax/kmem'
Vishal Verma has altered the interworking between memory-hotplug
and dax/kmem so that newly added 'device memory' can more easily
have its memmap placed within that newly added memory.
- Matthew Wilcox continues folio-related work (including a few fixes)
in the patch series
'Add folio_zero_tail() and folio_fill_tail()'
'Make folio_start_writeback return void'
'Fix fault handler's handling of poisoned tail pages'
'Convert aops->error_remove_page to ->error_remove_folio'
'Finish two folio conversions'
'More swap folio conversions'
- Kefeng Wang has also contributed folio-related work in the series
'mm: cleanup and use more folio in page fault'
- Jim Cromie has improved the kmemleak reporting output in the series
'tweak kmemleak report format'.
- In the series 'stackdepot: allow evicting stack traces' Andrey
Konovalov to permits clients (in this case KASAN) to cause eviction
of no longer needed stack traces.
- Charan Teja Kalla has fixed some accounting issues in the page
allocator's atomic reserve calculations in the series 'mm:
page_alloc: fixes for high atomic reserve caluculations'.
- Dmitry Rokosov has added to the samples/ dorectory some sample code
for a userspace memcg event listener application. See the series
'samples: introduce cgroup events listeners'.
- Some mapletree maintanance work from Liam Howlett in the series
'maple_tree: iterator state changes'.
- Nhat Pham has improved zswap's approach to writeback in the series
'workload-specific and memory pressure-driven zswap writeback'.
- DAMON/DAMOS feature and maintenance work from SeongJae Park in the
series
'mm/damon: let users feed and tame/auto-tune DAMOS'
'selftests/damon: add Python-written DAMON functionality tests'
'mm/damon: misc updates for 6.8'
- Yosry Ahmed has improved memcg's stats flushing in the series 'mm:
memcg: subtree stats flushing and thresholds'.
- In the series 'Multi-size THP for anonymous memory' Ryan Roberts
has added a runtime opt-in feature to transparent hugepages which
improves performance by allocating larger chunks of memory during
anonymous page faults.
- Matthew Wilcox has also contributed some cleanup and maintenance
work against eh buffer_head code int he series 'More buffer_head
cleanups'.
- Suren Baghdasaryan has done work on Andrea Arcangeli's series
'userfaultfd move option'. UFFDIO_MOVE permits userspace heap
compaction algorithms to move userspace's pages around rather than
UFFDIO_COPY'a alloc/copy/free.
- Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm:
Add ksm advisor'. This is a governor which tunes KSM's scanning
aggressiveness in response to userspace's current needs.
- Chengming Zhou has optimized zswap's temporary working memory use
in the series 'mm/zswap: dstmem reuse optimizations and cleanups'.
- Matthew Wilcox has performed some maintenance work on the writeback
code, both code and within filesystems. The series is 'Clean up the
writeback paths'.
- Andrey Konovalov has optimized KASAN's handling of alloc and free
stack traces for secondary-level allocators, in the series 'kasan:
save mempool stack traces'.
- Andrey also performed some KASAN maintenance work in the series
'kasan: assorted clean-ups'.
- David Hildenbrand has gone to town on the rmap code. Cleanups, more
pte batching, folio conversions and more. See the series 'mm/rmap:
interface overhaul'.
- Kinsey Ho has contributed some maintenance work on the MGLRU code
in the series 'mm/mglru: Kconfig cleanup'.
- Matthew Wilcox has contributed lruvec page accounting code cleanups
in the series 'Remove some lruvec page accounting functions'"
* tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits)
mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
mm, treewide: introduce NR_PAGE_ORDERS
selftests/mm: add separate UFFDIO_MOVE test for PMD splitting
selftests/mm: skip test if application doesn't has root privileges
selftests/mm: conform test to TAP format output
selftests: mm: hugepage-mmap: conform to TAP format output
selftests/mm: gup_test: conform test to TAP format output
mm/selftests: hugepage-mremap: conform test to TAP format output
mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING
mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large
mm/memcontrol: remove __mod_lruvec_page_state()
mm/khugepaged: use a folio more in collapse_file()
slub: use a folio in __kmalloc_large_node
slub: use folio APIs in free_large_kmalloc()
slub: use alloc_pages_node() in alloc_slab_page()
mm: remove inc/dec lruvec page state functions
mm: ratelimit stat flush from workingset shrinker
kasan: stop leaking stack trace handles
mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE
mm/mglru: add dummy pmd_dirty()
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmWWu9EACgkQu+CwddJF
iJpXvQf/aGL7uEY57VpTm0t4gPwoZ9r2P89HxI/nQs9XgVzDcBmVp/cC0LDvSdcm
t91kJO538KeGjMgvlhLMTEuoShH5FlPs6cOwrGAYUoAGa4NwiOpGvliGky+nNHqY
w887ZgSzVLq0UOuSvn86N6enumMvewt4V+872+OWo6O1HWOJhC0SgHTIa8QPQtwb
yZ9BghO5IqMRXiZEsSIwyO+tQHcaU6l2G5huFXzgMFUhkQqAB9KTFc3h6rYI+i80
L4ppNXo2KNPGTDRb9dA8LNMWgvmfjhCb7chs8o1zSY2PwZlkzOix7EUBLCAIbc/2
EIaFC8AsZjfT47D1t72r8QpHB+C14Q==
=J+E7
-----END PGP SIGNATURE-----
Merge tag 'slab-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:
- SLUB: delayed freezing of CPU partial slabs (Chengming Zhou)
Freezing is an operation involving double_cmpxchg() that makes a slab
exclusive for a particular CPU. Chengming noticed that we use it also
in situations where we are not yet installing the slab as the CPU
slab, because freezing also indicates that the slab is not on the
shared list. This results in redundant freeze/unfreeze operation and
can be avoided by marking separately the shared list presence by
reusing the PG_workingset flag.
This approach neatly avoids the issues described in 9b1ea29bc0
("Revert "mm, slub: consider rest of partial list if acquire_slab()
fails"") as we can now grab a slab from the shared list in a quick
and guaranteed way without the cmpxchg_double() operation that
amplifies the lock contention and can fail.
As a result, lkp has reported 34.2% improvement of
stress-ng.rawudp.ops_per_sec
- SLAB removal and SLUB cleanups (Vlastimil Babka)
The SLAB allocator has been deprecated since 6.5 and nobody has
objected so far. We agreed at LSF/MM to wait until the next LTS,
which is 6.6, so we should be good to go now.
This doesn't yet erase all traces of SLAB outside of mm/ so some dead
code, comments or documentation remain, and will be cleaned up
gradually (some series are already in the works).
Removing the choice of allocators has already allowed to simplify and
optimize the code wiring up the kmalloc APIs to the SLUB
implementation.
* tag 'slab-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (34 commits)
mm/slub: free KFENCE objects in slab_free_hook()
mm/slub: handle bulk and single object freeing separately
mm/slub: introduce __kmem_cache_free_bulk() without free hooks
mm/slub: fix bulk alloc and free stats
mm/slub: optimize free fast path code layout
mm/slub: optimize alloc fastpath code layout
mm/slub: remove slab_alloc() and __kmem_cache_alloc_lru() wrappers
mm/slab: move kmalloc() functions from slab_common.c to slub.c
mm/slab: move kmalloc_slab() to mm/slab.h
mm/slab: move kfree() from slab_common.c to slub.c
mm/slab: move struct kmem_cache_node from slab.h to slub.c
mm/slab: move memcg related functions from slab.h to slub.c
mm/slab: move pre/post-alloc hooks from slab.h to slub.c
mm/slab: consolidate includes in the internal mm/slab.h
mm/slab: move the rest of slub_def.h to mm/slab.h
mm/slab: move struct kmem_cache_cpu declaration to slub.c
mm/slab: remove mm/slab.c and slab_def.h
mm/mempool/dmapool: remove CONFIG_DEBUG_SLAB ifdefs
mm/slab: remove CONFIG_SLAB code from slab common code
cpu/hotplug: remove CPUHP_SLAB_PREPARE hooks
...
The allocation request for swiotlb contiguous memory greater than
128*2KB cannot be fulfilled because it exceeds the maximum contiguous
memory limit. If the swiotlb memory we allocate is larger than 128*2KB,
swiotlb_find_slots() will still schedule the allocation of a new memory
pool, which will increase memory overhead.
Fix it by adding a check with alloc_size no more than 128*2KB before
scheduling the allocation of a new memory pool in swiotlb_find_slots().
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Reviewed-by: Petr Tesarik <petr.tesarik1@huawei-partners.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
- Yafang Shao added task_get_cgroup1() helper to enable a similar BPF helper
so that BPF progs can be more useful on cgroup1 hierarchies. While cgroup1
is mostly in maintenance mode, this addition is very small while having an
outsized usefulness for users who are still on cgroup1. Yafang also
optimized root cgroup list access by making it RCU protected in the
process.
- Waiman Long optimized rstat operation leading to substantially lower and
more consistent lock hold time while flushing the hierarchical statistics.
As the lock can be acquired briefly in various hot paths, this reduction
has cascading benefits.
- Waiman also improved the quality of isolation for cpuset's isolated
partitions. CPUs which are allocated to isolated partitions are now
excluded from running unbound work items and cpu_is_isolated() test which
is used by vmstat and memcg to reduce interference now includes cpuset
isolated CPUs. While it isn't there yet, the hope is eventually reaching
parity with the isolation level provided by the `isolcpus` boot param but
in a dynamic manner.
This involved a couple workqueue patches which were applied directly to
cgroup/for-6.8 rather than ping-ponged through the wq tree. This was
because the wq code change was small and the area is usually very static
and unlikely to cause conflicts. However, luck had it that there was a wq
bug fix in the area during the 6.7 cycle which caused a conflict. The
conflict is contextual but can be a bit confusing to resolve, so there is
one merge from wq/for-6.7-fixes.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZYnuJg4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGQ5kAP9nMMWqi+R1HeG7+hWROTVjQZ0OM9KRcpZ1TmjF
FNbkJgEAzt+sPnoWwYDTSI7pkNeZ/IM7x1qkkKGvENNtUXrz0Ac=
=PyYN
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- Yafang Shao added task_get_cgroup1() helper to enable a similar BPF
helper so that BPF progs can be more useful on cgroup1 hierarchies.
While cgroup1 is mostly in maintenance mode, this addition is very
small while having an outsized usefulness for users who are still on
cgroup1. Yafang also optimized root cgroup list access by making it
RCU protected in the process.
- Waiman Long optimized rstat operation leading to substantially lower
and more consistent lock hold time while flushing the hierarchical
statistics. As the lock can be acquired briefly in various hot paths,
this reduction has cascading benefits.
- Waiman also improved the quality of isolation for cpuset's isolated
partitions. CPUs which are allocated to isolated partitions are now
excluded from running unbound work items and cpu_is_isolated() test
which is used by vmstat and memcg to reduce interference now includes
cpuset isolated CPUs. While it isn't there yet, the hope is
eventually reaching parity with the isolation level provided by the
`isolcpus` boot param but in a dynamic manner.
* tag 'cgroup-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Move rcu_head up near the top of cgroup_root
cgroup/cpuset: Include isolated cpuset CPUs in cpu_is_isolated() check
cgroup: Avoid false cacheline sharing of read mostly rstat_cpu
cgroup/rstat: Optimize cgroup_rstat_updated_list()
cgroup: Fix documentation for cpu.idle
cgroup/cpuset: Expose cpuset.cpus.isolated
workqueue: Move workqueue_set_unbound_cpumask() and its helpers inside CONFIG_SYSFS
cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked()
cgroup/cpuset: Take isolated CPUs out of workqueue unbound cpumask
cgroup/cpuset: Keep track of CPUs in isolated partitions
selftests/cgroup: Minor code cleanup and reorganization of test_cpuset_prs.sh
workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask
selftests: cgroup: Fixes a typo in a comment
cgroup: Add a new helper for cgroup1 hierarchy
cgroup: Add annotation for holding namespace_sem in current_cgns_cgroup_from_root()
cgroup: Eliminate the need for cgroup_mutex in proc_cgroup_show()
cgroup: Make operations on the cgroup root_list RCU safe
cgroup: Remove unnecessary list_empty()
- Energy scheduling:
- Consolidate how the max compute capacity is
used in the scheduler and how we calculate
the frequency for a level of utilization.
- Rework interface between the scheduler and
the schedutil governor
- Simplify the util_est logic
- Deadline scheduler:
- Work more towards reducing SCHED_DEADLINE
starvation of low priority tasks (e.g., SCHED_OTHER)
tasks when higher priority tasks monopolize CPU
cycles, via the introduction of 'deadline servers'
(nested/2-level scheduling).
"Fair servers" to make use of this facility are
not introduced yet.
- EEVDF:
- Introduce O(1) fastpath for EEVDF task selection
- NUMA balancing:
- Tune the NUMA-balancing vma scanning logic some more,
to better distribute the probability
of a particular vma getting scanned.
- Plus misc fixes, cleanups and updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWcASMRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jLbg/+NOwF18M6klF1/3jUaV1PU09vRzYnnA7w
oF7Tru7JLV+/vZK+rwI1zxzj5Nj3sVBQPIyp1embEHx7Z/QH8MIaIVpcSFsDDCYY
Q8n6ZVRB+lKWEo5+Ti6JEJftDAWuLHXwFWDa57oWPuR0Tc736+zYHUfj7jdKk0RI
nT/lnOT6hXU8q26O4QFrBrrhvCCxc4byo7buKPQfqie0bDA70ppIWkFQoQME6mvQ
US9jvOyUipOiPV06DPwFvPDJUQBGq2VdJNk+5zCEtcqEfLREuo/Xq1Ww1x1BWaZI
761532EuDo73iMK4IFZrvVmj1ioz957qbje11MSSkDdKj692xxjXyvnY0NBvZuho
Ueog/jQ4D4I2qu7pPSCF8UfnI/Hw4Q+KJ89j3pcywRm4hmCTf9k3MGpAaVLVxH7G
e5REZ5MSsFZi4Cs+zF87Of5KCKLhTr1qSetNtShinKahg06WZ+MZ8tW4jb52qy0j
F8PMlvfBI3f7SOtA8s2P26mDGQ21YQehN2d5P+Fbwj/U3fjIlSTOyx6NwLpFwYaS
Vf+fctchGFV1Sh7c2JjCh+ecYfXx3ghT/pvyPOImJtxtCKSRUQ8c26ApC1OsWfOE
FdHv4f2dPqcyswCZzIv/2fyDXc9eaS2E05EMDNqVuMCGnzidzSs81n7hBioNMrnH
ZgHK90TmEbw=
=wTVh
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Energy scheduling:
- Consolidate how the max compute capacity is used in the scheduler
and how we calculate the frequency for a level of utilization.
- Rework interface between the scheduler and the schedutil governor
- Simplify the util_est logic
Deadline scheduler:
- Work more towards reducing SCHED_DEADLINE starvation of low
priority tasks (e.g., SCHED_OTHER) tasks when higher priority tasks
monopolize CPU cycles, via the introduction of 'deadline servers'
(nested/2-level scheduling).
"Fair servers" to make use of this facility are not introduced yet.
EEVDF:
- Introduce O(1) fastpath for EEVDF task selection
NUMA balancing:
- Tune the NUMA-balancing vma scanning logic some more, to better
distribute the probability of a particular vma getting scanned.
Plus misc fixes, cleanups and updates"
* tag 'sched-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
sched/fair: Fix tg->load when offlining a CPU
sched/fair: Remove unused 'next_buddy_marked' local variable in check_preempt_wakeup_fair()
sched/fair: Use all little CPUs for CPU-bound workloads
sched/fair: Simplify util_est
sched/fair: Remove SCHED_FEAT(UTIL_EST_FASTUP, true)
arm64/amu: Use capacity_ref_freq() to set AMU ratio
cpufreq/cppc: Set the frequency used for computing the capacity
cpufreq/cppc: Move and rename cppc_cpufreq_{perf_to_khz|khz_to_perf}()
energy_model: Use a fixed reference frequency
cpufreq/schedutil: Use a fixed reference frequency
cpufreq: Use the fixed and coherent frequency for scaling capacity
sched/topology: Add a new arch_scale_freq_ref() method
freezer,sched: Clean saved_state when restoring it during thaw
sched/fair: Update min_vruntime for reweight_entity() correctly
sched/doc: Update documentation after renames and synchronize Chinese version
sched/cpufreq: Rework iowait boost
sched/cpufreq: Rework schedutil governor performance estimation
sched/pelt: Avoid underestimation of task utilization
sched/timers: Explain why idle task schedules out on remote timer enqueue
sched/cpuidle: Comment about timers requirements VS idle handler
...
- Add branch stack counters ABI extension to better capture
the growing amount of information the PMU exposes via
branch stack sampling. There's matching tooling support.
- Fix race when creating the nr_addr_filters sysfs file
- Add Intel Sierra Forest and Grand Ridge intel/cstate
PMU support.
- Add Intel Granite Rapids, Sierra Forest and Grand Ridge
uncore PMU support.
- Misc cleanups & fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWb4lURHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jlnQ/+NSzrPQ9hEiS5a1iMMxdwC6IoXCmeFVsv
s5NsGaVC7FEgjm3oCfvQlP63HolMO9R7TNLZsgINzOda5IHtE7WUcgBK7gbZr+NT
WabdTyFrdmUr+Br0rLrEe0bxDSQU7r41ptqKE5HZRM9/3SbLhWgaXSJbfFAG2JV0
xboZ/2qzb7Puch6VTWv1YhuIpr1Pi817As4SOo7JR4V8jBB2bh2eZ7XBN1z23aw2
xuglbYml5gs4dOaFTqkRLWyn2PmrZ9wYKcdp63FVUscZ4LxvSw749BxEcNpTbxLp
PT6uXIKw9PnStNfscfrsk6fDocVJzqrOK71blgiOKbmhWTE0UimEpFf1Hd3ooewg
hFp3hmkE5Bc2MTUnwivkBxj96fz5rXH+3+Cue/5NsvDNlhlkswIIxzDw8M1G4rOI
KQMDUYFOhQPa3Hi1lSp2SgHI5AcYHudepr/Z3QMxD3iLs+Wo2cmDcp8d2VrMLfb7
GHSITG592iYcZPYsJosxby8CSFaUPxIl9l3AODQwWuEjd4PcOYa6iB2HbEa/mC3R
wXcs8mFIMAaH/HRYUlqUDA5pOqN5chb13iDtS4JqJqBKyWgdrDLCVxoZSQvB64+I
bldyy1e5oQSVVwJ42WLkUK3Eld2x75ki1JLZFwMgYuOgQv3jfu2VNenUWJ5ig0La
dPpHP8PwOoc=
=2O/5
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance events updates from Ingo Molnar:
- Add branch stack counters ABI extension to better capture the growing
amount of information the PMU exposes via branch stack sampling.
There's matching tooling support.
- Fix race when creating the nr_addr_filters sysfs file
- Add Intel Sierra Forest and Grand Ridge intel/cstate PMU support
- Add Intel Granite Rapids, Sierra Forest and Grand Ridge uncore PMU
support
- Misc cleanups & fixes
* tag 'perf-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Factor out topology_gidnid_map()
perf/x86/intel/uncore: Fix NULL pointer dereference issue in upi_fill_topology()
perf/x86/amd: Reject branch stack for IBS events
perf/x86/intel/uncore: Support Sierra Forest and Grand Ridge
perf/x86/intel/uncore: Support IIO free-running counters on GNR
perf/x86/intel/uncore: Support Granite Rapids
perf/x86/uncore: Use u64 to replace unsigned for the uncore offsets array
perf/x86/intel/uncore: Generic uncore_get_uncores and MMIO format of SPR
perf: Fix the nr_addr_filters fix
perf/x86/intel/cstate: Add Grand Ridge support
perf/x86/intel/cstate: Add Sierra Forest support
x86/smp: Export symbol cpu_clustergroup_mask()
perf/x86/intel/cstate: Cleanup duplicate attr_groups
perf/core: Fix narrow startup race when creating the perf nr_addr_filters sysfs file
perf/x86/intel: Support branch counters logging
perf/x86/intel: Reorganize attrs and is_visible
perf: Add branch_sample_call_stack
perf/x86: Add PERF_X86_EVENT_NEEDS_BRANCH_STACK flag
perf: Add branch stack counters
- Various preparatory cleanups & enhancements of the timer-wheel code,
in preparation for the WIP 'pull timers at expiry' timer migration model
series (which will replace the current 'push timers at enqueue' migration
model), by Anna-Maria Behnsen:
- Update comments and clean up confusing variable names
- Add debug check to warn about time travel
- Improve/expand timer-wheel tracepoints
- Optimize away unnecessary IPIs for deferrable timers
- Restructure & clean up next_expiry_recalc()
- Clean up forward_timer_base()
- Introduce __forward_timer_base() and use it to simplify
and micro-optimize get_next_timer_interrupt()
- Restructure the get_next_timer_interrupt()'s idle logic
for better readability and to enable a minor optimization.
- Fix the nextevt calculation when no timers are pending
- Fix the sysfs_get_uname() prototype declaration
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWb0XIRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1h9kg/9FpjbiogIKrDXb/pJHyhYkK6jzN4aNrQo
wsOz4FDKyvioqLfr5ndpFE++DwsyzUibPfHJzfwD5IilTyolm2jW44VSCBzNdm72
lI6NGIcIxmIeCuO4bLmJj/fuQAugQ+ajmA2pyC/aBSO4Q2jtnxjYMGiV9zMWmOsa
E816CK5zp6IVx+w0GWwK5yW5YR5dscSQCD+mBYVAdTWYoRNTy6xonsMTRuNi0ePx
donetpu0eWG9NGwUdax/65oKVLZMR/rKAI/3pInhkOS9BsL2o8Rt4o2Y+9aBFi2t
2m+ZbFg5hngJwhP8Mfc7A+I3qiWgCOMGNGrebyzlwb+0PnNBPzrwnNPveW3R9QRx
LMxSU3aH66bXeX+YCF4y2tjWSmYooAnztPstUGrs+sq36+NF0wyY6ip/36S6MRGk
zjedqWnrHQeeZlzOLiKNzB+FIBnOt6bhZEh1Wk1/zwi9UWxw+7+I6tR0b57NqRxZ
VHq38fp+O2OEAj5JvwJ6FomOd+onqQ2wwveG5OMCa+hwM2ZCuVXQRYgM2ohMfwl3
BMSd3KMZsBiHT0zyun3k/uJ7CaIjArPh016baSS10ArSl9sE64aJj7ELtuSLqtaD
idJFXu3tv6VgDT2rMhLWNHvzQoK+gb8+/qnms4Ea+wY2f7nubi0aH20qHfugkgis
4KOkw9cQw0U=
=n40J
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer subsystem updates from Ingo Molnar:
- Various preparatory cleanups & enhancements of the timer-wheel code,
in preparation for the WIP 'pull timers at expiry' timer migration
model series (which will replace the current 'push timers at enqueue'
migration model), by Anna-Maria Behnsen:
- Update comments and clean up confusing variable names
- Add debug check to warn about time travel
- Improve/expand timer-wheel tracepoints
- Optimize away unnecessary IPIs for deferrable timers
- Restructure & clean up next_expiry_recalc()
- Clean up forward_timer_base()
- Introduce __forward_timer_base() and use it to simplify and
micro-optimize get_next_timer_interrupt()
- Restructure the get_next_timer_interrupt()'s idle logic for better
readability and to enable a minor optimization.
- Fix the nextevt calculation when no timers are pending
- Fix the sysfs_get_uname() prototype declaration
* tag 'timers-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers: Fix nextevt calculation when no timers are pending
timers: Rework idle logic
timers: Use already existing function for forwarding timer base
timers: Split out forward timer base functionality
timers: Clarify check in forward_timer_base()
timers: Move store of next event into __next_timer_interrupt()
timers: Do not IPI for deferrable timers
tracing/timers: Add tracepoint for tracking timer base is_idle flag
tracing/timers: Enhance timer_start tracepoint
tick-sched: Warn when next tick seems to be in the past
tick/sched: Cleanup confusing variables
tick-sched: Fix function names in comments
time: Make sysfs_get_uname() function visible in header
and always-inline them, to improve syscall entry performance on s390 by ~11%.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWbxQ4RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1h5Cw//TEIlWPCLpIeiDsOCKb5g2e4U+AatNIGt
ysmCvTWsKOiBEItDbZpwdpcdv/Ed41UXkS7Zmwetw81P50rz/i+kIJZW4gdl9GiV
qhjj0gbhGQ43myQkGdYIcmdVaHl9fuyDGZSai6c17zgdOoL5CvCGGiL5Dn4Cn36x
skm8P66r9DuM9cLTnhqQHMKp7cf4HQAX+awhFeppCquhzh3M2I8GsUVrT7tZV+Jw
zOMLVjsI8Va4JyGsl07DoqWlyFWcoYvJ5ayzvDCaBxgeFIK9uZgwkKV0HT9q5tvg
RhsHQK4zbxgkaMMCgEt/WdT14YesO2+5+ml91Zkjp2NMud0O0gmd2YXZju1aOQQw
XCL3pm6DB4oN+IkW9lo6k3rqo9PEip9rt/FAfkNLeb50elHfSZSvE1ZxXSQwx5N4
pHDNMcK6SMsJhEdJInNotViKrpXX0Rjr7x1pY/2DA9bMP/jX/9+J3ODuGCDZrvjp
eq4JM15VSq6tVmg+LMcszThWz+9gIaLFAqQwFt3G082ANDkOvg0mK7T65gccDuyA
Gl6f/p3tAYHYxOI9KOBN6Daq3QAqlMT+M4YgNbbv8fanWYIzRd3U/Y+YrUCnryu8
Db+8FHlUkbJb/clUofJ5nj0Ene7xReJX/8m8XxA95Cc/UGYU8w0cachXDoPyKZUP
xtFW3xn8K4M=
=VHW9
-----END PGP SIGNATURE-----
Merge tag 'core-entry-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull generic syscall updates from Ingo Molnar:
"Move various entry functions from kernel/entry/common.c to a header
file, and always-inline them, to improve syscall entry performance
on s390 by ~11%"
* tag 'core-entry-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
entry: Move syscall_enter_from_user_mode() to header file
entry: Move enter_from_user_mode() to header file
entry: Move exit to usermode functions to header file
- lock guards:
- Use lock guards in the ptrace code
- Introduce conditional guards to extend to conditional lock
primitives like mutex_trylock()/mutex_lock_interruptible()/etc.
- lockdep:
- Optimize 'struct lock_class' to be smaller
- Update file patterns in MAINTAINERS
- mutexes: Document mutex lifetime rules a bit more
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWbur0RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1i19xAAtAZs8Xi5X3IuLoygfMg81k91GvavLpv/
sVYGbKLm6+r0dOr4w3w76jVyWXNrVtTUFJysfK6nTNInEO+P0stL6aDmQVDYIaCw
jaQDO/UgPfxUxnBgMy7dUvsnmxBw4GO3zcQpx8GuAuuuEtNcQcnMoP2t/RvBpBEI
K2xCCkIT5LQPKbu9LkVZ/fAhZHtMypipuIMtEpfVYEKCMEwDmXoHuj3SNo4LGt04
0wZ5hHVhTcOQDm1/tjSXKsmxwQRVhI6OCcjXJ8hxDiPXg9vWO0+CzOibujl/jfhs
Dw45D7wwiCHFUsmKHKz335Jtk8wBpgWUtlxZ+GB/TVfAQkLv1xBdoE1F8yLcBEfx
yBNxh+0wecPWSrIsRLZEotRRu7obmBsIW04qUVP3oLWIu3tzhcud9gR43CeeplkT
RW1lkdLt5SzN//MLx1cWPqKjfi6wiolaveD1RrqIbVXRhhrFH3o7+EDEOqGNwOvu
0D9yx4Su7SYlAYXgGnGLnnmcmDt2cEoOXkY3K608sYW45dZOpNIu56+EfHiVH1fI
Q0lZNHXNDkX85Zzxoam6Y0SYo74lXYBmtL+RSYvvaKjRYgrJGxGcOciK+/c8kmY2
+Eazx5sxoR5nKDMP8MrZCZ5CQtqdPB9IYk7hUvQ31BYL1LSKhA5Mi/xjsFwMSNwv
pEBazHxty58=
=fOTd
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molar:
"Lock guards:
- Use lock guards in the ptrace code
- Introduce conditional guards to extend to conditional lock
primitives like mutex_trylock()/mutex_lock_interruptible()/etc.
lockdep:
- Optimize 'struct lock_class' to be smaller
- Update file patterns in MAINTAINERS
mutexes:
- Document mutex lifetime rules a bit more"
* tag 'locking-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/mutex: Clarify that mutex_unlock(), and most other sleeping locks, can still use the lock object after it's unlocked
locking/mutex: Document that mutex_unlock() is non-atomic
ptrace: Convert ptrace_attach() to use lock guards
locking/lockdep: Slightly reorder 'struct lock_class' to save some memory
MAINTAINERS: Add include/linux/lockdep*.h
cleanup: Add conditional guard support
commit 23baf831a3 ("mm, treewide: redefine MAX_ORDER sanely") has
changed the definition of MAX_ORDER to be inclusive. This has caused
issues with code that was not yet upstream and depended on the previous
definition.
To draw attention to the altered meaning of the define, rename MAX_ORDER
to MAX_PAGE_ORDER.
Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
NR_PAGE_ORDERS defines the number of page orders supported by the page
allocator, ranging from 0 to MAX_ORDER, MAX_ORDER + 1 in total.
NR_PAGE_ORDERS assists in defining arrays of page orders and allows for
more natural iteration over them.
[kirill.shutemov@linux.intel.com: fixup for kerneldoc warning]
Link: https://lkml.kernel.org/r/20240101111512.7empzyifq7kxtzk3@box
Link: https://lkml.kernel.org/r/20231228144704.14033-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUzBQAKCRCRxhvAZXjc
ot+3AQCZw1PBD4azVxFMWH76qwlAGoVIFug4+ogKU/iUa4VLygEA2FJh1vLJw5iI
LpgBEIUTPVkwtzinAW94iJJo1Vr7NAI=
=p6PB
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.8.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs iov_iter cleanups from Christian Brauner:
"This contains a minor cleanup. The patches drop an unused argument
from import_single_range() allowing to replace import_single_range()
with import_ubuf() and dropping import_single_range() completely"
* tag 'vfs-6.8.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
iov_iter: replace import_single_range() with import_ubuf()
iov_iter: remove unused 'iov' argument from import_single_range()
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUxRQAKCRCRxhvAZXjc
ov/QAQDzvge3oQ9MEymmOiyzzcF+HhAXBr+9oEsYJjFc1p0TsgEA61gXjZo7F1jY
KBqd6znOZCR+Waj0kIVJRAo/ISRBqQc=
=0bRl
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual fses.
Features:
- Add Jan Kara as VFS reviewer
- Show correct device and inode numbers in proc/<pid>/maps for vma
files on stacked filesystems. This is now easily doable thanks to
the backing file work from the last cycles. This comes with
selftests
Cleanups:
- Remove a redundant might_sleep() from wait_on_inode()
- Initialize pointer with NULL, not 0
- Clarify comment on access_override_creds()
- Rework and simplify eventfd_signal() and eventfd_signal_mask()
helpers
- Process aio completions in batches to avoid needless wakeups
- Completely decouple struct mnt_idmap from namespaces. We now only
keep the actual idmapping around and don't stash references to
namespaces
- Reformat maintainer entries to indicate that a given subsystem
belongs to fs/
- Simplify fput() for files that were never opened
- Get rid of various pointless file helpers
- Rename various file helpers
- Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from
last cycle
- Make relatime_need_update() return bool
- Use GFP_KERNEL instead of GFP_USER when allocating superblocks
- Replace deprecated ida_simple_*() calls with their current ida_*()
counterparts
Fixes:
- Fix comments on user namespace id mapping helpers. They aren't
kernel doc comments so they shouldn't be using /**
- s/Retuns/Returns/g in various places
- Add missing parameter documentation on can_move_mount_beneath()
- Rename i_mapping->private_data to i_mapping->i_private_data
- Fix a false-positive lockdep warning in pipe_write() for watch
queues
- Improve __fget_files_rcu() code generation to improve performance
- Only notify writer that pipe resizing has finished after setting
pipe->max_usage otherwise writers are never notified that the pipe
has been resized and hang
- Fix some kernel docs in hfsplus
- s/passs/pass/g in various places
- Fix kernel docs in ntfs
- Fix kcalloc() arguments order reported by gcc 14
- Fix uninitialized value in reiserfs"
* tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits)
reiserfs: fix uninit-value in comp_keys
watch_queue: fix kcalloc() arguments order
ntfs: dir.c: fix kernel-doc function parameter warnings
fs: fix doc comment typo fs tree wide
selftests/overlayfs: verify device and inode numbers in /proc/pid/maps
fs/proc: show correct device and inode numbers in /proc/pid/maps
eventfd: Remove usage of the deprecated ida_simple_xx() API
fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation
fs/hfsplus: wrapper.c: fix kernel-doc warnings
fs: add Jan Kara as reviewer
fs/inode: Make relatime_need_update return bool
pipe: wakeup wr_wait after setting max_usage
file: remove __receive_fd()
file: stop exposing receive_fd_user()
fs: replace f_rcuhead with f_task_work
file: remove pointless wrapper
file: s/close_fd_get_file()/file_close_fd()/g
Improve __fget_files_rcu() code generation (and thus __fget_light())
file: massage cleanup of files that failed to open
fs/pipe: Fix lockdep false-positive in watchqueue pipe_write()
...
The parse_actions() function uses 'len = str_has_prefix()' to test which
action is in the string being parsed. But then it goes and repeats the
logic for each different action. This logic can be simplified and
duplicate code can be removed as 'len' contains the length of the found
prefix which should be used for all actions.
Link: https://lore.kernel.org/all/20240107112044.6702cb66@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20240107203258.37e26d2b@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Merge system-wide power management updates for 6.8-rc1:
- Fix possible deadlocks in the core system-wide PM code that occur if
device-handling functions cannot be executed asynchronously during
resune from system-wide suspend (Rafael J. Wysocki).
- Clean up unnecessary local variable initializations in multiple
places in the hibernation code (Wang chaodong, Li zeming).
- Adjust core hibernation code to avoid missing wakeup events that
occur after saving an image to persistent storage (Chris Feng).
- Update hibernation code to enforce correct ordering during image
compression and decompression (Hongchen Zhang).
- Use kmap_local_page() instead of kmap_atomic() in copy_data_page()
during hibernation and restore (Chen Haonan).
- Adjust documentation and code comments to reflect recent task freezer
changes (Kevin Hao).
- Repair excess function parameter description warning in the
hibernation image-saving code (Randy Dunlap).
* pm-sleep:
PM: sleep: Fix possible deadlocks in core system-wide PM code
async: Introduce async_schedule_dev_nocall()
async: Split async_schedule_node_domain()
PM: hibernate: Repair excess function parameter description warning
PM: sleep: Remove obsolete comment from unlock_system_sleep()
Documentation: PM: Adjust freezing-of-tasks.rst to the freezer changes
PM: hibernate: Use kmap_local_page() in copy_data_page()
PM: hibernate: Enforce ordering during image compression/decompression
PM: hibernate: Avoid missing wakeup events during hibernation
PM: hibernate: Do not initialize error in snapshot_write_next()
PM: hibernate: Do not initialize error in swap_write_page()
PM: hibernate: Drop unnecessary local variable initialization
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZZgrfgAKCRDbK58LschI
g87JAQDu+oUG3aWnRJi+lJTK8vGnKRuBwUxgnI5Ze99N0tuPmAEAz1gpXLYP+fKE
eqRhZGGhujdHC9if3Le+nG6nvf8Gvw0=
=KPkZ
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-01-05
We've added 40 non-merge commits during the last 2 day(s) which contain
a total of 73 files changed, 1526 insertions(+), 951 deletions(-).
The main changes are:
1) Fix a memory leak when streaming AF_UNIX sockets were inserted
into multiple sockmap slots/maps, from John Fastabend.
2) Fix gotol in s390 BPF JIT with large offsets, from Ilya Leoshkevich.
3) Fix reattachment branch in bpf_tracing_prog_attach() and reject
the request if there is no valid attach_btf, from Jiri Olsa.
4) Remove deprecated bpfilter kernel leftovers given the project
is developed in user space (https://github.com/facebook/bpfilter),
from Quentin Deslandes.
5) Relax tracing BPF program recursive attach rules given right now
it is not possible to create tracing program call cycles,
from Dmitrii Dolgov.
6) Fix excessive memory consumption for the bpf_global_percpu_ma
for systems with a large number of CPUs, from Yonghong Song.
7) Small x86 BPF JIT cleanup to reuse emit_nops instead of open-coding
memcpy of x86_nops, from Leon Hwang.
8) Follow-up for libbpf to support __arg_ctx global function argument tag
semantics to complement the merged kernel side, from Andrii Nakryiko.
9) Introduce "volatile compare" macros for BPF selftests in order
to make the latter more robust against compiler optimization,
from Alexei Starovoitov.
10) Small simplification in verifier's size checking of helper accesses
along with additional selftests, from Andrei Matei.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (40 commits)
selftests/bpf: Test re-attachment fix for bpf_tracing_prog_attach
bpf: Fix re-attachment branch in bpf_tracing_prog_attach
selftests/bpf: Add test for recursive attachment of tracing progs
bpf: Relax tracing prog recursive attach rules
bpf, x86: Use emit_nops to replace memcpy x86_nops
selftests/bpf: Test gotol with large offsets
selftests/bpf: Double the size of test_loader log
s390/bpf: Fix gotol with large offsets
bpfilter: remove bpfilter
bpf: Remove unnecessary cpu == 0 check in memalloc
selftests/bpf: add __arg_ctx BTF rewrite test
selftests/bpf: add arg:ctx cases to test_global_funcs tests
libbpf: implement __arg_ctx fallback logic
libbpf: move BTF loading step after relocation step
libbpf: move exception callbacks assignment logic into relocation step
libbpf: use stable map placeholder FDs
libbpf: don't rely on map->fd as an indicator of map being created
libbpf: use explicit map reuse flag to skip map creation steps
libbpf: make uniform use of btf__fd() accessor inside libbpf
selftests/bpf: Add a selftest with > 512-byte percpu allocation size
...
====================
Link: https://lore.kernel.org/r/20240105170105.21070-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The purpose of crash_exclude_mem_range() is to remove all memory ranges
that overlap with [mstart-mend]. However, the current logic only removes
the first overlapping memory range.
Commit a2e9a95d21 ("kexec: Improve & fix crash_exclude_mem_range() to
handle overlapping ranges") attempted to address this issue, but it did
not fix all error cases.
Let's fix and simplify the logic of crash_exclude_mem_range().
Link: https://lkml.kernel.org/r/20240102144905.110047-4-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add CONFIG_LRU_GEN_WALKS_MMU such that if disabled, the code that
walks page tables to promote pages into the youngest generation will
not be built.
Also improves code readability by adding two helper functions
get_mm_state() and get_next_mm().
Link: https://lkml.kernel.org/r/20231227141205.2200125-3-kinseyho@google.com
Signed-off-by: Kinsey Ho <kinseyho@google.com>
Co-developed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Tested-by: Donet Tom <donettom@linux.vnet.ibm.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The following case can cause a crash due to missing attach_btf:
1) load rawtp program
2) load fentry program with rawtp as target_fd
3) create tracing link for fentry program with target_fd = 0
4) repeat 3
In the end we have:
- prog->aux->dst_trampoline == NULL
- tgt_prog == NULL (because we did not provide target_fd to link_create)
- prog->aux->attach_btf == NULL (the program was loaded with attach_prog_fd=X)
- the program was loaded for tgt_prog but we have no way to find out which one
BUG: kernel NULL pointer dereference, address: 0000000000000058
Call Trace:
<TASK>
? __die+0x20/0x70
? page_fault_oops+0x15b/0x430
? fixup_exception+0x22/0x330
? exc_page_fault+0x6f/0x170
? asm_exc_page_fault+0x22/0x30
? bpf_tracing_prog_attach+0x279/0x560
? btf_obj_id+0x5/0x10
bpf_tracing_prog_attach+0x439/0x560
__sys_bpf+0x1cf4/0x2de0
__x64_sys_bpf+0x1c/0x30
do_syscall_64+0x41/0xf0
entry_SYSCALL_64_after_hwframe+0x6e/0x76
Return -EINVAL in this situation.
Fixes: f3a9507554 ("bpf: Allow trampoline re-attach for tracing and lsm programs")
Cc: stable@vger.kernel.org
Signed-off-by: Jiri Olsa <olsajiri@gmail.com>
Acked-by: Jiri Olsa <olsajiri@gmail.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Dmitrii Dolgov <9erthalion6@gmail.com>
Link: https://lore.kernel.org/r/20240103190559.14750-4-9erthalion6@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, it's not allowed to attach an fentry/fexit prog to another
one fentry/fexit. At the same time it's not uncommon to see a tracing
program with lots of logic in use, and the attachment limitation
prevents usage of fentry/fexit for performance analysis (e.g. with
"bpftool prog profile" command) in this case. An example could be
falcosecurity libs project that uses tp_btf tracing programs.
Following the corresponding discussion [1], the reason for that is to
avoid tracing progs call cycles without introducing more complex
solutions. But currently it seems impossible to load and attach tracing
programs in a way that will form such a cycle. The limitation is coming
from the fact that attach_prog_fd is specified at the prog load (thus
making it impossible to attach to a program loaded after it in this
way), as well as tracing progs not implementing link_detach.
Replace "no same type" requirement with verification that no more than
one level of attachment nesting is allowed. In this way only one
fentry/fexit program could be attached to another fentry/fexit to cover
profiling use case, and still no cycle could be formed. To implement,
add a new field into bpf_prog_aux to track nested attachment for tracing
programs.
[1]: https://lore.kernel.org/bpf/20191108064039.2041889-16-ast@kernel.org/
Acked-by: Jiri Olsa <olsajiri@gmail.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Dmitrii Dolgov <9erthalion6@gmail.com>
Link: https://lore.kernel.org/r/20240103190559.14750-2-9erthalion6@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The driver core now can handle a const struct bus_type pointer, and the
dma_debug_add_bus() call just passes on the pointer give to it to the
driver core, so make this pointer const as well to allow everyone to use
read-only struct bus_type pointers going forward.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: <iommu@lists.linux.dev>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/2023121941-dejected-nugget-681e@gregkh
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For percpu data structure allocation with bpf_global_percpu_ma,
the maximum data size is 4K. But for a system with large
number of cpus, bigger data size (e.g., 2K, 4K) might consume
a lot of memory. For example, the percpu memory consumption
with unit size 2K and 1024 cpus will be 2K * 1K * 1k = 2GB
memory.
We should discourage such usage. Let us limit the maximum data
size to be 512 for bpf_global_percpu_ma allocation.
Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231222031801.1290841-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, refill low/high marks are set with the assumption
of normal non-percpu memory allocation. For example, for
an allocation size 256, for non-percpu memory allocation,
low mark is 32 and high mark is 96, resulting in the
batch allocation of 48 elements and the allocated memory
will be 48 * 256 = 12KB for this particular cpu.
Assuming an 128-cpu system, the total memory consumption
across all cpus will be 12K * 128 = 1.5MB memory.
This might be okay for non-percpu allocation, but may not be
good for percpu allocation, which will consume 1.5MB * 128 = 192MB
memory in the worst case if every cpu has a chance of memory
allocation.
In practice, percpu allocation is very rare compared to
non-percpu allocation. So let us have smaller low/high marks
which can avoid unnecessary memory consumption.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231222031755.1289671-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Typically for percpu map element or data structure, once allocated,
most operations are lookup or in-place update. Deletion are really
rare. Currently, for percpu data strcture, 4 elements will be
refilled if the size is <= 256. Let us just do with one element
for percpu data. For example, for size 256 and 128 cpus, the
potential saving will be 3 * 256 * 128 * 128 = 12MB.
Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231222031750.1289290-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit 41a5db8d81 ("Add support for non-fix-size percpu mem allocation")
added support for non-fix-size percpu memory allocation.
Such allocation will allocate percpu memory for all buckets on all
cpus and the memory consumption is in the order to quadratic.
For example, let us say, 4 cpus, unit size 16 bytes, so each
cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes.
Then let us say, 8 cpus with the same unit size, each cpu
has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes.
So if the number of cpus doubles, the number of memory consumption
will be 4 times. So for a system with large number of cpus, the
memory consumption goes up quickly with quadratic order.
For example, for 4KB percpu allocation, 128 cpus. The total memory
consumption will 4KB * 128 * 128 = 64MB. Things will become
worse if the number of cpus is bigger (e.g., 512, 1024, etc.)
In Commit 41a5db8d81, the non-fix-size percpu memory allocation is
done in boot time, so for system with large number of cpus, the initial
percpu memory consumption is very visible. For example, for 128 cpu
system, the total percpu memory allocation will be at least
(16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096)
* 128 * 128 = ~138MB.
which is pretty big. It will be even bigger for larger number of cpus.
Note that the current prefill also allocates 4 entries if the unit size
is less than 256. So on top of 138MB memory consumption, this will
add more consumption with
3 * (16 + 32 + 64 + 96 + 128 + 196 + 256) * 128 * 128 = ~38MB.
Next patch will try to reduce this memory consumption.
Later on, Commit 1fda5bb66a ("bpf: Do not allocate percpu memory
at init stage") moved the non-fix-size percpu memory allocation
to bpf verificaiton stage. Once a particular bpf_percpu_obj_new()
is called by bpf program, the memory allocator will try to fill in
the cache with all sizes, causing the same amount of percpu memory
consumption as in the boot stage.
To reduce the initial percpu memory consumption for non-fix-size
percpu memory allocation, instead of filling the cache with all
supported allocation sizes, this patch intends to fill the cache
only for the requested size. As typically users will not use large
percpu data structure, this can save memory significantly.
For example, the allocation size is 64 bytes with 128 cpus.
Then total percpu memory amount will be 64 * 128 * 128 = 1MB,
much less than previous 138MB.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231222031745.1289082-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The objcg is a bpf_mem_alloc level property since all bpf_mem_cache's
are with the same objcg. This patch made such a property explicit.
The next patch will use this property to save and restore objcg
for percpu unit allocator.
Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231222031739.1288590-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, for percpu memory allocation, say if the user
requests allocation size to be 32 bytes, the actually
calculated size will be 40 bytes and it further rounds
to 64 bytes, and eventually 64 bytes are allocated,
wasting 32-byte memory.
Change bpf_mem_alloc() to calculate the cache index
based on the user-provided allocation size so unnecessary
extra memory can be avoided.
Suggested-by: Hou Tao <houtao1@huawei.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231222031734.1288400-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch simplifies the verification of size arguments associated to
pointer arguments to helpers and kfuncs. Many helpers take a pointer
argument followed by the size of the memory access performed to be
performed through that pointer. Before this patch, the handling of the
size argument in check_mem_size_reg() was confusing and wasteful: if the
size register's lower bound was 0, then the verification was done twice:
once considering the size of the access to be the lower-bound of the
respective argument, and once considering the upper bound (even if the
two are the same). The upper bound checking is a super-set of the
lower-bound checking(*), except: the only point of the lower-bound check
is to handle the case where zero-sized-accesses are explicitly not
allowed and the lower-bound is zero. This static condition is now
checked explicitly, replacing a much more complex, expensive and
confusing verification call to check_helper_mem_access().
Error messages change in this patch. Before, messages about illegal
zero-size accesses depended on the type of the pointer and on other
conditions, and sometimes the message was plain wrong: in some tests
that changed you'll see that the old message was something like "R1 min
value is outside of the allowed memory range", where R1 is the pointer
register; the error was wrongly claiming that the pointer was bad
instead of the size being bad. Other times the information that the size
came for a register with a possible range of values was wrong, and the
error presented the size as a fixed zero. Now the errors refer to the
right register. However, the old error messages did contain useful
information about the pointer register which is now lost; recovering
this information was deemed not important enough.
(*) Besides standing to reason that the checks for a bigger size access
are a super-set of the checks for a smaller size access, I have also
mechanically verified this by reading the code for all types of
pointers. I could convince myself that it's true for all but
PTR_TO_BTF_ID (check_ptr_to_btf_access). There, simply looking
line-by-line does not immediately prove what we want. If anyone has any
qualms, let me know.
Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231221232225.568730-2-andreimatei1@gmail.com
In preparation for subsequent changes, introduce a specialized variant
of async_schedule_dev() that will not invoke the argument function
synchronously when it cannot be scheduled for asynchronous execution.
The new function, async_schedule_dev_nocall(), will be used for fixing
possible deadlocks in the system-wide power management core code.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> for the series.
Tested-by: Youngmin Nam <youngmin.nam@samsung.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
In preparation for subsequent changes, split async_schedule_node_domain()
in two pieces so as to allow the bottom part of it to be called from a
somewhat different code path.
No functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Tested-by: Youngmin Nam <youngmin.nam@samsung.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Introduce thermal_zone_device_critical_reboot() to trigger an
emergency reboot.
It is a counterpart of thermal_zone_device_critical() with the
difference that it will force a reboot instead of shutdown.
The motivation for doing this is to allow the thermal subystem
to trigger a reboot when the temperature reaches the critical
temperature.
Signed-off-by: Fabio Estevam <festevam@denx.de>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://lore.kernel.org/r/20231129124330.519423-3-festevam@gmail.com
Add some helper functions to make it easier introducing the support
for thermal reboot.
No functional change.
Signed-off-by: Fabio Estevam <festevam@denx.de>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://lore.kernel.org/r/20231129124330.519423-2-festevam@gmail.com
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZYVEqQAKCRDbK58LschI
gzH6AP9hVXLpHFTWMT0+2GK2lx69VX8zW1C0SmN7WHaxUbPN9QEAwzGnELfKk00P
0IKRHSl5abhVMX7JOM3sSOhCILeKjQg=
=wRLJ
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
bpf-next-for-netdev
The following pull-request contains BPF updates for your *net-next* tree.
We've added 22 non-merge commits during the last 3 day(s) which contain
a total of 23 files changed, 652 insertions(+), 431 deletions(-).
The main changes are:
1) Add verifier support for annotating user's global BPF subprogram arguments
with few commonly requested annotations for a better developer experience,
from Andrii Nakryiko.
These tags are:
- Ability to annotate a special PTR_TO_CTX argument
- Ability to annotate a generic PTR_TO_MEM as non-NULL
2) Support BPF verifier tracking of BPF_JNE which helps cases when the compiler
transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the like, from
Menglong Dong.
3) Fix a warning in bpf_mem_cache's check_obj_size() as reported by LKP, from Hou Tao.
4) Re-support uid/gid options when mounting bpffs which had to be reverted with
the prior token series revert to avoid conflicts, from Daniel Borkmann.
5) Fix a libbpf NULL pointer dereference in bpf_object__collect_prog_relos() found
from fuzzing the library with malformed ELF files, from Mingyi Zhang.
6) Skip DWARF sections in libbpf's linker sanity check given compiler options to
generate compressed debug sections can trigger a rejection due to misalignment,
from Alyssa Ross.
7) Fix an unnecessary use of the comma operator in BPF verifier, from Simon Horman.
8) Fix format specifier for unsigned long values in cpustat sample, from Colin Ian King.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
- Fix readers that are blocked on the ring buffer when buffer_percent is
100%. They are supposed to wake up when the buffer is full, but
because the sub-buffer that the writer is on is never considered
"dirty" in the calculation, dirty pages will never equal nr_pages.
Add +1 to the dirty count in order to count for the sub-buffer that
the writer is on.
- When a reader is blocked on the "snapshot_raw" file, it is to be
woken up when a snapshot is done and be able to read the snapshot
buffer. But because the snapshot swaps the buffers (the main one
with the snapshot one), and the snapshot reader is waiting on the
old snapshot buffer, it was not woken up (because it is now on
the main buffer after the swap). Worse yet, when it reads the buffer
after a snapshot, it's not reading the snapshot buffer, it's reading
the live active main buffer.
Fix this by forcing a wakeup of all readers on the snapshot buffer when
a new snapshot happens, and then update the buffer that the reader
is reading to be back on the snapshot buffer.
- Fix the modification of the direct_function hash. There was a race
when new functions were added to the direct_function hash as when
it moved function entries from the old hash to the new one, a direct
function trace could be hit and not see its entry.
This is fixed by allocating the new hash, copy all the old entries
onto it as well as the new entries, and then use rcu_assign_pointer()
to update the new direct_function hash with it.
This also fixes a memory leak in that code.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZZAzTxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qs9IAP9e6wZ74aEjMED9nsbC49EpyCNTqa72
y0uDS/p9ppv52gD7Be+l+kJQzYNh6bZU0+B19hNC2QVn38jb7sOadfO/1Q8=
=NDkf
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix readers that are blocked on the ring buffer when buffer_percent
is 100%. They are supposed to wake up when the buffer is full, but
because the sub-buffer that the writer is on is never considered
"dirty" in the calculation, dirty pages will never equal nr_pages.
Add +1 to the dirty count in order to count for the sub-buffer that
the writer is on.
- When a reader is blocked on the "snapshot_raw" file, it is to be
woken up when a snapshot is done and be able to read the snapshot
buffer. But because the snapshot swaps the buffers (the main one with
the snapshot one), and the snapshot reader is waiting on the old
snapshot buffer, it was not woken up (because it is now on the main
buffer after the swap). Worse yet, when it reads the buffer after a
snapshot, it's not reading the snapshot buffer, it's reading the live
active main buffer.
Fix this by forcing a wakeup of all readers on the snapshot buffer
when a new snapshot happens, and then update the buffer that the
reader is reading to be back on the snapshot buffer.
- Fix the modification of the direct_function hash. There was a race
when new functions were added to the direct_function hash as when it
moved function entries from the old hash to the new one, a direct
function trace could be hit and not see its entry.
This is fixed by allocating the new hash, copy all the old entries
onto it as well as the new entries, and then use rcu_assign_pointer()
to update the new direct_function hash with it.
This also fixes a memory leak in that code.
- Fix eventfs ownership
* tag 'trace-v6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ftrace: Fix modification of direct_function hash while in use
tracing: Fix blocked reader of snapshot buffer
ring-buffer: Fix wake ups when buffer_percent is set to 100
eventfs: Fix file and directory uid and gid ownership
Directly return NULL or 'next' instead of breaking out of the loop.
Signed-off-by: David Laight <david.laight@aculab.com>
[ Split original patch into two independent parts - Linus ]
Link: https://lore.kernel.org/lkml/7c8828aec72e42eeb841ca0ee3397e9a@AcuMS.aculab.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
osq_wait_next() is passed 'prev' from osq_lock() and NULL from
osq_unlock() but only needs the 'cpu' value to write to lock->tail.
Just pass prev->cpu or OSQ_UNLOCKED_VAL instead.
Should have no effect on the generated code since gcc manages to assume
that 'prev != NULL' due to an earlier dereference.
Signed-off-by: David Laight <david.laight@aculab.com>
[ Changed 'old' to 'old_cpu' by request from Waiman Long - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
struct optimistic_spin_node is private to the implementation.
Move it into the C file to ensure nothing is accessing it.
Signed-off-by: David Laight <david.laight@aculab.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Masami Hiramatsu reported a memory leak in register_ftrace_direct() where
if the number of new entries are added is large enough to cause two
allocations in the loop:
for (i = 0; i < size; i++) {
hlist_for_each_entry(entry, &hash->buckets[i], hlist) {
new = ftrace_add_rec_direct(entry->ip, addr, &free_hash);
if (!new)
goto out_remove;
entry->direct = addr;
}
}
Where ftrace_add_rec_direct() has:
if (ftrace_hash_empty(direct_functions) ||
direct_functions->count > 2 * (1 << direct_functions->size_bits)) {
struct ftrace_hash *new_hash;
int size = ftrace_hash_empty(direct_functions) ? 0 :
direct_functions->count + 1;
if (size < 32)
size = 32;
new_hash = dup_hash(direct_functions, size);
if (!new_hash)
return NULL;
*free_hash = direct_functions;
direct_functions = new_hash;
}
The "*free_hash = direct_functions;" can happen twice, losing the previous
allocation of direct_functions.
But this also exposed a more serious bug.
The modification of direct_functions above is not safe. As
direct_functions can be referenced at any time to find what direct caller
it should call, the time between:
new_hash = dup_hash(direct_functions, size);
and
direct_functions = new_hash;
can have a race with another CPU (or even this one if it gets interrupted),
and the entries being moved to the new hash are not referenced.
That's because the "dup_hash()" is really misnamed and is really a
"move_hash()". It moves the entries from the old hash to the new one.
Now even if that was changed, this code is not proper as direct_functions
should not be updated until the end. That is the best way to handle
function reference changes, and is the way other parts of ftrace handles
this.
The following is done:
1. Change add_hash_entry() to return the entry it created and inserted
into the hash, and not just return success or not.
2. Replace ftrace_add_rec_direct() with add_hash_entry(), and remove
the former.
3. Allocate a "new_hash" at the start that is made for holding both the
new hash entries as well as the existing entries in direct_functions.
4. Copy (not move) the direct_function entries over to the new_hash.
5. Copy the entries of the added hash to the new_hash.
6. If everything succeeds, then use rcu_pointer_assign() to update the
direct_functions with the new_hash.
This simplifies the code and fixes both the memory leak as well as the
race condition mentioned above.
Link: https://lore.kernel.org/all/170368070504.42064.8960569647118388081.stgit@devnote2/
Link: https://lore.kernel.org/linux-trace-kernel/20231229115134.08dd5174@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fixes: 763e34e74b ("ftrace: Add register_ftrace_direct()")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
If, as part of handling a hardlockup or softlockup, we've already dumped
all CPUs and we're just about to panic, don't reenable dumping and give
some other CPU a chance to hop in there and add some confusing logs right
as the panic is happening.
Link: https://lkml.kernel.org/r/20231220131534.4.Id3a9c7ec2d7d83e4080da6f8662ba2226b40543f@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If two CPUs end up reporting a hardlockup at the same time then their logs
could get interleaved which is hard to read.
The interleaving problem was especially bad with the "perf" hardlockup
detector where the locked up CPU is always the same as the running CPU and
we end up in show_regs(). show_regs() has no inherent serialization so we
could mix together two crawls if two hardlockups happened at the same time
(and if we didn't have `sysctl_hardlockup_all_cpu_backtrace` set). With
this change we'll fully serialize hardlockups when using the "perf"
hardlockup detector.
The interleaving problem was less bad with the "buddy" hardlockup
detector. With "buddy" we always end up calling
`trigger_single_cpu_backtrace(cpu)` on some CPU other than the running
one. trigger_single_cpu_backtrace() always at least serializes the
individual stack crawls because it eventually uses
printk_cpu_sync_get_irqsave(). Unfortunately the fact that
trigger_single_cpu_backtrace() eventually calls
printk_cpu_sync_get_irqsave() (on a different CPU) means that we have to
drop the "lock" before calling it and we can't fully serialize all
printouts associated with a given hardlockup. However, we still do get
the advantage of serializing the output of print_modules() and
print_irqtrace_events().
Aside from serializing hardlockups from each other, this change also has
the advantage of serializing hardlockups and softlockups from each other
if they happen to happen at the same time since they are both using the
same "lock".
Even though nobody is expected to hang while holding the lock associated
with printk_cpu_sync_get_irqsave(), out of an abundance of caution, we
don't call printk_cpu_sync_get_irqsave() until after we print out about
the hardlockup. This makes extra sure that, even if
printk_cpu_sync_get_irqsave() somehow never runs we at least print that we
saw the hardlockup. This is different than the choice made for softlockup
because hardlockup is really our last resort.
Link: https://lkml.kernel.org/r/20231220131534.3.I6ff691b3b40f0379bc860f80c6e729a0485b5247@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Instead of introducing a spinlock, use printk_cpu_sync_get_irqsave() and
printk_cpu_sync_put_irqrestore() to serialize softlockup reporting. Alone
this doesn't have any real advantage over the spinlock, but this will
allow us to use the same function in a future change to also serialize
hardlockup crawls.
NOTE: for the most part this serialization is important because we often
end up in the show_regs() path and that has no built-in serialization if
there are multiple callers at once. However, even in the case where we
end up in the dump_stack() path this still has some advantages because the
stack will be guaranteed to be together in the logs with the lockup
message with no interleaving.
NOTE: the fact that printk_cpu_sync_get_irqsave() is allowed to be called
multiple times on the same CPU is important here. Specifically we hold
the "lock" while calling dump_stack() which also gets the same "lock".
This is explicitly documented to be OK and means we don't need to
introduce a variant of dump_stack() that doesn't grab the lock.
Link: https://lkml.kernel.org/r/20231220131534.2.Ia5906525d440d8e8383cde31b7c61c2aadc8f907@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Li Zhe <lizhe.67@bytedance.com>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "watchdog: Better handling of concurrent lockups".
When we get multiple lockups at roughly the same time, the output in the
kernel logs can be very confusing since the reports about the lockups end
up interleaved in the logs. There is some code in the kernel to try to
handle this but it wasn't that complete.
Li Zhe recently made this a bit better for softlockups (specifically for
the case where `kernel.softlockup_all_cpu_backtrace` is not set) in commit
9d02330abd ("softlockup: serialized softlockup's log"), but that only
handled softlockup reports. Hardlockup reports still had similar issues.
This series also has a small fix to avoid dumping all stacks a second time
in the case of a panic. This is a bit unrelated to the interleaving fixes
but it does also improve the clarity of lockup reports.
This patch (of 4):
The hardlockup detector and softlockup detector both have the ability to
dump the stack of all CPUs (`kernel.hardlockup_all_cpu_backtrace` and
`kernel.softlockup_all_cpu_backtrace`). Both detectors also have some
logic to attempt to avoid interleaving printouts if two CPUs were trying
to do dumps of all CPUs at the same time. However:
- The hardlockup detector's logic still allowed interleaving some
information. Specifically another CPU could print modules and dump
the stack of the locked CPU at the same time we were dumping all
CPUs.
- In the case where `kernel.hardlockup_panic` was set in addition to
`kernel.hardlockup_all_cpu_backtrace`, when two CPUs both detected
hardlockups at the same time the second CPU could call panic() while
the first was still dumping stacks. This was especially bad if the
locked up CPU wasn't responding to the request for a backtrace since
the function nmi_trigger_cpumask_backtrace() can wait up to 10
seconds.
Let's resolve this by adopting the softlockup logic in the hardlockup
handler.
NOTES:
- As part of this, one might think that we should make a helper
function that both the hard and softlockup detectors call. This
turns out not to be super trivial since it would have to be
parameterized quite a bit since there are separate global variables
controlling each lockup detector and they print log messages that
are just different enough that it would be a pain. We probably don't
want to change the messages that are printed without good reason to
avoid throwing log parsers for a loop.
- One might also think that it would be a good idea to have the
hardlockup and softlockup detector use the same global variable to
prevent interleaving. This would make sure that softlockups and
hardlockups can't interleave each other. That _almost_ works but has
a dangerous flaw if `kernel.hardlockup_panic` is not the same as
`kernel.softlockup_panic` because we might skip a call to panic() if
one type of lockup was detected at the same time as another.
Link: https://lkml.kernel.org/r/20231220211640.2023645-1-dianders@chromium.org
Link: https://lkml.kernel.org/r/20231220131534.1.I4f35a69fbb124b5f0c71f75c631e11fabbe188ff@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
image->control_page represents the starting address for allocating the
next control page, while hole_end represents the address of the last valid
byte of the currently allocated control page.
This bug actually does not affect the correctness of allocating control
pages, because image->control_page is currently only used in
kimage_alloc_crash_control_pages(), and this function, when allocating
control pages, will first align image->control_page up to the nearest
`(1 << order) << PAGE_SHIFT` boundary, then use this value as the
starting address of the next control page. This ensures that the newly
allocated control page will use the correct starting address and not
overlap with previously allocated control pages.
Although it does not affect the correctness of the final result, it is
better for us to set image->control_page to the correct value, in case
it might be used elsewhere in the future, potentially causing errors.
Therefore, after successfully allocating a control page,
image->control_page should be updated to `hole_end + 1`, rather than
hole_end.
Link: https://lkml.kernel.org/r/20231221042308.11076-1-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Change @task to @tsk to prevent kernel-doc warnings:
kernel/stacktrace.c:138: warning: Excess function parameter 'task' description in 'stack_trace_save_tsk'
kernel/stacktrace.c:138: warning: Function parameter or member 'tsk' not described in 'stack_trace_save_tsk'
Link: https://lkml.kernel.org/r/20231220054945.17663-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Documentation/filesystems/relay.rst says to use
return debugfs_create_file(filename, mode, parent, buf,
&relay_file_operations);
and this is the only way relay_file_operations is used.
Thus: debugfs_create_file(&relay_file_operations)
-> __debugfs_create_file(&debugfs_full_proxy_file_operations,
&relay_file_operations)
-> dentry{inode: {i_fop: &debugfs_full_proxy_file_operations},
d_fsdata: &relay_file_operations
| DEBUGFS_FSDATA_IS_REAL_FOPS_BIT}
debugfs_full_proxy_file_operations.open is full_proxy_open, which extracts
the &relay_file_operations from the dentry, and allocates via
__full_proxy_fops_init() new fops, with trivial wrappers around release,
llseek, read, write, poll, and unlocked_ioctl, then replaces the fops on
the opened file therewith.
Naturally, all thusly-created debugfs files have .splice_read = NULL.
This was introduced in commit 49d200deaa ("debugfs: prevent access to
removed files' private data") from 2016-03-22.
AFAICT, relay_file_operations is the only struct file_operations used for
debugfs which defines a .splice_read callback. Hooking it up with
> diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
> index 5063434be0fc..952fcf5b2afa 100644
> --- a/fs/debugfs/file.c
> +++ b/fs/debugfs/file.c
> @@ -328,6 +328,11 @@ FULL_PROXY_FUNC(write, ssize_t, filp,
> loff_t *ppos),
> ARGS(filp, buf, size, ppos));
>
> +FULL_PROXY_FUNC(splice_read, long, in,
> + PROTO(struct file *in, loff_t *ppos, struct pipe_inode_info *pipe,
> + size_t len, unsigned int flags),
> + ARGS(in, ppos, pipe, len, flags));
> +
> FULL_PROXY_FUNC(unlocked_ioctl, long, filp,
> PROTO(struct file *filp, unsigned int cmd, unsigned long arg),
> ARGS(filp, cmd, arg));
> @@ -382,6 +387,8 @@ static void __full_proxy_fops_init(struct file_operations *proxy_fops,
> proxy_fops->write = full_proxy_write;
> if (real_fops->poll)
> proxy_fops->poll = full_proxy_poll;
> + if (real_fops->splice_read)
> + proxy_fops->splice_read = full_proxy_splice_read;
> if (real_fops->unlocked_ioctl)
> proxy_fops->unlocked_ioctl = full_proxy_unlocked_ioctl;
> }
shows it just doesn't work, and splicing always instantly returns empty
(subsequent reads actually return the contents).
No-one noticed it became dead code in 2016, who knows if it worked back
then. Clearly no-one cares; just delete it.
Link: https://lkml.kernel.org/r/dtexwpw6zcdx7dkx3xj5gyjp5syxmyretdcbcdtvrnukd4vvuh@tarta.nabijaczleweli.xyz
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Li kunyu <kunyu@nfschina.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zhang Zhengming <zhang.zhengming@h3c.com>
Cc: Zhao Lei <zhao_lei1@hoperun.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
temp_end represents the address of the last available byte. Therefore,
the starting address of the memory segment with temp_end as its last
available byte and a size of `kbuf->memsz`, that is, the value of
temp_start, should be `temp_end - kbuf->memsz + 1` instead of `temp_end -
kbuf->memsz`.
Additionally, use the ALIGN_DOWN macro instead of open-coding it directly
in locate_mem_hole_top_down() to improve code readability.
Link: https://lkml.kernel.org/r/20231217033528.303333-3-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The end parameter received by kimage_is_destination_range() should be the
last valid byte address of the target memory segment plus 1. However, in
the locate_mem_hole_bottom_up() and locate_mem_hole_top_down() functions,
the corresponding value passed to kimage_is_destination_range() is the
last valid byte address of the target memory segment, which is 1 less.
There are two ways to fix this bug. We can either correct the logic of
the locate_mem_hole_bottom_up() and locate_mem_hole_top_down() functions,
or we can fix kimage_is_destination_range() by making the end parameter
represent the last valid byte address of the target memory segment. Here,
we choose the second approach.
Due to the modification to kimage_is_destination_range(), we also need to
adjust its callers, such as kimage_alloc_normal_control_pages() and
kimage_alloc_page().
Link: https://lkml.kernel.org/r/20231217033528.303333-2-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We already have the folio in these functions, we just need to use it.
folio_add_new_anon_rmap() didn't exist at the time they were converted to
folios.
Link: https://lkml.kernel.org/r/20231211162214.2146080-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If an application blocks on the snapshot or snapshot_raw files, expecting
to be woken up when a snapshot occurs, it will not happen. Or it may
happen with an unexpected result.
That result is that the application will be reading the main buffer
instead of the snapshot buffer. That is because when the snapshot occurs,
the main and snapshot buffers are swapped. But the reader has a descriptor
still pointing to the buffer that it originally connected to.
This is fine for the main buffer readers, as they may be blocked waiting
for a watermark to be hit, and when a snapshot occurs, the data that the
main readers want is now on the snapshot buffer.
But for waiters of the snapshot buffer, they are waiting for an event to
occur that will trigger the snapshot and they can then consume it quickly
to save the snapshot before the next snapshot occurs. But to do this, they
need to read the new snapshot buffer, not the old one that is now
receiving new data.
Also, it does not make sense to have a watermark "buffer_percent" on the
snapshot buffer, as the snapshot buffer is static and does not receive new
data except all at once.
Link: https://lore.kernel.org/linux-trace-kernel/20231228095149.77f5b45d@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fixes: debdd57f51 ("tracing: Make a snapshot feature available from userspace")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The tracefs file "buffer_percent" is to allow user space to set a
water-mark on how much of the tracing ring buffer needs to be filled in
order to wake up a blocked reader.
0 - is to wait until any data is in the buffer
1 - is to wait for 1% of the sub buffers to be filled
50 - would be half of the sub buffers are filled with data
100 - is not to wake the waiter until the ring buffer is completely full
Unfortunately the test for being full was:
dirty = ring_buffer_nr_dirty_pages(buffer, cpu);
return (dirty * 100) > (full * nr_pages);
Where "full" is the value for "buffer_percent".
There is two issues with the above when full == 100.
1. dirty * 100 > 100 * nr_pages will never be true
That is, the above is basically saying that if the user sets
buffer_percent to 100, more pages need to be dirty than exist in the
ring buffer!
2. The page that the writer is on is never considered dirty, as dirty
pages are only those that are full. When the writer goes to a new
sub-buffer, it clears the contents of that sub-buffer.
That is, even if the check was ">=" it would still not be equal as the
most pages that can be considered "dirty" is nr_pages - 1.
To fix this, add one to dirty and use ">=" in the compare.
Link: https://lore.kernel.org/linux-trace-kernel/20231226125902.4a057f1d@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fixes: 03329f9939 ("tracing: Add tracefs file buffer_percentage")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When a CPU is taken offline, the contribution of its cfs_rqs to task_groups'
load may remain and will negatively impact the calculation of the share of
the online CPUs.
To fix this bug, clear the contribution of an offlining CPU to task groups'
load and skip its contribution while it is inactive.
Here's the reproducer of the anomaly, by Imran Khan:
"So far I have encountered only one rather lengthy way of reproducing this issue,
which is as follows:
1. Take a KVM guest (booted with 4 CPUs and can be scaled up to 124 CPUs) and
create 2 custom cgroups: /sys/fs/cgroup/cpu/test_group_1 and /sys/fs/cgroup/
cpu/test_group_2
2. Assign a CPU intensive workload to each of these cgroups and start the
workload.
For my tests I am using following app:
int main(int argc, char *argv[])
{
unsigned long count, i, val;
if (argc != 2) {
printf("usage: ./a.out <number of random nums to generate> \n");
return 0;
}
count = strtoul(argv[1], NULL, 10);
printf("Generating %lu random numbers \n", count);
for (i = 0; i < count; i++) {
val = rand();
val = val % 2;
//usleep(1);
}
printf("Generated %lu random numbers \n", count);
return 0;
}
Also since the system is booted with 4 CPUs, in order to completely load the
system I am also launching 4 instances of same test app under:
/sys/fs/cgroup/cpu/
3. We can see that both of the cgroups get similar CPU time:
# systemd-cgtop --depth 1
Path Tasks %CPU Memory Input/s Output/s
/ 659 - 5.5G - -
/system.slice - - 5.7G - -
/test_group_1 4 - - - -
/test_group_2 3 - - - -
/user.slice 31 - 56.5M - -
Path Tasks %CPU Memory Input/s Output/s
/ 659 394.6 5.5G - -
/test_group_2 3 65.7 - - -
/user.slice 29 55.1 48.0M - -
/test_group_1 4 47.3 - - -
/system.slice - 2.2 5.7G - -
Path Tasks %CPU Memory Input/s Output/s
/ 659 394.8 5.5G - -
/test_group_1 4 62.9 - - -
/user.slice 28 44.9 54.2M - -
/test_group_2 3 44.7 - - -
/system.slice - 0.9 5.7G - -
Path Tasks %CPU Memory Input/s Output/s
/ 659 394.4 5.5G - -
/test_group_2 3 58.8 - - -
/test_group_1 4 51.9 - - -
/user.slice 30 39.3 59.6M - -
/system.slice - 1.9 5.7G - -
Path Tasks %CPU Memory Input/s Output/s
/ 659 394.7 5.5G - -
/test_group_1 4 60.9 - - -
/test_group_2 3 57.9 - - -
/user.slice 28 43.5 36.9M - -
/system.slice - 3.0 5.7G - -
Path Tasks %CPU Memory Input/s Output/s
/ 659 395.0 5.5G - -
/test_group_1 4 66.8 - - -
/test_group_2 3 56.3 - - -
/user.slice 29 43.1 51.8M - -
/system.slice - 0.7 5.7G - -
4. Now move systemd-udevd to one of these test groups, say test_group_1, and
perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the
host side.
5. Run the same workload i.e 4 instances of CPU hogger under /sys/fs/cgroup/cpu
and one instance of CPU hogger each in /sys/fs/cgroup/cpu/test_group_1 and
/sys/fs/cgroup/test_group_2.
It can be seen that test_group_1 (the one where systemd-udevd was moved) is getting
much less CPU time than the test_group_2, even though at this point of time both of
these groups have only CPU hogger running:
# systemd-cgtop --depth 1
Path Tasks %CPU Memory Input/s Output/s
/ 1219 - 5.4G - -
/system.slice - - 5.6G - -
/test_group_1 4 - - - -
/test_group_2 3 - - - -
/user.slice 26 - 91.3M - -
Path Tasks %CPU Memory Input/s Output/s
/ 1221 394.3 5.4G - -
/test_group_2 3 82.7 - - -
/test_group_1 4 14.3 - - -
/system.slice - 0.8 5.6G - -
/user.slice 26 0.4 91.2M - -
Path Tasks %CPU Memory Input/s Output/s
/ 1221 394.6 5.4G - -
/test_group_2 3 67.4 - - -
/system.slice - 24.6 5.6G - -
/test_group_1 4 12.5 - - -
/user.slice 26 0.4 91.2M - -
Path Tasks %CPU Memory Input/s Output/s
/ 1221 395.2 5.4G - -
/test_group_2 3 60.9 - - -
/system.slice - 27.9 5.6G - -
/test_group_1 4 12.2 - - -
/user.slice 26 0.4 91.2M - -
Path Tasks %CPU Memory Input/s Output/s
/ 1221 395.2 5.4G - -
/test_group_2 3 69.4 - - -
/test_group_1 4 13.9 - - -
/user.slice 28 1.6 92.0M - -
/system.slice - 1.0 5.6G - -
Path Tasks %CPU Memory Input/s Output/s
/ 1221 395.6 5.4G - -
/test_group_2 3 59.3 - - -
/test_group_1 4 14.1 - - -
/user.slice 28 1.3 92.2M - -
/system.slice - 0.7 5.6G - -
Path Tasks %CPU Memory Input/s Output/s
/ 1221 395.5 5.4G - -
/test_group_2 3 67.2 - - -
/test_group_1 4 11.5 - - -
/user.slice 28 1.3 92.5M - -
/system.slice - 0.6 5.6G - -
Path Tasks %CPU Memory Input/s Output/s
/ 1221 395.1 5.4G - -
/test_group_2 3 76.8 - - -
/test_group_1 4 12.9 - - -
/user.slice 28 1.3 92.8M - -
/system.slice - 1.2 5.6G - -
From sched_debug data it can be seen that in bad case the load.weight of per-CPU
sched entities corresponding to test_group_1 has reduced significantly and
also load_avg of test_group_1 remains much higher than that of test_group_2,
even though systemd-udevd stopped running long time back and at this point of
time both cgroups just have the CPU hogger app as running entity."
[ mingo: Added details from the original discussion, plus minor edits to the patch. ]
Reported-by: Imran Khan <imran.f.khan@oracle.com>
Tested-by: Imran Khan <imran.f.khan@oracle.com>
Tested-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Imran Khan <imran.f.khan@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lore.kernel.org/r/20231223111545.62135-1-vincent.guittot@linaro.org
are not considered backporting material.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZYys4AAKCRDdBJ7gKXxA
jtmaAQC+o04Ia7IfB8MIqp1p7dNZQo64x/EnGA8YjUnQ8N6IwQD+ImU7dHl9g9Oo
ROiiAbtMRBUfeJRsExX/Yzc1DV9E9QM=
=ZGcs
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2023-12-27-15-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"11 hotfixes. 7 are cc:stable and the other 4 address post-6.6 issues
or are not considered backporting material"
* tag 'mm-hotfixes-stable-2023-12-27-15-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mailmap: add an old address for Naoya Horiguchi
mm/memory-failure: cast index to loff_t before shifting it
mm/memory-failure: check the mapcount of the precise page
mm/memory-failure: pass the folio and the page to collect_procs()
selftests: secretmem: floor the memory size to the multiple of page_size
mm: migrate high-order folios in swap cache correctly
maple_tree: do not preallocate nodes for slot stores
mm/filemap: avoid buffered read/write race to read inconsistent data
kunit: kasan_test: disable fortify string checker on kmalloc_oob_memset
kexec: select CRYPTO from KEXEC_FILE instead of depending on it
kexec: fix KEXEC_FILE dependencies
by moving cond_resched_rcu() to rcupdate_wait.h, we can kill another big
sched.h dependency.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're trying to get sched.h down to more or less just types only, not
code - rseq can live in its own header.
This helps us kill the dependency on preempt.h in sched.h.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Running N CPU-bound tasks on an N CPUs platform:
- with asymmetric CPU capacity
- not being a DynamIq system (i.e. having a PKG level sched domain
without the SD_SHARE_PKG_RESOURCES flag set)
.. might result in a task placement where two tasks run on a big CPU
and none on a little CPU. This placement could be more optimal by
using all CPUs.
Testing platform:
Juno-r2:
- 2 big CPUs (1-2), maximum capacity of 1024
- 4 little CPUs (0,3-5), maximum capacity of 383
Testing workload ([1]):
Spawn 6 CPU-bound tasks. During the first 100ms (step 1), each tasks
is affine to a CPU, except for:
- one little CPU which is left idle.
- one big CPU which has 2 tasks affine.
After the 100ms (step 2), remove the cpumask affinity.
Behavior before the patch:
During step 2, the load balancer running from the idle CPU tags sched
domains as:
- little CPUs: 'group_has_spare'. Cf. group_has_capacity() and
group_is_overloaded(), 3 CPU-bound tasks run on a 4 CPUs
sched-domain, and the idle CPU provides enough spare capacity
regarding the imbalance_pct
- big CPUs: 'group_overloaded'. Indeed, 3 tasks run on a 2 CPUs
sched-domain, so the following path is used:
group_is_overloaded()
\-if (sgs->sum_nr_running <= sgs->group_weight) return true;
The following path which would change the migration type to
'migrate_task' is not taken:
calculate_imbalance()
\-if (env->idle != CPU_NOT_IDLE && env->imbalance == 0)
as the local group has some spare capacity, so the imbalance
is not 0.
The migration type requested is 'migrate_util' and the busiest
runqueue is the big CPU's runqueue having 2 tasks (each having a
utilization of 512). The idle little CPU cannot pull one of these
task as its capacity is too small for the task. The following path
is used:
detach_tasks()
\-case migrate_util:
\-if (util > env->imbalance) goto next;
After the patch:
As the number of failed balancing attempts grows (with
'nr_balance_failed'), progressively make it easier to migrate
a big task to the idling little CPU. A similar mechanism is
used for the 'migrate_load' migration type.
Improvement:
Running the testing workload [1] with the step 2 representing
a ~10s load for a big CPU:
Before patch: ~19.3s
After patch: ~18s (-6.7%)
Similar issue reported at:
https://lore.kernel.org/lkml/20230716014125.139577-1-qyousef@layalina.io/
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Qais Yousef <qyousef@layalina.io>
Link: https://lore.kernel.org/r/20231206090043.634697-1-pierre.gondois@arm.com
With UTIL_EST_FASTUP now being permanent, we can take advantage of the
fact that the ewma jumps directly to a higher utilization at dequeue to
simplify util_est and remove the enqueued field.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com>
Reviewed-by: Alex Shi <alexs@kernel.org>
Link: https://lore.kernel.org/r/20231201161652.1241695-3-vincent.guittot@linaro.org
sched_feat(UTIL_EST_FASTUP) has been added to easily disable the feature
in order to check for possibly related regressions. After 3 years, it has
never been used and no regression has been reported. Let's remove it
and make fast increase a permanent behavior.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com>
Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com>
Reviewed-by: Yanteng Si <siyanteng@loongson.cn> [for the Chinese translation]
Reviewed-by: Alex Shi <alexs@kernel.org>
Link: https://lore.kernel.org/r/20231201161652.1241695-2-vincent.guittot@linaro.org
cpuinfo.max_freq can change at runtime because of boost as an example. This
implies that the value could be different than the one that has been
used when computing the capacity of a CPU.
The new arch_scale_freq_ref() returns a fixed and coherent reference
frequency that can be used when computing a frequency based on utilization.
Use this arch_scale_freq_ref() when available and fallback to
policy otherwise.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lore.kernel.org/r/20231211104855.558096-4-vincent.guittot@linaro.org
Although it does not seem to have any untoward side-effects, the use
of ';' to separate to assignments seems more appropriate than ','.
Flagged by clang-17 -Wcomma
No functional change intended. Compile tested only.
Signed-off-by: Simon Horman <horms@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/bpf/20231221-bpf-verifier-comma-v1-1-cde2530912e9@kernel.org
- Fix another kerneldoc warning
- Fix eventfs files to inherit the ownership of its parent directory.
The dynamic creating of dentries in eventfs did not take into
account if the tracefs file system was mounted with a gid/uid,
and would still default to the gid/uid of root. This is a regression.
- Fix warning when synthetic event testing is enabled along with
startup event tracing testing is enabled
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZYRYjhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qs0aAQCXWcBeDEWsi8VxAOBU5Q6isvXn2koM
+xSX6LJPh6hFVAD+Pc3oLgvyE5IyqNUM9RYtpwPVMhpAsyE9FIz3TWarEww=
=LY0i
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.7-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix another kerneldoc warning
- Fix eventfs files to inherit the ownership of its parent directory.
The dynamic creation of dentries in eventfs did not take into account
if the tracefs file system was mounted with a gid/uid, and would
still default to the gid/uid of root. This is a regression.
- Fix warning when synthetic event testing is enabled along with
startup event tracing testing is enabled
* tag 'trace-v6.7-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing / synthetic: Disable events after testing in synth_event_gen_test_init()
eventfs: Have event files and directories default to parent uid and gid
tracing/synthetic: fix kernel-doc warnings
The comparisons to PAGE_SIZE were all converted to use the
buffer->subbuf_order, but the use of PAGE_MASK was missed.
Convert all the PAGE_MASK usages over to:
(PAGE_SIZE << cpu_buffer->buffer->subbuf_order) - 1
Link: https://lore.kernel.org/linux-trace-kernel/20231219173800.66eefb7a@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Fixes: 139f840021 ("ring-buffer: Page size per ring buffer")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Using page order for deciding what the size of the ring buffer sub buffers
are is exposing a bit too much of the implementation. Although the sub
buffers are only allocated in orders of pages, allow the user to specify
the minimum size of each sub-buffer via kilobytes like they can with the
buffer size itself.
If the user specifies 3 via:
echo 3 > buffer_subbuf_size_kb
Then the sub-buffer size will round up to 4kb (on a 4kb page size system).
If they specify:
echo 6 > buffer_subbuf_size_kb
The sub-buffer size will become 8kb.
and so on.
Link: https://lore.kernel.org/linux-trace-kernel/20231219185631.809766769@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The ring_buffer_subbuf_order_set() was creating ring_buffer_per_cpu
cpu_buffers with the new subbuffers with the updated order, and if they
all successfully were created, then they the ring_buffer's per_cpu buffers
would be freed and replaced by them.
The problem is that the freed per_cpu buffers contains state that would be
lost. Running the following commands:
1. # echo 3 > /sys/kernel/tracing/buffer_subbuf_order
2. # echo 0 > /sys/kernel/tracing/tracing_cpumask
3. # echo 1 > /sys/kernel/tracing/snapshot
4. # echo ff > /sys/kernel/tracing/tracing_cpumask
5. # echo test > /sys/kernel/tracing/trace_marker
Would result in:
-bash: echo: write error: Bad file descriptor
That's because the state of the per_cpu buffers of the snapshot buffer is
lost when the order is changed (the order of a freed snapshot buffer goes
to 0 to save memory, and when the snapshot buffer is allocated again, it
goes back to what the main buffer is).
In operation 2, the snapshot buffers were set to "disable" (as all the
ring buffers CPUs were disabled).
In operation 3, the snapshot is allocated and a call to
ring_buffer_subbuf_order_set() replaced the per_cpu buffers losing the
"record_disable" count.
When it was enabled again, the atomic_dec(&cpu_buffer->record_disable) was
decrementing a zero, setting it to -1. Writing 1 into the snapshot would
swap the snapshot buffer with the main buffer, so now the main buffer is
"disabled", and nothing can write to the ring buffer anymore.
Instead of creating new per_cpu buffers and losing the state of the old
buffers, basically do what the resize does and just allocate new subbuf
pages into the new_pages link list of the per_cpu buffer and if they all
succeed, then replace the old sub buffers with the new ones. This keeps
the per_cpu buffer descriptor in tact and by doing so, keeps its state.
Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.944104939@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Fixes: f9b94daa54 ("ring-buffer: Set new size of the ring buffer sub page")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The function ring_buffer_subbuf_order_set() just updated the sub-buffers
to the new size, but this also changes the size of the buffer in doing so.
As the size is determined by nr_pages * subbuf_size. If the subbuf_size is
increased without decreasing the nr_pages, this causes the total size of
the buffer to increase.
This broke the latency tracers as the snapshot needs to be the same size
as the main buffer. The size of the snapshot buffer is only expanded when
needed, and because the order is still the same, the size becomes out of
sync with the main buffer, as the main buffer increased in size without
the tracing system knowing.
Calculate the nr_pages to allocate with the new subbuf_size to be
buffer_size / new_subbuf_size.
Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.649397785@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Fixes: f9b94daa54 ("ring-buffer: Set new size of the ring buffer sub page")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Because the main buffer and the snapshot buffer need to be the same for
some tracers, otherwise it will fail and disable all tracing, the tracers
need to be stopped while updating the sub buffer sizes so that the tracers
see the main and snapshot buffers with the same sub buffer size.
Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.353222794@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Fixes: 2808e31ec1 ("ring-buffer: Add interface for configuring trace sub buffer size")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When updating the order of the sub buffers for the main buffer, make sure
that if the snapshot buffer exists, that it gets its order updated as
well.
Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.054668186@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Now that the ring buffer specifies the size of its sub buffers, they all
need to be the same size. When doing a read, a swap is done with a spare
page. Make sure they are the same size before doing the swap, otherwise
the read will fail.
Link: https://lore.kernel.org/linux-trace-kernel/20231219185629.763664788@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
As all the subbuffer order (subbuffer sizes) must be the same throughout
the ring buffer, check the order of the buffers that are doing a CPU
buffer swap in ring_buffer_swap_cpu() to make sure they are the same.
If the are not the same, then fail to do the swap, otherwise the ring
buffer will think the CPU buffer has a specific subbuffer size when it
does not.
Link: https://lore.kernel.org/linux-trace-kernel/20231219185629.467894710@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
On failure to allocate ring buffer pages, the pointer to the CPU buffer
pages is freed, but the pages that were allocated previously were not.
Make sure they are freed too.
Link: https://lore.kernel.org/linux-trace-kernel/20231219185629.179352802@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Fixes: f9b94daa54 ("tracing: Set new size of the ring buffer sub page")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The synth_event_gen_test module can be built in, if someone wants to run
the tests at boot up and not have to load them.
The synth_event_gen_test_init() function creates and enables the synthetic
events and runs its tests.
The synth_event_gen_test_exit() disables the events it created and
destroys the events.
If the module is builtin, the events are never disabled. The issue is, the
events should be disable after the tests are run. This could be an issue
if the rest of the boot up tests are enabled, as they expect the events to
be in a known state before testing. That known state happens to be
disabled.
When CONFIG_SYNTH_EVENT_GEN_TEST=y and CONFIG_EVENT_TRACE_STARTUP_TEST=y
a warning will trigger:
Running tests on trace events:
Testing event create_synth_test:
Enabled event during self test!
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1 at kernel/trace/trace_events.c:4150 event_trace_self_tests+0x1c2/0x480
Modules linked in:
CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.7.0-rc2-test-00031-gb803d7c664d5-dirty #276
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:event_trace_self_tests+0x1c2/0x480
Code: bb e8 a2 ab 5d fc 48 8d 7b 48 e8 f9 3d 99 fc 48 8b 73 48 40 f6 c6 01 0f 84 d6 fe ff ff 48 c7 c7 20 b6 ad bb e8 7f ab 5d fc 90 <0f> 0b 90 48 89 df e8 d3 3d 99 fc 48 8b 1b 4c 39 f3 0f 85 2c ff ff
RSP: 0000:ffffc9000001fdc0 EFLAGS: 00010246
RAX: 0000000000000029 RBX: ffff88810399ca80 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffb9f19478 RDI: ffff88823c734e64
RBP: ffff88810399f300 R08: 0000000000000000 R09: fffffbfff79eb32a
R10: ffffffffbcf59957 R11: 0000000000000001 R12: ffff888104068090
R13: ffffffffbc89f0a0 R14: ffffffffbc8a0f08 R15: 0000000000000078
FS: 0000000000000000(0000) GS:ffff88823c700000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000001f6282001 CR4: 0000000000170ef0
Call Trace:
<TASK>
? __warn+0xa5/0x200
? event_trace_self_tests+0x1c2/0x480
? report_bug+0x1f6/0x220
? handle_bug+0x6f/0x90
? exc_invalid_op+0x17/0x50
? asm_exc_invalid_op+0x1a/0x20
? tracer_preempt_on+0x78/0x1c0
? event_trace_self_tests+0x1c2/0x480
? __pfx_event_trace_self_tests_init+0x10/0x10
event_trace_self_tests_init+0x27/0xe0
do_one_initcall+0xd6/0x3c0
? __pfx_do_one_initcall+0x10/0x10
? kasan_set_track+0x25/0x30
? rcu_is_watching+0x38/0x60
kernel_init_freeable+0x324/0x450
? __pfx_kernel_init+0x10/0x10
kernel_init+0x1f/0x1e0
? _raw_spin_unlock_irq+0x33/0x50
ret_from_fork+0x34/0x60
? __pfx_kernel_init+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
This is because the synth_event_gen_test_init() left the synthetic events
that it created enabled. By having it disable them after testing, the
other selftests will run fine.
Link: https://lore.kernel.org/linux-trace-kernel/20231220111525.2f0f49b0@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 9fe41efaca ("tracing: Add synth event generation test module")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reported-by: Alexander Graf <graf@amazon.com>
Tested-by: Alexander Graf <graf@amazon.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
For a clean, conflict-free revert of the token-related patches in commit
d17aff807f ("Revert BPF token-related functionality"), the bpf fs commit
750e785796 ("bpf: Support uid and gid when mounting bpffs") was undone
temporarily as well.
This patch manually re-adds the functionality from the original one back
in 750e785796, no other functional changes intended.
Testing:
# mount -t bpf -o uid=65534,gid=65534 bpffs ./foo
# ls -la . | grep foo
drwxrwxrwt 2 nobody nogroup 0 Dec 20 13:16 foo
# mount -t bpf
bpffs on /root/foo type bpf (rw,relatime,uid=65534,gid=65534)
Also, passing invalid arguments for uid/gid are properly rejected as expected.
Fixes: d17aff807f ("Revert BPF token-related functionality")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Cc: Jie Jiang <jiejiang@chromium.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/bpf/20231220133805.20953-1-daniel@iogearbox.net
Bring in the changes to the file infrastructure for this cycle. Mostly
cleanups and some performance tweaks.
* file: remove __receive_fd()
* file: stop exposing receive_fd_user()
* fs: replace f_rcuhead with f_task_work
* file: remove pointless wrapper
* file: s/close_fd_get_file()/file_close_fd()/g
* Improve __fget_files_rcu() code generation (and thus __fget_light())
* file: massage cleanup of files that failed to open
Signed-off-by: Christian Brauner <brauner@kernel.org>
When compiling with gcc version 14.0.0 20231220 (experimental)
and W=1, I've noticed the following warning:
kernel/watch_queue.c: In function 'watch_queue_set_size':
kernel/watch_queue.c:273:32: warning: 'kcalloc' sizes specified with 'sizeof'
in the earlier argument and not in the later argument [-Wcalloc-transposed-args]
273 | pages = kcalloc(sizeof(struct page *), nr_pages, GFP_KERNEL);
| ^~~~~~
Since 'n' and 'size' arguments of 'kcalloc()' are multiplied to
calculate the final size, their actual order doesn't affect the
result and so this is not a bug. But it's still worth to fix it.
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Link: https://lore.kernel.org/r/20231221090139.12579-1-dmantipov@yandex.ru
Signed-off-by: Christian Brauner <brauner@kernel.org>
Only the posix timer system calls use this (when the posix timer support
is disabled, which does not actually happen in any normal case), because
they had debug code to print out a warning about missing system calls.
Get rid of that special case, and just use the standard COND_SYSCALL
interface that creates weak system call stubs that return -ENOSYS for
when the system call does not exist.
This fixes a kCFI issue with the SYS_NI() hackery:
CFI failure at int80_emulation+0x67/0xb0 (target: sys_ni_posix_timers+0x0/0x70; expected type: 0xb02b34d9)
WARNING: CPU: 0 PID: 48 at int80_emulation+0x67/0xb0
Reported-by: kernel test robot <oliver.sang@intel.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Tested-by: Sami Tolvanen <samitolvanen@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's really no overlap between uapi/linux/wait.h and linux/wait.h.
There are two files which rely on the uapi file being implcitly included,
so explicitly include it there and remove it from the main header file.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Christian Brauner <brauner@kernel.org>
This is needed for killing the sched.h dependency on rcupdate.h, and
pid.h is a better place for this code anyways.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Moving these stub functions to a .c file means we can kill a sched.h
dependency on printk.h.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Remove second include of linux/kexec.h
Link: https://lkml.kernel.org/r/202312151654+0800-wangjinchao@xfusion.com
Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use ALIGN macro instead of open-coding it to improve code readability.
Link: https://lkml.kernel.org/r/20231212142706.25149-1-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
TASK_KILLABLE already includes TASK_UNINTERRUPTIBLE, so there is no
need to add a separate TASK_UNINTERRUPTIBLE.
Link: https://lkml.kernel.org/r/20231208084115.1973285-1-haokexin@gmail.com
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Then when specifying '-d' for kexec_file_load interface, loaded locations
of kernel/initrd/cmdline etc can be printed out to help debug.
Here replace pr_debug() with the newly added kexec_dprintk() in kexec_file
loading related codes.
And also print out type/start/head of kimage and flags to help debug.
Link: https://lkml.kernel.org/r/20231213055747.61826-3-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Conor Dooley <conor@kernel.org>
Cc: Joe Perches <joe@perches.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "kexec_file: print out debugging message if required", v4.
Currently, specifying '-d' on kexec command will print a lot of debugging
informationabout kexec/kdump loading with kexec_load interface.
However, kexec_file_load prints nothing even though '-d' is specified.
It's very inconvenient to debug or analyze the kexec/kdump loading when
something wrong happened with kexec/kdump itself or develper want to check
the kexec/kdump loading.
In this patchset, a kexec_file flag is KEXEC_FILE_DEBUG added and checked
in code. If it's passed in, debugging message of kexec_file code will be
printed out and can be seen from console and dmesg. Otherwise, the
debugging message is printed like beofre when pr_debug() is taken.
Note:
****
=====
1) The code in kexec-tools utility also need be changed to support
passing KEXEC_FILE_DEBUG to kernel when 'kexec -s -d' is specified.
The patch link is here:
=========
[PATCH] kexec_file: add kexec_file flag to support debug printing
http://lists.infradead.org/pipermail/kexec/2023-November/028505.html
2) s390 also has kexec_file code, while I am not sure what debugging
information is necessary. So leave it to s390 developer.
Test:
****
====
Testing was done in v1 on x86_64 and arm64. For v4, tested on x86_64
again. And on x86_64, the printed messages look like below:
--------------------------------------------------------------
kexec measurement buffer for the loaded kernel at 0x207fffe000.
Loaded purgatory at 0x207fff9000
Loaded boot_param, command line and misc at 0x207fff3000 bufsz=0x1180 memsz=0x1180
Loaded 64bit kernel at 0x207c000000 bufsz=0xc88200 memsz=0x3c4a000
Loaded initrd at 0x2079e79000 bufsz=0x2186280 memsz=0x2186280
Final command line is: root=/dev/mapper/fedora_intel--knightslanding--lb--02-root ro
rd.lvm.lv=fedora_intel-knightslanding-lb-02/root console=ttyS0,115200N81 crashkernel=256M
E820 memmap:
0000000000000000-000000000009a3ff (1)
000000000009a400-000000000009ffff (2)
00000000000e0000-00000000000fffff (2)
0000000000100000-000000006ff83fff (1)
000000006ff84000-000000007ac50fff (2)
......
000000207fff6150-000000207fff615f (128)
000000207fff6160-000000207fff714f (1)
000000207fff7150-000000207fff715f (128)
000000207fff7160-000000207fff814f (1)
000000207fff8150-000000207fff815f (128)
000000207fff8160-000000207fffffff (1)
nr_segments = 5
segment[0]: buf=0x000000004e5ece74 bufsz=0x211 mem=0x207fffe000 memsz=0x1000
segment[1]: buf=0x000000009e871498 bufsz=0x4000 mem=0x207fff9000 memsz=0x5000
segment[2]: buf=0x00000000d879f1fe bufsz=0x1180 mem=0x207fff3000 memsz=0x2000
segment[3]: buf=0x000000001101cd86 bufsz=0xc88200 mem=0x207c000000 memsz=0x3c4a000
segment[4]: buf=0x00000000c6e38ac7 bufsz=0x2186280 mem=0x2079e79000 memsz=0x2187000
kexec_file_load: type:0, start:0x207fff91a0 head:0x109e004002 flags:0x8
---------------------------------------------------------------------------
This patch (of 7):
When specifying 'kexec -c -d', kexec_load interface will print loading
information, e.g the regions where kernel/initrd/purgatory/cmdline are
put, the memmap passed to 2nd kernel taken as system RAM ranges, and
printing all contents of struct kexec_segment, etc. These are very
helpful for analyzing or positioning what's happening when kexec/kdump
itself failed. The debugging printing for kexec_load interface is made in
user space utility kexec-tools.
Whereas, with kexec_file_load interface, 'kexec -s -d' print nothing.
Because kexec_file code is mostly implemented in kernel space, and the
debugging printing functionality is missed. It's not convenient when
debugging kexec/kdump loading and jumping with kexec_file_load interface.
Now add KEXEC_FILE_DEBUG to kexec_file flag to control the debugging
message printing. And add global variable kexec_file_dbg_print and macro
kexec_dprintk() to facilitate the printing.
This is a preparation, later kexec_dprintk() will be used to replace the
existing pr_debug(). Once 'kexec -s -d' is specified, it will print out
kexec/kdump loading information. If '-d' is not specified, it regresses
to pr_debug().
Link: https://lkml.kernel.org/r/20231213055747.61826-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20231213055747.61826-2-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Conor Dooley <conor@kernel.org>
Cc: Joe Perches <joe@perches.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
All other users of crypto code use 'select' instead of 'depends on', so do
the same thing with KEXEC_FILE for consistency.
In practice this makes very little difference as kernels with kexec
support are very likely to also include some other feature that already
selects both crypto and crypto_sha256, but being consistent here helps for
usability as well as to avoid potential circular dependencies.
This reverts the dependency back to what it was originally before commit
74ca317c26 ("kexec: create a new config option CONFIG_KEXEC_FILE for
new syscall"), which changed changed it with the comment "This should be
safer as "select" is not recursive", but that appears to have been done in
error, as "select" is indeed recursive, and there are no other
dependencies that prevent CRYPTO_SHA256 from being selected here.
Link: https://lkml.kernel.org/r/20231023110308.1202042-2-arnd@kernel.org
Fixes: 74ca317c26 ("kexec: create a new config option CONFIG_KEXEC_FILE for new syscall")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Eric DeVolder <eric_devolder@yahoo.com>
Tested-by: Eric DeVolder <eric_devolder@yahoo.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Conor Dooley <conor@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The cleanup for the CONFIG_KEXEC Kconfig logic accidentally changed the
'depends on CRYPTO=y' dependency to a plain 'depends on CRYPTO', which
causes a link failure when all the crypto support is in a loadable module
and kexec_file support is built-in:
x86_64-linux-ld: vmlinux.o: in function `__x64_sys_kexec_file_load':
(.text+0x32e30a): undefined reference to `crypto_alloc_shash'
x86_64-linux-ld: (.text+0x32e58e): undefined reference to `crypto_shash_update'
x86_64-linux-ld: (.text+0x32e6ee): undefined reference to `crypto_shash_final'
Both s390 and x86 have this problem, while ppc64 and riscv have the
correct dependency already. On riscv, the dependency is only used for the
purgatory, not for the kexec_file code itself, which may be a bit
surprising as it means that with CONFIG_CRYPTO=m, it is possible to enable
KEXEC_FILE but then the purgatory code is silently left out.
Move this into the common Kconfig.kexec file in a way that is correct
everywhere, using the dependency on CRYPTO_SHA256=y only when the
purgatory code is available. This requires reversing the dependency
between ARCH_SUPPORTS_KEXEC_PURGATORY and KEXEC_FILE, but the effect
remains the same, other than making riscv behave like the other ones.
On s390, there is an additional dependency on CRYPTO_SHA256_S390, which
should technically not be required but gives better performance. Remove
this dependency here, noting that it was not present in the initial
Kconfig code but was brought in without an explanation in commit
71406883fd ("s390/kexec_file: Add kexec_file_load system call").
[arnd@arndb.de: fix riscv build]
Link: https://lkml.kernel.org/r/67ddd260-d424-4229-a815-e3fcfb864a77@app.fastmail.com
Link: https://lkml.kernel.org/r/20231023110308.1202042-1-arnd@kernel.org
Fixes: 6af5138083 ("x86/kexec: refactor for kernel/Kconfig.kexec")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Eric DeVolder <eric_devolder@yahoo.com>
Tested-by: Eric DeVolder <eric_devolder@yahoo.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Conor Dooley <conor@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
At present, bpf memory allocator uses check_obj_size() to ensure that
ksize() of allocated pointer is equal with the unit_size of used
bpf_mem_cache. Its purpose is to prevent bpf_mem_free() from selecting
a bpf_mem_cache which has different unit_size compared with the
bpf_mem_cache used for allocation. But as reported by lkp, the return
value of ksize() or kmalloc_size_roundup() may change due to slab merge
and it will lead to the warning report in check_obj_size().
The reported warning happened as follows:
(1) in bpf_mem_cache_adjust_size(), kmalloc_size_roundup(96) returns the
object_size of kmalloc-96 instead of kmalloc-cg-96. The object_size of
kmalloc-96 is 96, so size_index for 96 is not adjusted accordingly.
(2) the object_size of kmalloc-cg-96 is adjust from 96 to 128 due to
slab merge in __kmem_cache_alias(). For SLAB, SLAB_HWCACHE_ALIGN is
enabled by default for kmalloc slab, so align is 64 and size is 128 for
kmalloc-cg-96. SLUB has a similar merge logic, but its object_size will
not be changed, because its align is 8 under x86-64.
(3) when unit_alloc() does kmalloc_node(96, __GFP_ACCOUNT, node),
ksize() returns 128 instead of 96 for the returned pointer.
(4) the warning in check_obj_size() is triggered.
Considering the slab merge can happen in anytime (e.g, a slab created in
a new module), the following case is also possible: during the
initialization of bpf_global_ma, there is no slab merge and ksize() for
a 96-bytes object returns 96. But after that a new slab created by a
kernel module is merged to kmalloc-cg-96 and the object_size of
kmalloc-cg-96 is adjust from 96 to 128 (which is possible for x86-64 +
CONFIG_SLAB, because its alignment requirement is 64 for 96-bytes slab).
So soon or later, when bpf_global_ma frees a 96-byte-sized pointer
which is allocated from bpf_mem_cache with unit_size=96, bpf_mem_free()
will free the pointer through a bpf_mem_cache in which unit_size is 128,
because the return value of ksize() changes. The warning for the
mismatch will be triggered again.
A feasible fix is introducing similar APIs compared with ksize() and
kmalloc_size_roundup() to return the actually-allocated size instead of
size which may change due to slab merge, but it will introduce
unnecessary dependency on the implementation details of mm subsystem.
As for now the pointer of bpf_mem_cache is saved in the 8-bytes area
(or 4-bytes under 32-bit host) above the returned pointer, using
unit_size in the saved bpf_mem_cache to select the target cache instead
of inferring the size from the pointer itself. Beside no extra
dependency on mm subsystem, the performance for bpf_mem_free_rcu() is
also improved as shown below.
Before applying the patch, the performances of bpf_mem_alloc() and
bpf_mem_free_rcu() on 8-CPUs VM with one producer are as follows:
kmalloc : alloc 11.69 ± 0.28M/s free 29.58 ± 0.93M/s
percpu : alloc 14.11 ± 0.52M/s free 14.29 ± 0.99M/s
After apply the patch, the performance for bpf_mem_free_rcu() increases
9% and 146% for kmalloc memory and per-cpu memory respectively:
kmalloc: alloc 11.01 ± 0.03M/s free 32.42 ± 0.48M/s
percpu: alloc 12.84 ± 0.12M/s free 35.24 ± 0.23M/s
After the fixes, there is no need to adjust size_index to fix the
mismatch between allocation and free, so remove it as well. Also return
NULL instead of ZERO_SIZE_PTR for zero-sized alloc in bpf_mem_alloc(),
because there is no bpf_mem_cache pointer saved above ZERO_SIZE_PTR.
Fixes: 9077fc228f ("bpf: Use kmalloc_size_roundup() to adjust size_index")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/bpf/202310302113.9f8fe705-oliver.sang@intel.com
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231216131052.27621-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Function swsusp_close() does not have any parameters, so remove the
description of parameter @exclusive to prevent this warning.
swap.c:1573: warning: Excess function parameter 'exclusive' description in 'swsusp_close'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
[ rjw: Subject edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
With the freezer changes introduced by commit f5d39b0208
("freezer,sched: Rewrite core freezer logic"), the comment in
unlock_system_sleep() has become obsolete, there is no need to
retain it.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
scripts/kernel-doc warns about using @args: for variadic arguments to
functions. Documentation/doc-guide/kernel-doc.rst says that this should
be written as @...: instead, so update the source code to match that,
preventing the warnings.
trace_events_synth.c:1165: warning: Excess function parameter 'args' description in '__synth_event_gen_cmd_start'
trace_events_synth.c:1714: warning: Excess function parameter 'args' description in 'synth_event_trace'
Link: https://lore.kernel.org/linux-trace-kernel/20231220061226.30962-1-rdunlap@infradead.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 35ca5207c2 ("tracing: Add synthetic event command generation functions")
Fixes: 8dcc53ad95 ("tracing: Add synth_event_trace() and related functions")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When no timer is queued into an empty timer base, the next_expiry will not
be updated. It was originally calculated as
base->clk + NEXT_TIMER_MAX_DELTA
When the timer base stays empty long enough (> NEXT_TIMER_MAX_DELTA), the
next_expiry value of the empty base suggests that there is a timer pending
soon. This might be more a kind of a theoretical problem, but the fix
doesn't hurt.
Use only base->next_expiry value as nextevt when timers are
pending. Otherwise nextevt will be jiffies + NEXT_TIMER_MAX_DELTA. As all
information is in place, update base->next_expiry value of the empty timer
base as well.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-13-anna-maria@linutronix.de
To improve readability of the code, split base->idle calculation and
expires calculation into separate parts. While at it, update the comment
about timer base idle marking.
Thereby the following subtle change happens if the next event is just one
jiffy ahead and the tick was already stopped: Originally base->is_idle
remains true in this situation. Now base->is_idle turns to false. This may
spare an IPI if a timer is enqueued remotely to an idle CPU that is going
to tick on the next jiffy.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-12-anna-maria@linutronix.de
There is an already existing function for forwarding the timer
base. Forwarding the timer base is implemented directly in
get_next_timer_interrupt() as well.
Remove the code duplication and invoke __forward_timer_base() instead.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-11-anna-maria@linutronix.de
Forwarding timer base is done when the next expiry value is calculated and
when a new timer is enqueued. When the next expiry value is calculated the
jiffies value is already available and does not need to be reread a second
time.
Splitting out the forward timer base functionality to make it executable
via both contextes - those where jiffies are already known and those, where
jiffies need to be read.
No functional change.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-10-anna-maria@linutronix.de
The current check whether a forward of the timer base is required can be
simplified by using an already existing comparison function which is easier
to read. The related comment is outdated and was not updated when the check
changed in commit 36cd28a4cd ("timers: Lower base clock forwarding
threshold").
Use time_before_eq() for the check and replace the comment by copying the
comment from the same check inside get_next_timer_interrupt(). Move the
precious information of the outdated comment to the proper place in
__run_timers().
No functional change.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-9-anna-maria@linutronix.de
Both call sites of __next_timer_interrupt() store the return value directly
in base->next_expiry. Move the store into __next_timer_interrupt() and to
make its purpose more clear, rename the function to next_expiry_recalc().
No functional change.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-8-anna-maria@linutronix.de
Deferrable timers do not prevent CPU from going idle and are not taken into
account on idle path. Sending an IPI to a remote CPU when a new first
deferrable timer was enqueued will wake up the remote CPU but nothing will
be done regarding the deferrable timers.
Drop IPI completely when a new first deferrable timer was enqueued.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-7-anna-maria@linutronix.de
When debugging timer code the timer tracepoints are very important. There
is no tracepoint when the is_idle flag of the timer base changes. Instead
of always adding manually trace_printk(), add tracepoints which can be
easily enabled whenever required.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-6-anna-maria@linutronix.de
For starting a timer, the timer is enqueued into a bucket of the timer
wheel. The bucket expiry is the defacto expiry of the timer but it is not
equal the timer expiry because of increasing granularity when bucket is in
a higher level of the wheel. To be able to figure out in a trace whether a
timer expired in time or not, the bucket expiry time is required as well.
Add bucket expiry time to the timer_start tracepoint and thereby simplify
the arguments.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-5-anna-maria@linutronix.de
When the next tick is in the past, the delta between basemono and the next
tick gets negativ. But the next tick should never be in the past. The
negative effect of a wrong next tick might be a stop of the tick and timers
might expire late.
To prevent expensive debugging when changing underlying code, add a
WARN_ON_ONCE into this code path. To prevent complete misbehaviour, also
reset next_tick to basemono in this case.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-4-anna-maria@linutronix.de
tick_nohz_stop_tick() contains the expires (u64 variable) and tick
(ktime_t) variable. In the beginning the value of expires is written to
tick. Afterwards none of the variables is changed. They are only used for
checks.
Drop the not required variable tick and use always expires instead.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-3-anna-maria@linutronix.de
When referencing functions in comments, it might be helpful to use full
function names (including the prefix) to be able to find it when grepping.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-2-anna-maria@linutronix.de
As the size of the ring sub buffer page can be changed dynamically,
the logic that reads and writes to the buffer should be fixed to take
that into account. Some internal ring buffer APIs are changed:
ring_buffer_alloc_read_page()
ring_buffer_free_read_page()
ring_buffer_read_page()
A new API is introduced:
ring_buffer_read_page_data()
Link: https://lore.kernel.org/linux-trace-devel/20211213094825.61876-6-tz.stoyanov@gmail.com
Link: https://lore.kernel.org/linux-trace-kernel/20231219185628.875145995@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com>
[ Fixed kerneldoc on data_page parameter in ring_buffer_free_read_page() ]
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
There are two approaches when changing the size of the ring buffer
sub page:
1. Destroying all pages and allocating new pages with the new size.
2. Allocating new pages, copying the content of the old pages before
destroying them.
The first approach is easier, it is selected in the proposed
implementation. Changing the ring buffer sub page size is supposed to
not happen frequently. Usually, that size should be set only once,
when the buffer is not in use yet and is supposed to be empty.
Link: https://lore.kernel.org/linux-trace-devel/20211213094825.61876-5-tz.stoyanov@gmail.com
Link: https://lore.kernel.org/linux-trace-kernel/20231219185628.588995543@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The trace ring buffer sub page size can be configured, per trace
instance. A new ftrace file "buffer_subbuf_order" is added to get and
set the size of the ring buffer sub page for current trace instance.
The size must be an order of system page size, that's why the new
interface works with system page order, instead of absolute page size:
0 means the ring buffer sub page is equal to 1 system page and so
forth:
0 - 1 system page
1 - 2 system pages
2 - 4 system pages
...
The ring buffer sub page size is limited between 1 and 128 system
pages. The default value is 1 system page.
New ring buffer APIs are introduced:
ring_buffer_subbuf_order_set()
ring_buffer_subbuf_order_get()
ring_buffer_subbuf_size_get()
Link: https://lore.kernel.org/linux-trace-devel/20211213094825.61876-4-tz.stoyanov@gmail.com
Link: https://lore.kernel.org/linux-trace-kernel/20231219185628.298324722@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently the size of one sub buffer page is global for all buffers and
it is hard coded to one system page. In order to introduce configurable
ring buffer sub page size, the internal logic should be refactored to
work with sub page size per ring buffer.
Link: https://lore.kernel.org/linux-trace-devel/20211213094825.61876-3-tz.stoyanov@gmail.com
Link: https://lore.kernel.org/linux-trace-kernel/20231219185628.009147038@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
In order to introduce sub-buffer size per ring buffer, some internal
refactoring is needed. As ring_buffer_print_page_header() will depend on
the trace_buffer structure, it is moved after the structure definition.
Link: https://lore.kernel.org/linux-trace-devel/20211213094825.61876-2-tz.stoyanov@gmail.com
Link: https://lore.kernel.org/linux-trace-kernel/20231219185627.723857541@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add ability to pass a pointer to dynptr into global functions.
This allows to have global subprogs that accept and work with generic
dynptrs that are created by caller. Dynptr argument is detected based on
the name of a struct type, if it's "bpf_dynptr", it's assumed to be
a proper dynptr pointer. Both actual struct and forward struct
declaration types are supported.
This is conceptually exactly the same semantics as
bpf_user_ringbuf_drain()'s use of dynptr to pass a variable-sized
pointer to ringbuf record. So we heavily rely on CONST_PTR_TO_DYNPTR
bits of already existing logic in the verifier.
During global subprog validation, we mark such CONST_PTR_TO_DYNPTR as
having LOCAL type, as that's the most unassuming type of dynptr and it
doesn't have any special helpers that can try to free or acquire extra
references (unlike skb, xdp, or ringbuf dynptr). So that seems like a safe
"choice" to make from correctness standpoint. It's still possible to
pass any type of dynptr to such subprog, though, because generic dynptr
helpers, like getting data/slice pointers, read/write memory copying
routines, dynptr adjustment and getter routines all work correctly with
any type of dynptr.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-8-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add support for annotating global BPF subprog arguments to provide more
information about expected semantics of the argument. Currently,
verifier relies purely on argument's BTF type information, and supports
three general use cases: scalar, pointer-to-context, and
pointer-to-fixed-size-memory.
Scalar and pointer-to-fixed-mem work well in practice and are quite
natural to use. But pointer-to-context is a bit problematic, as typical
BPF users don't realize that they need to use a special type name to
signal to verifier that argument is not just some pointer, but actually
a PTR_TO_CTX. Further, even if users do know which type to use, it is
limiting in situations where the same BPF program logic is used across
few different program types. Common case is kprobes, tracepoints, and
perf_event programs having a helper to send some data over BPF perf
buffer. bpf_perf_event_output() requires `ctx` argument, and so it's
quite cumbersome to share such global subprog across few BPF programs of
different types, necessitating extra static subprog that is context
type-agnostic.
Long story short, there is a need to go beyond types and allow users to
add hints to global subprog arguments to define expectations.
This patch adds such support for two initial special tags:
- pointer to context;
- non-null qualifier for generic pointer arguments.
All of the above came up in practice already and seem generally useful
additions. Non-null qualifier is an often requested feature, which
currently has to be worked around by having unnecessary NULL checks
inside subprogs even if we know that arguments are never NULL. Pointer
to context was discussed earlier.
As for implementation, we utilize btf_decl_tag attribute and set up an
"arg:xxx" convention to specify argument hint. As such:
- btf_decl_tag("arg:ctx") is a PTR_TO_CTX hint;
- btf_decl_tag("arg:nonnull") marks pointer argument as not allowed to
be NULL, making NULL check inside global subprog unnecessary.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Remove duplicated BTF parsing logic when it comes to subprog call check.
Instead, use (potentially cached) results of btf_prepare_func_args() to
abstract away expectations of each subprog argument in generic terms
(e.g., "this is pointer to context", or "this is a pointer to memory of
size X"), and then use those simple high-level argument type
expectations to validate actual register states to check if they match
expectations.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Subprog call logic in btf_check_subprog_call() currently has both a lot
of BTF parsing logic (which is, presumably, what justified putting it
into btf.c), but also a bunch of register state checks, some of each
utilize deep verifier logic helpers, necessarily exported from
verifier.c: check_ptr_off_reg(), check_func_arg_reg_off(),
and check_mem_reg().
Going forward, btf_check_subprog_call() will have a minimum of
BTF-related logic, but will get more internal verifier logic related to
register state manipulation. So move it into verifier.c to minimize
amount of verifier-specific logic exposed to btf.c.
We do this move before refactoring btf_check_func_arg_match() to
preserve as much history post-refactoring as possible.
No functional changes.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Generalize btf_prepare_func_args() to support both global and static
subprogs. We are going to utilize this property in the next patch,
reusing btf_prepare_func_args() for subprog call logic instead of
reparsing BTF information in a completely separate implementation.
btf_prepare_func_args() now detects whether subprog is global or static
makes slight logic adjustments for static func cases, like not failing
fatally (-EFAULT) for conditions that are allowable for static subprogs.
Somewhat subtle (but major!) difference is the handling of pointer arguments.
Both global and static functions need to handle special context
arguments (which are pointers to predefined type names), but static
subprogs give up on any other pointers, falling back to marking subprog
as "unreliable", disabling the use of BTF type information altogether.
For global functions, though, we are assuming that such pointers to
unrecognized types are just pointers to fixed-sized memory region (or
error out if size cannot be established, like for `void *` pointers).
This patch accommodates these small differences and sets up a stage for
refactoring in the next patch, eliminating a separate BTF-based parsing
logic in btf_check_func_arg_match().
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Instead of btf_check_subprog_arg_match(), use btf_prepare_func_args()
logic to validate "trustworthiness" of main BPF program's BTF information,
if it is present.
We ignored results of original BTF check anyway, often times producing
confusing and ominously-sounding "reg type unsupported for arg#0
function" message, which has no apparent effect on program correctness
and verification process.
All the -EFAULT returning sanity checks are already performed in
check_btf_info_early(), so there is zero reason to have this duplication
of logic between btf_check_subprog_call() and btf_check_subprog_arg_match().
Dropping btf_check_subprog_arg_match() simplifies
btf_check_func_arg_match() further removing `bool processing_call` flag.
One subtle bit that was done by btf_check_subprog_arg_match() was
potentially marking main program's BTF as unreliable. We do this
explicitly now with a dedicated simple check, preserving the original
behavior, but now based on well factored btf_prepare_func_args() logic.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
btf_prepare_func_args() is used to understand expectations and
restrictions on global subprog arguments. But current implementation is
hard to extend, as it intermixes BTF-based func prototype parsing and
interpretation logic with setting up register state at subprog entry.
Worse still, those registers are not completely set up inside
btf_prepare_func_args(), requiring some more logic later in
do_check_common(). Like calling mark_reg_unknown() and similar
initialization operations.
This intermixing of BTF interpretation and register state setup is
problematic. First, it causes duplication of BTF parsing logic for global
subprog verification (to set up initial state of global subprog) and
global subprog call sites analysis (when we need to check that whatever
is being passed into global subprog matches expectations), performed in
btf_check_subprog_call().
Given we want to extend global func argument with tags later, this
duplication is problematic. So refactor btf_prepare_func_args() to do
only BTF-based func proto and args parsing, returning high-level
argument "expectations" only, with no regard to specifics of register
state. I.e., if it's a context argument, instead of setting register
state to PTR_TO_CTX, we return ARG_PTR_TO_CTX enum for that argument as
"an argument specification" for further processing inside
do_check_common(). Similarly for SCALAR arguments, PTR_TO_MEM, etc.
This allows to reuse btf_prepare_func_args() in following patches at
global subprog call site analysis time. It also keeps register setup
code consistently in one place, do_check_common().
Besides all this, we cache this argument specs information inside
env->subprog_info, eliminating the need to redo these potentially
expensive BTF traversals, especially if BPF program's BTF is big and/or
there are lots of global subprog calls.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
We can derive some new information for BPF_JNE in regs_refine_cond_op().
Take following code for example:
/* The type of "a" is u32 */
if (a > 0 && a < 100) {
/* the range of the register for a is [0, 99], not [1, 99],
* and will cause the following error:
*
* invalid zero-sized read
*
* as a can be 0.
*/
bpf_skb_store_bytes(skb, xx, xx, a, 0);
}
In the code above, "a > 0" will be compiled to "jmp xxx if a == 0". In the
TRUE branch, the dst_reg will be marked as known to 0. However, in the
fallthrough(FALSE) branch, the dst_reg will not be handled, which makes
the [min, max] for a is [0, 99], not [1, 99].
For BPF_JNE, we can reduce the range of the dst reg if the src reg is a
const and is exactly the edge of the dst reg.
Signed-off-by: Menglong Dong <menglong8.dong@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231219134800.1550388-2-menglong8.dong@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
TASK_KILLABLE already includes TASK_UNINTERRUPTIBLE, so there is no
need to add a separate TASK_UNINTERRUPTIBLE.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
While working on the ring buffer, I found one more bug with the timestamp
code, and the fix for this removed the need for the final 64-bit cmpxchg!
The ring buffer events hold a "delta" from the previous event. If it is
determined that the delta can not be calculated, it falls back to adding an
absolute timestamp value. The way to know if the delta can be used is via
two stored timestamps in the per-cpu buffer meta data:
before_stamp and write_stamp
The before_stamp is written by every event before it tries to allocate its
space on the ring buffer. The write_stamp is written after it allocates its
space and knows that nothing came in after it read the previous
before_stamp and write_stamp and the two matched.
A previous fix dd93942570 ("ring-buffer: Do not try to put back
write_stamp") removed putting back the write_stamp to match the
before_stamp so that the next event could use the delta, but races were
found where the two would match, but not be for of the previous event.
It was determined to allow the event reservation to not have a valid
write_stamp when it is finished, and this fixed a lot of races.
The last use of the 64-bit timestamp cmpxchg depended on the write_stamp
being valid after an interruption. But this is no longer the case, as if an
event is interrupted by a softirq that writes an event, and that event gets
interrupted by a hardirq or NMI and that writes an event, then the softirq
could finish its reservation without a valid write_stamp.
In the slow path of the event reservation, a delta can still be used if the
write_stamp is valid. Instead of using a cmpxchg against the write stamp,
the before_stamp needs to be read again to validate the write_stamp. The
cmpxchg is not needed.
This updates the slowpath to validate the write_stamp by comparing it to
the before_stamp and removes all rb_time_cmpxchg() as there are no more
users of that function.
The removal of the 32-bit updates of rb_time_t will be done in the next
merge window.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZYHVxhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qhk5AQDT56Uis34ewzeEzkwBSs8nsV2HDhnA
d0CU4BHsf0GUVQD9E2eWVbIB9z8MiQwNMvKslpFJYmGCzr359pCMzoOmcws=
=0rcD
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fix from Steven Rostedt:
"While working on the ring buffer, I found one more bug with the
timestamp code, and the fix for this removed the need for the final
64-bit cmpxchg!
The ring buffer events hold a "delta" from the previous event. If it
is determined that the delta can not be calculated, it falls back to
adding an absolute timestamp value. The way to know if the delta can
be used is via two stored timestamps in the per-cpu buffer meta data:
before_stamp and write_stamp
The before_stamp is written by every event before it tries to allocate
its space on the ring buffer. The write_stamp is written after it
allocates its space and knows that nothing came in after it read the
previous before_stamp and write_stamp and the two matched.
A previous fix dd93942570 ("ring-buffer: Do not try to put back
write_stamp") removed putting back the write_stamp to match the
before_stamp so that the next event could use the delta, but races
were found where the two would match, but not be for of the previous
event.
It was determined to allow the event reservation to not have a valid
write_stamp when it is finished, and this fixed a lot of races.
The last use of the 64-bit timestamp cmpxchg depended on the
write_stamp being valid after an interruption. But this is no longer
the case, as if an event is interrupted by a softirq that writes an
event, and that event gets interrupted by a hardirq or NMI and that
writes an event, then the softirq could finish its reservation without
a valid write_stamp.
In the slow path of the event reservation, a delta can still be used
if the write_stamp is valid. Instead of using a cmpxchg against the
write stamp, the before_stamp needs to be read again to validate the
write_stamp. The cmpxchg is not needed.
This updates the slowpath to validate the write_stamp by comparing it
to the before_stamp and removes all rb_time_cmpxchg() as there are no
more users of that function.
The removal of the 32-bit updates of rb_time_t will be done in the
next merge window"
* tag 'trace-v6.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ring-buffer: Fix slowpath of interrupted event
kmap_atomic() has been deprecated in favor of kmap_local_page().
kmap_atomic() disables page-faults and preemption (the latter
only for !PREEMPT_RT kernels).The code between the mapping and
un-mapping in this patch does not depend on the above-mentioned
side effects.So simply replaced kmap_atomic() with kmap_local_page().
Signed-off-by: Chen Haonan <chen.haonan2@zte.com.cn>
[ rjw: Subject edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The check_buffer() which checks the timestamps of the ring buffer
sub-buffer page, when enabled, only checks if the adding of deltas of the
events from the last absolute timestamp or the timestamp of the sub-buffer
page adds up to the current event.
What it does not check is if the absolute timestamp causes the time of the
events to go backwards, as that can cause issues elsewhere.
Test for the timestamp going backwards too.
This also fixes a slight issue where if the warning triggers at boot up
(because of the resetting of the tsc), it will disable all further checks,
even those that are after boot Have it continue checking if the warning
was ignored during boot up.
Link: https://lore.kernel.org/linux-trace-kernel/20231219074732.18b092d4@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When the ring buffer timestamp verifier triggers, it dumps the content of
the sub-buffer. But currently it only dumps the timestamps and the offset
of the data as well as the deltas. It would be even more informative if
the event data also showed the interrupt context level it was in.
That is, if each event showed that the event was written in normal,
softirq, irq or NMI context. Then a better idea about how the events may
have been interrupted from each other.
As the payload of the ring buffer is really a black box of the ring
buffer, just assume that if the payload is larger than a trace entry, that
it is a trace entry. As trace entries have the interrupt context
information saved in a flags field, look at that location and report the
output of the flags.
If the payload is not a trace entry, there's no way to really know, and
the information will be garbage. But that's OK, because this is for
debugging only (this output is not used in production as the buffer check
that calls it causes a huge overhead to the tracing). This information,
when available, is crucial for debugging timestamp issues. If it's
garbage, it will also be pretty obvious that its garbage too.
As this output usually happens in kselftests of the tracing code, the user
will know what the payload is at the time.
Link: https://lore.kernel.org/linux-trace-kernel/20231219074542.6f304601@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Suggested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Each event has a 27 bit timestamp delta that is used to hold the delta
from the last event. If the time between events is greater than 2^27, then
a timestamp is added that holds a 59 bit absolute timestamp.
Until a389d86f7f ("ring-buffer: Have nested events still record running
time stamp"), if an interrupt interrupted an event in progress, all the
events delta would be zero to not deal with the races that need to be
handled. The commit a389d86f7f changed that to handle the races giving
all events, even those that preempt other events, still have an accurate
timestamp.
To handle those races requires performing 64-bit cmpxchg on the
timestamps. But doing 64-bit cmpxchg on 32-bit architectures is considered
very slow. To try to deal with this the timestamp logic was broken into
two and then three 32-bit cmpxchgs, with the thought that two (or three)
32-bit cmpxchgs are still faster than a single 64-bit cmpxchg on 32-bit
architectures.
Part of the problem with this is that I didn't have any 32-bit
architectures to test on. After hitting several subtle bugs in this code,
an effort was made to try and see if three 32-bit cmpxchgs are indeed
faster than a single 64-bit. After a few people brushed off the dust of
their old 32-bit machines, tests were done, and even though 32-bit cmpxchg
was faster than a single 64-bit, it was in the order of 50% at best, not
300%.
After some more refactoring of the code, all 4 64-bit cmpxchg were removed:
https://lore.kernel.org/linux-trace-kernel/20231211114420.36dde01b@gandalf.local.homehttps://lore.kernel.org/linux-trace-kernel/20231214222921.193037a7@gandalf.local.homehttps://lore.kernel.org/linux-trace-kernel/20231215081810.1f4f38fe@rorschach.local.homehttps://lore.kernel.org/linux-trace-kernel/20231218230712.3a76b081@gandalf.local.home/
With all the 64-bit cmpxchg removed, the complex 32-bit workaround can also be
removed.
The 32-bit and 64-bit logic is now exactly the same.
Link: https://lore.kernel.org/all/20231213214632.15047c40@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20231219074303.28f9abda@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
There's no reason to give an arbitrary limit to the size of a raw trace
marker. Just let it be as big as the size that is allowed by the ring
buffer itself.
And there's also no reason to artificially break up the write to
TRACE_BUF_SIZE, as that's not even used.
Link: https://lore.kernel.org/linux-trace-kernel/20231213104218.2efc70c1@gandalf.local.home
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
If a trace_marker write is bigger than what trace_seq can hold, then it
will print "LINE TOO BIG" message and not what was written.
Instead, check if the write is bigger than the trace_seq and break it
up by that size.
Ideally, we could make the trace_seq dynamic that could hold this. But
that's for another time.
Link: https://lore.kernel.org/linux-trace-kernel/20231212190422.1eaf224f@gandalf.local.home
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Now that trace_marker can hold more than 1KB string, and can write as much
as the ring buffer can hold, the trace_seq is not big enough to hold
writes:
~# a="1234567890"
~# cnt=4080
~# s=""
~# while [ $cnt -gt 10 ]; do
~# s="${s}${a}"
~# cnt=$((cnt-10))
~# done
~# echo $s > trace_marker
~# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 2/2 #P:8
#
# _-----=> irqs-off/BH-disabled
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
<...>-860 [002] ..... 105.543465: tracing_mark_write[LINE TOO BIG]
<...>-860 [002] ..... 105.543496: tracing_mark_write: 789012345678901234567890
By increasing the trace_seq buffer to almost two pages, it can now print
out the first line.
This also subtracts the rest of the trace_seq fields from the buffer, so
that the entire trace_seq is now PAGE_SIZE aligned.
Link: https://lore.kernel.org/linux-trace-kernel/20231209175220.19867af4@gandalf.local.home
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Allow a trace write to be as big as the ring buffer tracing data will
allow. Currently, it only allows writes of 1KB in size, but there's no
reason that it cannot allow what the ring buffer can hold.
Link: https://lore.kernel.org/linux-trace-kernel/20231212131901.5f501e72@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
On bugs that have the ring buffer timestamp get out of sync, the config
CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS, that checks for it and if it is
detected it causes a dump of the bad sub buffer.
It shows each event and their timestamp as well as the delta in the event.
But it's also good to see the offset into the subbuffer for that event to
know if how close to the end it is.
Also print where the last event actually ended compared to where it was
expected to end.
Link: https://lore.kernel.org/linux-trace-kernel/20231211131623.59eaebd2@gandalf.local.home
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
A trace instance may only need to enable specific events. As the eventfs
directory of an instance currently creates all events which adds overhead,
allow internal instances to be created with just the events in systems
that they care about. This currently only deals with systems and not
individual events, but this should bring down the overhead of creating
instances for specific use cases quite bit.
The trace_array_get_by_name() now has another parameter "systems". This
parameter is a const string pointer of a comma/space separated list of
event systems that should be created by the trace_array. (Note if the
trace_array already exists, this parameter is ignored).
The list of systems is saved and if a module is loaded, its events will
not be added unless the system for those events also match the systems
string.
Link: https://lore.kernel.org/linux-trace-kernel/20231213093701.03fddec0@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Sean Paul <seanpaul@chromium.org>
Cc: Arun Easi <aeasi@marvell.com>
Cc: Daniel Wagner <dwagner@suse.de>
Tested-by: Dmytro Maluka <dmaluka@chromium.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
To synchronize the timestamps with the ring buffer reservation, there are
two timestamps that are saved in the buffer meta data.
1. before_stamp
2. write_stamp
When the two are equal, the write_stamp is considered valid, as in, it may
be used to calculate the delta of the next event as the write_stamp is the
timestamp of the previous reserved event on the buffer.
This is done by the following:
/*A*/ w = current position on the ring buffer
before = before_stamp
after = write_stamp
ts = read current timestamp
if (before != after) {
write_stamp is not valid, force adding an absolute
timestamp.
}
/*B*/ before_stamp = ts
/*C*/ write = local_add_return(event length, position on ring buffer)
if (w == write - event length) {
/* Nothing interrupted between A and C */
/*E*/ write_stamp = ts;
delta = ts - after
/*
* If nothing interrupted again,
* before_stamp == write_stamp and write_stamp
* can be used to calculate the delta for
* events that come in after this one.
*/
} else {
/*
* The slow path!
* Was interrupted between A and C.
*/
This is the place that there's a bug. We currently have:
after = write_stamp
ts = read current timestamp
/*F*/ if (write == current position on the ring buffer &&
after < ts && cmpxchg(write_stamp, after, ts)) {
delta = ts - after;
} else {
delta = 0;
}
The assumption is that if the current position on the ring buffer hasn't
moved between C and F, then it also was not interrupted, and that the last
event written has a timestamp that matches the write_stamp. That is the
write_stamp is valid.
But this may not be the case:
If a task context event was interrupted by softirq between B and C.
And the softirq wrote an event that got interrupted by a hard irq between
C and E.
and the hard irq wrote an event (does not need to be interrupted)
We have:
/*B*/ before_stamp = ts of normal context
---> interrupted by softirq
/*B*/ before_stamp = ts of softirq context
---> interrupted by hardirq
/*B*/ before_stamp = ts of hard irq context
/*E*/ write_stamp = ts of hard irq context
/* matches and write_stamp valid */
<----
/*E*/ write_stamp = ts of softirq context
/* No longer matches before_stamp, write_stamp is not valid! */
<---
w != write - length, go to slow path
// Right now the order of events in the ring buffer is:
//
// |-- softirq event --|-- hard irq event --|-- normal context event --|
//
after = write_stamp (this is the ts of softirq)
ts = read current timestamp
if (write == current position on the ring buffer [true] &&
after < ts [true] && cmpxchg(write_stamp, after, ts) [true]) {
delta = ts - after [Wrong!]
The delta is to be between the hard irq event and the normal context
event, but the above logic made the delta between the softirq event and
the normal context event, where the hard irq event is between the two. This
will shift all the remaining event timestamps on the sub-buffer
incorrectly.
The write_stamp is only valid if it matches the before_stamp. The cmpxchg
does nothing to help this.
Instead, the following logic can be done to fix this:
before = before_stamp
ts = read current timestamp
before_stamp = ts
after = write_stamp
if (write == current position on the ring buffer &&
after == before && after < ts) {
delta = ts - after
} else {
delta = 0;
}
The above will only use the write_stamp if it still matches before_stamp
and was tested to not have changed since C.
As a bonus, with this logic we do not need any 64-bit cmpxchg() at all!
This means the 32-bit rb_time_t workaround can finally be removed. But
that's for a later time.
Link: https://lore.kernel.org/linux-trace-kernel/20231218175229.58ec3daf@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20231218230712.3a76b081@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: dd93942570 ("ring-buffer: Do not try to put back write_stamp")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmWAz2EACgkQ6rmadz2v
bToqrw/9EwroZCc8GEHOKAlb/fzrMvn92rLo0ZW/cGN84QJPnx4zM6Zo0+fgLaaN
oqqztwMUwdzGC3uX3FfVXaaLKbJ/MeHeL9BXFZNW8zkRHciw4R7kIBhOdPnHyET7
uT+rQ4xPe1Mt7e9PjepKlSL5mEsxWfBkdUgsdn19Z2Vjdfr9mZMhYWYMJGcfTCD1
TwxHKBPhq5fN3IsshmMBB8IrRp1HStUKb65MgZ4dI22LJXxTsFkx5XMFXcmuqvkH
NhKj8jDcPEEh31bYcb6aG2Z4onw5F2lquygjk1Qyy5cyw45m/ipJKAXKdAyvJG+R
VZCWOET/9wbRwFSK5wxwihCuKghFiofK52i2PcGtXZh0PCouyZZneSJOKM0yVWKO
BvuJBxK4ETRnQyN6ZxhuJiEXG3/YMBBhyR2TX1LntVK9ct/k7qFVzATG49J39/sR
SYMbptBRj4a5oMJ1qn0nFVEDFkg0jTnTDNnsEpcz60Ayt6EsJ1XosO5yz2huf861
xgRMTKMseyG1/uV45tQ8ZPzbSPpBxjUi9Dl3coYsIm1a+y6clWUXcarONY5KVrpS
CR98DuFgl+E7dXuisd/Kz2p2KxxSPq8nytsmLlgOvrUqhwiXqB+TKN8EHgIapVOt
l1A5LrzXFTcGlT9MlaWBqEIy83Bu1nqQqbxrAFOE0k8A5jomXaw=
=stU2
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:
====================
pull-request: bpf-next 2023-12-18
This PR is larger than usual and contains changes in various parts
of the kernel.
The main changes are:
1) Fix kCFI bugs in BPF, from Peter Zijlstra.
End result: all forms of indirect calls from BPF into kernel
and from kernel into BPF work with CFI enabled. This allows BPF
to work with CONFIG_FINEIBT=y.
2) Introduce BPF token object, from Andrii Nakryiko.
It adds an ability to delegate a subset of BPF features from privileged
daemon (e.g., systemd) through special mount options for userns-bound
BPF FS to a trusted unprivileged application. The design accommodates
suggestions from Christian Brauner and Paul Moore.
Example:
$ sudo mkdir -p /sys/fs/bpf/token
$ sudo mount -t bpf bpffs /sys/fs/bpf/token \
-o delegate_cmds=prog_load:MAP_CREATE \
-o delegate_progs=kprobe \
-o delegate_attachs=xdp
3) Various verifier improvements and fixes, from Andrii Nakryiko, Andrei Matei.
- Complete precision tracking support for register spills
- Fix verification of possibly-zero-sized stack accesses
- Fix access to uninit stack slots
- Track aligned STACK_ZERO cases as imprecise spilled registers.
It improves the verifier "instructions processed" metric from single
digit to 50-60% for some programs.
- Fix verifier retval logic
4) Support for VLAN tag in XDP hints, from Larysa Zaremba.
5) Allocate BPF trampoline via bpf_prog_pack mechanism, from Song Liu.
End result: better memory utilization and lower I$ miss for calls to BPF
via BPF trampoline.
6) Fix race between BPF prog accessing inner map and parallel delete,
from Hou Tao.
7) Add bpf_xdp_get_xfrm_state() kfunc, from Daniel Xu.
It allows BPF interact with IPSEC infra. The intent is to support
software RSS (via XDP) for the upcoming ipsec pcpu work.
Experiments on AWS demonstrate single tunnel pcpu ipsec reaching
line rate on 100G ENA nics.
8) Expand bpf_cgrp_storage to support cgroup1 non-attach, from Yafang Shao.
9) BPF file verification via fsverity, from Song Liu.
It allows BPF progs get fsverity digest.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (164 commits)
bpf: Ensure precise is reset to false in __mark_reg_const_zero()
selftests/bpf: Add more uprobe multi fail tests
bpf: Fail uprobe multi link with negative offset
selftests/bpf: Test the release of map btf
s390/bpf: Fix indirect trampoline generation
selftests/bpf: Temporarily disable dummy_struct_ops test on s390
x86/cfi,bpf: Fix bpf_exception_cb() signature
bpf: Fix dtor CFI
cfi: Add CFI_NOSEAL()
x86/cfi,bpf: Fix bpf_struct_ops CFI
x86/cfi,bpf: Fix bpf_callback_t CFI
x86/cfi,bpf: Fix BPF JIT call
cfi: Flip headers
selftests/bpf: Add test for abnormal cnt during multi-kprobe attachment
selftests/bpf: Don't use libbpf_get_error() in kprobe_multi_test
selftests/bpf: Add test for abnormal cnt during multi-uprobe attachment
bpf: Limit the number of kprobes when attaching program to multiple kprobes
bpf: Limit the number of uprobes when attaching program to multiple uprobes
bpf: xdp: Register generic_kfunc_set with XDP programs
selftests/bpf: utilize string values for delegate_xxx mount options
...
====================
Link: https://lore.kernel.org/r/20231219000520.34178-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It is safe to always start with imprecise SCALAR_VALUE register.
Previously __mark_reg_const_zero() relied on caller to reset precise
mark, but it's very error prone and we already missed it in a few
places. So instead make __mark_reg_const_zero() reset precision always,
as it's a safe default for SCALAR_VALUE. Explanation is basically the
same as for why we are resetting (or rather not setting) precision in
current state. If necessary, precision propagation will set it to
precise correctly.
As such, also remove a big comment about forward precision propagation
in mark_reg_stack_read() and avoid unnecessarily setting precision to
true after reading from STACK_ZERO stack. Again, precision propagation
will correctly handle this, if that SCALAR_VALUE register will ever be
needed to be precise.
Reported-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231218173601.53047-1-andrii@kernel.org
Add KUNIT_INIT_TABLE to the INIT_DATA linker section.
Alter the KUnit macros to create init tests:
kunit_test_init_section_suites
Update lib/kunit/executor.c to run both the suites in KUNIT_TABLE and
KUNIT_INIT_TABLE.
Reviewed-by: David Gow <davidgow@google.com>
Signed-off-by: Rae Moar <rmoar@google.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Currently the __uprobe_register will return 0 (success) when called with
negative offset. The reason is that the call to register_for_each_vma and
then build_map_info won't return error for negative offset. They just won't
do anything - no matching vma is found so there's no registered breakpoint
for the uprobe.
I don't think we can change the behaviour of __uprobe_register and fail
for negative uprobe offset, because apps might depend on that already.
But I think we can still make the change and check for it on bpf multi
link syscall level.
Also moving the __get_user call and check for the offsets to the top of
loop, to fail early without extra __get_user calls for ref_ctr_offset
and cookie arrays.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20231217215538.3361991-2-jolsa@kernel.org
because there are none, and thus prevent a lockdep splat
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmV/WSEACgkQEsHwGGHe
VUoQ7RAAoc9AdZJ8ZmTMLy/5/SS3542z3w3Ts5DxHziZrIzBclRx5RYBw3YgBA2q
AgbPIgqr1Y4+45gChhwKIAp7G6g2K8lpqNSJbqP/nFpwkmx7v/m5T/9DH7id1UyQ
uRKwxtEAwKz5XeR2czXI5Z+VvG9vIqkrdAR+dSRmcXgcfi8oJCKOywW+n7QFUarf
sYWdMQwbMNenl0y/o8MQXmTraQWkRJCqI5JXc23GIkr03z6ZUOwt2qAx2YW4GQQo
enXzdk9MokKU4IpRz/rU9j7qaOd9h/AZZXhACSMUVJqDQVDFJJO1rbktKcS17sNH
EgRv57xSAYsPvARk2wvw3INEIPOvL4Jb1s86MLa0eN2mH4mVwkqlv/KkUl1/RuHY
IuCgpNWliQ3nb6dQeEsp83EW6Ao8FTn3D8+66tbtCVXnMBFQEfUHWFnSfHqcUizb
JsRnBA9ke2t3Wu0ph/nZBzck+9kxp0PeUvio//x2IznjfeZ31fQYdVDYU0o8QgOe
Ns6MyCn1OcCulfZZTpUbMhy/5FjokKLf2Sfit1r8duDMXVu4cicSnuMUpVPvZd+A
8XcbH73kCO0DuUVtYMbYerFJLgZrcN1gzyPgAmegoaDXBVu4KkOIbXBGEJjv7pPs
h5p84zKibyjKhwa+bqlIgy9R9EYb9sPwbNv5eWgQFXzrvQVm5FA=
=s8Bq
-----END PGP SIGNATURE-----
Merge tag 'perf_urgent_for_v6.7_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Borislav Petkov:
- Avoid iterating over newly created group leader event's siblings
because there are none, and thus prevent a lockdep splat
* tag 'perf_urgent_for_v6.7_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix perf_event_validate_size() lockdep splat
- Fix alloc_free_mem_region()'s scan for address space, prevent false
negative out-of-space events
- Fix sleeping lock acquisition from CXL trace event (atomic context)
- Fix put_device() like for the new CXL PMU driver
- Fix wrong pointer freed on error path
- Fixup several lockdep reports (missing lock hold) from new assertion
in cxl_num_decoders_committed() and new tests
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCZX6oZwAKCRDfioYZHlFs
ZyLRAPwLXinja3lpUab4mV6P6w87oO7qz1n4ly8vKpTTZZxaJAD/QGlqYS6YtiPo
IXA8QiHe9RX3bGKhYmzSOd2/JFjyhQc=
=2+M9
-----END PGP SIGNATURE-----
Merge tag 'cxl-fixes-6.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull CXL (Compute Express Link) fixes from Dan Williams:
"A collection of CXL fixes.
The touch outside of drivers/cxl/ is for a helper that allocates
physical address space. Device hotplug tests showed that the driver
failed to utilize (skipped over) valid capacity when allocating a new
memory region. Outside of that, new tests uncovered a small crop of
lockdep reports.
There is also some miscellaneous error path and leak fixups that are
not urgent, but useful to cleanup now.
- Fix alloc_free_mem_region()'s scan for address space, prevent false
negative out-of-space events
- Fix sleeping lock acquisition from CXL trace event (atomic context)
- Fix put_device() like for the new CXL PMU driver
- Fix wrong pointer freed on error path
- Fixup several lockdep reports (missing lock hold) from new
assertion in cxl_num_decoders_committed() and new tests"
* tag 'cxl-fixes-6.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl/pmu: Ensure put_device on pmu devices
cxl/cdat: Free correct buffer on checksum error
cxl/hdm: Fix dpa translation locking
kernel/resource: Increment by align value in get_free_mem_region()
cxl: Add cxl_num_decoders_committed() usage to cxl_test
cxl/memdev: Hold region_rwsem during inject and clear poison ops
cxl/core: Always hold region_rwsem while reading poison lists
cxl/hdm: Fix a benign lockdep splat
Trying to probe update_sd_lb_stats() using perf results in the below
message in the kernel log:
trace_kprobe: Could not probe notrace function _text
This is because 'perf probe' specifies the kprobe location as an offset
from '_text':
$ sudo perf probe -D update_sd_lb_stats
p:probe/update_sd_lb_stats _text+1830728
However, the error message is misleading and doesn't help convey the
actual notrace function that is being probed. Fix this by looking up the
actual function name that is being probed. With this fix, we now get the
below message in the kernel log:
trace_kprobe: Could not probe notrace function update_sd_lb_stats.constprop.0
Link: https://lore.kernel.org/all/20231214051702.1687300-1-naveen@kernel.org/
Signed-off-by: Naveen N Rao <naveen@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
- Fix eventfs to check creating new files for events with names greater than
NAME_MAX. The eventfs lookup needs to check the return result of
simple_lookup().
- Fix the ring buffer to check the proper max data size. Events must be able to
fit on the ring buffer sub-buffer, if it cannot, then it fails to be written
and the logic to add the event is avoided. The code to check if an event can
fit failed to add the possible absolute timestamp which may make the event
not be able to fit. This causes the ring buffer to go into an infinite loop
trying to find a sub-buffer that would fit the event. Luckily, there's a check
that will bail out if it looped over a 1000 times and it also warns.
The real fix is not to add the absolute timestamp to an event that is
starting at the beginning of a sub-buffer because it uses the sub-buffer
timestamp. By avoiding the timestamp at the start of the sub-buffer allows
events that pass the first check to always find a sub-buffer that it can fit
on.
- Have large events that do not fit on a trace_seq to print "LINE TOO BIG" like
it does for the trace_pipe instead of what it does now which is to silently
drop the output.
- Fix a memory leak of forgetting to free the spare page that is saved by a
trace instance.
- Update the size of the snapshot buffer when the main buffer is updated if the
snapshot buffer is allocated.
- Fix ring buffer timestamp logic by removing all the places that tried to put
the before_stamp back to the write stamp so that the next event doesn't add
an absolute timestamp. But each of these updates added a race where by making
the two timestamp equal, it was validating the write_stamp so that it can be
incorrectly used for calculating the delta of an event.
- There's a temp buffer used for printing the event that was using the event
data size for allocation when it needed to use the size of the entire event
(meta-data and payload data)
- For hardening, use "%.*s" for printing the trace_marker output, to limit the
amount that is printed by the size of the event. This was discovered by
development that added a bug that truncated the '\0' and caused a crash.
- Fix a use-after-free bug in the use of the histogram files when an instance
is being removed.
- Remove a useless update in the rb_try_to_discard of the write_stamp. The
before_stamp was already changed to force the next event to add an absolute
timestamp that the write_stamp is not used. But the write_stamp is modified
again using an unneeded 64-bit cmpxchg.
- Fix several races in the 32-bit implementation of the rb_time_cmpxchg() that
does a 64-bit cmpxchg.
- While looking at fixing the 64-bit cmpxchg, I noticed that because the ring
buffer uses normal cmpxchg, and this can be done in NMI context, there's some
architectures that do not have a working cmpxchg in NMI context. For these
architectures, fail recording events that happen in NMI context.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZX0nChQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qlOMAQD3iegTcceQl9lAsroa3tb3xdweC1GP
51MsX5athxSyoQEAutI/2pBCtLFXgTLMHAMd5F23EM1U9rha7W0myrnvKQY=
=d3bS
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix eventfs to check creating new files for events with names greater
than NAME_MAX. The eventfs lookup needs to check the return result of
simple_lookup().
- Fix the ring buffer to check the proper max data size. Events must be
able to fit on the ring buffer sub-buffer, if it cannot, then it
fails to be written and the logic to add the event is avoided. The
code to check if an event can fit failed to add the possible absolute
timestamp which may make the event not be able to fit. This causes
the ring buffer to go into an infinite loop trying to find a
sub-buffer that would fit the event. Luckily, there's a check that
will bail out if it looped over a 1000 times and it also warns.
The real fix is not to add the absolute timestamp to an event that is
starting at the beginning of a sub-buffer because it uses the
sub-buffer timestamp.
By avoiding the timestamp at the start of the sub-buffer allows
events that pass the first check to always find a sub-buffer that it
can fit on.
- Have large events that do not fit on a trace_seq to print "LINE TOO
BIG" like it does for the trace_pipe instead of what it does now
which is to silently drop the output.
- Fix a memory leak of forgetting to free the spare page that is saved
by a trace instance.
- Update the size of the snapshot buffer when the main buffer is
updated if the snapshot buffer is allocated.
- Fix ring buffer timestamp logic by removing all the places that tried
to put the before_stamp back to the write stamp so that the next
event doesn't add an absolute timestamp. But each of these updates
added a race where by making the two timestamp equal, it was
validating the write_stamp so that it can be incorrectly used for
calculating the delta of an event.
- There's a temp buffer used for printing the event that was using the
event data size for allocation when it needed to use the size of the
entire event (meta-data and payload data)
- For hardening, use "%.*s" for printing the trace_marker output, to
limit the amount that is printed by the size of the event. This was
discovered by development that added a bug that truncated the '\0'
and caused a crash.
- Fix a use-after-free bug in the use of the histogram files when an
instance is being removed.
- Remove a useless update in the rb_try_to_discard of the write_stamp.
The before_stamp was already changed to force the next event to add
an absolute timestamp that the write_stamp is not used. But the
write_stamp is modified again using an unneeded 64-bit cmpxchg.
- Fix several races in the 32-bit implementation of the
rb_time_cmpxchg() that does a 64-bit cmpxchg.
- While looking at fixing the 64-bit cmpxchg, I noticed that because
the ring buffer uses normal cmpxchg, and this can be done in NMI
context, there's some architectures that do not have a working
cmpxchg in NMI context. For these architectures, fail recording
events that happen in NMI context.
* tag 'trace-v6.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ring-buffer: Do not record in NMI if the arch does not support cmpxchg in NMI
ring-buffer: Have rb_time_cmpxchg() set the msb counter too
ring-buffer: Fix 32-bit rb_time_read() race with rb_time_cmpxchg()
ring-buffer: Fix a race in rb_time_cmpxchg() for 32 bit archs
ring-buffer: Remove useless update to write_stamp in rb_try_to_discard()
ring-buffer: Do not try to put back write_stamp
tracing: Fix uaf issue when open the hist or hist_debug file
tracing: Add size check when printing trace_marker output
ring-buffer: Have saved event hold the entire event
ring-buffer: Do not update before stamp when switching sub-buffers
tracing: Update snapshot buffer on resize if it is allocated
ring-buffer: Fix memory leak of free page
eventfs: Fix events beyond NAME_MAX blocking tasks
tracing: Have large events show up as '[LINE TOO BIG]' instead of nothing
ring-buffer: Fix writing to the buffer with max_data_size
Ensure the various dtor functions match their prototype and retain
their CFI signatures, since they don't have their address taken, they
are prone to not getting CFI, making them impossible to call
indirectly.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231215092707.799451071@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
BPF struct_ops uses __arch_prepare_bpf_trampoline() to write
trampolines for indirect function calls. These tramplines much have
matching CFI.
In order to obtain the correct CFI hash for the various methods, add a
matching structure that contains stub functions, the compiler will
generate correct CFI which we can pilfer for the trampolines.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231215092707.566977112@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The current BPF call convention is __nocfi, except when it calls !JIT things,
then it calls regular C functions.
It so happens that with FineIBT the __nocfi and C calling conventions are
incompatible. Specifically __nocfi will call at func+0, while FineIBT will have
endbr-poison there, which is not a valid indirect target. Causing #CP.
Notably this only triggers on IBT enabled hardware, which is probably why this
hasn't been reported (also, most people will have JIT on anyway).
Implement proper CFI prologues for the BPF JIT codegen and drop __nocfi for
x86.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231215092707.345270396@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This code is rarely (never?) enabled by distros, and it hasn't caught
anything in decades. Let's kill off this legacy debug code.
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are multiple ways to grab references to credentials, and the only
protection we have against overflowing it is the memory required to do
so.
With memory sizes only moving in one direction, let's bump the reference
count to 64-bit and move it outside the realm of feasibly overflowing.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An abnormally big cnt may also be assigned to kprobe_multi.cnt when
attaching multiple kprobes. It will trigger the following warning in
kvmalloc_node():
if (unlikely(size > INT_MAX)) {
WARN_ON_ONCE(!(flags & __GFP_NOWARN));
return NULL;
}
Fix the warning by limiting the maximal number of kprobes in
bpf_kprobe_multi_link_attach(). If the number of kprobes is greater than
MAX_KPROBE_MULTI_CNT, the attachment will fail and return -E2BIG.
Fixes: 0dcac27254 ("bpf: Add multi kprobe link")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231215100708.2265609-3-houtao@huaweicloud.com
An abnormally big cnt may be passed to link_create.uprobe_multi.cnt,
and it will trigger the following warning in kvmalloc_node():
if (unlikely(size > INT_MAX)) {
WARN_ON_ONCE(!(flags & __GFP_NOWARN));
return NULL;
}
Fix the warning by limiting the maximal number of uprobes in
bpf_uprobe_multi_link_attach(). If the number of uprobes is greater than
MAX_UPROBE_MULTI_CNT, the attachment will return -E2BIG.
Fixes: 89ae89f53d ("bpf: Add multi uprobe link")
Reported-by: Xingwei Lee <xrivendell7@gmail.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Closes: https://lore.kernel.org/bpf/CABOYnLwwJY=yFAGie59LFsUsBAgHfroVqbzZ5edAXbFE3YiNVA@mail.gmail.com
Link: https://lore.kernel.org/bpf/20231215100708.2265609-2-houtao@huaweicloud.com
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZXxs8wAKCRDdBJ7gKXxA
junbAQCdItfHHinkWziciOrb0387wW+5WZ1ohqRFW8pGYLuasQEArpKmw13bvX7z
e+ec9K1Ek9MlIsO2RwORR4KHH4MAbwA=
=YpZh
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2023-12-15-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"17 hotfixes. 8 are cc:stable and the other 9 pertain to post-6.6
issues"
* tag 'mm-hotfixes-stable-2023-12-15-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/mglru: reclaim offlined memcgs harder
mm/mglru: respect min_ttl_ms with memcgs
mm/mglru: try to stop at high watermarks
mm/mglru: fix underprotected page cache
mm/shmem: fix race in shmem_undo_range w/THP
Revert "selftests: error out if kernel header files are not yet built"
crash_core: fix the check for whether crashkernel is from high memory
x86, kexec: fix the wrong ifdeffery CONFIG_KEXEC
sh, kexec: fix the incorrect ifdeffery and dependency of CONFIG_KEXEC
mips, kexec: fix the incorrect ifdeffery and dependency of CONFIG_KEXEC
m68k, kexec: fix the incorrect ifdeffery and build dependency of CONFIG_KEXEC
loongarch, kexec: change dependency of object files
mm/damon/core: make damon_start() waits until kdamond_fn() starts
selftests/mm: cow: print ksft header before printing anything else
mm: fix VMA heap bounds checking
riscv: fix VMALLOC_START definition
kexec: drop dependency on ARCH_SUPPORTS_KEXEC from CRASH_DUMP
One of the last remaining users of strlcpy() in the kernel is
kernfs_path_from_node_locked(), which passes back the problematic "length
we _would_ have copied" return value to indicate truncation. Convert the
chain of all callers to use the negative return value (some of which
already doing this explicitly). All callers were already also checking
for negative return values, so the risk to missed checks looks very low.
In this analysis, it was found that cgroup1_release_agent() actually
didn't handle the "too large" condition, so this is technically also a
bug fix. :)
Here's the chain of callers, and resolution identifying each one as now
handling the correct return value:
kernfs_path_from_node_locked()
kernfs_path_from_node()
pr_cont_kernfs_path()
returns void
kernfs_path()
sysfs_warn_dup()
return value ignored
cgroup_path()
blkg_path()
bfq_bic_update_cgroup()
return value ignored
TRACE_IOCG_PATH()
return value ignored
TRACE_CGROUP_PATH()
return value ignored
perf_event_cgroup()
return value ignored
task_group_path()
return value ignored
damon_sysfs_memcg_path_eq()
return value ignored
get_mm_memcg_path()
return value ignored
lru_gen_seq_show()
return value ignored
cgroup_path_from_kernfs_id()
return value ignored
cgroup_show_path()
already converted "too large" error to negative value
cgroup_path_ns_locked()
cgroup_path_ns()
bpf_iter_cgroup_show_fdinfo()
return value ignored
cgroup1_release_agent()
wasn't checking "too large" error
proc_cgroup_show()
already converted "too large" to negative value
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Waiman Long <longman@redhat.com>
Cc: <cgroups@vger.kernel.org>
Co-developed-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Link: https://lore.kernel.org/r/20231116192127.1558276-3-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20231212211741.164376-3-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
By passing the fsugid to kernfs_create_dir_ns(), we don't need
cgroup_kn_set_ugid() any longer. That function was added for exactly
this purpose by commit 49957f8e2a ("cgroup: newly created dirs and
files should be owned by the creator").
Eliminating this piece of duplicate code means we benefit from future
improvements to kernfs_create_dir_ns(); for example, both are lacking
S_ISGID support currently, which my next patch will add to
kernfs_create_dir_ns(). It cannot (easily) be added to
cgroup_kn_set_ugid() because we can't dereference struct kernfs_iattrs
from there.
--
v1 -> v2: 12-digit commit id
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20231208093310.297233-1-max.kellermann@ionos.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As the ring buffer recording requires cmpxchg() to work, if the
architecture does not support cmpxchg in NMI, then do not do any recording
within an NMI.
Link: https://lore.kernel.org/linux-trace-kernel/20231213175403.6fc18540@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The rb_time_cmpxchg() on 32-bit architectures requires setting three
32-bit words to represent the 64-bit timestamp, with some salt for
synchronization. Those are: msb, top, and bottom
The issue is, the rb_time_cmpxchg() did not properly salt the msb portion,
and the msb that was written was stale.
Link: https://lore.kernel.org/linux-trace-kernel/20231215084114.20899342@rorschach.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: f03f2abce4 ("ring-buffer: Have 32 bit time stamps use all 64 bits")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The following race can cause rb_time_read() to observe a corrupted time
stamp:
rb_time_cmpxchg()
[...]
if (!rb_time_read_cmpxchg(&t->msb, msb, msb2))
return false;
if (!rb_time_read_cmpxchg(&t->top, top, top2))
return false;
<interrupted before updating bottom>
__rb_time_read()
[...]
do {
c = local_read(&t->cnt);
top = local_read(&t->top);
bottom = local_read(&t->bottom);
msb = local_read(&t->msb);
} while (c != local_read(&t->cnt));
*cnt = rb_time_cnt(top);
/* If top and msb counts don't match, this interrupted a write */
if (*cnt != rb_time_cnt(msb))
return false;
^ this check fails to catch that "bottom" is still not updated.
So the old "bottom" value is returned, which is wrong.
Fix this by checking that all three of msb, top, and bottom 2-bit cnt
values match.
The reason to favor checking all three fields over requiring a specific
update order for both rb_time_set() and rb_time_cmpxchg() is because
checking all three fields is more robust to handle partial failures of
rb_time_cmpxchg() when interrupted by nested rb_time_set().
Link: https://lore.kernel.org/lkml/20231211201324.652870-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/linux-trace-kernel/20231212193049.680122-1-mathieu.desnoyers@efficios.com
Fixes: f458a14534 ("ring-buffer: Test last update in 32bit version of __rb_time_read()")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Mathieu Desnoyers pointed out an issue in the rb_time_cmpxchg() for 32 bit
architectures. That is:
static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
{
unsigned long cnt, top, bottom, msb;
unsigned long cnt2, top2, bottom2, msb2;
u64 val;
/* The cmpxchg always fails if it interrupted an update */
if (!__rb_time_read(t, &val, &cnt2))
return false;
if (val != expect)
return false;
<<<< interrupted here!
cnt = local_read(&t->cnt);
The problem is that the synchronization counter in the rb_time_t is read
*after* the value of the timestamp is read. That means if an interrupt
were to come in between the value being read and the counter being read,
it can change the value and the counter and the interrupted process would
be clueless about it!
The counter needs to be read first and then the value. That way it is easy
to tell if the value is stale or not. If the counter hasn't been updated,
then the value is still good.
Link: https://lore.kernel.org/linux-trace-kernel/20231211201324.652870-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/linux-trace-kernel/20231212115301.7a9c9a64@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Fixes: 10464b4aa6 ("ring-buffer: Add rb_time_t 64 bit operations for speeding up 32 bit")
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When filtering is enabled, a temporary buffer is created to place the
content of the trace event output so that the filter logic can decide
from the trace event output if the trace event should be filtered out or
not. If it is to be filtered out, the content in the temporary buffer is
simply discarded, otherwise it is written into the trace buffer.
But if an interrupt were to come in while a previous event was using that
temporary buffer, the event written by the interrupt would actually go
into the ring buffer itself to prevent corrupting the data on the
temporary buffer. If the event is to be filtered out, the event in the
ring buffer is discarded, or if it fails to discard because another event
were to have already come in, it is turned into padding.
The update to the write_stamp in the rb_try_to_discard() happens after a
fix was made to force the next event after the discard to use an absolute
timestamp by setting the before_stamp to zero so it does not match the
write_stamp (which causes an event to use the absolute timestamp).
But there's an effort in rb_try_to_discard() to put back the write_stamp
to what it was before the event was added. But this is useless and
wasteful because nothing is going to be using that write_stamp for
calculations as it still will not match the before_stamp.
Remove this useless update, and in doing so, we remove another
cmpxchg64()!
Also update the comments to reflect this change as well as remove some
extra white space in another comment.
Link: https://lore.kernel.org/linux-trace-kernel/20231215081810.1f4f38fe@rorschach.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Fixes: b2dd797543 ("ring-buffer: Force absolute timestamp on discard of event")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
If an update to an event is interrupted by another event between the time
the initial event allocated its buffer and where it wrote to the
write_stamp, the code try to reset the write stamp back to the what it had
just overwritten. It knows that it was overwritten via checking the
before_stamp, and if it didn't match what it wrote to the before_stamp
before it allocated its space, it knows it was overwritten.
To put back the write_stamp, it uses the before_stamp it read. The problem
here is that by writing the before_stamp to the write_stamp it makes the
two equal again, which means that the write_stamp can be considered valid
as the last timestamp written to the ring buffer. But this is not
necessarily true. The event that interrupted the event could have been
interrupted in a way that it was interrupted as well, and can end up
leaving with an invalid write_stamp. But if this happens and returns to
this context that uses the before_stamp to update the write_stamp again,
it can possibly incorrectly make it valid, causing later events to have in
correct time stamps.
As it is OK to leave this function with an invalid write_stamp (one that
doesn't match the before_stamp), there's no reason to try to make it valid
again in this case. If this race happens, then just leave with the invalid
write_stamp and the next event to come along will just add a absolute
timestamp and validate everything again.
Bonus points: This gets rid of another cmpxchg64!
Link: https://lore.kernel.org/linux-trace-kernel/20231214222921.193037a7@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Fixes: a389d86f7f ("ring-buffer: Have nested events still record running time stamp")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
An S4 (suspend to disk) test on the LoongArch 3A6000 platform sometimes
fails with the following error messaged in the dmesg log:
Invalid LZO compressed length
That happens because when compressing/decompressing the image, the
synchronization between the control thread and the compress/decompress/crc
thread is based on a relaxed ordering interface, which is unreliable, and the
following situation may occur:
CPU 0 CPU 1
save_image_lzo lzo_compress_threadfn
atomic_set(&d->stop, 1);
atomic_read(&data[thr].stop)
data[thr].cmp = data[thr].cmp_len;
WRITE data[thr].cmp_len
Then CPU0 gets a stale cmp_len and writes it to disk. During resume from S4,
wrong cmp_len is loaded.
To maintain data consistency between the two threads, use the acquire/release
variants of atomic set and read operations.
Fixes: 081a9d043c ("PM / Hibernate: Improve performance of LZO/plain hibernation, checksum image")
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Hongchen Zhang <zhanghongchen@loongson.cn>
Co-developed-by: Weihao Li <liweihao@loongson.cn>
Signed-off-by: Weihao Li <liweihao@loongson.cn>
[ rjw: Subject rewrite and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Wakeup events that occur in the hibernation process's
hibernation_platform_enter() cannot wake up the system. Although the
current hibernation framework will execute part of the recovery process
after a wakeup event occurs, it ultimately performs a shutdown operation
because the system does not check the return value of
hibernation_platform_enter(). In short, if a wakeup event occurs before
putting the system into the final low-power state, it will be missed.
To solve this problem, check the return value of
hibernation_platform_enter(). When it returns -EAGAIN or -EBUSY (indicate
the occurrence of a wakeup event), execute the hibernation recovery
process, discard the previously saved image, and ultimately return to the
working state.
Signed-off-by: Chris Feng <chris.feng@mediatek.com>
[ rjw: Rephrase the message printed when going back to the working state ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When lockdep is enabled, the for_each_sibling_event(sibling, event)
macro checks that event->ctx->mutex is held. When creating a new group
leader event, we call perf_event_validate_size() on a partially
initialized event where event->ctx is NULL, and so when
for_each_sibling_event() attempts to check event->ctx->mutex, we get a
splat, as reported by Lucas De Marchi:
WARNING: CPU: 8 PID: 1471 at kernel/events/core.c:1950 __do_sys_perf_event_open+0xf37/0x1080
This only happens for a new event which is its own group_leader, and in
this case there cannot be any sibling events. Thus it's safe to skip the
check for siblings, which avoids having to make invasive and ugly
changes to for_each_sibling_event().
Avoid the splat by bailing out early when the new event is its own
group_leader.
Fixes: 382c27f4ed ("perf: Fix perf_event_validate_size()")
Closes: https://lore.kernel.org/lkml/20231214000620.3081018-1-lucas.demarchi@intel.com/
Closes: https://lore.kernel.org/lkml/ZXpm6gQ%2Fd59jGsuW@xpf.sh.intel.com/
Reported-by: Lucas De Marchi <lucas.demarchi@intel.com>
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231215112450.3972309-1-mark.rutland@arm.com
Reproduced with below sequence:
dma_declare_coherent_memory()->dma_release_coherent_memory()
->dma_declare_coherent_memory()->"return -EBUSY" error
It will return -EBUSY from the dma_assign_coherent_memory()
in dma_declare_coherent_memory(), the reason is that dev->dma_mem
pointer has not been set to NULL after it's freed.
Fixes: cf65a0f6f6 ("dma-mapping: move all DMA mapping code to kernel/dma")
Signed-off-by: Joakim Zhang <joakim.zhang@cixtech.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
If multiple areas and multiple IO TLB pools exist, first iterate the
current CPU specific area in all pools. Then move to the next area index.
This is best illustrated by a diagram:
area 0 | area 1 | ... | area M |
pool 0 A B C
pool 1 D E
...
pool N F G H
Currently, each pool is searched before moving on to the next pool,
i.e. the search order is A, B ... C, D, E ... F, G ... H. With this patch,
each area is searched in all pools before moving on to the next area,
i.e. the search order is A, D ... F, B, E ... G ... C ... H.
Note that preemption is not disabled, and raw_smp_processor_id() may not
return a stable result, but it is called only once to determine the initial
area index. The search will iterate over all areas eventually, even if the
current task is preempted.
Next, some pools may have less (but not more) areas than default_nareas.
Skip such pools if the distance from the initial area index is greater than
pool->nareas. This logic ensures that for every pool the search starts in
the initial CPU's own area and never tries any area twice.
To verify performance impact, I booted the kernel with a minimum pool
size ("swiotlb=512,4,force"), so multiple pools get allocated, and I ran
these benchmarks:
- small: single-threaded I/O of 4 KiB blocks,
- big: single-threaded I/O of 64 KiB blocks,
- 4way: 4-way parallel I/O of 4 KiB blocks.
The "var" column in the tables below is the coefficient of variance over 5
runs of the test, the "diff" column is the relative difference against base
in read-write I/O bandwidth (MiB/s).
Tested on an x86 VM against a QEMU virtio SATA driver backed by a RAM-based
block device on the host:
base patched
var var diff
small 0.69% 0.62% +25.4%
big 2.14% 2.27% +25.7%
4way 2.65% 1.70% +23.6%
Tested on a Raspberry Pi against a class-10 A1 microSD card:
base patched
var var diff
small 0.53% 1.96% -0.3%
big 0.02% 0.57% +0.8%
4way 6.17% 0.40% +0.3%
These results confirm that there is significant performance boost in the
software IO TLB slot allocation itself. Where performance is dominated by
actual hardware, there is no measurable change.
Signed-off-by: Petr Tesarik <petr.tesarik1@huawei-partners.com>
Reviewed-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr>
Signed-off-by: Christoph Hellwig <hch@lst.de>
A bus_dma_region necessarily stores both CPU and DMA base addresses for
a range, so there's no need to also store the difference between them.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Registering generic_kfunc_set with XDP programs enables some of the
newer BPF features inside XDP -- namely tree based data structures and
BPF exceptions.
The current motivation for this commit is to enable assertions inside
XDP bpf progs. Assertions are a standard and useful tool to encode
intent.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/d07d4614b81ca6aada44fcb89bb6b618fb66e4ca.1702594357.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Besides already supported special "any" value and hex bit mask, support
string-based parsing of delegation masks based on exact enumerator
names. Utilize BTF information of `enum bpf_cmd`, `enum bpf_map_type`,
`enum bpf_prog_type`, and `enum bpf_attach_type` types to find supported
symbolic names (ignoring __MAX_xxx guard values and stripping repetitive
prefixes like BPF_ for cmd and attach types, BPF_MAP_TYPE_ for maps, and
BPF_PROG_TYPE_ for prog types). The case doesn't matter, but it is
normalized to lower case in mount option output. So "PROG_LOAD",
"prog_load", and "MAP_create" are all valid values to specify for
delegate_cmds options, "array" is among supported for map types, etc.
Besides supporting string values, we also support multiple values
specified at the same time, using colon (':') separator.
There are corresponding changes on bpf_show_options side to use known
values to print them in human-readable format, falling back to hex mask
printing, if there are any unrecognized bits. This shouldn't be
necessary when enum BTF information is present, but in general we should
always be able to fall back to this even if kernel was built without BTF.
As mentioned, emitted symbolic names are normalized to be all lower case.
Example below shows various ways to specify delegate_cmds options
through mount command and how mount options are printed back:
12/14 14:39:07.604
vmuser@archvm:~/local/linux/tools/testing/selftests/bpf
$ mount | rg token
$ sudo mkdir -p /sys/fs/bpf/token
$ sudo mount -t bpf bpffs /sys/fs/bpf/token \
-o delegate_cmds=prog_load:MAP_CREATE \
-o delegate_progs=kprobe \
-o delegate_attachs=xdp
$ mount | grep token
bpffs on /sys/fs/bpf/token type bpf (rw,relatime,delegate_cmds=map_create:prog_load,delegate_progs=kprobe,delegate_attachs=xdp)
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231214225016.1209867-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When running `./test_progs -j` in my local vm with latest kernel,
I once hit a kasan error like below:
[ 1887.184724] BUG: KASAN: slab-use-after-free in bpf_rb_root_free+0x1f8/0x2b0
[ 1887.185599] Read of size 4 at addr ffff888106806910 by task kworker/u12:2/2830
[ 1887.186498]
[ 1887.186712] CPU: 3 PID: 2830 Comm: kworker/u12:2 Tainted: G OEL 6.7.0-rc3-00699-g90679706d486-dirty #494
[ 1887.188034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 1887.189618] Workqueue: events_unbound bpf_map_free_deferred
[ 1887.190341] Call Trace:
[ 1887.190666] <TASK>
[ 1887.190949] dump_stack_lvl+0xac/0xe0
[ 1887.191423] ? nf_tcp_handle_invalid+0x1b0/0x1b0
[ 1887.192019] ? panic+0x3c0/0x3c0
[ 1887.192449] print_report+0x14f/0x720
[ 1887.192930] ? preempt_count_sub+0x1c/0xd0
[ 1887.193459] ? __virt_addr_valid+0xac/0x120
[ 1887.194004] ? bpf_rb_root_free+0x1f8/0x2b0
[ 1887.194572] kasan_report+0xc3/0x100
[ 1887.195085] ? bpf_rb_root_free+0x1f8/0x2b0
[ 1887.195668] bpf_rb_root_free+0x1f8/0x2b0
[ 1887.196183] ? __bpf_obj_drop_impl+0xb0/0xb0
[ 1887.196736] ? preempt_count_sub+0x1c/0xd0
[ 1887.197270] ? preempt_count_sub+0x1c/0xd0
[ 1887.197802] ? _raw_spin_unlock+0x1f/0x40
[ 1887.198319] bpf_obj_free_fields+0x1d4/0x260
[ 1887.198883] array_map_free+0x1a3/0x260
[ 1887.199380] bpf_map_free_deferred+0x7b/0xe0
[ 1887.199943] process_scheduled_works+0x3a2/0x6c0
[ 1887.200549] worker_thread+0x633/0x890
[ 1887.201047] ? __kthread_parkme+0xd7/0xf0
[ 1887.201574] ? kthread+0x102/0x1d0
[ 1887.202020] kthread+0x1ab/0x1d0
[ 1887.202447] ? pr_cont_work+0x270/0x270
[ 1887.202954] ? kthread_blkcg+0x50/0x50
[ 1887.203444] ret_from_fork+0x34/0x50
[ 1887.203914] ? kthread_blkcg+0x50/0x50
[ 1887.204397] ret_from_fork_asm+0x11/0x20
[ 1887.204913] </TASK>
[ 1887.204913] </TASK>
[ 1887.205209]
[ 1887.205416] Allocated by task 2197:
[ 1887.205881] kasan_set_track+0x3f/0x60
[ 1887.206366] __kasan_kmalloc+0x6e/0x80
[ 1887.206856] __kmalloc+0xac/0x1a0
[ 1887.207293] btf_parse_fields+0xa15/0x1480
[ 1887.207836] btf_parse_struct_metas+0x566/0x670
[ 1887.208387] btf_new_fd+0x294/0x4d0
[ 1887.208851] __sys_bpf+0x4ba/0x600
[ 1887.209292] __x64_sys_bpf+0x41/0x50
[ 1887.209762] do_syscall_64+0x4c/0xf0
[ 1887.210222] entry_SYSCALL_64_after_hwframe+0x63/0x6b
[ 1887.210868]
[ 1887.211074] Freed by task 36:
[ 1887.211460] kasan_set_track+0x3f/0x60
[ 1887.211951] kasan_save_free_info+0x28/0x40
[ 1887.212485] ____kasan_slab_free+0x101/0x180
[ 1887.213027] __kmem_cache_free+0xe4/0x210
[ 1887.213514] btf_free+0x5b/0x130
[ 1887.213918] rcu_core+0x638/0xcc0
[ 1887.214347] __do_softirq+0x114/0x37e
The error happens at bpf_rb_root_free+0x1f8/0x2b0:
00000000000034c0 <bpf_rb_root_free>:
; {
34c0: f3 0f 1e fa endbr64
34c4: e8 00 00 00 00 callq 0x34c9 <bpf_rb_root_free+0x9>
34c9: 55 pushq %rbp
34ca: 48 89 e5 movq %rsp, %rbp
...
; if (rec && rec->refcount_off >= 0 &&
36aa: 4d 85 ed testq %r13, %r13
36ad: 74 a9 je 0x3658 <bpf_rb_root_free+0x198>
36af: 49 8d 7d 10 leaq 0x10(%r13), %rdi
36b3: e8 00 00 00 00 callq 0x36b8 <bpf_rb_root_free+0x1f8>
<==== kasan function
36b8: 45 8b 7d 10 movl 0x10(%r13), %r15d
<==== use-after-free load
36bc: 45 85 ff testl %r15d, %r15d
36bf: 78 8c js 0x364d <bpf_rb_root_free+0x18d>
So the problem is at rec->refcount_off in the above.
I did some source code analysis and find the reason.
CPU A CPU B
bpf_map_put:
...
btf_put with rcu callback
...
bpf_map_free_deferred
with system_unbound_wq
... ... ...
... btf_free_rcu: ...
... ... bpf_map_free_deferred:
... ...
... ---------> btf_struct_metas_free()
... | race condition ...
... ---------> map->ops->map_free()
...
... btf->struct_meta_tab = NULL
In the above, map_free() corresponds to array_map_free() and eventually
calling bpf_rb_root_free() which calls:
...
__bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
...
Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with following code:
meta = btf_find_struct_meta(btf, btf_id);
if (!meta)
return -EFAULT;
rec->fields[i].graph_root.value_rec = meta->record;
So basically, 'value_rec' is a pointer to the record in struct_metas_tab.
And it is possible that that particular record has been freed by
btf_struct_metas_free() and hence we have a kasan error here.
Actually it is very hard to reproduce the failure with current bpf/bpf-next
code, I only got the above error once. To increase reproducibility, I added
a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which
significantly increased reproducibility.
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5e43ddd1b83f..aae5b5213e93 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
struct bpf_map *map = container_of(work, struct bpf_map, work);
struct btf_record *rec = map->record;
+ mdelay(100);
security_bpf_map_free(map);
bpf_map_release_memcg(map);
/* implementation dependent freeing */
Hao also provided test cases ([1]) for easily reproducing the above issue.
There are two ways to fix the issue, the v1 of the patch ([2]) moving
btf_put() after map_free callback, and the v5 of the patch ([3]) using
a kptr style fix which tries to get a btf reference during
map_check_btf(). Each approach has its pro and cons. The first approach
delays freeing btf while the second approach needs to acquire reference
depending on context which makes logic not very elegant and may
complicate things with future new data structures. Alexei
suggested in [4] going back to v1 which is what this patch
tries to do.
Rerun './test_progs -j' with the above mdelay() hack for a couple
of times and didn't observe the error for the above rb_root test cases.
Running Hou's test ([1]) is also successful.
[1] https://lore.kernel.org/bpf/20231207141500.917136-1-houtao@huaweicloud.com/
[2] v1: https://lore.kernel.org/bpf/20231204173946.3066377-1-yonghong.song@linux.dev/
[3] v5: https://lore.kernel.org/bpf/20231208041621.2968241-1-yonghong.song@linux.dev/
[4] v4: https://lore.kernel.org/bpf/CAADnVQJ3FiXUhZJwX_81sjZvSYYKCFB3BT6P8D59RS2Gu+0Z7g@mail.gmail.com/
Cc: Hou Tao <houtao@huaweicloud.com>
Fixes: 958cf2e273 ("bpf: Introduce bpf_obj_new")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231214203815.1469107-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
rcu_read_lock() is no longer held when invoking bpf_event_entry_gen()
which is called by perf_event_fd_array_get_ptr(), so using GFP_KERNEL
instead of GFP_ATOMIC to reduce the possibility of failures due to
out-of-memory.
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231214043010.3458072-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
There is no rcu-read-lock requirement for ops->map_fd_get_ptr() or
ops->map_fd_put_ptr(), so doesn't use rcu-read-lock for these two
callbacks.
For bpf_fd_array_map_update_elem(), accessing array->ptrs doesn't need
rcu-read-lock because array->ptrs must still be allocated. For
bpf_fd_htab_map_update_elem(), htab_map_update_elem() only requires
rcu-read-lock to be held to avoid the WARN_ON_ONCE(), so only use
rcu_read_lock() during the invocation of htab_map_update_elem().
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231214043010.3458072-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
KASAN report following issue. The root cause is when opening 'hist'
file of an instance and accessing 'trace_event_file' in hist_show(),
but 'trace_event_file' has been freed due to the instance being removed.
'hist_debug' file has the same problem. To fix it, call
tracing_{open,release}_file_tr() in file_operations callback to have
the ref count and avoid 'trace_event_file' being freed.
BUG: KASAN: slab-use-after-free in hist_show+0x11e0/0x1278
Read of size 8 at addr ffff242541e336b8 by task head/190
CPU: 4 PID: 190 Comm: head Not tainted 6.7.0-rc5-g26aff849438c #133
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x98/0xf8
show_stack+0x1c/0x30
dump_stack_lvl+0x44/0x58
print_report+0xf0/0x5a0
kasan_report+0x80/0xc0
__asan_report_load8_noabort+0x1c/0x28
hist_show+0x11e0/0x1278
seq_read_iter+0x344/0xd78
seq_read+0x128/0x1c0
vfs_read+0x198/0x6c8
ksys_read+0xf4/0x1e0
__arm64_sys_read+0x70/0xa8
invoke_syscall+0x70/0x260
el0_svc_common.constprop.0+0xb0/0x280
do_el0_svc+0x44/0x60
el0_svc+0x34/0x68
el0t_64_sync_handler+0xb8/0xc0
el0t_64_sync+0x168/0x170
Allocated by task 188:
kasan_save_stack+0x28/0x50
kasan_set_track+0x28/0x38
kasan_save_alloc_info+0x20/0x30
__kasan_slab_alloc+0x6c/0x80
kmem_cache_alloc+0x15c/0x4a8
trace_create_new_event+0x84/0x348
__trace_add_new_event+0x18/0x88
event_trace_add_tracer+0xc4/0x1a0
trace_array_create_dir+0x6c/0x100
trace_array_create+0x2e8/0x568
instance_mkdir+0x48/0x80
tracefs_syscall_mkdir+0x90/0xe8
vfs_mkdir+0x3c4/0x610
do_mkdirat+0x144/0x200
__arm64_sys_mkdirat+0x8c/0xc0
invoke_syscall+0x70/0x260
el0_svc_common.constprop.0+0xb0/0x280
do_el0_svc+0x44/0x60
el0_svc+0x34/0x68
el0t_64_sync_handler+0xb8/0xc0
el0t_64_sync+0x168/0x170
Freed by task 191:
kasan_save_stack+0x28/0x50
kasan_set_track+0x28/0x38
kasan_save_free_info+0x34/0x58
__kasan_slab_free+0xe4/0x158
kmem_cache_free+0x19c/0x508
event_file_put+0xa0/0x120
remove_event_file_dir+0x180/0x320
event_trace_del_tracer+0xb0/0x180
__remove_instance+0x224/0x508
instance_rmdir+0x44/0x78
tracefs_syscall_rmdir+0xbc/0x140
vfs_rmdir+0x1cc/0x4c8
do_rmdir+0x220/0x2b8
__arm64_sys_unlinkat+0xc0/0x100
invoke_syscall+0x70/0x260
el0_svc_common.constprop.0+0xb0/0x280
do_el0_svc+0x44/0x60
el0_svc+0x34/0x68
el0t_64_sync_handler+0xb8/0xc0
el0t_64_sync+0x168/0x170
Link: https://lore.kernel.org/linux-trace-kernel/20231214012153.676155-1-zhengyejian1@huawei.com
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Since commit 638e4b825d ("bpf: Allows per-cpu maps and map-in-map in
sleepable programs"), sleepable BPF program can also use map-in-map, but
maybe_wait_bpf_programs() doesn't handle it accordingly. The main reason
is that using synchronize_rcu_tasks_trace() to wait for the completions
of these sleepable BPF programs may incur a very long delay and
userspace may think it is hung, so the wait for sleepable BPF programs
is skipped. Update the comments in maybe_wait_bpf_programs() to reflect
the reason.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20231211083447.1921178-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
security_path_* based LSM hooks appear to be generally missing from
the sleepable_lsm_hooks list. Initially add a small subset of them to
the preexisting sleepable_lsm_hooks list so that sleepable BPF helpers
like bpf_d_path() can be used from sleepable BPF LSM based programs.
The security_path_* hooks added in this patch are similar to the
security_inode_* counterparts that already exist in the
sleepable_lsm_hooks list, and are called in roughly similar points and
contexts. Presumably, making them OK to be also annotated as
sleepable.
Building a kernel with DEBUG_ATOMIC_SLEEP options enabled and running
reasonable workloads stimulating activity that would be intercepted by
such security hooks didn't show any splats.
Notably, I haven't added all the security_path_* LSM hooks that are
available as I don't need them at this point in time.
Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Acked-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/r/ZXM3IHHXpNY9y82a@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
It's quite confusing in practice when it's possible to successfully
create a BPF token from BPF FS that didn't have any of delegate_xxx
mount options set up. While it's not wrong, it's actually more
meaningful to reject BPF_TOKEN_CREATE with specific error code (-ENOENT)
to let user-space know that no token delegation is setup up.
So, instead of creating empty BPF token that will be always ignored
because it doesn't have any of the allow_xxx bits set, reject it with
-ENOENT. If we ever need empty BPF token to be possible, we can support
that with extra flag passed into BPF_TOKEN_CREATE.
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231213190842.3844987-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Parse uid and gid in bpf_parse_param() so that they can be passed in as
the `data` parameter when mount() bpffs. This will be useful when we
want to control which user/group has the control to the mounted bpffs,
otherwise a separate chown() call will be needed.
Signed-off-by: Jie Jiang <jiejiang@chromium.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Mike Frysinger <vapier@chromium.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231212093923.497838-1-jiejiang@chromium.org
If an rcutorture test scenario creates an fqs_task kthread, it will
periodically invoke rcu_force_quiescent_state() in order to start
force-quiescent-state (FQS) operations. However, an FQS operation
will be started even if there is no RCU grace period in progress.
Although testing FQS operations startup when there is no grace period in
progress is necessary, it need not happen all that often. This commit
therefore causes rcu_force_quiescent_state() to take an early exit
if there is no grace period in progress.
Note that there will still be attempts to start an FQS scan in the
absence of a grace period because the grace period might end right
after the rcu_force_quiescent_state() function's check. In actual
testing, this happens about once every ten minutes, which should
provide adequate testing.
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
If for some reason the trace_marker write does not have a nul byte for the
string, it will overflow the print:
trace_seq_printf(s, ": %s", field->buf);
The field->buf could be missing the nul byte. To prevent overflow, add the
max size that the buf can be by using the event size and the field
location.
int max = iter->ent_size - offsetof(struct print_entry, buf);
trace_seq_printf(s, ": %*.s", max, field->buf);
Link: https://lore.kernel.org/linux-trace-kernel/20231212084444.4619b8ce@gandalf.local.home
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
For the ring buffer iterator (non-consuming read), the event needs to be
copied into the iterator buffer to make sure that a writer does not
overwrite it while the user is reading it. If a write happens during the
copy, the buffer is simply discarded.
But the temp buffer itself was not big enough. The allocation of the
buffer was only BUF_MAX_DATA_SIZE, which is the maximum data size that can
be passed into the ring buffer and saved. But the temp buffer needs to
hold the meta data as well. That would be BUF_PAGE_SIZE and not
BUF_MAX_DATA_SIZE.
Link: https://lore.kernel.org/linux-trace-kernel/20231212072558.61f76493@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 785888c544 ("ring-buffer: Have rb_iter_head_event() handle concurrent writer")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The ring buffer timestamps are synchronized by two timestamp placeholders.
One is the "before_stamp" and the other is the "write_stamp" (sometimes
referred to as the "after stamp" but only in the comments. These two
stamps are key to knowing how to handle nested events coming in with a
lockless system.
When moving across sub-buffers, the before stamp is updated but the write
stamp is not. There's an effort to put back the before stamp to something
that seems logical in case there's nested events. But as the current event
is about to cross sub-buffers, and so will any new nested event that happens,
updating the before stamp is useless, and could even introduce new race
conditions.
The first event on a sub-buffer simply uses the sub-buffer's timestamp
and keeps a "delta" of zero. The "before_stamp" and "write_stamp" are not
used in the algorithm in this case. There's no reason to try to fix the
before_stamp when this happens.
As a bonus, it removes a cmpxchg() when crossing sub-buffers!
Link: https://lore.kernel.org/linux-trace-kernel/20231211114420.36dde01b@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: a389d86f7f ("ring-buffer: Have nested events still record running time stamp")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
If crash_base is equal to CRASH_ADDR_LOW_MAX, it also indicates that
the crashkernel memory is allocated from high memory. However, the
current check only considers the case where crash_base is greater than
CRASH_ADDR_LOW_MAX. Fix it.
The runtime effects is that crashkernel high memory is successfully
reserved, whereas the crashkernel low memory is bypassed in this case,
then kdump kernel bootup will fail because of no low memory under 4G.
This patch also includes some minor cleanups.
Link: https://lkml.kernel.org/r/20231209141438.77233-1-ytcoode@gmail.com
Fixes: 0ab97169aa ("crash_core: add generic function to do reservation")
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Zhen Lei <thunder.leizhen@huawei.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In commit f8ff23429c62 ("kernel/Kconfig.kexec: drop select of KEXEC for
CRASH_DUMP") we tried to fix a config regression, where CONFIG_CRASH_DUMP
required CONFIG_KEXEC.
However, it was not enough at least for arm64 platforms. While further
testing the patch with our arm64 config I noticed that CONFIG_CRASH_DUMP
is unavailable in menuconfig. This is because CONFIG_CRASH_DUMP still
depends on the new CONFIG_ARCH_SUPPORTS_KEXEC introduced in commit
91506f7e5d ("arm64/kexec: refactor for kernel/Kconfig.kexec") and on
arm64 CONFIG_ARCH_SUPPORTS_KEXEC requires CONFIG_PM_SLEEP_SMP=y, which in
turn requires either CONFIG_SUSPEND=y or CONFIG_HIBERNATION=y neither of
which are set in our config.
Given that we already established that CONFIG_KEXEC (which is a switch for
kexec system call itself) is not required for CONFIG_CRASH_DUMP drop
CONFIG_ARCH_SUPPORTS_KEXEC dependency as well. The arm64 kernel builds
just fine with CONFIG_CRASH_DUMP=y and with both CONFIG_KEXEC=n and
CONFIG_KEXEC_FILE=n after f8ff23429c62 ("kernel/Kconfig.kexec: drop select
of KEXEC for CRASH_DUMP") and this patch are applied given that the
necessary shared bits are included via CONFIG_KEXEC_CORE dependency.
[bhe@redhat.com: don't export some symbols when CONFIG_MMU=n]
Link: https://lkml.kernel.org/r/ZW03ODUKGGhP1ZGU@MiWiFi-R3L-srv
[bhe@redhat.com: riscv, kexec: fix dependency of two items]
Link: https://lkml.kernel.org/r/ZW04G/SKnhbE5mnX@MiWiFi-R3L-srv
Link: https://lkml.kernel.org/r/20231129220409.55006-1-ignat@cloudflare.com
Fixes: 91506f7e5d ("arm64/kexec: refactor for kernel/Kconfig.kexec")
Signed-off-by: Ignat Korchagin <ignat@cloudflare.com>
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: <stable@vger.kernel.org> # 6.6+: f8ff234: kernel/Kconfig.kexec: drop select of KEXEC for CRASH_DUMP
Cc: <stable@vger.kernel.org> # 6.6+
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The snapshot buffer is to mimic the main buffer so that when a snapshot is
needed, the snapshot and main buffer are swapped. When the snapshot buffer
is allocated, it is set to the minimal size that the ring buffer may be at
and still functional. When it is allocated it becomes the same size as the
main ring buffer, and when the main ring buffer changes in size, it should
do.
Currently, the resize only updates the snapshot buffer if it's used by the
current tracer (ie. the preemptirqsoff tracer). But it needs to be updated
anytime it is allocated.
When changing the size of the main buffer, instead of looking to see if
the current tracer is utilizing the snapshot buffer, just check if it is
allocated to know if it should be updated or not.
Also fix typo in comment just above the code change.
Link: https://lore.kernel.org/linux-trace-kernel/20231210225447.48476a6a@rorschach.local.home
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: ad909e21bb ("tracing: Add internal tracing_snapshot() functions")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reading the ring buffer does a swap of a sub-buffer within the ring buffer
with a empty sub-buffer. This allows the reader to have full access to the
content of the sub-buffer that was swapped out without having to worry
about contention with the writer.
The readers call ring_buffer_alloc_read_page() to allocate a page that
will be used to swap with the ring buffer. When the code is finished with
the reader page, it calls ring_buffer_free_read_page(). Instead of freeing
the page, it stores it as a spare. Then next call to
ring_buffer_alloc_read_page() will return this spare instead of calling
into the memory management system to allocate a new page.
Unfortunately, on freeing of the ring buffer, this spare page is not
freed, and causes a memory leak.
Link: https://lore.kernel.org/linux-trace-kernel/20231210221250.7b9cc83c@rorschach.local.home
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 73a757e631 ("ring-buffer: Return reader page back into existing ring buffer")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The maximum ring buffer data size is the maximum size of data that can be
recorded on the ring buffer. Events must be smaller than the sub buffer
data size minus any meta data. This size is checked before trying to
allocate from the ring buffer because the allocation assumes that the size
will fit on the sub buffer.
The maximum size was calculated as the size of a sub buffer page (which is
currently PAGE_SIZE minus the sub buffer header) minus the size of the
meta data of an individual event. But it missed the possible adding of a
time stamp for events that are added long enough apart that the event meta
data can't hold the time delta.
When an event is added that is greater than the current BUF_MAX_DATA_SIZE
minus the size of a time stamp, but still less than or equal to
BUF_MAX_DATA_SIZE, the ring buffer would go into an infinite loop, looking
for a page that can hold the event. Luckily, there's a check for this loop
and after 1000 iterations and a warning is emitted and the ring buffer is
disabled. But this should never happen.
This can happen when a large event is added first, or after a long period
where an absolute timestamp is prefixed to the event, increasing its size
by 8 bytes. This passes the check and then goes into the algorithm that
causes the infinite loop.
For events that are the first event on the sub-buffer, it does not need to
add a timestamp, because the sub-buffer itself contains an absolute
timestamp, and adding one is redundant.
The fix is to check if the event is to be the first event on the
sub-buffer, and if it is, then do not add a timestamp.
This also fixes 32 bit adding a timestamp when a read of before_stamp or
write_stamp is interrupted. There's still no need to add that timestamp if
the event is going to be the first event on the sub buffer.
Also, if the buffer has "time_stamp_abs" set, then also check if the
length plus the timestamp is greater than the BUF_MAX_DATA_SIZE.
Link: https://lore.kernel.org/all/20231212104549.58863438@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20231212071837.5fdd6c13@gandalf.local.home
Link: https://lore.kernel.org/linux-trace-kernel/20231212111617.39e02849@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: a4543a2fa9 ("ring-buffer: Get timestamp after event is allocated")
Fixes: 58fbc3c632 ("ring-buffer: Consolidate add_timestamp to remove some branches")
Reported-by: Kent Overstreet <kent.overstreet@linux.dev> # (on IRC)
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
This patch adds a comment to check_mem_size_reg -- a function whose
meaning is not very transparent. The function implicitly deals with two
registers connected by convention, which is not obvious.
Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231210225149.67639-1-andreimatei1@gmail.com
The function are defined in the verifier.c file, but not called
elsewhere, so delete the unused function.
kernel/bpf/verifier.c:3448:20: warning: unused function 'bt_set_slot'
kernel/bpf/verifier.c:3453:20: warning: unused function 'bt_clear_slot'
kernel/bpf/verifier.c:3488:20: warning: unused function 'bt_is_slot_set'
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20231212005436.103829-1-yang.lee@linux.alibaba.com
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=7714
Honestly, there's little value in having a helper with and without that
int __user *ufd argument. It's just messy and doesn't really give us
anything. Just expose receive_fd() with that argument and get rid of
that helper.
Link: https://lore.kernel.org/r/20231130-vfs-files-fixes-v1-5-e73ca6f4ea83@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Linus suggested that the kconfig here is confusing:
https://lore.kernel.org/all/CAHk-=wgUiAtiszwseM1p2fCJ+sC4XWQ+YN4TanFhUgvUqjr9Xw@mail.gmail.com/
Let's break it into three kconfigs controlling distinct things:
- CONFIG_IOMMU_MM_DATA controls if the mm_struct has the additional
fields for the IOMMU. Currently only PASID, but later patches store
a struct iommu_mm_data *
- CONFIG_ARCH_HAS_CPU_PASID controls if the arch needs the scheduling bit
for keeping track of the ENQCMD instruction. x86 will select this if
IOMMU_SVA is enabled
- IOMMU_SVA controls if the IOMMU core compiles in the SVA support code
for iommu driver use and the IOMMU exported API
This way ARM will not enable CONFIG_ARCH_HAS_CPU_PASID
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20231027000525.1278806-2-tina.zhang@intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Use the fact that we are passing subprog index around and have
a corresponding struct bpf_subprog_info in bpf_verifier_env for each
subprogram. We don't need to separately pass around a flag whether
subprog is exception callback or not, each relevant verifier function
can determine this using provided subprog index if we maintain
bpf_subprog_info properly.
Also move out exception callback-specific logic from
btf_prepare_func_args(), keeping it generic. We can enforce all these
restriction right before exception callback verification pass. We add
out parameter, arg_cnt, for now, but this will be unnecessary with
subsequent refactoring and will be removed.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231204233931.49758-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
If an SRCU barrier is queued while callbacks are running and a new
callbacks invocator for the same sdp were to run concurrently, the
RCU barrier might execute too early. As this requirement is non-obvious,
make sure to keep a record.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
While in grace period start, there is nothing to accelerate and
therefore no need to advance the callbacks either if no callback is
to be enqueued.
Spare these needless operations in this case.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
Callbacks advancing on SRCU must be performed on two specific places:
1) On enqueue time in order to make room for the acceleration of the
new callback.
2) On invocation time in order to move the callbacks ready to invoke.
Any other callback advancing callsite is needless. Remove the remaining
one in srcu_gp_start().
Co-developed-by: Yong He <zhuangel570@gmail.com>
Signed-off-by: Yong He <zhuangel570@gmail.com>
Co-developed-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Co-developed-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
The error variable in snapshot_write_next() gets a value before it is
used, so don't initialize it to 0 upfront.
Signed-off-by: Li zeming <zeming@nfschina.com>
[ rjw: Subject and changelog rewrite ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
'error' first receives the function result before it is used, and it
does not need to be assigned a value during definition.
Signed-off-by: Li zeming <zeming@nfschina.com>
[ rjw: Subject rewrite ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
It is not necessary to intialize the error variable in
create_basic_memory_bitmaps(), because it is only read after
being assigned a value.
Signed-off-by: Wang chaodong <chaodong@nfschina.com>
[ rjw: Subject and changelog rewrite ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Although the RCU CPU stall notifiers can be useful for dumping state when
tracking down delicate forward-progress bugs where NUMA effects cause
cache lines to be delivered to a given CPU regularly, but always in a
state that prevents that CPU from making forward progress. These bugs can
be detected by the RCU CPU stall-warning mechanism, but in some cases,
the stall-warnings printk()s disrupt the forward-progress bug before
any useful state can be obtained.
Unfortunately, the notifier mechanism added by commit 5b404fdaba ("rcu:
Add RCU CPU stall notifier") can make matters worse if used at all
carelessly. For example, if the stall warning was caused by a lock not
being released, then any attempt to acquire that lock in the notifier
will hang. This will prevent not only the notifier from producing any
useful output, but it will also prevent the stall-warning message from
ever appearing.
This commit therefore hides this new RCU CPU stall notifier
mechanism under a new RCU_CPU_STALL_NOTIFIER Kconfig option that
depends on both DEBUG_KERNEL and RCU_EXPERT. In addition, the
rcupdate.rcu_cpu_stall_notifiers=1 kernel boot parameter must also
be specified. The RCU_CPU_STALL_NOTIFIER Kconfig option's help text
contains a warning and explains the dangers of careless use, recommending
lockless notifier code. In addition, a WARN() is triggered each time
that an attempt is made to register a stall-warning notifier in kernels
built with CONFIG_RCU_CPU_STALL_NOTIFIER=y.
This combination of measures will keep use of this mechanism confined to
debug kernels and away from routine deployments.
[ paulmck: Apply Dan Carpenter feedback. ]
Fixes: 5b404fdaba ("rcu: Add RCU CPU stall notifier")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
The task_struct structure's ->rcu_tasks_idle_cpu can be concurrently
read and written from the RCU Tasks grace-period kthread and from the
CPU on which the task_struct structure's task is running. This commit
therefore marks the accesses appropriately.
Reported-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
If multiple CPUs trigger softlockup at the same time with
'softlockup_all_cpu_backtrace=0', the softlockup's logs will appear
staggeredly in dmesg, which will affect the viewing of the logs for
developer. Since the code path for outputting softlockup logs is not a
kernel hotspot and the performance requirements for the code are not
strict, locks are used to serialize the softlockup log output to improve
the readability of the logs.
Link: https://lkml.kernel.org/r/20231123084022.10302-1-lizhe.67@bytedance.com
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "kexec_file: Load kernel at top of system RAM if required".
Justification:
==============
Kexec_load interface has been doing top down searching and loading
kernel/initrd/purgtory etc to prepare for kexec reboot. In that way, the
benefits are that it avoids to consume and fragment limited low memory
which satisfy DMA buffer allocation and big chunk of continuous memory
during system init; and avoids to stir with BIOS/FW reserved or occupied
areas, or corner case handling/work around/quirk occupied areas when doing
system init. By the way, the top-down searching and loading of kexec-ed
kernel is done in user space utility code.
For kexec_file loading, even if kexec_buf.top_down is 'true', it's simply
ignored. It calls walk_system_ram_res() directly to go through all
resources of System RAM bottom up, to find an available memory region,
then call locate_mem_hole_callback() to allocate memory in that found
memory region from top to down. This is not expected and inconsistent
with kexec_load.
Implementation
===============
In patch 1, introduce a new function walk_system_ram_res_rev() which is a
variant of walk_system_ram_res(), it walks through a list of all the
resources of System RAM in reversed order, i.e., from higher to lower.
In patch 2, check if kexec_buf.top_down is 'true' in
kexec_walk_resources(), if yes, call walk_system_ram_res_rev() to find
memory region of system RAM from top to down to load kernel/initrd etc.
Background information: ======================= And I ever tried this in
the past in a different way, please see below link. In the post, I tried
to adjust struct sibling linking code, replace the the singly linked list
with list_head so that walk_system_ram_res_rev() can be implemented in a
much easier way. Finally I failed.
https://lore.kernel.org/all/20180718024944.577-4-bhe@redhat.com/
This time, I picked up the patch from AKASHI Takahiro's old post and made
some change to take as the current patch 1:
https://lists.infradead.org/pipermail/linux-arm-kernel/2017-September/531456.html
This patch (of 2):
Kexec_load interface has been doing top down searching and loading
kernel/initrd/purgtory etc to prepare for kexec reboot. In that way, the
benefits are that it avoids to consume and fragment limited low memory
which satisfy DMA buffer allocation and big chunk of continuous memory
during system init; and avoids to stir with BIOS/FW reserved or occupied
areas, or corner case handling/work around/quirk occupied areas when doing
system init. By the way, the top-down searching and loading of kexec-ed
kernel is done in user space utility code.
For kexec_file loading, even if kexec_buf.top_down is 'true', it's simply
ignored. It calls walk_system_ram_res() directly to go through all
resources of System RAM bottom up, to find an available memory region,
then call locate_mem_hole_callback() to allocate memory in that found
memory region from top to down. This is not expected and inconsistent
with kexec_load.
Here check if kexec_buf.top_down is 'true' in kexec_walk_resources(), if
yes, call the newly added walk_system_ram_res_rev() to find memory region
of system RAM from top to down to load kernel/initrd etc.
Link: https://lkml.kernel.org/r/20231114091658.228030-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20231114091658.228030-3-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This function, being a variant of walk_system_ram_res() introduced in
commit 8c86e70ace ("resource: provide new functions to walk through
resources"), walks through a list of all the resources of System RAM in
reversed order, i.e., from higher to lower.
It will be used in kexec_file code to load kernel, initrd etc when
preparing kexec reboot.
Link: https://lkml.kernel.org/r/ZVTA6z/06cLnWKUz@MiWiFi-R3L-srv
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
These four functions have a normal definition for CONFIG_FAIR_GROUP_SCHED,
and empty one that is only referenced when FAIR_GROUP_SCHED is disabled
but CGROUP_SCHED is still enabled. If both are turned off, the functions
are still defined but the misisng prototype causes a W=1 warning:
kernel/sched/fair.c:12544:6: error: no previous prototype for 'free_fair_sched_group'
kernel/sched/fair.c:12546:5: error: no previous prototype for 'alloc_fair_sched_group'
kernel/sched/fair.c:12553:6: error: no previous prototype for 'online_fair_sched_group'
kernel/sched/fair.c:12555:6: error: no previous prototype for 'unregister_fair_sched_group'
Move the alternatives into the header as static inline functions with the
correct combination of #ifdef checks to avoid the warning without adding
even more complexity.
[A different patch with the same description got applied by accident
and was later reverted, but the original patch is still missing]
Link: https://lkml.kernel.org/r/20231123110506.707903-4-arnd@kernel.org
Fixes: 7aa55f2a59 ("sched/fair: Move unused stub functions to header")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicolas Schier <nicolas@fjasle.eu>
Cc: Palmer Dabbelt <palmer@rivosinc.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Tudor Ambarus <tudor.ambarus@linaro.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use atomic_try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
crash_kexec(). x86 CMPXCHG instruction returns success in ZF flag,
so this change saves a compare after cmpxchg.
No functional change intended.
Link: https://lkml.kernel.org/r/20231114161228.108516-1-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The corner case described by the comment is no longer possible after the
commit 7b3c36fc4c ("ptrace: fix task_join_group_stop() for the case when
current is traced"), task_join_group_stop() ensures that the new thread
has the correct signr in JOBCTL_STOP_SIGMASK regardless of ptrace.
Link: https://lkml.kernel.org/r/20231121162650.GA6635@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The purpose of recalc_sigpending_and_wake() is not clear, it looks
"obviously unneeded" because we are going to send the signal which can't
be blocked or ignored.
Add the comment to explain why we can't rely on send_signal_locked() and
make this logic more simple/explicit. recalc_sigpending_and_wake() has no
other users, it can die.
In fact I think we don't even need signal_wake_up(), the target task must
be either current or a TASK_TRACED child, otherwise the usage of siglock
is not safe. But this needs another change.
Link: https://lkml.kernel.org/r/20231120151649.GA15995@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
IA-64 was the only architecture which selected ARCH_TASK_STRUCT_ALLOCATOR.
IA-64 was removed with commit cf8e865810 ("arch: Remove Itanium (IA-64)
architecture"). Therefore remove support for ARCH_THREAD_STACK_ALLOCATOR
as well.
Link: https://lkml.kernel.org/r/20231116133638.1636277-3-hca@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Remove unused code after IA-64 removal".
While looking into something different I noticed that there are a couple
of Kconfig options which were only selected by IA-64 and which are now
unused.
So remove them and simplify the code a bit.
This patch (of 3):
IA-64 was the only architecture which selected ARCH_THREAD_STACK_ALLOCATOR.
IA-64 was removed with commit cf8e865810 ("arch: Remove Itanium (IA-64)
architecture"). Therefore remove support for ARCH_THREAD_STACK_ALLOCATOR as
well.
Link: https://lkml.kernel.org/r/20231116133638.1636277-1-hca@linux.ibm.com
Link: https://lkml.kernel.org/r/20231116133638.1636277-2-hca@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cosmetic, but imho it makes the usage look more clear and simple, the new
helper doesn't require to initialize "t".
After this change while_each_thread() has only 3 users, and it is only
used in the do/while loops.
Link: https://lkml.kernel.org/r/20231030155710.GA9095@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When kernel_can_power_off() returns false, and reboot has called with
LINUX_REBOOT_CMD_POWER_OFF, kernel_halt() will be initiated instead of
actual power off function.
However, in this situation, Kernel never explicitly notifies user that
system halted instead of requested power off.
Since halt and power off perform different behavior, and user initiated
reboot call with power off command, not halt, This could be unintended
behavior to user, like this:
~ # poweroff -f
[ 3.581482] reboot: System halted
Therefore, this explicitly notifies user that poweroff is not available,
and halting has been occured as an alternative behavior instead:
~ # poweroff -f
[ 4.123668] reboot: Power off not available: System halted instead
[akpm@linux-foundation.org: tweak comment text]
Link: https://lkml.kernel.org/r/20231104113320.72440-1-ldmldm05@gmail.com
Signed-off-by: Dongmin Lee <ldmldm05@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In dup_mmap(), using __mt_dup() to duplicate the old maple tree and then
directly replacing the entries of VMAs in the new maple tree can result in
better performance. __mt_dup() uses DFS pre-order to duplicate the maple
tree, so it is efficient.
The average time complexity of __mt_dup() is O(n), where n is the number
of VMAs. The proof of the time complexity is provided in the commit log
that introduces __mt_dup(). After duplicating the maple tree, each
element is traversed and replaced (ignoring the cases of deletion, which
are rare). Since it is only a replacement operation for each element,
this process is also O(n).
Analyzing the exact time complexity of the previous algorithm is
challenging because each insertion can involve appending to a node,
pushing data to adjacent nodes, or even splitting nodes. The frequency of
each action is difficult to calculate. The worst-case scenario for a
single insertion is when the tree undergoes splitting at every level. If
we consider each insertion as the worst-case scenario, we can determine
that the upper bound of the time complexity is O(n*log(n)), although this
is a loose upper bound. However, based on the test data, it appears that
the actual time complexity is likely to be O(n).
As the entire maple tree is duplicated using __mt_dup(), if dup_mmap()
fails, there will be a portion of VMAs that have not been duplicated in
the maple tree. To handle this, we mark the failure point with
XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered, stop
releasing VMAs that have not been duplicated after this point.
There is a "spawn" in byte-unixbench[1], which can be used to test the
performance of fork(). I modified it slightly to make it work with
different number of VMAs.
Below are the test results. The first row shows the number of VMAs. The
second and third rows show the number of fork() calls per ten seconds,
corresponding to next-20231006 and the this patchset, respectively. The
test results were obtained with CPU binding to avoid scheduler load
balancing that could cause unstable results. There are still some
fluctuations in the test results, but at least they are better than the
original performance.
21 121 221 421 821 1621 3221 6421 12821 25621 51221
112100 76261 54227 34035 20195 11112 6017 3161 1606 802 393
114558 83067 65008 45824 28751 16072 8922 4747 2436 1233 599
2.19% 8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42%
[1] https://github.com/kdlucas/byte-unixbench/tree/master
Link: https://lkml.kernel.org/r/20231027033845.90608-11-zhangpeng.00@bytedance.com
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
getting corrupted
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmV1nDUACgkQEsHwGGHe
VUqC3Q/9GF3IjEzKZAwTqw9ir2Nq9fFKkDZVT1ZCkXcg3bc6t5Dp68NcMPEoPdNE
6ONaEwKhZxqPyivI7u1ExdZnHYDMRWolZmjs/x19c+g3Zo6QzT+6blMdoWvl6nV2
RD3macPt5w5bcJ8ugSM4ekTQgo4nPU5VhBS52zDARx0W9ufpIk3YKmxmVQjhuV5J
z/nfewUuUtAHDxnbF8pRvN8WoSg15Z5iERksdcj8Wagjx79cMAR6liuauJNkj9dP
lldG69ODdJeZc9L/SUkLEgYPVaq+G6BOKgWXbzeiRM9LedHN3iQlT9JUttLHN383
NdTbQ6lboViP1O64WuoqJFVDYvY0DvVLUll4URywfT3lPbISGvxhg0Xj+4E8F5W9
A9pB9TDZwRXwrNuRLksaY0v/Glfo7eUr6252aDbgrUovJCDOwfRB+pI4ywpfoL/+
2eKkJR1mUjoCXirkbYjcm7EhnTSKxiKmCYK7pyol3fJCsK/4bQF7mJ4UyDFIB3Na
VXVD41KkMsaAdIQp4HbdduYaPSCQvQee6ahtobQwcxyBWGXRzurTw4ubHlzSeN9F
fIfxF9PfSY+So2J9IrU1uYKPvfbUWfU3b1urQPhPvVlbVlZmfG579ek6+4bhagsg
UztDRvv9lCxvBskruIMfelAduXsDkDi0UwJ0/TXlPnQGzYlDdeI=
=07a8
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v6.7_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Borislav Petkov:
- Make sure tasks are thawed exactly and only once to avoid their state
getting corrupted
* tag 'sched_urgent_for_v6.7_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
freezer,sched: Do not restore saved_state of a thawed task
group
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmV1mfQACgkQEsHwGGHe
VUrW9w/9EVMf1/cu1rY4XN68NbOgdfoic2oPan60WJwiYhYto9uA1quR4Q8ziwVh
UbuO4e0up1ZCUzutZGFnx0ZHrlQIb0/YTQj8kDKX6m7g8s2Vers7YbkRwneDsNqA
JDp58yGXdc1TipVYrKqa0leNrezvaEeoVFPIPKoelzi3673xrlslRseJ/n7vJd4u
SnMjT7LQZIlEe/pecz01nHAo6SSwfI/Ynh2WSorHnhSTuE5gMUzJwBYSXvpZ2gyg
207keTiIcrvxgT+a32NMeEYsFFFvpYKFHI5nxxV1pB8AWXdWaNpuYHNItTDIh81D
fSb8hu+EpNSWtZYzXl/esgULfMgHXez+4VknTpX/vsbfcV1Yif4aHlZP8tgP6gZ5
QyA2NMA5vJypjzLsAgCyZjpTyEVPYQ3f4+iYg4EGlMlgLgoXtHIV+zP765SzDVkC
yPO4xVf+Ypo9AKcGKjBrxyMlRq40zos40k6l2yOjSUlTE2IfOLMhjgVHeLcgD+uv
E9pi0/KtfGvrm3nWgIhDtcvd5Jg6vrilaRWl9bAN6g6xgaqLPXuIZbOjPaRpKSNa
L32XBMg5fUt4eesZv458qu4Zw1ybHCd6qoe3OieFzW5ocR61O946MHX3kkbpmsWC
PzH1mBsPa3F8/utJ06p+9pank3M5yKHdkDPQXfSvImuZ3DPKEGI=
=QxHj
-----END PGP SIGNATURE-----
Merge tag 'perf_urgent_for_v6.7_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf event fix from Borislav Petkov:
- Make sure perf event size validation is done on every event in the
group
* tag 'perf_urgent_for_v6.7_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix perf_event_validate_size()
It can be useful to query how many bits are set in a cpumask. For
example, if you want to perform special logic for the last remaining
core that's set in a mask. Let's therefore add a new
bpf_cpumask_weight() kfunc which checks how many bits are set in a mask.
Signed-off-by: David Vernet <void@manifault.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231207210843.168466-2-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When verifier validates BPF_ST_MEM instruction that stores known
constant to stack (e.g., *(u64 *)(r10 - 8) = 123), it effectively spills
a fake register with a constant (but initially imprecise) value to
a stack slot. Because read-side logic treats it as a proper register
fill from stack slot, we need to mark such stack slot initialization as
INSN_F_STACK_ACCESS instruction to stop precision backtracking from
missing it.
Fixes: 41f6f64e69 ("bpf: support non-r10 register spill/fill to/from stack in precision tracking")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231209010958.66758-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
generic_map_{delete,update}_batch() doesn't set uattr->batch.count as
zero before it tries to allocate memory for key. If the memory
allocation fails, the value of uattr->batch.count will be incorrect.
Fix it by setting uattr->batch.count as zero beore batched update or
deletion.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231208102355.2628918-6-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
There is no need to call maybe_wait_bpf_programs() if update or deletion
operation fails. So only call maybe_wait_bpf_programs() if update or
deletion operation succeeds.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231208102355.2628918-5-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When doing batched lookup and deletion operations on htab of maps,
maybe_wait_bpf_programs() is needed to ensure all programs don't use the
inner map after the bpf syscall returns.
Instead of adding the wait in __htab_map_lookup_and_delete_batch(),
adding the wait in bpf_map_do_batch() and also removing the calling of
maybe_wait_bpf_programs() from generic_map_{delete,update}_batch().
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231208102355.2628918-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Just like commit 9087c6ff8d ("bpf: Call maybe_wait_bpf_programs() only
once from generic_map_delete_batch()"), there is also no need to call
maybe_wait_bpf_programs() for each update in batched update, so only
call it once in generic_map_update_batch().
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231208102355.2628918-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Both map_lookup_elem() and generic_map_lookup_batch() use
bpf_map_copy_value() to lookup and copy the value, and there is no
update operation in bpf_map_copy_value(), so just remove the invocation
of maybe_wait_bpf_programs() from it.
Fixes: 15c14a3dca ("bpf: Add bpf_map_{value_size, update_value, map_copy_value} functions")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231208102355.2628918-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In the current cgroup1 environment, associating operations between cgroups
and applications in a BPF program requires storing a mapping of cgroup_id
to application either in a hash map or maintaining it in userspace.
However, by enabling bpf_cgrp_storage for cgroup1, it becomes possible to
conveniently store application-specific information in cgroup-local storage
and utilize it within BPF programs. Furthermore, enabling this feature for
cgroup1 involves minor modifications for the non-attach case, streamlining
the process.
However, when it comes to enabling this functionality for the cgroup1
attach case, it presents challenges. Therefore, the decision is to focus on
enabling it solely for the cgroup1 non-attach case at present. If
attempting to attach to a cgroup1 fd, the operation will simply fail with
the error code -EBADF.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231206115326.4295-2-laoar.shao@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Push the rounding up of stack offsets into the function responsible for
growing the stack, rather than relying on all the callers to do it.
Uncertainty about whether the callers did it or not tripped up people in
a previous review.
Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20231208032519.260451-4-andreimatei1@gmail.com
Privileged programs are supposed to be able to read uninitialized stack
memory (ever since 6715df8d5) but, before this patch, these accesses
were permitted inconsistently. In particular, accesses were permitted
above state->allocated_stack, but not below it. In other words, if the
stack was already "large enough", the access was permitted, but
otherwise the access was rejected instead of being allowed to "grow the
stack". This undesired rejection was happening in two places:
- in check_stack_slot_within_bounds()
- in check_stack_range_initialized()
This patch arranges for these accesses to be permitted. A bunch of tests
that were relying on the old rejection had to change; all of them were
changed to add also run unprivileged, in which case the old behavior
persists. One tests couldn't be updated - global_func16 - because it
can't run unprivileged for other reasons.
This patch also fixes the tracking of the stack size for variable-offset
reads. This second fix is bundled in the same commit as the first one
because they're inter-related. Before this patch, writes to the stack
using registers containing a variable offset (as opposed to registers
with fixed, known values) were not properly contributing to the
function's needed stack size. As a result, it was possible for a program
to verify, but then to attempt to read out-of-bounds data at runtime
because a too small stack had been allocated for it.
Each function tracks the size of the stack it needs in
bpf_subprog_info.stack_depth, which is maintained by
update_stack_depth(). For regular memory accesses, check_mem_access()
was calling update_state_depth() but it was passing in only the fixed
part of the offset register, ignoring the variable offset. This was
incorrect; the minimum possible value of that register should be used
instead.
This tracking is now fixed by centralizing the tracking of stack size in
grow_stack_state(), and by lifting the calls to grow_stack_state() to
check_stack_access_within_bounds() as suggested by Andrii. The code is
now simpler and more convincingly tracks the correct maximum stack size.
check_stack_range_initialized() can now rely on enough stack having been
allocated for the access; this helps with the fix for the first issue.
A few tests were changed to also check the stack depth computation. The
one that fails without this patch is verifier_var_off:stack_write_priv_vs_unpriv.
Fixes: 01f810ace9 ("bpf: Allow variable-offset stack access")
Reported-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231208032519.260451-3-andreimatei1@gmail.com
Closes: https://lore.kernel.org/bpf/CABWLsev9g8UP_c3a=1qbuZUi20tGoUXoU07FPf-5FLvhOKOY+Q@mail.gmail.com/
- Snapshot buffer issues
1. When instances started allowing latency tracers, it uses
a snapshot buffer (another buffer that is not written to
but swapped with the main buffer that is). The snapshot buffer
needs to be the same size as the main buffer. But when the
snapshot buffers were added to instances, the code to make
the snapshot equal to the main buffer still was only doing it
for the main buffer and not the instances.
2. Need to stop the current tracer when resizing the buffers.
Otherwise there can be a race if the tracer decides to make
a snapshot between resizing the main buffer and the snapshot
buffer.
3. When a tracer is "stopped" in disables both the main buffer
and the snapshot buffer. This needs to be done for instances
and not only the main buffer, now that instances also have
a snapshot buffer.
- Buffered event for filtering issues
When filtering is enabled, because events can be dropped often,
it is quicker to copy the event into a temp buffer and write that
into the main buffer if it is not filtered or just drop the event
if it is, than to write the event into the ring buffer and then
try to discard it. This temp buffer is allocated and needs special
synchronization to do so. But there were some issues with that:
1. When disabling the filter and freeing the buffer, a call to all
CPUs is required to stop each per_cpu usage. But the code
called smp_call_function_many() which does not include the
current CPU. If the task is migrated to another CPU when it
enables the CPUs via smp_call_function_many(), it will not enable
the one it is currently on and this causes issues later on.
Use on_each_cpu_mask() instead, which includes the current CPU.
2. When the allocation of the buffered event fails, it can give
a warning. But the buffered event is just an optimization
(it's still OK to write to the ring buffer and free it).
Do not WARN in this case.
3. The freeing of the buffer event requires synchronization.
First a counter is decremented to zero so that no new uses
of it will happen. Then it sets the buffered event to NULL,
and finally it frees the buffered event. There's a synchronize_rcu()
between the counter decrement and the setting the variable to
NULL, but only a smp_wmb() between that and the freeing of the
buffer. It is theoretically possible that a user missed seeing
the decrement, but will use the buffer after it is free. Another
synchronize_rcu() is needed in place of that smp_wmb().
- ring buffer timestamps on 32 bit machines
The ring buffer timestamp on 32 bit machines has to break the 64 bit
number into multiple values as cmpxchg is required on it, and a
64 bit cmpxchg on 32 bit architectures is very slow. The code use
to just use two 32 bit values and make it a 60 bit timestamp where
the other 4 bits were used as counters for synchronization. It later
came known that the timestamp on 32 bit still need all 64 bits in
some cases. So 3 words were created to handle the 64 bits. But issues
arised with this:
1. The synchronization logic still only compared the counter
with the first two, but not with the third number, so the
synchronization could fail unknowingly.
2. A check on discard of an event could race if an event happened
between the discard and updating one of the counters. The
counter needs to be updated (forcing an absolute timestamp
and not to use a delta) before the actual discard happens.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZXIP5hQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qmJxAQDXBZwBUFQjWqZHLJn0S9aaz5FggkeR
RmlsOMND0PXcjwD+N6U905i553ehu3SSyOP+5svoi0hyCB2qhj3ZF0LzZQU=
=us1V
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Snapshot buffer issues:
1. When instances started allowing latency tracers, it uses a
snapshot buffer (another buffer that is not written to but swapped
with the main buffer that is). The snapshot buffer needs to be the
same size as the main buffer. But when the snapshot buffers were
added to instances, the code to make the snapshot equal to the
main buffer still was only doing it for the main buffer and not
the instances.
2. Need to stop the current tracer when resizing the buffers.
Otherwise there can be a race if the tracer decides to make a
snapshot between resizing the main buffer and the snapshot buffer.
3. When a tracer is "stopped" in disables both the main buffer and
the snapshot buffer. This needs to be done for instances and not
only the main buffer, now that instances also have a snapshot
buffer.
- Buffered event for filtering issues:
When filtering is enabled, because events can be dropped often, it is
quicker to copy the event into a temp buffer and write that into the
main buffer if it is not filtered or just drop the event if it is,
than to write the event into the ring buffer and then try to discard
it. This temp buffer is allocated and needs special synchronization
to do so. But there were some issues with that:
1. When disabling the filter and freeing the buffer, a call to all
CPUs is required to stop each per_cpu usage. But the code called
smp_call_function_many() which does not include the current CPU.
If the task is migrated to another CPU when it enables the CPUs
via smp_call_function_many(), it will not enable the one it is
currently on and this causes issues later on. Use
on_each_cpu_mask() instead, which includes the current CPU.
2.When the allocation of the buffered event fails, it can give a
warning. But the buffered event is just an optimization (it's
still OK to write to the ring buffer and free it). Do not WARN in
this case.
3.The freeing of the buffer event requires synchronization. First a
counter is decremented to zero so that no new uses of it will
happen. Then it sets the buffered event to NULL, and finally it
frees the buffered event. There's a synchronize_rcu() between the
counter decrement and the setting the variable to NULL, but only a
smp_wmb() between that and the freeing of the buffer. It is
theoretically possible that a user missed seeing the decrement,
but will use the buffer after it is free. Another
synchronize_rcu() is needed in place of that smp_wmb().
- ring buffer timestamps on 32 bit machines
The ring buffer timestamp on 32 bit machines has to break the 64 bit
number into multiple values as cmpxchg is required on it, and a 64
bit cmpxchg on 32 bit architectures is very slow. The code use to
just use two 32 bit values and make it a 60 bit timestamp where the
other 4 bits were used as counters for synchronization. It later came
known that the timestamp on 32 bit still need all 64 bits in some
cases. So 3 words were created to handle the 64 bits. But issues
arised with this:
1. The synchronization logic still only compared the counter with
the first two, but not with the third number, so the
synchronization could fail unknowingly.
2. A check on discard of an event could race if an event happened
between the discard and updating one of the counters. The counter
needs to be updated (forcing an absolute timestamp and not to use
a delta) before the actual discard happens.
* tag 'trace-v6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ring-buffer: Test last update in 32bit version of __rb_time_read()
ring-buffer: Force absolute timestamp on discard of event
tracing: Fix a possible race when disabling buffered events
tracing: Fix a warning when allocating buffered events fails
tracing: Fix incomplete locking when disabling buffered events
tracing: Disable snapshot buffer when stopping instance tracers
tracing: Stop current tracer when resizing buffer
tracing: Always update snapshot buffer size
The remainder address post-6.6 issues or aren't considered serious enough
to justify backporting.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZXKEfwAKCRDdBJ7gKXxA
jlRpAQCiAp1nSqIz/fOKTzoQRaTDXU/m+C+6ZAXdKLDfvQBhpwEAnxxjZ8IgF+8Z
Klz/GirHX5w5o7jE2wb8iObo1nR75Qo=
=omRq
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2023-12-07-18-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"31 hotfixes. Ten of these address pre-6.6 issues and are marked
cc:stable. The remainder address post-6.6 issues or aren't considered
serious enough to justify backporting"
* tag 'mm-hotfixes-stable-2023-12-07-18-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (31 commits)
mm/madvise: add cond_resched() in madvise_cold_or_pageout_pte_range()
nilfs2: prevent WARNING in nilfs_sufile_set_segment_usage()
mm/hugetlb: have CONFIG_HUGETLB_PAGE select CONFIG_XARRAY_MULTI
scripts/gdb: fix lx-device-list-bus and lx-device-list-class
MAINTAINERS: drop Antti Palosaari
highmem: fix a memory copy problem in memcpy_from_folio
nilfs2: fix missing error check for sb_set_blocksize call
kernel/Kconfig.kexec: drop select of KEXEC for CRASH_DUMP
units: add missing header
drivers/base/cpu: crash data showing should depends on KEXEC_CORE
mm/damon/sysfs-schemes: add timeout for update_schemes_tried_regions
scripts/gdb/tasks: fix lx-ps command error
mm/Kconfig: make userfaultfd a menuconfig
selftests/mm: prevent duplicate runs caused by TEST_GEN_PROGS
mm/damon/core: copy nr_accesses when splitting region
lib/group_cpus.c: avoid acquiring cpu hotplug lock in group_cpus_evenly
checkstack: fix printed address
mm/memory_hotplug: fix error handling in add_memory_resource()
mm/memory_hotplug: add missing mem_hotplug_lock
.mailmap: add a new address mapping for Chester Lin
...
Current release - regressions:
- veth: fix packet segmentation in veth_convert_skb_to_xdp_buff
Current release - new code bugs:
- tcp: assorted fixes to the new Auth Option support
Older releases - regressions:
- tcp: fix mid stream window clamp
- tls: fix incorrect splice handling
- ipv4: ip_gre: handle skb_pull() failure in ipgre_xmit()
- dsa: mv88e6xxx: restore USXGMII support for 6393X
- arcnet: restore support for multiple Sohard Arcnet cards
Older releases - always broken:
- tcp: do not accept ACK of bytes we never sent
- require admin privileges to receive packet traces via netlink
- packet: move reference count in packet_sock to atomic_long_t
- bpf:
- fix incorrect branch offset comparison with cpu=v4
- fix prog_array_map_poke_run map poke update
- netfilter:
- 3 fixes for crashes on bad admin commands
- xt_owner: fix race accessing sk->sk_socket, TOCTOU null-deref
- nf_tables: fix 'exist' matching on bigendian arches
- leds: netdev: fix RTNL handling to prevent potential deadlock
- eth: tg3: prevent races in error/reset handling
- eth: r8169: fix rtl8125b PAUSE storm when suspended
- eth: r8152: improve reset and surprise removal handling
- eth: hns: fix race between changing features and sending
- eth: nfp: fix sleep in atomic for bonding offload
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmVyGxsACgkQMUZtbf5S
IrvziA//XZQLEQ3OsZnnYuuGkH0lPnY6ABaK/hcjCHnk9xs8SfIKPVYpq1LaShEp
TY6mBhLMIANbdNO+yPzaszWVTkBPyb0w8JNy43bhLhOL3m/6FS6qwsgN8SAL2qVv
8rnDF9Gsb4yU27aMZ6+2m92WiuyPptf4HrWU2ISSv/oCYH9TWsPUrTwt+QuVUboN
eSbvMzgIAkFIQVSbhMuinR9bOzAypSJPi18m1kkID5NsNUP/OToxPE7IFDEVS/oo
f4P7Ru6g1Gw9pAJmVXy5c0528Hy2P4Pyyw3LD5i2FWZ7rhYJRADOC4EMs9lINzrn
uscNUyztldaMHkKcZRqKbaXsnA3MPvuf3qycRH0wyHa1+OjL9N4A9P077FugtBln
UlmgVokfONVlxRgwy7AqapQbZ30QmnUEOvWjFWV3dsCBS3ziq1h7ujCTaQkl6R/6
i96xuiUPMrAnxAlbFOjoF8NeGvcvwujYCqs/q5JC43f+xZRGf52Pwf5U/AliOFym
aBX1mF/mdMLjYIBlGwFABiybACRPMceT2RuCfvhfIdQiM01OHlydO933jS+R3I4O
cB03ppK0QiNo5W4RlMqDGuXfVnBJ36pv/2tY8IUOZGXSR+jSQOxZHrhYrtzMM5F8
sWjpEIrfzdtuz0ssEg9wwGBTffEf07uZyPttov3Pm+VnDrsmCMU=
=bkyC
-----END PGP SIGNATURE-----
Merge tag 'net-6.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf and netfilter.
Current release - regressions:
- veth: fix packet segmentation in veth_convert_skb_to_xdp_buff
Current release - new code bugs:
- tcp: assorted fixes to the new Auth Option support
Older releases - regressions:
- tcp: fix mid stream window clamp
- tls: fix incorrect splice handling
- ipv4: ip_gre: handle skb_pull() failure in ipgre_xmit()
- dsa: mv88e6xxx: restore USXGMII support for 6393X
- arcnet: restore support for multiple Sohard Arcnet cards
Older releases - always broken:
- tcp: do not accept ACK of bytes we never sent
- require admin privileges to receive packet traces via netlink
- packet: move reference count in packet_sock to atomic_long_t
- bpf:
- fix incorrect branch offset comparison with cpu=v4
- fix prog_array_map_poke_run map poke update
- netfilter:
- three fixes for crashes on bad admin commands
- xt_owner: fix race accessing sk->sk_socket, TOCTOU null-deref
- nf_tables: fix 'exist' matching on bigendian arches
- leds: netdev: fix RTNL handling to prevent potential deadlock
- eth: tg3: prevent races in error/reset handling
- eth: r8169: fix rtl8125b PAUSE storm when suspended
- eth: r8152: improve reset and surprise removal handling
- eth: hns: fix race between changing features and sending
- eth: nfp: fix sleep in atomic for bonding offload"
* tag 'net-6.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (62 commits)
vsock/virtio: fix "comparison of distinct pointer types lacks a cast" warning
net/smc: fix missing byte order conversion in CLC handshake
net: dsa: microchip: provide a list of valid protocols for xmit handler
drop_monitor: Require 'CAP_SYS_ADMIN' when joining "events" group
psample: Require 'CAP_NET_ADMIN' when joining "packets" group
bpf: sockmap, updating the sg structure should also update curr
net: tls, update curr on splice as well
nfp: flower: fix for take a mutex lock in soft irq context and rcu lock
net: dsa: mv88e6xxx: Restore USXGMII support for 6393X
tcp: do not accept ACK of bytes we never sent
selftests/bpf: Add test for early update in prog_array_map_poke_run
bpf: Fix prog_array_map_poke_run map poke update
netfilter: xt_owner: Fix for unsafe access of sk->sk_socket
netfilter: nf_tables: validate family when identifying table via handle
netfilter: nf_tables: bail out on mismatching dynset and set expressions
netfilter: nf_tables: fix 'exist' matching on bigendian arches
netfilter: nft_set_pipapo: skip inactive elements during set walk
netfilter: bpf: fix bad registration on nf_defrag
leds: trigger: netdev: fix RTNL handling to prevent potential deadlock
octeontx2-af: Update Tx link register range
...
This patch promotes the arithmetic around checking stack bounds to be
done in the 64-bit domain, instead of the current 32bit. The arithmetic
implies adding together a 64-bit register with a int offset. The
register was checked to be below 1<<29 when it was variable, but not
when it was fixed. The offset either comes from an instruction (in which
case it is 16 bit), from another register (in which case the caller
checked it to be below 1<<29 [1]), or from the size of an argument to a
kfunc (in which case it can be a u32 [2]). Between the register being
inconsistently checked to be below 1<<29, and the offset being up to an
u32, it appears that we were open to overflowing the `int`s which were
currently used for arithmetic.
[1] 815fb87b75/kernel/bpf/verifier.c (L7494-L7498)
[2] 815fb87b75/kernel/bpf/verifier.c (L11904)
Reported-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231207041150.229139-4-andreimatei1@gmail.com
This patch fixes a bug around the verification of possibly-zero-sized
stack accesses. When the access was done through a var-offset stack
pointer, check_stack_access_within_bounds was incorrectly computing the
maximum-offset of a zero-sized read to be the same as the register's min
offset. Instead, we have to take in account the register's maximum
possible value. The patch also simplifies how the max offset is checked;
the check is now simpler than for min offset.
The bug was allowing accesses to erroneously pass the
check_stack_access_within_bounds() checks, only to later crash in
check_stack_range_initialized() when all the possibly-affected stack
slots are iterated (this time with a correct max offset).
check_stack_range_initialized() is relying on
check_stack_access_within_bounds() for its accesses to the
stack-tracking vector to be within bounds; in the case of zero-sized
accesses, we were essentially only verifying that the lowest possible
slot was within bounds. We would crash when the max-offset of the stack
pointer was >= 0 (which shouldn't pass verification, and hopefully is
not something anyone's code attempts to do in practice).
Thanks Hao for reporting!
Fixes: 01f810ace9 ("bpf: Allow variable-offset stack access")
Reported-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231207041150.229139-2-andreimatei1@gmail.com
Closes: https://lore.kernel.org/bpf/CACkBjsZGEUaRCHsmaX=h-efVogsRfK1FPxmkgb0Os_frnHiNdw@mail.gmail.com/
Just one patch.
f5d39b0208 ("freezer,sched: Rewrite core freezer logic") changed how
freezing state is recorded which cgroup_freezing() disagree with the actual
state of the task while thawing triggering a warning. Fix it by updating
cgroup_freezing().
-----BEGIN PGP SIGNATURE-----
iIMEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZXDMtQ4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGX2kAQDqVMZJ+fq3+nGYZNWdsCP+GtDOjsMf8GaadsMT
e8Iu4QDzBo6QwRyFIA4glPQVWiTh3R35XeN0TCN1qEfCTYEICw==
=F8Zn
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.7-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
"Just one fix.
Commit f5d39b0208 ("freezer,sched: Rewrite core freezer logic")
changed how freezing state is recorded which made cgroup_freezing()
disagree with the actual state of the task while thawing triggering a
warning. Fix it by updating cgroup_freezing()"
* tag 'cgroup-for-6.7-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup_freezer: cgroup_freezing: Check if not frozen
Just one patch to fix a bug which can crash the kernel if the housekeeping
and wq_unbound_cpu cpumask configuration combination leaves the latter
empty.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZXDKTg4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGTmMAP9kuC9JkII2J5JnxQpkJLDd/qeRHrigrClx3F0+
gBiK8AD/XgsGY5J/OOMjsU1Px7BYvy6w0MEEqqhx2vOVEkEFPAo=
=pH9n
-----END PGP SIGNATURE-----
Merge tag 'wq-for-6.7-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
"Just one patch to fix a bug which can crash the kernel if the
housekeeping and wq_unbound_cpu cpumask configuration combination
leaves the latter empty"
* tag 'wq-for-6.7-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Make sure that wq_unbound_cpumask is never empty
Instead of blindly allocating PAGE_SIZE for each trampoline, check the size
of the trampoline with arch_bpf_trampoline_size(). This size is saved in
bpf_tramp_image->size, and used for modmem charge/uncharge. The fallback
arch_alloc_bpf_trampoline() still allocates a whole page because we need to
use set_memory_* to protect the memory.
struct_ops trampoline still uses a whole page for multiple trampolines.
With this size check at caller (regular trampoline and struct_ops
trampoline), remove arch_bpf_trampoline_size() from
arch_prepare_bpf_trampoline() in archs.
Also, update bpf_image_ksym_add() to handle symbol of different sizes.
Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Björn Töpel <bjorn@rivosinc.com> # on riscv
Link: https://lore.kernel.org/r/20231206224054.492250-7-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This helper will be used to calculate the size of the trampoline before
allocating the memory.
arch_prepare_bpf_trampoline() for arm64 and riscv64 can use
arch_bpf_trampoline_size() to check the trampoline fits in the image.
OTOH, arch_prepare_bpf_trampoline() for s390 has to call the JIT process
twice, so it cannot use arch_bpf_trampoline_size().
Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Björn Töpel <bjorn@rivosinc.com> # on riscv
Link: https://lore.kernel.org/r/20231206224054.492250-6-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
As BPF trampoline of different archs moves from bpf_jit_[alloc|free]_exec()
to bpf_prog_pack_[alloc|free](), we need to use different _alloc, _free for
different archs during the transition. Add the following helpers for this
transition:
void *arch_alloc_bpf_trampoline(unsigned int size);
void arch_free_bpf_trampoline(void *image, unsigned int size);
void arch_protect_bpf_trampoline(void *image, unsigned int size);
void arch_unprotect_bpf_trampoline(void *image, unsigned int size);
The fallback version of these helpers require size <= PAGE_SIZE, but they
are only called with size == PAGE_SIZE. They will be called with size <
PAGE_SIZE when arch_bpf_trampoline_size() helper is introduced later.
Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20231206224054.492250-4-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
We are using "im" for "struct bpf_tramp_image" and "tr" for "struct
bpf_trampoline" in most of the code base. The only exception is the
prototype and fallback version of arch_prepare_bpf_trampoline(). Update
them to match the rest of the code base.
We mix "orig_call" and "func_addr" for the argument in different versions
of arch_prepare_bpf_trampoline(). s/orig_call/func_addr/g so they match.
Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20231206224054.492250-3-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, bpf_prog_pack_free only can only free pointer to struct
bpf_binary_header, which is not flexible. Add a size argument to
bpf_prog_pack_free so that it can handle any pointer.
Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20231206224054.492250-2-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Ignat Korchagin complained that a potential config regression was
introduced by commit 89cde45591 ("kexec: consolidate kexec and crash
options into kernel/Kconfig.kexec"). Before the commit, CONFIG_CRASH_DUMP
has no dependency on CONFIG_KEXEC. After the commit, CRASH_DUMP selects
KEXEC. That enforces system to have CONFIG_KEXEC=y as long as
CONFIG_CRASH_DUMP=Y which people may not want.
In Ignat's case, he sets CONFIG_CRASH_DUMP=y, CONFIG_KEXEC_FILE=y and
CONFIG_KEXEC=n because kexec_load interface could have security issue if
kernel/initrd has no chance to be signed and verified.
CRASH_DUMP has select of KEXEC because Eric, author of above commit, met a
LKP report of build failure when posting patch of earlier version. Please
see below link to get detail of the LKP report:
https://lore.kernel.org/all/3e8eecd1-a277-2cfb-690e-5de2eb7b988e@oracle.com/T/#u
In fact, that LKP report is triggered because arm's <asm/kexec.h> is
wrapped in CONFIG_KEXEC ifdeffery scope. That is wrong. CONFIG_KEXEC
controls the enabling/disabling of kexec_load interface, but not kexec
feature. Removing the wrongly added CONFIG_KEXEC ifdeffery scope in
<asm/kexec.h> of arm allows us to drop the select KEXEC for CRASH_DUMP.
Meanwhile, change arch/arm/kernel/Makefile to let machine_kexec.o
relocate_kernel.o depend on KEXEC_CORE.
Link: https://lkml.kernel.org/r/20231128054457.659452-1-bhe@redhat.com
Fixes: 89cde45591 ("kexec: consolidate kexec and crash options into kernel/Kconfig.kexec")
Signed-off-by: Baoquan He <bhe@redhat.com>
Reported-by: Ignat Korchagin <ignat@cloudflare.com>
Tested-by: Ignat Korchagin <ignat@cloudflare.com> [compile-time only]
Tested-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Eric DeVolder <eric_devolder@yahoo.com>
Tested-by: Eric DeVolder <eric_devolder@yahoo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lee pointed out issue found by syscaller [0] hitting BUG in prog array
map poke update in prog_array_map_poke_run function due to error value
returned from bpf_arch_text_poke function.
There's race window where bpf_arch_text_poke can fail due to missing
bpf program kallsym symbols, which is accounted for with check for
-EINVAL in that BUG_ON call.
The problem is that in such case we won't update the tail call jump
and cause imbalance for the next tail call update check which will
fail with -EBUSY in bpf_arch_text_poke.
I'm hitting following race during the program load:
CPU 0 CPU 1
bpf_prog_load
bpf_check
do_misc_fixups
prog_array_map_poke_track
map_update_elem
bpf_fd_array_map_update_elem
prog_array_map_poke_run
bpf_arch_text_poke returns -EINVAL
bpf_prog_kallsyms_add
After bpf_arch_text_poke (CPU 1) fails to update the tail call jump, the next
poke update fails on expected jump instruction check in bpf_arch_text_poke
with -EBUSY and triggers the BUG_ON in prog_array_map_poke_run.
Similar race exists on the program unload.
Fixing this by moving the update to bpf_arch_poke_desc_update function which
makes sure we call __bpf_arch_text_poke that skips the bpf address check.
Each architecture has slightly different approach wrt looking up bpf address
in bpf_arch_text_poke, so instead of splitting the function or adding new
'checkip' argument in previous version, it seems best to move the whole
map_poke_run update as arch specific code.
[0] https://syzkaller.appspot.com/bug?extid=97a4fe20470e9bc30810
Fixes: ebf7d1f508 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
Reported-by: syzbot+97a4fe20470e9bc30810@syzkaller.appspotmail.com
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Cc: Lee Jones <lee@kernel.org>
Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20231206083041.1306660-2-jolsa@kernel.org
Since 64 bit cmpxchg() is very expensive on 32bit architectures, the
timestamp used by the ring buffer does some interesting tricks to be able
to still have an atomic 64 bit number. It originally just used 60 bits and
broke it up into two 32 bit words where the extra 2 bits were used for
synchronization. But this was not enough for all use cases, and all 64
bits were required.
The 32bit version of the ring buffer timestamp was then broken up into 3
32bit words using the same counter trick. But one update was not done. The
check to see if the read operation was done without interruption only
checked the first two words and not last one (like it had before this
update). Fix it by making sure all three updates happen without
interruption by comparing the initial counter with the last updated
counter.
Link: https://lore.kernel.org/linux-trace-kernel/20231206100050.3100b7bb@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: f03f2abce4 ("ring-buffer: Have 32 bit time stamps use all 64 bits")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
There's a race where if an event is discarded from the ring buffer and an
interrupt were to happen at that time and insert an event, the time stamp
is still used from the discarded event as an offset. This can screw up the
timings.
If the event is going to be discarded, set the "before_stamp" to zero.
When a new event comes in, it compares the "before_stamp" with the
"write_stamp" and if they are not equal, it will insert an absolute
timestamp. This will prevent the timings from getting out of sync due to
the discarded event.
Link: https://lore.kernel.org/linux-trace-kernel/20231206100244.5130f9b3@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 6f6be606e7 ("ring-buffer: Force before_stamp and write_stamp to be different on discard")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently, the cpu_is_isolated() function checks only the statically
isolated CPUs specified via the "isolcpus" and "nohz_full" kernel
command line options. This function is used by vmstat and memcg to
reduce interference with isolated CPUs by not doing stat flushing
or scheduling works on those CPUs.
Workloads running on isolated CPUs within isolated cpuset
partitions should receive the same treatment to reduce unnecessary
interference. This patch introduces a new cpuset_cpu_is_isolated()
function to be called by cpu_is_isolated() so that the set of dynamically
created cpuset isolated CPUs will be included in the check.
Assuming that testing a bit in a cpumask is atomic, no synchronization
primitive is currently used to synchronize access to the cpuset's
isolated_cpus mask.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Wire up bpf_token_create and bpf_token_free LSM hooks, which allow to
allocate LSM security blob (we add `void *security` field to struct
bpf_token for that), but also control who can instantiate BPF token.
This follows existing pattern for BPF map and BPF prog.
Also add security_bpf_token_allow_cmd() and security_bpf_token_capable()
LSM hooks that allow LSM implementation to control and negate (if
necessary) BPF token's delegation of a specific bpf_cmd and capability,
respectively.
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-12-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Similarly to bpf_prog_alloc LSM hook, rename and extend bpf_map_alloc
hook into bpf_map_create, taking not just struct bpf_map, but also
bpf_attr and bpf_token, to give a fuller context to LSMs.
Unlike bpf_prog_alloc, there is no need to move the hook around, as it
currently is firing right before allocating BPF map ID and FD, which
seems to be a sweet spot.
But like bpf_prog_alloc/bpf_prog_free combo, make sure that bpf_map_free
LSM hook is called even if bpf_map_create hook returned error, as if few
LSMs are combined together it could be that one LSM successfully
allocated security blob for its needs, while subsequent LSM rejected BPF
map creation. The former LSM would still need to free up LSM blob, so we
need to ensure security_bpf_map_free() is called regardless of the
outcome.
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-11-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Based on upstream discussion ([0]), rework existing
bpf_prog_alloc_security LSM hook. Rename it to bpf_prog_load and instead
of passing bpf_prog_aux, pass proper bpf_prog pointer for a full BPF
program struct. Also, we pass bpf_attr union with all the user-provided
arguments for BPF_PROG_LOAD command. This will give LSMs as much
information as we can basically provide.
The hook is also BPF token-aware now, and optional bpf_token struct is
passed as a third argument. bpf_prog_load LSM hook is called after
a bunch of sanity checks were performed, bpf_prog and bpf_prog_aux were
allocated and filled out, but right before performing full-fledged BPF
verification step.
bpf_prog_free LSM hook is now accepting struct bpf_prog argument, for
consistency. SELinux code is adjusted to all new names, types, and
signatures.
Note, given that bpf_prog_load (previously bpf_prog_alloc) hook can be
used by some LSMs to allocate extra security blob, but also by other
LSMs to reject BPF program loading, we need to make sure that
bpf_prog_free LSM hook is called after bpf_prog_load/bpf_prog_alloc one
*even* if the hook itself returned error. If we don't do that, we run
the risk of leaking memory. This seems to be possible today when
combining SELinux and BPF LSM, as one example, depending on their
relative ordering.
Also, for BPF LSM setup, add bpf_prog_load and bpf_prog_free to
sleepable LSM hooks list, as they are both executed in sleepable
context. Also drop bpf_prog_load hook from untrusted, as there is no
issue with refcount or anything else anymore, that originally forced us
to add it to untrusted list in c0c852dd18 ("bpf: Do not mark certain LSM
hook arguments as trusted"). We now trigger this hook much later and it
should not be an issue anymore.
[0] https://lore.kernel.org/bpf/9fe88aef7deabbe87d3fc38c4aea3c69.paul@paul-moore.com/
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-10-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Remove remaining direct queries to perfmon_capable() and bpf_capable()
in BPF verifier logic and instead use BPF token (if available) to make
decisions about privileges.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Instead of performing unconditional system-wide bpf_capable() and
perfmon_capable() calls inside bpf_base_func_proto() function (and other
similar ones) to determine eligibility of a given BPF helper for a given
program, use previously recorded BPF token during BPF_PROG_LOAD command
handling to inform the decision.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-8-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add basic support of BPF token to BPF_PROG_LOAD. Wire through a set of
allowed BPF program types and attach types, derived from BPF FS at BPF
token creation time. Then make sure we perform bpf_token_capable()
checks everywhere where it's relevant.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Accept BPF token FD in BPF_BTF_LOAD command to allow BTF data loading
through delegated BPF token. BTF loading is a pretty straightforward
operation, so as long as BPF token is created with allow_cmds granting
BPF_BTF_LOAD command, kernel proceeds to parsing BTF data and creating
BTF object.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Allow providing token_fd for BPF_MAP_CREATE command to allow controlled
BPF map creation from unprivileged process through delegated BPF token.
Wire through a set of allowed BPF map types to BPF token, derived from
BPF FS at BPF token creation time. This, in combination with allowed_cmds
allows to create a narrowly-focused BPF token (controlled by privileged
agent) with a restrictive set of BPF maps that application can attempt
to create.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add new kind of BPF kernel object, BPF token. BPF token is meant to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while having a good amount of control over which
privileged operations could be performed using provided BPF token.
This is achieved through mounting BPF FS instance with extra delegation
mount options, which determine what operations are delegatable, and also
constraining it to the owning user namespace (as mentioned in the
previous patch).
BPF token itself is just a derivative from BPF FS and can be created
through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
FS FD, which can be attained through open() API by opening BPF FS mount
point. Currently, BPF token "inherits" delegated command, map types,
prog type, and attach type bit sets from BPF FS as is. In the future,
having an BPF token as a separate object with its own FD, we can allow
to further restrict BPF token's allowable set of things either at the
creation time or after the fact, allowing the process to guard itself
further from unintentionally trying to load undesired kind of BPF
programs. But for now we keep things simple and just copy bit sets as is.
When BPF token is created from BPF FS mount, we take reference to the
BPF super block's owning user namespace, and then use that namespace for
checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
capabilities that are normally only checked against init userns (using
capable()), but now we check them using ns_capable() instead (if BPF
token is provided). See bpf_token_capable() for details.
Such setup means that BPF token in itself is not sufficient to grant BPF
functionality. User namespaced process has to *also* have necessary
combination of capabilities inside that user namespace. So while
previously CAP_BPF was useless when granted within user namespace, now
it gains a meaning and allows container managers and sys admins to have
a flexible control over which processes can and need to use BPF
functionality within the user namespace (i.e., container in practice).
And BPF FS delegation mount options and derived BPF tokens serve as
a per-container "flag" to grant overall ability to use bpf() (plus further
restrict on which parts of bpf() syscalls are treated as namespaced).
Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
within the BPF FS owning user namespace, rounding up the ns_capable()
story of BPF token.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
- `delegate_cmds` allow to specify which bpf() syscall commands are
allowed with BPF token derived from this BPF FS instance;
- if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
a set of allowable BPF map types that could be created with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
a set of allowable BPF program types that could be loaded with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
a set of allowable BPF program attach types that could be loaded with
BPF token; delegate_progs and delegate_attachs are meant to be used
together, as full BPF program type is, in general, determined
through both program type and program attach type.
Currently, these mount options accept the following forms of values:
- a special value "any", that enables all possible values of a given
bit set;
- numeric value (decimal or hexadecimal, determined by kernel
automatically) that specifies a bit mask value directly;
- all the values for a given mount option are combined, if specified
multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
mask.
Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.
Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.
This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.
There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.
This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.
[0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Within BPF syscall handling code CAP_NET_ADMIN checks stand out a bit
compared to CAP_BPF and CAP_PERFMON checks. For the latter, CAP_BPF or
CAP_PERFMON are checked first, but if they are not set, CAP_SYS_ADMIN
takes over and grants whatever part of BPF syscall is required.
Similar kind of checks that involve CAP_NET_ADMIN are not so consistent.
One out of four uses does follow CAP_BPF/CAP_PERFMON model: during
BPF_PROG_LOAD, if the type of BPF program is "network-related" either
CAP_NET_ADMIN or CAP_SYS_ADMIN is required to proceed.
But in three other cases CAP_NET_ADMIN is required even if CAP_SYS_ADMIN
is set:
- when creating DEVMAP/XDKMAP/CPU_MAP maps;
- when attaching CGROUP_SKB programs;
- when handling BPF_PROG_QUERY command.
This patch is changing the latter three cases to follow BPF_PROG_LOAD
model, that is allowing to proceed under either CAP_NET_ADMIN or
CAP_SYS_ADMIN.
This also makes it cleaner in subsequent BPF token patches to switch
wholesomely to a generic bpf_token_capable(int cap) check, that always
falls back to CAP_SYS_ADMIN if requested capability is missing.
Cc: Jakub Kicinski <kuba@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Function trace_buffered_event_disable() is responsible for freeing pages
backing buffered events and this process can run concurrently with
trace_event_buffer_lock_reserve().
The following race is currently possible:
* Function trace_buffered_event_disable() is called on CPU 0. It
increments trace_buffered_event_cnt on each CPU and waits via
synchronize_rcu() for each user of trace_buffered_event to complete.
* After synchronize_rcu() is finished, function
trace_buffered_event_disable() has the exclusive access to
trace_buffered_event. All counters trace_buffered_event_cnt are at 1
and all pointers trace_buffered_event are still valid.
* At this point, on a different CPU 1, the execution reaches
trace_event_buffer_lock_reserve(). The function calls
preempt_disable_notrace() and only now enters an RCU read-side
critical section. The function proceeds and reads a still valid
pointer from trace_buffered_event[CPU1] into the local variable
"entry". However, it doesn't yet read trace_buffered_event_cnt[CPU1]
which happens later.
* Function trace_buffered_event_disable() continues. It frees
trace_buffered_event[CPU1] and decrements
trace_buffered_event_cnt[CPU1] back to 0.
* Function trace_event_buffer_lock_reserve() continues. It reads and
increments trace_buffered_event_cnt[CPU1] from 0 to 1. This makes it
believe that it can use the "entry" that it already obtained but the
pointer is now invalid and any access results in a use-after-free.
Fix the problem by making a second synchronize_rcu() call after all
trace_buffered_event values are set to NULL. This waits on all potential
users in trace_event_buffer_lock_reserve() that still read a previous
pointer from trace_buffered_event.
Link: https://lore.kernel.org/all/20231127151248.7232-2-petr.pavlu@suse.com/
Link: https://lkml.kernel.org/r/20231205161736.19663-4-petr.pavlu@suse.com
Cc: stable@vger.kernel.org
Fixes: 0fc1b09ff1 ("tracing: Use temp buffer when filtering events")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Function trace_buffered_event_disable() produces an unexpected warning
when the previous call to trace_buffered_event_enable() fails to
allocate pages for buffered events.
The situation can occur as follows:
* The counter trace_buffered_event_ref is at 0.
* The soft mode gets enabled for some event and
trace_buffered_event_enable() is called. The function increments
trace_buffered_event_ref to 1 and starts allocating event pages.
* The allocation fails for some page and trace_buffered_event_disable()
is called for cleanup.
* Function trace_buffered_event_disable() decrements
trace_buffered_event_ref back to 0, recognizes that it was the last
use of buffered events and frees all allocated pages.
* The control goes back to trace_buffered_event_enable() which returns.
The caller of trace_buffered_event_enable() has no information that
the function actually failed.
* Some time later, the soft mode is disabled for the same event.
Function trace_buffered_event_disable() is called. It warns on
"WARN_ON_ONCE(!trace_buffered_event_ref)" and returns.
Buffered events are just an optimization and can handle failures. Make
trace_buffered_event_enable() exit on the first failure and left any
cleanup later to when trace_buffered_event_disable() is called.
Link: https://lore.kernel.org/all/20231127151248.7232-2-petr.pavlu@suse.com/
Link: https://lkml.kernel.org/r/20231205161736.19663-3-petr.pavlu@suse.com
Fixes: 0fc1b09ff1 ("tracing: Use temp buffer when filtering events")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The following warning appears when using buffered events:
[ 203.556451] WARNING: CPU: 53 PID: 10220 at kernel/trace/ring_buffer.c:3912 ring_buffer_discard_commit+0x2eb/0x420
[...]
[ 203.670690] CPU: 53 PID: 10220 Comm: stress-ng-sysin Tainted: G E 6.7.0-rc2-default #4 56e6d0fcf5581e6e51eaaecbdaec2a2338c80f3a
[ 203.670704] Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017
[ 203.670709] RIP: 0010:ring_buffer_discard_commit+0x2eb/0x420
[ 203.735721] Code: 4c 8b 4a 50 48 8b 42 48 49 39 c1 0f 84 b3 00 00 00 49 83 e8 01 75 b1 48 8b 42 10 f0 ff 40 08 0f 0b e9 fc fe ff ff f0 ff 47 08 <0f> 0b e9 77 fd ff ff 48 8b 42 10 f0 ff 40 08 0f 0b e9 f5 fe ff ff
[ 203.735734] RSP: 0018:ffffb4ae4f7b7d80 EFLAGS: 00010202
[ 203.735745] RAX: 0000000000000000 RBX: ffffb4ae4f7b7de0 RCX: ffff8ac10662c000
[ 203.735754] RDX: ffff8ac0c750be00 RSI: ffff8ac10662c000 RDI: ffff8ac0c004d400
[ 203.781832] RBP: ffff8ac0c039cea0 R08: 0000000000000000 R09: 0000000000000000
[ 203.781839] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 203.781842] R13: ffff8ac10662c000 R14: ffff8ac0c004d400 R15: ffff8ac10662c008
[ 203.781846] FS: 00007f4cd8a67740(0000) GS:ffff8ad798880000(0000) knlGS:0000000000000000
[ 203.781851] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 203.781855] CR2: 0000559766a74028 CR3: 00000001804c4000 CR4: 00000000001506f0
[ 203.781862] Call Trace:
[ 203.781870] <TASK>
[ 203.851949] trace_event_buffer_commit+0x1ea/0x250
[ 203.851967] trace_event_raw_event_sys_enter+0x83/0xe0
[ 203.851983] syscall_trace_enter.isra.0+0x182/0x1a0
[ 203.851990] do_syscall_64+0x3a/0xe0
[ 203.852075] entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 203.852090] RIP: 0033:0x7f4cd870fa77
[ 203.982920] Code: 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 b8 89 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 43 0e 00 f7 d8 64 89 01 48
[ 203.982932] RSP: 002b:00007fff99717dd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000089
[ 203.982942] RAX: ffffffffffffffda RBX: 0000558ea1d7b6f0 RCX: 00007f4cd870fa77
[ 203.982948] RDX: 0000000000000000 RSI: 00007fff99717de0 RDI: 0000558ea1d7b6f0
[ 203.982957] RBP: 00007fff99717de0 R08: 00007fff997180e0 R09: 00007fff997180e0
[ 203.982962] R10: 00007fff997180e0 R11: 0000000000000246 R12: 00007fff99717f40
[ 204.049239] R13: 00007fff99718590 R14: 0000558e9f2127a8 R15: 00007fff997180b0
[ 204.049256] </TASK>
For instance, it can be triggered by running these two commands in
parallel:
$ while true; do
echo hist:key=id.syscall:val=hitcount > \
/sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger;
done
$ stress-ng --sysinfo $(nproc)
The warning indicates that the current ring_buffer_per_cpu is not in the
committing state. It happens because the active ring_buffer_event
doesn't actually come from the ring_buffer_per_cpu but is allocated from
trace_buffered_event.
The bug is in function trace_buffered_event_disable() where the
following normally happens:
* The code invokes disable_trace_buffered_event() via
smp_call_function_many() and follows it by synchronize_rcu(). This
increments the per-CPU variable trace_buffered_event_cnt on each
target CPU and grants trace_buffered_event_disable() the exclusive
access to the per-CPU variable trace_buffered_event.
* Maintenance is performed on trace_buffered_event, all per-CPU event
buffers get freed.
* The code invokes enable_trace_buffered_event() via
smp_call_function_many(). This decrements trace_buffered_event_cnt and
releases the access to trace_buffered_event.
A problem is that smp_call_function_many() runs a given function on all
target CPUs except on the current one. The following can then occur:
* Task X executing trace_buffered_event_disable() runs on CPU 0.
* The control reaches synchronize_rcu() and the task gets rescheduled on
another CPU 1.
* The RCU synchronization finishes. At this point,
trace_buffered_event_disable() has the exclusive access to all
trace_buffered_event variables except trace_buffered_event[CPU0]
because trace_buffered_event_cnt[CPU0] is never incremented and if the
buffer is currently unused, remains set to 0.
* A different task Y is scheduled on CPU 0 and hits a trace event. The
code in trace_event_buffer_lock_reserve() sees that
trace_buffered_event_cnt[CPU0] is set to 0 and decides the use the
buffer provided by trace_buffered_event[CPU0].
* Task X continues its execution in trace_buffered_event_disable(). The
code incorrectly frees the event buffer pointed by
trace_buffered_event[CPU0] and resets the variable to NULL.
* Task Y writes event data to the now freed buffer and later detects the
created inconsistency.
The issue is observable since commit dea499781a ("tracing: Fix warning
in trace_buffered_event_disable()") which moved the call of
trace_buffered_event_disable() in __ftrace_event_enable_disable()
earlier, prior to invoking call->class->reg(.. TRACE_REG_UNREGISTER ..).
The underlying problem in trace_buffered_event_disable() is however
present since the original implementation in commit 0fc1b09ff1
("tracing: Use temp buffer when filtering events").
Fix the problem by replacing the two smp_call_function_many() calls with
on_each_cpu_mask() which invokes a given callback on all CPUs.
Link: https://lore.kernel.org/all/20231127151248.7232-2-petr.pavlu@suse.com/
Link: https://lkml.kernel.org/r/20231205161736.19663-2-petr.pavlu@suse.com
Cc: stable@vger.kernel.org
Fixes: 0fc1b09ff1 ("tracing: Use temp buffer when filtering events")
Fixes: dea499781a ("tracing: Fix warning in trace_buffered_event_disable()")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
It use to be that only the top level instance had a snapshot buffer (for
latency tracers like wakeup and irqsoff). When stopping a tracer in an
instance would not disable the snapshot buffer. This could have some
unintended consequences if the irqsoff tracer is enabled.
Consolidate the tracing_start/stop() with tracing_start/stop_tr() so that
all instances behave the same. The tracing_start/stop() functions will
just call their respective tracing_start/stop_tr() with the global_array
passed in.
Link: https://lkml.kernel.org/r/20231205220011.041220035@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 6d9b3fa5e7 ("tracing: Move tracing_max_latency into trace_array")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When the ring buffer is being resized, it can cause side effects to the
running tracer. For instance, there's a race with irqsoff tracer that
swaps individual per cpu buffers between the main buffer and the snapshot
buffer. The resize operation modifies the main buffer and then the
snapshot buffer. If a swap happens in between those two operations it will
break the tracer.
Simply stop the running tracer before resizing the buffers and enable it
again when finished.
Link: https://lkml.kernel.org/r/20231205220010.748996423@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 3928a8a2d9 ("ftrace: make work with new ring buffer")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
It use to be that only the top level instance had a snapshot buffer (for
latency tracers like wakeup and irqsoff). The update of the ring buffer
size would check if the instance was the top level and if so, it would
also update the snapshot buffer as it needs to be the same as the main
buffer.
Now that lower level instances also has a snapshot buffer, they too need
to update their snapshot buffer sizes when the main buffer is changed,
otherwise the following can be triggered:
# cd /sys/kernel/tracing
# echo 1500 > buffer_size_kb
# mkdir instances/foo
# echo irqsoff > instances/foo/current_tracer
# echo 1000 > instances/foo/buffer_size_kb
Produces:
WARNING: CPU: 2 PID: 856 at kernel/trace/trace.c:1938 update_max_tr_single.part.0+0x27d/0x320
Which is:
ret = ring_buffer_swap_cpu(tr->max_buffer.buffer, tr->array_buffer.buffer, cpu);
if (ret == -EBUSY) {
[..]
}
WARN_ON_ONCE(ret && ret != -EAGAIN && ret != -EBUSY); <== here
That's because ring_buffer_swap_cpu() has:
int ret = -EINVAL;
[..]
/* At least make sure the two buffers are somewhat the same */
if (cpu_buffer_a->nr_pages != cpu_buffer_b->nr_pages)
goto out;
[..]
out:
return ret;
}
Instead, update all instances' snapshot buffer sizes when their main
buffer size is updated.
Link: https://lkml.kernel.org/r/20231205220010.454662151@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 6d9b3fa5e7 ("tracing: Move tracing_max_latency into trace_array")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Similar to special handling of STACK_ZERO, when reading 1/2/4 bytes from
stack from slot that has register spilled into it and that register has
a constant value zero, preserve that zero and mark spilled register as
precise for that. This makes spilled const zero register and STACK_ZERO
cases equivalent in their behavior.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231205184248.1502704-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Instead of always forcing STACK_ZERO slots to STACK_MISC, preserve it in
situations where this is possible. E.g., when spilling register as
1/2/4-byte subslots on the stack, all the remaining bytes in the stack
slot do not automatically become unknown. If we knew they contained
zeroes, we can preserve those STACK_ZERO markers.
Add a helper mark_stack_slot_misc(), similar to scrub_spilled_slot(),
but that doesn't overwrite either STACK_INVALID nor STACK_ZERO. Note
that we need to take into account possibility of being in unprivileged
mode, in which case STACK_INVALID is forced to STACK_MISC for correctness,
as treating STACK_INVALID as equivalent STACK_MISC is only enabled in
privileged mode.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231205184248.1502704-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When register is spilled onto a stack as a 1/2/4-byte register, we set
slot_type[BPF_REG_SIZE - 1] (plus potentially few more below it,
depending on actual spill size). So to check if some stack slot has
spilled register we need to consult slot_type[7], not slot_type[0].
To avoid the need to remember and double-check this in the future, just
use is_spilled_reg() helper.
Fixes: 27113c59b6 ("bpf: Check the other end of slot_type for STACK_SPILL")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231205184248.1502704-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Use instruction (jump) history to record instructions that performed
register spill/fill to/from stack, regardless if this was done through
read-only r10 register, or any other register after copying r10 into it
*and* potentially adjusting offset.
To make this work reliably, we push extra per-instruction flags into
instruction history, encoding stack slot index (spi) and stack frame
number in extra 10 bit flags we take away from prev_idx in instruction
history. We don't touch idx field for maximum performance, as it's
checked most frequently during backtracking.
This change removes basically the last remaining practical limitation of
precision backtracking logic in BPF verifier. It fixes known
deficiencies, but also opens up new opportunities to reduce number of
verified states, explored in the subsequent patches.
There are only three differences in selftests' BPF object files
according to veristat, all in the positive direction (less states).
File Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF)
-------------------------------------- ------------- --------- --------- ------------- ---------- ---------- -------------
test_cls_redirect_dynptr.bpf.linked3.o cls_redirect 2987 2864 -123 (-4.12%) 240 231 -9 (-3.75%)
xdp_synproxy_kern.bpf.linked3.o syncookie_tc 82848 82661 -187 (-0.23%) 5107 5073 -34 (-0.67%)
xdp_synproxy_kern.bpf.linked3.o syncookie_xdp 85116 84964 -152 (-0.18%) 5162 5130 -32 (-0.62%)
Note, I avoided renaming jmp_history to more generic insn_hist to
minimize number of lines changed and potential merge conflicts between
bpf and bpf-next trees.
Notice also cur_hist_entry pointer reset to NULL at the beginning of
instruction verification loop. This pointer avoids the problem of
relying on last jump history entry's insn_idx to determine whether we
already have entry for current instruction or not. It can happen that we
added jump history entry because current instruction is_jmp_point(), but
also we need to add instruction flags for stack access. In this case, we
don't want to entries, so we need to reuse last added entry, if it is
present.
Relying on insn_idx comparison has the same ambiguity problem as the one
that was fixed recently in [0], so we avoid that.
[0] https://patchwork.kernel.org/project/netdevbpf/patch/20231110002638.4168352-3-andrii@kernel.org/
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reported-by: Tao Lyu <tao.lyu@epfl.ch>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231205184248.1502704-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
With the removal of the 'iov' argument to import_single_range(), the two
functions are now fully identical. Convert the import_single_range()
callers to import_ubuf(), and remove the former fully.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20231204174827.1258875-3-axboe@kernel.dk
Signed-off-by: Christian Brauner <brauner@kernel.org>
The CPUHP_SLAB_PREPARE hooks are only used by SLAB which is removed.
SLUB defines them as NULL, so we can remove those altogether.
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David Rientjes <rientjes@google.com>
Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
When removing the inner map from the outer map, the inner map will be
freed after one RCU grace period and one RCU tasks trace grace
period, so it is certain that the bpf program, which may access the
inner map, has exited before the inner map is freed.
However there is no need to wait for one RCU tasks trace grace period if
the outer map is only accessed by non-sleepable program. So adding
sleepable_refcnt in bpf_map and increasing sleepable_refcnt when adding
the outer map into env->used_maps for sleepable program. Although the
max number of bpf program is INT_MAX - 1, the number of bpf programs
which are being loaded may be greater than INT_MAX, so using atomic64_t
instead of atomic_t for sleepable_refcnt. When removing the inner map
from the outer map, using sleepable_refcnt to decide whether or not a
RCU tasks trace grace period is needed before freeing the inner map.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231204140425.1480317-6-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When updating or deleting an inner map in map array or map htab, the map
may still be accessed by non-sleepable program or sleepable program.
However bpf_map_fd_put_ptr() decreases the ref-counter of the inner map
directly through bpf_map_put(), if the ref-counter is the last one
(which is true for most cases), the inner map will be freed by
ops->map_free() in a kworker. But for now, most .map_free() callbacks
don't use synchronize_rcu() or its variants to wait for the elapse of a
RCU grace period, so after the invocation of ops->map_free completes,
the bpf program which is accessing the inner map may incur
use-after-free problem.
Fix the free of inner map by invoking bpf_map_free_deferred() after both
one RCU grace period and one tasks trace RCU grace period if the inner
map has been removed from the outer map before. The deferment is
accomplished by using call_rcu() or call_rcu_tasks_trace() when
releasing the last ref-counter of bpf map. The newly-added rcu_head
field in bpf_map shares the same storage space with work field to
reduce the size of bpf_map.
Fixes: bba1dc0b55 ("bpf: Remove redundant synchronize_rcu.")
Fixes: 638e4b825d ("bpf: Allows per-cpu maps and map-in-map in sleepable programs")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231204140425.1480317-5-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Both map deletion operation, map release and map free operation use
fd_array_map_delete_elem() to remove the element from fd array and
need_defer is always true in fd_array_map_delete_elem(). For the map
deletion operation and map release operation, need_defer=true is
necessary, because the bpf program, which accesses the element in fd
array, may still alive. However for map free operation, it is certain
that the bpf program which owns the fd array has already been exited, so
setting need_defer as false is appropriate for map free operation.
So fix it by adding need_defer parameter to bpf_fd_array_map_clear() and
adding a new helper __fd_array_map_delete_elem() to handle the map
deletion, map release and map free operations correspondingly.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231204140425.1480317-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
map is the pointer of outer map, and need_defer needs some explanation.
need_defer tells the implementation to defer the reference release of
the passed element and ensure that the element is still alive before
the bpf program, which may manipulate it, exits.
The following three cases will invoke map_fd_put_ptr() and different
need_defer values will be passed to these callers:
1) release the reference of the old element in the map during map update
or map deletion. The release must be deferred, otherwise the bpf
program may incur use-after-free problem, so need_defer needs to be
true.
2) release the reference of the to-be-added element in the error path of
map update. The to-be-added element is not visible to any bpf
program, so it is OK to pass false for need_defer parameter.
3) release the references of all elements in the map during map release.
Any bpf program which has access to the map must have been exited and
released, so need_defer=false will be OK.
These two parameters will be used by the following patches to fix the
potential use-after-free problem for map-in-map.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231204140425.1480317-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently get_free_mem_region() searches for available capacity
in increments equal to the region size being requested. This can
cause the search to take giant steps through the resource leaving
needless gaps and missing available space.
Specifically 'cxl create-region' fails with ERANGE even though capacity
of the given size and CXL's expected 256M x InterleaveWays alignment can
be satisfied.
Replace the total-request-size increment with a next alignment increment
so that the next possible address is always examined for availability.
Fixes: 14b80582c4 ("resource: Introduce alloc_free_mem_region()")
Reported-by: Dmytro Adamenko <dmytro.adamenko@intel.com>
Reported-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Link: https://lore.kernel.org/r/20231113221324.1118092-1-alison.schofield@intel.com
Cc: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
- objpool: Fix objpool overrun case on memory/cache access delay especially
on the big.LITTLE SoC. The objpool uses a copy of object slot index
internal loop, but the slot index can be changed on another processor
in parallel. In that case, the difference of 'head' local copy and the
'slot->last' index will be bigger than local slot size. In that case,
we need to re-read the slot::head to update it.
- kretprobe: Fix to use appropriate rcu API for kretprobe holder. Since
kretprobe_holder::rp is RCU managed, it should use rcu_assign_pointer()
and rcu_dereference_check() correctly. Also adding __rcu tag for
finding wrong usage by sparse.
- rethook: Fix to use appropriate rcu API for rethook::handler. The same
as kretprobe, rethook::handler is RCU managed and it should use
rcu_assign_pointer() and rcu_dereference_check(). This also adds __rcu
tag for finding wrong usage by sparse.
-----BEGIN PGP SIGNATURE-----
iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmVpfBobHG1hc2FtaS5o
aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bNyMIAJSLICKQNuFiBJEn/rty
ACWJ9QMOnwi0DoVaepG/m9QJh6AIUUFW4//9helmSm0GIVzxQ2+f8UeKU+sYiVtH
ro9atea4W4+FMTvtEB1cU8oG5CDVT4WQdUXbjMktqYe3+WB8Zt8+fIP0mnbTFAVr
yStpliGPecmlupJVRYqrJGyDdbkUxXxVlPsP/eDrHFgbBWv8Incw0f+MLGSi6LSE
sZ1MaKCdi2tlHbtD/fiowfLoBMZwQAKY4hq/XguVsWh+BGaGUgwtif+8ESwPeu22
KEZLyWDQ1N8XBHyOBotV7vsBEwh6LKtLGVXIBsO4KxVyGw6msxWBis0dt/tkn+kk
LEg=
=B9WK
-----END PGP SIGNATURE-----
Merge tag 'probes-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:
- objpool: Fix objpool overrun case on memory/cache access delay
especially on the big.LITTLE SoC. The objpool uses a copy of object
slot index internal loop, but the slot index can be changed on
another processor in parallel. In that case, the difference of 'head'
local copy and the 'slot->last' index will be bigger than local slot
size. In that case, we need to re-read the slot::head to update it.
- kretprobe: Fix to use appropriate rcu API for kretprobe holder. Since
kretprobe_holder::rp is RCU managed, it should use
rcu_assign_pointer() and rcu_dereference_check() correctly. Also
adding __rcu tag for finding wrong usage by sparse.
- rethook: Fix to use appropriate rcu API for rethook::handler. The
same as kretprobe, rethook::handler is RCU managed and it should use
rcu_assign_pointer() and rcu_dereference_check(). This also adds
__rcu tag for finding wrong usage by sparse.
* tag 'probes-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rethook: Use __rcu pointer for rethook::handler
kprobes: consistent rcu api usage for kretprobe holder
lib: objpool: fix head overrun on RK3588 SBC
Emit tnum representation as just a constant if all bits are known.
Use decimal-vs-hex logic to determine exact format of emitted
constant value, just like it's done for register range values.
For that move tnum_strn() to kernel/bpf/log.c to reuse decimal-vs-hex
determination logic and constants.
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231202175705.885270-12-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Given we enforce a valid range for program and async callback return
value, we must mark R0 as precise to avoid incorrect state pruning.
Fixes: b5dc0163d8 ("bpf: precise scalar_value tracking")
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231202175705.885270-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Use common logic to verify program return values and async callback
return values. This allows to avoid duplication of any extra steps
necessary, like precision marking, which will be added in the next
patch.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231202175705.885270-8-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Similarly to subprog/callback logic, enforce return value of BPF program
using more precise smin/smax range.
We need to adjust a bunch of tests due to a changed format of an error
message.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231202175705.885270-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Instead of relying on potentially imprecise tnum representation of
expected return value range for callbacks and subprogs, validate that
smin/smax range satisfy exact expected range of return values.
E.g., if callback would need to return [0, 2] range, tnum can't
represent this precisely and instead will allow [0, 3] range. By
checking smin/smax range, we can make sure that subprog/callback indeed
returns only valid [0, 2] range.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231202175705.885270-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Given verifier checks actual value, r0 has to be precise, so we need to
propagate precision properly. r0 also has to be marked as read,
otherwise subsequent state comparisons will ignore such register as
unimportant and precision won't really help here.
Fixes: 69c087ba62 ("bpf: Add bpf_for_each_map_elem() helper")
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231202175705.885270-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
It is common practice for security solutions to store tags/labels in
xattrs. To implement similar functionalities in BPF LSM, add new kfunc
bpf_get_file_xattr().
The first use case of bpf_get_file_xattr() is to implement file
verifications with asymmetric keys. Specificially, security applications
could use fsverity for file hashes and use xattr to store file signatures.
(kfunc for fsverity hash will be added in a separate commit.)
Currently, only xattrs with "user." prefix can be read with kfunc
bpf_get_file_xattr(). As use cases evolve, we may add a dedicated prefix
for bpf_get_file_xattr().
To avoid recursion, bpf_get_file_xattr can be only called from LSM hooks.
Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/r/20231129234417.856536-2-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Bpf cpu=v4 support is introduced in [1] and Commit 4cd58e9af8
("bpf: Support new 32bit offset jmp instruction") added support for new
32bit offset jmp instruction. Unfortunately, in function
bpf_adj_delta_to_off(), for new branch insn with 32bit offset, the offset
(plus/minor a small delta) compares to 16-bit offset bound
[S16_MIN, S16_MAX], which caused the following verification failure:
$ ./test_progs-cpuv4 -t verif_scale_pyperf180
...
insn 10 cannot be patched due to 16-bit range
...
libbpf: failed to load object 'pyperf180.bpf.o'
scale_test:FAIL:expect_success unexpected error: -12 (errno 12)
#405 verif_scale_pyperf180:FAIL
Note that due to recent llvm18 development, the patch [2] (already applied
in bpf-next) needs to be applied to bpf tree for testing purpose.
The fix is rather simple. For 32bit offset branch insn, the adjusted
offset compares to [S32_MIN, S32_MAX] and then verification succeeded.
[1] https://lore.kernel.org/all/20230728011143.3710005-1-yonghong.song@linux.dev
[2] https://lore.kernel.org/bpf/20231110193644.3130906-1-yonghong.song@linux.dev
Fixes: 4cd58e9af8 ("bpf: Support new 32bit offset jmp instruction")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231201024640.3417057-1-yonghong.song@linux.dev
strlcpy() reads the entire source buffer first. This read may exceed
the destination size limit. This is both inefficient and can lead
to linear read overflows if a source string is not NUL-terminated[1].
Additionally, it returns the size of the source string, not the
resulting size of the destination string. In an effort to remove strlcpy()
completely[2], replace strlcpy() here with strscpy().
The negative return value is already handled by this code so no new
handling is needed here.
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [1]
Link: https://github.com/KSPP/linux/issues/89 [2]
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: linux-trace-kernel@vger.kernel.org
Acked-by: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20231130205607.work.463-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
The multi-line comment style in the file is rather arbitrary.
Make it follow the standard one.
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20231120151419.1661807-6-andriy.shevchenko@linux.intel.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Sort the headers in alphabetic order in order to ease
the maintenance for this part.
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20231120151419.1661807-5-andriy.shevchenko@linux.intel.com
Signed-off-by: Kees Cook <keescook@chromium.org>
We can use strnlen() even on early stages and it prevents from
going over the string boundaries in case it's already too long.
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20231120151419.1661807-3-andriy.shevchenko@linux.intel.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Introduce a new type for the callback to parse an unknown argument.
This unifies function prototypes which takes that as a parameter.
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20231120151419.1661807-2-andriy.shevchenko@linux.intel.com
Signed-off-by: Kees Cook <keescook@chromium.org>
The current design of cgroup_rstat_cpu_pop_updated() is to traverse
the updated tree in a way to pop out the leaf nodes first before
their parents. This can cause traversal of multiple nodes before a
leaf node can be found and popped out. IOW, a given node in the tree
can be visited multiple times before the whole operation is done. So
it is not very efficient and the code can be hard to read.
With the introduction of cgroup_rstat_updated_list() to build a list
of cgroups to be flushed first before any flushing operation is being
done, we can optimize the way the updated tree nodes are being popped
by pushing the parents first to the tail end of the list before their
children. In this way, most updated tree nodes will be visited only
once with the exception of the subtree root as we still need to go
back to its parent and popped it out of its updated_children list.
This also makes the code easier to read.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
I have seen several cases of attempts to use mutex_unlock() to release an
object such that the object can then be freed by another task.
This is not safe because mutex_unlock(), in the
MUTEX_FLAG_WAITERS && !MUTEX_FLAG_HANDOFF case, accesses the mutex
structure after having marked it as unlocked; so mutex_unlock() requires
its caller to ensure that the mutex stays alive until mutex_unlock()
returns.
If MUTEX_FLAG_WAITERS is set and there are real waiters, those waiters
have to keep the mutex alive, but we could have a spurious
MUTEX_FLAG_WAITERS left if an interruptible/killable waiter bailed
between the points where __mutex_unlock_slowpath() did the cmpxchg
reading the flags and where it acquired the wait_lock.
( With spinlocks, that kind of code pattern is allowed and, from what I
remember, used in several places in the kernel. )
Document this, such a semantic difference between mutexes and spinlocks
is fairly unintuitive.
[ mingo: Made the changelog a bit more assertive, refined the comments. ]
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231130204817.2031407-1-jannh@google.com
Since the rethook::handler is an RCU-maganged pointer so that it will
notice readers the rethook is stopped (unregistered) or not, it should
be an __rcu pointer and use appropriate functions to be accessed. This
will use appropriate memory barrier when accessing it. OTOH,
rethook::data is never changed, so we don't need to check it in
get_kretprobe().
NOTE: To avoid sparse warning, rethook::handler is defined by a raw
function pointer type with __rcu instead of rethook_handler_t.
Link: https://lore.kernel.org/all/170126066201.398836.837498688669005979.stgit@devnote2/
Fixes: 54ecbe6f1e ("rethook: Add a generic return hook")
Cc: stable@vger.kernel.org
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202311241808.rv9ceuAh-lkp@intel.com/
Tested-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
It seems that the pointer-to-kretprobe "rp" within the kretprobe_holder is
RCU-managed, based on the (non-rethook) implementation of get_kretprobe().
The thought behind this patch is to make use of the RCU API where possible
when accessing this pointer so that the needed barriers are always in place
and to self-document the code.
The __rcu annotation to "rp" allows for sparse RCU checking. Plain writes
done to the "rp" pointer are changed to make use of the RCU macro for
assignment. For the single read, the implementation of get_kretprobe()
is simplified by making use of an RCU macro which accomplishes the same,
but note that the log warning text will be more generic.
I did find that there is a difference in assembly generated between the
usage of the RCU macros vs without. For example, on arm64, when using
rcu_assign_pointer(), the corresponding store instruction is a
store-release (STLR) which has an implicit barrier. When normal assignment
is done, a regular store (STR) is found. In the macro case, this seems to
be a result of rcu_assign_pointer() using smp_store_release() when the
value to write is not NULL.
Link: https://lore.kernel.org/all/20231122132058.3359-1-inwardvessel@gmail.com/
Fixes: d741bf41d7 ("kprobes: Remove kretprobe hash")
Cc: stable@vger.kernel.org
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZWiCPAAKCRDbK58LschI
g4djAQC1FdqCRIFkhbiIRNHTgHjnfQShELQbd9ofJqzylLqmmgD+JI1E7D9SXagm
pIXQ26EGmq8/VcCT3VLncA8EsC76Gg4=
=Xowm
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-11-30
We've added 30 non-merge commits during the last 7 day(s) which contain
a total of 58 files changed, 1598 insertions(+), 154 deletions(-).
The main changes are:
1) Add initial TX metadata implementation for AF_XDP with support in mlx5
and stmmac drivers. Two types of offloads are supported right now, that
is, TX timestamp and TX checksum offload, from Stanislav Fomichev with
stmmac implementation from Song Yoong Siang.
2) Change BPF verifier logic to validate global subprograms lazily instead
of unconditionally before the main program, so they can be guarded using
BPF CO-RE techniques, from Andrii Nakryiko.
3) Add BPF link_info support for uprobe multi link along with bpftool
integration for the latter, from Jiri Olsa.
4) Use pkg-config in BPF selftests to determine ld flags which is
in particular needed for linking statically, from Akihiko Odaki.
5) Fix a few BPF selftest failures to adapt to the upcoming LLVM18,
from Yonghong Song.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (30 commits)
bpf/tests: Remove duplicate JSGT tests
selftests/bpf: Add TX side to xdp_hw_metadata
selftests/bpf: Convert xdp_hw_metadata to XDP_USE_NEED_WAKEUP
selftests/bpf: Add TX side to xdp_metadata
selftests/bpf: Add csum helpers
selftests/xsk: Support tx_metadata_len
xsk: Add option to calculate TX checksum in SW
xsk: Validate xsk_tx_metadata flags
xsk: Document tx_metadata_len layout
net: stmmac: Add Tx HWTS support to XDP ZC
net/mlx5e: Implement AF_XDP TX timestamp and checksum offload
tools: ynl: Print xsk-features from the sample
xsk: Add TX timestamp and TX checksum offload support
xsk: Support tx_metadata_len
selftests/bpf: Use pkg-config for libelf
selftests/bpf: Override PKG_CONFIG for static builds
selftests/bpf: Choose pkg-config for the target
bpftool: Add support to display uprobe_multi links
selftests/bpf: Add link_info test for uprobe_multi link
selftests/bpf: Use bpf_link__destroy in fill_link_info tests
...
====================
Conflicts:
Documentation/netlink/specs/netdev.yaml:
839ff60df3 ("net: page_pool: add nlspec for basic access to page pools")
48eb03dd26 ("xsk: Add TX timestamp and TX checksum offload support")
https://lore.kernel.org/all/20231201094705.1ee3cab8@canb.auug.org.au/
While at it also regen, tree is dirty after:
48eb03dd26 ("xsk: Add TX timestamp and TX checksum offload support")
looks like code wasn't re-rendered after "render-max" was removed.
Link: https://lore.kernel.org/r/20231130145708.32573-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Current release - regressions:
- neighbour: fix __randomize_layout crash in struct neighbour
- r8169: fix deadlock on RTL8125 in jumbo mtu mode
Previous releases - regressions:
- wifi:
- mac80211: fix warning at station removal time
- cfg80211: fix CQM for non-range use
- tools: ynl-gen: fix unexpected response handling
- octeontx2-af: fix possible buffer overflow
- dpaa2: recycle the RX buffer only after all processing done
- rswitch: fix missing dev_kfree_skb_any() in error path
Previous releases - always broken:
- ipv4: fix uaf issue when receiving igmp query packet
- wifi: mac80211: fix debugfs deadlock at device removal time
- bpf:
- sockmap: af_unix stream sockets need to hold ref for pair sock
- netdevsim: don't accept device bound programs
- selftests: fix a char signedness issue
- dsa: mv88e6xxx: fix marvell 6350 probe crash
- octeontx2-pf: restore TC ingress police rules when interface is up
- wangxun: fix memory leak on msix entry
- ravb: keep reverse order of operations in ravb_remove()
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmVobzISHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOk4rwP/2qaUstOJVpkO8cG+bRYi3idH9uO/8Yu
dYgFI4LM826YgbVNVzuiu9Sh7t78dbep/fWQ2quDuZUinWtPmv6RV3UKbDyNWLRr
iV7sZvXElGsUefixxGANYDUPuCrlr3O230Y8zCN0R65BMurppljs9Pp8FwIqaD+v
pVs2alb/PeX7g+hPACKPr4Knu8QeZYmzdHoyYeLoMG3PqIgJVU3/8OHHfmnYCdxT
VSss2LB5FKFCOgetEPGy83KQP7QVaK22GDphZJ4xh7aSewRVP92ORfauiI8To4vQ
0VnLNcQ+1pXnYzgGdv8oF02e4EP5b0jvrTpqCw1U0QU2s2PARJarzajCXBkwa308
gXELRpVRpY4+7WEBSX4RGUigurwGGEh/IP/puVtPDr9KU3lFgaTI8wM624Y3Ob/e
/LVI7a5kUSJysq9/H/QrHjoiuTtV7nCmzBgDqEFSN5hQinSHYKyD0XsUPcLlMJmn
p6CyQDGHv2ibbg+8TStig0xfmC83N8KfDfcCekSrYxquDMTRtfa2VXofzQiQKDnr
XNyIURmZAAUVPR6enxlg5Iqzc0mQGumYif7wzsO1bzVzmVZgIDCVxU95hkoRrutU
qnWXuUGUdieUvXA9HltntTzy2BgJVtg7L/p8YEbd97dxtgK80sbdnjfDswFvEeE4
nTvE+IDKdCmb
=QiQp
-----END PGP SIGNATURE-----
Merge tag 'net-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from bpf and wifi.
Current release - regressions:
- neighbour: fix __randomize_layout crash in struct neighbour
- r8169: fix deadlock on RTL8125 in jumbo mtu mode
Previous releases - regressions:
- wifi:
- mac80211: fix warning at station removal time
- cfg80211: fix CQM for non-range use
- tools: ynl-gen: fix unexpected response handling
- octeontx2-af: fix possible buffer overflow
- dpaa2: recycle the RX buffer only after all processing done
- rswitch: fix missing dev_kfree_skb_any() in error path
Previous releases - always broken:
- ipv4: fix uaf issue when receiving igmp query packet
- wifi: mac80211: fix debugfs deadlock at device removal time
- bpf:
- sockmap: af_unix stream sockets need to hold ref for pair sock
- netdevsim: don't accept device bound programs
- selftests: fix a char signedness issue
- dsa: mv88e6xxx: fix marvell 6350 probe crash
- octeontx2-pf: restore TC ingress police rules when interface is up
- wangxun: fix memory leak on msix entry
- ravb: keep reverse order of operations in ravb_remove()"
* tag 'net-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (51 commits)
net: ravb: Keep reverse order of operations in ravb_remove()
net: ravb: Stop DMA in case of failures on ravb_open()
net: ravb: Start TX queues after HW initialization succeeded
net: ravb: Make write access to CXR35 first before accessing other EMAC registers
net: ravb: Use pm_runtime_resume_and_get()
net: ravb: Check return value of reset_control_deassert()
net: libwx: fix memory leak on msix entry
ice: Fix VF Reset paths when interface in a failed over aggregate
bpf, sockmap: Add af_unix test with both sockets in map
bpf, sockmap: af_unix stream sockets need to hold ref for pair sock
tools: ynl-gen: always construct struct ynl_req_state
ethtool: don't propagate EOPNOTSUPP from dumps
ravb: Fix races between ravb_tx_timeout_work() and net related ops
r8169: prevent potential deadlock in rtl8169_close
r8169: fix deadlock on RTL8125 in jumbo mtu mode
neighbour: Fix __randomize_layout crash in struct neighbour
octeontx2-pf: Restore TC ingress police rules when interface is up
octeontx2-pf: Fix adding mbox work queue entry when num_vfs > 64
net: stmmac: xgmac: Disable FPE MMC interrupts
octeontx2-af: Fix possible buffer overflow
...
Created as testing for the conditional guard infrastructure.
Specifically this makes use of the following form:
scoped_cond_guard (mutex_intr, return -ERESTARTNOINTR,
&task->signal->cred_guard_mutex) {
...
}
...
return 0;
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lkml.kernel.org/r/20231102110706.568467727%40infradead.org
Since reweight_entity() may have chance to change the weight of
cfs_rq->curr entity, we should also update_min_vruntime() if
this is the case
Fixes: eab03c23c2 ("sched/eevdf: Fix vruntime adjustment on reweight")
Signed-off-by: Yiwei Lin <s921975628@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Abel Wu <wuyun.abel@bytedance.com>
Link: https://lore.kernel.org/r/20231117080106.12890-1-s921975628@gmail.com
Budimir noted that perf_event_validate_size() only checks the size of
the newly added event, even though the sizes of all existing events
can also change due to not all events having the same read_format.
When we attach the new event, perf_group_attach(), we do re-compute
the size for all events.
Fixes: a723968c0e ("perf: Fix u16 overflows")
Reported-by: Budimir Markovic <markovicbudimir@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
It is possible for a task to be thawed multiple times when mixing the
*legacy* cgroup freezer and system-wide freezer. To do this, freeze the
cgroup, do system-wide freeze/thaw, then thaw the cgroup. When this
happens, then a stale saved_state can be written to the task's state
and cause task to hang indefinitely. Fix this by only trying to thaw
tasks that are actually frozen.
This change also has the marginal benefit avoiding unnecessary
wake_up_state(p, TASK_FROZEN) if we know the task is already thawed.
There is not possibility of time-of-compare/time-of-use race when we skip
the wake_up_state because entering/exiting TASK_FROZEN is guarded by
freezer_lock.
Fixes: 8f0eed4a78 ("freezer,sched: Use saved_state to reduce some spurious wakeups")
Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Abhijeet Dharmapurikar <quic_adharmap@quicinc.com>
Link: https://lore.kernel.org/r/20231120-freezer-state-multiple-thaws-v1-1-f2e1dd7ce5a2@quicinc.com
Adding support to get uprobe_link details through bpf_link_info
interface.
Adding new struct uprobe_multi to struct bpf_link_info to carry
the uprobe_multi link details.
The uprobe_multi.count is passed from user space to denote size
of array fields (offsets/ref_ctr_offsets/cookies). The actual
array size is stored back to uprobe_multi.count (allowing user
to find out the actual array size) and array fields are populated
up to the user passed size.
All the non-array fields (path/count/flags/pid) are always set.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20231125193130.834322-4-jolsa@kernel.org
We will need to return ref_ctr_offsets values through link_info
interface in following change, so we need to keep them around.
Storing ref_ctr_offsets values directly into bpf_uprobe array.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20231125193130.834322-3-jolsa@kernel.org
__thaw_task() was recently updated to warn if the task being thawed was
part of a freezer cgroup that is still currently freezing:
void __thaw_task(struct task_struct *p)
{
...
if (WARN_ON_ONCE(freezing(p)))
goto unlock;
This has exposed a bug in cgroup1 freezing where when CGROUP_FROZEN is
asserted, the CGROUP_FREEZING bits are not also cleared at the same
time. Meaning, when a cgroup is marked FROZEN it continues to be marked
FREEZING as well. This causes the WARNING to trigger, because
cgroup_freezing() thinks the cgroup is still freezing.
There are two ways to fix this:
1. Whenever FROZEN is set, clear FREEZING for the cgroup and all
children cgroups.
2. Update cgroup_freezing() to also verify that FROZEN is not set.
This patch implements option (2), since it's smaller and more
straightforward.
Signed-off-by: Tim Van Patten <timvp@google.com>
Tested-by: Mark Hasemeyer <markhas@chromium.org>
Fixes: f5d39b0208 ("freezer,sched: Rewrite core freezer logic")
Cc: stable@vger.kernel.org # v6.1+
Signed-off-by: Tejun Heo <tj@kernel.org>
The root-only cpuset.cpus.isolated control file shows the current set
of isolated CPUs in isolated partitions. This control file is currently
exposed only with the cgroup_debug boot command line option which also
adds the ".__DEBUG__." prefix. This is actually a useful control file if
users want to find out which CPUs are currently in an isolated state by
the cpuset controller. Remove CFTYPE_DEBUG flag for this control file and
make it available by default without any prefix.
The test_cpuset_prs.sh test script and the cgroup-v2.rst documentation
file are also updated accordingly. Minor code change is also made in
test_cpuset_prs.sh to avoid false test failure when running on debug
kernel.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There's no reason we need to couple mnt idmapping to namespaces in the
way we currently do. Copy the idmapping when an idmapped mount is
created and don't take any reference on the namespace at all.
We also can't easily refcount struct uid_gid_map because it needs to
stay the size of a cacheline otherwise we risk performance regressions
(Ignoring for a second that right now struct uid_gid_map isn't actually
64 byte but 72 but that's a fix for another patch series.).
Link: https://lore.kernel.org/r/20231122-vfs-mnt_idmap-v1-3-dae4abdde5bd@kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
bpf_mem_cache_alloc_flags() may call __alloc() directly when there is no
free object in free list, but it doesn't initialize the allocation hint
for the returned pointer. It may lead to bad memory dereference when
freeing the pointer, so fix it by initializing the allocation hint.
Fixes: 822fb26bdb ("bpf: Add a hint to allocated objects.")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231111043821.2258513-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Kent reported an occasional KASAN splat in lockdep. Mark then noted:
> I suspect the dodgy access is to chain_block_buckets[-1], which hits the last 4
> bytes of the redzone and gets (incorrectly/misleadingly) attributed to
> nr_large_chain_blocks.
That would mean @size == 0, at which point size_to_bucket() returns -1
and the above happens.
alloc_chain_hlocks() has 'size - req', for the first with the
precondition 'size >= rq', which allows the 0.
This code is trying to split a block, del_chain_block() takes what we
need, and add_chain_block() puts back the remainder, except in the
above case the remainder is 0 sized and things go sideways.
Fixes: 810507fe6f ("locking/lockdep: Reuse freed chain_hlocks entries")
Reported-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lkml.kernel.org/r/20231121114126.GH8262@noisy.programming.kicks-ass.net
Slightly change BPF verifier logic around eagerness and order of global
subprog validation. Instead of going over every global subprog eagerly
and validating it before main (entry) BPF program is verified, turn it
around. Validate main program first, mark subprogs that were called from
main program for later verification, but otherwise assume it is valid.
Afterwards, go over marked global subprogs and validate those,
potentially marking some more global functions as being called. Continue
this process until all (transitively) callable global subprogs are
validated. It's a BFS traversal at its heart and will always converge.
This is an important change because it allows to feature-gate some
subprograms that might not be verifiable on some older kernel, depending
on supported set of features.
E.g., at some point, global functions were allowed to accept a pointer
to memory, which size is identified by user-provided type.
Unfortunately, older kernels don't support this feature. With BPF CO-RE
approach, the natural way would be to still compile BPF object file once
and guard calls to this global subprog with some CO-RE check or using
.rodata variables. That's what people do to guard usage of new helpers
or kfuncs, and any other new BPF-side feature that might be missing on
old kernels.
That's currently impossible to do with global subprogs, unfortunately,
because they are eagerly and unconditionally validated. This patch set
aims to change this, so that in the future when global funcs gain new
features, those can be guarded using BPF CO-RE techniques in the same
fashion as any other new kernel feature.
Two selftests had to be adjusted in sync with these changes.
test_global_func12 relied on eager global subprog validation failing
before main program failure is detected (unknown return value). Fix by
making sure that main program is always valid.
verifier_subprog_precision's parent_stack_slot_precise subtest relied on
verifier checkpointing heuristic to do a checkpoint at instruction #5,
but that's no longer true because we don't have enough jumps validated
before reaching insn #5 due to global subprogs being validated later.
Other than that, no changes, as one would expect.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231124035937.403208-3-andrii@kernel.org
We have the name, instead of emitting just func#N to identify global
subprog, augment verifier log messages with actual function name to make
it more user-friendly.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231124035937.403208-2-andrii@kernel.org
Current release - regressions:
- Revert "net: r8169: Disable multicast filter for RTL8168H
and RTL8107E"
- kselftest: rtnetlink: fix ip route command typo
Current release - new code bugs:
- s390/ism: make sure ism driver implies smc protocol in kconfig
- two build fixes for tools/net
Previous releases - regressions:
- rxrpc: couple of ACK/PING/RTT handling fixes
Previous releases - always broken:
- bpf: verify bpf_loop() callbacks as if they are called unknown
number of times
- improve stability of auto-bonding with Hyper-V
- account BPF-neigh-redirected traffic in interface statistics
Misc:
- net: fill in some more MODULE_DESCRIPTION()s
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmVfiBoACgkQMUZtbf5S
IrukFhAAiY5XyqiVyEBsm10AgYSpl0BbnxywfK27nR2SbxSTvSxyuXseV2EyEynU
iNn6CksHe2rG1/DXbKoQohsIYey/YjY5c6omT5JzuxIT2h69J4iYKMIj/Ptk5joZ
MQsDK5J9aCvxXBazYF2CuOCgVcwmqNFbCHf1FAFhk0RuqI3RoC5/OGbLM0tmTMQw
BceNUHBn8iPcSkRbnntwLLHVxMrX9bay6F+pcy5/b40VWBATM3uBkw/2zBqPZ+n1
Z0SNWvLtnO6T66Y07vaE1sRiqN4KHtS4WWelVYcmYX2rY1HkXx/TUjvrJ7R/uQQQ
/5yUB6G294NmFv/2X+Yjt5AB8PjnFzjm/BqCBrjXcnnMPOiB0YZg554s59RhRrSr
cmZ4bveUgCQV/jJWMxwGYvZSAqtle8uN+8DhxdjbCpVJocbrseDHKyWJ6bOy85BN
zbJuUOUeFDs53nhV+Z9fiuUFuxhIwDCKHHFmEp7R5VotX0ZURiDnqlj9WEIxKZrC
UidWRXv/VP+bV4BB2GVIFncEWMuhrnWOQY8DR6eC33uq7JkwTZD3R8IGR8up/+tm
CtVyPvefAYZB8/IVU/mOSVrx04ERupNVvBkXzOMQe7UqRq3okPsQFPW8HmSrmnQG
KrJWyBIqG3jnJCuqoXwlt0rKP3MmgCjowhTbZ3uDjeVf9UJTu2U=
=2sG4
-----END PGP SIGNATURE-----
Merge tag 'net-6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf.
Current release - regressions:
- Revert "net: r8169: Disable multicast filter for RTL8168H and
RTL8107E"
- kselftest: rtnetlink: fix ip route command typo
Current release - new code bugs:
- s390/ism: make sure ism driver implies smc protocol in kconfig
- two build fixes for tools/net
Previous releases - regressions:
- rxrpc: couple of ACK/PING/RTT handling fixes
Previous releases - always broken:
- bpf: verify bpf_loop() callbacks as if they are called unknown
number of times
- improve stability of auto-bonding with Hyper-V
- account BPF-neigh-redirected traffic in interface statistics
Misc:
- net: fill in some more MODULE_DESCRIPTION()s"
* tag 'net-6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (58 commits)
tools: ynl: fix duplicate op name in devlink
tools: ynl: fix header path for nfsd
net: ipa: fix one GSI register field width
tls: fix NULL deref on tls_sw_splice_eof() with empty record
net: axienet: Fix check for partial TX checksum
vsock/test: fix SEQPACKET message bounds test
i40e: Fix adding unsupported cloud filters
ice: restore timestamp configuration after device reset
ice: unify logic for programming PFINT_TSYN_MSK
ice: remove ptp_tx ring parameter flag
amd-xgbe: propagate the correct speed and duplex status
amd-xgbe: handle the corner-case during tx completion
amd-xgbe: handle corner-case during sfp hotplug
net: veth: fix ethtool stats reporting
octeontx2-pf: Fix ntuple rule creation to direct packet to VF with higher Rx queue than its PF
net: usb: qmi_wwan: claim interface 4 for ZTE MF290
Revert "net: r8169: Disable multicast filter for RTL8168H and RTL8107E"
net/smc: avoid data corruption caused by decline
nfc: virtual_ncidev: Add variable to check if ndev is running
dpll: Fix potential msg memleak when genlmsg_put_reply failed
...
Thomas reported that commit 652ffc2104 ("perf/core: Fix narrow
startup race when creating the perf nr_addr_filters sysfs file") made
the entire attribute group vanish, instead of only the nr_addr_filters
attribute.
Additionally a stray return.
Insufficient coffee was involved with both writing and merging the
patch.
Fixes: 652ffc2104 ("perf/core: Fix narrow startup race when creating the perf nr_addr_filters sysfs file")
Reported-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Link: https://lkml.kernel.org/r/20231122100756.GP8262@noisy.programming.kicks-ass.net
Use the max value that has already been computed inside sugov_get_util()
to cap the iowait boost and remove dependency with uclamp_rq_util_with()
which is not used anymore.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20231122133904.446032-3-vincent.guittot@linaro.org
The current method to take into account uclamp hints when estimating the
target frequency can end in a situation where the selected target
frequency is finally higher than uclamp hints, whereas there are no real
needs. Such cases mainly happen because we are currently mixing the
traditional scheduler utilization signal with the uclamp performance
hints. By adding these 2 metrics, we loose an important information when
it comes to select the target frequency, and we have to make some
assumptions which can't fit all cases.
Rework the interface between the scheduler and schedutil governor in order
to propagate all information down to the cpufreq governor.
effective_cpu_util() interface changes and now returns the actual
utilization of the CPU with 2 optional inputs:
- The minimum performance for this CPU; typically the capacity to handle
the deadline task and the interrupt pressure. But also uclamp_min
request when available.
- The maximum targeting performance for this CPU which reflects the
maximum level that we would like to not exceed. By default it will be
the CPU capacity but can be reduced because of some performance hints
set with uclamp. The value can be lower than actual utilization and/or
min performance level.
A new sugov_effective_cpu_perf() interface is also available to compute
the final performance level that is targeted for the CPU, after applying
some cpufreq headroom and taking into account all inputs.
With these 2 functions, schedutil is now able to decide when it must go
above uclamp hints. It now also has a generic way to get the min
performance level.
The dependency between energy model and cpufreq governor and its headroom
policy doesn't exist anymore.
eenv_pd_max_util() asks schedutil for the targeted performance after
applying the impact of the waking task.
[ mingo: Refined the changelog & C comments. ]
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20231122133904.446032-2-vincent.guittot@linaro.org
Lukasz Luba reported that a thread's util_est can significantly decrease as
a result of sharing the CPU with other threads.
The use case can be easily reproduced with a periodic task TA that runs 1ms
and sleeps 100us. When the task is alone on the CPU, its max utilization and
its util_est is around 888. If another similar task starts to run on the
same CPU, TA will have to share the CPU runtime and its maximum utilization
will decrease around half the CPU capacity (512) then TA's util_est will
follow this new maximum trend which is only the result of sharing the CPU
with others tasks.
Such situation can be detected with runnable_avg wich is close or
equal to util_avg when TA is alone, but increases above util_avg when TA
shares the CPU with other threads and wait on the runqueue.
[ We prefer an util_est that overestimate rather than under estimate
because in 1st case we will not provide enough performance to the
task which will remain under-provisioned, whereas in the other case we
will create some idle time which will enable to reduce contention and
as a result reduces the util_est so the overestimate will be transient
whereas the underestimate will remain. ]
[ mingo: Refined the changelog, added comments from the LKML discussion. ]
Reported-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/lkml/CAKfTPtDd-HhF-YiNTtL9i5k0PfJbF819Yxu4YquzfXgwi7voyw@mail.gmail.com/#t
Link: https://lore.kernel.org/r/20231122140119.472110-1-vincent.guittot@linaro.org
Cc: Hongyan Xia <hongyan.xia2@arm.com>
For rcutorture tests on RCU implementations that support
force-quiescent-state operations and that set the fqs_duration module
parameter greater than zero, the fqs_task kthread will be created.
However, if the fqs_holdoff module parameter is not set, then its default
value of zero will cause fqs_task enter a long-term busy loop until
stopped by kthread_stop(). This commit therefore adds a fqs_holdoff
check before the fqs_task is created, making sure that whenever the
fqs_task is created, the fqs_holdoff will be greater than zero.
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
One letter difference is really not enough, so this commit changes
call_rcu_chain to call_rcu_chain_list.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
cgroup/for-6.8 is carrying two workqueue changes to allow cpuset to restrict
the CPUs used by unbound workqueues. Unfortunately, this conflicts with a
new bug fix in wq/for-6.7-fixes. The conflict is contextual but can be a bit
confusing to resolve. Pull the fix branch to resolve the conflict.
Signed-off-by: Tejun Heo <tj@kernel.org>
During boot, depending on how the housekeeping and workqueue.unbound_cpus
masks are set, wq_unbound_cpumask can end up empty. Since 8639ecebc9
("workqueue: Implement non-strict affinity scope for unbound workqueues"),
this may end up feeding -1 as a CPU number into scheduler leading to oopses.
BUG: unable to handle page fault for address: ffffffff8305e9c0
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
...
Call Trace:
<TASK>
select_idle_sibling+0x79/0xaf0
select_task_rq_fair+0x1cb/0x7b0
try_to_wake_up+0x29c/0x5c0
wake_up_process+0x19/0x20
kick_pool+0x5e/0xb0
__queue_work+0x119/0x430
queue_work_on+0x29/0x30
...
An empty wq_unbound_cpumask is a clear misconfiguration and already
disallowed once system is booted up. Let's warn on and ignore
unbound_cpumask restrictions which lead to no unbound cpus. While at it,
also remove now unncessary empty check on wq_unbound_cpumask in
wq_select_unbound_cpu().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-Tested-by: Yong He <alexyonghe@tencent.com>
Link: http://lkml.kernel.org/r/20231120121623.119780-1-alexyonghe@tencent.com
Fixes: 8639ecebc9 ("workqueue: Implement non-strict affinity scope for unbound workqueues")
Cc: stable@vger.kernel.org # v6.6+
Reviewed-by: Waiman Long <longman@redhat.com>
This function is defined globally in clocksource.c and used conditionally
in clockevent.c, which the declaration hidden when clockevent support
is disabled. This causes a harmless warning in the definition:
kernel/time/clocksource.c:1324:9: warning: no previous prototype for 'sysfs_get_uname' [-Wmissing-prototypes]
1324 | ssize_t sysfs_get_uname(const char *buf, char *dst, size_t cnt)
Move the declaration out of the #ifdef so it is always visible.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Link: https://lore.kernel.org/r/20231108125843.3806765-5-arnd@kernel.org
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZV0kjgAKCRDbK58LschI
gy0EAP9XwncW2OhO72DpITluFzvWPgB0N97OANKBXjzKJrRAlQD/aUe9nlvBQuad
WsbMKLeC4wvI2X/4PEIR4ukbuZ3ypAA=
=LMVg
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-11-21
We've added 85 non-merge commits during the last 12 day(s) which contain
a total of 63 files changed, 4464 insertions(+), 1484 deletions(-).
The main changes are:
1) Huge batch of verifier changes to improve BPF register bounds logic
and range support along with a large test suite, and verifier log
improvements, all from Andrii Nakryiko.
2) Add a new kfunc which acquires the associated cgroup of a task within
a specific cgroup v1 hierarchy where the latter is identified by its id,
from Yafang Shao.
3) Extend verifier to allow bpf_refcount_acquire() of a map value field
obtained via direct load which is a use-case needed in sched_ext,
from Dave Marchevsky.
4) Fix bpf_get_task_stack() helper to add the correct crosstask check
for the get_perf_callchain(), from Jordan Rome.
5) Fix BPF task_iter internals where lockless usage of next_thread()
was wrong. The rework also simplifies the code, from Oleg Nesterov.
6) Fix uninitialized tail padding via LIBBPF_OPTS_RESET, and another
fix for certain BPF UAPI structs to fix verifier failures seen
in bpf_dynptr usage, from Yonghong Song.
7) Add BPF selftest fixes for map_percpu_stats flakes due to per-CPU BPF
memory allocator not being able to allocate per-CPU pointer successfully,
from Hou Tao.
8) Add prep work around dynptr and string handling for kfuncs which
is later going to be used by file verification via BPF LSM and fsverity,
from Song Liu.
9) Improve BPF selftests to update multiple prog_tests to use ASSERT_*
macros, from Yuran Pereira.
10) Optimize LPM trie lookup to check prefixlen before walking the trie,
from Florian Lehner.
11) Consolidate virtio/9p configs from BPF selftests in config.vm file
given they are needed consistently across archs, from Manu Bretelle.
12) Small BPF verifier refactor to remove register_is_const(),
from Shung-Hsi Yu.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (85 commits)
selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in vmlinux
selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bpf_obj_id
selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bind_perm
selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bpf_tcp_ca
selftests/bpf: reduce verboseness of reg_bounds selftest logs
bpf: bpf_iter_task_next: use next_task(kit->task) rather than next_task(kit->pos)
bpf: bpf_iter_task_next: use __next_thread() rather than next_thread()
bpf: task_group_seq_get_next: use __next_thread() rather than next_thread()
bpf: emit frameno for PTR_TO_STACK regs if it differs from current one
bpf: smarter verifier log number printing logic
bpf: omit default off=0 and imm=0 in register state log
bpf: emit map name in register state if applicable and available
bpf: print spilled register state in stack slot
bpf: extract register state printing
bpf: move verifier state printing code to kernel/bpf/log.c
bpf: move verbose_linfo() into kernel/bpf/log.c
bpf: rename BPF_F_TEST_SANITY_STRICT to BPF_F_TEST_REG_INVARIANTS
bpf: Remove test for MOVSX32 with offset=32
selftests/bpf: add iter test requiring range x range logic
veristat: add ability to set BPF_F_TEST_SANITY_STRICT flag with -r flag
...
====================
Link: https://lore.kernel.org/r/20231122000500.28126-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit fe28f631fa ("workqueue: Add workqueue_unbound_exclude_cpumask()
to exclude CPUs from wq_unbound_cpumask") makes
workqueue_set_unbound_cpumask() static as it is not used elsewhere in
the kernel. However, this triggers a kernel test robot warning about
'workqueue_set_unbound_cpumask' defined but not used when CONFIG_SYS
isn't defined. It happens that workqueue_set_unbound_cpumask() is only
called when CONFIG_SYS is defined.
Move workqueue_set_unbound_cpumask() and its helpers inside the
CONFIG_SYSFS compilation block to avoid the warning. There is no
functional change.
Fixes: fe28f631fa ("workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202311130831.uh0AoCd1-lkp@intel.com/
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In some cases verifier can't infer convergence of the bpf_loop()
iteration. E.g. for the following program:
static int cb(__u32 idx, struct num_context* ctx)
{
ctx->i++;
return 0;
}
SEC("?raw_tp")
int prog(void *_)
{
struct num_context ctx = { .i = 0 };
__u8 choice_arr[2] = { 0, 1 };
bpf_loop(2, cb, &ctx, 0);
return choice_arr[ctx.i];
}
Each 'cb' simulation would eventually return to 'prog' and reach
'return choice_arr[ctx.i]' statement. At which point ctx.i would be
marked precise, thus forcing verifier to track multitude of separate
states with {.i=0}, {.i=1}, ... at bpf_loop() callback entry.
This commit allows "brute force" handling for such cases by limiting
number of callback body simulations using 'umax' value of the first
bpf_loop() parameter.
For this, extend bpf_func_state with 'callback_depth' field.
Increment this field when callback visiting state is pushed to states
traversal stack. For frame #N it's 'callback_depth' field counts how
many times callback with frame depth N+1 had been executed.
Use bpf_func_state specifically to allow independent tracking of
callback depths when multiple nested bpf_loop() calls are present.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-11-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Callbacks are similar to open coded iterators, so add imprecise
widening logic for callback body processing. This makes callback based
loops behave identically to open coded iterators, e.g. allowing to
verify programs like below:
struct ctx { u32 i; };
int cb(u32 idx, struct ctx* ctx)
{
++ctx->i;
return 0;
}
...
struct ctx ctx = { .i = 0 };
bpf_loop(100, cb, &ctx, 0);
...
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-9-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Prior to this patch callbacks were handled as regular function calls,
execution of callback body was modeled exactly once.
This patch updates callbacks handling logic as follows:
- introduces a function push_callback_call() that schedules callback
body verification in env->head stack;
- updates prepare_func_exit() to reschedule callback body verification
upon BPF_EXIT;
- as calls to bpf_*_iter_next(), calls to callback invoking functions
are marked as checkpoints;
- is_state_visited() is updated to stop callback based iteration when
some identical parent state is found.
Paths with callback function invoked zero times are now verified first,
which leads to necessity to modify some selftests:
- the following negative tests required adding release/unlock/drop
calls to avoid previously masked unrelated error reports:
- cb_refs.c:underflow_prog
- exceptions_fail.c:reject_rbtree_add_throw
- exceptions_fail.c:reject_with_cp_reference
- the following precision tracking selftests needed change in expected
log trace:
- verifier_subprog_precision.c:callback_result_precise
(note: r0 precision is no longer propagated inside callback and
I think this is a correct behavior)
- verifier_subprog_precision.c:parent_callee_saved_reg_precise_with_callback
- verifier_subprog_precision.c:parent_stack_slot_precise_with_callback
Reported-by: Andrew Werner <awerner32@gmail.com>
Closes: https://lore.kernel.org/bpf/CA+vRuzPChFNXmouzGG+wsy=6eMcfr1mFG0F3g7rbg-sedGKW3w@mail.gmail.com/
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-7-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Move code for simulated stack frame creation to a separate utility
function. This function would be used in the follow-up change for
callbacks handling.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-6-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Split check_reg_arg() into two utility functions:
- check_reg_arg() operating on registers from current verifier state;
- __check_reg_arg() operating on a specific set of registers passed as
a parameter;
The __check_reg_arg() function would be used by a follow-up change for
callbacks handling.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-5-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
offlined earlier in the offlining process in order to prevent
a deadlock
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmVaB28ACgkQEsHwGGHe
VUr3ZBAAwOLL5vimHB3Y59cTRLPN+zGKhzyVLMLnbkKs4sGJ+9srP4HLX4Q9PoAb
kR9Hzq90+48YuyLe+S/R2pvm1x88K33spS+4w4fl3x6EeToqvUlop2GPuMS2yzXY
yECdqCLEd3Q6DeI8hN35lv899qyfGSD+6WxezLCT+uwx6AMHljMAsDy2249UtMZv
1bqZnYCtN2zv3MQuV1uli/AVxTDv4vXcumza17inuw0IpEA26Wz2kWruxeyZnUXU
/sWZudUdhiErfg428ok3oTL1BOwPzyiIWjhN2MzqlKFmyp463DwV7KoAc3SxYINE
8qbODN93CFdnU6h29+VQoRxO9vcmWL6w7A/Swc9ar/0/Qnt7H9JdzUKtJ4+EaTCY
/IpRWcNcX4WI6BKuHHl6kOBvX3YW77PKaIsxj8JDNZTMk6rq6lMGi+tIaVsAki92
3MQZ9+Lkm0baykIZAWz4jajbA98KvJMeJ60qZQI6sWWdpyrncEqG9pH/ulkLY4aZ
gT94LiRpdwT0LWrX0J6xPMTNi9NYWjdB/uyo6Drer42SB9J7ol4rAbOxs50srG8i
z46VGDtgWz6C5MSkonhQqrpGzc/HF9xCWVVSF1UENT4K+2W55JhJrDZBs5XCPJiz
Bj8T3Maz7wcVkA41DA7C5xlVed+ST1ID8/4y5cWImnrmWOdG5Zw=
=Tekh
-----END PGP SIGNATURE-----
Merge tag 'timers_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Borislav Petkov:
- Do the push of pending hrtimers away from a CPU which is being
offlined earlier in the offlining process in order to prevent a
deadlock
* tag 'timers_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hrtimers: Push pending hrtimers away from outgoing CPU earlier
weights
- Fix wrongly rejected unprivileged poll requests to the cgroup psi
pressure files
- Make sure the load balancing is done by only one CPU
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmVaBe0ACgkQEsHwGGHe
VUqX4hAAmrlp7bcloMRto6j4yC+pjDIQlFym7opa7kaEPeY3icOydfpSGEDnEwbv
HxOOmveb2sC8DBE+Rkum4bHb2I46SD/5LlM/MZHvSguEGNgAJEYCcPfGZJDgGlW1
MgALG78ThA/mVKr5i3/Q1U6U71+vuNHJOpCY1s4o+jgF/sG3AYIdK1sqaVI++ygz
q0WK31jGo+YelPpNDKnXpVGIuOcUlh9v/Hu2zGBBJD9pf4kfTelseiV7rc+rk0yI
YHSKpw2jCnuJaGS748Q4aIG+8kLRBz+HqUKDWQPlq3pRWjJWTBbH+i8TZef7keZQ
gAk/uJpdQ9z4Z7suwY3vcEBVRo4e6AoD99XDG1eUX07C+f1d9p54EVDkgFIZMIle
pT2yd5GT/zl0UfcZ8B96y2lJHoa6pHnv83uZKtRZhBMiN5F4iN88lhQFVpZDoCBg
xZ+NPfpXcZxm4HpKFjfsGyxQJxIkC6NDdf6Rfhtc3sV1rx4AT1Qii4fDnBHOkaBs
iFgpFOCeb+K9UUXB0ONJ5PWZVnc8OGPtm/22TwtZ9rBzVqrmtVJb+VDg2YWpwFwU
xhy0hMWxwZFsn0VjjsBbgfm1/+WGjCKjbPa1SvS3oH3+H9EbWiBjxe1zwkS46PUf
HjC0RCMPxfnYG4+h9JHEaFioGvUqQQ6Ub3K8epd8MPUtD9DCnro=
=hJzS
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Fix virtual runtime calculation when recomputing a sched entity's
weights
- Fix wrongly rejected unprivileged poll requests to the cgroup psi
pressure files
- Make sure the load balancing is done by only one CPU
* tag 'sched_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix the decision for load balance
sched: psi: fix unprivileged polling against cgroups
sched/eevdf: Fix vruntime adjustment on reweight
failure
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmVaA2sACgkQEsHwGGHe
VUoOdBAAmvDdbMNVi0p33kqLhSQLwzxsqkrGyNkAfSbuuaGNsH8mQ87VA0dMQYpe
bXzJzvoccHYxJYnFyExv0d7PtN3xquh2q32D1pL6gzaA974oEmQiyQag9++gkJGh
+/NYQwo0Y2ucEsvgeMN+knE0q0OdelUAiKNPF9nE9LG0d9TLFC45jwLH+9pa5jAF
jtLBxrexeU49UBBDnoPR2CNrDi9TlNYRas2V5xlQnUXl5kZlVNcQLMo1Rcd7+dTF
6I414ZVXiS6u02Vs7wcrKC50BdBIa4h2WaOX+Nb2j9ibJ5uY14B1nwewAztmaQY7
szpaI2EtSMk0+Ap0QHTaxZvi/UREWed5n0AykqTy97f0vsvkK9zCiPk3LMJsoupu
vfEApclAIMzDi6qnB/zGhHkHLMBHsiXrANGCe6SbjphD9ic0ClKwAyqJ9kpB43JE
pnqdrTcrYLuTCV+fE9r/WfXt5Z09xmlF+usmOS4T7y35gzrl4+BPVzu2V80SlZSj
CtDSvMG7z7LLK5o8XsvQk1VlAYCXEPfOldkoRaisD82yKw0r38YqXf+cigE4noyq
55ChMwNmlqtetvPNK/6SsPtj8F/502Lqo/xAJjSRo/vO1KYpNa3sfXUZpZ5J+xuc
zVGXzcBGsNgteVin2I0jhdOvRd7apA7rKiXd0duTtiSj2N++b5U=
=T4AK
-----END PGP SIGNATURE-----
Merge tag 'locking_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Borislav Petkov:
- Fix a hardcoded futex flags case which lead to one robust futex test
failure
* tag 'locking_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Fix hardcoded flags
This looks more clear and simplifies the code. While at it, remove the
unnecessary initialization of pos/task at the start of bpf_iter_task_new().
Note that we can even kill kit->task, we can just use pos->group_leader,
but I don't understand the BUILD_BUG_ON() checks in bpf_iter_task_new().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231114163239.GA903@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Lockless use of next_thread() should be avoided, kernel/bpf/task_iter.c
is the last user and the usage is wrong.
bpf_iter_task_next() can loop forever, "kit->pos == kit->task" can never
happen if kit->pos execs. Change this code to use __next_thread().
With or without this change the usage of kit->pos/task and next_task()
doesn't look nice, see the next patch.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231114163237.GA897@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Lockless use of next_thread() should be avoided, kernel/bpf/task_iter.c
is the last user and the usage is wrong.
task_group_seq_get_next() can return the group leader twice if it races
with mt-thread exec which changes the group->leader's pid.
Change the main loop to use __next_thread(), kill "next_tid == common->pid"
check.
__next_thread() can't loop forever, we can also change this code to retry
if next_tid == 0.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231114163234.GA890@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Fix power soft-off on qemu
- Disable prctl(PR_SET_MDWE) since parisc sometimes still needs
writeable stacks
- Use strscpy instead of strlcpy in show_cpuinfo()
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQS86RI+GtKfB8BJu973ErUQojoPXwUCZVkHjgAKCRD3ErUQojoP
X196AP9I9w/4Go3HfvFNgEGUpVSbQq8679um13mlMdlFC6z3NAD+J32vmvU1keL1
0f4C7IltOr2ntU4QIXJUCLAPWO7NWgQ=
=r7N6
-----END PGP SIGNATURE-----
Merge tag 'parisc-for-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:
"On parisc we still sometimes need writeable stacks, e.g. if programs
aren't compiled with gcc-14. To avoid issues with the upcoming
systemd-254 we therefore have to disable prctl(PR_SET_MDWE) for now
(for parisc only).
The other two patches are minor: a bugfix for the soft power-off on
qemu with 64-bit kernel and prefer strscpy() over strlcpy():
- Fix power soft-off on qemu
- Disable prctl(PR_SET_MDWE) since parisc sometimes still needs
writeable stacks
- Use strscpy instead of strlcpy in show_cpuinfo()"
* tag 'parisc-for-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
prctl: Disable prctl(PR_SET_MDWE) on parisc
parisc/power: Fix power soft-off when running on qemu
parisc: Replace strlcpy() with strscpy()
It's possible to pass a pointer to parent's stack to child subprogs. In
such case verifier state output is ambiguous not showing whether
register container a pointer to "current" stack, belonging to current
subprog (frame), or it's actually a pointer to one of parent frames.
So emit this information if frame number differs between the state which
register is part of. E.g., if current state is in frame 2 and it has
a register pointing to stack in grand parent state (frame #0), we'll see
something like 'R1=fp[0]-16', while "local stack pointer" will be just
'R2=fp-16'.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Instead of always printing numbers as either decimals (and in some
cases, like for "imm=%llx", in hexadecimals), decide the form based on
actual values. For numbers in a reasonably small range (currently,
[0, U16_MAX] for unsigned values, and [S16_MIN, S16_MAX] for signed ones),
emit them as decimals. In all other cases, even for signed values,
emit them in hexadecimals.
For large values hex form is often times way more useful: it's easier to
see an exact difference between 0xffffffff80000000 and 0xffffffff7fffffff,
than between 18446744071562067966 and 18446744071562067967, as one
particular example.
Small values representing small pointer offsets or application
constants, on the other hand, are way more useful to be represented in
decimal notation.
Adjust reg_bounds register state parsing logic to take into account this
change.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-8-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Simplify BPF verifier log further by omitting default (and frequently
irrelevant) off=0 and imm=0 parts for non-SCALAR_VALUE registers. As can
be seen from fixed tests, this is often a visual noise for PTR_TO_CTX
register and even for PTR_TO_PACKET registers.
Omitting default values follows the rest of register state logic: we
omit default values to keep verifier log succinct and to highlight
interesting state that deviates from default one. E.g., we do the same
for var_off, when it's unknown, which gives no additional information.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In complicated real-world applications, whenever debugging some
verification error through verifier log, it often would be very useful
to see map name for PTR_TO_MAP_VALUE register. Usually this needs to be
inferred from key/value sizes and maybe trying to guess C code location,
but it's not always clear.
Given verifier has the name, and it's never too long, let's just emit it
for ptr_to_map_key, ptr_to_map_value, and const_ptr_to_map registers. We
reshuffle the order a bit, so that map name, key size, and value size
appear before offset and immediate values, which seems like a more
logical order.
Current output:
R1_w=map_ptr(map=array_map,ks=4,vs=8,off=0,imm=0)
But we'll get rid of useless off=0 and imm=0 parts in the next patch.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Print the same register state representation when printing stack state,
as we do for normal registers. Note that if stack slot contains
subregister spill (1, 2, or 4 byte long), we'll still emit "m0?" mask
for those bytes that are not part of spilled register.
While means we can get something like fp-8=0000scalar() for a 4-byte
spill with other 4 bytes still being STACK_ZERO.
Some example before and after, taken from the log of
pyperf_subprogs.bpf.o:
49: (7b) *(u64 *)(r10 -256) = r1 ; frame1: R1_w=ctx(off=0,imm=0) R10=fp0 fp-256_w=ctx
49: (7b) *(u64 *)(r10 -256) = r1 ; frame1: R1_w=ctx(off=0,imm=0) R10=fp0 fp-256_w=ctx(off=0,imm=0)
150: (7b) *(u64 *)(r10 -264) = r0 ; frame1: R0_w=map_value_or_null(id=6,off=0,ks=192,vs=4,imm=0) R10=fp0 fp-264_w=map_value_or_null
150: (7b) *(u64 *)(r10 -264) = r0 ; frame1: R0_w=map_value_or_null(id=6,off=0,ks=192,vs=4,imm=0) R10=fp0 fp-264_w=map_value_or_null(id=6,off=0,ks=192,vs=4,imm=0)
5192: (61) r1 = *(u32 *)(r10 -272) ; frame1: R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf)) R10=fp0 fp-272=
5192: (61) r1 = *(u32 *)(r10 -272) ; frame1: R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf)) R10=fp0 fp-272=????scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf))
While at it, do a few other simple clean ups:
- skip slot if it's not scratched before detecting whether it's valid;
- move taking spilled_reg pointer outside of switch (only DYNPTR has
to adjust that to get to the "main" slot);
- don't recalculate types_buf second time for MISC/ZERO/default case.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Extract printing register state representation logic into a separate
helper, as we are going to reuse it for spilled register state printing
in the next patch. This also nicely reduces code nestedness.
No functional changes.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Move a good chunk of code from verifier.c to log.c: verifier state
verbose printing logic. This is an important and very much
logging/debugging oriented code. It fits the overlall log.c's focus on
verifier logging, and moving it allows to keep growing it without
unnecessarily adding to verifier.c code that otherwise contains a core
verification logic.
There are not many shared dependencies between this code and the rest of
verifier.c code, except a few single-line helpers for various register
type checks and a bit of state "scratching" helpers. We move all such
trivial helpers into include/bpf/bpf_verifier.h as static inlines.
No functional changes in this patch.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
verifier.c is huge. Let's try to move out parts that are logging-related
into log.c, as we previously did with bpf_log() and other related stuff.
This patch moves line info verbose output routines: it's pretty
self-contained and isolated code, so there is no problem with this.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
systemd-254 tries to use prctl(PR_SET_MDWE) for it's MemoryDenyWriteExecute
functionality, but fails on parisc which still needs executable stacks in
certain combinations of gcc/glibc/kernel.
Disable prctl(PR_SET_MDWE) by returning -EINVAL for now on parisc, until
userspace has catched up.
Signed-off-by: Helge Deller <deller@gmx.de>
Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Sam James <sam@gentoo.org>
Closes: https://github.com/systemd/systemd/issues/29775
Tested-by: Sam James <sam@gentoo.org>
Link: https://lore.kernel.org/all/875y2jro9a.fsf@gentoo.org/
Cc: <stable@vger.kernel.org> # v6.3+
Drop the kernel-doc "/**" notation from 8 structs or functions to
prevent 22 kernel-doc warnings (samples below).
user_namespace.c:239: warning: Function parameter or member 'map_up' not described in 'idmap_key'
user_namespace.c:246: warning: Function parameter or member 'k' not described in 'cmp_map_id'
user_namespace.c:277: warning: Function parameter or member 'extents' not described in 'map_id_range_down_max'
user_namespace.c:295: warning: Function parameter or member 'extents' not described in 'map_id_range_down_base'
user_namespace.c:344: warning: Function parameter or member 'extents' not described in 'map_id_up_base'
user_namespace.c:364: warning: Function parameter or member 'extents' not described in 'map_id_up_max'
user_namespace.c:776: warning: Function parameter or member 'map' not described in 'insert_extent'
user_namespace.c:844: warning: Function parameter or member 'map' not described in 'sort_idmaps'
Fixes: 6397fac491 ("userns: bump idmap limits to 340")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20230830163215.13193-1-rdunlap@infradead.org
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmVWX8cUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXPl8hAA2D4DAmbnM4wLGk5FX1ruFpACmabx
7iNPonV7loDiGZInvlTgvQxTQ6hafvs6aFqu69ZplLuCaBLiSn6U3J/bOXneQxzn
nRjLQEfJLcSmTd39M82QxpaihCtVltDRT4jPfq4AGN+6nV0TB4KyFjrIvOw7udfX
fJF096Lt9rqxbYyKk2Lgy8LZZdVqFN9pbstpH7Vas8LOi4bnvogRljhFA3vipn45
0tzMrFR9b/myOPFm1ktvAUSUdWIzNGmxsYkrxHkQ2TemhuFEiNl3n86juWzeXCzN
wjaGPLIUqJQW+C+kXRmEZo/SytiqKS5Wo97mMVDPKpYFwp6IbgjSg01LPNdmLoVY
2i1jxOFTDnANLZgXa31kjzTO2Ceu61GFVqLZGuOh2lB7rjj3+JAkL0U/YLQWWBMO
RG8MbmQnHOGZlHdqiPRKJKo/qHPW7vBkgSPJ/K0tRNXMFoZtGAfcHjxJJQNystPU
BoRd2Tdw0jMrrS5cLNXfkxhHKwNHGFny4TRyqOJo9G7/jK56JWU+3ZXNoWH9OKFJ
Ln2wH7NT16CLMnb/kZ2CSh8UQXIJpkBL1OuG6IrOQuBoNun7AzGnmXW7vqywV5bo
dqOgxtkBYhrfjUEXmRzEii2oOoc/esr1vZYnmj5K4RpWksIXTJ4BZjC9kH/fHEpg
2ZO03UwyZ7ZiE2U=
=00KW
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20231116' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit fix from Paul Moore:
"One small audit patch to convert a WARN_ON_ONCE() into a normal
conditional to avoid scary looking console warnings when eBPF code
generates audit records from unexpected places"
* tag 'audit-pr-20231116' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: don't WARN_ON_ONCE(!current->mm) in audit_exe_compare()
Current release - regressions:
- core: fix undefined behavior in netdev name allocation
- bpf: do not allocate percpu memory at init stage
- netfilter: nf_tables: split async and sync catchall in two functions
- mptcp: fix possible NULL pointer dereference on close
Current release - new code bugs:
- eth: ice: dpll: fix initial lock status of dpll
Previous releases - regressions:
- bpf: fix precision backtracking instruction iteration
- af_unix: fix use-after-free in unix_stream_read_actor()
- tipc: fix kernel-infoleak due to uninitialized TLV value
- eth: bonding: stop the device in bond_setup_by_slave()
- eth: mlx5:
- fix double free of encap_header
- avoid referencing skb after free-ing in drop path
- eth: hns3: fix VF reset
- eth: mvneta: fix calls to page_pool_get_stats
Previous releases - always broken:
- core: set SOCK_RCU_FREE before inserting socket into hashtable
- bpf: fix control-flow graph checking in privileged mode
- eth: ppp: limit MRU to 64K
- eth: stmmac: avoid rx queue overrun
- eth: icssg-prueth: fix error cleanup on failing initialization
- eth: hns3: fix out-of-bounds access may occur when coalesce info is
read via debugfs
- eth: cortina: handle large frames
Misc:
- selftests: gso: support CONFIG_MAX_SKB_FRAGS up to 45
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmVV9akSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkICMP/1+QHUaD4JG1mW9oYc2zINPfQl3dqQt3
2CGSE2yrtbQvyQl39BDa0WFzV5X6So6/U50twhTNM+UAJsCaOvxCUDvUP9eY9Dcm
z2H4oITZimyP4CEb3l7JpL2PImvfImL7D/fCPPMUZVzNY6dkEFznaQrnawbJz4gg
mZXDnjwIXq7OchoJy3dHzyOn4ZQj2Df5VcfBzkVMdMcwV55Sd5JezbhwJ6NOmnKA
uoXlq4pFYj3ahAhEQfLWUwXmF3e6esHs/WUCMe5FR9YkanJlu4oHUmY3RLzfcdQA
PPIPDRxOzthcXyymqvqs7gnZ3ruMUll4B7tGTVFpJch8ts+DwGdUyBIIoDd/1BUT
gmjipP5HPia3Qdtk3Jc4vMkcf5AwoGo0hXku7YYJ1K7+4+t8ep3/hDbQc0PLWX6J
afiQgqpnNXHSTqBO5zl91vSwhGr/AAtAkDlPnsQL/RDAxY4teIwxHuoMvwPWaHZJ
sMo5ZcHXvNnBbGhpozFtmrnbf1nduUrQmW5LkJViCLf25Sj6pDYbo8WnhMuOKSnZ
7an2YqniCgBtrX4MEVn2jsWgavI+SxndVIQR04u0uwqmP+dn8s9LUfjKKDtPWHsK
+zMFtk+Op03TW5ur9w3+dgrGH0cLogPO3BJkho7xXKBfZ6/tN/pOef3/nV9xY6g8
JjnBUdpZRTWI
=VjWw
-----END PGP SIGNATURE-----
Merge tag 'net-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from BPF and netfilter.
Current release - regressions:
- core: fix undefined behavior in netdev name allocation
- bpf: do not allocate percpu memory at init stage
- netfilter: nf_tables: split async and sync catchall in two
functions
- mptcp: fix possible NULL pointer dereference on close
Current release - new code bugs:
- eth: ice: dpll: fix initial lock status of dpll
Previous releases - regressions:
- bpf: fix precision backtracking instruction iteration
- af_unix: fix use-after-free in unix_stream_read_actor()
- tipc: fix kernel-infoleak due to uninitialized TLV value
- eth: bonding: stop the device in bond_setup_by_slave()
- eth: mlx5:
- fix double free of encap_header
- avoid referencing skb after free-ing in drop path
- eth: hns3: fix VF reset
- eth: mvneta: fix calls to page_pool_get_stats
Previous releases - always broken:
- core: set SOCK_RCU_FREE before inserting socket into hashtable
- bpf: fix control-flow graph checking in privileged mode
- eth: ppp: limit MRU to 64K
- eth: stmmac: avoid rx queue overrun
- eth: icssg-prueth: fix error cleanup on failing initialization
- eth: hns3: fix out-of-bounds access may occur when coalesce info is
read via debugfs
- eth: cortina: handle large frames
Misc:
- selftests: gso: support CONFIG_MAX_SKB_FRAGS up to 45"
* tag 'net-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (78 commits)
macvlan: Don't propagate promisc change to lower dev in passthru
net: sched: do not offload flows with a helper in act_ct
net/mlx5e: Check return value of snprintf writing to fw_version buffer for representors
net/mlx5e: Check return value of snprintf writing to fw_version buffer
net/mlx5e: Reduce the size of icosq_str
net/mlx5: Increase size of irq name buffer
net/mlx5e: Update doorbell for port timestamping CQ before the software counter
net/mlx5e: Track xmit submission to PTP WQ after populating metadata map
net/mlx5e: Avoid referencing skb after free-ing in drop path of mlx5e_sq_xmit_wqe
net/mlx5e: Don't modify the peer sent-to-vport rules for IPSec offload
net/mlx5e: Fix pedit endianness
net/mlx5e: fix double free of encap_header in update funcs
net/mlx5e: fix double free of encap_header
net/mlx5: Decouple PHC .adjtime and .adjphase implementations
net/mlx5: DR, Allow old devices to use multi destination FTE
net/mlx5: Free used cpus mask when an IRQ is released
Revert "net/mlx5: DR, Supporting inline WQE when possible"
bpf: Do not allocate percpu memory at init stage
net: Fix undefined behavior in netdev name allocation
dt-bindings: net: ethernet-controller: Fix formatting error
...
This change doesn't seem to have any effect on selftests and production
BPF object files, but we preemptively try to make it more robust.
First, "learn sign from signed bounds" comment is misleading, as we are
learning not just sign, but also values.
Second, we simplify the check for determining whether entire range is
positive or negative similarly to other checks added earlier, using
appropriate u32/u64 cast and single comparisons. As explain in comments
in __reg64_deduce_bounds(), the checks are equivalent.
Last but not least, smin/smax and s32_min/s32_max reassignment based on
min/max of both umin/umax and smin/smax (and 32-bit equivalents) is hard
to explain and justify. We are updating unsigned bounds from signed
bounds, why would we update signed bounds at the same time? This might
be correct, but it's far from obvious why and the code or comments don't
try to justify this. Given we've added a separate deduction of signed
bounds from unsigned bounds earlier, this seems at least redundant, if
not just wrong.
In short, we remove doubtful pieces, and streamline the rest to follow
the logic and approach of the rest of reg_bounds_sync() checks.
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231112010609.848406-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Equivalent checks were recently added in more succinct and, arguably,
safer form in:
- f188765f23a5 ("bpf: derive smin32/smax32 from umin32/umax32 bounds");
- 2e74aef782d3 ("bpf: derive smin/smax from umin/max bounds").
The checks we are removing in this patch set do similar checks to detect
if entire u32/u64 range has signed bit set or not set, but does it with
two separate checks.
Further, we forcefully overwrite either smin or smax (and 32-bit equvalents)
without applying normal min/max intersection logic. It's not clear why
that would be correct in all cases and seems to work by accident. This
logic is also "gated" by previous signed -> unsigned derivation, which
returns early.
All this is quite confusing and seems error-prone, while we already have
at least equivalent checks happening earlier. So remove this duplicate
and error-prone logic to simplify things a bit.
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231112010609.848406-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add simple sanity checks that validate well-formed ranges (min <= max)
across u64, s64, u32, and s32 ranges. Also for cases when the value is
constant (either 64-bit or 32-bit), we validate that ranges and tnums
are in agreement.
These bounds checks are performed at the end of BPF_ALU/BPF_ALU64
operations, on conditional jumps, and for LDX instructions (where subreg
zero/sign extension is probably the most important to check). This
covers most of the interesting cases.
Also, we validate the sanity of the return register when manually
adjusting it for some special helpers.
By default, sanity violation will trigger a warning in verifier log and
resetting register bounds to "unbounded" ones. But to aid development
and debugging, BPF_F_TEST_SANITY_STRICT flag is added, which will
trigger hard failure of verification with -EFAULT on register bounds
violations. This allows selftests to catch such issues. veristat will
also gain a CLI option to enable this behavior.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231112010609.848406-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Use 32-bit subranges to prune some 64-bit BPF_JEQ/BPF_JNE conditions
that otherwise would be "inconclusive" (i.e., is_branch_taken() would
return -1). This can happen, for example, when registers are initialized
as 64-bit u64/s64, then compared for inequality as 32-bit subregisters,
and then followed by 64-bit equality/inequality check. That 32-bit
inequality can establish some pattern for lower 32 bits of a register
(e.g., s< 0 condition determines whether the bit #31 is zero or not),
while overall 64-bit value could be anything (according to a value range
representation).
This is not a fancy quirky special case, but actually a handling that's
necessary to prevent correctness issue with BPF verifier's range
tracking: set_range_min_max() assumes that register ranges are
non-overlapping, and if that condition is not guaranteed by
is_branch_taken() we can end up with invalid ranges, where min > max.
[0] https://lore.kernel.org/bpf/CACkBjsY2q1_fUohD7hRmKGqv1MV=eP2f6XK8kjkYNw7BaiF8iQ@mail.gmail.com/
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231112010609.848406-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Generalize is_branch_taken logic for SCALAR_VALUE register to handle
cases when both registers are not constants. Previously supported
<range> vs <scalar> cases are a natural subset of more generic <range>
vs <range> set of cases.
Generalized logic relies on straightforward segment intersection checks.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231112010609.848406-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Generalize bounds adjustment logic of reg_set_min_max() to handle not
just register vs constant case, but in general any register vs any
register cases. For most of the operations it's trivial extension based
on range vs range comparison logic, we just need to properly pick
min/max of a range to compare against min/max of the other range.
For BPF_JSET we keep the original capabilities, just make sure JSET is
integrated in the common framework. This is manifested in the
internal-only BPF_JSET + BPF_X "opcode" to allow for simpler and more
uniform rev_opcode() handling. See the code for details. This allows to
reuse the same code exactly both for TRUE and FALSE branches without
explicitly handling both conditions with custom code.
Note also that now we don't need a special handling of BPF_JEQ/BPF_JNE
case none of the registers are constants. This is now just a normal
generic case handled by reg_set_min_max().
To make tnum handling cleaner, tnum_with_subreg() helper is added, as
that's a common operator when dealing with 32-bit subregister bounds.
This keeps the overall logic much less noisy when it comes to tnums.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231112010609.848406-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Kirill Shutemov reported significant percpu memory consumption increase after
booting in 288-cpu VM ([1]) due to commit 41a5db8d81 ("bpf: Add support for
non-fix-size percpu mem allocation"). The percpu memory consumption is
increased from 111MB to 969MB. The number is from /proc/meminfo.
I tried to reproduce the issue with my local VM which at most supports upto
255 cpus. With 252 cpus, without the above commit, the percpu memory
consumption immediately after boot is 57MB while with the above commit the
percpu memory consumption is 231MB.
This is not good since so far percpu memory from bpf memory allocator is not
widely used yet. Let us change pre-allocation in init stage to on-demand
allocation when verifier detects there is a need of percpu memory for bpf
program. With this change, percpu memory consumption after boot can be reduced
signicantly.
[1] https://lore.kernel.org/lkml/20231109154934.4saimljtqx625l3v@box.shutemov.name/
Fixes: 41a5db8d81 ("bpf: Add support for non-fix-size percpu mem allocation")
Reported-and-tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231111013928.948838-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Trying to avoid that didn't bring much value after testing, add comment
about this.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lkml.kernel.org/r/20231114193840.4041-3-frederic@kernel.org
Add missing explanation concerning IRQs re-enablement constraints in
the cpuidle path against timers.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lkml.kernel.org/r/20231114193840.4041-2-frederic@kernel.org
Low priority tasks (e.g., SCHED_OTHER) can suffer starvation if tasks
with higher priority (e.g., SCHED_FIFO) monopolize CPU(s).
RT Throttling has been introduced a while ago as a (mostly debug)
countermeasure one can utilize to reserve some CPU time for low priority
tasks (usually background type of work, e.g. workqueues, timers, etc.).
It however has its own problems (see documentation) and the undesired
effect of unconditionally throttling FIFO tasks even when no lower
priority activity needs to run (there are mechanisms to fix this issue
as well, but, again, with their own problems).
Introduce deadline servers to service low priority tasks needs under
starvation conditions. Deadline servers are built extending SCHED_DEADLINE
implementation to allow 2-level scheduling (a sched_deadline entity
becomes a container for lower priority scheduling entities).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/4968601859d920335cf85822eb573a5f179f04b8.1699095159.git.bristot@kernel.org
In preparation of introducing !task sched_dl_entity; move the
bandwidth accounting into {en.de}queue_dl_entity().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/a86dccbbe44e021b8771627e1dae01a69b73466d.1699095159.git.bristot@kernel.org
Create a single function that initializes a sched_dl_entity.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/51acc695eecf0a1a2f78f9a044e11ffd9b316bcf.1699095159.git.bristot@kernel.org
Now that trace_sched_stat_runtime() no longer takes a vruntime
argument, the task specific bits are identical between
update_curr_common() and update_curr().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tracing the runtime delta makes sense, observer can sum over time.
Tracing the absolute vruntime makes less sense, inconsistent:
absolute-vs-delta, but also vruntime delta can be computed from
runtime delta.
Removing the vruntime thing also makes the two tracepoint sites
identical, allowing to unify the code in a later patch.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
All classes use sched_entity::exec_start to track runtime and have
copies of the exact same code around to compute runtime.
Collapse all that.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/54d148a144f26d9559698c4dd82d8859038a7380.1699095159.git.bristot@kernel.org
Since the RB-tree is now sorted by deadline, let's first try the
leftmost entity which has the earliest virtual deadline. I've done
some benchmarks to see its effectiveness.
All the benchmarks are done inside a normal cpu cgroup in a clean
environment with cpu turbo disabled, on a dual-CPU Intel Xeon(R)
Platinum 8260 with 2 NUMA nodes each of which has 24C/48T.
hackbench: process/thread + pipe/socket + 1/2/4/8 groups
netperf: TCP/UDP + STREAM/RR + 24/48/72/96/192 threads
tbench: loopback 24/48/72/96/192 threads
schbench: 1/2/4/8 mthreads
direct: cfs_rq has only one entity
parity: RUN_TO_PARITY
fast: O(1) fastpath
slow: heap search
(%) direct parity fast slow
hackbench 92.95 2.02 4.91 0.12
netperf 68.08 6.60 24.18 1.14
tbench 67.55 11.22 20.61 0.62
schbench 69.91 2.65 25.73 1.71
The above results indicate that this fastpath really makes task
selection more efficient.
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231115033647.80785-4-wuyun.abel@bytedance.com
Sort the task timeline by virtual deadline and keep the min_vruntime
in the augmented tree, so we can avoid doubling the worst case cost
and make full use of the cached leftmost node to enable O(1) fastpath
picking in next patch.
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231115033647.80785-3-wuyun.abel@bytedance.com
Since commit fc137c0dda ("sched/numa: enhance vma scanning logic")
NUMA Balancing allows updating PTEs to trap NUMA hinting faults if the
task had previously accessed VMA. However unconditional scan of VMAs are
allowed during initial phase of VMA creation until process's
mm numa_scan_seq reaches 2 even though current task had not accessed VMA.
Rationale:
- Without initial scan subsequent PTE update may never happen.
- Give fair opportunity to all the VMAs to be scanned and subsequently
understand the access pattern of all the VMAs.
But it has a corner case where, if a VMA is created after some time,
process's mm numa_scan_seq could be already greater than 2.
For e.g., values of mm numa_scan_seq when VMAs are created by running
mmtest autonuma benchmark briefly looks like:
start_seq=0 : 459
start_seq=2 : 138
start_seq=3 : 144
start_seq=4 : 8
start_seq=8 : 1
start_seq=9 : 1
This results in no unconditional PTE updates for those VMAs created after
some time.
Fix:
- Note down the initial value of mm numa_scan_seq in per VMA start_seq.
- Allow unconditional scan till start_seq + 2.
Result:
SUT: AMD EPYC Milan with 2 NUMA nodes 256 cpus.
base kernel: upstream 6.6-rc6 with Mels patches [1] applied.
kernbench
========== base patched %gain
Amean elsp-128 165.09 ( 0.00%) 164.78 * 0.19%*
Duration User 41404.28 41375.08
Duration System 9862.22 9768.48
Duration Elapsed 519.87 518.72
Ops NUMA PTE updates 1041416.00 831536.00
Ops NUMA hint faults 263296.00 220966.00
Ops NUMA pages migrated 258021.00 212769.00
Ops AutoNUMA cost 1328.67 1114.69
autonumabench
NUMA01_THREADLOCAL
==================
Amean elsp-NUMA01_THREADLOCAL 81.79 (0.00%) 67.74 * 17.18%*
Duration User 54832.73 47379.67
Duration System 75.00 185.75
Duration Elapsed 576.72 476.09
Ops NUMA PTE updates 394429.00 11121044.00
Ops NUMA hint faults 1001.00 8906404.00
Ops NUMA pages migrated 288.00 2998694.00
Ops AutoNUMA cost 7.77 44666.84
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/2ea7cbce80ac7c62e90cbfb9653a7972f902439f.1697816692.git.raghavendra.kt@amd.com
Audit of the refcounting turned up that perf_pmu_migrate_context()
fails to migrate the ctx refcount.
Fixes: bd27568117 ("perf: Rewrite core context handling")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20230612093539.085862001@infradead.org
Cc: <stable@vger.kernel.org>
Xi reported that commit 5694289ce1 ("futex: Flag conversion") broke
glibc's robust futex tests.
This was narrowed down to the change of FLAGS_SHARED from 0x01 to
0x10, at which point Florian noted that handle_futex_death() has a
hardcoded flags argument of 1.
Change this to: FLAGS_SIZE_32 | FLAGS_SHARED, matching how
futex_to_flags() unconditionally sets FLAGS_SIZE_32 for all legacy
futex ops.
Reported-by: Xi Ruoyao <xry111@xry111.site>
Reported-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20231114201402.GA25315@noisy.programming.kicks-ass.net
Fixes: 5694289ce1 ("futex: Flag conversion")
Cc: <stable@vger.kernel.org>
eBPF can end up calling into the audit code from some odd places, and
some of these places don't have @current set properly so we end up
tripping the `WARN_ON_ONCE(!current->mm)` near the top of
`audit_exe_compare()`. While the basic `!current->mm` check is good,
the `WARN_ON_ONCE()` results in some scary console messages so let's
drop that and just do the regular `!current->mm` check to avoid
problems.
Cc: <stable@vger.kernel.org>
Fixes: 47846d5134 ("audit: don't take task_lock() in audit_exe_compare() code path")
Reported-by: Artem Savkov <asavkov@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
should_we_balance is called for the decision to do load-balancing.
When sched ticks invoke this function, only one CPU should return
true. However, in the current code, two CPUs can return true. The
following situation, where b means busy and i means idle, is an
example, because CPU 0 and CPU 2 return true.
[0, 1] [2, 3]
b b i b
This fix checks if there exists an idle CPU with busy sibling(s)
after looking for a CPU on an idle core. If some idle CPUs with busy
siblings are found, just the first one should do load-balancing.
Fixes: b1bfeab9b0 ("sched/fair: Consider the idle state of the whole core for load balance")
Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20231031133821.1570861-1-keisuke.nishimura@inria.fr
519fabc7aa ("psi: remove 500ms min window size limitation for
triggers") breaks unprivileged psi polling on cgroups.
Historically, we had a privilege check for polling in the open() of a
pressure file in /proc, but were erroneously missing it for the open()
of cgroup pressure files.
When unprivileged polling was introduced in d82caa2735 ("sched/psi:
Allow unprivileged polling of N*2s period"), it needed to filter
privileges depending on the exact polling parameters, and as such
moved the CAP_SYS_RESOURCE check from the proc open() callback to
psi_trigger_create(). Both the proc files as well as cgroup files go
through this during write(). This implicitly added the missing check
for privileges required for HT polling for cgroups.
When 519fabc7aa ("psi: remove 500ms min window size limitation for
triggers") followed right after to remove further restrictions on the
RT polling window, it incorrectly assumed the cgroup privilege check
was still missing and added it to the cgroup open(), mirroring what we
used to do for proc files in the past.
As a result, unprivileged poll requests that would be supported now
get rejected when opening the cgroup pressure file for writing.
Remove the cgroup open() check. psi_trigger_create() handles it.
Fixes: 519fabc7aa ("psi: remove 500ms min window size limitation for triggers")
Reported-by: Luca Boccassi <bluca@debian.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Luca Boccassi <bluca@debian.org>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Cc: stable@vger.kernel.org # 6.5+
Link: https://lore.kernel.org/r/20231026164114.2488682-1-hannes@cmpxchg.org
vruntime of the (on_rq && !0-lag) entity needs to be adjusted when
it gets re-weighted, and the calculations can be simplified based
on the fact that re-weight won't change the w-average of all the
entities. Please check the proofs in comments.
But adjusting vruntime can also cause position change in RB-tree
hence require re-queue to fix up which might be costly. This might
be avoided by deferring adjustment to the time the entity actually
leaves tree (dequeue/pick), but that will negatively affect task
selection and probably not good enough either.
Fixes: 147f3efaa2 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231107090510.71322-2-wuyun.abel@bytedance.com
A new kfunc is added to acquire cgroup1 of a task:
- bpf_task_get_cgroup1
Acquires the associated cgroup of a task whithin a specific cgroup1
hierarchy. The cgroup1 hierarchy is identified by its hierarchy ID.
This new kfunc enables the tracing of tasks within a designated
container or cgroup directory in BPF programs.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20231111090034.4248-2-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Create a system call to report the list of Linux Security Modules
that are active on the system. The list is provided as an array
of LSM ID numbers.
The calling application can use this list determine what LSM
specific actions it might take. That might include choosing an
output format, determining required privilege or bypassing
security module specific behavior.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: John Johansen <john.johansen@canonical.com>
Reviewed-by: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Create a system call lsm_get_self_attr() to provide the security
module maintained attributes of the current process.
Create a system call lsm_set_self_attr() to set a security
module maintained attribute of the current process.
Historically these attributes have been exposed to user space via
entries in procfs under /proc/self/attr.
The attribute value is provided in a lsm_ctx structure. The structure
identifies the size of the attribute, and the attribute value. The format
of the attribute value is defined by the security module. A flags field
is included for LSM specific information. It is currently unused and must
be 0. The total size of the data, including the lsm_ctx structure and any
padding, is maintained as well.
struct lsm_ctx {
__u64 id;
__u64 flags;
__u64 len;
__u64 ctx_len;
__u8 ctx[];
};
Two new LSM hooks are used to interface with the LSMs.
security_getselfattr() collects the lsm_ctx values from the
LSMs that support the hook, accounting for space requirements.
security_setselfattr() identifies which LSM the attribute is
intended for and passes it along.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
When auditd_set sets the auditd_conn pointer, audit messages can
immediately be put on the socket by other kernel threads. If the backlog
is large or the rate is high, this can immediately fill the socket
buffer. If the audit daemon requested an ACK for this operation, a full
socket buffer causes the ACK to get dropped, also setting ENOBUFS on the
socket.
To avoid this race and ensure ACKs get through, fast-track the ACK in
this specific case to ensure it is sent before auditd_conn is set.
Signed-off-by: Chris Riches <chris.riches@nutanix.com>
[PM: fix some tab vs space damage]
Signed-off-by: Paul Moore <paul@paul-moore.com>
When cgroup_rstat_updated() isn't being called concurrently with
cgroup_rstat_flush_locked(), its run time is pretty short. When
both are called concurrently, the cgroup_rstat_updated() run time
can spike to a pretty high value due to high cpu_lock hold time in
cgroup_rstat_flush_locked(). This can be problematic if the task calling
cgroup_rstat_updated() is a realtime task running on an isolated CPU
with a strict latency requirement. The cgroup_rstat_updated() call can
happen when there is a page fault even though the task is running in
user space most of the time.
The percpu cpu_lock is used to protect the update tree -
updated_next and updated_children. This protection is only needed when
cgroup_rstat_cpu_pop_updated() is being called. The subsequent flushing
operation which can take a much longer time does not need that protection
as it is already protected by cgroup_rstat_lock.
To reduce the cpu_lock hold time, we need to perform all the
cgroup_rstat_cpu_pop_updated() calls up front with the lock
released afterward before doing any flushing. This patch adds a new
cgroup_rstat_updated_list() function to return a singly linked list of
cgroups to be flushed.
Some instrumentation code are added to measure the cpu_lock hold time
right after lock acquisition to after releasing the lock. Parallel
kernel build on a 2-socket x86-64 server is used as the benchmarking
tool for measuring the lock hold time.
The maximum cpu_lock hold time before and after the patch are 100us and
29us respectively. So the worst case time is reduced to about 30% of
the original. However, there may be some OS or hardware noises like NMI
or SMI in the test system that can worsen the worst case value. Those
noises are usually tuned out in a real production environment to get
a better result.
OTOH, the lock hold time frequency distribution should give a better
idea of the performance benefit of the patch. Below were the frequency
distribution before and after the patch:
Hold time Before patch After patch
--------- ------------ -----------
0-01 us 804,139 13,738,708
01-05 us 9,772,767 1,177,194
05-10 us 4,595,028 4,984
10-15 us 303,481 3,562
15-20 us 78,971 1,314
20-25 us 24,583 18
25-30 us 6,908 12
30-40 us 8,015
40-50 us 2,192
50-60 us 316
60-70 us 43
70-80 us 7
80-90 us 2
>90 us 3
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
To make CPUs in isolated cpuset partition closer in isolation to
the boot time isolated CPUs specified in the "isolcpus" boot command
line option, we need to take those CPUs out of the workqueue unbound
cpumask so that work functions from the unbound workqueues won't run
on those CPUs. Otherwise, they will interfere the user tasks running
on those isolated CPUs.
With the introduction of the workqueue_unbound_exclude_cpumask() helper
function in an earlier commit, those isolated CPUs can now be taken
out from the workqueue unbound cpumask.
This patch also updates cgroup-v2.rst to mention that isolated
CPUs will be excluded from unbound workqueue cpumask as well as
updating test_cpuset_prs.sh to verify the correctness of the new
*cpuset.cpus.isolated file, if available via cgroup_debug option.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add a new internal isolated_cpus mask to keep track of the CPUs that are in
isolated partitions. Expose that new cpumask as a new root-only control file
".cpuset.cpus.isolated".
tj: Updated patch description to reflect dropping __DEBUG__ prefix.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When the "isolcpus" boot command line option is used to add a set
of isolated CPUs, those CPUs will be excluded automatically from
wq_unbound_cpumask to avoid running work functions from unbound
workqueues.
Recently cpuset has been extended to allow the creation of partitions
of isolated CPUs dynamically. To make it closer to the "isolcpus"
in functionality, the CPUs in those isolated cpuset partitions should be
excluded from wq_unbound_cpumask as well. This can be done currently by
explicitly writing to the workqueue's cpumask sysfs file after creating
the isolated partitions. However, this process can be error prone.
Ideally, the cpuset code should be allowed to request the workqueue code
to exclude those isolated CPUs from wq_unbound_cpumask so that this
operation can be done automatically and the isolated CPUs will be returned
back to wq_unbound_cpumask after the destructions of the isolated
cpuset partitions.
This patch adds a new workqueue_unbound_exclude_cpumask() function to
enable that. This new function will exclude the specified isolated
CPUs from wq_unbound_cpumask. To be able to restore those isolated
CPUs back after the destruction of isolated cpuset partitions, a new
wq_requested_unbound_cpumask is added to store the user provided unbound
cpumask either from the boot command line options or from writing to
the cpumask sysfs file. This new cpumask provides the basis for CPU
exclusion.
To enable users to understand how the wq_unbound_cpumask is being
modified internally, this patch also exposes the newly introduced
wq_requested_unbound_cpumask as well as a wq_isolated_cpumask to
store the cpumask to be excluded from wq_unbound_cpumask as read-only
sysfs files.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2b8272ff4a ("cpu/hotplug: Prevent self deadlock on CPU hot-unplug")
solved the straight forward CPU hotplug deadlock vs. the scheduler
bandwidth timer. Yu discovered a more involved variant where a task which
has a bandwidth timer started on the outgoing CPU holds a lock and then
gets throttled. If the lock required by one of the CPU hotplug callbacks
the hotplug operation deadlocks because the unthrottling timer event is not
handled on the dying CPU and can only be recovered once the control CPU
reaches the hotplug state which pulls the pending hrtimers from the dead
CPU.
Solve this by pushing the hrtimers away from the dying CPU in the dying
callbacks. Nothing can queue a hrtimer on the dying CPU at that point because
all other CPUs spin in stop_machine() with interrupts disabled and once the
operation is finished the CPU is marked offline.
Reported-by: Yu Liao <liaoyu15@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Liu Tie <liutie4@huawei.com>
Link: https://lore.kernel.org/r/87a5rphara.ffs@tglx
- Documentation update: Add a note about argument and return value
fetching is the best effort because it depends on the type.
- objpool: Fix to make internal global variables static in
test_objpool.c.
- kprobes: Unify kprobes_exceptions_nofify() prototypes. There are
the same prototypes in asm/kprobes.h for some architectures, but
some of them are missing the prototype and it causes a warning.
So move the prototype into linux/kprobes.h.
- tracing: Fix to check the tracepoint event and return event at
parsing stage. The tracepoint event doesn't support %return
but if $retval exists, it will be converted to %return silently.
This finds that case and rejects it.
- tracing: Fix the order of the descriptions about the parameters
of __kprobe_event_gen_cmd_start() to be consistent with the
argument list of the function.
-----BEGIN PGP SIGNATURE-----
iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmVOwAQbHG1hc2FtaS5o
aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bItMH/0F/vyiirgLrRVvQ+5Tr
Hm32oc1BQzxnQ0+9bjzk3r90KYk5cysBEEqxKzgxq9/RsJdyCczQUpxYehU0BoZT
1B4pB5eQ0DwcdGAVk4TyBRYVBb3uhCyyZNXv+F60AsO8i87fHHoJXT9SoKK+Vgx4
MAklE1gnxFFlRoYCBQpks89NajRx6n3aEL4/oXO3WYSrv+H2WGtZamB+RhpufkDx
Qx5TkIGnjulcW6J5m7Px5N3z9AX00SbfooZHAae3fqsek5RPNecfc1/WiANNXrSm
SYsG/i1jcHVvmk2YmCVokVLPKzhCOsKIuiW91rBu/Tu6lqiJmC+fxWxuZqAdXFUi
+kw=
=uymB
-----END PGP SIGNATURE-----
Merge tag 'probes-fixes-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:
- Documentation update: Add a note about argument and return value
fetching is the best effort because it depends on the type.
- objpool: Fix to make internal global variables static in
test_objpool.c.
- kprobes: Unify kprobes_exceptions_nofify() prototypes. There are the
same prototypes in asm/kprobes.h for some architectures, but some of
them are missing the prototype and it causes a warning. So move the
prototype into linux/kprobes.h.
- tracing: Fix to check the tracepoint event and return event at
parsing stage. The tracepoint event doesn't support %return but if
$retval exists, it will be converted to %return silently. This finds
that case and rejects it.
- tracing: Fix the order of the descriptions about the parameters of
__kprobe_event_gen_cmd_start() to be consistent with the argument
list of the function.
* tag 'probes-fixes-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/kprobes: Fix the order of argument descriptions
tracing: fprobe-event: Fix to check tracepoint event and return
kprobes: unify kprobes_exceptions_nofify() prototypes
lib: test_objpool: make global variables static
Documentation: tracing: Add a note about argument and retval access
The order of descriptions should be consistent with the argument list of
the function, so "kretprobe" should be the second one.
int __kprobe_event_gen_cmd_start(struct dynevent_cmd *cmd, bool kretprobe,
const char *name, const char *loc, ...)
Link: https://lore.kernel.org/all/20231031041305.3363712-1-yujie.liu@intel.com/
Fixes: 2a588dd1d5 ("tracing: Add kprobe event command generation functions")
Suggested-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Yujie Liu <yujie.liu@intel.com>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
- don't leave pages decrypted for DMA in encrypted memory setups linger
around on failure (Petr Tesarik)
- fix an out of bounds access in the new dynamic swiotlb code (Petr Tesarik)
- fix dma_addressing_limited for systems with weird physical memory layouts
(Jia He)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmVN1HQLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYMXsg//YYUP27ZfjqOeyRAv5IZ56u5Gci8d32vHEZjvEngI
5wAErzIoGzHXtZIk5nCU9Lrc4+g608gXqqefkU7e0lAMHVSpExHF0ZxktRBG0/bz
OQNLrlT9HpnOJgAKLg4a2rSpomfbtMBd1MNek1ZI8Osz49AagqANOOlfpr13lvw6
kWzZEnoRKJqUW3x8g5u/WnggZzoBYHeMJp9EORutnhxU09DlpJ6pVg5wP7ysKQfT
FUoX4YUoe52pYgluTwNlJkh/Mxe3/oZOPbCIMB0eclVxylLDVEZcqlh9A91BTaQK
rOQv51UGl2eS1DvIDUqgoy3VlB0PQ9FADdGVP0BQfnn9yS1vfo4A7hQS99jLejC1
SnAsASeWVj5Ot/peWMUh5UDoHhJWtlEY6Lfv5Qr1a8Gan21+3CrBLhd67eUvun40
koafsbUzWgmY9qadNNjjebY761WXa2TgLb0LzYo42Asur8Qw1FC8/OHV6QMET/t7
jB+NqQWydIAr6dEzVbqm5ZQ2/r3hXuzJcOKjKhgjhuTzHAGXkeiAkkkuGhPQr5Nq
vqua2m55xwCK8Zucie/tnj4ujRY1hnUgxcs0sm0koDVNcpYm3h1MmoTqzaISJVPh
4edyTESz95MlgiMzion8+Gq/dGVeYzyO0XKWnyMVQ7pCJnJfoWa5Pqhgsg+dXiU8
Txo=
=6pQM
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.7-2023-11-10' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
- don't leave pages decrypted for DMA in encrypted memory setups linger
around on failure (Petr Tesarik)
- fix an out of bounds access in the new dynamic swiotlb code (Petr
Tesarik)
- fix dma_addressing_limited for systems with weird physical memory
layouts (Jia He)
* tag 'dma-mapping-6.7-2023-11-10' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: fix out-of-bounds TLB allocations with CONFIG_SWIOTLB_DYNAMIC
dma-mapping: fix dma_addressing_limited() if dma_range_map can't cover all system RAM
dma-mapping: move dma_addressing_limited() out of line
swiotlb: do not free decrypted pages if dynamic
Currently get_perf_callchain only supports user stack walking for
the current task. Passing the correct *crosstask* param will return
0 frames if the task passed to __bpf_get_stack isn't the current
one instead of a single incorrect frame/address. This change
passes the correct *crosstask* param but also does a preemptive
check in __bpf_get_stack if the task is current and returns
-EOPNOTSUPP if it is not.
This issue was found using bpf_get_task_stack inside a BPF
iterator ("iter/task"), which iterates over all tasks.
bpf_get_task_stack works fine for fetching kernel stacks
but because get_perf_callchain relies on the caller to know
if the requested *task* is the current one (via *crosstask*)
it was failing in a confusing way.
It might be possible to get user stacks for all tasks utilizing
something like access_process_vm but that requires the bpf
program calling bpf_get_task_stack to be sleepable and would
therefore be a breaking change.
Fixes: fa28dcb82a ("bpf: Introduce helper bpf_get_task_stack()")
Signed-off-by: Jordan Rome <jordalgo@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231108112334.3433136-1-jordalgo@meta.com
Fix to check the tracepoint event is not valid with $retval.
The commit 08c9306fc2 ("tracing/fprobe-event: Assume fprobe is
a return event by $retval") introduced automatic return probe
conversion with $retval. But since tracepoint event does not
support return probe, $retval is not acceptable.
Without this fix, ftracetest, tprobe_syntax_errors.tc fails;
[22] Tracepoint probe event parser error log check [FAIL]
----
# tail 22-tprobe_syntax_errors.tc-log.mRKroL
+ ftrace_errlog_check trace_fprobe t kfree ^$retval dynamic_events
+ printf %s t kfree
+ wc -c
+ pos=8
+ printf %s t kfree ^$retval
+ tr -d ^
+ command=t kfree $retval
+ echo Test command: t kfree $retval
Test command: t kfree $retval
+ echo
----
So 't kfree $retval' should fail (tracepoint doesn't support
return probe) but passed it.
Link: https://lore.kernel.org/all/169944555933.45057.12831706585287704173.stgit@devnote2/
Fixes: 08c9306fc2 ("tracing/fprobe-event: Assume fprobe is a return event by $retval")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
When BPF program is verified in privileged mode, BPF verifier allows
bounded loops. This means that from CFG point of view there are
definitely some back-edges. Original commit adjusted check_cfg() logic
to not detect back-edges in control flow graph if they are resulting
from conditional jumps, which the idea that subsequent full BPF
verification process will determine whether such loops are bounded or
not, and either accept or reject the BPF program. At least that's my
reading of the intent.
Unfortunately, the implementation of this idea doesn't work correctly in
all possible situations. Conditional jump might not result in immediate
back-edge, but just a few unconditional instructions later we can arrive
at back-edge. In such situations check_cfg() would reject BPF program
even in privileged mode, despite it might be bounded loop. Next patch
adds one simple program demonstrating such scenario.
To keep things simple, instead of trying to detect back edges in
privileged mode, just assume every back edge is valid and let subsequent
BPF verification prove or reject bounded loops.
Note a few test changes. For unknown reason, we have a few tests that
are specified to detect a back-edge in a privileged mode, but looking at
their code it seems like the right outcome is passing check_cfg() and
letting subsequent verification to make a decision about bounded or not
bounded looping.
Bounded recursion case is also interesting. The example should pass, as
recursion is limited to just a few levels and so we never reach maximum
number of nested frames and never exhaust maximum stack depth. But the
way that max stack depth logic works today it falsely detects this as
exceeding max nested frame count. This patch series doesn't attempt to
fix this orthogonal problem, so we just adjust expected verifier failure.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Fixes: 2589726d12 ("bpf: introduce bounded loops")
Reported-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231110061412.2995786-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Fix an edge case in __mark_chain_precision() which prematurely stops
backtracking instructions in a state if it happens that state's first
and last instruction indexes are the same. This situations doesn't
necessarily mean that there were no instructions simulated in a state,
but rather that we starting from the instruction, jumped around a bit,
and then ended up at the same instruction before checkpointing or
marking precision.
To distinguish between these two possible situations, we need to consult
jump history. If it's empty or contain a single record "bridging" parent
state and first instruction of processed state, then we indeed
backtracked all instructions in this state. But if history is not empty,
we are definitely not done yet.
Move this logic inside get_prev_insn_idx() to contain it more nicely.
Use -ENOENT return code to denote "we are out of instructions"
situation.
This bug was exposed by verifier_loop1.c's bounded_recursion subtest, once
the next fix in this patch set is applied.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Fixes: b5dc0163d8 ("bpf: precise scalar_value tracking")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231110002638.4168352-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
ldimm64 instructions are 16-byte long, and so have to be handled
appropriately in check_cfg(), just like the rest of BPF verifier does.
This has implications in three places:
- when determining next instruction for non-jump instructions;
- when determining next instruction for callback address ldimm64
instructions (in visit_func_call_insn());
- when checking for unreachable instructions, where second half of
ldimm64 is expected to be unreachable;
We take this also as an opportunity to report jump into the middle of
ldimm64. And adjust few test_verifier tests accordingly.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reported-by: Hao Sun <sunhao.th@gmail.com>
Fixes: 475fb78fbf ("bpf: verifier (add branch/goto checks)")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231110002638.4168352-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch enables the following pattern:
/* mapval contains a __kptr pointing to refcounted local kptr */
mapval = bpf_map_lookup_elem(&map, &idx);
if (!mapval || !mapval->some_kptr) { /* omitted */ }
p = bpf_refcount_acquire(&mapval->some_kptr);
Currently this doesn't work because bpf_refcount_acquire expects an
owning or non-owning ref. The verifier defines non-owning ref as a type:
PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF
while mapval->some_kptr is PTR_TO_BTF_ID | PTR_UNTRUSTED. It's possible
to do the refcount_acquire by first bpf_kptr_xchg'ing mapval->some_kptr
into a temp kptr, refcount_acquiring that, and xchg'ing back into
mapval, but this is unwieldy and shouldn't be necessary.
This patch modifies btf_ld_kptr_type such that user-allocated types are
marked MEM_ALLOC and if those types have a bpf_{rb,list}_node they're
marked NON_OWN_REF as well. Additionally, due to changes to
bpf_obj_drop_impl earlier in this series, rcu_protected_object now
returns true for all user-allocated types, resulting in
mapval->some_kptr being marked MEM_RCU.
After this patch's changes, mapval->some_kptr is now:
PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU
which results in it passing the non-owning ref test, and the motivating
example passing verification.
Future work will likely get rid of special non-owning ref lifetime logic
in the verifier, at which point we'll be able to delete the NON_OWN_REF
flag entirely.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20231107085639.3016113-6-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This refactoring patch removes the unused BPF_GRAPH_NODE_OR_ROOT
btf_field_type and moves BPF_GRAPH_{NODE,ROOT} macros into the
btf_field_type enum. Further patches in the series will use
BPF_GRAPH_NODE, so let's move this useful definition out of btf.c.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20231107085639.3016113-5-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The use of bpf_mem_free_rcu to free refcounted local kptrs was added
in commit 7e26cd12ad ("bpf: Use bpf_mem_free_rcu when
bpf_obj_dropping refcounted nodes"). In the cover letter for the
series containing that patch [0] I commented:
Perhaps it makes sense to move to mem_free_rcu for _all_
non-owning refs in the future, not just refcounted. This might
allow custom non-owning ref lifetime + invalidation logic to be
entirely subsumed by MEM_RCU handling. IMO this needs a bit more
thought and should be tackled outside of a fix series, so it's not
attempted here.
It's time to start moving in the "non-owning refs have MEM_RCU
lifetime" direction. As mentioned in that comment, using
bpf_mem_free_rcu for all local kptrs - not just refcounted - is
necessarily the first step towards that goal. This patch does so.
After this patch the memory pointed to by all local kptrs will not be
reused until RCU grace period elapses. The verifier's understanding of
non-owning ref validity and the clobbering logic it uses to enforce
that understanding are not changed here, that'll happen gradually in
future work, including further patches in the series.
[0]: https://lore.kernel.org/all/20230821193311.3290257-1-davemarchevsky@fb.com/
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20231107085639.3016113-4-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Refcounted local kptrs are kptrs to user-defined types with a
bpf_refcount field. Recent commits ([0], [1]) modified the lifetime of
refcounted local kptrs such that the underlying memory is not reused
until RCU grace period has elapsed.
Separately, verification of bpf_refcount_acquire calls currently
succeeds for MAYBE_NULL non-owning reference input, which is a problem
as bpf_refcount_acquire_impl has no handling for this case.
This patch takes advantage of aforementioned lifetime changes to tag
bpf_refcount_acquire_impl kfunc KF_RCU, thereby preventing MAYBE_NULL
input to the kfunc. The KF_RCU flag applies to all kfunc params; it's
fine for it to apply to the void *meta__ign param as that's populated by
the verifier and is tagged __ign regardless.
[0]: commit 7e26cd12ad ("bpf: Use bpf_mem_free_rcu when
bpf_obj_dropping refcounted nodes") is the actual change to
allocation behaivor
[1]: commit 0816b8c6bf ("bpf: Consider non-owning refs to refcounted
nodes RCU protected") modified verifier understanding of
refcounted local kptrs to match [0]'s changes
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Fixes: 7c50b1cb76 ("bpf: Add bpf_refcount_acquire kfunc")
Link: https://lore.kernel.org/r/20231107085639.3016113-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The addition of is_reg_const() in commit 171de12646d2 ("bpf: generalize
is_branch_taken to handle all conditional jumps in one place") has made the
register_is_const() redundant. Give the former has more feature, plus the
fact the latter is only used in one place, replace register_is_const() with
is_reg_const(), and remove the definition of register_is_const.
This requires moving the definition of is_reg_const() further up. And since
the comment of reg_const_value() reference is_reg_const(), move it up as
well.
Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231108140043.12282-1-shung-hsi.yu@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Similar to ARG_PTR_TO_CONST_STR for BPF helpers, KF_ARG_PTR_TO_CONST_STR
specifies kfunc args that point to const strings. Annotation "__str" is
used to specify kfunc arg of type KF_ARG_PTR_TO_CONST_STR. Also, add
documentation for the "__str" annotation.
bpf_get_file_xattr() will be the first kfunc that uses this type.
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://lore.kernel.org/bpf/20231107045725.2278852-4-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
ARG_PTR_TO_CONST_STR is used to specify constant string args for BPF
helpers. The logic that verifies a reg is ARG_PTR_TO_CONST_STR is
implemented in check_func_arg().
As we introduce kfuncs with constant string args, it is necessary to
do the same check for kfuncs (in check_kfunc_args). Factor out the logic
for ARG_PTR_TO_CONST_STR to a new check_reg_const_str() so that it can be
reused.
check_func_arg() ensures check_reg_const_str() is only called with reg of
type PTR_TO_MAP_VALUE. Add a redundent type check in check_reg_const_str()
to avoid misuse in the future. Other than this redundent check, there is
no change in behavior.
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://lore.kernel.org/bpf/20231107045725.2278852-3-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Different types of bpf dynptr have different internal data storage.
Specifically, SKB and XDP type of dynptr may have non-continuous data.
Therefore, it is not always safe to directly access dynptr->data.
Add __bpf_dynptr_data and __bpf_dynptr_data_rw to replace direct access to
dynptr->data.
Update bpf_verify_pkcs7_signature to use __bpf_dynptr_data instead of
dynptr->data.
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://lore.kernel.org/bpf/20231107045725.2278852-2-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When looking up an element in LPM trie, the condition 'matchlen ==
trie->max_prefixlen' will never return true, if key->prefixlen is larger
than trie->max_prefixlen. Consequently all elements in the LPM trie will
be visited and no element is returned in the end.
To resolve this, check key->prefixlen first before walking the LPM trie.
Fixes: b95a5c4db0 ("bpf: add a longest prefix match trie map implementation")
Signed-off-by: Florian Lehner <dev@der-flo.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231105085801.3742-1-dev@der-flo.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Change reg_set_min_max() to take FALSE/TRUE sets of two registers each,
instead of assuming that we are always comparing to a constant. For now
we still assume that right-hand side registers are constants (and make
sure that's the case by swapping src/dst regs, if necessary), but
subsequent patches will remove this limitation.
reg_set_min_max() is now called unconditionally for any register
comparison, so that might include pointer vs pointer. This makes it
consistent with is_branch_taken() generality. But we currently only
support adjustments based on SCALAR vs SCALAR comparisons, so
reg_set_min_max() has to guard itself againts pointers.
Taking two by two registers allows to further unify and simplify
check_cond_jmp_op() logic. We utilize fake register for BPF_K
conditional jump case, just like with is_branch_taken() part.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231102033759.2541186-18-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Similarly to is_branch_taken()-related refactorings, start preparing
reg_set_min_max() to handle more generic case of two non-const
registers. Start with renaming arguments to accommodate later addition
of second register as an input argument.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231102033759.2541186-17-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Combine 32-bit and 64-bit is_branch_taken logic for SCALAR_VALUE
registers. It makes it easier to see parallels between two domains
(32-bit and 64-bit), and makes subsequent refactoring more
straightforward.
No functional changes.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231102033759.2541186-16-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Make is_branch_taken() a single entry point for branch pruning decision
making, handling both pointer vs pointer, pointer vs scalar, and scalar
vs scalar cases in one place. This also nicely cleans up check_cond_jmp_op().
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231102033759.2541186-15-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Move is_branch_taken() slightly down. In subsequent patched we'll need
both flip_opcode() and is_pkt_ptr_branch_taken() for is_branch_taken(),
but instead of sprinkling forward declarations around, it makes more
sense to move is_branch_taken() lower below is_pkt_ptr_branch_taken(),
and also keep it closer to very tightly related reg_set_min_max(), as
they are two critical parts of the same SCALAR range tracking logic.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231102033759.2541186-14-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
While still assuming that second register is a constant, generalize
is_branch_taken-related code to accept two registers instead of register
plus explicit constant value. This also, as a side effect, allows to
simplify check_cond_jmp_op() by unifying BPF_K case with BPF_X case, for
which we use a fake register to represent BPF_K's imm constant as
a register.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231102033759.2541186-13-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Just taking mundane refactoring bits out into a separate patch. No
functional changes.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231102033759.2541186-12-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When performing 32-bit conditional operation operating on lower 32 bits
of a full 64-bit register, register full value isn't changed. We just
potentially gain new knowledge about that register's lower 32 bits.
Unfortunately, __reg_combine_{32,64}_into_{64,32} logic that
reg_set_min_max() performs as a last step, can lose information in some
cases due to __mark_reg64_unbounded() and __reg_assign_32_into_64().
That's bad and completely unnecessary. Especially __reg_assign_32_into_64()
looks completely out of place here, because we are not performing
zero-extending subregister assignment during conditional jump.
So this patch replaced __reg_combine_* with just a normal
reg_bounds_sync() which will do a proper job of deriving u64/s64 bounds
from u32/s32, and vice versa (among all other combinations).
__reg_combine_64_into_32() is also used in one more place,
coerce_reg_to_size(), while handling 1- and 2-byte register loads.
Looking into this, it seems like besides marking subregister as
unbounded before performing reg_bounds_sync(), we were also performing
deduction of smin32/smax32 and umin32/umax32 bounds from respective
smin/smax and umin/umax bounds. It's now redundant as reg_bounds_sync()
performs all the same logic more generically (e.g., without unnecessary
assumption that upper 32 bits of full register should be zero).
Long story short, we remove __reg_combine_64_into_32() completely, and
coerce_reg_to_size() now only does resetting subreg to unbounded and then
performing reg_bounds_sync() to recover as much information as possible
from 64-bit umin/umax and smin/smax bounds, set explicitly in
coerce_reg_to_size() earlier.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231102033759.2541186-10-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
There are cases (caught by subsequent reg_bounds tests in selftests/bpf)
where performing one round of __reg_deduce_bounds() doesn't propagate
all the information from, say, s32 to u32 bounds and than from newly
learned u32 bounds back to u64 and s64. So perform __reg_deduce_bounds()
twice to make sure such derivations are propagated fully after
reg_bounds_sync().
One such example is test `(s64)[0xffffffff00000001; 0] (u64)<
0xffffffff00000000` from selftest patch from this patch set. It demonstrates an
intricate dance of u64 -> s64 -> u64 -> u32 bounds adjustments, which requires
two rounds of __reg_deduce_bounds(). Here are corresponding refinement log from
selftest, showing evolution of knowledge.
REFINING (FALSE R1) (u64)SRC=[0xffffffff00000000; U64_MAX] (u64)DST_OLD=[0; U64_MAX] (u64)DST_NEW=[0xffffffff00000000; U64_MAX]
REFINING (FALSE R1) (u64)SRC=[0xffffffff00000000; U64_MAX] (s64)DST_OLD=[0xffffffff00000001; 0] (s64)DST_NEW=[0xffffffff00000001; -1]
REFINING (FALSE R1) (s64)SRC=[0xffffffff00000001; -1] (u64)DST_OLD=[0xffffffff00000000; U64_MAX] (u64)DST_NEW=[0xffffffff00000001; U64_MAX]
REFINING (FALSE R1) (u64)SRC=[0xffffffff00000001; U64_MAX] (u32)DST_OLD=[0; U32_MAX] (u32)DST_NEW=[1; U32_MAX]
R1 initially has smin/smax set to [0xffffffff00000001; -1], while umin/umax is
unknown. After (u64)< comparison, in FALSE branch we gain knowledge that
umin/umax is [0xffffffff00000000; U64_MAX]. That causes smin/smax to learn that
zero can't happen and upper bound is -1. Then smin/smax is adjusted from
umin/umax improving lower bound from 0xffffffff00000000 to 0xffffffff00000001.
And then eventually umin32/umax32 bounds are drived from umin/umax and become
[1; U32_MAX].
Selftest in the last patch is actually implementing a multi-round fixed-point
convergence logic, but so far all the tests are handled by two rounds of
reg_bounds_sync() on the verifier state, so we keep it simple for now.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231102033759.2541186-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a few interesting cases in which we can tighten 64-bit bounds based
on newly learnt information about 32-bit bounds. E.g., when full u64/s64
registers are used in BPF program, and then eventually compared as
u32/s32. The latter comparison doesn't change the value of full
register, but it does impose new restrictions on possible lower 32 bits
of such full registers. And we can use that to derive additional full
register bounds information.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231102033759.2541186-8-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a special case where we can derive valid s32 bounds from umin/umax
or smin/smax by stitching together negative s32 subrange and
non-negative s32 subrange. That requires upper 32 bits to form a [N, N+1]
range in u32 domain (taking into account wrap around, so 0xffffffff
to 0x00000000 is a valid [N, N+1] range in this sense). See code comment
for concrete examples.
Eduard Zingerman also provided an alternative explanation ([0]) for more
mathematically inclined readers:
Suppose:
. there are numbers a, b, c
. 2**31 <= b < 2**32
. 0 <= c < 2**31
. umin = 2**32 * a + b
. umax = 2**32 * (a + 1) + c
The number of values in the range represented by [umin; umax] is:
. N = umax - umin + 1 = 2**32 + c - b + 1
. min(N) = 2**32 + 0 - (2**32-1) + 1 = 2, with b = 2**32-1, c = 0
. max(N) = 2**32 + (2**31 - 1) - 2**31 + 1 = 2**32, with b = 2**31, c = 2**31-1
Hence [(s32)b; (s32)c] forms a valid range.
[0] https://lore.kernel.org/bpf/d7af631802f0cfae20df77fe70068702d24bbd31.camel@gmail.com/
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231102033759.2541186-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Comments in code try to explain the idea behind why this is correct.
Please check the code and comments.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231102033759.2541186-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
All the logic that applies to u64 vs s64, equally applies for u32 vs s32
relationships (just taken in a smaller 32-bit numeric space). So do the
same deduction of smin32/smax32 from umin32/umax32, if we can.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231102033759.2541186-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add smin/smax derivation from appropriate umin/umax values. Previously the
logic was surprisingly asymmetric, trying to derive umin/umax from smin/smax
(if possible), but not trying to do the same in the other direction. A simple
addition to __reg64_deduce_bounds() fixes this.
Added also generic comment about u64/s64 ranges and their relationship.
Hopefully that helps readers to understand all the bounds deductions
a bit better.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231102033759.2541186-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Current release - regressions:
- sched: fix SKB_NOT_DROPPED_YET splat under debug config
Current release - new code bugs:
- tcp: fix usec timestamps with TCP fastopen
- tcp_sigpool: fix some off by one bugs
- tcp: fix possible out-of-bounds reads in tcp_hash_fail()
- tcp: fix SYN option room calculation for TCP-AO
- bpf: fix compilation error without CGROUPS
- ptp:
- ptp_read() should not release queue
- fix tsevqs corruption
Previous releases - regressions:
- llc: verify mac len before reading mac header
Previous releases - always broken:
- bpf:
- fix check_stack_write_fixed_off() to correctly spill imm
- fix precision tracking for BPF_ALU | BPF_TO_BE | BPF_END
- check map->usercnt after timer->timer is assigned
- dsa: lan9303: consequently nested-lock physical MDIO
- dccp/tcp: call security_inet_conn_request() after setting IP addr
- tg3: fix the TX ring stall due to incorrect full ring handling
- phylink: initialize carrier state at creation
- ice: fix direction of VF rules in switchdev mode
Misc:
- fill in a bunch of missing MODULE_DESCRIPTION()s, more to come
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmVNRnAACgkQMUZtbf5S
IrsYaA/+IUoYi96/oLtvvrET6HIbXeMaLKef0UlytEicQKy8h5EWlhcTZPhQEY0g
dtaKOemQsO0dQTma4eQBiBDHeCeSkitgD9p7fh0i+//QFYWSFqHrBiF2mlToc/ZQ
T1p4BlVL7D2Xsr1Lki93zk+EhFGEy2KroYgrWbZc9TWE5ap9PtSVF9eqeHAVCmZ7
ocre/eo4pqUM9rAHIAyhoL+0xtVQ59dBevbJC0qYcmflhafr82Gtdveo6pBBKuYm
GhwbRrAXER3Neav9c6NHqat4zsMwGpC27SiN9dYWm6dlkeS9U9t2PUu71OkJGVfw
VaSE+utkC/WmzGbuiUIjqQLBrRe372ItHCr78BfSRMshS+RBTHtoK7njeH8Iv67E
RsMeCyVNj9dtGlOQG5JAv8IoCQ1WbMw9B36Yzw3ip/MmDX/ntXz7Dcr4ZMZ6VURS
CHhHFZPnmMykMXkT6SIlxeAg2r8ELtESzkvLimdTVFPAlk3cPkibKJbh3F/tEqXS
PDb3y0uoEgRQBAsWXXx9FQEvv9rTL6YrzbMhmJBIIEoNxppQYQ7FZBJX9utAVp5B
1GdyqhR6IRTaKb9cMRj/K1xPwm2KgCw9xj9pjKdAA7QUMslXbFp8blv1rIkFGshg
hiNXmPcI8wo0j+0lZYktEcIERL5y6c8BgK2NnPU6RULua96tuQ4=
=k6Wk
-----END PGP SIGNATURE-----
Merge tag 'net-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter and bpf.
Current release - regressions:
- sched: fix SKB_NOT_DROPPED_YET splat under debug config
Current release - new code bugs:
- tcp:
- fix usec timestamps with TCP fastopen
- fix possible out-of-bounds reads in tcp_hash_fail()
- fix SYN option room calculation for TCP-AO
- tcp_sigpool: fix some off by one bugs
- bpf: fix compilation error without CGROUPS
- ptp:
- ptp_read() should not release queue
- fix tsevqs corruption
Previous releases - regressions:
- llc: verify mac len before reading mac header
Previous releases - always broken:
- bpf:
- fix check_stack_write_fixed_off() to correctly spill imm
- fix precision tracking for BPF_ALU | BPF_TO_BE | BPF_END
- check map->usercnt after timer->timer is assigned
- dsa: lan9303: consequently nested-lock physical MDIO
- dccp/tcp: call security_inet_conn_request() after setting IP addr
- tg3: fix the TX ring stall due to incorrect full ring handling
- phylink: initialize carrier state at creation
- ice: fix direction of VF rules in switchdev mode
Misc:
- fill in a bunch of missing MODULE_DESCRIPTION()s, more to come"
* tag 'net-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (84 commits)
net: ti: icss-iep: fix setting counter value
ptp: fix corrupted list in ptp_open
ptp: ptp_read should not release queue
net_sched: sch_fq: better validate TCA_FQ_WEIGHTS and TCA_FQ_PRIOMAP
net: kcm: fill in MODULE_DESCRIPTION()
net/sched: act_ct: Always fill offloading tuple iifidx
netfilter: nat: fix ipv6 nat redirect with mapped and scoped addresses
netfilter: xt_recent: fix (increase) ipv6 literal buffer length
ipvs: add missing module descriptions
netfilter: nf_tables: remove catchall element in GC sync path
netfilter: add missing module descriptions
drivers/net/ppp: use standard array-copy-function
net: enetc: shorten enetc_setup_xdp_prog() error message to fit NETLINK_MAX_FMTMSG_LEN
virtio/vsock: Fix uninit-value in virtio_transport_recv_pkt()
r8169: respect userspace disabling IFF_MULTICAST
selftests/bpf: get trusted cgrp from bpf_iter__cgroup directly
bpf: Let verifier consider {task,cgroup} is trusted in bpf_iter_reg
net: phylink: initialize carrier state at creation
test/vsock: add dobule bind connect test
test/vsock: refactor vsock_accept
...
A new helper is added for cgroup1 hierarchy:
- task_get_cgroup1
Acquires the associated cgroup of a task within a specific cgroup1
hierarchy. The cgroup1 hierarchy is identified by its hierarchy ID.
This helper function is added to facilitate the tracing of tasks within
a particular container or cgroup dir in BPF programs. It's important to
note that this helper is designed specifically for cgroup1 only.
tj: Use irsqsave/restore as suggested by Hou Tao <houtao@huaweicloud.com>.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Hou Tao <houtao@huaweicloud.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When I initially examined the function current_cgns_cgroup_from_root(), I
was perplexed by its lack of holding cgroup_mutex. However, after Michal
explained the reason[0] to me, I realized that it already holds the
namespace_sem. I believe this intricacy could also confuse others, so it
would be advisable to include an annotation for clarification.
After we replace the cgroup_mutex with RCU read lock, if current doesn't
hold the namespace_sem, the root cgroup will be NULL. So let's add a
WARN_ON_ONCE() for it.
[0]. https://lore.kernel.org/bpf/afdnpo3jz2ic2ampud7swd6so5carkilts2mkygcaw67vbw6yh@5b5mncf7qyet
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Michal Koutny <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The cgroup root_list is already RCU-safe. Therefore, we can replace the
cgroup_mutex with the RCU read lock in some particular paths. This change
will be particularly beneficial for frequent operations, such as
`cat /proc/self/cgroup`, in a cgroup1-based container environment.
I did stress tests with this change, as outlined below
(with CONFIG_PROVE_RCU_LIST enabled):
- Continuously mounting and unmounting named cgroups in some tasks,
for example:
cgrp_name=$1
while true
do
mount -t cgroup -o none,name=$cgrp_name none /$cgrp_name
umount /$cgrp_name
done
- Continuously triggering proc_cgroup_show() in some tasks concurrently,
for example:
while true; do cat /proc/self/cgroup > /dev/null; done
They can ran successfully after implementing this change, with no RCU
warnings in dmesg.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
At present, when we perform operations on the cgroup root_list, we must
hold the cgroup_mutex, which is a relatively heavyweight lock. In reality,
we can make operations on this list RCU-safe, eliminating the need to hold
the cgroup_mutex during traversal. Modifications to the list only occur in
the cgroup root setup and destroy paths, which should be infrequent in a
production environment. In contrast, traversal may occur frequently.
Therefore, making it RCU-safe would be beneficial.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The root hasn't been removed from the root_list, so the list can't be NULL.
However, if it had been removed, attempting to destroy it once more is not
possible. Let's replace this with WARN_ON_ONCE() for clarity.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
* Fix a lock inversion between scheduler and RCU introduced
in v6.2-rc4. The scenario could trigger on any user of RCU_NOCB
(mostly Android and also nohz_full).
* Fix PF_IDLE semantic changes introduced in v6.6-rc3 breaking some
RCU-Tasks and RCU-Tasks-Trace expectations as to what exactly is
an idle task. This resulted in potential spurious stalls and
warnings.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmVFB/QACgkQhSRUR1CO
jHctUg/+LjAHNvT8YR99qxP3v8Ylqf191ADhDla1Zx1LYkhcrApdRKQU5G1dzfiF
arJ/KSSgCLCYMecwdh6iOL5s3S3DtQGsh7Y7f/dJ6vwJbNe2NbPaMYK2WQ171zT+
ht7oAugFrOpWn5H6i1eC170N+rG7nHYuCXrzsta0fx1fOp9G0OeemvweD7HjpaB7
66tp0KnuJWLfPLoZucscxKRtC7ZG0zDztFUXcfcv+TQE6w5fXuhtLX7E7fE1jPMr
Pa3x9Su5fjFCRJCCCFNHzKqbf8VKG/teGOONFBwt8AwwU1/7TuIuwSOjrISOaG91
eEeIwWyiAw+aP+7CMMGca0oYPOR+b3Xq7vmN85KIEFUHpNls6MvR5/itR1d3/FsA
9EkiNLIwbRV9ICi+JeGeJsSmm6TRJL4XToiw9fJKfC05IHjotMSAtUOqHS/VmdjL
o1kDdiWBSHUy7VVelu5wRHb0WINc9Zrq/tc3MT2W0EpTDRSL2r04MsP8UDYre1oY
KDdTLSPsy2D+5YWY484Z+bCOHl0bXSgZe4PycGVrP9HEus6MGHyxU6uDCXwWbqQz
R0QdZ60AdquwWTdK1K0PsU5VKQw3oZKKwj8hyKOr31NpSZ1FvLJa8Ct8D2UZPbMS
WNscgGE4ZNjCYFq8wtnL5LkUym8is1Y6/nLifBV1scx5GtyHYiU=
=a5L7
-----END PGP SIGNATURE-----
Merge tag 'rcu-fixes-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks
Pull RCU fixes from Frederic Weisbecker:
- Fix a lock inversion between scheduler and RCU introduced in
v6.2-rc4. The scenario could trigger on any user of RCU_NOCB
(mostly Android but also nohz_full)
- Fix PF_IDLE semantic changes introduced in v6.6-rc3 breaking
some RCU-Tasks and RCU-Tasks-Trace expectations as to what
exactly is an idle task. This resulted in potential spurious
stalls and warnings.
* tag 'rcu-fixes-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks:
rcu/tasks-trace: Handle new PF_IDLE semantics
rcu/tasks: Handle new PF_IDLE semantics
rcu: Introduce rcu_cpu_online()
rcu: Break rcu_node_0 --> &rq->__lock order
Just two patches for you this time!
* During a panic, flush the console before entering kgdb. This makes
things a little easier to comprehend, especially if an NMI backtrace
was triggered on all CPUs just before we enter the panic routines.
* Correcting a couple of misleading (a.k.a. plain wrong) comments.
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAmVJIdkACgkQfOMlXTn3
iKEetQ//Xy1LKXtxJOCY+uPEvFOr5NRSjGez41ImknbnR+njNh5/eb6g9rX9gNXf
DpVHc0Yac61qP687kNCJw56rbizb9ETasJ/skdcEcGewvPL3ebvC64JjBAwDJOp/
V+jJljW6ccJnoxMUl/WdjCeK7hZo2+X6lzMmAjctj+NkTyVks3C/uqczuYKhsAhc
z4WwQ8fSCjopuau8aDyn9fc9zsem+rMV9RWlMuqvl66URtnbFQhA3FjO/W8H8oPA
SP7WX0HfKrtwctu0GQqbvKyba5eH1g69bndo0LxGVBDhFvK90sI3QPKID3UifPDq
lpTCnOAjkDZuzjI6vLh8u+px1Z8mmXC4zi7mKy0ceTZeX5+DOJRuKi5XsnoBM7R1
24TPUY1kv1hVFyxBDHQ+IV493fjOLNY/qljhYNn635QaLhe/kdKOb7nC2IXCl2sR
zhGmAzmEteQ2keuNuRMsKncnow3qLxxLmCfZgSZ9XS/8rM4H8YZ6LnQeIQ7VHeKi
gKqV3qc5eD//hkJ8LtL4BCdTUuctMslltvCfhh9GvEpSMBsDxk8BszDMyDX5SOtM
xzhBdI4K6iTdZhHfPD8BMx4iIC9HdS3v6Qx8f6b1ZC3gEUTWGikXkLlr9k3Ae1Qe
YPcwnWfg4jImS2f4w/AQ+ajIU3i3ob0F/VRTC0r1oY4cM1z8RGs=
=HBws
-----END PGP SIGNATURE-----
Merge tag 'kgdb-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux
Pull kgdb updates from Daniel Thompson:
"Just two patches for you this time!
- During a panic, flush the console before entering kgdb.
This makes things a little easier to comprehend, especially if an
NMI backtrace was triggered on all CPUs just before we enter the
panic routines
- Correcting a couple of misleading (a.k.a. plain wrong) comments"
* tag 'kgdb-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
kdb: Corrects comment for kdballocenv
kgdb: Flush console before entering kgdb on panic
Limit the free list length to the size of the IO TLB. Transient pool can be
smaller than IO_TLB_SEGSIZE, but the free list is initialized with the
assumption that the total number of slots is a multiple of IO_TLB_SEGSIZE.
As a result, swiotlb_area_find_slots() may allocate slots past the end of
a transient IO TLB buffer.
Reported-by: Niklas Schnelle <schnelle@linux.ibm.com>
Closes: https://lore.kernel.org/linux-iommu/104a8c8fedffd1ff8a2890983e2ec1c26bff6810.camel@linux.ibm.com/
Fixes: 79636caad3 ("swiotlb: if swiotlb is full, fall back to a transient memory pool")
Cc: stable@vger.kernel.org
Signed-off-by: Petr Tesarik <petr.tesarik1@huawei-partners.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
BTF_TYPE_SAFE_TRUSTED(struct bpf_iter__task) in verifier.c wanted to
teach BPF verifier that bpf_iter__task -> task is a trusted ptr. But it
doesn't work well.
The reason is, bpf_iter__task -> task would go through btf_ctx_access()
which enforces the reg_type of 'task' is ctx_arg_info->reg_type, and in
task_iter.c, we actually explicitly declare that the
ctx_arg_info->reg_type is PTR_TO_BTF_ID_OR_NULL.
Actually we have a previous case like this[1] where PTR_TRUSTED is added to
the arg flag for map_iter.
This patch sets ctx_arg_info->reg_type is PTR_TO_BTF_ID_OR_NULL |
PTR_TRUSTED in task_reg_info.
Similarly, bpf_cgroup_reg_info -> cgroup is also PTR_TRUSTED since we are
under the protection of cgroup_mutex and we would check cgroup_is_dead()
in __cgroup_iter_seq_show().
This patch is to improve the user experience of the newly introduced
bpf_iter_css_task kfunc before hitting the mainline. The Fixes tag is
pointing to the commit introduced the bpf_iter_css_task kfunc.
Link[1]:https://lore.kernel.org/all/20230706133932.45883-3-aspsk@isovalent.com/
Fixes: 9c66dc94b6 ("bpf: Introduce css_task open-coded iterator kfuncs")
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231107132204.912120-2-zhouchuyi@bytedance.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
This patch corrects the comment for the kdballocenv function.
The previous comment incorrectly described the function's
parameters and return values.
Signed-off-by: Yuran Pereira <yuran.pereira@hotmail.com>
Link: https://lore.kernel.org/r/DB3PR10MB6835B383B596133EDECEA98AE8ABA@DB3PR10MB6835.EURPRD10.PROD.OUTLOOK.COM
[daniel.thompson@linaro.org: fixed whitespace alignment in new lines]
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
There is an unusual case that the range map covers right up to the top
of system RAM, but leaves a hole somewhere lower down. Then it prevents
the nvme device dma mapping in the checking path of phys_to_dma() and
causes the hangs at boot.
E.g. On an Armv8 Ampere server, the dsdt ACPI table is:
Method (_DMA, 0, Serialized) // _DMA: Direct Memory Access
{
Name (RBUF, ResourceTemplate ()
{
QWordMemory (ResourceConsumer, PosDecode, MinFixed,
MaxFixed, Cacheable, ReadWrite,
0x0000000000000000, // Granularity
0x0000000000000000, // Range Minimum
0x00000000FFFFFFFF, // Range Maximum
0x0000000000000000, // Translation Offset
0x0000000100000000, // Length
,, , AddressRangeMemory, TypeStatic)
QWordMemory (ResourceConsumer, PosDecode, MinFixed,
MaxFixed, Cacheable, ReadWrite,
0x0000000000000000, // Granularity
0x0000006010200000, // Range Minimum
0x000000602FFFFFFF, // Range Maximum
0x0000000000000000, // Translation Offset
0x000000001FE00000, // Length
,, , AddressRangeMemory, TypeStatic)
QWordMemory (ResourceConsumer, PosDecode, MinFixed,
MaxFixed, Cacheable, ReadWrite,
0x0000000000000000, // Granularity
0x00000060F0000000, // Range Minimum
0x00000060FFFFFFFF, // Range Maximum
0x0000000000000000, // Translation Offset
0x0000000010000000, // Length
,, , AddressRangeMemory, TypeStatic)
QWordMemory (ResourceConsumer, PosDecode, MinFixed,
MaxFixed, Cacheable, ReadWrite,
0x0000000000000000, // Granularity
0x0000007000000000, // Range Minimum
0x000003FFFFFFFFFF, // Range Maximum
0x0000000000000000, // Translation Offset
0x0000039000000000, // Length
,, , AddressRangeMemory, TypeStatic)
})
But the System RAM ranges are:
cat /proc/iomem |grep -i ram
90000000-91ffffff : System RAM
92900000-fffbffff : System RAM
880000000-fffffffff : System RAM
8800000000-bff5990fff : System RAM
bff59d0000-bff5a4ffff : System RAM
bff8000000-bfffffffff : System RAM
So some RAM ranges are out of dma_range_map.
Fix it by checking whether each of the system RAM resources can be
properly encompassed within the dma_range_map.
Signed-off-by: Jia He <justin.he@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
This patch moves dma_addressing_limited() out of line, serving as a
preliminary step to prevent the introduction of a new publicly accessible
low-level helper when validating whether all system RAM is mapped within
the DMA mapping range.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jia He <justin.he@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Here is the big set of tty/serial driver changes for 6.7-rc1. Included
in here are:
- console/vgacon cleanups and removals from Arnd
- tty core and n_tty cleanups from Jiri
- lots of 8250 driver updates and cleanups
- sc16is7xx serial driver updates
- dt binding updates
- first set of port lock wrapers from Thomas for the printk fixes
coming in future releases
- other small serial and tty core cleanups and updates
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZUTbaw8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yk9+gCeKdoRb8FDwGCO/GaoHwR4EzwQXhQAoKXZRmN5
LTtw9sbfGIiBdOTtgLPb
=6PJr
-----END PGP SIGNATURE-----
Merge tag 'tty-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty and serial updates from Greg KH:
"Here is the big set of tty/serial driver changes for 6.7-rc1. Included
in here are:
- console/vgacon cleanups and removals from Arnd
- tty core and n_tty cleanups from Jiri
- lots of 8250 driver updates and cleanups
- sc16is7xx serial driver updates
- dt binding updates
- first set of port lock wrapers from Thomas for the printk fixes
coming in future releases
- other small serial and tty core cleanups and updates
All of these have been in linux-next for a while with no reported
issues"
* tag 'tty-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (193 commits)
serdev: Replace custom code with device_match_acpi_handle()
serdev: Simplify devm_serdev_device_open() function
serdev: Make use of device_set_node()
tty: n_gsm: add copyright Siemens Mobility GmbH
tty: n_gsm: fix race condition in status line change on dead connections
serial: core: Fix runtime PM handling for pending tx
vgacon: fix mips/sibyte build regression
dt-bindings: serial: drop unsupported samsung bindings
tty: serial: samsung: drop earlycon support for unsupported platforms
tty: 8250: Add note for PX-835
tty: 8250: Fix IS-200 PCI ID comment
tty: 8250: Add Brainboxes Oxford Semiconductor-based quirks
tty: 8250: Add support for Intashield IX cards
tty: 8250: Add support for additional Brainboxes PX cards
tty: 8250: Fix up PX-803/PX-857
tty: 8250: Fix port count of PX-257
tty: 8250: Add support for Intashield IS-100
tty: 8250: Add support for Brainboxes UP cards
tty: 8250: Add support for additional Brainboxes UC cards
tty: 8250: Remove UC-257 and UC-431
...
Here is the set of driver core updates for 6.7-rc1. Nothing major in
here at all, just a small number of changes including:
- minor cleanups and updates from Andy Shevchenko
- __counted_by addition
- firmware_loader update for aborting loads cleaner
- other minor changes, details in the shortlog
- documentation update
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZUTe8A8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ynP0QCfT2jQx3OcL22MoqCvdTuZJKPiHSIAoMxrliJF
d4cUeICW17ywlTFzsKg8
=nTeu
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the set of driver core updates for 6.7-rc1. Nothing major in
here at all, just a small number of changes including:
- minor cleanups and updates from Andy Shevchenko
- __counted_by addition
- firmware_loader update for aborting loads cleaner
- other minor changes, details in the shortlog
- documentation update
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (21 commits)
firmware_loader: Abort all upcoming firmware load request once reboot triggered
firmware_loader: Refactor kill_pending_fw_fallback_reqs()
Documentation: security-bugs.rst: linux-distros relaxed their rules
driver core: Release all resources during unbind before updating device links
driver core: class: remove boilerplate code
driver core: platform: Annotate struct irq_affinity_devres with __counted_by
resource: Constify resource crosscheck APIs
resource: Unify next_resource() and next_resource_skip_children()
resource: Reuse for_each_resource() macro
PCI: Implement custom llseek for sysfs resource entries
kernfs: sysfs: support custom llseek method for sysfs entries
debugfs: Fix __rcu type comparison warning
device property: Replace custom implementation of COUNT_ARGS()
drivers: base: test: Make property entry API test modular
driver core: Add missing parameter description to __fwnode_link_add()
device property: Clarify usage scope of some struct fwnode_handle members
devres: rename the first parameter of devm_add_action(_or_reset)
driver core: platform: Unify the firmware node type check
driver core: platform: Use temporary variable in platform_device_add()
driver core: platform: Refactor error path in a couple places
...
- Remove eventfs_file descriptor
This is the biggest change, and the second part of making eventfs
create its files dynamically.
In 6.6 the first part was added, and that maintained a one to one
mapping between eventfs meta descriptors and the directories and
file inodes and dentries that were dynamically created. The
directories were represented by a eventfs_inode and the files
were represented by a eventfs_file.
In v6.7 the eventfs_file is removed. As all events have the same
directory make up (sched_switch has an "enable", "id", "format",
etc files), the handing of what files are underneath each leaf
eventfs directory is moved back to the tracing subsystem via a
callback. When a event is added to the eventfs, it registers
an array of evenfs_entry's. These hold the names of the files and
the callbacks to call when the file is referenced. The callback gets
the name so that the same callback may be used by multiple files.
The callback then supplies the filesystem_operations structure needed
to create this file.
This has brought the memory footprint of creating multiple eventfs
instances down by 2 megs each!
- User events now has persistent events that are not associated
to a single processes. These are privileged events that hang around
even if no process is attached to them.
- Clean up of seq_buf.
There's talk about using seq_buf more to replace strscpy() and friends.
But this also requires some minor modifications of seq_buf to be
able to do this.
- Expand instance ring buffers individually
Currently if boot up creates an instance, and a trace event is
enabled on that instance, the ring buffer for that instance and the
top level ring buffer are expanded (1.4 MB per CPU). This wastes
memory as this happens when nothing is using the top level instance.
- Other minor clean ups and fixes.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZUMrBBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6quzVAQCed/kPM7X9j2QZamJVDruMf2CmVxpu
/TOvKvSKV584GgEAxLntf5VKx1Q98bc68y3Zkg+OCi8jSgORos1ROmURhws=
=iIgb
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
- Remove eventfs_file descriptor
This is the biggest change, and the second part of making eventfs
create its files dynamically.
In 6.6 the first part was added, and that maintained a one to one
mapping between eventfs meta descriptors and the directories and file
inodes and dentries that were dynamically created. The directories
were represented by a eventfs_inode and the files were represented by
a eventfs_file.
In v6.7 the eventfs_file is removed. As all events have the same
directory make up (sched_switch has an "enable", "id", "format", etc
files), the handing of what files are underneath each leaf eventfs
directory is moved back to the tracing subsystem via a callback.
When an event is added to the eventfs, it registers an array of
evenfs_entry's. These hold the names of the files and the callbacks
to call when the file is referenced. The callback gets the name so
that the same callback may be used by multiple files. The callback
then supplies the filesystem_operations structure needed to create
this file.
This has brought the memory footprint of creating multiple eventfs
instances down by 2 megs each!
- User events now has persistent events that are not associated to a
single processes. These are privileged events that hang around even
if no process is attached to them
- Clean up of seq_buf
There's talk about using seq_buf more to replace strscpy() and
friends. But this also requires some minor modifications of seq_buf
to be able to do this
- Expand instance ring buffers individually
Currently if boot up creates an instance, and a trace event is
enabled on that instance, the ring buffer for that instance and the
top level ring buffer are expanded (1.4 MB per CPU). This wastes
memory as this happens when nothing is using the top level instance
- Other minor clean ups and fixes
* tag 'trace-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (34 commits)
seq_buf: Export seq_buf_puts()
seq_buf: Export seq_buf_putc()
eventfs: Use simple_recursive_removal() to clean up dentries
eventfs: Remove special processing of dput() of events directory
eventfs: Delete eventfs_inode when the last dentry is freed
eventfs: Hold eventfs_mutex when calling callback functions
eventfs: Save ownership and mode
eventfs: Test for ei->is_freed when accessing ei->dentry
eventfs: Have a free_ei() that just frees the eventfs_inode
eventfs: Remove "is_freed" union with rcu head
eventfs: Fix kerneldoc of eventfs_remove_rec()
tracing: Have the user copy of synthetic event address use correct context
eventfs: Remove extra dget() in eventfs_create_events_dir()
tracing: Have trace_event_file have ref counters
seq_buf: Introduce DECLARE_SEQ_BUF and seq_buf_str()
eventfs: Fix typo in eventfs_inode union comment
eventfs: Fix WARN_ON() in create_file_dentry()
powerpc: Remove initialisation of readpos
tracing/histograms: Simplify last_cmd_set()
seq_buf: fix a misleading comment
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmVDk4QACgkQUqAMR0iA
lPLKtg/9FXMuIRtyiZqHARIjRprxSWyiwVu+ZsDfp8JznP1+WnoWp8E+xf824dbW
xNnQS1qKecvH+1Nw9jMhXNAlViI5re7ft01yDraA0POjP3P8ar+4CXiB3e8CXTLZ
VUwMegBlztkcskV0L5EpRRyOa728UF63V3FN41SuR81GaAGOVyZNE4RmlwlsS9xg
uCj7XEMgVuZpVjnM6Cx/YzUfuWGwFp0eWn31Vc7Dp5Bab/yDvSRdD8H1foo23/3k
nLh6wDV9UbYbAsgHwLWck18kmn0xsnzK8G08bzFwP8FASHjVuEaxSgJqFBuZxY1j
O4XFR1zedVOB2Q08fzwi/rVB7jibw9mZZNWceOnrDM8PUx7m/m/YVOO0aNVmBh+f
uztptyoMw8o93dXwR05Dc5JGyYh4k4eMN9eS0MbALNziFwvc80ln4g1goNYpzjyb
vfytDoacwuRYD3BkEL5t5BsAK4ULqpTyOZiv0sXUC+cERH1HkUGm5KCPbQawnvaK
fEkGJTHMtE+I+Cui83jYCdxJVfE1LFcM9Yl7ZDfrVisRQ8/KM9+L68XceRe4E30U
1YiSJIopeYi5+ABysfE4RPumHWsG6d8FQPtXZyS+/K0cl5j49t36r3IOy0QUh6bn
G9iZSvO4oxUHz1sXja+X+EKzN1r8fRf/j+ovTrBUpCcZZRqKk9s=
=Cmex
-----END PGP SIGNATURE-----
Merge tag 'printk-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:
- Another preparation step for introducing printk kthreads. The main
piece is a per-console lock with several features:
- Support three priorities: normal, emergency, and panic. They will
be defined by a context where the lock is taken. A context with a
higher priority is allowed to take over the lock from a context
with a lower one.
The plan is to use the emergency context for Oops and WARN()
messages, and also by watchdogs.
The panic() context will be used on panic CPU.
- The owner might enter/exit regions where it is not safe to take
over the lock. It allows the take over the lock a safe way in the
middle of a message.
For example, serial drivers emit characters one by one. And the
serial port is in a safe state in between.
Only the final console_flush_in_panic() will be allowed to take
over the lock even in the unsafe state (last chance, pray, and
hope).
- A higher priority context might busy wait with a timeout. The
current owner is informed about the waiter and releases the lock
on exit from the unsafe state.
- The new lock is safe even in atomic contexts, including NMI.
Another change is a safe manipulation of per-console sequence number
counter under the new lock.
- simple_strntoull() micro-optimization
- Reduce pr_flush() pooling time.
- Calm down false warning about possible buffer invalid access to
console buffers when CONFIG_PRINTK is disabled.
[ .. and Thomas Gleixner wants to point out that while several of the
commits are attributed to him, he only authored the early versions of
said commits, and that John Ogness and Petr Mladek have been the ones
who sorted out the details and really should be those who get the
credit - Linus ]
* tag 'printk-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
vsprintf: uninline simple_strntoull(), reorder arguments
printk: printk: Remove unnecessary statements'len = 0;'
printk: Reduce pr_flush() pooling time
printk: fix illegal pbufs access for !CONFIG_PRINTK
printk: nbcon: Allow drivers to mark unsafe regions and check state
printk: nbcon: Add emit function and callback function for atomic printing
printk: nbcon: Add sequence handling
printk: nbcon: Add ownership state functions
printk: nbcon: Add buffer management
printk: Make static printk buffers available to nbcon
printk: nbcon: Add acquire/release logic
printk: Add non-BKL (nbcon) console basic infrastructure
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmVDgs8ACgkQUqAMR0iA
lPK6aA//a+NCWNep46zyKv7Iq6Ek/VDdGyui042R+UQ0KI1sSOeC2lojtVfRx5OB
pGQruPqX/nflziJ5lduBy01uabQ/djj+Fgr5YEHwchck+QDxaxidWJsgUhEtG+jX
PbqKsJJ6teBOT/rtPHLHzBQcHN09Go3BjpmTdKLQhSP5ql7Da8IocKk4oqmjLcuj
vkc7D1nHPvDKMkTzsO2cJv4Y9PNi4WzeV16myC8jSWOOb3Ja2Ouaeb/DVCrydl2L
4ovT+RMz5m1dcjLnDTDwHk5mKgp/c3zCts5igJQRvam53OKL9S2DMptMQLJLfCSL
1wjryOjEpou5sf9vxbi1c87QmvhK+TiLCPU7I4FlPBbrA65atggIwCVFOzfF9stf
r0MFF4L6jwoutcHpCONEJOYMm5rsAjhB+Ylqfm5324wUL86UbNZi9hKI9d6nBHV2
3si/9lhQhajOS2hKzrR01QWAN4wpacp6ua69WMkirrTl1/dvNOzluo5bFncz6vmB
msYfUKb4+yGdmug0MVfjBtw8RbhcYfraaX7KSuQWf9y5VQfy0q7is1gIi9lea30e
VlyahO2vUY8accn6g9HgZuhncdYSeZWv3i1OeJSxFp0zkO9JtLNqKLF8/lawchlJ
+prKKFS6yBw9EQ3IF+F5SbpWwQoyXb0nTpzPfsTfPushiiJF7So=
=injt
-----END PGP SIGNATURE-----
Merge tag 'livepatching-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching
Pull livepatching update from Petr Mladek:
- Add missing newline character to avoid waiting for a continuous
message
* tag 'livepatching-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
livepatch: Fix missing newline character in klp_resolve_symbols()
Fix these two error paths:
1. When set_memory_decrypted() fails, pages may be left fully or partially
decrypted.
2. Decrypted pages may be freed if swiotlb_alloc_tlb() determines that the
physical address is too high.
To fix the first issue, call set_memory_encrypted() on the allocated region
after a failed decryption attempt. If that also fails, leak the pages.
To fix the second issue, check that the TLB physical address is below the
requested limit before decrypting.
Let the caller differentiate between unsuitable physical address (=> retry
from a lower zone) and allocation failures (=> no point in retrying).
Cc: stable@vger.kernel.org
Fixes: 79636caad3 ("swiotlb: if swiotlb is full, fall back to a transient memory pool")
Signed-off-by: Petr Tesarik <petr.tesarik1@huawei-partners.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
there's little I can say which isn't in the individual changelogs.
The lengthier patch series are
- "kdump: use generic functions to simplify crashkernel reservation in
arch", from Baoquan He. This is mainly cleanups and consolidation of
the "crashkernel=" kernel parameter handling.
- After much discussion, David Laight's "minmax: Relax type checks in
min() and max()" is here. Hopefully reduces some typecasting and the
use of min_t() and max_t().
- A group of patches from Oleg Nesterov which clean up and slightly fix
our handling of reads from /proc/PID/task/... and which remove
task_struct.therad_group.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZUQP9wAKCRDdBJ7gKXxA
jmOAAQDh8sxagQYocoVsSm28ICqXFeaY9Co1jzBIDdNesAvYVwD/c2DHRqJHEiS4
63BNcG3+hM9nwGJHb5lyh5m79nBMRg0=
=On4u
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
"As usual, lots of singleton and doubleton patches all over the tree
and there's little I can say which isn't in the individual changelogs.
The lengthier patch series are
- 'kdump: use generic functions to simplify crashkernel reservation
in arch', from Baoquan He. This is mainly cleanups and
consolidation of the 'crashkernel=' kernel parameter handling
- After much discussion, David Laight's 'minmax: Relax type checks in
min() and max()' is here. Hopefully reduces some typecasting and
the use of min_t() and max_t()
- A group of patches from Oleg Nesterov which clean up and slightly
fix our handling of reads from /proc/PID/task/... and which remove
task_struct.thread_group"
* tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (64 commits)
scripts/gdb/vmalloc: disable on no-MMU
scripts/gdb: fix usage of MOD_TEXT not defined when CONFIG_MODULES=n
.mailmap: add address mapping for Tomeu Vizoso
mailmap: update email address for Claudiu Beznea
tools/testing/selftests/mm/run_vmtests.sh: lower the ptrace permissions
.mailmap: map Benjamin Poirier's address
scripts/gdb: add lx_current support for riscv
ocfs2: fix a spelling typo in comment
proc: test ProtectionKey in proc-empty-vm test
proc: fix proc-empty-vm test with vsyscall
fs/proc/base.c: remove unneeded semicolon
do_io_accounting: use sig->stats_lock
do_io_accounting: use __for_each_thread()
ocfs2: replace BUG_ON() at ocfs2_num_free_extents() with ocfs2_error()
ocfs2: fix a typo in a comment
scripts/show_delta: add __main__ judgement before main code
treewide: mark stuff as __ro_after_init
fs: ocfs2: check status values
proc: test /proc/${pid}/statm
compiler.h: move __is_constexpr() to compiler.h
...
included in this merge do the following:
- Kemeng Shi has contributed some compation maintenance work in the
series "Fixes and cleanups to compaction".
- Joel Fernandes has a patchset ("Optimize mremap during mutual
alignment within PMD") which fixes an obscure issue with mremap()'s
pagetable handling during a subsequent exec(), based upon an
implementation which Linus suggested.
- More DAMON/DAMOS maintenance and feature work from SeongJae Park i the
following patch series:
mm/damon: misc fixups for documents, comments and its tracepoint
mm/damon: add a tracepoint for damos apply target regions
mm/damon: provide pseudo-moving sum based access rate
mm/damon: implement DAMOS apply intervals
mm/damon/core-test: Fix memory leaks in core-test
mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
- In the series "Do not try to access unaccepted memory" Adrian Hunter
provides some fixups for the recently-added "unaccepted memory' feature.
To increase the feature's checking coverage. "Plug a few gaps where
RAM is exposed without checking if it is unaccepted memory".
- In the series "cleanups for lockless slab shrink" Qi Zheng has done
some maintenance work which is preparation for the lockless slab
shrinking code.
- Qi Zheng has redone the earlier (and reverted) attempt to make slab
shrinking lockless in the series "use refcount+RCU method to implement
lockless slab shrink".
- David Hildenbrand contributes some maintenance work for the rmap code
in the series "Anon rmap cleanups".
- Kefeng Wang does more folio conversions and some maintenance work in
the migration code. Series "mm: migrate: more folio conversion and
unification".
- Matthew Wilcox has fixed an issue in the buffer_head code which was
causing long stalls under some heavy memory/IO loads. Some cleanups
were added on the way. Series "Add and use bdev_getblk()".
- In the series "Use nth_page() in place of direct struct page
manipulation" Zi Yan has fixed a potential issue with the direct
manipulation of hugetlb page frames.
- In the series "mm: hugetlb: Skip initialization of gigantic tail
struct pages if freed by HVO" has improved our handling of gigantic
pages in the hugetlb vmmemmep optimizaton code. This provides
significant boot time improvements when significant amounts of gigantic
pages are in use.
- Matthew Wilcox has sent the series "Small hugetlb cleanups" - code
rationalization and folio conversions in the hugetlb code.
- Yin Fengwei has improved mlock()'s handling of large folios in the
series "support large folio for mlock"
- In the series "Expose swapcache stat for memcg v1" Liu Shixin has
added statistics for memcg v1 users which are available (and useful)
under memcg v2.
- Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
prctl so that userspace may direct the kernel to not automatically
propagate the denial to child processes. The series is named "MDWE
without inheritance".
- Kefeng Wang has provided the series "mm: convert numa balancing
functions to use a folio" which does what it says.
- In the series "mm/ksm: add fork-exec support for prctl" Stefan Roesch
makes is possible for a process to propagate KSM treatment across
exec().
- Huang Ying has enhanced memory tiering's calculation of memory
distances. This is used to permit the dax/kmem driver to use "high
bandwidth memory" in addition to Optane Data Center Persistent Memory
Modules (DCPMM). The series is named "memory tiering: calculate
abstract distance based on ACPI HMAT"
- In the series "Smart scanning mode for KSM" Stefan Roesch has
optimized KSM by teaching it to retain and use some historical
information from previous scans.
- Yosry Ahmed has fixed some inconsistencies in memcg statistics in the
series "mm: memcg: fix tracking of pending stats updates values".
- In the series "Implement IOCTL to get and optionally clear info about
PTEs" Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits
us to atomically read-then-clear page softdirty state. This is mainly
used by CRIU.
- Hugh Dickins contributed the series "shmem,tmpfs: general maintenance"
- a bunch of relatively minor maintenance tweaks to this code.
- Matthew Wilcox has increased the use of the VMA lock over file-backed
page faults in the series "Handle more faults under the VMA lock". Some
rationalizations of the fault path became possible as a result.
- In the series "mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()" David Hildenbrand has implemented some cleanups
and folio conversions.
- In the series "various improvements to the GUP interface" Lorenzo
Stoakes has simplified and improved the GUP interface with an eye to
providing groundwork for future improvements.
- Andrey Konovalov has sent along the series "kasan: assorted fixes and
improvements" which does those things.
- Some page allocator maintenance work from Kemeng Shi in the series
"Two minor cleanups to break_down_buddy_pages".
- In thes series "New selftest for mm" Breno Leitao has developed
another MM self test which tickles a race we had between madvise() and
page faults.
- In the series "Add folio_end_read" Matthew Wilcox provides cleanups
and an optimization to the core pagecache code.
- Nhat Pham has added memcg accounting for hugetlb memory in the series
"hugetlb memcg accounting".
- Cleanups and rationalizations to the pagemap code from Lorenzo
Stoakes, in the series "Abstract vma_merge() and split_vma()".
- Audra Mitchell has fixed issues in the procfs page_owner code's new
timestamping feature which was causing some misbehaviours. In the
series "Fix page_owner's use of free timestamps".
- Lorenzo Stoakes has fixed the handling of new mappings of sealed files
in the series "permit write-sealed memfd read-only shared mappings".
- Mike Kravetz has optimized the hugetlb vmemmap optimization in the
series "Batch hugetlb vmemmap modification operations".
- Some buffer_head folio conversions and cleanups from Matthew Wilcox in
the series "Finish the create_empty_buffers() transition".
- As a page allocator performance optimization Huang Ying has added
automatic tuning to the allocator's per-cpu-pages feature, in the series
"mm: PCP high auto-tuning".
- Roman Gushchin has contributed the patchset "mm: improve performance
of accounted kernel memory allocations" which improves their performance
by ~30% as measured by a micro-benchmark.
- folio conversions from Kefeng Wang in the series "mm: convert page
cpupid functions to folios".
- Some kmemleak fixups in Liu Shixin's series "Some bugfix about
kmemleak".
- Qi Zheng has improved our handling of memoryless nodes by keeping them
off the allocation fallback list. This is done in the series "handle
memoryless nodes more appropriately".
- khugepaged conversions from Vishal Moola in the series "Some
khugepaged folio conversions".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZULEMwAKCRDdBJ7gKXxA
jhQHAQCYpD3g849x69DmHnHWHm/EHQLvQmRMDeYZI+nx/sCJOwEAw4AKg0Oemv9y
FgeUPAD1oasg6CP+INZvCj34waNxwAc=
=E+Y4
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Many singleton patches against the MM code. The patch series which are
included in this merge do the following:
- Kemeng Shi has contributed some compation maintenance work in the
series 'Fixes and cleanups to compaction'
- Joel Fernandes has a patchset ('Optimize mremap during mutual
alignment within PMD') which fixes an obscure issue with mremap()'s
pagetable handling during a subsequent exec(), based upon an
implementation which Linus suggested
- More DAMON/DAMOS maintenance and feature work from SeongJae Park i
the following patch series:
mm/damon: misc fixups for documents, comments and its tracepoint
mm/damon: add a tracepoint for damos apply target regions
mm/damon: provide pseudo-moving sum based access rate
mm/damon: implement DAMOS apply intervals
mm/damon/core-test: Fix memory leaks in core-test
mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
- In the series 'Do not try to access unaccepted memory' Adrian
Hunter provides some fixups for the recently-added 'unaccepted
memory' feature. To increase the feature's checking coverage. 'Plug
a few gaps where RAM is exposed without checking if it is
unaccepted memory'
- In the series 'cleanups for lockless slab shrink' Qi Zheng has done
some maintenance work which is preparation for the lockless slab
shrinking code
- Qi Zheng has redone the earlier (and reverted) attempt to make slab
shrinking lockless in the series 'use refcount+RCU method to
implement lockless slab shrink'
- David Hildenbrand contributes some maintenance work for the rmap
code in the series 'Anon rmap cleanups'
- Kefeng Wang does more folio conversions and some maintenance work
in the migration code. Series 'mm: migrate: more folio conversion
and unification'
- Matthew Wilcox has fixed an issue in the buffer_head code which was
causing long stalls under some heavy memory/IO loads. Some cleanups
were added on the way. Series 'Add and use bdev_getblk()'
- In the series 'Use nth_page() in place of direct struct page
manipulation' Zi Yan has fixed a potential issue with the direct
manipulation of hugetlb page frames
- In the series 'mm: hugetlb: Skip initialization of gigantic tail
struct pages if freed by HVO' has improved our handling of gigantic
pages in the hugetlb vmmemmep optimizaton code. This provides
significant boot time improvements when significant amounts of
gigantic pages are in use
- Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code
rationalization and folio conversions in the hugetlb code
- Yin Fengwei has improved mlock()'s handling of large folios in the
series 'support large folio for mlock'
- In the series 'Expose swapcache stat for memcg v1' Liu Shixin has
added statistics for memcg v1 users which are available (and
useful) under memcg v2
- Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
prctl so that userspace may direct the kernel to not automatically
propagate the denial to child processes. The series is named 'MDWE
without inheritance'
- Kefeng Wang has provided the series 'mm: convert numa balancing
functions to use a folio' which does what it says
- In the series 'mm/ksm: add fork-exec support for prctl' Stefan
Roesch makes is possible for a process to propagate KSM treatment
across exec()
- Huang Ying has enhanced memory tiering's calculation of memory
distances. This is used to permit the dax/kmem driver to use 'high
bandwidth memory' in addition to Optane Data Center Persistent
Memory Modules (DCPMM). The series is named 'memory tiering:
calculate abstract distance based on ACPI HMAT'
- In the series 'Smart scanning mode for KSM' Stefan Roesch has
optimized KSM by teaching it to retain and use some historical
information from previous scans
- Yosry Ahmed has fixed some inconsistencies in memcg statistics in
the series 'mm: memcg: fix tracking of pending stats updates
values'
- In the series 'Implement IOCTL to get and optionally clear info
about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap
which permits us to atomically read-then-clear page softdirty
state. This is mainly used by CRIU
- Hugh Dickins contributed the series 'shmem,tmpfs: general
maintenance', a bunch of relatively minor maintenance tweaks to
this code
- Matthew Wilcox has increased the use of the VMA lock over
file-backed page faults in the series 'Handle more faults under the
VMA lock'. Some rationalizations of the fault path became possible
as a result
- In the series 'mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()' David Hildenbrand has implemented some
cleanups and folio conversions
- In the series 'various improvements to the GUP interface' Lorenzo
Stoakes has simplified and improved the GUP interface with an eye
to providing groundwork for future improvements
- Andrey Konovalov has sent along the series 'kasan: assorted fixes
and improvements' which does those things
- Some page allocator maintenance work from Kemeng Shi in the series
'Two minor cleanups to break_down_buddy_pages'
- In thes series 'New selftest for mm' Breno Leitao has developed
another MM self test which tickles a race we had between madvise()
and page faults
- In the series 'Add folio_end_read' Matthew Wilcox provides cleanups
and an optimization to the core pagecache code
- Nhat Pham has added memcg accounting for hugetlb memory in the
series 'hugetlb memcg accounting'
- Cleanups and rationalizations to the pagemap code from Lorenzo
Stoakes, in the series 'Abstract vma_merge() and split_vma()'
- Audra Mitchell has fixed issues in the procfs page_owner code's new
timestamping feature which was causing some misbehaviours. In the
series 'Fix page_owner's use of free timestamps'
- Lorenzo Stoakes has fixed the handling of new mappings of sealed
files in the series 'permit write-sealed memfd read-only shared
mappings'
- Mike Kravetz has optimized the hugetlb vmemmap optimization in the
series 'Batch hugetlb vmemmap modification operations'
- Some buffer_head folio conversions and cleanups from Matthew Wilcox
in the series 'Finish the create_empty_buffers() transition'
- As a page allocator performance optimization Huang Ying has added
automatic tuning to the allocator's per-cpu-pages feature, in the
series 'mm: PCP high auto-tuning'
- Roman Gushchin has contributed the patchset 'mm: improve
performance of accounted kernel memory allocations' which improves
their performance by ~30% as measured by a micro-benchmark
- folio conversions from Kefeng Wang in the series 'mm: convert page
cpupid functions to folios'
- Some kmemleak fixups in Liu Shixin's series 'Some bugfix about
kmemleak'
- Qi Zheng has improved our handling of memoryless nodes by keeping
them off the allocation fallback list. This is done in the series
'handle memoryless nodes more appropriately'
- khugepaged conversions from Vishal Moola in the series 'Some
khugepaged folio conversions'"
[ bcachefs conflicts with the dynamically allocated shrinkers have been
resolved as per Stephen Rothwell in
https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/
with help from Qi Zheng.
The clone3 test filtering conflict was half-arsed by yours truly ]
* tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits)
mm/damon/sysfs: update monitoring target regions for online input commit
mm/damon/sysfs: remove requested targets when online-commit inputs
selftests: add a sanity check for zswap
Documentation: maple_tree: fix word spelling error
mm/vmalloc: fix the unchecked dereference warning in vread_iter()
zswap: export compression failure stats
Documentation: ubsan: drop "the" from article title
mempolicy: migration attempt to match interleave nodes
mempolicy: mmap_lock is not needed while migrating folios
mempolicy: alloc_pages_mpol() for NUMA policy without vma
mm: add page_rmappable_folio() wrapper
mempolicy: remove confusing MPOL_MF_LAZY dead code
mempolicy: mpol_shared_policy_init() without pseudo-vma
mempolicy trivia: use pgoff_t in shared mempolicy tree
mempolicy trivia: slightly more consistent naming
mempolicy trivia: delete those ancient pr_debug()s
mempolicy: fix migrate_pages(2) syscall return nr_failed
kernfs: drop shared NUMA mempolicy hooks
hugetlbfs: drop shared NUMA mempolicy pretence
mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()
...
API:
- Add virtual-address based lskcipher interface.
- Optimise ahash/shash performance in light of costly indirect calls.
- Remove ahash alignmask attribute.
Algorithms:
- Improve AES/XTS performance of 6-way unrolling for ppc.
- Remove some uses of obsolete algorithms (md4, md5, sha1).
- Add FIPS 202 SHA-3 support in pkcs1pad.
- Add fast path for single-page messages in adiantum.
- Remove zlib-deflate.
Drivers:
- Add support for S4 in meson RNG driver.
- Add STM32MP13x support in stm32.
- Add hwrng interface support in qcom-rng.
- Add support for deflate algorithm in hisilicon/zip.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmVB3vgACgkQxycdCkmx
i6dsOBAAykbnX8BpnpnOXYywE9ZWrl98rAk51MK0N9olZNfg78zRPIv7fFxFdC20
SDJrDSNPmn0Qvaa5e0EfoAdklsm0k2GkXL/BwPKMKWUsyIoJVYI3WrBMnjBy9xMp
yfME+h0bKoXJCZKnYkIUSGUejmUPSyRlEylrXoFlH/VWYwAaii/x9zwreQoF+0LR
KI24A1q8AYs6Dw9HSfndaAub9GOzrqKYs6fSaMG+77Y4UC5aoi5J9Bp2G3uVyHay
x/0bZtIxKXS9wn+LeG/3GspX23x/I5VwBOdAoMigrYmAIaIg5qgyMszudltTAs4R
zF1Kh7WsnM5+vpnBSeigzo+/GGOU3QTz8y3tBTg+3ZR7GWGOwQLiizhOYqCyOfAH
pIm6c++sZw/OOHiL69Nt4HeLKzGNYYWk3s4X/B/6cqoouPfOsfBaQobZNx9zfy7q
ZNEvSVBjrFX/L6wDSotny1LTWLUNjHbmLaMV5uQZ/SQKEtv19fp2Dl7SsLkHH+3v
ldOAwfoJR6QcSwz3Ez02TUAvQhtP172Hnxi7u44eiZu2aUboLhCFr7aEU6kVdBCx
1rIRVHD1oqlOEDRwPRXzhF3I8R4QDORJIxZ6UUhg7yueuI+XCGDsBNC+LqBrBmSR
IbdjqmSDUBhJyM5yMnt1VFYhqKQ/ZzwZ3JQviwW76Es9pwEIolM=
=IZmR
-----END PGP SIGNATURE-----
Merge tag 'v6.7-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
"API:
- Add virtual-address based lskcipher interface
- Optimise ahash/shash performance in light of costly indirect calls
- Remove ahash alignmask attribute
Algorithms:
- Improve AES/XTS performance of 6-way unrolling for ppc
- Remove some uses of obsolete algorithms (md4, md5, sha1)
- Add FIPS 202 SHA-3 support in pkcs1pad
- Add fast path for single-page messages in adiantum
- Remove zlib-deflate
Drivers:
- Add support for S4 in meson RNG driver
- Add STM32MP13x support in stm32
- Add hwrng interface support in qcom-rng
- Add support for deflate algorithm in hisilicon/zip"
* tag 'v6.7-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (283 commits)
crypto: adiantum - flush destination page before unmapping
crypto: testmgr - move pkcs1pad(rsa,sha3-*) to correct place
Documentation/module-signing.txt: bring up to date
module: enable automatic module signing with FIPS 202 SHA-3
crypto: asymmetric_keys - allow FIPS 202 SHA-3 signatures
crypto: rsa-pkcs1pad - Add FIPS 202 SHA-3 support
crypto: FIPS 202 SHA-3 register in hash info for IMA
x509: Add OIDs for FIPS 202 SHA-3 hash and signatures
crypto: ahash - optimize performance when wrapping shash
crypto: ahash - check for shash type instead of not ahash type
crypto: hash - move "ahash wrapping shash" functions to ahash.c
crypto: talitos - stop using crypto_ahash::init
crypto: chelsio - stop using crypto_ahash::init
crypto: ahash - improve file comment
crypto: ahash - remove struct ahash_request_priv
crypto: ahash - remove crypto_ahash_alignmask
crypto: gcm - stop using alignmask of ahash
crypto: chacha20poly1305 - stop using alignmask of ahash
crypto: ccm - stop using alignmask of ahash
net: ipv6: stop checking crypto_ahash_alignmask
...
- Allow all MFD Cell properties to be filled in dynamically at runtime
- Skip disabled device nodes and continue to look for subsequent devices
- New Device Support
- Add support for Lunar Lake-M PCI to Intel LPSS PCI
- Add support for Denverton to Intel ICH LPC
- New Functionality
- Add support for Clocks to Texas Instruments TWL* Core
- Add support for Interrupts to STMicroelectronics STM32 Timers
- Fix-ups
- Convert to new devm-* (managed) power-off API
- Remove superfluous code
- Bunch of Device Tree additions, conversions and adaptions
- Simplify obtaining resources (memory, device data) using unified API helpers
- Trivial coding-style / spelling type clean-ups
- Constify / staticify changes
- Expand or edit on existing documentation
- Convert some Regmap configurations to use the Maple Tree cache
- Apply new __counted_by() annotation to several data structures containing flexible arrays
- Replace strncpy() with strscpy()
- Bug Fixes
- Remove double put creating reference imbalances
- Ensure headphone/lineout detection sets set when booting with ACPI
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEdrbJNaO+IJqU8IdIUa+KL4f8d2EFAmVDsQcACgkQUa+KL4f8
d2FvXRAAjCwKTtj1cUOw7aiCknPuwHj5NoOvannEZN47/kzzbi35YX2oTOE3mSqc
11U6OLLZ8kpLAvKEy6l82puvDmXRDkQ7eXTFicDgLM2c+FNt3RnfffH+Njr6L8fx
a5ncAMTesnCXJAfS/8PfsONvRylGl/zQ/zmeSWvukfVa4BVAWIYcJiRnjjOL/jGf
/POTf8ihUjScCeNlRbsx28jOHDZo6RWCMauKywShuSweX/wMuRD8FwBXp8YmcsLH
LsYng06Xm+pNtMXv7VB4MQRztRAW7oHduvh/OQ0HkjzlxN8M+wpeZveyq3/i6ut2
q54TlnlLsmmOh42tmgC7sSwmVmegLTnsoEpNJeYl0AJzNvuJ7W+VdDRuUe6mYwDV
5MBThe0MGtLtthglNRR1s7pII18ffz4hDQQFExQ5Ai6ZvYu4b57TB+mKas4cTt9T
7WnoLuPiQW0SPNPWQAYtUDAF16pQmIRME2KYaNIUxGqfDK6GX4EaEQ7//0PHOchE
kdip5vDFhiTunHLOjf1Se7ZJO0KFEg/hECTq1tcYUDHSopO5hTwhy6Wcd56xyrKP
dkFn+6dl0bGqBCgxDjlJ7tPJ1m2PEDv1MbV1yO3vIU89PPTPotfUPBud1I/H2IN5
wDWTULgWxyCJqlWXZ3HjwCsRXF6H4F4A57ffJukmvivqDcXam9Y=
=rwPH
-----END PGP SIGNATURE-----
Merge tag 'mfd-next-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd
Pull MFD updates from Lee Jones:
"Core Frameworks:
- Allow all MFD Cell properties to be filled in dynamically at
runtime
- Skip disabled device nodes and continue to look for subsequent
devices
New Device Support:
- Add support for Lunar Lake-M PCI to Intel LPSS PCI
- Add support for Denverton to Intel ICH LPC
New Functionality:
- Add support for Clocks to Texas Instruments TWL* Core
- Add support for Interrupts to STMicroelectronics STM32 Timers
Fix-ups:
- Convert to new devm-* (managed) power-off API
- Remove superfluous code
- Bunch of Device Tree additions, conversions and adaptions
- Simplify obtaining resources (memory, device data) using unified
API helpers
- Trivial coding-style / spelling type clean-ups
- Constify / staticify changes
- Expand or edit on existing documentation
- Convert some Regmap configurations to use the Maple Tree cache
- Apply new __counted_by() annotation to several data structures
containing flexible arrays
- Replace strncpy() with strscpy()
Bug Fixes:
- Remove double put creating reference imbalances
- Ensure headphone/lineout detection gets set when booting with ACPI"
* tag 'mfd-next-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (73 commits)
mfd: lpc_ich: Mark *_gpio_offsets data with const
spmi: rename spmi device lookup helper
spmi: document spmi_device_from_of() refcounting
dt-bindings: mfd: armltd: Move Arm board syscon's to separate schema
mfd: rk8xx: Add support for RK806 power off
mfd: rk8xx: Add support for standard system-power-controller property
dt-bindings: mfd: rk806: Allow system-power-controller property
dt-bindings: mfd: rk8xx: Deprecate rockchip,system-power-controller
dt-bindings: mfd: max8925: Convert to DT schema format
mfd: Use i2c_get_match_data() in a selection of drivers
mfd: Use device_get_match_data() in a bunch of drivers
mfd: mc13xxx-spi/wm831x-spi: Use spi_get_device_match_data()
mfd: motorola-cpcap: Drop unnecessary of_match_device() call
mfd: arizona-spi: Set pdata.hpdet_channel for ACPI enumerated devs
mfd: qcom-spmi-pmic: Switch to EXPORT_SYMBOL_GPL()
mfd: qcom-spmi-pmic: Fix revid implementation
mfd: qcom-spmi-pmic: Fix reference leaks in revid helper
mfd: intel-m10-bmc: Change contact for ABI docs
mfd: max8907: Convert to use maple tree register cache
mfd: max77686: Convert to use maple tree register cache
...
We consistently switched from kmalloc() to vmalloc() in module
decompression to prevent potential memory allocation failures with large
modules, however vmalloc() is not as memory-efficient and fast as
kmalloc().
Since we don't know in general the size of the workspace required by the
decompression algorithm, it is more reasonable to use kvmalloc()
consistently, also considering that we don't have special memory
requirements here.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Tested-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The only thing worth highligthing is that gzip moves to use vmalloc() instead of
kmalloc just as we had a fix for this for zstd on v6.6-rc1. The rest is regular
house keeping, keeping things neat, tidy, and boring.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmVCsGcSHG1jZ3JvZkBr
ZXJuZWwub3JnAAoJEM4jHQowkoinbdgP+gKY/a3GGDDkQbNkd3VBiT4h56pixXzG
Q8kOdRHHfmSIGfnaRvfOs7GPRWClyRcWC5m3TUrObDK7or2BOJATvY9eLg6ax9s2
1z7AMuH+Bw8D7e5XjOz/UIHc9PbWdVi+nmKY7OXQw/lFPTtts9zdpxq862VAvexW
AZ3u/gfIm1b8VgyP9U/iXEt8cLb0JYOK0cjqIkdHuOa0EOf3tt6k6pjIaY7jJNLx
a4IFbp0NiA2ms2F2XOSl9x4dnKIzAA4PYbr5bDpREKywFJYsrw5p4m5zttsVIbtM
pc3KzyjSuQ+dx0aIeFrKzshuKXaNsLvMIcWIgrcxYnHzBLgF6hgyLcyb1uO2E5bT
Ig4FF5agE35Hq+gfn1az24kN+9NQm7Mab2OMXA54JWif/YTjLKAMqEA4UPLGeJ7+
+GBwvvHSEdiw37FmjNQyH9/6Ey7NqG5yiSblyufQqSZjLp/VI0u2qKj4SRHlTYyT
lIXV/EYT9855PXhWjLRHkUSdBWenKXyrxugBd8/EjsfebsewXhL+ciyPyUnzh8o4
hlokC+DfBy9znV33uGRo7qj+YHEsd4u5IKTmHzL8EWGBWqeCWxbQPHhysCfvO804
lvpz44qdDXoNqOPvFkt6JdKt6iO0jwEx7Jk1veK8XXsD8bsDogOs+gemSA5f/wKZ
4WrEjhptG6yf
=+tZ8
-----END PGP SIGNATURE-----
Merge tag 'modules-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull modules updates from Luis Chamberlain:
"The only thing worth highligthing is that gzip moves to use vmalloc()
instead of kmalloc just as we had a fix for this for zstd on v6.6-rc1.
The rest is regular house keeping, keeping things neat, tidy, and
boring"
[ The kmalloc -> vmalloc conversion is not the right approach.
Unless you know you need huge areas or know you need to use virtual
mappings for some reason (playing with protection bits or whatever),
you should use kvmalloc()/kvfree, which automatically picks the right
allocation model - Linus ]
* tag 'modules-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
module: Annotate struct module_notes_attrs with __counted_by
module: Fix comment typo
module: Make is_valid_name() return bool
module: Make is_mapping_symbol() return bool
module/decompress: use vmalloc() for gzip decompression workspace
MAINTAINERS: add include/linux/module*.h to modules
module: Clarify documentation of module_param_call()
To help make the move of sysctls out of kernel/sysctl.c not incur a size
penalty sysctl has been changed to allow us to not require the sentinel, the
final empty element on the sysctl array. Joel Granados has been doing all this
work. On the v6.6 kernel we got the major infrastructure changes required to
support this. For v6.7-rc1 we have all arch/ and drivers/ modified to remove
the sentinel. Both arch and driver changes have been on linux-next for a bit
less than a month. It is worth re-iterating the value:
- this helps reduce the overall build time size of the kernel and run time
memory consumed by the kernel by about ~64 bytes per array
- the extra 64-byte penalty is no longer inncurred now when we move sysctls
out from kernel/sysctl.c to their own files
For v6.8-rc1 expect removal of all the sentinels and also then the unneeded
check for procname == NULL.
The last 2 patches are fixes recently merged by Krister Johansen which allow
us again to use softlockup_panic early on boot. This used to work but the
alias work broke it. This is useful for folks who want to detect softlockups
super early rather than wait and spend money on cloud solutions with nothing
but an eventual hung kernel. Although this hadn't gone through linux-next it's
also a stable fix, so we might as well roll through the fixes now.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmVCqKsSHG1jZ3JvZkBr
ZXJuZWwub3JnAAoJEM4jHQowkoinEgYQAIpkqRL85DBwems19Uk9A27lkctwZ6Fc
HdslQCObQTsbuKVimZFP4IL2beUfUE0cfLZCXlzp+4nRDOf6vyhyf3w19jPQtI0Q
YdqwTk9y6G5VjDsb35QK0+UBloY/kZ1H3/LW4uCwjXTuksUGmWW2Qvey35696Scv
hDMLADqKQmdpYxLUaNi9QyYbEAjYtOai2ezg3+i7hTG168t1k/Ab2BxIFrPVsCR2
FAiq05L4ugWjNskdsWBjck05JZsx9SK/qcAxpIPoUm4nGiFNHApXE0E0hs3vsnmn
WIHIbxCQw8ZlUDlmw4S+0YH3NFFzFbWfmW8k2b0f2qZTJm/rU4KiJfcJVknkAUVF
raFox6XDW0AUQ9L/NOUJ9ip5rup57GcFrMYocdJ3PPAvvmHKOb1D1O741p75RRcc
9j7zwfIRrzjPUqzhsQS/GFjdJu3lJNmEBK1AcgrVry6WoItrAzJHKPPDC7TwaNmD
eXpjxMl1sYzzHqtVh4hn+xkUYphj/6gTGMV8zdo+/FopFswgeJW9G8kHtlEWKDPk
MRIKwACmfetP6f3ngHunBg+BOipbjCANL7JI0nOhVOQoaULxCCPx+IPJ6GfSyiuH
AbcjH8DGI7fJbUkBFoF0dsRFZ2gH8ds1PYMbWUJ6x3FtuCuv5iIuvQYoaWU6itm7
6f0KvCogg0fU
=Qf50
-----END PGP SIGNATURE-----
Merge tag 'sysctl-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull sysctl updates from Luis Chamberlain:
"To help make the move of sysctls out of kernel/sysctl.c not incur a
size penalty sysctl has been changed to allow us to not require the
sentinel, the final empty element on the sysctl array. Joel Granados
has been doing all this work. On the v6.6 kernel we got the major
infrastructure changes required to support this. For v6.7-rc1 we have
all arch/ and drivers/ modified to remove the sentinel. Both arch and
driver changes have been on linux-next for a bit less than a month. It
is worth re-iterating the value:
- this helps reduce the overall build time size of the kernel and run
time memory consumed by the kernel by about ~64 bytes per array
- the extra 64-byte penalty is no longer inncurred now when we move
sysctls out from kernel/sysctl.c to their own files
For v6.8-rc1 expect removal of all the sentinels and also then the
unneeded check for procname == NULL.
The last two patches are fixes recently merged by Krister Johansen
which allow us again to use softlockup_panic early on boot. This used
to work but the alias work broke it. This is useful for folks who want
to detect softlockups super early rather than wait and spend money on
cloud solutions with nothing but an eventual hung kernel. Although
this hadn't gone through linux-next it's also a stable fix, so we
might as well roll through the fixes now"
* tag 'sysctl-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (23 commits)
watchdog: move softlockup_panic back to early_param
proc: sysctl: prevent aliased sysctls from getting passed to init
intel drm: Remove now superfluous sentinel element from ctl_table array
Drivers: hv: Remove now superfluous sentinel element from ctl_table array
raid: Remove now superfluous sentinel element from ctl_table array
fw loader: Remove the now superfluous sentinel element from ctl_table array
sgi-xp: Remove the now superfluous sentinel element from ctl_table array
vrf: Remove the now superfluous sentinel element from ctl_table array
char-misc: Remove the now superfluous sentinel element from ctl_table array
infiniband: Remove the now superfluous sentinel element from ctl_table array
macintosh: Remove the now superfluous sentinel element from ctl_table array
parport: Remove the now superfluous sentinel element from ctl_table array
scsi: Remove now superfluous sentinel element from ctl_table array
tty: Remove now superfluous sentinel element from ctl_table array
xen: Remove now superfluous sentinel element from ctl_table array
hpet: Remove now superfluous sentinel element from ctl_table array
c-sky: Remove now superfluous sentinel element from ctl_talbe array
powerpc: Remove now superfluous sentinel element from ctl_table arrays
riscv: Remove now superfluous sentinel element from ctl_table array
x86/vdso: Remove now superfluous sentinel element from ctl_table array
...
BPF_END and BPF_NEG has a different specification for the source bit in
the opcode compared to other ALU/ALU64 instructions, and is either
reserved or use to specify the byte swap endianness. In both cases the
source bit does not encode source operand location, and src_reg is a
reserved field.
backtrack_insn() currently does not differentiate BPF_END and BPF_NEG
from other ALU/ALU64 instructions, which leads to r0 being incorrectly
marked as precise when processing BPF_ALU | BPF_TO_BE | BPF_END
instructions. This commit teaches backtrack_insn() to correctly mark
precision for such case.
While precise tracking of BPF_NEG and other BPF_END instructions are
correct and does not need fixing, this commit opt to process all BPF_NEG
and BPF_END instructions within the same if-clause to better align with
current convention used in the verifier (e.g. check_alu_op).
Fixes: b5dc0163d8 ("bpf: precise scalar_value tracking")
Cc: stable@vger.kernel.org
Reported-by: Mohamed Mahmoud <mmahmoud@redhat.com>
Closes: https://lore.kernel.org/r/87jzrrwptf.fsf@toke.dk
Tested-by: Toke Høiland-Jørgensen <toke@redhat.com>
Tested-by: Tao Lyu <tao.lyu@epfl.ch>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231102053913.12004-2-shung-hsi.yu@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The newly added open-coded css_task iter would try to hold the global
css_set_lock in bpf_iter_css_task_new, so the bpf side has to be careful in
where it allows to use this iter. The mainly concern is dead locking on
css_set_lock. check_css_task_iter_allowlist() in verifier enforced css_task
can only be used in bpf_lsm hooks and sleepable bpf_iter.
This patch relax the allowlist for css_task iter. Any lsm and any iter
(even non-sleepable) and any sleepable are safe since they would not hold
the css_set_lock before entering BPF progs context.
This patch also fixes the misused BPF_TRACE_ITER in
check_css_task_iter_allowlist which compared bpf_prog_type with
bpf_attach_type.
Fixes: 9c66dc94b6 ("bpf: Introduce css_task open-coded iterator kfuncs")
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231031050438.93297-2-zhouchuyi@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When there are concurrent uref release and bpf timer init operations,
the following sequence diagram is possible. It will break the guarantee
provided by bpf_timer: bpf_timer will still be alive after userspace
application releases or unpins the map. It also will lead to kmemleak
for old kernel version which doesn't release bpf_timer when map is
released.
bpf program X:
bpf_timer_init()
lock timer->lock
read timer->timer as NULL
read map->usercnt != 0
process Y:
close(map_fd)
// put last uref
bpf_map_put_uref()
atomic_dec_and_test(map->usercnt)
array_map_free_timers()
bpf_timer_cancel_and_free()
// just return
read timer->timer is NULL
t = bpf_map_kmalloc_node()
timer->timer = t
unlock timer->lock
Fix the problem by checking map->usercnt after timer->timer is assigned,
so when there are concurrent uref release and bpf timer init, either
bpf_timer_cancel_and_free() from uref release reads a no-NULL timer
or the newly-added atomic64_read() returns a zero usercnt.
Because atomic_dec_and_test(map->usercnt) and READ_ONCE(timer->timer)
in bpf_timer_cancel_and_free() are not protected by a lock, so add
a memory barrier to guarantee the order between map->usercnt and
timer->timer. Also use WRITE_ONCE(timer->timer, x) to match the lockless
read of timer->timer in bpf_timer_cancel_and_free().
Reported-by: Hsin-Wei Hung <hsinweih@uci.edu>
Closes: https://lore.kernel.org/bpf/CABcoxUaT2k9hWsS1tNgXyoU3E-=PuOgMn737qK984fbFmfYixQ@mail.gmail.com
Fixes: b00628b1c7 ("bpf: Introduce bpf timers.")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231030063616.1653024-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Not all uses of __diag_ignore_all(...) in BPF-related code in order to
suppress warnings are wrapping kfunc definitions. Some "hook point"
definitions - small functions meant to be used as attach points for
fentry and similar BPF progs - need to suppress -Wmissing-declarations.
We could use __bpf_kfunc_{start,end}_defs added in the previous patch in
such cases, but this might be confusing to someone unfamiliar with BPF
internals. Instead, this patch adds __bpf_hook_{start,end} macros,
currently having the same effect as __bpf_kfunc_{start,end}_defs, then
uses them to suppress warnings for two hook points in the kernel itself
and some bpf_testmod hook points as well.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20231031215625.2343848-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
BPF kfuncs are meant to be called from BPF programs. Accordingly, most
kfuncs are not called from anywhere in the kernel, which the
-Wmissing-prototypes warning is unhappy about. We've peppered
__diag_ignore_all("-Wmissing-prototypes", ... everywhere kfuncs are
defined in the codebase to suppress this warning.
This patch adds two macros meant to bound one or many kfunc definitions.
All existing kfunc definitions which use these __diag calls to suppress
-Wmissing-prototypes are migrated to use the newly-introduced macros.
A new __diag_ignore_all - for "-Wmissing-declarations" - is added to the
__bpf_kfunc_start_defs macro based on feedback from Andrii on an earlier
version of this patch [0] and another recent mailing list thread [1].
In the future we might need to ignore different warnings or do other
kfunc-specific things. This change will make it easier to make such
modifications for all kfunc defs.
[0]: https://lore.kernel.org/bpf/CAEf4BzaE5dRWtK6RPLnjTW-MW9sx9K3Fn6uwqCTChK2Dcb1Xig@mail.gmail.com/
[1]: https://lore.kernel.org/bpf/ZT+2qCc%2FaXep0%2FLf@krava/
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Cc: Jiri Olsa <olsajiri@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20231031215625.2343848-1-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Our MPTCP CI complained [1] -- and KBuild too -- that it was no longer
possible to build the kernel without CONFIG_CGROUPS:
kernel/bpf/task_iter.c: In function 'bpf_iter_css_task_new':
kernel/bpf/task_iter.c:919:14: error: 'CSS_TASK_ITER_PROCS' undeclared (first use in this function)
919 | case CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED:
| ^~~~~~~~~~~~~~~~~~~
kernel/bpf/task_iter.c:919:14: note: each undeclared identifier is reported only once for each function it appears in
kernel/bpf/task_iter.c:919:36: error: 'CSS_TASK_ITER_THREADED' undeclared (first use in this function)
919 | case CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED:
| ^~~~~~~~~~~~~~~~~~~~~~
kernel/bpf/task_iter.c:927:60: error: invalid application of 'sizeof' to incomplete type 'struct css_task_iter'
927 | kit->css_it = bpf_mem_alloc(&bpf_global_ma, sizeof(struct css_task_iter));
| ^~~~~~
kernel/bpf/task_iter.c:930:9: error: implicit declaration of function 'css_task_iter_start'; did you mean 'task_seq_start'? [-Werror=implicit-function-declaration]
930 | css_task_iter_start(css, flags, kit->css_it);
| ^~~~~~~~~~~~~~~~~~~
| task_seq_start
kernel/bpf/task_iter.c: In function 'bpf_iter_css_task_next':
kernel/bpf/task_iter.c:940:16: error: implicit declaration of function 'css_task_iter_next'; did you mean 'class_dev_iter_next'? [-Werror=implicit-function-declaration]
940 | return css_task_iter_next(kit->css_it);
| ^~~~~~~~~~~~~~~~~~
| class_dev_iter_next
kernel/bpf/task_iter.c:940:16: error: returning 'int' from a function with return type 'struct task_struct *' makes pointer from integer without a cast [-Werror=int-conversion]
940 | return css_task_iter_next(kit->css_it);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
kernel/bpf/task_iter.c: In function 'bpf_iter_css_task_destroy':
kernel/bpf/task_iter.c:949:9: error: implicit declaration of function 'css_task_iter_end' [-Werror=implicit-function-declaration]
949 | css_task_iter_end(kit->css_it);
| ^~~~~~~~~~~~~~~~~
This patch simply surrounds with a #ifdef the new code requiring CGroups
support. It seems enough for the compiler and this is similar to
bpf_iter_css_{new,next,destroy}() functions where no other #ifdef have
been added in kernel/bpf/helpers.c and in the selftests.
Fixes: 9c66dc94b6 ("bpf: Introduce css_task open-coded iterator kfuncs")
Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/6665206927
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202310260528.aHWgVFqq-lkp@intel.com/
Signed-off-by: Matthieu Baerts <matttbe@kernel.org>
[ added missing ifdefs for BTF_ID cgroup definitions ]
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20231101181601.1493271-1-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
A synthetic event is created by the synthetic event interface that can
read both user or kernel address memory. In reality, it reads any
arbitrary memory location from within the kernel. If the address space is
in USER (where CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE is set) then
it uses strncpy_from_user_nofault() to copy strings otherwise it uses
strncpy_from_kernel_nofault().
But since both functions use the same variable there's no annotation to
what that variable is (ie. __user). This makes sparse complain.
Quiet sparse by typecasting the strncpy_from_user_nofault() variable to
a __user pointer.
Link: https://lore.kernel.org/linux-trace-kernel/20231031151033.73c42e23@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Fixes: 0934ae9977 ("tracing: Fix reading strings from synthetic events");
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202311010013.fm8WTxa5-lkp@intel.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The following can crash the kernel:
# cd /sys/kernel/tracing
# echo 'p:sched schedule' > kprobe_events
# exec 5>>events/kprobes/sched/enable
# > kprobe_events
# exec 5>&-
The above commands:
1. Change directory to the tracefs directory
2. Create a kprobe event (doesn't matter what one)
3. Open bash file descriptor 5 on the enable file of the kprobe event
4. Delete the kprobe event (removes the files too)
5. Close the bash file descriptor 5
The above causes a crash!
BUG: kernel NULL pointer dereference, address: 0000000000000028
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 6 PID: 877 Comm: bash Not tainted 6.5.0-rc4-test-00008-g2c6b6b1029d4-dirty #186
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:tracing_release_file_tr+0xc/0x50
What happens here is that the kprobe event creates a trace_event_file
"file" descriptor that represents the file in tracefs to the event. It
maintains state of the event (is it enabled for the given instance?).
Opening the "enable" file gets a reference to the event "file" descriptor
via the open file descriptor. When the kprobe event is deleted, the file is
also deleted from the tracefs system which also frees the event "file"
descriptor.
But as the tracefs file is still opened by user space, it will not be
totally removed until the final dput() is called on it. But this is not
true with the event "file" descriptor that is already freed. If the user
does a write to or simply closes the file descriptor it will reference the
event "file" descriptor that was just freed, causing a use-after-free bug.
To solve this, add a ref count to the event "file" descriptor as well as a
new flag called "FREED". The "file" will not be freed until the last
reference is released. But the FREE flag will be set when the event is
removed to prevent any more modifications to that event from happening,
even if there's still a reference to the event "file" descriptor.
Link: https://lore.kernel.org/linux-trace-kernel/20231031000031.1e705592@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20231031122453.7a48b923@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Fixes: f5ca233e2e ("tracing: Increase trace array ref count on enable and filter files")
Reported-by: Beau Belgrave <beaub@linux.microsoft.com>
Tested-by: Beau Belgrave <beaub@linux.microsoft.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
there are some significant changes nonetheless:
- Some more Spanish-language and Chinese translations.
- The much-discussed documentation of the confidential-computing threat
model.
- Powerpc and RISCV documentation move under Documentation/arch - these
complete this particular bit of documentation churn.
- A large traditional-Chinese documentation update.
- A new document on backporting and conflict resolution.
- Some kernel-doc and Sphinx fixes.
Plus the usual smattering of smaller updates and typo fixes.
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmVBNv8PHGNvcmJldEBs
d24ubmV0AAoJEBdDWhNsDH5Y0JkH/36MOpkaDnsY69/dMRKSuD4mAAP2H6LS8V63
SsMgH5VCj8lcy/Tz1+J89t14pbcX8l0viKxSo4UxvzoJ5snrz8A8gZ9oqY7NCcNs
nMtolnN5IwdbgGnEGqASSLsl07lnabhRK0VYv9ZO7lHjYQp97VsJ/qrjJn385HFE
vYW8iRcxcKdwtuuwOtbPcdAMjP54saJdNC5wMLsfMR0csKcGbzaSNpqpiGovzT7l
phG2DSxrJH0gUZyeGPryroNppaf+mVKSDSiwRdI8mzm0J67p6dZYYwBS1Iw6Awbf
8iYoj6W63/FVQbXffPx5d6ffOSQh4JkAskxgBUOzluSGusSDc+4=
=9HU5
-----END PGP SIGNATURE-----
Merge tag 'docs-6.7' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"The number of commits for documentation is not huge this time around,
but there are some significant changes nonetheless:
- Some more Spanish-language and Chinese translations
- The much-discussed documentation of the confidential-computing
threat model
- Powerpc and RISCV documentation move under Documentation/arch -
these complete this particular bit of documentation churn
- A large traditional-Chinese documentation update
- A new document on backporting and conflict resolution
- Some kernel-doc and Sphinx fixes
Plus the usual smattering of smaller updates and typo fixes"
* tag 'docs-6.7' of git://git.lwn.net/linux: (40 commits)
scripts/kernel-doc: Fix the regex for matching -Werror flag
docs: backporting: address feedback
Documentation: driver-api: pps: Update PPS generator documentation
speakup: Document USB support
doc: blk-ioprio: Bring the doc in line with the implementation
docs: usb: fix reference to nonexistent file in UVC Gadget
docs: doc-guide: mention 'make refcheckdocs'
Documentation: fix typo in dynamic-debug howto
scripts/kernel-doc: match -Werror flag strictly
Documentation/sphinx: Remove the repeated word "the" in comments.
docs: sparse: add SPDX-License-Identifier
docs/zh_CN: Add subsystem-apis Chinese translation
docs/zh_TW: update contents for zh_TW
docs: submitting-patches: encourage direct notifications to commenters
docs: add backporting and conflict resolution document
docs: move riscv under arch
docs: update link to powerpc/vmemmap_dedup.rst
mm/memory-hotplug: fix typo in documentation
docs: move powerpc under arch
PCI: Update the devres documentation regarding to pcim_*()
...
- cleanups:
. kprobes: Fixes typo in kprobes samples.
. tracing/eprobes: Remove 'break' after return.
- kretprobe/fprobe performance improvements:
. lib: Introduce new `objpool`, which is a high performance lockless
object queue. This uses per-cpu ring array to allocate/release
objects from the pre-allocated object pool. Since the index of ring
array is a 32bit sequential counter, we can retry to push/pop the
object pointer from the ring without lock (as seq-lock does).
. lib: Add an objpool test module to test the functionality and
evaluate the performance under some circumstances.
. kprobes/fprobe: Improve kretprobe and rethook scalability
performance with objpool.
This improves both legacy kretprobe and fprobe exit handler (which
is based on rethook) to be scalable on SMP systems. Even with
8-threads parallel test, it shows a great scalability improvement.
. Remove unneeded freelist.h which is replaced by objpool.
. objpool: Add maintainers entry for the objpool.
. objpool: Fix to remove unused include header lines.
-----BEGIN PGP SIGNATURE-----
iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmVA54obHG1hc2FtaS5o
aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8busoH/3mG/rJwVVJw70zTLlfs
ko4U1wn16aImYQYYLXkZLlYsKr6Y2dzNkb5C4CEI2r47EZjTamHatGZ6MSwvAtPb
u9oloHEbRbE6yM+EjrE1JAKT9FwC+21/yZCN2zACZKJRwCwQRzxGIXUwGTWtDNdE
NySLBDyMoR6zZJsFy8YueFBAJxcZdWIPK6mQH2Y5awVQA4tV7tQEe92KFqUYWTd5
exbfBbcVG8MBWmrPqRI46Hxh0NWOnPCqFwGqX8Q7hE/yrQnTPzJ+2ZsbYFkGRk6A
pM5wRCdwO5+OlcHEcEHBMQSGCmFgk6m1UMG8RvbCKyF3cwHbxzlelbjzHosKQvSh
EKQ=
=/vZK
-----END PGP SIGNATURE-----
Merge tag 'probes-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes updates from Masami Hiramatsu:
"Cleanups:
- kprobes: Fixes typo in kprobes samples
- tracing/eprobes: Remove 'break' after return
kretprobe/fprobe performance improvements:
- lib: Introduce new `objpool`, which is a high performance lockless
object queue. This uses per-cpu ring array to allocate/release
objects from the pre-allocated object pool.
Since the index of ring array is a 32bit sequential counter, we can
retry to push/pop the object pointer from the ring without lock (as
seq-lock does)
- lib: Add an objpool test module to test the functionality and
evaluate the performance under some circumstances
- kprobes/fprobe: Improve kretprobe and rethook scalability
performance with objpool.
This improves both legacy kretprobe and fprobe exit handler (which
is based on rethook) to be scalable on SMP systems. Even with
8-threads parallel test, it shows a great scalability improvement
- Remove unneeded freelist.h which is replaced by objpool
- objpool: Add maintainers entry for the objpool
- objpool: Fix to remove unused include header lines"
* tag 'probes-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
kprobes: unused header files removed
MAINTAINERS: objpool added
kprobes: freelist.h removed
kprobes: kretprobe scalability improvement
lib: objpool test module added
lib: objpool added: ring-array based lockless MPMC
tracing/eprobe: drop unneeded breaks
samples: kprobes: Fixes a typo
The ia64 architecture gets its well-earned retirement as planned,
now that there is one last (mostly) working release that will
be maintained as an LTS kernel.
The architecture specific system call tables are updated for
the added map_shadow_stack() syscall and to remove references
to the long-gone sys_lookup_dcookie() syscall.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmVC40IACgkQYKtH/8kJ
Uidhmw/9EX+aWSXGoObJ3fngaNSMw+PmrEuP8qEKBHxfKHcCdX3hc451Oh4GlhaQ
tru91pPwgNvN2/rfoKusxT+V4PemGIzfNni/04rp+P0kvmdw5otQ2yNhsQNsfVmq
XGWvkxF4P2GO6bkjjfR/1dDq7GtlyXtwwPDKeLbYb6TnJOZjtx+EAN27kkfSn1Ms
R4Sa3zJ+DfHUmHL5S9g+7UD/CZ5GfKNmIskI4Mz5GsfoUz/0iiU+Bge/9sdcdSJQ
kmbLy5YnVzfooLZ3TQmBFsO3iAMWb0s/mDdtyhqhTVmTUshLolkPYyKnPFvdupyv
shXcpEST2XJNeaDRnL2K4zSCdxdbnCZHDpjfl9wfioBg7I8NfhXKpf1jYZHH1de4
LXq8ndEFEOVQw/zSpYWfQq1sux8Jiqr+UK/ukbVeFWiGGIUs91gEWtPAf8T0AZo9
ujkJvaWGl98O1g5wmBu0/dAR6QcFJMDfVwbmlIFpU8O+MEaz6X8mM+O5/T0IyTcD
eMbAUjj4uYcU7ihKzHEv/0SS9Of38kzff67CLN5k8wOP/9NlaGZ78o1bVle9b52A
BdhrsAefFiWHp1jT6Y9Rg4HOO/TguQ9e6EWSKOYFulsiLH9LEFaB9RwZLeLytV0W
vlAgY9rUW77g1OJcb7DoNv33nRFuxsKqsnz3DEIXtgozo9CzbYI=
=H1vH
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull ia64 removal and asm-generic updates from Arnd Bergmann:
- The ia64 architecture gets its well-earned retirement as planned,
now that there is one last (mostly) working release that will be
maintained as an LTS kernel.
- The architecture specific system call tables are updated for the
added map_shadow_stack() syscall and to remove references to the
long-gone sys_lookup_dcookie() syscall.
* tag 'asm-generic-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
hexagon: Remove unusable symbols from the ptrace.h uapi
asm-generic: Fix spelling of architecture
arch: Reserve map_shadow_stack() syscall number for all architectures
syscalls: Cleanup references to sys_lookup_dcookie()
Documentation: Drop or replace remaining mentions of IA64
lib/raid6: Drop IA64 support
Documentation: Drop IA64 from feature descriptions
kernel: Drop IA64 support from sig_fault handlers
arch: Remove Itanium (IA-64) architecture
- get rid of the fake support for coherent DMA allocation on coldfire with
caches (Christoph Hellwig)
- add a few Kconfig dependencies so that Kconfig catches the use of
invalid configurations (Christoph Hellwig)
- fix a type in dma-debug output (Chuck Lever)
- rewrite a comment in swiotlb (Sean Christopherson)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmU/t/8LHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYNRsA/9GurDhfwje9qOaMIOfrmrB+mppEJ67pi0dgAXKgGN
HpZJwHEJCoM3zrAmvq58tCCI4r8kOjqkfKkPZNHaqSLF+fAPzI7YhSD+Y28GClM4
cutrYovJVGeOTXJMwINMRo/r6n3nBZ4fG17YflGnuZHL27H7+dmaxwXusLvwBTwv
7rFr8WRuqEpnMb7OktHIG9fnsy6oxWNhxBvG8Vu93yiZqprv3xbhI/BaRaOtZM2W
zQA7OqM5YxQCH5gNnfcx25f5bkfkDoxUYh8gDd4JSwTUJz0ZlIL8/ROPJScjpFvh
M3ur/NXdFfaqfDYWzO40wxmF6N0moHLvppOaEzM/tmGvtZBzqKmpNCkVBQCoxNAS
1jwW4kh1ZhoW4RbPEKX6kcfJjn97o+RE9pY5t956a9sDd3DBqPNaPIOqlwmeB8Sd
bh2ekwuNmxwsZXqTv5c5vvN4a95RNhZMvS2ma9o6lnsLTaeog7x4mnU0cf69tQuT
850JexGcM0fzD2nMqrmfyyLgUjPN6k+Z71Ay6FiTWjnK4mLRN8zmVgF8tXtQuexH
4HAJ70LJ2OxfEkW5nD3yUc2S/RwyVR6HeGG9bciYQbob3hqb4glzALNpB9C02Cf1
/iOwAMdUgsj4MYaeOOwk2u3+ZMsuF3DlaoJ+8Gr/M60C92SCkMAYIYJDh61b6qk1
i04=
=D8Eq
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.7-2023-10-30' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- get rid of the fake support for coherent DMA allocation on coldfire
with caches (Christoph Hellwig)
- add a few Kconfig dependencies so that Kconfig catches the use of
invalid configurations (Christoph Hellwig)
- fix a type in dma-debug output (Chuck Lever)
- rewrite a comment in swiotlb (Sean Christopherson)
* tag 'dma-mapping-6.7-2023-10-30' of git://git.infradead.org/users/hch/dma-mapping:
dma-debug: Fix a typo in a debugging eye-catcher
swiotlb: rewrite comment explaining why the source is preserved on DMA_FROM_DEVICE
m68k: remove unused includes from dma.c
m68k: don't provide arch_dma_alloc for nommu/coldfire
net: fec: use dma_alloc_noncoherent for data cache enabled coldfire
m68k: use the coherent DMA code for coldfire without data cache
dma-direct: warn when coherent allocations aren't supported
dma-direct: simplify the use atomic pool logic in dma_direct_alloc
dma-direct: add a CONFIG_ARCH_HAS_DMA_ALLOC symbol
dma-direct: add dependencies to CONFIG_DMA_GLOBAL_POOL
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmVAUXUQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpuGsEADEs0/4uXb8kLUF/y0B0bY9jmwiw5id14g5
TkAH9lbceV0Yv0E1tPeWYIz7Y7s83UOduFVZo4hRH8EysH3IYFZCI/ny3v2nJ1av
lN7F7YegVOu6qx77e/CwLo7on14awHkSo8pUdCOm6tYLunLg42miRf+xTpSAL0Mg
ONnt0WxWDOgdNvTaGwBPaVE78FAWK8nc2ACzonQGfzCl2VXOsSy9JaJJMv8eyXOf
VVZCNcSvHh/zVznlC1YPoZh/bgS2UUJmIGL/XMQnM5qzbK1IPpzlN0cu8rje3s9b
TUKBKqr6xhC9nyAS1qAjgZ98RfjVnzcbMX+aWEb/Z0y9XFJVSSQQdW+f9A/0KLZm
jAejHJpNuqwEdB9MplHTXdeSDTkJH3YNbXvtwA6cc/KpZ1FVQXlhSJPp/mbOa7qe
IIeg6SYt84uZ2HxflTtm+I1uVE9QMcsesy3FIK4kxhA8jSximQw+hPZ3xrv4AHLd
cTkRAzfXPUFsJJQCgpv289QXobV/vsFhCFTHFxv63H+EGpJ7e1EaW6Eq0pAHG0Ai
8kk5Ns29jzTVer1W3sMMeDaZ7S8hGRAyRC+Zb/0QxtGsmvxikB0qY1GpdRGPFueQ
gOawhLZdhkigIsq0U1UGMpHKY0G1Sl9wvHuH2qzUKeWk+vFRv5RwR6zQuVJr2Jo/
j3HgyYDs7Q==
=Z0L0
-----END PGP SIGNATURE-----
Merge tag 'io_uring-futex-2023-10-30' of git://git.kernel.dk/linux
Pull io_uring futex support from Jens Axboe:
"This adds support for using futexes through io_uring - first futex
wake and wait, and then the vectored variant of waiting, futex waitv.
For both wait/wake/waitv, we support the bitset variant, as the
'normal' variants can be easily implemented on top of that.
PI and requeue are not supported through io_uring, just the above
mentioned parts. This may change in the future, but in the spirit of
keeping this small (and based on what people have been asking for),
this is what we currently have.
Wake support is pretty straight forward, most of the thought has gone
into the wait side to avoid needing to offload wait operations to a
blocking context. Instead, we rely on the usual callbacks to retry and
post a completion event, when appropriate.
As far as I can recall, the first request for futex support with
io_uring came from Andres Freund, working on postgres. His aio rework
of postgres was one of the early adopters of io_uring, and futex
support was a natural extension for that. This is relevant from both a
usability point of view, as well as for effiency and performance. In
Andres's words, for the former:
Futex wait support in io_uring makes it a lot easier to avoid
deadlocks in concurrent programs that have their own buffer pool:
Obviously pages in the application buffer pool have to be locked
during IO. If the initiator of IO A needs to wait for a held lock
B, the holder of lock B might wait for the IO A to complete. The
ability to wait for a lock and IO completions at the same time
provides an efficient way to avoid such deadlocks
and in terms of effiency, even without unlocking the full potential
yet, Andres says:
Futex wake support in io_uring is useful because it allows for more
efficient directed wakeups. For some "locks" postgres has queues
implemented in userspace, with wakeup logic that cannot easily be
implemented with FUTEX_WAKE_BITSET on a single "futex word"
(imagine waiting for journal flushes to have completed up to a
certain point).
Thus a "lock release" sometimes need to wake up many processes in a
row. A quick-and-dirty conversion to doing these wakeups via
io_uring lead to a 3% throughput increase, with 12% fewer context
switches, albeit in a fairly extreme workload"
* tag 'io_uring-futex-2023-10-30' of git://git.kernel.dk/linux:
io_uring: add support for vectored futex waits
futex: make the vectored futex operations available
futex: make futex_parse_waitv() available as a helper
futex: add wake_data to struct futex_q
io_uring: add support for futex wake and wait
futex: abstract out a __futex_wake_mark() helper
futex: factor out the futex wake handling
futex: move FUTEX2_VALID_MASK to futex.h
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmU/vdwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpr2rD/0astIsj/AACVSPzHARg9lnhkIvUeweMSSl
CjifLTzK3a9E3R2IrC4sflObUKIEL3fste0Lva141eNULZvBJ6cQJDvY7Bp72Bkc
CTPEwEQiwDJKLhTzQh3gY0H0+nFMWwEm1uc4dyeNAft/R9bPP/qOq62ttCoCp9+S
1UoFmTlJE3bhejyS7fytoGZvKqhkpdR7rtbR4ya7CXWPoAG+v9amo8fputbxm0dj
WECpKdd65JHWwYV4rbPA69T7jZ9V0oUsLen9RJ9BmjMLOFggHYqQdvEwG0Htirhw
t5uaXqSvc8pXsJhKXMS3tXCrLNtBha5nlWHBpSE+6ovcmKiRzFjUaRXkRbcIrOAx
ljIm0HHto1+xv0pDrNl3/lIjv5dpNOEauqqgMeYytQJIHa0JpSWbYzvjwQ8EZXQv
WWDiRfH5Z0/3BsFdOCVqd8mTt4Pbksp2VFcxGkojRtSqSr4CML3mPZSmqGcs3nE6
Fc16XXw7oLEWoF1tQYMP6KG0cVLem4on28c8CcVMJ/pRvcun3jBCif2gmMHJkWyA
a6Uq116amqQ61f1p+EQ3ChqyTA5uALrXPmovu6Ne3Y/btW5yG4+Vu7AsPLjPHdFN
oGHjOPV77XQzEqzUWRXmXPecZ+QifkcCV/8kbqtEHQqk5n+HUKQZmpC8+014ms3V
Af6LYI/vYg==
=sk8+
-----END PGP SIGNATURE-----
Merge tag 'for-6.7/io_uring-sockopt-2023-10-30' of git://git.kernel.dk/linux
Pull io_uring {get,set}sockopt support from Jens Axboe:
"This adds support for using getsockopt and setsockopt via io_uring.
The main use cases for this is to enable use of direct descriptors,
rather than first instantiating a normal file descriptor, doing the
option tweaking needed, then turning it into a direct descriptor. With
this support, we can avoid needing a regular file descriptor
completely.
The net and bpf bits have been signed off on their side"
* tag 'for-6.7/io_uring-sockopt-2023-10-30' of git://git.kernel.dk/linux:
selftests/bpf/sockopt: Add io_uring support
io_uring/cmd: Introduce SOCKET_URING_OP_SETSOCKOPT
io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPT
io_uring/cmd: return -EOPNOTSUPP if net is disabled
selftests/net: Extract uring helpers to be reusable
tools headers: Grab copy of io_uring.h
io_uring/cmd: Pass compat mode in issue_flags
net/socket: Break down __sys_getsockopt
net/socket: Break down __sys_setsockopt
bpf: Add sockptr support for setsockopt
bpf: Add sockptr support for getsockopt
The commit:
cff9b2332a ("kernel/sched: Modify initial boot task idle setup")
has changed the semantics of what is to be considered an idle task in
such a way that the idle task of an offline CPU may not carry the
PF_IDLE flag anymore.
However RCU-tasks-trace tests the opposite assertion, still assuming
that idle tasks carry the PF_IDLE flag during their whole lifecycle.
Remove this assumption to avoid spurious warnings but keep the initial
test verifying that the idle task is the current task on any offline
CPU.
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Fixes: cff9b2332a ("kernel/sched: Modify initial boot task idle setup")
Suggested-by: Joel Fernandes <joel@joelfernandes.org>
Suggested-by: Paul E . McKenney" <paulmck@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmU/vcMQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmnaD/4spcYjSSdeHVh3J60QuWjMYOM//E/BNb6e
3I2L6Is2RLuDGhVhHKfRfkJQy1UPKYKu5TZewUnwC3bz12kWGc8CZBF4WgM0159T
0uBm2ZtsstSCONA16tQdmE7gt5MJ6KFO0rsubm/AxNWxTnpyrbrX512TkkJTBrfC
ZluAKxGviZOcrl9ROoVMc/FeMmaKVcT79mDuLp0y+Pmb2KO3y9bWTs/wpmEPNVro
P7n/j9B4dBQC3Saij/wCdcsodkHUaCfCnRK3g34JKeACb+Kclg7QSzinb3TZjeEw
o98l1XMiejkPJDIxYmWPTmdzqu6AUnT3Geq6eL463/PUOjgkzet6idYfk6XQgRyz
AhFzA6KruMJ+IhOs974KtmDJj+7LbGkMUpW0kEqKWpXFEO2t+yG6Ue4cdC2FtsqV
m/ojTTeejVqJ1RLng9IqVMT/X6sqpTtBOikNIJeWyDZQGpOOBxkG9qyoYxNQTOAr
280UwcFMgsRDQMpi9uIsc7uE7QvN/RYL9nqm49bxJTRm/sRsABPb71yWcbrHSAjh
y2tprYqG0V4qK7ogCiqDt8qdq/nZS6d1mN/th33yGAHtWEStTyFKNuYmPOrzLtWb
tvnmYGA7YxcpSMEPHQbYG5TlmoWoTlzUlwJ1OWGzqdlPw7USCwjFfTZVJuKm6wkR
u0uTkYhn4A==
=okQ8
-----END PGP SIGNATURE-----
Merge tag 'for-6.7/io_uring-2023-10-30' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
"This contains the core io_uring updates, of which there are not many,
and adds support for using WAITID through io_uring and hence not
needing to block on these kinds of events.
Outside of that, tweaks to the legacy provided buffer handling and
some cleanups related to cancelations for uring_cmd support"
* tag 'for-6.7/io_uring-2023-10-30' of git://git.kernel.dk/linux:
io_uring/poll: use IOU_F_TWQ_LAZY_WAKE for wakeups
io_uring/kbuf: Use slab for struct io_buffer objects
io_uring/kbuf: Allow the full buffer id space for provided buffers
io_uring/kbuf: Fix check of BID wrapping in provided buffers
io_uring/rsrc: cleanup io_pin_pages()
io_uring: cancelable uring_cmd
io_uring: retain top 8bits of uring_cmd flags for kernel internal use
io_uring: add IORING_OP_WAITID support
exit: add internal include file with helpers
exit: add kernel_waitid_prepare() helper
exit: move core of do_wait() into helper
exit: abstract out should_wake helper for child_wait_callback()
io_uring/rw: add support for IORING_OP_READ_MULTISHOT
io_uring/rw: mark readv/writev as vectored in the opcode definition
io_uring/rw: split io_read() into a helper
The commit:
cff9b2332a ("kernel/sched: Modify initial boot task idle setup")
has changed the semantics of what is to be considered an idle task in
such a way that CPU boot code preceding the actual idle loop is excluded
from it.
This has however introduced new potential RCU-tasks stalls when either:
1) Grace period is started before init/0 had a chance to set PF_IDLE,
keeping it stuck in the holdout list until idle ever schedules.
2) Grace period is started when some possible CPUs have never been
online, keeping their idle tasks stuck in the holdout list until the
CPU ever boots up.
3) Similar to 1) but with secondary CPUs: Grace period is started
concurrently with secondary CPU booting, putting its idle task in
the holdout list because PF_IDLE isn't yet observed on it. It stays
then stuck in the holdout list until that CPU ever schedules. The
effect is mitigated here by the hotplug AP thread that must run to
bring the CPU up.
Fix this with handling the new semantics of PF_IDLE, keeping in mind
that it may or may not be set on an idle task. Take advantage of that to
strengthen the coverage of an RCU-tasks quiescent state within an idle
task, excluding the CPU boot code from it. Only the code running within
the idle loop is now a quiescent state, along with offline CPUs.
Fixes: cff9b2332a ("kernel/sched: Modify initial boot task idle setup")
Suggested-by: Joel Fernandes <joel@joelfernandes.org>
Suggested-by: Paul E . McKenney" <paulmck@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Export the RCU point of view as to when a CPU is considered offline
(ie: when does RCU consider that a CPU is sufficiently down in the
hotplug process to not feature any possible read side).
This will be used by RCU-tasks whose vision of an offline CPU should
reasonably match the one of RCU core.
Fixes: cff9b2332a ("kernel/sched: Modify initial boot task idle setup")
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Commit 851a723e45 ("sched: Always clear user_cpus_ptr in
do_set_cpus_allowed()") added a kfree() call to free any user
provided affinity mask, if present. It was changed later to use
kfree_rcu() in commit 9a5418bc48 ("sched/core: Use kfree_rcu()
in do_set_cpus_allowed()") to avoid a circular locking dependency
problem.
It turns out that even kfree_rcu() isn't safe for avoiding
circular locking problem. As reported by kernel test robot,
the following circular locking dependency now exists:
&rdp->nocb_lock --> rcu_node_0 --> &rq->__lock
Solve this by breaking the rcu_node_0 --> &rq->__lock chain by moving
the resched_cpu() out from under rcu_node lock.
[peterz: heavily borrowed from Waiman's Changelog]
[paulmck: applied Z qiang feedback]
Fixes: 851a723e45 ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/oe-lkp/202310302207.a25f1a30-oliver.sang@intel.com
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct module_notes_attrs.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: linux-modules@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Use a similar approach as commit a419beac4a ("module/decompress: use
vmalloc() for zstd decompression workspace") and replace kmalloc() with
vmalloc() also for the gzip module decompression workspace.
In this case the workspace is represented by struct inflate_workspace
that can be fairly large for kmalloc() and it can potentially lead to
allocation errors on certain systems:
$ pahole inflate_workspace
struct inflate_workspace {
struct inflate_state inflate_state; /* 0 9544 */
/* --- cacheline 149 boundary (9536 bytes) was 8 bytes ago --- */
unsigned char working_window[32768]; /* 9544 32768 */
/* size: 42312, cachelines: 662, members: 2 */
/* last cacheline: 8 bytes */
};
Considering that there is no need to use continuous physical memory,
simply switch to vmalloc() to provide a more reliable in-kernel module
decompression.
Fixes: b1ae6dc41e ("module: add in-kernel support for decompressing")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Setting softlockup_panic from do_sysctl_args() causes it to take effect
later in boot. The lockup detector is enabled before SMP is brought
online, but do_sysctl_args runs afterwards. If a user wants to set
softlockup_panic on boot and have it trigger should a softlockup occur
during onlining of the non-boot processors, they could do this prior to
commit f117955a22 ("kernel/watchdog.c: convert {soft/hard}lockup boot
parameters to sysctl aliases"). However, after this commit the value
of softlockup_panic is set too late to be of help for this type of
problem. Restore the prior behavior.
Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
Cc: stable@vger.kernel.org
Fixes: f117955a22 ("kernel/watchdog.c: convert {soft/hard}lockup boot parameters to sysctl aliases")
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
kernel:
- add initial vmemdup-user-array
core:
- fix platform remove() to return void
- drm_file owner updated to reflect owner
- move size calcs to drm buddy allocator
- let GPUVM build as a module
- allow variable number of run-queues in scheduler
edid:
- handle bad h/v sync_end in EDIDs
panfrost:
- add Boris as maintainer
fbdev:
- use fb_ops helpers more
- only allow logo use from fbcon
- rename fb_pgproto to pgprot_framebuffer
- add HPD state to drm_connector_oob_hotplug_event
- convert to fbdev i/o mem helpers
i915:
- Enable meteorlake by default
- Early Xe2 LPD/Lunarlake display enablement
- Rework subplatforms into IP version checks
- GuC based TLB invalidation for Meteorlake
- Display rework for future Xe driver integration
- LNL FBC features
- LNL display feature capability reads
- update recommended fw versions for DG2+
- drop fastboot module parameter
- added deviceid for Arrowlake-S
- drop preproduction workarounds
- don't disable preemption for resets
- cleanup inlines in headers
- PXP firmware loading fix
- Fix sg list lengths
- DSC PPS state readout/verification
- Add more RPL P/U PCI IDs
- Add new DG2-G12 stepping
- DP enhanced framing support to state checker
- Improve shared link bandwidth management
- stop using GEM macros in display code
- refactor related code into display code
- locally enable W=1 warnings
- remove PSR watchdog timers on LNL
amdgpu:
- RAS/FRU EEPROM updatse
- IP discovery updatses
- GC 11.5 support
- DCN 3.5 support
- VPE 6.1 support
- NBIO 7.11 support
- DML2 support
- lots of IP updates
- use flexible arrays for bo list handling
- W=1 fixes
- Enable seamless boot in more cases
- Enable context type property for HDMI
- Rework GPUVM TLB flushing
- VCN IB start/size alignment fixes
amdkfd:
- GC 10/11 fixes
- GC 11.5 support
- use partial migration in GPU faults
radeon:
- W=1 Fixes
- fix some possible buffer overflow/NULL derefs
nouveau:
- update uapi for NO_PREFETCH
- scheduler/fence fixes
- rework suspend/resume for GSP-RM
- rework display in preparation for GSP-RM
habanalabs:
- uapi: expose tsc clock
- uapi: block access to eventfd through control device
- uapi: force dma-buf export to PAGE_SIZE alignments
- complete move to accel subsystem
- move firmware interface include files
- perform hard reset on PCIe AXI drain event
- optimise user interrupt handling
msm:
- DP: use existing helpers for DPCD
- DPU: interrupts reworked
- gpu: a7xx (a730/a740) support
- decouple msm_drv from kms for headless devices
mediatek:
- MT8188 dsi/dp/edp support
- DDP GAMMA - 12 bit LUT support
- connector dynamic selection capability
rockchip:
- rv1126 mipi-dsi/vop support
- add planar formats
ast:
- rename constants
panels:
- Mitsubishi AA084XE01
- JDI LPM102A188A
- LTK050H3148W-CTA6
ivpu:
- power management fixes
qaic:
- add detach slice bo api
komeda:
- add NV12 writeback
tegra:
- support NVSYNC/NHSYNC
- host1x suspend fixes
ili9882t:
- separate into own driver
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmVAgzYACgkQDHTzWXnE
hr7ZEQ//UXne3tyGOsU3X8r+lstLFDMa90a3hvTg6hX+Q0MjHd/clwkKFkLpkipL
n7gIZlaHl11dRs0FzrIZA5EVAAgjMLKmIl10NBDFec6ZFA3VERcggx8y61uifI15
VviMR1VbLHYZaCdyrQOK0A4wcktWnKXyoXp7cwy9crdc2GOBMUZkdIqtvD7jHxQx
UMIFnzi1CyKUX/Fjt/JceYcNk9y2ZGkzakYO3sHcUdv4DPu9qX4kNzpjF691AZBP
UeKWvCswTRVg2M0kuo/RYIBzqaTmOlk6dHLWBognIeZPyuyhCcaGC2d64c6tShwQ
dtHdi+IgyQ8s2qb350ymKTQUP7xA/DfZBwH7LvrZALBxeQGYQN1CnsgDMOS2wcUc
XrRFiS7PxEOtMMBctcPBnnoV5ttnsLLlPpzM9puh9sUFMn6CgLzcAMqXdqxzMajH
+dz2aD1N0vMqq4varozOg9SC2QamgUiPN/TQfrulhCTCfQaXczy5x1OYiIz65+Sl
mKoe2WASuP9Ve8do4N/wEwH5SZY2ItipBdUTRxttY9NTanmV0X5DjZBXH5b9XGci
Zl5Ar613f9zwm5T5BVA5k6s3ZbGY6QcP5pDNTCPaSgitfFXIdReBZ2CaYzK3MPg/
Wit/TXrud9yT6VPpI1igboMyasf5QubV1MY1K83kOCWr9u8R2CM=
=l79u
-----END PGP SIGNATURE-----
Merge tag 'drm-next-2023-10-31-1' of git://anongit.freedesktop.org/drm/drm
Pull drm updates from Dave Airlie:
"Highlights:
- AMD adds some more upcoming HW platforms
- Intel made Meteorlake stable and started adding Lunarlake
- nouveau has a bunch of display rework in prepartion for the NVIDIA
GSP firmware support
- msm adds a7xx support
- habanalabs has finished migration to accel subsystem
Detail summary:
kernel:
- add initial vmemdup-user-array
core:
- fix platform remove() to return void
- drm_file owner updated to reflect owner
- move size calcs to drm buddy allocator
- let GPUVM build as a module
- allow variable number of run-queues in scheduler
edid:
- handle bad h/v sync_end in EDIDs
panfrost:
- add Boris as maintainer
fbdev:
- use fb_ops helpers more
- only allow logo use from fbcon
- rename fb_pgproto to pgprot_framebuffer
- add HPD state to drm_connector_oob_hotplug_event
- convert to fbdev i/o mem helpers
i915:
- Enable meteorlake by default
- Early Xe2 LPD/Lunarlake display enablement
- Rework subplatforms into IP version checks
- GuC based TLB invalidation for Meteorlake
- Display rework for future Xe driver integration
- LNL FBC features
- LNL display feature capability reads
- update recommended fw versions for DG2+
- drop fastboot module parameter
- added deviceid for Arrowlake-S
- drop preproduction workarounds
- don't disable preemption for resets
- cleanup inlines in headers
- PXP firmware loading fix
- Fix sg list lengths
- DSC PPS state readout/verification
- Add more RPL P/U PCI IDs
- Add new DG2-G12 stepping
- DP enhanced framing support to state checker
- Improve shared link bandwidth management
- stop using GEM macros in display code
- refactor related code into display code
- locally enable W=1 warnings
- remove PSR watchdog timers on LNL
amdgpu:
- RAS/FRU EEPROM updatse
- IP discovery updatses
- GC 11.5 support
- DCN 3.5 support
- VPE 6.1 support
- NBIO 7.11 support
- DML2 support
- lots of IP updates
- use flexible arrays for bo list handling
- W=1 fixes
- Enable seamless boot in more cases
- Enable context type property for HDMI
- Rework GPUVM TLB flushing
- VCN IB start/size alignment fixes
amdkfd:
- GC 10/11 fixes
- GC 11.5 support
- use partial migration in GPU faults
radeon:
- W=1 Fixes
- fix some possible buffer overflow/NULL derefs
nouveau:
- update uapi for NO_PREFETCH
- scheduler/fence fixes
- rework suspend/resume for GSP-RM
- rework display in preparation for GSP-RM
habanalabs:
- uapi: expose tsc clock
- uapi: block access to eventfd through control device
- uapi: force dma-buf export to PAGE_SIZE alignments
- complete move to accel subsystem
- move firmware interface include files
- perform hard reset on PCIe AXI drain event
- optimise user interrupt handling
msm:
- DP: use existing helpers for DPCD
- DPU: interrupts reworked
- gpu: a7xx (a730/a740) support
- decouple msm_drv from kms for headless devices
mediatek:
- MT8188 dsi/dp/edp support
- DDP GAMMA - 12 bit LUT support
- connector dynamic selection capability
rockchip:
- rv1126 mipi-dsi/vop support
- add planar formats
ast:
- rename constants
panels:
- Mitsubishi AA084XE01
- JDI LPM102A188A
- LTK050H3148W-CTA6
ivpu:
- power management fixes
qaic:
- add detach slice bo api
komeda:
- add NV12 writeback
tegra:
- support NVSYNC/NHSYNC
- host1x suspend fixes
ili9882t:
- separate into own driver"
* tag 'drm-next-2023-10-31-1' of git://anongit.freedesktop.org/drm/drm: (1803 commits)
drm/amdgpu: Remove unused variables from amdgpu_show_fdinfo
drm/amdgpu: Remove duplicate fdinfo fields
drm/amd/amdgpu: avoid to disable gfxhub interrupt when driver is unloaded
drm/amdgpu: Add EXT_COHERENT support for APU and NUMA systems
drm/amdgpu: Retrieve CE count from ce_count_lo_chip in EccInfo table
drm/amdgpu: Identify data parity error corrected in replay mode
drm/amdgpu: Fix typo in IP discovery parsing
drm/amd/display: fix S/G display enablement
drm/amdxcp: fix amdxcp unloads incompletely
drm/amd/amdgpu: fix the GPU power print error in pm info
drm/amdgpu: Use pcie domain of xcc acpi objects
drm/amd: check num of link levels when update pcie param
drm/amdgpu: Add a read to GFX v9.4.3 ring test
drm/amd/pm: call smu_cmn_get_smc_version in is_mode1_reset_supported.
drm/amdgpu: get RAS poison status from DF v4_6_2
drm/amdgpu: Use discovery table's subrevision
drm/amd/display: 3.2.256
drm/amd/display: add interface to query SubVP status
drm/amd/display: Read before writing Backlight Mode Set Register
drm/amd/display: Disable SYMCLK32_SE RCO on DCN314
...
- Add support for several Qualcomm SoC versions and other similar
changes (Christian Marangi, Dmitry Baryshkov, Luca Weiss, Neil
Armstrong, Richard Acayan, Robert Marko, Rohit Agarwal, Stephan
Gerhold and Varadarajan Narayanan).
- Clean up the tegra cpufreq driver (Sumit Gupta).
- Use of_property_read_reg() to parse "reg" in pmac32 driver (Rob
Herring).
- Add support for TI's am62p5 Soc (Bryan Brattlof).
- Make ARM_BRCMSTB_AVS_CPUFREQ depends on !ARM_SCMI_CPUFREQ (Florian
Fainelli).
- Update Kconfig to mention i.MX7 as well (Alexander Stein).
- Revise global turbo disable check in intel_pstate (Srinivas
Pandruvada).
- Carry out initialization of sg_cpu in the schedutil cpufreq governor
in one loop (Liao Chang).
- Simplify the condition for storing 'down_threshold' in the
conservative cpufreq governor (Liao Chang).
- Use fine-grained mutex in the userspace cpufreq governor (Liao
Chang).
- Move is_managed indicator in the userspace cpufreq governor into a
per-policy structure (Liao Chang).
- Rebuild sched-domains when removing cpufreq driver (Pierre Gondois).
- Fix buffer overflow detection in trans_stats() (Christian Marangi).
- Switch to dev_pm_opp_find_freq_(ceil/floor)_indexed() APIs to support
specific devices like UFS which handle multiple clocks through OPP
(Operating Performance Point) framework (Manivannan Sadhasivam).
- Add perf support to the Rockchip DFI (DDR Monitor Module) devfreq-
event driver:
* Generalize rockchip-dfi.c to support new RK3568/RK3588 using
different DDR type (Sascha Hauer).
* Convert DT binding document format to yaml (Sascha Hauer).
* Add perf support for DFI (a unit suitable for measuring DDR
utilization) to rockchip-dfi.c to extend DFI usage (Sascha Hauer).
- Add locking to the OPP handling code in the Mediatek CCI devfreq
driver, because the voltage of shared OPP might be changed by
multiple drivers (Mark Tseng, Dan Carpenter).
- Use device_get_match_data() in the Samsung Exynos PPMU devfreq-event
driver (Rob Herring).
- Extend support for the opp-level beyond required-opps (Ulf Hansson).
- Add dev_pm_opp_find_level_floor() (Krishna chaitanya chundru).
- dt-bindings: Allow opp-peak-kBpsfor kryo CPUs, support Qualcomm Krait
SoCs and document named opp-microvolt property (Bjorn Andersson,
Dmitry Baryshkov and Christian Marangi).
- Fix -Wunsequenced warning _of_add_opp_table_v1() (Nathan Chancellor).
- General cleanup of OPP code (Viresh Kumar).
- Use __get_safe_page() rather than touching the list in hibernation
snapshot code (Brian Geffon).
- Fix symbol export for _SIMPLE_ variants of _PM_OPS() (Raag Jadav).
- Clean up sync_read handling in snapshot_write_next() (Brian Geffon).
- Fix kerneldoc comments for swsusp_check() and swsusp_close() to
better match code (Christoph Hellwig).
- Downgrade BIOS locked limits pr_warn() in the Intel RAPL power
capping driver to pr_debug() (Ville Syrjälä).
- Change the minimum python version for the intel_pstate_tracer utility
from 2.7 to 3.6 (Doug Smythies).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmU6bqYSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxzLcP/Avv9PgVRVqZlJ1Rs2fqIcyOY+j5qrvx
xRiO3TBwdAzRy49ItnQY4W/CHk/skGY4vFhiluZE+OTUlx1fmPKeQLFpel+1+PvW
vLezQ9v18sH7d2Kd6gJO5k9xsyu5ZMHEkwiejA/tmS2vTs5ne4wB7ONTJObYx5iB
9Hrg6jLnk7MmolQqvQB6vmpej1eeWmuu7AXlg2OsXqYsCEnhS5iGBq86E35LvlKA
Pnef/B2ZP9RaFg2dVapSZwubn0FkUtd29ifhtGC7Fw5LM8WCRc/KHAWZwMe4dcMf
38uKux28xIEalGZm9zMhKO8gHGdfF/v1C46/hBvgjavwVJF3AUNXnsfc+v5SerDp
tXx1xghGyM/blbHUdTfzZc4l5TyqsjhkBMSCMEQcj9QYjsCY0pTZmwLz8F0BAv4D
0FukGf5jK987RBGvaHY90UCE+NvokOyJDckuSHQffrAZWghnhSgbZxMD5oiIjRYR
BioM5wQsL+wOxWdUGAOVhK6wKj32kf2XjBqWdEBk70qcpbvEmc0N8t1BSd+TzzoK
qM2hnyo+yxvv98wi/cglcJeZ1mbL+s1agTh7jFTkC23ap/GrZEw0EB5xdj4NbzOk
hO1OXas8J1LA1GFwL0WoLDyY0gvGDYFWkh0yeu0SUgxTVwKapyG03OMPQATN5M/y
cp+PK3ibS8Mb
=h8my
-----END PGP SIGNATURE-----
Merge tag 'pm-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These add new hardware support (new Qualcomm SoC versions in cpufreq,
RK3568/RK3588 in devfreq), extend the OPP (operating performance
points) framework, improve cpufreq governors, fix issues and clean up
code (most of the changes are in cpufreq and devfreq).
Specifics:
- Add support for several Qualcomm SoC versions and other similar
changes (Christian Marangi, Dmitry Baryshkov, Luca Weiss, Neil
Armstrong, Richard Acayan, Robert Marko, Rohit Agarwal, Stephan
Gerhold and Varadarajan Narayanan)
- Clean up the tegra cpufreq driver (Sumit Gupta)
- Use of_property_read_reg() to parse "reg" in pmac32 driver (Rob
Herring)
- Add support for TI's am62p5 Soc (Bryan Brattlof)
- Make ARM_BRCMSTB_AVS_CPUFREQ depends on !ARM_SCMI_CPUFREQ (Florian
Fainelli)
- Update Kconfig to mention i.MX7 as well (Alexander Stein)
- Revise global turbo disable check in intel_pstate (Srinivas
Pandruvada)
- Carry out initialization of sg_cpu in the schedutil cpufreq
governor in one loop (Liao Chang)
- Simplify the condition for storing 'down_threshold' in the
conservative cpufreq governor (Liao Chang)
- Use fine-grained mutex in the userspace cpufreq governor (Liao
Chang)
- Move is_managed indicator in the userspace cpufreq governor into a
per-policy structure (Liao Chang)
- Rebuild sched-domains when removing cpufreq driver (Pierre Gondois)
- Fix buffer overflow detection in trans_stats() (Christian Marangi)
- Switch to dev_pm_opp_find_freq_(ceil/floor)_indexed() APIs to
support specific devices like UFS which handle multiple clocks
through OPP (Operating Performance Point) framework (Manivannan
Sadhasivam)
- Add perf support to the Rockchip DFI (DDR Monitor Module) devfreq-
event driver:
* Generalize rockchip-dfi.c to support new RK3568/RK3588 using
different DDR type (Sascha Hauer).
* Convert DT binding document format to yaml (Sascha Hauer).
* Add perf support for DFI (a unit suitable for measuring DDR
utilization) to rockchip-dfi.c to extend DFI usage (Sascha
Hauer)
- Add locking to the OPP handling code in the Mediatek CCI devfreq
driver, because the voltage of shared OPP might be changed by
multiple drivers (Mark Tseng, Dan Carpenter)
- Use device_get_match_data() in the Samsung Exynos PPMU
devfreq-event driver (Rob Herring)
- Extend support for the opp-level beyond required-opps (Ulf Hansson)
- Add dev_pm_opp_find_level_floor() (Krishna chaitanya chundru)
- dt-bindings: Allow opp-peak-kBpsfor kryo CPUs, support Qualcomm
Krait SoCs and document named opp-microvolt property (Bjorn
Andersson, Dmitry Baryshkov and Christian Marangi)
- Fix -Wunsequenced warning _of_add_opp_table_v1() (Nathan
Chancellor)
- General cleanup of OPP code (Viresh Kumar)
- Use __get_safe_page() rather than touching the list in hibernation
snapshot code (Brian Geffon)
- Fix symbol export for _SIMPLE_ variants of _PM_OPS() (Raag Jadav)
- Clean up sync_read handling in snapshot_write_next() (Brian Geffon)
- Fix kerneldoc comments for swsusp_check() and swsusp_close() to
better match code (Christoph Hellwig)
- Downgrade BIOS locked limits pr_warn() in the Intel RAPL power
capping driver to pr_debug() (Ville Syrjälä)
- Change the minimum python version for the intel_pstate_tracer
utility from 2.7 to 3.6 (Doug Smythies)"
* tag 'pm-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (82 commits)
dt-bindings: cpufreq: qcom-hw: document SM8650 CPUFREQ Hardware
cpufreq: arm: Kconfig: Add i.MX7 to supported SoC for ARM_IMX_CPUFREQ_DT
cpufreq: qcom-nvmem: add support for IPQ8064
cpufreq: qcom-nvmem: also accept operating-points-v2-krait-cpu
cpufreq: qcom-nvmem: drop pvs_ver for format a fuses
dt-bindings: cpufreq: qcom-cpufreq-nvmem: Document krait-cpu
cpufreq: qcom-nvmem: add support for IPQ6018
dt-bindings: cpufreq: qcom-cpufreq-nvmem: document IPQ6018
cpufreq: qcom-nvmem: Add MSM8909
cpufreq: qcom-nvmem: Simplify driver data allocation
powercap: intel_rapl: Downgrade BIOS locked limits pr_warn() to pr_debug()
cpufreq: stats: Fix buffer overflow detection in trans_stats()
dt-bindings: devfreq: event: rockchip,dfi: Add rk3588 support
dt-bindings: devfreq: event: rockchip,dfi: Add rk3568 support
dt-bindings: devfreq: event: convert Rockchip DFI binding to yaml
PM / devfreq: rockchip-dfi: add support for RK3588
PM / devfreq: rockchip-dfi: account for multiple DDRMON_CTRL registers
PM / devfreq: rockchip-dfi: make register stride SoC specific
PM / devfreq: rockchip-dfi: Add perf support
PM / devfreq: rockchip-dfi: give variable a better name
...
Core & protocols
----------------
- Support usec resolution of TCP timestamps, enabled selectively by
a route attribute.
- Defer regular TCP ACK while processing socket backlog, try to send
a cumulative ACK at the end. Increase single TCP flow performance
on a 200Gbit NIC by 20% (100Gbit -> 120Gbit).
- The Fair Queuing (FQ) packet scheduler:
- add built-in 3 band prio / WRR scheduling
- support bypass if the qdisc is mostly idle (5% speed up for TCP RR)
- improve inactive flow reporting
- optimize the layout of structures for better cache locality
- Support TCP Authentication Option (RFC 5925, TCP-AO), a more modern
replacement for the old MD5 option.
- Add more retransmission timeout (RTO) related statistics to TCP_INFO.
- Support sending fragmented skbs over vsock sockets.
- Make sure we send SIGPIPE for vsock sockets if socket was shutdown().
- Add sysctl for ignoring lower limit on lifetime in Router
Advertisement PIO, based on an in-progress IETF draft.
- Add sysctl to control activation of TCP ping-pong mode.
- Add sysctl to make connection timeout in MPTCP configurable.
- Support rcvlowat and notsent_lowat on MPTCP sockets, to help apps
limit the number of wakeups.
- Support netlink GET for MDB (multicast forwarding), allowing user
space to request a single MDB entry instead of dumping the entire
table.
- Support selective FDB flushing in the VXLAN tunnel driver.
- Allow limiting learned FDB entries in bridges, prevent OOM attacks.
- Allow controlling via configfs netconsole targets which were created
via the kernel cmdline at boot, rather than via configfs at runtime.
- Support multiple PTP timestamp event queue readers with different
filters.
- MCTP over I3C.
BPF
---
- Add new veth-like netdevice where BPF program defines the logic
of the xmit routine. It can operate in L3 and L2 mode.
- Support exceptions - allow asserting conditions which should
never be true but are hard for the verifier to infer.
With some extra flexibility around handling of the exit / failure.
https://lwn.net/Articles/938435/
- Add support for local per-cpu kptr, allow allocating and storing
per-cpu objects in maps. Access to those objects operates on
the value for the current CPU. This allows to deprecate local
one-off implementations of per-CPU storage like
BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE maps.
- Extend cgroup BPF sockaddr hooks for UNIX sockets. The use case is
for systemd to re-implement the LogNamespace feature which allows
running multiple instances of systemd-journald to process the logs
of different services.
- Enable open-coded task_vma iteration, after maple tree conversion
made it hard to directly walk VMAs in tracing programs.
- Add open-coded task, css_task and css iterator support.
One of the use cases is customizable OOM victim selection via BPF.
- Allow source address selection with bpf_*_fib_lookup().
- Add ability to pin BPF timer to the current CPU.
- Prevent creation of infinite loops by combining tail calls and
fentry/fexit programs.
- Add missed stats for kprobes to retrieve the number of missed kprobe
executions and subsequent executions of BPF programs.
- Inherit system settings for CPU security mitigations.
- Add BPF v4 CPU instruction support for arm32 and s390x.
Changes to common code
----------------------
- overflow: add DEFINE_FLEX() for on-stack definition of structs
with flexible array members.
- Process doc update with more guidance for reviewers.
Driver API
----------
- Simplify locking in WiFi (cfg80211 and mac80211 layers), use wiphy
mutex in most places and remove a lot of smaller locks.
- Create a common DPLL configuration API. Allow configuring
and querying state of PLL circuits used for clock syntonization,
in network time distribution.
- Unify fragmented and full page allocation APIs in page pool code.
Let drivers be ignorant of PAGE_SIZE.
- Rework PHY state machine to avoid races with calls to phy_stop().
- Notify DSA drivers of MAC address changes on user ports, improve
correctness of offloads which depend on matching port MAC addresses.
- Allow antenna control on injected WiFi frames.
- Reduce the number of variants of napi_schedule().
- Simplify error handling when composing devlink health messages.
Misc
----
- A lot of KCSAN data race "fixes", from Eric.
- A lot of __counted_by() annotations, from Kees.
- A lot of strncpy -> strscpy and printf format fixes.
- Replace master/slave terminology with conduit/user in DSA drivers.
- Handful of KUnit tests for netdev and WiFi core.
Removed
-------
- AppleTalk COPS.
- AppleTalk ipddp.
- TI AR7 CPMAC Ethernet driver.
Drivers
-------
- Ethernet high-speed NICs:
- Intel (100G, ice, idpf):
- add a driver for the Intel E2000 IPUs
- make CRC/FCS stripping configurable
- cross-timestamping for E823 devices
- basic support for E830 devices
- use aux-bus for managing client drivers
- i40e: report firmware versions via devlink
- nVidia/Mellanox:
- support 4-port NICs
- increase max number of channels to 256
- optimize / parallelize SF creation flow
- Broadcom (bnxt):
- enhance NIC temperature reporting
- support PAM4 speeds and lane configuration
- Marvell OcteonTX2:
- PTP pulse-per-second output support
- enable hardware timestamping for VFs
- Solarflare/AMD:
- conntrack NAT offload and offload for tunnels
- Wangxun (ngbe/txgbe):
- expose HW statistics
- Pensando/AMD:
- support PCI level reset
- narrow down the condition under which skbs are linearized
- Netronome/Corigine (nfp):
- support CHACHA20-POLY1305 crypto in IPsec offload
- Ethernet NICs embedded, slower, virtual:
- Synopsys (stmmac):
- add Loongson-1 SoC support
- enable use of HW queues with no offload capabilities
- enable PPS input support on all 5 channels
- increase TX coalesce timer to 5ms
- RealTek USB (r8152): improve efficiency of Rx by using GRO frags
- xen: support SW packet timestamping
- add drivers for implementations based on TI's PRUSS (AM64x EVM)
- nVidia/Mellanox Ethernet datacenter switches:
- avoid poor HW resource use on Spectrum-4 by better block selection
for IPv6 multicast forwarding and ordering of blocks in ACL region
- Ethernet embedded switches:
- Microchip:
- support configuring the drive strength for EMI compliance
- ksz9477: partial ACL support
- ksz9477: HSR offload
- ksz9477: Wake on LAN
- Realtek:
- rtl8366rb: respect device tree config of the CPU port
- Ethernet PHYs:
- support Broadcom BCM5221 PHYs
- TI dp83867: support hardware LED blinking
- CAN:
- add support for Linux-PHY based CAN transceivers
- at91_can: clean up and use rx-offload helpers
- WiFi:
- MediaTek (mt76):
- new sub-driver for mt7925 USB/PCIe devices
- HW wireless <> Ethernet bridging in MT7988 chips
- mt7603/mt7628 stability improvements
- Qualcomm (ath12k):
- WCN7850:
- enable 320 MHz channels in 6 GHz band
- hardware rfkill support
- enable IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS
to make scan faster
- read board data variant name from SMBIOS
- QCN9274: mesh support
- RealTek (rtw89):
- TDMA-based multi-channel concurrency (MCC)
- Silicon Labs (wfx):
- Remain-On-Channel (ROC) support
- Bluetooth:
- ISO: many improvements for broadcast support
- mark BCM4378/BCM4387 as BROKEN_LE_CODED
- add support for QCA2066
- btmtksdio: enable Bluetooth wakeup from suspend
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmU8XsYACgkQMUZtbf5S
Irv19RAAnud/24OOF5XMEJkIcYlnfqximh4XO6PujRSYkSkOUJdZTF6iJPgf3pSP
YpwoHYbYKHYfeOf8+3bTNESiQNSnoVmvmvwiS6/7lZ3behHUrGLQzW9Htc3EZyWH
2h6QkDZ5OOjfg0bwYSfp3vXkmMH2k8WE9Y0NvCkhcohqZi13Rmp14RnyPmNb2d1V
yZRYDMSM133KqE6gnBr1Ct65IEvnKeGlCUN2mTGqOJgdn6DZMsyxvtt0y4rmN7Ab
41+CgPU5SfxfbYpW+Dl2HJpgfte3WrC57KC6AM0PAPJzPmQWgeB/m9mjz/apj6Bg
bhsEIo7FdvbCnQm3yWPhK2OgCAcSwLr8jfGMU+Q+W4VnL5SRRR3Rm0zjsze+kHNP
OfqJgxzl3DpvoJqVBy1h5FGcZt0XHwhksm4cTxWqIahsF+veY0ECBXbuBBQx9XTF
Y7INfI8ulg7wISJs+CJfIClYkgOibTw2u8taBS5ikbtgxNqp5D4QqODn7UefQap1
PR/IDYODF+zRgmMJLeBqSa6fij6BkfOEDiOWak5kggBoZdtbtmeKI6tzze06CNdW
lWv1WEhRufxnwK+IuWsEkjhiMbs2WGLvkJ5JbgQV9BfqHfIfiqBCrcWtT/WbQnGt
lmU46CXh1t/FZEqbmK9h+8vsIIfrcDl6jb5npEiKPRG00vDKRTM=
=46nS
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Support usec resolution of TCP timestamps, enabled selectively by a
route attribute.
- Defer regular TCP ACK while processing socket backlog, try to send
a cumulative ACK at the end. Increase single TCP flow performance
on a 200Gbit NIC by 20% (100Gbit -> 120Gbit).
- The Fair Queuing (FQ) packet scheduler:
- add built-in 3 band prio / WRR scheduling
- support bypass if the qdisc is mostly idle (5% speed up for TCP RR)
- improve inactive flow reporting
- optimize the layout of structures for better cache locality
- Support TCP Authentication Option (RFC 5925, TCP-AO), a more modern
replacement for the old MD5 option.
- Add more retransmission timeout (RTO) related statistics to
TCP_INFO.
- Support sending fragmented skbs over vsock sockets.
- Make sure we send SIGPIPE for vsock sockets if socket was
shutdown().
- Add sysctl for ignoring lower limit on lifetime in Router
Advertisement PIO, based on an in-progress IETF draft.
- Add sysctl to control activation of TCP ping-pong mode.
- Add sysctl to make connection timeout in MPTCP configurable.
- Support rcvlowat and notsent_lowat on MPTCP sockets, to help apps
limit the number of wakeups.
- Support netlink GET for MDB (multicast forwarding), allowing user
space to request a single MDB entry instead of dumping the entire
table.
- Support selective FDB flushing in the VXLAN tunnel driver.
- Allow limiting learned FDB entries in bridges, prevent OOM attacks.
- Allow controlling via configfs netconsole targets which were
created via the kernel cmdline at boot, rather than via configfs at
runtime.
- Support multiple PTP timestamp event queue readers with different
filters.
- MCTP over I3C.
BPF:
- Add new veth-like netdevice where BPF program defines the logic of
the xmit routine. It can operate in L3 and L2 mode.
- Support exceptions - allow asserting conditions which should never
be true but are hard for the verifier to infer. With some extra
flexibility around handling of the exit / failure:
https://lwn.net/Articles/938435/
- Add support for local per-cpu kptr, allow allocating and storing
per-cpu objects in maps. Access to those objects operates on the
value for the current CPU.
This allows to deprecate local one-off implementations of per-CPU
storage like BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE maps.
- Extend cgroup BPF sockaddr hooks for UNIX sockets. The use case is
for systemd to re-implement the LogNamespace feature which allows
running multiple instances of systemd-journald to process the logs
of different services.
- Enable open-coded task_vma iteration, after maple tree conversion
made it hard to directly walk VMAs in tracing programs.
- Add open-coded task, css_task and css iterator support. One of the
use cases is customizable OOM victim selection via BPF.
- Allow source address selection with bpf_*_fib_lookup().
- Add ability to pin BPF timer to the current CPU.
- Prevent creation of infinite loops by combining tail calls and
fentry/fexit programs.
- Add missed stats for kprobes to retrieve the number of missed
kprobe executions and subsequent executions of BPF programs.
- Inherit system settings for CPU security mitigations.
- Add BPF v4 CPU instruction support for arm32 and s390x.
Changes to common code:
- overflow: add DEFINE_FLEX() for on-stack definition of structs with
flexible array members.
- Process doc update with more guidance for reviewers.
Driver API:
- Simplify locking in WiFi (cfg80211 and mac80211 layers), use wiphy
mutex in most places and remove a lot of smaller locks.
- Create a common DPLL configuration API. Allow configuring and
querying state of PLL circuits used for clock syntonization, in
network time distribution.
- Unify fragmented and full page allocation APIs in page pool code.
Let drivers be ignorant of PAGE_SIZE.
- Rework PHY state machine to avoid races with calls to phy_stop().
- Notify DSA drivers of MAC address changes on user ports, improve
correctness of offloads which depend on matching port MAC
addresses.
- Allow antenna control on injected WiFi frames.
- Reduce the number of variants of napi_schedule().
- Simplify error handling when composing devlink health messages.
Misc:
- A lot of KCSAN data race "fixes", from Eric.
- A lot of __counted_by() annotations, from Kees.
- A lot of strncpy -> strscpy and printf format fixes.
- Replace master/slave terminology with conduit/user in DSA drivers.
- Handful of KUnit tests for netdev and WiFi core.
Removed:
- AppleTalk COPS.
- AppleTalk ipddp.
- TI AR7 CPMAC Ethernet driver.
Drivers:
- Ethernet high-speed NICs:
- Intel (100G, ice, idpf):
- add a driver for the Intel E2000 IPUs
- make CRC/FCS stripping configurable
- cross-timestamping for E823 devices
- basic support for E830 devices
- use aux-bus for managing client drivers
- i40e: report firmware versions via devlink
- nVidia/Mellanox:
- support 4-port NICs
- increase max number of channels to 256
- optimize / parallelize SF creation flow
- Broadcom (bnxt):
- enhance NIC temperature reporting
- support PAM4 speeds and lane configuration
- Marvell OcteonTX2:
- PTP pulse-per-second output support
- enable hardware timestamping for VFs
- Solarflare/AMD:
- conntrack NAT offload and offload for tunnels
- Wangxun (ngbe/txgbe):
- expose HW statistics
- Pensando/AMD:
- support PCI level reset
- narrow down the condition under which skbs are linearized
- Netronome/Corigine (nfp):
- support CHACHA20-POLY1305 crypto in IPsec offload
- Ethernet NICs embedded, slower, virtual:
- Synopsys (stmmac):
- add Loongson-1 SoC support
- enable use of HW queues with no offload capabilities
- enable PPS input support on all 5 channels
- increase TX coalesce timer to 5ms
- RealTek USB (r8152): improve efficiency of Rx by using GRO frags
- xen: support SW packet timestamping
- add drivers for implementations based on TI's PRUSS (AM64x EVM)
- nVidia/Mellanox Ethernet datacenter switches:
- avoid poor HW resource use on Spectrum-4 by better block
selection for IPv6 multicast forwarding and ordering of blocks
in ACL region
- Ethernet embedded switches:
- Microchip:
- support configuring the drive strength for EMI compliance
- ksz9477: partial ACL support
- ksz9477: HSR offload
- ksz9477: Wake on LAN
- Realtek:
- rtl8366rb: respect device tree config of the CPU port
- Ethernet PHYs:
- support Broadcom BCM5221 PHYs
- TI dp83867: support hardware LED blinking
- CAN:
- add support for Linux-PHY based CAN transceivers
- at91_can: clean up and use rx-offload helpers
- WiFi:
- MediaTek (mt76):
- new sub-driver for mt7925 USB/PCIe devices
- HW wireless <> Ethernet bridging in MT7988 chips
- mt7603/mt7628 stability improvements
- Qualcomm (ath12k):
- WCN7850:
- enable 320 MHz channels in 6 GHz band
- hardware rfkill support
- enable IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS to
make scan faster
- read board data variant name from SMBIOS
- QCN9274: mesh support
- RealTek (rtw89):
- TDMA-based multi-channel concurrency (MCC)
- Silicon Labs (wfx):
- Remain-On-Channel (ROC) support
- Bluetooth:
- ISO: many improvements for broadcast support
- mark BCM4378/BCM4387 as BROKEN_LE_CODED
- add support for QCA2066
- btmtksdio: enable Bluetooth wakeup from suspend"
* tag 'net-next-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1816 commits)
net: pcs: xpcs: Add 2500BASE-X case in get state for XPCS drivers
net: bpf: Use sockopt_lock_sock() in ip_sock_set_tos()
net: mana: Use xdp_set_features_flag instead of direct assignment
vxlan: Cleanup IFLA_VXLAN_PORT_RANGE entry in vxlan_get_size()
iavf: delete the iavf client interface
iavf: add a common function for undoing the interrupt scheme
iavf: use unregister_netdev
iavf: rely on netdev's own registered state
iavf: fix the waiting time for initial reset
iavf: in iavf_down, don't queue watchdog_task if comms failed
iavf: simplify mutex_trylock+sleep loops
iavf: fix comments about old bit locks
doc/netlink: Update schema to support cmd-cnt-name and cmd-max-name
tools: ynl: introduce option to process unknown attributes or types
ipvlan: properly track tx_errors
netdevsim: Block until all devices are released
nfp: using napi_build_skb() to replace build_skb()
net: dsa: microchip: ksz9477: Fix spelling mistake "Enery" -> "Energy"
net: dsa: microchip: Ensure Stable PME Pin State for Wake-on-LAN
net: dsa: microchip: Refactor switch shutdown routine for WoL preparation
...
* cpuset now supports remote partitions where CPUs can be reserved for
exclusive use down the tree without requiring all the intermediate nodes
to be partitions. This makes it easier to use partitions without modifying
existing cgroup hierarchy.
* cpuset partition configuration behavior improvement.
* cgroup_favordynmods= boot param added to allow setting the flag on boot on
cgroup1.
* Misc code and doc updates.
-----BEGIN PGP SIGNATURE-----
iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZUBUKA4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGWfMAP9WP+Z21qzzL2bY5I5kOxu+rD2fF9ORk7azILrI
c3gXSQD/bdxNWcdtrCQMvs+ToNEJvqDFxrmgG5uRUF8Ea+FCtQ8=
=gZj2
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- cpuset now supports remote partitions where CPUs can be reserved for
exclusive use down the tree without requiring all the intermediate
nodes to be partitions. This makes it easier to use partitions
without modifying existing cgroup hierarchy.
- cpuset partition configuration behavior improvement
- cgroup_favordynmods= boot param added to allow setting the flag on
boot on cgroup1
- Misc code and doc updates
* tag 'cgroup-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
docs/cgroup: Add the list of threaded controllers to cgroup-v2.rst
cgroup: use legacy_name for cgroup v1 disable info
cgroup/cpuset: Cleanup signedness issue in cpu_exclusive_check()
cgroup/cpuset: Enable invalid to valid local partition transition
cgroup: add cgroup_favordynmods= command-line option
cgroup/cpuset: Extend test_cpuset_prs.sh to test remote partition
cgroup/cpuset: Documentation update for partition
cgroup/cpuset: Check partition conflict with housekeeping setup
cgroup/cpuset: Introduce remote partition
cgroup/cpuset: Add cpuset.cpus.exclusive for v2
cgroup/cpuset: Add cpuset.cpus.exclusive.effective for v2
cgroup/cpuset: Fix load balance state in update_partition_sd_lb()
cgroup: Avoid extra dereference in css_populate_dir()
cgroup: Check for ret during cgroup1_base_files cft addition
Just one commit to improve lockdep annotation for work_on_cpu() to avoid
spurious warnings. I'll send another pull request for workqueue rust
binding.
-----BEGIN PGP SIGNATURE-----
iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZUBJqg4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGb0pAQCmql98CmUazFhn/8W2zMo+GaU9CuMf/Drchmx+
VeTAfgEAhn2RQGnhhq/G3jfShlBq/9xSA2N1d8OohLfza5ZmfQs=
=0gh4
-----END PGP SIGNATURE-----
Merge tag 'wq-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue update from Tejun Heo:
"Just one commit to improve lockdep annotation for work_on_cpu() to
avoid spurious warnings"
* tag 'wq-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Provide one lock class key per work_on_cpu() callsite
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmVAJpsUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXOvghAAzzIu0KzvcDci1AkEvS/6MD0ChS/K
d6cJPnf65bJ98M8PSIJz5bik2t6TYGDJZqAXTqyb6Bb7XM0ITVY5vPgLKWy1VXCr
xHBQOSFzYWMV+cqo+17l55iRKPFQb+2rNHfIqKfwxrAWhLL0zeZR4M7AQQVMrT5Q
WZXleW5xwvMA9My7ny+n3jTqxaiDIZTYyi0noF89j7NNfteehaKAkwZ6Phkdx5cM
UeSTkv9fO2/O80UcBs63h2rAhBM7BAvSDV8PicDN6VyWWdE3njQUBkWK+aGt4Ab+
KW2S1su0cPc636cmRyFqUvAdHNaJb2Uh+0OWRYWPd0fhAxI9kzYVQaMvl84ngwQZ
g2z8VHaCE8pO4zCuBVmxOBhl0pVoQfSXOG1wXPvEw4Me5hEudda3Fa3SjCeD1u2j
vEi/w2yl7Rb0oMg7MlpCZLSnH1fhwPxmnn+SJD1C0lCuINsB04sTuO0YbwJOcoUc
30bbzZt4oNJmTeNfuDiYDeQQExdIaOhACEnJ0c4TiMU26FfoAWnQVnz2KSqWvy+g
EKcQvqQD0Jl1e6adqyNDA7cexPkcC26KkUBJzK8g7pI6YqOG/Ev3dwygcM6Nesi8
iSxfnjixU84hHRq+u/7wQ72ga5oRw0qwL6iXLyFlV/O6vuvlFRtGSofluXkZ2LNL
m82igNveVpd9/Mc=
=a9QY
-----END PGP SIGNATURE-----
Merge tag 'lsm-pr-20231030' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull LSM updates from Paul Moore:
- Add new credential functions, get_cred_many() and put_cred_many() to
save some atomic_t operations for a few operations.
While not strictly LSM related, this patchset had been rotting on the
mailing lists for some time and since the LSMs do care a lot about
credentials I thought it reasonable to give this patch a home.
- Five patches to constify different LSM hook parameters.
- Fix a spelling mistake.
* tag 'lsm-pr-20231030' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
lsm: fix a spelling mistake
cred: add get_cred_many and put_cred_many
lsm: constify 'sb' parameter in security_sb_kern_mount()
lsm: constify 'bprm' parameter in security_bprm_committed_creds()
lsm: constify 'bprm' parameter in security_bprm_committing_creds()
lsm: constify 'file' parameter in security_bprm_creds_from_file()
lsm: constify 'sb' parameter in security_quotactl()
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmVAJZ4UHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNtmg/9GZcLIK5LSxqHNf++8xoRgJGSL0Ku
wdKKWalRmW7Ka54SmU8Cj2FpJSL+LZ6JyKBkFylu17+Wi/Ym8zdNgKulZGTPomLb
P0J+5rjKf8SgEgVrCyU1APFQhcvbMK7r0aJ7yu+xsV2+7t+bqCHG4YOGX906U8bs
TAMcMKe1giNUBQbp2WcdQ2XPptSsiD4SjZfF00hoC8vWLr5nEB8W+EsHNhS6lUwx
7ziOT4CRtxMF/k74jb7uVzvjR8fU73KNTUp+Kf+QJuolOZPrMph2xPmOHkys8eJR
uwqac7GcpP8M+9tb0W43yM+AgWsYS63ZoPMXX+72RP1oEeAarTu/U21MsB7aSnxT
Lir26psS02RtjUlq6SPGs+ubU2Q3Y7TlG9N+Z/V80/D7K0TYWkXoOBfUm8XPp1xa
IymvW5jiFDEvC+4aa4zH0oKVeO3FX3rsb9GkhS2h0PgFOO6FurCgrudyiUc0lxbb
TNNvksTFNCelNsv/Vi8WXFFL93VzeX10ZcoKHUCM46kW7xtbsYpZNt15Fv43rHuo
0mTZNU+iMXcarNOH/C3p9DWTXy8Z3rrFR2eBC1oXnCJax/qvK7rodUazMWMy8edj
893+n2ZYE6AxJbammPuG6bgTtv2+YfJv7m9jSY/l1U9tay2exSTR3OWnr/tBZ6S7
J2mS/mIIhPNPJ4w=
=xLC2
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20231030' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit update from Paul Moore:
"Only two audit patches for v6.7, both fairly small with a combined 11
lines of changes.
The first patch is a simple __counted_by annontation, and the second
fixes a a problem where audit could deadlock on task_lock() when an
exe filter is configured. More information is available in the commit
description and the patch is tagged for stable"
* tag 'audit-pr-20231030' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: don't take task_lock() in audit_exe_compare() code path
audit: Annotate struct audit_chunk with __counted_by
- Support non-BSS ELF segments with 0 filesz (Eric W. Biederman, Kees Cook)
- Enable namespaced binfmt_misc (Christian Brauner)
- Remove struct tag 'dynamic' from ELF UAPI (Alejandro Colomar)
- Clean up binfmt_elf_fdpic debug output (Greg Ungerer)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmU/40AWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJh1rD/9g8iQ77KvC/l/GUjt9WNCsbwMR
4Ro15U6PP9TbafxEUTgYGrwpVPcWrOTz3zrLZ/NsR5GtgKolxLry94oeUPCpFRUP
v+4cQWpcQQtkAiRw+vc4/XfUivWmZuNGiLOt2egZUP6tQhJocmRNW7XbGF1ZDrSA
ASZknaA7qVLx/hnghm+bCXjNOx6hN/Md35QBPuyAclpD3sbUJDPSODZZb9Bcz+4w
0qD3acA3Nulug9k/5j7Ed0MzV8I/WfgZQQhGMl4K7yBQv06vcrRV6Eon4D9KvJVm
bjK3zFE/zILkY1BHIUZZT3h2DjdUwHrGr82u5y6u3buj88IcNyFfSaGyYYBqn3Ux
P7Y+dD9zZXQuMbqmhWbdK8UoSYiJ9isOB02lt0oHipONR5PqRocTsA6gseMsO9cv
TwvGL279WlfZIj+2pvn0VJv/7DOCKGjZfc2AXhgPSkjICSO9mEOlVcFv1v3ZuXAn
Cb/6/BMZyNqh/UIGWdPRyDVHEdswpJVcecewnJwmrG1vmvYyfyP8U+VoRE4ItELz
fMpZskAb7SKV+McHLDauV+9eCgqaF5DIM3/zgws5iayRcGZQfengXqIajL7Ujlwf
RKlnfhtRxkfgpF8vEmQDs0y5AVsU/l48dOSrb/0Vg9oXKBdBa9ozyhr1Ok5kwkiN
LfDZDjSMyERgO/UZHQ==
=hxk8
-----END PGP SIGNATURE-----
Merge tag 'execve-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve updates from Kees Cook:
- Support non-BSS ELF segments with zero filesz
Eric Biederman and I refactored ELF segment loading to handle the
case where a segment has a smaller filesz than memsz. Traditionally
linkers only did this for .bss and it was always the last segment. As
a result, the kernel only handled this case when it was the last
segment. We've had two recent cases where linkers were trying to use
these kinds of segments for other reasons, and the were in the middle
of the segment list. There was no good reason for the kernel not to
support this, and the refactor actually ends up making things more
readable too.
- Enable namespaced binfmt_misc
Christian Brauner has made it possible to use binfmt_misc with mount
namespaces. This means some traditionally root-only interfaces (for
adding/removing formats) are now more exposed (but believed to be
safe).
- Remove struct tag 'dynamic' from ELF UAPI
Alejandro Colomar noticed that the ELF UAPI has been polluting the
struct namespace with an unused and overly generic tag named
"dynamic" for no discernible reason for many many years. After
double-checking various distro source repositories, it has been
removed.
- Clean up binfmt_elf_fdpic debug output (Greg Ungerer)
* tag 'execve-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
binfmt_misc: enable sandboxed mounts
binfmt_misc: cleanup on filesystem umount
binfmt_elf_fdpic: clean up debug warnings
mm: Remove unused vm_brk()
binfmt_elf: Only report padzero() errors when PROT_WRITE
binfmt_elf: Use elf_load() for library
binfmt_elf: Use elf_load() for interpreter
binfmt_elf: elf_bss no longer used by load_elf_binary()
binfmt_elf: Support segments with 0 filesz and misaligned starts
elf, uapi: Remove struct tag 'dynamic'
- Add LKDTM test for stuck CPUs (Mark Rutland)
- Improve LKDTM selftest behavior under UBSan (Ricardo Cañuelo)
- Refactor more 1-element arrays into flexible arrays (Gustavo A. R. Silva)
- Analyze and replace strlcpy and strncpy uses (Justin Stitt, Azeem Shaikh)
- Convert group_info.usage to refcount_t (Elena Reshetova)
- Add __counted_by annotations (Kees Cook, Gustavo A. R. Silva)
- Add Kconfig fragment for basic hardening options (Kees Cook, Lukas Bulwahn)
- Fix randstruct GCC plugin performance mode to stay in groups (Kees Cook)
- Fix strtomem() compile-time check for small sources (Kees Cook)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmU/3cUWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJsEoEACBGPSiOmfSWdH3TOnIG270PD24
jGjg8KFv7RC/JTOdYmpLl0okdlGT9LvjN/ToSSDEw3PIayxoXUdhkbYy0MYtiV3m
yz2ozDTzJuplQX/W2fPE+nXSzIwHao2zjPPFjHnT7lt8IIjhgjiOtLfZ2gGUkW99
Mdu2aWh3u0r4tC8OS23++yN5ibRc5l72efsjDWjZ0aPXnxE1bjmLMiIPiizpndIf
beasPuDBs98sJVYouemCwnsPXuXOPz3Q1Cpo/fTd+TMTJCLSemCQZCTuOBU0acI/
ZjLCgCaJU1yIYKBMtrIN4G9kITZniXX3/Nm4o6NQMVlcCqMeNaHuflomqWoqWfhE
UPbRo2eghZOaMNiCKLLvZDIqPrh1IcsiEl6Ef3W4hICc42GTK96IuGisIvDXwQ4N
/SzTOupJuN42noh3z1M3XuZy5RoXJ99IYDNY5CTKf9IdqvA0bbGkU3nb1gZH/xw9
BjTqKzR/7K1kTXuSgagDZ1Wceej9pZxhX7E3IHYsP8ZOvKug3EeL4yybVwQ3HRfq
Qnzcp/qPB9cOkLSQXveRTFTsj2mX28Gixct/iDuc1jIYwGQlY1gI6dcUcqby6ptM
BrQti7eR2NH2+T3aE2UVCIWsZVhx7NaSF+z8JxfAuu56jicc4xJVsi8zrNveWX5M
m2VXyBl3121BVtKi4w==
=0iVF
-----END PGP SIGNATURE-----
Merge tag 'hardening-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
"One of the more voluminous set of changes is for adding the new
__counted_by annotation[1] to gain run-time bounds checking of
dynamically sized arrays with UBSan.
- Add LKDTM test for stuck CPUs (Mark Rutland)
- Improve LKDTM selftest behavior under UBSan (Ricardo Cañuelo)
- Refactor more 1-element arrays into flexible arrays (Gustavo A. R.
Silva)
- Analyze and replace strlcpy and strncpy uses (Justin Stitt, Azeem
Shaikh)
- Convert group_info.usage to refcount_t (Elena Reshetova)
- Add __counted_by annotations (Kees Cook, Gustavo A. R. Silva)
- Add Kconfig fragment for basic hardening options (Kees Cook, Lukas
Bulwahn)
- Fix randstruct GCC plugin performance mode to stay in groups (Kees
Cook)
- Fix strtomem() compile-time check for small sources (Kees Cook)"
* tag 'hardening-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (56 commits)
hwmon: (acpi_power_meter) replace open-coded kmemdup_nul
reset: Annotate struct reset_control_array with __counted_by
kexec: Annotate struct crash_mem with __counted_by
virtio_console: Annotate struct port_buffer with __counted_by
ima: Add __counted_by for struct modsig and use struct_size()
MAINTAINERS: Include stackleak paths in hardening entry
string: Adjust strtomem() logic to allow for smaller sources
hardening: x86: drop reference to removed config AMD_IOMMU_V2
randstruct: Fix gcc-plugin performance mode to stay in group
mailbox: zynqmp: Annotate struct zynqmp_ipi_pdata with __counted_by
drivers: thermal: tsens: Annotate struct tsens_priv with __counted_by
irqchip/imx-intmux: Annotate struct intmux_data with __counted_by
KVM: Annotate struct kvm_irq_routing_table with __counted_by
virt: acrn: Annotate struct vm_memory_region_batch with __counted_by
hwmon: Annotate struct gsc_hwmon_platform_data with __counted_by
sparc: Annotate struct cpuinfo_tree with __counted_by
isdn: kcapi: replace deprecated strncpy with strscpy_pad
isdn: replace deprecated strncpy with strscpy
NFS/flexfiles: Annotate struct nfs4_ff_layout_segment with __counted_by
nfs41: Annotate struct nfs4_file_layout_dsaddr with __counted_by
...
This pull request contains the following branches:
rcu/torture: RCU torture, locktorture and generic torture infrastructure
updates that include various fixes, cleanups and consolidations.
Among the user visible things, ftrace dumps can now be found into
their own file, and module parameters get better documented and
reported on dumps.
rcu/fixes: Generic and misc fixes all over the place. Some highlights:
* Hotplug handling has seen some light cleanups and comments.
* An RCU barrier can now be triggered through sysfs to serialize
memory stress testing and avoid OOM.
* Object information is now dumped in case of invalid callback
invocation.
* Also various SRCU issues, too hard to trigger to deserve urgent
pull requests, have been fixed.
rcu/docs: RCU documentation updates
rcu/refscale: RCU reference scalability test minor fixes and doc
improvements.
rcu/tasks: RCU tasks minor fixes
rcu/stall: Stall detection updates. Introduce RCU CPU Stall notifiers
that allows a subsystem to provide informations to help debugging.
Also cure some false positive stalls.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmU21h0ACgkQhSRUR1CO
jHdUgA/+Myy5K5OxNrqlF/gIK+flOSg635RyZ0DBx8OMXZ/fAg9qRI+PKt5I4Lha
eXAg6EtmwSgHmIbjcg8WzsvwniEsqqjOF+n1qil447fHUI2Qqw6c7fIm/MXQkeHJ
qA7CODDRtsAnwnjmTteasmMeGV0bmXDENxhNrAZBFnVkRgTqfyDbFcn+nxOaPK6b
fmbKvnB07WUg1KOV8/MbEtAZPb8QgHo58bXSZRKjKkiqRQWB/D3On+tShFK7SYJi
wIqQ96MLyUXLaIWQ47v6xEO4PZO+3o1wAryvP1DRdb5UrPjO6yKFfQaoo5Mza92G
zhBJhnXkVvCoNoCU7GKJIDV54SgDHaB6Sf1GN5cjwfujOkLuGCyg0CpKktCGm7uH
n3X66PVep608Uj2Y/pAo/hv3Hbv7lCu4nfrERvVLG9YoxUvTJDsKmBv+SF/g2mxF
rHqFa39HUPr1yHA5WjqOQS3lLdqCXEGKvNi6zXCvOceiDbHbiJFkBo6p8TVrbSMX
FCOWZ3LoE+6uiLu/lLOEroTjeBd8GhDh1LgWgyVK7o0LhP1018DSBolrpcSwnmOo
Q/E4G2x+aPWs+5NTOmMGOIPY70khKQIM3c8YZelSRffJBo6O3yV68h6X45NQxYvx
keLvrDaza8h4hKwaof/QaX4ZJgTOZ0xjpawr1vR0hbK8LNtPrUw=
=cVD7
-----END PGP SIGNATURE-----
Merge tag 'rcu-next-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks
Pull RCU updates from Frederic Weisbecker:
- RCU torture, locktorture and generic torture infrastructure updates
that include various fixes, cleanups and consolidations.
Among the user visible things, ftrace dumps can now be found into
their own file, and module parameters get better documented and
reported on dumps.
- Generic and misc fixes all over the place. Some highlights:
* Hotplug handling has seen some light cleanups and comments
* An RCU barrier can now be triggered through sysfs to serialize
memory stress testing and avoid OOM
* Object information is now dumped in case of invalid callback
invocation
* Also various SRCU issues, too hard to trigger to deserve urgent
pull requests, have been fixed
- RCU documentation updates
- RCU reference scalability test minor fixes and doc improvements.
- RCU tasks minor fixes
- Stall detection updates. Introduce RCU CPU Stall notifiers that
allows a subsystem to provide informations to help debugging. Also
cure some false positive stalls.
* tag 'rcu-next-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (56 commits)
srcu: Only accelerate on enqueue time
locktorture: Check the correct variable for allocation failure
srcu: Fix callbacks acceleration mishandling
rcu: Comment why callbacks migration can't wait for CPUHP_RCUTREE_PREP
rcu: Standardize explicit CPU-hotplug calls
rcu: Conditionally build CPU-hotplug teardown callbacks
rcu: Remove references to rcu_migrate_callbacks() from diagrams
rcu: Assume rcu_report_dead() is always called locally
rcu: Assume IRQS disabled from rcu_report_dead()
rcu: Use rcu_segcblist_segempty() instead of open coding it
rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objects
srcu: Fix srcu_struct node grpmask overflow on 64-bit systems
torture: Convert parse-console.sh to mktemp
rcutorture: Traverse possible cpu to set maxcpu in rcu_nocb_toggle()
rcutorture: Replace schedule_timeout*() 1-jiffy waits with HZ/20
torture: Add kvm.sh --debug-info argument
locktorture: Rename readers_bind/writers_bind to bind_readers/bind_writers
doc: Catch-up update for locktorture module parameters
locktorture: Add call_rcu_chains module parameter
locktorture: Add new module parameters to lock_torture_print_module_parms()
...
This series adds a kernel boot parameter that causes the kernel to
panic if one of the call_smp_function() APIs is stalled for more than
the specified duration. This is useful in deployments in which a clean
panic is preferable to an indefinite stall.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmU9plITHHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jB91D/9OiOMtV03TXN2K+zGmJMjTFgLuVnug
OqG4mrCv7jTJ3k6fTpu7hih/BCmG1Mu7byyPV6BUSfKsYony7L4yTPFJsjK8lNmq
MHh847DErGieuURCDnsvqBVpYIRfXnvW9ptlf+BMCjbzz4FuUu1XhJTm+U2nab3i
BEIEMORxDCyghh7yluAVG6sULRXOqjv5VcypwOXUbavDf0JyJTns4QlXFD85yiBr
nvfHvLUrzu5EblA3m09lTnCaKrAlz5pwD7fWQq7bS3rz2gndR/DcVcZknQ68FMsj
Mcf0Zf45TzyWWfMcE8LCulhMlZ2GYsIm2YkIbgwlsOAndjBrV55rsgaVbSZDke37
QMHPGUZ7m7AjuDqpWZITrJjQHkWCtn/5tFSazHSlMwWg44pgOqvc3OgZn04tzn01
L/guq3yBIKBJiAjdgzdx8/H9S7cSH8TLEWFY2utEAMIjef9Xzi6qwQ4X5p4K9QYJ
Sm1hTTCbF9NeyMk0o2rokR58f13+9ewxE4RIYcwP9loo6nbBcTVN/fvdHN6jQywa
7UIui508AUf55zQgO+5z8LsmjyNiDmDIeE8CLeVDicllWN+ne7F//AXAuA5V8m+y
G0Mphn/lZpnddDSFlR/QFuMvGqtpIu2y8w93awq8g/VzirbkiiCia3iFcFtss4ip
n0Y50CCu/W70fw==
=ZD0S
-----END PGP SIGNATURE-----
Merge tag 'csd-lock.2023.10.23a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull CSD lock update from Paul McKenney:
"This adds a kernel boot parameter that causes the kernel to panic if
one of the call_smp_function() APIs is stalled for more than the
specified duration.
This is useful in deployments in which a clean panic is preferable to
an indefinite stall"
* tag 'csd-lock.2023.10.23a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
smp,csd: Throw an error if a CSD lock is stuck for too long
- Limit the hardcoded topology quirk for Hygon CPUs to those which have a
model ID less than 4. The newer models have the topology CPUID leaf 0xB
correctly implemented and are not affected.
- Make SMT control more robust against enumeration failures
SMT control was added to allow controlling SMT at boottime or
runtime. The primary purpose was to provide a simple mechanism to
disable SMT in the light of speculation attack vectors.
It turned out that the code is sensible to enumeration failures and
worked only by chance for XEN/PV. XEN/PV has no real APIC enumeration
which means the primary thread mask is not set up correctly. By chance
a XEN/PV boot ends up with smp_num_siblings == 2, which makes the
hotplug control stay at its default value "enabled". So the mask is
never evaluated.
The ongoing rework of the topology evaluation caused XEN/PV to end up
with smp_num_siblings == 1, which sets the SMT control to "not
supported" and the empty primary thread mask causes the hotplug core to
deny the bringup of the APS.
Make the decision logic more robust and take 'not supported' and 'not
implemented' into account for the decision whether a CPU should be
booted or not.
- Fake primary thread mask for XEN/PV
Pretend that all XEN/PV vCPUs are primary threads, which makes the
usage of the primary thread mask valid on XEN/PV. That is consistent
with because all of the topology information on XEN/PV is fake or even
non-existent.
- Encapsulate topology information in cpuinfo_x86
Move the randomly scattered topology data into a separate data
structure for readability and as a preparatory step for the topology
evaluation overhaul.
- Consolidate APIC ID data type to u32
It's fixed width hardware data and not randomly u16, int, unsigned long
or whatever developers decided to use.
- Cure the abuse of cpuinfo for persisting logical IDs.
Per CPU cpuinfo is used to persist the logical package and die
IDs. That's really not the right place simply because cpuinfo is
subject to be reinitialized when a CPU goes through an offline/online
cycle.
Use separate per CPU data for the persisting to enable the further
topology management rework. It will be removed once the new topology
management is in place.
- Provide a debug interface for inspecting topology information
Useful in general and extremly helpful for validating the topology
management rework in terms of correctness or "bug" compatibility.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmU+yX0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoROUD/4vlvKEcpm9rbI5DzLcaq4DFHKbyEZF
cQtzuOSM/9vTc9DHnuoNNLl9TWSYxiVYnejf3E21evfsqspYlzbTH8bId9XBCUid
6B68AJW842M2erNuwj0b0HwF1z++zpDmBDyhGOty/KQhoM8pYOHMvntAmbzJbuso
Dgx6BLVFcboTy6RwlfRa0EE8f9W5V+JbmG/VBDpdyCInal7VrudoVFZmWQnPIft7
zwOJpAoehkp8OKq7geKDf79yWxu9a1sNPd62HtaVEvfHwehHqE6OaMLss1us+0vT
SJ/D6gmRQBOwcXaZL0wL1dG7Km9Et4AisOvzhXGvTa5b2D5oljVoqJ7V7FTf5g3u
y3aqWbeUJzERUbeJt1HoGVAKyA4GtZOvg+TNIysf6F1Z4khl9alfa9jiqjj4g1au
zgItq/ZMBEBmJ7X4FxQUEUVBG2CDsEidyNBDRcimWQUDfBakV/iCs0suD8uu8ZOD
K5jMx8Hi2+xFx7r1YqsfsyMBYOf/zUZw65RbNe+kI992JbJ9nhcODbnbo5MlAsyv
vcqlK5FwXgZ4YAC8dZHU/tyTiqAW7oaOSkqKwTP5gcyNEqsjQHV//q6v+uqtjfYn
1C4oUsRHT2vJiV9ktNJTA4GQHIYF4geGgpG8Ih2SjXsSzdGtUd3DtX1iq0YiLEOk
eHhYsnniqsYB5g==
=xrz8
-----END PGP SIGNATURE-----
Merge tag 'x86-core-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 core updates from Thomas Gleixner:
- Limit the hardcoded topology quirk for Hygon CPUs to those which have
a model ID less than 4.
The newer models have the topology CPUID leaf 0xB correctly
implemented and are not affected.
- Make SMT control more robust against enumeration failures
SMT control was added to allow controlling SMT at boottime or
runtime. The primary purpose was to provide a simple mechanism to
disable SMT in the light of speculation attack vectors.
It turned out that the code is sensible to enumeration failures and
worked only by chance for XEN/PV. XEN/PV has no real APIC enumeration
which means the primary thread mask is not set up correctly. By
chance a XEN/PV boot ends up with smp_num_siblings == 2, which makes
the hotplug control stay at its default value "enabled". So the mask
is never evaluated.
The ongoing rework of the topology evaluation caused XEN/PV to end up
with smp_num_siblings == 1, which sets the SMT control to "not
supported" and the empty primary thread mask causes the hotplug core
to deny the bringup of the APS.
Make the decision logic more robust and take 'not supported' and 'not
implemented' into account for the decision whether a CPU should be
booted or not.
- Fake primary thread mask for XEN/PV
Pretend that all XEN/PV vCPUs are primary threads, which makes the
usage of the primary thread mask valid on XEN/PV. That is consistent
with because all of the topology information on XEN/PV is fake or
even non-existent.
- Encapsulate topology information in cpuinfo_x86
Move the randomly scattered topology data into a separate data
structure for readability and as a preparatory step for the topology
evaluation overhaul.
- Consolidate APIC ID data type to u32
It's fixed width hardware data and not randomly u16, int, unsigned
long or whatever developers decided to use.
- Cure the abuse of cpuinfo for persisting logical IDs.
Per CPU cpuinfo is used to persist the logical package and die IDs.
That's really not the right place simply because cpuinfo is subject
to be reinitialized when a CPU goes through an offline/online cycle.
Use separate per CPU data for the persisting to enable the further
topology management rework. It will be removed once the new topology
management is in place.
- Provide a debug interface for inspecting topology information
Useful in general and extremly helpful for validating the topology
management rework in terms of correctness or "bug" compatibility.
* tag 'x86-core-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/apic, x86/hyperv: Use u32 in hv_snp_boot_ap() too
x86/cpu: Provide debug interface
x86/cpu/topology: Cure the abuse of cpuinfo for persisting logical ids
x86/apic: Use u32 for wakeup_secondary_cpu[_64]()
x86/apic: Use u32 for [gs]et_apic_id()
x86/apic: Use u32 for phys_pkg_id()
x86/apic: Use u32 for cpu_present_to_apicid()
x86/apic: Use u32 for check_apicid_used()
x86/apic: Use u32 for APIC IDs in global data
x86/apic: Use BAD_APICID consistently
x86/cpu: Move cpu_l[l2]c_id into topology info
x86/cpu: Move logical package and die IDs into topology info
x86/cpu: Remove pointless evaluation of x86_coreid_bits
x86/cpu: Move cu_id into topology info
x86/cpu: Move cpu_core_id into topology info
hwmon: (fam15h_power) Use topology_core_id()
scsi: lpfc: Use topology_core_id()
x86/cpu: Move cpu_die_id into topology info
x86/cpu: Move phys_proc_id into topology info
x86/cpu: Encapsulate topology information in cpuinfo_x86
...
- Make the quirk for non-maskable MSI interrupts in the affinity setter
functional again.
It was broken by a MSI core code update, which restructured the code in
a way that the quirk flag was not longer set correctly.
Trying to restore the core logic caused a deeper inspection and it
turned out that the extra quirk flag is not required at all because
it's the inverse of the reservation mode bit, which only can be set
when the MSI interrupt is maskable.
So the trivial fix is to use the reservation mode check in the affinity
setter function and remove almost 40 lines of code related to the
no-mask quirk flag.
- Cure a Kconfig dependency issue which causes compile fails by correcting
the conditionals in the affected heaer files.
- Clean up coding style in the UV APIC driver.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmU+w2ETHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodRIEACugQpAiND53Cz9MTJE1XWGheBn8n+y
WZtbVckK9LXHfnf99dtigtq3ycPBg2Mx/pItT1c71iN/NyVPNCC/3mM+ntP53etX
06v7VoIgRpF+GRQsIbjvVfBkKWS3G5zqHq6hazE0lcPhPEiOIMSBcusSQaA3C3sp
BT23rrx/59JKCb/187pum0Obx+n1oH+MG91wJ1v0zNOGgfx1u5gBPyKDZ0JAsMPx
r05991Lp4K80ooEsk/PKgpcZuPZD3QRwFDLHJwh2ITjsC22ItOnhN/c/1J8MZtk9
5YBaIzy+vyQd+nksqmDtB48FD0ioW/WLUR5zV9MMpD3RQAGgrxQmXeTqScu/j8tn
I4QZGi80HZSWLwFjWL2DoAw5LHZDh8ksxIYZBHE+8LHx0ymyX+7//naZnNFn9cXM
K5orouZFQi+dvfD7MAla73fiibPs6cHjGzRKfsfNlLBNNj+/ffcEolLlKGhEX9fx
R1w7gbtNs/RDnfoa45cvmez8UxJB5zei7l2HWbSecR2DDTQ9aGfEU/0NEr9muBzK
cuZIpmgD09VvS84jCxdYbaXA5Cau3NLOE5Os6UN9Wa3dUjWP5PArWkc8RzxP3Bgx
PANtytesqJ9YeDDg9SyylPw6d3EYNldtFZDXlD32K2O2SDmUZxxJa+0og8K9YNJN
aIUcqiSaYhg5rQ==
=2lhV
-----END PGP SIGNATURE-----
Merge tag 'x86-apic-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 APIC updates from Thomas Gleixner:
- Make the quirk for non-maskable MSI interrupts in the affinity setter
functional again.
It was broken by a MSI core code update, which restructured the code
in a way that the quirk flag was not longer set correctly.
Trying to restore the core logic caused a deeper inspection and it
turned out that the extra quirk flag is not required at all because
it's the inverse of the reservation mode bit, which only can be set
when the MSI interrupt is maskable.
So the trivial fix is to use the reservation mode check in the
affinity setter function and remove almost 40 lines of code related
to the no-mask quirk flag.
- Cure a Kconfig dependency issue which causes compile failures by
correcting the conditionals in the affected header files.
- Clean up coding style in the UV APIC driver.
* tag 'x86-apic-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic/msi: Fix misconfigured non-maskable MSI quirk
x86/msi: Fix compile error caused by CONFIG_GENERIC_MSI_IRQ=y && !CONFIG_X86_LOCAL_APIC
x86/platform/uv/apic: Clean up inconsistent indenting
Core:
- Avoid superfluous deactivation of the tick in the low resolution tick
NOHZ interrupt handler as the deactivation is handled already in the
idle loop and on interrupt exit.
- Update stale comments in the tick NOHZ code and rename the tick
handler functions to be self-explanatory.
- Remove an unused function in the tick NOHZ code, which was forgotten
when the last user went away.
- Handle RTC alarms which exceed the maximum alarm time of the
underlying RTC hardware gracefully.
Setting RTC alarms which exceed the maximum alarm time of the RTC
hardware failed so far and caused suspend operations to abort.
Cure this by limiting the alarm to the maximum alarm time of the RTC
hardware, which is provided by the driver. This causes early resume
wakeups, but that's way better than not suspending at all.
Drivers:
- Add a proper clocksource/event driver for the ancient Cirrus Logic
EP93xx SoC family, which is one of the last non device-tree holdouts
in arch/arm.
- The usual boring device tree bindings updates and small fixes and
enhancements all over the place.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmU+wSYTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoYB8D/4uSv1IV76IwAzZ3heDw4DYDeERIpxw
fnpdLHExJtuJaGNa3KCGHyiWEsN5DQfWOgYzEgGP8Uf3Y8GX0xSmi+b1fl+DwnaJ
fdVlWjWEw8EpuQjl9k78x6l+hi6UQ0Cdsk1pGJDulYelmQRCuIAQMCIX5uSc/vRw
T7Ee0ve8LHNr5IZROTv+LVJqGpcNQMwGlQRnIVIkmYJR3Lq4LPOdtmzhgbZjNGow
m/keJNIKiqEpjxHdifea6OVQRPQ1Axblyd7M9LMXtJ3kiH0RGfqp1KK5APXRYiJL
P5wOzoIwlteAV/nRPfWfR+i/C+eO0fp/fGYrPP3ih70gi21qEguyuYUuOaR/llfG
F5/IwPE6loa/5FZ0bPgs+yGR7nCpfx3SY8wNvk39gFwF8O8CwNVNuUXhJyyh0FLt
2fwo6e94b+XofyTy3MpX0ISoM3Ccn5HsF6hM1yk98UEM9iFtVC2zMxrH8LDVhZHt
JlU8jPqlOOVpEtbKBDoRSR7IGAvwPANqowLXHz1F2jbwJOnA5V7sUN5wWggu6eqq
S/kB0uEcFf3GjR1p3DolXOn8LrVX2TEmp6rtxifJaGyM2oMOVdilZf/GTpvg57+B
a+pUhG8j6AN7FCP1hOy238/U1/G2RH13J4GxdICZaAz5s96mjZNA74xbuY1SgO9M
OhCOhgJUtAMWHQ==
=7Yqn
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Updates for time, timekeeping and timers:
Core:
- Avoid superfluous deactivation of the tick in the low resolution
tick NOHZ interrupt handler as the deactivation is handled already
in the idle loop and on interrupt exit.
- Update stale comments in the tick NOHZ code and rename the tick
handler functions to be self-explanatory.
- Remove an unused function in the tick NOHZ code, which was
forgotten when the last user went away.
- Handle RTC alarms which exceed the maximum alarm time of the
underlying RTC hardware gracefully.
Setting RTC alarms which exceed the maximum alarm time of the RTC
hardware failed so far and caused suspend operations to abort.
Cure this by limiting the alarm to the maximum alarm time of the
RTC hardware, which is provided by the driver. This causes early
resume wakeups, but that's way better than not suspending at all.
Drivers:
- Add a proper clocksource/event driver for the ancient Cirrus Logic
EP93xx SoC family, which is one of the last non device-tree
holdouts in arch/arm.
- The usual boring device tree bindings updates and small fixes and
enhancements all over the place"
* tag 'timers-core-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource: ep93xx: Add driver for Cirrus Logic EP93xx
dt-bindings: timers: Add Cirrus EP93xx
clocksource/drivers/timer-atmel-tcb: Fix initialization on SAM9 hardware
clocksource/timer-riscv: ACPI: Add timer_cannot_wakeup_cpu
clocksource/drivers/sun5i: Remove surplus dev_err() when using platform_get_irq()
drivers/clocksource/timer-ti-dm: Don't call clk_get_rate() in stop function
clocksource/drivers/timer-imx-gpt: Fix potential memory leak
dt-bindings: timer: renesas,rz-mtu3: Document RZ/{G2UL,Five} SoCs
dt-bindings: timer: renesas,rz-mtu3: Improve documentation
dt-bindings: timer: renesas,rz-mtu3: Fix overflow/underflow interrupt names
alarmtimer: Use maximum alarm time for suspend
rtc: Add API function to return alarm time bound by hardware limit
tick/nohz: Update comments some more
tick/nohz: Remove unused tick_nohz_idle_stop_tick_protected()
tick/nohz: Don't shutdown the lowres tick from itself
tick/nohz: Update obsolete comments
tick/nohz: Rename the tick handlers to more self-explanatory names
- Switch the smp_call_function*() @csd argument to call_single_data_t
type, which is a cache-line aligned typedef of the underlying struct
__call_single_data.
This ensures that the call data is not crossing a cacheline which
avoids bouncing an extra cache-line for the SMP function call
- Prevent offlining of the last housekeeping CPU when CPU isolation is
active.
Offlining the last housekeeping CPU makes no sense in general, but also
caused the scheduler to panic due to the empty CPU mask when rebuilding
the scheduler domains.
- Remove an unused CPU hotplug state
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmU+vdYTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYocb3EAChhdVZCBm3UoDcrWtGuS7mMkTuLuUK
rheS9OtUt/uDEI0YZT5bD2R7KqdDVVNqbG1RLyICIWpQMvuMojZyu6fMCUjkONzS
iioun/2lZP4Q9TyAn0rdr9/GrPxb1/glYnLuz8ZJcY+jC63skNIDVfzJdhJ81/sx
t4BSswVsi75qZtBWWYFIzMQaJ1nUGJ5SZPYJV/WuQLf+pronoeWu+2VZHnaDqr3h
1N3oTQRbg0syPBg6trRuLEnn3384LYtdq7CHjeREX2jn2sU8yr+xzjKup5ShtSCR
7Amka/IlCTe2+FNS0F+6e3RGCH9Man1W593DqjUeIQT/Z0O2u4l9vNuVTv5GjX6C
fqDVd4hwVRd7/OGmaSPPY+pn9QK6B1WYU3BaAxACcGE6GaY69PU2jREnuCpt/pu9
Pg4xYDqClVwzvq3YYoU7YISya2TXjyJticxg3FtPUzrpVu0LIIq3IAcO7Nej+AzS
uSwhIRkqyT20CO/fRXhn5KQ2h66G6QNLPnEMtK/35K24Am7MGqwJd7wnGxKPu3RO
zAcRkQofouS0UcVbNY4UbV4vD6lpEAvy1RdxNPWt5DOIk5f83E176Yyc+vB8jAjG
YEM8ZnS3gFd7jvNC37rk9FfjlAIL9Z9QcrhtHJJ/h5y9sgCqzsV96B8c2KR3Ggs0
BQbaSJhdB89BwQ==
=tP7K
-----END PGP SIGNATURE-----
Merge tag 'smp-core-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP and CPU hotplug updates from Thomas Gleixner:
- Switch the smp_call_function*() @csd argument to call_single_data_t
type, which is a cache-line aligned typedef of the underlying struct
__call_single_data.
This ensures that the call data is not crossing a cacheline which
avoids bouncing an extra cache-line for the SMP function call
- Prevent offlining of the last housekeeping CPU when CPU isolation is
active.
Offlining the last housekeeping CPU makes no sense in general, but
also caused the scheduler to panic due to the empty CPU mask when
rebuilding the scheduler domains.
- Remove an unused CPU hotplug state
* tag 'smp-core-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Don't offline the last non-isolated CPU
cpu/hotplug: Remove unused cpuhp_state CPUHP_AP_X86_VDSO_VMA_ONLINE
smp: Change function signatures to use call_single_data_t
Core:
- Exclude managed interrupts in the calculation of interrupts which are
targeted to a CPU which is about to be offlined to ensure that there
are enough free vectors on the still online CPUs to migrate them over.
Managed interrupts do not need to be accounted because they are
either shut down on offline or migrated to an already reserved and
guaranteed slot on a still online CPU in the interrupts affinity
mask.
Including managed interrupts is overaccounting and can result in
needlessly aborting hibernation on large server machines.
- The usual set of small improvements
Drivers:
- Make the generic interrupt chip implementation handle interrupt
domains correctly and initialize the name pointers correctly
- Add interrupt affinity setting support to the Renesas RZG2L chip
driver.
- Prevent registering syscore operations multiple times in the SiFive
PLIC chip driver.
- Update device tree handling in the NXP Layerscape MSI chip driver
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmU+uawTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoZJGD/9ZdvT7OWZVLbqPX7eNPxGXUbLImD8H
lBgPEKi4oQGzNK+CANpicQa2YclZmprN8a9v6jjW/1HEY97WSUS8mNZymrwYjMsX
hTWiFpF3CIknPQAVPgZG7Zu7iVLLeY/MYN7/H+outCAuncwkw/RVYFDJG4oe6jy3
mNsV5f22xAsL4yR1eRzkbvJwwq4iqdrk8JPE9YoYN3Oo1/OUU1LIW/mUK7CaClfZ
+6mEuSYUCyhrYjYN68OZRyVQ+8ZuemOOwIQPTNXbx5tYOY+dsFOwNbOweXH8s4Gx
4IJ9u2QtvOXZRNvfWqz0zwhjPhaYNBHHWW2cDLM57v/Os3Fe0aYals6ElF/1SGGE
KIITCP6Qi9IJlOtO44Zlz+7to5n+zgn6aQTxSkPxM1pC9J0AgDBSrL8GWbrpTjR+
bMbXDkOMkZLYVJEGrD07UiAsl1xwHjN+uQAKqq6SUXQcVIdQQCDAm37km8cf+URx
hH7oWQCDRKCkrcjLvGAjRs9TxiD/UfAwZ8eWTg1L2gIMx5ugctr2RWLl87t/y25c
DDFXn9Y9SmMBOeA4J99neCZfXeHo8iiVGYD8Wv3AVFAETAZE81XmiiGaVRoyty2N
rbRgdmYZKUJu/XVvcHL+wxvQ9W7Y4p3qQI3fcgFXGx6cIco6cqcS4QrlE8di3SWf
KLLhOCNUGRtcSQ==
=Ok6p
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"Core:
- Exclude managed interrupts in the calculation of interrupts which
are targeted to a CPU which is about to be offlined to ensure that
there are enough free vectors on the still online CPUs to migrate
them over.
Managed interrupts do not need to be accounted because they are
either shut down on offline or migrated to an already reserved and
guaranteed slot on a still online CPU in the interrupts affinity
mask.
Including managed interrupts is overaccounting and can result in
needlessly aborting hibernation on large server machines.
- The usual set of small improvements
Drivers:
- Make the generic interrupt chip implementation handle interrupt
domains correctly and initialize the name pointers correctly
- Add interrupt affinity setting support to the Renesas RZG2L chip
driver.
- Prevent registering syscore operations multiple times in the SiFive
PLIC chip driver.
- Update device tree handling in the NXP Layerscape MSI chip driver"
* tag 'irq-core-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/sifive-plic: Fix syscore registration for multi-socket systems
irqchip/ls-scfg-msi: Use device_get_match_data()
genirq/generic_chip: Make irq_remove_generic_chip() irqdomain aware
genirq/matrix: Exclude managed interrupts in irq_matrix_allocated()
PCI/MSI: Provide stubs for IMS functions
irqchip/renesas-rzg2l: Enhance driver to support interrupt affinity setting
genirq/generic-chip: Fix the irq_chip name for /proc/interrupts
irqdomain: Annotate struct irq_domain with __counted_by
- Add a comment to explain that the preempt_disable() before unlocking
tasklist lock is not a correctness problem and just avoids the tracer
to preempt the tracee before the tracee schedules out.
- Make that preempt_disable() conditional on PREEMPT_RT=n.
RT enabled kernels cannot disable preemption at this point because
cgroup_enter_frozen() and sched_submit_work() acquire spinlocks or
rwlocks which are substituted by sleeping locks on RT. Acquiring a
sleeping lock in a preemption disable region is obviously not possible.
This obviously brings back the potential slowdown of ptrace() for RT
enabled kernels, but that's a price to be payed for latency guarantees.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmU+tkETHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoSYtD/9O9VkmH0kHPEzYHrRGd7JT/T4249Mw
ezYvO/OWQqaeC2Et758Gi89QdC9u7fKqlTKGasNXlVCyVQ5i/8/3HkxjzUaiX0g0
y1bIEtyUK2bBOO8rveSJAbMUuy21Ap2XTOTI9x6Au+iRY7JJASjWDfjrmh92+dN7
YTIhDWb6Fe8P6fFBSzdcYnsIX1gBEu9eJfVH885sFDFjHbWwD9TPmiQOx7rFsiOW
yq5qsGARE71k8KB6URlCfB6gB4lOw5d8CFDlafTWwfbu2/m2/RpfoZw1kdsnhsxV
XQnHyKYkRzimG+82xQFZEx0a1izWRg+9ANReWVo02hpolxWxOjJeZm97K8cEeabV
iRAVGegYMkaIHkxWIukZkXEx9dz18jxUryWaFslC6om43TiZMhGhZ8se8BPljiwN
8LxGzw4lVwTqB7+l/PVc1oZKh9851nQNsiEG4dMrwcPJd6HvMFXDTn/vVwWcZhRg
YWCd7u3bc9xZXwm5xhFyApWV8M32bpe7mNxPO4johUSf7HzK89ABrNg85OrKTlwm
vGjOI6q7Dh5YbyM8xmIV0jxeIyWMKmwPUpesTSvoZI0ApQ+heK2lMXvrxN6oN4bH
14JcEG5gRAmYuNZmdyQRGa/6yCgEXd8TvSxgYrufvcfnVAdyZWbPM2g09Qa761bC
ZXZ/ZM4LMNnENg==
=3ai1
-----END PGP SIGNATURE-----
Merge tag 'core-core-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core updates from Thomas Gleixner:
"Two small updates to ptrace_stop():
- Add a comment to explain that the preempt_disable() before
unlocking tasklist lock is not a correctness problem and just
avoids the tracer to preempt the tracee before the tracee schedules
out.
- Make that preempt_disable() conditional on PREEMPT_RT=n.
RT enabled kernels cannot disable preemption at this point because
cgroup_enter_frozen() and sched_submit_work() acquire spinlocks or
rwlocks which are substituted by sleeping locks on RT. Acquiring a
sleeping lock in a preemption disable region is obviously not
possible.
This obviously brings back the potential slowdown of ptrace() for
RT enabled kernels, but that's a price to be paid for latency
guarantees"
* tag 'core-core-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
signal: Don't disable preemption in ptrace_stop() on PREEMPT_RT
signal: Add a proper comment about preempt_disable() in ptrace_stop()
- Add AMD Unified Memory Controller (UMC) events introduced with Zen 4
- Simplify & clean up the uncore management code
- Fall back from RDPMC to RDMSR on certain uncore PMUs
- Improve per-package and cstate event reading
- Extend the Intel ref-cycles event to GP counters
- Fix Intel MTL event constraints
- Improve the Intel hybrid CPU handling code
- Micro-optimize the RAPL code
- Optimize perf_cgroup_switch()
- Improve large AUX area error handling
- Misc fixes and cleanups
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU89YsRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iQqQ/9EF9mG4te5By4qN+B7jADCmE71xG5ViKz
sp4Thl86SHxhwFuiHn8dMUixrp+qbcemi5yTbQ9TF8cKl4s3Ju2CihU8jaauUp0a
iS5W0IliMqLD1pxQoXAPLuPVInVYgrNOCbR4l6l7D6ervh5Z6PVEf7SVeAP3L5wo
QV/V3NKkrYeNQL+FoKhCH8Vhxw0HxUmKJO7UhW6yuCt7BAok9Es18h3OVnn+7es4
BB7VI/JvdmXf2ioKhTPnDXJjC+vh5vnwiBoTcdQ2W9ADhWUvfL4ozxOXT6z7oC3A
nwBOdXf8w8Rqnqqd8hduop1QUrusMxlEVgOMCk27qHx97uWgPceZWdoxDXGHBiRK
fqJAwXERf9wp5/M57NDlPwyf/43Hocdx2CdLkQBpfD78/k/sB5hW0KxnzY0FUI9x
jBRQyWD05IDJATBaMHz+VbrexS+Itvjp2QvSiSm9zislYD4zA9fQ3lAgFhEpcUbA
ZA/nN4t+CbiGEAsJEuBPlvSC1ahUwVP/0nz3PFlVWFDqAx0mXgVNKBe083A9yh7I
dVisVY6KPAVDzyOc1LqzU8WFXNFnIkIIaLrb6fRHJVEM8MDfpLPS/a+7AHdRcDP4
yq6fjVVjyP7e9lSQLYBUP3/3uiVnWQj92l6V6CrcgDMX5rDOb0VN+BQrmhPR6fWY
WEim6WZrZj4=
=OLsv
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance event updates from Ingo Molnar:
- Add AMD Unified Memory Controller (UMC) events introduced with Zen 4
- Simplify & clean up the uncore management code
- Fall back from RDPMC to RDMSR on certain uncore PMUs
- Improve per-package and cstate event reading
- Extend the Intel ref-cycles event to GP counters
- Fix Intel MTL event constraints
- Improve the Intel hybrid CPU handling code
- Micro-optimize the RAPL code
- Optimize perf_cgroup_switch()
- Improve large AUX area error handling
- Misc fixes and cleanups
* tag 'perf-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
perf/x86/amd/uncore: Pass through error code for initialization failures, instead of -ENODEV
perf/x86/amd/uncore: Fix uninitialized return value in amd_uncore_init()
x86/cpu: Fix the AMD Fam 17h, Fam 19h, Zen2 and Zen4 MSR enumerations
perf: Optimize perf_cgroup_switch()
perf/x86/amd/uncore: Add memory controller support
perf/x86/amd/uncore: Add group exclusivity
perf/x86/amd/uncore: Use rdmsr if rdpmc is unavailable
perf/x86/amd/uncore: Move discovery and registration
perf/x86/amd/uncore: Refactor uncore management
perf/core: Allow reading package events from perf_event_read_local
perf/x86/cstate: Allow reading the package statistics from local CPU
perf/x86/intel/pt: Fix kernel-doc comments
perf/x86/rapl: Annotate 'struct rapl_pmus' with __counted_by
perf/core: Rename perf_proc_update_handler() -> perf_event_max_sample_rate_handler(), for readability
perf/x86/rapl: Fix "Using plain integer as NULL pointer" Sparse warning
perf/x86/rapl: Use local64_try_cmpxchg in rapl_event_update()
perf/x86/rapl: Stop doing cpu_relax() in the local64_cmpxchg() loop in rapl_event_update()
perf/core: Bail out early if the request AUX area is out of bound
perf/x86/intel: Extend the ref-cycles event to GP counters
perf/x86/intel: Fix broken fixed event constraints extension
...
- Fair scheduler (SCHED_OTHER) improvements:
- Remove the old and now unused SIS_PROP code & option
- Scan cluster before LLC in the wake-up path
- Use candidate prev/recent_used CPU if scanning failed for cluster wakeup
- NUMA scheduling improvements:
- Improve the VMA access-PID code to better skip/scan VMAs
- Extend tracing to cover VMA-skipping decisions
- Improve/fix the recently introduced sched_numa_find_nth_cpu() code
- Generalize numa_map_to_online_node()
- Energy scheduling improvements:
- Remove the EM_MAX_COMPLEXITY limit
- Add tracepoints to track energy computation
- Make the behavior of the 'sched_energy_aware' sysctl more consistent
- Consolidate and clean up access to a CPU's max compute capacity
- Fix uclamp code corner cases
- RT scheduling improvements:
- Drive dl_rq->overloaded with dl_rq->pushable_dl_tasks updates
- Drive the ->rto_mask with rt_rq->pushable_tasks updates
- Scheduler scalability improvements:
- Rate-limit updates to tg->load_avg
- On x86 disable IBRS when CPU is offline to improve single-threaded performance
- Micro-optimize in_task() and in_interrupt()
- Micro-optimize the PSI code
- Avoid updating PSI triggers and ->rtpoll_total when there are no state changes
- Core scheduler infrastructure improvements:
- Use saved_state to reduce some spurious freezer wakeups
- Bring in a handful of fast-headers improvements to scheduler headers
- Make the scheduler UAPI headers more widely usable by user-space
- Simplify the control flow of scheduler syscalls by using lock guards
- Fix sched_setaffinity() vs. CPU hotplug race
- Scheduler debuggability improvements:
- Disallow writing invalid values to sched_rt_period_us
- Fix a race in the rq-clock debugging code triggering warnings
- Fix a warning in the bandwidth distribution code
- Micro-optimize in_atomic_preempt_off() checks
- Enforce that the tasklist_lock is held in for_each_thread()
- Print the TGID in sched_show_task()
- Remove the /proc/sys/kernel/sched_child_runs_first sysctl
- Misc cleanups & fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU8/NoRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gN+xAAvKGYNZBCBG4jowxccgqAbCx81KOhhsy/
KUaOmdLPg9WaXuqjZ5sggXQCMT0wUqBYAmqV7ts53VhWcma2I1ap4dCM6Jj+RLrc
vNwkeNetsikiZtarMoCJs5NahL8ULh3liBaoAkkToPjQ5r43aZ/eKwDovEdIKc+g
+Vgn7jUY8ssIrAOKT1midSwY1y8kAU2AzWOSFDTgedkJP4PgOu9/lBl9jSJ2sYaX
N4XqONYPXTwOHUtvmzkYILxLz0k0GgJ7hmt78E8Xy2rC4taGCRwCfCMBYxREuwiP
huo3O1P/iIe5svm4/EBUvcpvf44eAWTV+CD0dnJPwOc9IvFhpSzqSZZAsyy/JQKt
Lnzmc/xmyc1PnXCYJfHuXrw2/m+MyUHaegPzh5iLJFrlqa79GavOElj0jNTAMzbZ
39fybzPtuFP+64faRfu0BBlQZfORPBNc/oWMpPKqgP58YGuveKTWaUF5rl5lM7Ne
nm07uOmq02JVR8YzPl/FcfhU2dPMawWuMwUjEr2eU+lAunY3PF88vu0FALj7iOBd
66F8qrtpDHJanOxrdEUwSJ7hgw79qY1iw66Db7cQYjMazFKZONxArQPqFUZ0ngLI
n9hVa7brg1bAQKrQflqjcIAIbpVu3SjPEl15cKpAJTB/gn5H66TQgw8uQ6HfG+h2
GtOsn1nlvuk=
=GDqb
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Fair scheduler (SCHED_OTHER) improvements:
- Remove the old and now unused SIS_PROP code & option
- Scan cluster before LLC in the wake-up path
- Use candidate prev/recent_used CPU if scanning failed for cluster
wakeup
NUMA scheduling improvements:
- Improve the VMA access-PID code to better skip/scan VMAs
- Extend tracing to cover VMA-skipping decisions
- Improve/fix the recently introduced sched_numa_find_nth_cpu() code
- Generalize numa_map_to_online_node()
Energy scheduling improvements:
- Remove the EM_MAX_COMPLEXITY limit
- Add tracepoints to track energy computation
- Make the behavior of the 'sched_energy_aware' sysctl more
consistent
- Consolidate and clean up access to a CPU's max compute capacity
- Fix uclamp code corner cases
RT scheduling improvements:
- Drive dl_rq->overloaded with dl_rq->pushable_dl_tasks updates
- Drive the ->rto_mask with rt_rq->pushable_tasks updates
Scheduler scalability improvements:
- Rate-limit updates to tg->load_avg
- On x86 disable IBRS when CPU is offline to improve single-threaded
performance
- Micro-optimize in_task() and in_interrupt()
- Micro-optimize the PSI code
- Avoid updating PSI triggers and ->rtpoll_total when there are no
state changes
Core scheduler infrastructure improvements:
- Use saved_state to reduce some spurious freezer wakeups
- Bring in a handful of fast-headers improvements to scheduler
headers
- Make the scheduler UAPI headers more widely usable by user-space
- Simplify the control flow of scheduler syscalls by using lock
guards
- Fix sched_setaffinity() vs. CPU hotplug race
Scheduler debuggability improvements:
- Disallow writing invalid values to sched_rt_period_us
- Fix a race in the rq-clock debugging code triggering warnings
- Fix a warning in the bandwidth distribution code
- Micro-optimize in_atomic_preempt_off() checks
- Enforce that the tasklist_lock is held in for_each_thread()
- Print the TGID in sched_show_task()
- Remove the /proc/sys/kernel/sched_child_runs_first sysctl
... and misc cleanups & fixes"
* tag 'sched-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits)
sched/fair: Remove SIS_PROP
sched/fair: Use candidate prev/recent_used CPU if scanning failed for cluster wakeup
sched/fair: Scan cluster before scanning LLC in wake-up path
sched: Add cpus_share_resources API
sched/core: Fix RQCF_ACT_SKIP leak
sched/fair: Remove unused 'curr' argument from pick_next_entity()
sched/nohz: Update comments about NEWILB_KICK
sched/fair: Remove duplicate #include
sched/psi: Update poll => rtpoll in relevant comments
sched: Make PELT acronym definition searchable
sched: Fix stop_one_cpu_nowait() vs hotplug
sched/psi: Bail out early from irq time accounting
sched/topology: Rename 'DIE' domain to 'PKG'
sched/psi: Delete the 'update_total' function parameter from update_triggers()
sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes
sched/headers: Remove comment referring to rq::cpu_load, since this has been removed
sched/numa: Complete scanning of inactive VMAs when there is no alternative
sched/numa: Complete scanning of partial VMAs regardless of PID activity
sched/numa: Move up the access pid reset logic
sched/numa: Trace decisions related to skipping VMAs
...
- Futex improvements:
- Add the 'futex2' syscall ABI, which is an attempt to get away from the
multiplex syscall and adds a little room for extentions, while lifting
some limitations.
- Fix futex PI recursive rt_mutex waiter state bug
- Fix inter-process shared futexes on no-MMU systems
- Use folios instead of pages
- Micro-optimizations of locking primitives:
- Improve arch_spin_value_unlocked() on asm-generic ticket spinlock
architectures, to improve lockref code generation.
- Improve the x86-32 lockref_get_not_zero() main loop by adding
build-time CMPXCHG8B support detection for the relevant lockref code,
and by better interfacing the CMPXCHG8B assembly code with the compiler.
- Introduce arch_sync_try_cmpxchg() on x86 to improve sync_try_cmpxchg()
code generation. Convert some sync_cmpxchg() users to sync_try_cmpxchg().
- Micro-optimize rcuref_put_slowpath()
- Locking debuggability improvements:
- Improve CONFIG_DEBUG_RT_MUTEXES=y to have a fast-path as well
- Enforce atomicity of sched_submit_work(), which is de-facto atomic but
was un-enforced previously.
- Extend <linux/cleanup.h>'s no_free_ptr() with __must_check semantics
- Fix ww_mutex self-tests
- Clean up const-propagation in <linux/seqlock.h> and simplify
the API-instantiation macros a bit.
- RT locking improvements:
- Provide the rt_mutex_*_schedule() primitives/helpers and use them
in the rtmutex code to avoid recursion vs. rtlock on the PI state.
- Add nested blocking lockdep asserts to rt_mutex_lock(), rtlock_lock()
and rwbase_read_lock().
- Plus misc fixes & cleanups
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU877IRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1g9jw/+N7rxQ78dmFCYh4UWnLCYvuKP0/ivHErG
493JcB8MupuA2tfJHIkDdr4aM2mNq2E61w69/WlZAQWWD6pdOhwgF5Xf5eoEcJm0
vsAhWBGLxihXdtevPuMAx0dEpg3AMp2wc6i5PkN831KdPUgCNsrKq9Bfnfef7/G8
MQTSHjmtba6jxleyxfEa4tE2xe5PJX825nRfkX2e1cf+stkYua+uJFxVxUfxFWGE
4pBy70D9OC7MsJ44WWOA1gwkVtMMiBTmRPNjlP8Gz2GQ0f3ERHRwYk3jDHOPHZI6
0GNt7pE3IMXQn2UuDtfkvv9IFTd+U5qD+APnWIn2ntWXqzGLFqOlmovMrobVn7El
olYDCyweWPG71m1Qblsb1VK2QjRPQVJ9NAEg8RlDHIu2ThxHbMysDVGPVOYnPFq4
S8QFpmldzbNoPU4rDJyT1fAmoUIrusBHkl+Us3yGfC74iM+fHnDEvaSoMZbzEdY1
x/Nocj9XgKEgfXdYzrCWFmZ9xXqHkO25/wDL6yKqBdQtvaEalXuHTT6mQcYxrUPm
Xx1BPan2Jg7p4u2oOFcVtKewUtRH9KBx8qytr5S+JK4PJbrBsixMnr84HLd/3X2V
ykYkO+367T5MTYv4TnJDE5vdurzUqekKSCFPY3skPujPJfdLj1vsPzYf9iMkCLdo
hU2f/R+Wpdk=
=36Ff
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Info Molnar:
"Futex improvements:
- Add the 'futex2' syscall ABI, which is an attempt to get away from
the multiplex syscall and adds a little room for extentions, while
lifting some limitations.
- Fix futex PI recursive rt_mutex waiter state bug
- Fix inter-process shared futexes on no-MMU systems
- Use folios instead of pages
Micro-optimizations of locking primitives:
- Improve arch_spin_value_unlocked() on asm-generic ticket spinlock
architectures, to improve lockref code generation
- Improve the x86-32 lockref_get_not_zero() main loop by adding
build-time CMPXCHG8B support detection for the relevant lockref
code, and by better interfacing the CMPXCHG8B assembly code with
the compiler
- Introduce arch_sync_try_cmpxchg() on x86 to improve
sync_try_cmpxchg() code generation. Convert some sync_cmpxchg()
users to sync_try_cmpxchg().
- Micro-optimize rcuref_put_slowpath()
Locking debuggability improvements:
- Improve CONFIG_DEBUG_RT_MUTEXES=y to have a fast-path as well
- Enforce atomicity of sched_submit_work(), which is de-facto atomic
but was un-enforced previously.
- Extend <linux/cleanup.h>'s no_free_ptr() with __must_check
semantics
- Fix ww_mutex self-tests
- Clean up const-propagation in <linux/seqlock.h> and simplify the
API-instantiation macros a bit
RT locking improvements:
- Provide the rt_mutex_*_schedule() primitives/helpers and use them
in the rtmutex code to avoid recursion vs. rtlock on the PI state.
- Add nested blocking lockdep asserts to rt_mutex_lock(),
rtlock_lock() and rwbase_read_lock()
.. plus misc fixes & cleanups"
* tag 'locking-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
futex: Don't include process MM in futex key on no-MMU
locking/seqlock: Fix grammar in comment
alpha: Fix up new futex syscall numbers
locking/seqlock: Propagate 'const' pointers within read-only methods, remove forced type casts
locking/lockdep: Fix string sizing bug that triggers a format-truncation compiler-warning
locking/seqlock: Change __seqprop() to return the function pointer
locking/seqlock: Simplify SEQCOUNT_LOCKNAME()
locking/atomics: Use atomic_try_cmpxchg_release() to micro-optimize rcuref_put_slowpath()
locking/atomic, xen: Use sync_try_cmpxchg() instead of sync_cmpxchg()
locking/atomic/x86: Introduce arch_sync_try_cmpxchg()
locking/atomic: Add generic support for sync_try_cmpxchg() and its fallback
locking/seqlock: Fix typo in comment
futex/requeue: Remove unnecessary ‘NULL’ initialization from futex_proxy_trylock_atomic()
locking/local, arch: Rewrite local_add_unless() as a static inline function
locking/debug: Fix debugfs API return value checks to use IS_ERR()
locking/ww_mutex/test: Make sure we bail out instead of livelock
locking/ww_mutex/test: Fix potential workqueue corruption
locking/ww_mutex/test: Use prng instead of rng to avoid hangs at bootup
futex: Add sys_futex_requeue()
futex: Add flags2 argument to futex_requeue()
...
Here's the bcachefs filesystem pull request.
One new patch since last week: the exportfs constants ended up
conflicting with other filesystems that are also getting added to the
global enum, so switched to new constants picked by Amir.
I'll also be sending another pull request later on in the cycle bringing
things up to date my master branch that people are currently running;
that will be restricted to fs/bcachefs/, naturally.
Testing - fstests as well as the bcachefs specific tests in ktest:
https://evilpiepirate.org/~testdashboard/ci?branch=bcachefs-for-upstream
It's also been soaking in linux-next, which resulted in a whole bunch of
smatch complaints and fixes and a patch or two from Kees.
The only new non fs/bcachefs/ patch is the objtool patch that adds
bcachefs functions to the list of noreturns. The patch that exports
osq_lock() has been dropped for now, per Ingo.
Prereq patch list:
faf1dce852 objtool: Add bcachefs noreturns
73badee428 lib/generic-radix-tree.c: Add peek_prev()
9492261ff2 lib/generic-radix-tree.c: Don't overflow in peek()
0fb5d567f5 MAINTAINERS: Add entry for generic-radix-tree
b414e8ecd4 closures: Add a missing include
48b7935722 closures: closure_nr_remaining()
ced58fc7ab closures: closure_wait_event()
bd0d22e41e MAINTAINERS: Add entry for closures
8c8d2d9670 bcache: move closures to lib/
957e48087d locking: export contention tracepoints for bcachefs six locks
21db931445 lib: Export errname
83feeb1955 lib/string_helpers: string_get_size() now returns characters wrote
7d672f4094 stacktrace: Export stack_trace_save_tsk
771eb4fe8b fs: factor out d_mark_tmpfile()
2b69987be5 sched: Add task_struct->faults_disabled_mapping
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmU/wyIACgkQE6szbY3K
bnZc1xAAqjQBGXdtgtKQvk0/ru0WaMZguMsOHd3BUXIbm30F6eJqnoXQ/ahALofc
Ju6NrOgcy9wmdPKWpbeF+aK3WnkAW9jShDd0QieVH6PkhyYyh5r11iR/EVtjjLu5
6Teodn8fyTqn9WSDtKG15QreTCJrEasAoGFQKQDA8oiXC7zc+RSpLUkkTWD/pxyW
zVqkGGiAUG4x6FON+X2a3QBa9WCahIgV6XzHstGLsmOECxKO/LopGR5jThuIhv9t
Yo0wodQTKAgb9QviG6V3f2dJLQKKUVDmVEGTXv+8Hl3d8CiYBJeIh+icp+VESBo1
m8ev0y2xbTPLwgm5v0Uj4o/G8ISZ+qmcexV2zQ9xUWUAd2AjEBzhCh9BrNXM5qSg
o7mphH+Pt6bJXgzxb2RkYJixU11yG3yuHPOCrRGGFpVHiNYhdHuJeDZOqChWZB8x
6kY0uvU0X0tqVfWKxMwTwuqG8mJ5BkJNvnEvYi05QEZG0dDcUhgOqYlNNaL8vGkl
qVixOwE4aH4kscdmW2gXY1c76VSebheyN8n6Wj1zrmTw4hTJH7ZWXPtmbRqQzpB6
U6w3NjVyopbIjuF+syWeGqitTT/8fpvgZU4E9MpKGmHX4ADgecp6YSZQzzxTJn7D
cbVX7YQxhmsM50C1PW7A8yLCspD/uRNiKLvzb/g9gFSInk4rV+U=
=g+ia
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2023-10-30' of https://evilpiepirate.org/git/bcachefs
Pull initial bcachefs updates from Kent Overstreet:
"Here's the bcachefs filesystem pull request.
One new patch since last week: the exportfs constants ended up
conflicting with other filesystems that are also getting added to the
global enum, so switched to new constants picked by Amir.
The only new non fs/bcachefs/ patch is the objtool patch that adds
bcachefs functions to the list of noreturns. The patch that exports
osq_lock() has been dropped for now, per Ingo"
* tag 'bcachefs-2023-10-30' of https://evilpiepirate.org/git/bcachefs: (2781 commits)
exportfs: Change bcachefs fid_type enum to avoid conflicts
bcachefs: Refactor memcpy into direct assignment
bcachefs: Fix drop_alloc_keys()
bcachefs: snapshot_create_lock
bcachefs: Fix snapshot skiplists during snapshot deletion
bcachefs: bch2_sb_field_get() refactoring
bcachefs: KEY_TYPE_error now counts towards i_sectors
bcachefs: Fix handling of unknown bkey types
bcachefs: Switch to unsafe_memcpy() in a few places
bcachefs: Use struct_size()
bcachefs: Correctly initialize new buckets on device resize
bcachefs: Fix another smatch complaint
bcachefs: Use strsep() in split_devs()
bcachefs: Add iops fields to bch_member
bcachefs: Rename bch_sb_field_members -> bch_sb_field_members_v1
bcachefs: New superblock section members_v2
bcachefs: Add new helper to retrieve bch_member from sb
bcachefs: bucket_lock() is now a sleepable lock
bcachefs: fix crc32c checksum merge byte order problem
bcachefs: Fix bch2_inode_delete_keys()
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppYgAKCRCRxhvAZXjc
okIHAP9anLz1QDyMLH12ASuHjgBc0Of3jcB6NB97IWGpL4O21gEA46ohaD+vcJuC
YkBLU3lXqQ87nfu28ExFAzh10hG2jwM=
=m4pB
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode time accessor updates from Christian Brauner:
"This finishes the conversion of all inode time fields to accessor
functions as discussed on list. Changing timestamps manually as we
used to do before is error prone. Using accessors function makes this
robust.
It does not contain the switch of the time fields to discrete 64 bit
integers to replace struct timespec and free up space in struct inode.
But after this, the switch can be trivially made and the patch should
only affect the vfs if we decide to do it"
* tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (86 commits)
fs: rename inode i_atime and i_mtime fields
security: convert to new timestamp accessors
selinux: convert to new timestamp accessors
apparmor: convert to new timestamp accessors
sunrpc: convert to new timestamp accessors
mm: convert to new timestamp accessors
bpf: convert to new timestamp accessors
ipc: convert to new timestamp accessors
linux: convert to new timestamp accessors
zonefs: convert to new timestamp accessors
xfs: convert to new timestamp accessors
vboxsf: convert to new timestamp accessors
ufs: convert to new timestamp accessors
udf: convert to new timestamp accessors
ubifs: convert to new timestamp accessors
tracefs: convert to new timestamp accessors
sysv: convert to new timestamp accessors
squashfs: convert to new timestamp accessors
server: convert to new timestamp accessors
client: convert to new timestamp accessors
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTpoQAAKCRCRxhvAZXjc
ovFNAQDgIRjXfZ1Ku+USxsRRdqp8geJVaNc3PuMmYhOYhUenqgEAmC1m+p0y31dS
P6+HlL16Mqgu0tpLCcJK9BibpDZ0Ew4=
=7yD1
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual fses.
Features:
- Rename and export helpers that get write access to a mount. They
are used in overlayfs to get write access to the upper mount.
- Print the pretty name of the root device on boot failure. This
helps in scenarios where we would usually only print
"unknown-block(1,2)".
- Add an internal SB_I_NOUMASK flag. This is another part in the
endless POSIX ACL saga in a way.
When POSIX ACLs are enabled via SB_POSIXACL the vfs cannot strip
the umask because if the relevant inode has POSIX ACLs set it might
take the umask from there. But if the inode doesn't have any POSIX
ACLs set then we apply the umask in the filesytem itself. So we end
up with:
(1) no SB_POSIXACL -> strip umask in vfs
(2) SB_POSIXACL -> strip umask in filesystem
The umask semantics associated with SB_POSIXACL allowed filesystems
that don't even support POSIX ACLs at all to raise SB_POSIXACL
purely to avoid umask stripping. That specifically means NFS v4 and
Overlayfs. NFS v4 does it because it delegates this to the server
and Overlayfs because it needs to delegate umask stripping to the
upper filesystem, i.e., the filesystem used as the writable layer.
This went so far that SB_POSIXACL is raised eve on kernels that
don't even have POSIX ACL support at all.
Stop this blatant abuse and add SB_I_NOUMASK which is an internal
superblock flag that filesystems can raise to opt out of umask
handling. That should really only be the two mentioned above. It's
not that we want any filesystems to do this. Ideally we have all
umask handling always in the vfs.
- Make overlayfs use SB_I_NOUMASK too.
- Now that we have SB_I_NOUMASK, stop checking for SB_POSIXACL in
IS_POSIXACL() if the kernel doesn't have support for it. This is a
very old patch but it's only possible to do this now with the wider
cleanup that was done.
- Follow-up work on fake path handling from last cycle. Citing mostly
from Amir:
When overlayfs was first merged, overlayfs files of regular files
and directories, the ones that are installed in file table, had a
"fake" path, namely, f_path is the overlayfs path and f_inode is
the "real" inode on the underlying filesystem.
In v6.5, we took another small step by introducing of the
backing_file container and the file_real_path() helper. This change
allowed vfs and filesystem code to get the "real" path of an
overlayfs backing file. With this change, we were able to make
fsnotify work correctly and report events on the "real" filesystem
objects that were accessed via overlayfs.
This method works fine, but it still leaves the vfs vulnerable to
new code that is not aware of files with fake path. A recent
example is commit db1d1e8b98 ("IMA: use vfs_getattr_nosec to get
the i_version"). This commit uses direct referencing to f_path in
IMA code that otherwise uses file_inode() and file_dentry() to
reference the filesystem objects that it is measuring.
This contains work to switch things around: instead of having
filesystem code opt-in to get the "real" path, have generic code
opt-in for the "fake" path in the few places that it is needed.
Is it far more likely that new filesystems code that does not use
the file_dentry() and file_real_path() helpers will end up causing
crashes or averting LSM/audit rules if we keep the "fake" path
exposed by default.
This change already makes file_dentry() moot, but for now we did
not change this helper just added a WARN_ON() in ovl_d_real() to
catch if we have made any wrong assumptions.
After the dust settles on this change, we can make file_dentry() a
plain accessor and we can drop the inode argument to ->d_real().
- Switch struct file to SLAB_TYPESAFE_BY_RCU. This looks like a small
change but it really isn't and I would like to see everyone on
their tippie toes for any possible bugs from this work.
Essentially we've been doing most of what SLAB_TYPESAFE_BY_RCU for
files since a very long time because of the nasty interactions
between the SCM_RIGHTS file descriptor garbage collection. So
extending it makes a lot of sense but it is a subtle change. There
are almost no places that fiddle with file rcu semantics directly
and the ones that did mess around with struct file internal under
rcu have been made to stop doing that because it really was always
dodgy.
I forgot to put in the link tag for this change and the discussion
in the commit so adding it into the merge message:
https://lore.kernel.org/r/20230926162228.68666-1-mjguzik@gmail.com
Cleanups:
- Various smaller pipe cleanups including the removal of a spin lock
that was only used to protect against writes without pipe_lock()
from O_NOTIFICATION_PIPE aka watch queues. As that was never
implemented remove the additional locking from pipe_write().
- Annotate struct watch_filter with the new __counted_by attribute.
- Clarify do_unlinkat() cleanup so that it doesn't look like an extra
iput() is done that would cause issues.
- Simplify file cleanup when the file has never been opened.
- Use module helper instead of open-coding it.
- Predict error unlikely for stale retry.
- Use WRITE_ONCE() for mount expiry field instead of just commenting
that one hopes the compiler doesn't get smart.
Fixes:
- Fix readahead on block devices.
- Fix writeback when layztime is enabled and inodes whose timestamp
is the only thing that changed reside on wb->b_dirty_time. This
caused excessively large zombie memory cgroup when lazytime was
enabled as such inodes weren't handled fast enough.
- Convert BUG_ON() to WARN_ON_ONCE() in open_last_lookups()"
* tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (26 commits)
file, i915: fix file reference for mmap_singleton()
vfs: Convert BUG_ON to WARN_ON_ONCE in open_last_lookups
writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs
chardev: Simplify usage of try_module_get()
ovl: rely on SB_I_NOUMASK
fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=n
fs: store real path instead of fake path in backing file f_path
fs: create helper file_user_path() for user displayed mapped file path
fs: get mnt_writers count for an open backing file's real path
vfs: stop counting on gcc not messing with mnt_expiry_mark if not asked
vfs: predict the error in retry_estale as unlikely
backing file: free directly
vfs: fix readahead(2) on block devices
io_uring: use files_lookup_fd_locked()
file: convert to SLAB_TYPESAFE_BY_RCU
vfs: shave work on failed file open
fs: simplify misleading code to remove ambiguity regarding ihold()/iput()
watch_queue: Annotate struct watch_filter with __counted_by
fs/pipe: use spinlock in pipe_read() only if there is a watch_queue
fs/pipe: remove unnecessary spinlock from pipe_write()
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZT0C2gAKCRCRxhvAZXjc
otV8AQCK5F9ONoQ7ISpdrKyUJiswySGXx0CYPfXbSg5gHH87zgEAua3vwVKeGXXF
5iVsdiNzIIQDwGDx7FyxufL4ggcN6gQ=
=E1kV
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.super' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs superblock updates from Christian Brauner:
"This contains the work to make block device opening functions return a
struct bdev_handle instead of just a struct block_device. The same
struct bdev_handle is then also passed to block device closing
functions.
This allows us to propagate context from opening to closing a block
device without having to modify all users everytime.
Sidenote, in the future we might even want to try and have block
device opening functions return a struct file directly but that's a
series on top of this.
These are further preparatory changes to be able to count writable
opens and blocking writes to mounted block devices. That's a separate
piece of work for next cycle and for that we absolutely need the
changes to btrfs that have been quietly dropped somehow.
Originally the series contained a patch that removed the old
blkdev_*() helpers. But since this would've caused needles churn in
-next for bcachefs we ended up delaying it.
The second piece of work addresses one of the major annoyances about
the work last cycle, namely that we required dropping s_umount
whenever we used the superblock and fs_holder_ops for a block device.
The reason for that requirement had been that in some codepaths
s_umount could've been taken under disk->open_mutex (that's always
been the case, at least theoretically). For example, on surprise block
device removal or media change. And opening and closing block devices
required grabbing disk->open_mutex as well.
So we did the work and went through the block layer and fixed all
those places so that s_umount is never taken under disk->open_mutex.
This means no more brittle games where we yield and reacquire s_umount
during block device opening and closing and no more requirements where
block devices need to be closed. Filesystems don't need to care about
this.
There's a bunch of other follow-up work such as moving block device
freezing and thawing to holder operations which makes it work for all
block devices and not just the main block device just as we did for
surprise removal. But that is for next cycle.
Tested with fstests for all major fses, blktests, LTP"
* tag 'vfs-6.7.super' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (37 commits)
porting: update locking requirements
fs: assert that open_mutex isn't held over holder ops
block: assert that we're not holding open_mutex over blk_report_disk_dead
block: move bdev_mark_dead out of disk_check_media_change
block: WARN_ON_ONCE() when we remove active partitions
block: simplify bdev_del_partition()
fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock
jfs: fix log->bdev_handle null ptr deref in lbmStartIO
bcache: Fixup error handling in register_cache()
xfs: Convert to bdev_open_by_path()
reiserfs: Convert to bdev_open_by_dev/path()
ocfs2: Convert to use bdev_open_by_dev()
nfs/blocklayout: Convert to use bdev_open_by_dev/path()
jfs: Convert to bdev_open_by_dev()
f2fs: Convert to bdev_open_by_dev/path()
ext4: Convert to bdev_open_by_dev()
erofs: Convert to use bdev_open_by_path()
btrfs: Convert to bdev_open_by_path()
fs: Convert to bdev_open_by_dev()
mm/swap: Convert to use bdev_open_by_dev()
...
Solve two ergonomic issues with struct seq_buf;
1) Too much boilerplate is required to initialize:
struct seq_buf s;
char buf[32];
seq_buf_init(s, buf, sizeof(buf));
Instead, we can build this directly on the stack. Provide
DECLARE_SEQ_BUF() macro to do this:
DECLARE_SEQ_BUF(s, 32);
2) %NUL termination is fragile and requires 2 steps to get a valid
C String (and is a layering violation exposing the "internals" of
seq_buf):
seq_buf_terminate(s);
do_something(s->buffer);
Instead, we can just return s->buffer directly after terminating it in
the refactored seq_buf_terminate(), now known as seq_buf_str():
do_something(seq_buf_str(s));
Link: https://lore.kernel.org/linux-trace-kernel/20231027155634.make.260-kees@kernel.org
Link: https://lore.kernel.org/linux-trace-kernel/20231026194033.it.702-kees@kernel.org/
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Justin Stitt <justinstitt@google.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Yun Zhou <yun.zhou@windriver.com>
Cc: Jacob Keller <jacob.e.keller@intel.com>
Cc: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
- tracing/kprobes: Fix kernel-doc warnings for the variable length
arguments.
- tracing/kprobes: Fix to count the symbols in modules even if the
module name is not specified so that user can probe the symbols in
the modules without module name.
-----BEGIN PGP SIGNATURE-----
iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmU82MUbHG1hc2FtaS5o
aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bMZ0H+wZHWVUsmqGLGNCt3gfi
m2EJX83VMwY8PzpwZ5ezrx4ibAcUyo7Dhh8OniGgEazC3BNeggoUu/HwpirS22gI
Tx0EMlgLOJQykauiUe6FPem0IbrlbQMI1gLplx6cVd8lgIYZQfMIM5gI0kuCywT3
Ka9sCgp6y3UKQNtHKFwtPRLYFTF3Afyy2C01wdsa800SEqeOAeTD9+8yz7ZnuFt+
bNgu6vJGFfJHkEkvYCwFFqZ1eIfXON6lUFpijNpCGvMN2h1XArLexSk8JRBf6j2+
8+1FrRQsTXRk3G6v9uQABeK7z5W2F8gufmSFyBlXajbZp2HT6j4s2S86u5lP9P9J
l1U=
=etyx
-----END PGP SIGNATURE-----
Merge tag 'probes-fixes-v6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:
- tracing/kprobes: Fix kernel-doc warnings for the variable length
arguments
- tracing/kprobes: Fix to count the symbols in modules even if the
module name is not specified so that user can probe the symbols in
the modules without module name
* tag 'probes-fixes-v6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/kprobes: Fix symbol counting logic by looking at modules as well
tracing/kprobes: Fix the description of variable length arguments
snapshot_test argument is now unused in swsusp_close() and
load_image_and_restore(). Drop it
CC: linux-pm@vger.kernel.org
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: "Rafael J. Wysocki" <rafael@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-17-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
Convert hibernation code to use bdev_open_by_dev().
CC: linux-pm@vger.kernel.org
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: "Rafael J. Wysocki" <rafael@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-16-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
Recent changes to count number of matching symbols when creating
a kprobe event failed to take into account kernel modules. As such, it
breaks kprobes on kernel module symbols, by assuming there is no match.
Fix this my calling module_kallsyms_on_each_symbol() in addition to
kallsyms_on_each_match_symbol() to perform a proper counting.
Link: https://lore.kernel.org/all/20231027233126.2073148-1-andrii@kernel.org/
Cc: Francis Laniel <flaniel@linux.microsoft.com>
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Fixes: b022f0c7e4 ("tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fix the following kernel-doc warnings:
kernel/trace/trace_kprobe.c:1029: warning: Excess function parameter 'args' description in '__kprobe_event_gen_cmd_start'
kernel/trace/trace_kprobe.c:1097: warning: Excess function parameter 'args' description in '__kprobe_event_add_fields'
Refer to the usage of variable length arguments elsewhere in the kernel
code, "@..." is the proper way to express it in the description.
Link: https://lore.kernel.org/all/20231027041315.2613166-1-yujie.liu@intel.com/
Fixes: 2a588dd1d5 ("tracing: Add kprobe event command generation functions")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202310190437.paI6LYJF-lkp@intel.com/
Signed-off-by: Yujie Liu <yujie.liu@intel.com>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Currently, the additional information of a branch entry is stored in a
u64 space. With more and more information added, the space is running
out. For example, the information of occurrences of events will be added
for each branch.
Two places were suggested to append the counters.
https://lore.kernel.org/lkml/20230802215814.GH231007@hirez.programming.kicks-ass.net/
One place is right after the flags of each branch entry. It changes the
existing struct perf_branch_entry. The later ARCH specific
implementation has to be really careful to consistently pick
the right struct.
The other place is right after the entire struct perf_branch_stack.
The disadvantage is that the pointer of the extra space has to be
recorded. The common interface perf_sample_save_brstack() has to be
updated.
The latter is much straightforward, and should be easily understood and
maintained. It is implemented in the patch.
Add a new branch sample type, PERF_SAMPLE_BRANCH_COUNTERS, to indicate
the event which is recorded in the branch info.
The "u64 counters" may store the occurrences of several events. The
information regarding the number of events/counters and the width of
each counter should be exposed via sysfs as a reference for the perf
tool. Define the branch_counter_nr and branch_counter_width ABI here.
The support will be implemented later in the Intel-specific patch.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231025201626.3000228-1-kan.liang@linux.intel.com
Add Kconfig options to use SHA-3 for kernel module signing. 256 size
for RSA only, and higher sizes for RSA and NIST P-384.
Signed-off-by: Dimitri John Ledkov <dimitri.ledkov@canonical.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
In a high-load arm64 environment, the pcrypt_aead01 test in LTP can lead
to system UAF (Use-After-Free) issues. Due to the lengthy analysis of
the pcrypt_aead01 function call, I'll describe the problem scenario
using a simplified model:
Suppose there's a user of padata named `user_function` that adheres to
the padata requirement of calling `padata_free_shell` after `serial()`
has been invoked, as demonstrated in the following code:
```c
struct request {
struct padata_priv padata;
struct completion *done;
};
void parallel(struct padata_priv *padata) {
do_something();
}
void serial(struct padata_priv *padata) {
struct request *request = container_of(padata,
struct request,
padata);
complete(request->done);
}
void user_function() {
DECLARE_COMPLETION(done)
padata->parallel = parallel;
padata->serial = serial;
padata_do_parallel();
wait_for_completion(&done);
padata_free_shell();
}
```
In the corresponding padata.c file, there's the following code:
```c
static void padata_serial_worker(struct work_struct *serial_work) {
...
cnt = 0;
while (!list_empty(&local_list)) {
...
padata->serial(padata);
cnt++;
}
local_bh_enable();
if (refcount_sub_and_test(cnt, &pd->refcnt))
padata_free_pd(pd);
}
```
Because of the high system load and the accumulation of unexecuted
softirq at this moment, `local_bh_enable()` in padata takes longer
to execute than usual. Subsequently, when accessing `pd->refcnt`,
`pd` has already been released by `padata_free_shell()`, resulting
in a UAF issue with `pd->refcnt`.
The fix is straightforward: add `refcount_dec_and_test` before calling
`padata_free_pd` in `padata_free_shell`.
Fixes: 07928d9bfc ("padata: Remove broken queue flushing")
Signed-off-by: WangJinchao <wangjinchao@xfusion.com>
Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
On no-MMU, all futexes are treated as private because there is no need
to map a virtual address to physical to match the futex across
processes. This doesn't quite work though, because private futexes
include the current process's mm_struct as part of their key. This makes
it impossible for one process to wake up a shared futex being waited on
in another process.
Fix this bug by excluding the mm_struct from the key. With
a single address space, the futex address is already a unique key.
Fixes: 784bdf3bb6 ("futex: Assume all mappings are private on !MMU systems")
Signed-off-by: Ben Wolsieffer <ben.wolsieffer@hefring.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: André Almeida <andrealmeid@igalia.com>
Link: https://lore.kernel.org/r/20231019204548.1236437-2-ben.wolsieffer@hefring.com
irq_remove_generic_chip() calculates the Linux interrupt number for removing the
handler and interrupt chip based on gc::irq_base as a linear function of
the bit positions of set bits in the @msk argument.
When the generic chip is present in an irq domain, i.e. created with a call
to irq_alloc_domain_generic_chips(), gc::irq_base contains not the base
Linux interrupt number. It contains the base hardware interrupt for this
chip. It is set to 0 for the first chip in the domain, 0 + N for the next
chip, where $N is the number of hardware interrupts per chip.
That means the Linux interrupt number cannot be calculated based on
gc::irq_base for irqdomain based chips without a domain map lookup, which
is currently missing.
Rework the code to take the irqdomain case into account and calculate the
Linux interrupt number by a irqdomain lookup of the domain specific
hardware interrupt number.
[ tglx: Massage changelog. Reshuffle the logic and add a proper comment. ]
Fixes: cfefd21e69 ("genirq: Add chip suspend and resume callbacks")
Signed-off-by: Herve Codina <herve.codina@bootlin.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20231024150335.322282-1-herve.codina@bootlin.com
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZTp12QAKCRDbK58LschI
g8BrAQDifqp5liEEdXV8jdReBwJtqInjrL5tzy5LcyHUMQbTaAEA6Ph3Ct3B+3oA
mFnIW/y6UJiJrby0Xz4+vV5BXI/5WQg=
=pLCV
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-10-26
We've added 51 non-merge commits during the last 10 day(s) which contain
a total of 75 files changed, 5037 insertions(+), 200 deletions(-).
The main changes are:
1) Add open-coded task, css_task and css iterator support.
One of the use cases is customizable OOM victim selection via BPF,
from Chuyi Zhou.
2) Fix BPF verifier's iterator convergence logic to use exact states
comparison for convergence checks, from Eduard Zingerman,
Andrii Nakryiko and Alexei Starovoitov.
3) Add BPF programmable net device where bpf_mprog defines the logic
of its xmit routine. It can operate in L3 and L2 mode,
from Daniel Borkmann and Nikolay Aleksandrov.
4) Batch of fixes for BPF per-CPU kptr and re-enable unit_size checking
for global per-CPU allocator, from Hou Tao.
5) Fix libbpf which eagerly assumed that SHT_GNU_verdef ELF section
was going to be present whenever a binary has SHT_GNU_versym section,
from Andrii Nakryiko.
6) Fix BPF ringbuf correctness to fold smp_mb__before_atomic() into
atomic_set_release(), from Paul E. McKenney.
7) Add a warning if NAPI callback missed xdp_do_flush() under
CONFIG_DEBUG_NET which helps checking if drivers were missing
the former, from Sebastian Andrzej Siewior.
8) Fix missed RCU read-lock in bpf_task_under_cgroup() which was throwing
a warning under sleepable programs, from Yafang Shao.
9) Avoid unnecessary -EBUSY from htab_lock_bucket by disabling IRQ before
checking map_locked, from Song Liu.
10) Make BPF CI linked_list failure test more robust,
from Kumar Kartikeya Dwivedi.
11) Enable samples/bpf to be built as PIE in Fedora, from Viktor Malik.
12) Fix xsk starving when multiple xsk sockets were associated with
a single xsk_buff_pool, from Albert Huang.
13) Clarify the signed modulo implementation for the BPF ISA standardization
document that it uses truncated division, from Dave Thaler.
14) Improve BPF verifier's JEQ/JNE branch taken logic to also consider
signed bounds knowledge, from Andrii Nakryiko.
15) Add an option to XDP selftests to use multi-buffer AF_XDP
xdp_hw_metadata and mark used XDP programs as capable to use frags,
from Larysa Zaremba.
16) Fix bpftool's BTF dumper wrt printing a pointer value and another
one to fix struct_ops dump in an array, from Manu Bretelle.
* tag 'for-netdev' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (51 commits)
netkit: Remove explicit active/peer ptr initialization
selftests/bpf: Fix selftests broken by mitigations=off
samples/bpf: Allow building with custom bpftool
samples/bpf: Fix passing LDFLAGS to libbpf
samples/bpf: Allow building with custom CFLAGS/LDFLAGS
bpf: Add more WARN_ON_ONCE checks for mismatched alloc and free
selftests/bpf: Add selftests for netkit
selftests/bpf: Add netlink helper library
bpftool: Extend net dump with netkit progs
bpftool: Implement link show support for netkit
libbpf: Add link-based API for netkit
tools: Sync if_link uapi header
netkit, bpf: Add bpf programmable net device
bpf: Improve JEQ/JNE branch taken logic
bpf: Fold smp_mb__before_atomic() into atomic_set_release()
bpf: Fix unnecessary -EBUSY from htab_lock_bucket
xsk: Avoid starving the xsk further down the list
bpf: print full verifier states on infinite loop detection
selftests/bpf: test if state loops are detected in a tricky case
bpf: correct loop detection for iterators convergence
...
====================
Link: https://lore.kernel.org/r/20231026150509.2824-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Merge updates related to system sleep handling, one power capping update
and one PM utility update for 6.7-rc1:
- Use __get_safe_page() rather than touching the list in hibernation
snapshot code (Brian Geffon).
- Fix symbol export for _SIMPLE_ variants of _PM_OPS() (Raag Jadav).
- Clean up sync_read handling in snapshot_write_next() (Brian Geffon).
- Fix kerneldoc comments for swsusp_check() and swsusp_close() to
better match code (Christoph Hellwig).
- Downgrade BIOS locked limits pr_warn() in the Intel RAPL power
capping driver to pr_debug() (Ville Syrjälä).
- Change the minimum python version for the intel_pstate_tracer utility
from 2.7 to 3.6 (Doug Smythies).
* pm-sleep:
PM: hibernate: fix the kerneldoc comment for swsusp_check() and swsusp_close()
PM: hibernate: Clean up sync_read handling in snapshot_write_next()
PM: sleep: Fix symbol export for _SIMPLE_ variants of _PM_OPS()
PM: hibernate: Use __get_safe_page() rather than touching the list
* powercap:
powercap: intel_rapl: Downgrade BIOS locked limits pr_warn() to pr_debug()
* pm-tools:
tools/power/x86/intel_pstate_tracer: python minimum version
Merge cpufreq updates for 6.7-rc1:
- Add support for several Qualcomm SoC versions and other similar
changes (Christian Marangi, Dmitry Baryshkov, Luca Weiss, Neil
Armstrong, Richard Acayan, Robert Marko, Rohit Agarwal, Stephan
Gerhold and Varadarajan Narayanan).
- Clean up the tegra cpufreq driver (Sumit Gupta).
- Use of_property_read_reg() to parse "reg" in pmac32 driver (Rob
Herring).
- Add support for TI's am62p5 Soc (Bryan Brattlof).
- Make ARM_BRCMSTB_AVS_CPUFREQ depends on !ARM_SCMI_CPUFREQ (Florian
Fainelli).
- Update Kconfig to mention i.MX7 as well (Alexander Stein).
- Revise global turbo disable check in intel_pstate (Srinivas
Pandruvada).
- Carry out initialization of sg_cpu in the schedutil cpufreq governor
in one loop (Liao Chang).
- Simplify the condition for storing 'down_threshold' in the
conservative cpufreq governor (Liao Chang).
- Use fine-grained mutex in the userspace cpufreq governor (Liao
Chang).
- Move is_managed indicator in the userspace cpufreq governor into a
per-policy structure (Liao Chang).
- Rebuild sched-domains when removing cpufreq driver (Pierre Gondois).
- Fix buffer overflow detection in trans_stats() (Christian Marangi).
* pm-cpufreq: (32 commits)
dt-bindings: cpufreq: qcom-hw: document SM8650 CPUFREQ Hardware
cpufreq: arm: Kconfig: Add i.MX7 to supported SoC for ARM_IMX_CPUFREQ_DT
cpufreq: qcom-nvmem: add support for IPQ8064
cpufreq: qcom-nvmem: also accept operating-points-v2-krait-cpu
cpufreq: qcom-nvmem: drop pvs_ver for format a fuses
dt-bindings: cpufreq: qcom-cpufreq-nvmem: Document krait-cpu
cpufreq: qcom-nvmem: add support for IPQ6018
dt-bindings: cpufreq: qcom-cpufreq-nvmem: document IPQ6018
cpufreq: qcom-nvmem: Add MSM8909
cpufreq: qcom-nvmem: Simplify driver data allocation
cpufreq: stats: Fix buffer overflow detection in trans_stats()
dt-bindings: cpufreq: cpufreq-qcom-hw: Add SDX75 compatible
cpufreq: ARM_BRCMSTB_AVS_CPUFREQ cannot be used with ARM_SCMI_CPUFREQ
cpufreq: ti-cpufreq: Add opp support for am62p5 SoCs
cpufreq: dt-platdev: add am62p5 to blocklist
cpufreq: tegra194: remove redundant AND with cpu_online_mask
cpufreq: tegra194: use refclk delta based loop instead of udelay
cpufreq: tegra194: save CPU data to avoid repeated SMP calls
cpufreq: Rebuild sched-domains when removing cpufreq driver
cpufreq: userspace: Move is_managed indicator into per-policy structure
...
There are two possible mismatched alloc and free cases in BPF memory
allocator:
1) allocate from cache X but free by cache Y with a different unit_size
2) allocate from per-cpu cache but free by kmalloc cache or vice versa
So add more WARN_ON_ONCE checks in free_bulk() and __free_by_rcu() to
spot these mismatched alloc and free early.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231021014959.3563841-1-houtao@huaweicloud.com
commit ef8dd01538 ("genirq/msi: Make interrupt allocation less
convoluted"), reworked the code so that the x86 specific quirk for affinity
setting of non-maskable PCI/MSI interrupts is not longer activated if
necessary.
This could be solved by restoring the original logic in the core MSI code,
but after a deeper analysis it turned out that the quirk flag is not
required at all.
The quirk is only required when the PCI/MSI device cannot mask the MSI
interrupts, which in turn also prevents reservation mode from being enabled
for the affected interrupt.
This allows ot remove the NOMASK quirk bit completely as msi_set_affinity()
can instead check whether reservation mode is enabled for the interrupt,
which gives exactly the same answer.
Even in the momentary non-existing case that the reservation mode would be
not set for a maskable MSI interrupt this would not cause any harm as it
just would cause msi_set_affinity() to go needlessly through the
functionaly equivalent slow path, which works perfectly fine with maskable
interrupts as well.
Rework msi_set_affinity() to query the reservation mode and remove all
NOMASK quirk logic from the core code.
[ tglx: Massaged changelog ]
Fixes: ef8dd01538 ("genirq/msi: Make interrupt allocation less convoluted")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Koichiro Den <den@valinux.co.jp>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20231026032036.2462428-1-den@valinux.co.jp
The get_task_exe_file() function locks the given task with task_lock()
which when used inside audit_exe_compare() can cause deadlocks on
systems that generate audit records when the task_lock() is held. We
resolve this problem with two changes: ignoring those cases where the
task being audited is not the current task, and changing our approach
to obtaining the executable file struct to not require task_lock().
With the intent of the audit exe filter being to filter on audit events
generated by processes started by the specified executable, it makes
sense that we would only want to use the exe filter on audit records
associated with the currently executing process, e.g. @current. If
we are asked to filter records using a non-@current task_struct we can
safely ignore the exe filter without negatively impacting the admin's
expectations for the exe filter.
Knowing that we only have to worry about filtering the currently
executing task in audit_exe_compare() we can do away with the
task_lock() and call get_mm_exe_file() with @current->mm directly.
Cc: <stable@vger.kernel.org>
Fixes: 5efc244346 ("audit: fix exe_file access in audit_exe_compare")
Reported-by: Andreas Steinmetz <anstein99@googlemail.com>
Reviewed-by: John Johansen <john.johanse@canonical.com>
Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Convert to use folio_xchg_last_cpupid() in should_numa_migrate_memory().
Link: https://lkml.kernel.org/r/20231018140806.2783514-14-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Convert to use folio_xchg_access_time() in numa_hint_fault_latency().
Link: https://lkml.kernel.org/r/20231018140806.2783514-9-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When a CPU is about to be offlined, x86 validates that all active
interrupts which are targeted to this CPU can be migrated to the remaining
online CPUs. If not, the offline operation is aborted.
The validation uses irq_matrix_allocated() to retrieve the number of
vectors which are allocated on the outgoing CPU. The returned number of
allocated vectors includes also vectors which are associated to managed
interrupts.
That's overaccounting because managed interrupts are:
- not migrated when the affinity mask of the interrupt targets only
the outgoing CPU
- migrated to another CPU, but in that case the vector is already
pre-allocated on the potential target CPUs and must not be taken into
account.
As a consequence the check whether the remaining online CPUs have enough
capacity for migrating the allocated vectors from the outgoing CPU might
fail incorrectly.
Let irq_matrix_allocated() return only the number of allocated non-managed
interrupts to make this validation check correct.
[ tglx: Amend changelog and fixup kernel-doc comment ]
Fixes: 2f75d9e1c9 ("genirq: Implement bitmap matrix allocator")
Reported-by: Wendy Wang <wendy.wang@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20231020072522.557846-1-yu.c.chen@intel.com
When allocating a new pool at runtime, reduce the number of slabs so
that the allocation order is at most MAX_ORDER. This avoids a kernel
warning in __alloc_pages().
The warning is relatively benign, because the pool size is subsequently
reduced when allocation fails, but it is silly to start with a request
that is known to fail, especially since this is the default behavior if
the kernel is built with CONFIG_SWIOTLB_DYNAMIC=y and booted without any
swiotlb= parameter.
Reported-by: Ben Greear <greearb@candelatech.com>
Closes: https://lore.kernel.org/netdev/4f173dd2-324a-0240-ff8d-abf5c191be18@candelatech.com/
Fixes: 1aaa736815 ("swiotlb: allocate a new memory pool when existing pools are full")
Signed-off-by: Petr Tesarik <petr.tesarik1@huawei-partners.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
This work adds a new, minimal BPF-programmable device called "netkit"
(former PoC code-name "meta") we recently presented at LSF/MM/BPF. The
core idea is that BPF programs are executed within the drivers xmit routine
and therefore e.g. in case of containers/Pods moving BPF processing closer
to the source.
One of the goals was that in case of Pod egress traffic, this allows to
move BPF programs from hostns tcx ingress into the device itself, providing
earlier drop or forward mechanisms, for example, if the BPF program
determines that the skb must be sent out of the node, then a redirect to
the physical device can take place directly without going through per-CPU
backlog queue. This helps to shift processing for such traffic from softirq
to process context, leading to better scheduling decisions/performance (see
measurements in the slides).
In this initial version, the netkit device ships as a pair, but we plan to
extend this further so it can also operate in single device mode. The pair
comes with a primary and a peer device. Only the primary device, typically
residing in hostns, can manage BPF programs for itself and its peer. The
peer device is designated for containers/Pods and cannot attach/detach
BPF programs. Upon the device creation, the user can set the default policy
to 'pass' or 'drop' for the case when no BPF program is attached.
Additionally, the device can be operated in L3 (default) or L2 mode. The
management of BPF programs is done via bpf_mprog, so that multi-attach is
supported right from the beginning with similar API and dependency controls
as tcx. For details on the latter see commit 053c8e1f23 ("bpf: Add generic
attach/detach/query API for multi-progs"). tc BPF compatibility is provided,
so that existing programs can be easily migrated.
Going forward, we plan to use netkit devices in Cilium as the main device
type for connecting Pods. They will be operated in L3 mode in order to
simplify a Pod's neighbor management and the peer will operate in default
drop mode, so that no traffic is leaving between the time when a Pod is
brought up by the CNI plugin and programs attached by the agent.
Additionally, the programs we attach via tcx on the physical devices are
using bpf_redirect_peer() for inbound traffic into netkit device, hence the
latter is also supporting the ndo_get_peer_dev callback. Similarly, we use
bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device
directly. Also, BIG TCP is supported on netkit device. For the follow-up
work in single device mode, we plan to convert Cilium's cilium_host/_net
devices into a single one.
An extensive test suite for checking device operations and the BPF program
and link management API comes as BPF selftests in this series.
Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://github.com/borkmann/iproute2/tree/pr/netkit
Link: http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf (24ff.)
Link: https://lore.kernel.org/r/20231024214904.29825-2-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
When determining if an if/else branch will always or never be taken, use
signed range knowledge in addition to currently used unsigned range knowledge.
If either signed or unsigned range suggests that condition is always/never
taken, return corresponding branch_taken verdict.
Current use of unsigned range for this seems arbitrary and unnecessarily
incomplete. It is possible for *signed* operations to be performed on
register, which could "invalidate" unsigned range for that register. In such
case branch_taken will be artificially useless, even if we can still tell
that some constant is outside of register value range based on its signed
bounds.
veristat-based validation shows zero differences across selftests, Cilium,
and Meta-internal BPF object files.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/bpf/20231022205743.72352-2-andrii@kernel.org
The bpf_user_ringbuf_drain() BPF_CALL function uses an atomic_set()
immediately preceded by smp_mb__before_atomic() so as to order storing
of ring-buffer consumer and producer positions prior to the atomic_set()
call's clearing of the ->busy flag, as follows:
smp_mb__before_atomic();
atomic_set(&rb->busy, 0);
Although this works given current architectures and implementations, and
given that this only needs to order prior writes against a later write.
However, it does so by accident because the smp_mb__before_atomic()
is only guaranteed to work with read-modify-write atomic operations, and
not at all with things like atomic_set() and atomic_read().
Note especially that smp_mb__before_atomic() will not, repeat *not*,
order the prior write to "a" before the subsequent non-read-modify-write
atomic read from "b", even on strongly ordered systems such as x86:
WRITE_ONCE(a, 1);
smp_mb__before_atomic();
r1 = atomic_read(&b);
Therefore, replace the smp_mb__before_atomic() and atomic_set() with
atomic_set_release() as follows:
atomic_set_release(&rb->busy, 0);
This is no slower (and sometimes is faster) than the original, and also
provides a formal guarantee of ordering that the original lacks.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/ec86d38e-cfb4-44aa-8fdb-6c925922d93c@paulmck-laptop
htab_lock_bucket uses the following logic to avoid recursion:
1. preempt_disable();
2. check percpu counter htab->map_locked[hash] for recursion;
2.1. if map_lock[hash] is already taken, return -BUSY;
3. raw_spin_lock_irqsave();
However, if an IRQ hits between 2 and 3, BPF programs attached to the IRQ
logic will not able to access the same hash of the hashtab and get -EBUSY.
This -EBUSY is not really necessary. Fix it by disabling IRQ before
checking map_locked:
1. preempt_disable();
2. local_irq_save();
3. check percpu counter htab->map_locked[hash] for recursion;
3.1. if map_lock[hash] is already taken, return -BUSY;
4. raw_spin_lock().
Similarly, use raw_spin_unlock() and local_irq_restore() in
htab_unlock_bucket().
Fixes: 20b6cc34ea ("bpf: Avoid hashtab deadlock with map_locked")
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/7a9576222aa40b1c84ad3a9ba3e64011d1a04d41.camel@linux.ibm.com
Link: https://lore.kernel.org/bpf/20231012055741.3375999-1-song@kernel.org
Chen Yu reports a hackbench regression of cluster wakeup when
hackbench threads equal to the CPU number [1]. Analysis shows
it's because we wake up more on the target CPU even if the
prev_cpu is a good wakeup candidate and leads to the decrease
of the CPU utilization.
Generally if the task's prev_cpu is idle we'll wake up the task
on it without scanning. On cluster machines we'll try to wake up
the task in the same cluster of the target for better cache
affinity, so if the prev_cpu is idle but not sharing the same
cluster with the target we'll still try to find an idle CPU within
the cluster. This will improve the performance at low loads on
cluster machines. But in the issue above, if the prev_cpu is idle
but not in the cluster with the target CPU, we'll try to scan an
idle one in the cluster. But since the system is busy, we're
likely to fail the scanning and use target instead, even if
the prev_cpu is idle. Then leads to the regression.
This patch solves this in 2 steps:
o record the prev_cpu/recent_used_cpu if they're good wakeup
candidates but not sharing the cluster with the target.
o on scanning failure use the prev_cpu/recent_used_cpu if
they're recorded as idle
[1] https://lore.kernel.org/all/ZGzDLuVaHR1PAYDt@chenyu5-mobl1/
Closes: https://lore.kernel.org/all/ZGsLy83wPIpamy6x@chenyu5-mobl1/
Reported-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20231019033323.54147-4-yangyicong@huawei.com
For platforms having clusters like Kunpeng920, CPUs within the same cluster
have lower latency when synchronizing and accessing shared resources like
cache. Thus, this patch tries to find an idle cpu within the cluster of the
target CPU before scanning the whole LLC to gain lower latency. This
will be implemented in 2 steps in select_idle_sibling():
1. When the prev_cpu/recent_used_cpu are good wakeup candidates, use them
if they're sharing cluster with the target CPU. Otherwise trying to
scan for an idle CPU in the target's cluster.
2. Scanning the cluster prior to the LLC of the target CPU for an
idle CPU to wakeup.
Testing has been done on Kunpeng920 by pinning tasks to one numa and two
numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.
With this patch, We noticed enhancement on tbench and netperf within one
numa or cross two numa on top of tip-sched-core commit
9b46f1abc6d4 ("sched/debug: Print 'tgid' in sched_show_task()")
tbench results (node 0):
baseline patched
1: 327.2833 372.4623 ( 13.80%)
4: 1320.5933 1479.8833 ( 12.06%)
8: 2638.4867 2921.5267 ( 10.73%)
16: 5282.7133 5891.5633 ( 11.53%)
32: 9810.6733 9877.3400 ( 0.68%)
64: 7408.9367 7447.9900 ( 0.53%)
128: 6203.2600 6191.6500 ( -0.19%)
tbench results (node 0-1):
baseline patched
1: 332.0433 372.7223 ( 12.25%)
4: 1325.4667 1477.6733 ( 11.48%)
8: 2622.9433 2897.9967 ( 10.49%)
16: 5218.6100 5878.2967 ( 12.64%)
32: 10211.7000 11494.4000 ( 12.56%)
64: 13313.7333 16740.0333 ( 25.74%)
128: 13959.1000 14533.9000 ( 4.12%)
netperf results TCP_RR (node 0):
baseline patched
1: 76546.5033 90649.9867 ( 18.42%)
4: 77292.4450 90932.7175 ( 17.65%)
8: 77367.7254 90882.3467 ( 17.47%)
16: 78519.9048 90938.8344 ( 15.82%)
32: 72169.5035 72851.6730 ( 0.95%)
64: 25911.2457 25882.2315 ( -0.11%)
128: 10752.6572 10768.6038 ( 0.15%)
netperf results TCP_RR (node 0-1):
baseline patched
1: 76857.6667 90892.2767 ( 18.26%)
4: 78236.6475 90767.3017 ( 16.02%)
8: 77929.6096 90684.1633 ( 16.37%)
16: 77438.5873 90502.5787 ( 16.87%)
32: 74205.6635 88301.5612 ( 19.00%)
64: 69827.8535 71787.6706 ( 2.81%)
128: 25281.4366 25771.3023 ( 1.94%)
netperf results UDP_RR (node 0):
baseline patched
1: 96869.8400 110800.8467 ( 14.38%)
4: 97744.9750 109680.5425 ( 12.21%)
8: 98783.9863 110409.9637 ( 11.77%)
16: 99575.0235 110636.2435 ( 11.11%)
32: 95044.7250 97622.8887 ( 2.71%)
64: 32925.2146 32644.4991 ( -0.85%)
128: 12859.2343 12824.0051 ( -0.27%)
netperf results UDP_RR (node 0-1):
baseline patched
1: 97202.4733 110190.1200 ( 13.36%)
4: 95954.0558 106245.7258 ( 10.73%)
8: 96277.1958 105206.5304 ( 9.27%)
16: 97692.7810 107927.2125 ( 10.48%)
32: 79999.6702 103550.2999 ( 29.44%)
64: 80592.7413 87284.0856 ( 8.30%)
128: 27701.5770 29914.5820 ( 7.99%)
Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so the SMT branch
in the code has not been tested but it supposed to work.
Chen Yu also noticed this will improve the performance of tbench and
netperf on a 24 CPUs Jacobsville machine, there are 4 CPUs in one
cluster sharing L2 Cache.
[https://lore.kernel.org/lkml/Ytfjs+m1kUs0ScSn@worktop.programming.kicks-ass.net]
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Link: https://lkml.kernel.org/r/20231019033323.54147-3-yangyicong@huawei.com
Add cpus_share_resources() API. This is the preparation for the
optimization of select_idle_cpu() on platforms with cluster scheduler
level.
On a machine with clusters cpus_share_resources() will test whether
two cpus are within the same cluster. On a non-cluster machine it
will behaves the same as cpus_share_cache(). So we use "resources"
here for cache resources.
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20231019033323.54147-2-yangyicong@huawei.com
Igor Raits and Bagas Sanjaya report a RQCF_ACT_SKIP leak warning.
This warning may be triggered in the following situations:
CPU0 CPU1
__schedule()
*rq->clock_update_flags <<= 1;* unregister_fair_sched_group()
pick_next_task_fair+0x4a/0x410 destroy_cfs_bandwidth()
newidle_balance+0x115/0x3e0 for_each_possible_cpu(i) *i=0*
rq_unpin_lock(this_rq, rf) __cfsb_csd_unthrottle()
raw_spin_rq_unlock(this_rq)
rq_lock(*CPU0_rq*, &rf)
rq_clock_start_loop_update()
rq->clock_update_flags & RQCF_ACT_SKIP <--
raw_spin_rq_lock(this_rq)
The purpose of RQCF_ACT_SKIP is to skip the update rq clock,
but the update is very early in __schedule(), but we clear
RQCF_*_SKIP very late, causing it to span that gap above
and triggering this warning.
In __schedule() we can clear the RQCF_*_SKIP flag immediately
after update_rq_clock() to avoid this RQCF_ACT_SKIP leak warning.
And set rq->clock_update_flags to RQCF_UPDATED to avoid
rq->clock_update_flags < RQCF_ACT_SKIP warning that may be triggered later.
Fixes: ebb83d84e4 ("sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()")
Closes: https://lore.kernel.org/all/20230913082424.73252-1-jiahao.os@bytedance.com
Reported-by: Igor Raits <igor.raits@gmail.com>
Reported-by: Bagas Sanjaya <bagasdotme@gmail.com>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/a5dd536d-041a-2ce9-f4b7-64d8d85c86dc@gmail.com
In the following two functions, len has already been assigned a value of
0 when defining the variable, so remove 'len=0;'.
Signed-off-by: Li kunyu <kunyu@nfschina.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20231023062359.130633-1-kunyu@nfschina.com
Additional logging in is_state_visited(): if infinite loop is detected
print full verifier state for both current and equivalent states.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231024000917.12153-8-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
It turns out that .branches > 0 in is_state_visited() is not a
sufficient condition to identify if two verifier states form a loop
when iterators convergence is computed. This commit adds logic to
distinguish situations like below:
(I) initial (II) initial
| |
V V
.---------> hdr ..
| | |
| V V
| .------... .------..
| | | | |
| V V V V
| ... ... .-> hdr ..
| | | | | |
| V V | V V
| succ <- cur | succ <- cur
| | | |
| V | V
| ... | ...
| | | |
'----' '----'
For both (I) and (II) successor 'succ' of the current state 'cur' was
previously explored and has branches count at 0. However, loop entry
'hdr' corresponding to 'succ' might be a part of current DFS path.
If that is the case 'succ' and 'cur' are members of the same loop
and have to be compared exactly.
Co-developed-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Co-developed-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reviewed-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231024000917.12153-6-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Convergence for open coded iterators is computed in is_state_visited()
by examining states with branches count > 1 and using states_equal().
states_equal() computes sub-state relation using read and precision marks.
Read and precision marks are propagated from children states,
thus are not guaranteed to be complete inside a loop when branches
count > 1. This could be demonstrated using the following unsafe program:
1. r7 = -16
2. r6 = bpf_get_prandom_u32()
3. while (bpf_iter_num_next(&fp[-8])) {
4. if (r6 != 42) {
5. r7 = -32
6. r6 = bpf_get_prandom_u32()
7. continue
8. }
9. r0 = r10
10. r0 += r7
11. r8 = *(u64 *)(r0 + 0)
12. r6 = bpf_get_prandom_u32()
13. }
Here verifier would first visit path 1-3, create a checkpoint at 3
with r7=-16, continue to 4-7,3 with r7=-32.
Because instructions at 9-12 had not been visitied yet existing
checkpoint at 3 does not have read or precision mark for r7.
Thus states_equal() would return true and verifier would discard
current state, thus unsafe memory access at 11 would not be caught.
This commit fixes this loophole by introducing exact state comparisons
for iterator convergence logic:
- registers are compared using regs_exact() regardless of read or
precision marks;
- stack slots have to have identical type.
Unfortunately, this is too strict even for simple programs like below:
i = 0;
while(iter_next(&it))
i++;
At each iteration step i++ would produce a new distinct state and
eventually instruction processing limit would be reached.
To avoid such behavior speculatively forget (widen) range for
imprecise scalar registers, if those registers were not precise at the
end of the previous iteration and do not match exactly.
This a conservative heuristic that allows to verify wide range of
programs, however it precludes verification of programs that conjure
an imprecise value on the first loop iteration and use it as precise
on the second.
Test case iter_task_vma_for_each() presents one of such cases:
unsigned int seen = 0;
...
bpf_for_each(task_vma, vma, task, 0) {
if (seen >= 1000)
break;
...
seen++;
}
Here clang generates the following code:
<LBB0_4>:
24: r8 = r6 ; stash current value of
... body ... 'seen'
29: r1 = r10
30: r1 += -0x8
31: call bpf_iter_task_vma_next
32: r6 += 0x1 ; seen++;
33: if r0 == 0x0 goto +0x2 <LBB0_6> ; exit on next() == NULL
34: r7 += 0x10
35: if r8 < 0x3e7 goto -0xc <LBB0_4> ; loop on seen < 1000
<LBB0_6>:
... exit ...
Note that counter in r6 is copied to r8 and then incremented,
conditional jump is done using r8. Because of this precision mark for
r6 lags one state behind of precision mark on r8 and widening logic
kicks in.
Adding barrier_var(seen) after conditional is sufficient to force
clang use the same register for both counting and conditional jump.
This issue was discussed in the thread [1] which was started by
Andrew Werner <awerner32@gmail.com> demonstrating a similar bug
in callback functions handling. The callbacks would be addressed
in a followup patch.
[1] https://lore.kernel.org/bpf/97a90da09404c65c8e810cf83c94ac703705dc0e.camel@gmail.com/
Co-developed-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Co-developed-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231024000917.12153-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Extract same_callsites() from clean_live_states() as a utility function.
This function would be used by the next patch in the set.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231024000917.12153-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Subsequent patches would make use of explored_state() function.
Move it up to avoid adding unnecessary prototype.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231024000917.12153-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This is just a process PR to merge the topic branch into drm-next, this contains some core kernel and drm changes.
Signed-off-by: Dave Airlie <airlied@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmU3GKAACgkQDHTzWXnE
hr4niA//SBthUNEGmzYtYVnpL4O+6nFgC+Pd4HN+xgdobQUob5TfIGpsoeD+Z4KT
hi7XcNdR+UElOIcNJOvzAYFZ9Joev3W2lUdwNpPtO/Oltaj5RSZr/2RqDLDKBsPl
BXFKpkACUsuSLoHWNpgTmz4PvTFUJJMPUKzdQ09e+IXhM7J85AffpAHVnKFFxzjm
mvcwmru64bElmlp0rySDN6RaeRqNQq3hcCfCtkRFiZgIZ+L59y+0kBsopJXHjJNX
9V3rvS0JRy5eoNkcwGUHr2EY/bO/hqIuHLv3aNbgqp4GfQAsFQ189cjXP8GmUxhX
MUgOG0+/zwBieobgMTHddNI85pTDP4LNnXHzPpi3AKVkP/ByQLdeYBJe42Ea2nNS
GOhA/0jRixfm2jqIc2cyZ+Ed9jyQSBs4gEV8hJOzlG6ex5qKeTLCG+9GLhxO3ZMM
Ze1+iQCuFHke/9lpGVsY/gzASAZ9BPIaESaLFomH38Yei+tGtEMW1S1GUq26t4vT
6e/65UU9xvawLJVj/XckOuJe2Pjp74ViCRlYXjuO8KCorAJ5rj3CPa2Na/douZ2D
YkjN7qOkUX7qDo1PBSVoIe18QP2+2pg62LXhqCMeIrQBWnViNps1/S+IyeC6VBQ0
HeMT7vFhC/mxEnzT1aiD+BQJpZgzPNNN+1Ug/kFczI0sCnyTOuQ=
=8GVw
-----END PGP SIGNATURE-----
Merge tag 'topic/vmemdup-user-array-2023-10-24-1' of git://anongit.freedesktop.org/drm/drm into drm-next
vmemdup-user-array API and changes with it.
This is just a process PR to merge the topic branch into drm-next, this contains some core kernel and drm changes.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Dave Airlie <airlied@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20231024010905.646830-1-airlied@redhat.com
As kernel test robot reported, lib/test_objpool.c (trace:probes/for-next)
has linux/version.h included, but version.h is not used at all. Then more
unused headers are found in test_objpool.c and rethook.c, and all of them
should be removed.
Link: https://lore.kernel.org/all/20231023112245.6112-1-wuqiang.matt@bytedance.com/
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202310191512.vvypKU5Z-lkp@intel.com/
Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Rewrite the comment explaining why swiotlb copies the original buffer to
the TLB buffer before initiating DMA *from* the device, i.e. before the
device DMAs into the TLB buffer. The existing comment's argument that
preserving the original data can prevent a kernel memory leak is bogus.
If the driver that triggered the mapping _knows_ that the device will
overwrite the entire mapping, or the driver will consume only the written
parts, then copying from the original memory is completely pointless.
If neither of the above holds true, then copying from the original adds
value only if preserving the data is necessary for functional
correctness, or the driver explicitly initialized the original memory.
If the driver didn't initialize the memory, then copying the original
buffer to the TLB buffer simply changes what kernel data is leaked to
user space.
Writing the entire TLB buffer _does_ prevent leaking stale TLB buffer
data from a previous bounce, but that can be achieved by simply zeroing
the TLB buffer when grabbing a slot.
The real reason swiotlb ended up initializing the TLB buffer with the
original buffer is that it's necessary to make swiotlb operate as
transparently as possible, i.e. to behave as closely as possible to
hardware, and to avoid corrupting the original buffer, e.g. if the driver
knows the device will do partial writes and is relying on the unwritten
data to be preserved.
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/all/ZN5elYQ5szQndN8n@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Log a warning once when dma_alloc_coherent fails because the platform
does not support coherent allocations at all.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Greg Ungerer <gerg@linux-m68k.org>
Tested-by: Greg Ungerer <gerg@linux-m68k.org>
The logic in dma_direct_alloc when to use the atomic pool vs remapping
grew a bit unreadable. Consolidate it into a single check, and clean
up the set_uncached vs remap logic a bit as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Greg Ungerer <gerg@linux-m68k.org>
Tested-by: Greg Ungerer <gerg@linux-m68k.org>
Instead of using arch_dma_alloc if none of the generic coherent
allocators are used, require the architectures to explicitly opt into
providing it. This will used to deal with the case of m68knommu and
coldfire where we can't do any coherent allocations whatsoever, and
also makes it clear that arch_dma_alloc is a last resort.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Greg Ungerer <gerg@linux-m68k.org>
Tested-by: Greg Ungerer <gerg@linux-m68k.org>
CONFIG_DMA_GLOBAL_POOL can't be combined with other DMA coherent
allocators. Add dependencies to Kconfig to document this, and make
kconfig complain about unmet dependencies if someone tries.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Reviewed-by: Greg Ungerer <gerg@linux-m68k.org>
Tested-by: Greg Ungerer <gerg@linux-m68k.org>
- kprobe-events: Fix kprobe events to reject if the attached symbol
is not unique name because it may not the function which the user
want to attach to. (User can attach a probe to such symbol using
the nearest unique symbol + offset.)
- selftest: Add a testcase to ensure the kprobe event rejects non
unique symbol correctly.
-----BEGIN PGP SIGNATURE-----
iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmUzdQobHG1hc2FtaS5o
aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bMNAH/inFWv8e+rMm8F5Po6ZI
CmBxuZbxy2l+KfYDjXqSHu7TLKngVd6Bhdb5H2K7fgdwiZxrS0i6qvdppo+Cxgop
Yod06peDTM80IKavioCcOJOwLPGXXpZkMlK5fdC48HN6vrf9km4vws5ZAagfc1ng
YhnYm1HHeXcIYwtLkE2dCr6HkwkaOebWTLdZ8c70d1OPw0L9rzxH+edjhKCq8uIw
6WUg9ERxJYPUuCkQxOxVJrTdzNMRXsgf28FHc0LyYRm8kDpECT2BP6e/Y+TBbsX5
2pN5cUY5qfI6t3Pc1HDs2KX8ui2QCmj0mCvT0VixhdjThdHpRf0VjIFFAANf3LNO
XVA=
=O1Aa
-----END PGP SIGNATURE-----
Merge tag 'probes-fixes-v6.6-rc6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:
- kprobe-events: Fix kprobe events to reject if the attached symbol is
not unique name because it may not the function which the user want
to attach to. (User can attach a probe to such symbol using the
nearest unique symbol + offset.)
- selftest: Add a testcase to ensure the kprobe event rejects non
unique symbol correctly.
* tag 'probes-fixes-v6.6-rc6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
selftests/ftrace: Add new test case which checks non unique symbol
tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols
The following warning was reported when running "./test_progs -t
test_bpf_ma/percpu_free_through_map_free":
------------[ cut here ]------------
WARNING: CPU: 1 PID: 68 at kernel/bpf/memalloc.c:342
CPU: 1 PID: 68 Comm: kworker/u16:2 Not tainted 6.6.0-rc2+ #222
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
Workqueue: events_unbound bpf_map_free_deferred
RIP: 0010:bpf_mem_refill+0x21c/0x2a0
......
Call Trace:
<IRQ>
? bpf_mem_refill+0x21c/0x2a0
irq_work_single+0x27/0x70
irq_work_run_list+0x2a/0x40
irq_work_run+0x18/0x40
__sysvec_irq_work+0x1c/0xc0
sysvec_irq_work+0x73/0x90
</IRQ>
<TASK>
asm_sysvec_irq_work+0x1b/0x20
RIP: 0010:unit_free+0x50/0x80
......
bpf_mem_free+0x46/0x60
__bpf_obj_drop_impl+0x40/0x90
bpf_obj_free_fields+0x17d/0x1a0
array_map_free+0x6b/0x170
bpf_map_free_deferred+0x54/0xa0
process_scheduled_works+0xba/0x370
worker_thread+0x16d/0x2e0
kthread+0x105/0x140
ret_from_fork+0x39/0x60
ret_from_fork_asm+0x1b/0x30
</TASK>
---[ end trace 0000000000000000 ]---
The reason is simple: __bpf_obj_drop_impl() does not know the freeing
field is a per-cpu pointer and it uses bpf_global_ma to free the
pointer. Because bpf_global_ma is not a per-cpu allocator, so ksize() is
used to select the corresponding cache. The bpf_mem_cache with 16-bytes
unit_size will always be selected to do the unmatched free and it will
trigger the warning in free_bulk() eventually.
Because per-cpu kptr doesn't support list or rb-tree now, so fix the
problem by only checking whether or not the type of kptr is per-cpu in
bpf_obj_free_fields(), and using bpf_global_percpu_ma to these kptrs.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231020133202.4043247-7-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
For bpf_global_percpu_ma, the pointer passed to bpf_mem_free_rcu() is
allocated by kmalloc() and its size is fixed (16-bytes on x86-64). So
no matter which cache allocates the dynamic per-cpu area, on x86-64
cache[2] will always be used to free the per-cpu area.
Fix the unbalance by checking whether the bpf memory allocator is
per-cpu or not and use pcpu_alloc_size() instead of ksize() to
find the correct cache for per-cpu free.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231020133202.4043247-5-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
With pcpu_alloc_size() in place, check whether or not the size of
the dynamic per-cpu area is matched with unit_size.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231020133202.4043247-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
To make seq_buf more lightweight as a string buf, move the readpos member
from seq_buf to its container, trace_seq. That puts the responsibility
of maintaining the readpos entirely in the tracing code. If some future
users want to package up the readpos with a seq_buf, we can define a
new struct then.
Link: https://lore.kernel.org/linux-trace-kernel/20231020033545.2587554-2-willy@infradead.org
Cc: Kees Cook <keescook@chromium.org>
Cc: Justin Stitt <justinstitt@google.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The eventfs_create_dir() function returns error pointers, it never returns
NULL. Update the check to reflect that.
Link: https://lore.kernel.org/linux-trace-kernel/ff641474-84e2-46a7-9d7a-62b251a1050c@moroto.mountain
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 5790b1fb3d ("eventfs: Remove eventfs_file and just use eventfs_inode")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The 'curr' argument of pick_next_entity() has become unused after
the EEVDF changes.
[ mingo: Updated the changelog. ]
Signed-off-by: Yiwei Lin <s921975628@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231020055617.42064-1-s921975628@gmail.com
When a kprobe is attached to a function that's name is not unique (is
static and shares the name with other functions in the kernel), the
kprobe is attached to the first function it finds. This is a bug as the
function that it is attaching to is not necessarily the one that the
user wants to attach to.
Instead of blindly picking a function to attach to what is ambiguous,
error with EADDRNOTAVAIL to let the user know that this function is not
unique, and that the user must use another unique function with an
address offset to get to the function they want to attach to.
Link: https://lore.kernel.org/all/20231020104250.9537-2-flaniel@linux.microsoft.com/
Cc: stable@vger.kernel.org
Fixes: 413d37d1eb ("tracing: Add kprobe-based event tracer")
Suggested-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>
Link: https://lore.kernel.org/lkml/20230819101105.b0c104ae4494a7d1f2eea742@kernel.org/
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
How ILB is triggered without IPIs is cryptic. Out of mercy for future
code readers, document it in code comments.
The comments are derived from a discussion with Vincent in a past
review.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231020014031.919742-2-joel@joelfernandes.org
sha224 does not provide enough security against collision attacks
relative to the default keys used for signing (RSA 4k & P-384). Also
sha224 never became popular, as sha256 got widely adopter ahead of
sha224 being introduced.
Signed-off-by: Dimitri John Ledkov <dimitri.ledkov@canonical.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Removes support for sha1 signed kernel modules, importing sha1 signed
x.509 certificates.
rsa-pkcs1pad keeps sha1 padding support, which seems to be used by
virtio driver.
sha1 remains available as there are many drivers and subsystems using
it. Note only hmac(sha1) with secret keys remains cryptographically
secure.
In the kernel there are filesystems, IMA, tpm/pcr that appear to be
using sha1. Maybe they can all start to be slowly upgraded to
something else i.e. blake3, ParallelHash, SHAKE256 as needed.
Signed-off-by: Dimitri John Ledkov <dimitri.ledkov@canonical.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
When using task_iter to iterate all threads of a specific task, we enforce
that the user must pass a valid task pointer to ensure safety. However,
when iterating all threads/process in the system, BPF verifier still
require a valid ptr instead of "nullable" pointer, even though it's
pointless, which is a kind of surprising from usability standpoint. It
would be nice if we could let that kfunc accept a explicit null pointer
when we are using BPF_TASK_ITER_ALL_{PROCS, THREADS} and a valid pointer
when using BPF_TASK_ITER_THREAD.
Given a trival kfunc:
__bpf_kfunc void FN(struct TYPE_A *obj);
BPF Prog would reject a nullptr for obj. The error info is:
"arg#x pointer type xx xx must point to scalar, or struct with scalar"
reported by get_kfunc_ptr_arg_type(). The reg->type is SCALAR_VALUE and
the btf type of ref_t is not scalar or scalar_struct which leads to the
rejection of get_kfunc_ptr_arg_type.
This patch add "__nullable" annotation:
__bpf_kfunc void FN(struct TYPE_A *obj__nullable);
Here __nullable indicates obj can be optional, user can pass a explicit
nullptr or a normal TYPE_A pointer. In get_kfunc_ptr_arg_type(), we will
detect whether the current arg is optional and register is null, If so,
return a new kfunc_ptr_arg_type KF_ARG_PTR_TO_NULL and skip to the next
arg in check_kfunc_args().
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231018061746.111364-7-zhouchuyi@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
css_iter and task_iter should be used in rcu section. Specifically, in
sleepable progs explicit bpf_rcu_read_lock() is needed before use these
iters. In normal bpf progs that have implicit rcu_read_lock(), it's OK to
use them directly.
This patch adds a new a KF flag KF_RCU_PROTECTED for bpf_iter_task_new and
bpf_iter_css_new. It means the kfunc should be used in RCU CS. We check
whether we are in rcu cs before we want to invoke this kfunc. If the rcu
protection is guaranteed, we would let st->type = PTR_TO_STACK | MEM_RCU.
Once user do rcu_unlock during the iteration, state MEM_RCU of regs would
be cleared. is_iter_reg_valid_init() will reject if reg->type is UNTRUSTED.
It is worth noting that currently, bpf_rcu_read_unlock does not
clear the state of the STACK_ITER reg, since bpf_for_each_spilled_reg
only considers STACK_SPILL. This patch also let bpf_for_each_spilled_reg
search STACK_ITER.
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231018061746.111364-6-zhouchuyi@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This Patch adds kfuncs bpf_iter_css_{new,next,destroy} which allow
creation and manipulation of struct bpf_iter_css in open-coded iterator
style. These kfuncs actually wrapps css_next_descendant_{pre, post}.
css_iter can be used to:
1) iterating a sepcific cgroup tree with pre/post/up order
2) iterating cgroup_subsystem in BPF Prog, like
for_each_mem_cgroup_tree/cpuset_for_each_descendant_pre in kernel.
The API design is consistent with cgroup_iter. bpf_iter_css_new accepts
parameters defining iteration order and starting css. Here we also reuse
BPF_CGROUP_ITER_DESCENDANTS_PRE, BPF_CGROUP_ITER_DESCENDANTS_POST,
BPF_CGROUP_ITER_ANCESTORS_UP enums.
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20231018061746.111364-5-zhouchuyi@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch adds kfuncs bpf_iter_task_{new,next,destroy} which allow
creation and manipulation of struct bpf_iter_task in open-coded iterator
style. BPF programs can use these kfuncs or through bpf_for_each macro to
iterate all processes in the system.
The API design keep consistent with SEC("iter/task"). bpf_iter_task_new()
accepts a specific task and iterating type which allows:
1. iterating all process in the system (BPF_TASK_ITER_ALL_PROCS)
2. iterating all threads in the system (BPF_TASK_ITER_ALL_THREADS)
3. iterating all threads of a specific task (BPF_TASK_ITER_PROC_THREADS)
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Link: https://lore.kernel.org/r/20231018061746.111364-4-zhouchuyi@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch adds kfuncs bpf_iter_css_task_{new,next,destroy} which allow
creation and manipulation of struct bpf_iter_css_task in open-coded
iterator style. These kfuncs actually wrapps css_task_iter_{start,next,
end}. BPF programs can use these kfuncs through bpf_for_each macro for
iteration of all tasks under a css.
css_task_iter_*() would try to get the global spin-lock *css_set_lock*, so
the bpf side has to be careful in where it allows to use this iter.
Currently we only allow it in bpf_lsm and bpf iter-s.
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20231018061746.111364-3-zhouchuyi@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch makes some preparations for using css_task_iter_*() in BPF
Program.
1. Flags CSS_TASK_ITER_* are #define-s and it's not easy for bpf prog to
use them. Convert them to enum so bpf prog can take them from vmlinux.h.
2. In the next patch we will add css_task_iter_*() in common kfuncs which
is not safe. Since css_task_iter_*() does spin_unlock_irq() which might
screw up irq flags depending on the context where bpf prog is running.
So we should use irqsave/irqrestore here and the switching is harmless.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20231018061746.111364-2-zhouchuyi@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The whole network stack uses sockptr, and while it doesn't move to
something more modern, let's use sockptr in setsockptr BPF hooks, so, it
could be used by other callers.
The main motivation for this change is to use it in the io_uring
{g,s}etsockopt(), which will use a userspace pointer for *optval, but, a
kernel value for optlen.
Link: https://lore.kernel.org/all/ZSArfLaaGcfd8LH8@gmail.com/
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20231016134750.1381153-3-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The whole network stack uses sockptr, and while it doesn't move to
something more modern, let's use sockptr in getsockptr BPF hooks, so, it
could be used by other callers.
The main motivation for this change is to use it in the io_uring
{g,s}etsockopt(), which will use a userspace pointer for *optval, but, a
kernel value for optlen.
Link: https://lore.kernel.org/all/ZSArfLaaGcfd8LH8@gmail.com/
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20231016134750.1381153-2-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The bcachefs implementation of six locks is intended to land in
generic locking code in the long term, but has been pulled into the
bcachefs subsystem for internal use for the time being. This code
lift breaks the bcachefs module build as six locks depend a couple
of the generic locking tracepoints. Export these tracepoint symbols
for bcachefs.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTD6IQAKCRCRxhvAZXjc
opXLAQC9X+ECnGUAOy/kvOrEBkBb7G4BuZ8XsrnL976riVNp0gEA85LaJV9Ow7Xk
51k/1ujhYkglQbCsa0zo+mI4ueE3wAQ=
=Dqrj
-----END PGP SIGNATURE-----
Merge tag 'v6.6-rc7.vfs.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fix from Christian Brauner:
"An openat() call from io_uring triggering an audit call can apparently
cause the refcount of struct filename to be incremented from multiple
threads concurrently during async execution, triggering a refcount
underflow and hitting a BUG_ON(). That bug has been lurking around
since at least v5.16 apparently.
Switch to an atomic counter to fix that. The underflow check is
downgraded from a BUG_ON() to a WARN_ON_ONCE() but we could easily
remove that check altogether tbh"
* tag 'v6.6-rc7.vfs.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
audit,io_uring: io_uring openat triggers audit reference count underflow
Overlayfs uses backing files with "fake" overlayfs f_path and "real"
underlying f_inode, in order to use underlying inode aops for mapped
files and to display the overlayfs path in /proc/<pid>/maps.
In preparation for storing the overlayfs "fake" path instead of the
underlying "real" path in struct backing_file, define a noop helper
file_user_path() that returns f_path for now.
Use the new helper in procfs and kernel logs whenever a path of a
mapped file is displayed to users.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231009153712.1566422-3-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
In recent discussions around some performance improvements in the file
handling area we discussed switching the file cache to rely on
SLAB_TYPESAFE_BY_RCU which allows us to get rid of call_rcu() based
freeing for files completely. This is a pretty sensitive change overall
but it might actually be worth doing.
The main downside is the subtlety. The other one is that we should
really wait for Jann's patch to land that enables KASAN to handle
SLAB_TYPESAFE_BY_RCU UAFs. Currently it doesn't but a patch for this
exists.
With SLAB_TYPESAFE_BY_RCU objects may be freed and reused multiple times
which requires a few changes. So it isn't sufficient anymore to just
acquire a reference to the file in question under rcu using
atomic_long_inc_not_zero() since the file might have already been
recycled and someone else might have bumped the reference.
In other words, callers might see reference count bumps from newer
users. For this reason it is necessary to verify that the pointer is the
same before and after the reference count increment. This pattern can be
seen in get_file_rcu() and __files_get_rcu().
In addition, it isn't possible to access or check fields in struct file
without first aqcuiring a reference on it. Not doing that was always
very dodgy and it was only usable for non-pointer data in struct file.
With SLAB_TYPESAFE_BY_RCU it is necessary that callers first acquire a
reference under rcu or they must hold the files_lock of the fdtable.
Failing to do either one of this is a bug.
Thanks to Jann for pointing out that we need to ensure memory ordering
between reallocations and pointer check by ensuring that all subsequent
loads have a dependency on the second load in get_file_rcu() and
providing a fixup that was folded into this patch.
Cc: Jann Horn <jannh@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Because group consistency is non-atomic between parent (filedesc) and children
(inherited) events, it is possible for PERF_FORMAT_GROUP read() to try and sum
non-matching counter groups -- with non-sensical results.
Add group_generation to distinguish the case where a parent group removes and
adds an event and thus has the same number, but a different configuration of
events as inherited groups.
This became a problem when commit fa8c269353 ("perf/core: Invert
perf_read_group() loops") flipped the order of child_list and sibling_list.
Previously it would iterate the group (sibling_list) first, and for each
sibling traverse the child_list. In this order, only the group composition of
the parent is relevant. By flipping the order the group composition of the
child (inherited) events becomes an issue and the mis-match in group
composition becomes evident.
That said; even prior to this commit, while reading of a group that is not
equally inherited was not broken, it still made no sense.
(Ab)use ECHILD as error return to indicate issues with child process group
composition.
Fixes: fa8c269353 ("perf/core: Invert perf_read_group() loops")
Reported-by: Budimir Markovic <markovicbudimir@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231018115654.GK33217@noisy.programming.kicks-ass.net
__read_mostly predates __ro_after_init. Many variables which are marked
__read_mostly should have been __ro_after_init from day 1.
Also, mark some stuff as "const" and "__init" while I'm at it.
[akpm@linux-foundation.org: revert sysctl_nr_open_min, sysctl_nr_open_max changes due to arm warning]
[akpm@linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/4f6bb9c0-abba-4ee4-a7aa-89265e886817@p183
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct gcov_iterator.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Link: https://lkml.kernel.org/r/20230922175220.work.327-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Tom Rix <trix@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ucounts is assigned first, so it does not need to initialize the
assignment.
Link: https://lkml.kernel.org/r/20230926022410.4280-1-kunyu@nfschina.com
Signed-off-by: Li kunyu <kunyu@nfschina.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "permit write-sealed memfd read-only shared mappings", v4.
The man page for fcntl() describing memfd file seals states the following
about F_SEAL_WRITE:-
Furthermore, trying to create new shared, writable memory-mappings via
mmap(2) will also fail with EPERM.
With emphasis on 'writable'. In turns out in fact that currently the
kernel simply disallows all new shared memory mappings for a memfd with
F_SEAL_WRITE applied, rendering this documentation inaccurate.
This matters because users are therefore unable to obtain a shared mapping
to a memfd after write sealing altogether, which limits their usefulness.
This was reported in the discussion thread [1] originating from a bug
report [2].
This is a product of both using the struct address_space->i_mmap_writable
atomic counter to determine whether writing may be permitted, and the
kernel adjusting this counter when any VM_SHARED mapping is performed and
more generally implicitly assuming VM_SHARED implies writable.
It seems sensible that we should only update this mapping if VM_MAYWRITE
is specified, i.e. whether it is possible that this mapping could at any
point be written to.
If we do so then all we need to do to permit write seals to function as
documented is to clear VM_MAYWRITE when mapping read-only. It turns out
this functionality already exists for F_SEAL_FUTURE_WRITE - we can
therefore simply adapt this logic to do the same for F_SEAL_WRITE.
We then hit a chicken and egg situation in mmap_region() where the check
for VM_MAYWRITE occurs before we are able to clear this flag. To work
around this, perform this check after we invoke call_mmap(), with careful
consideration of error paths.
Thanks to Andy Lutomirski for the suggestion!
[1]:https://lore.kernel.org/all/20230324133646.16101dfa666f253c4715d965@linux-foundation.org/
[2]:https://bugzilla.kernel.org/show_bug.cgi?id=217238
This patch (of 3):
There is a general assumption that VMAs with the VM_SHARED flag set are
writable. If the VM_MAYWRITE flag is not set, then this is simply not the
case.
Update those checks which affect the struct address_space->i_mmap_writable
field to explicitly test for this by introducing
[vma_]is_shared_maywrite() helper functions.
This remains entirely conservative, as the lack of VM_MAYWRITE guarantees
that the VMA cannot be written to.
Link: https://lkml.kernel.org/r/cover.1697116581.git.lstoakes@gmail.com
Link: https://lkml.kernel.org/r/d978aefefa83ec42d18dfa964ad180dbcde34795.1697116581.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There are no users of wait bookmarks left, so simplify the wait
code by removing them.
Link: https://lkml.kernel.org/r/20231010035829.544242-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Benjamin Segall <bsegall@google.com>
Cc: Bin Lai <sclaibin@gmail.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.
For instance, here is one of our usecases: suppose there are two 32G
containers. The machine is booted with hugetlb_cma=6G, and each container
may or may not use up to 3 gigantic page, depending on the workload within
it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup
limit of each cgroup to 3G to enforce hugetlb fairness. But it is very
difficult to configure memory.max to keep overall consumption, including
anon, cache, slab etc. fair.
What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max. Similar procedure is done to other memory limits
(memory.low for e.g). However, this is rather cumbersome and buggy.
Furthermore, when there is a delay in memory limits correction, (for e.g
when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.
This patch rectifies this issue by charging the memcg when the hugetlb
folio is utilized, and uncharging when the folio is freed (analogous to
the hugetlb controller). Note that we do not charge when the folio is
allocated to the hugetlb pool, because at this point it is not owned by
any memcg.
Some caveats to consider:
* This feature is only available on cgroup v2.
* There is no hugetlb pool management involved in the memory
controller. As stated above, hugetlb folios are only charged towards
the memory controller when it is used. Host overcommit management
has to consider it when configuring hard limits.
* Failure to charge towards the memcg results in SIGBUS. This could
happen even if the hugetlb pool still has pages (but the cgroup
limit is hit and reclaim attempt fails).
* When this feature is enabled, hugetlb pages contribute to memory
reclaim protection. low, min limits tuning must take into account
hugetlb memory.
* Hugetlb pages utilized while this option is not selected will not
be tracked by the memory controller (even if cgroup v2 is remounted
later on).
Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tejun heo <tj@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Architectures which don't define their own use the one in
asm-generic/bitops/lock.h. Get rid of all the ifdefs around "maybe we
don't have it".
Link: https://lkml.kernel.org/r/20231004165317.1061855-15-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Replace clear_bit_and_unlock_is_negative_byte() with
xor_unlock_is_negative_byte(). We have a few places that like to lock a
folio, set a flag and unlock it again. Allow for the possibility of
combining the latter two operations for efficiency. We are guaranteed
that the caller holds the lock, so it is safe to unlock it with the xor.
The caller must guarantee that nobody else will set the flag without
holding the lock; it is not safe to do this with the PG_dirty flag, for
example.
Link: https://lkml.kernel.org/r/20231004165317.1061855-8-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
get_user_pages_remote() will never return 0 except in the case of
FOLL_NOWAIT being specified, which we explicitly disallow.
This simplifies error handling for the caller and avoids the awkwardness
of dealing with both errors and failing to pin. Failing to pin here is an
error.
Link: https://lkml.kernel.org/r/00319ce292d27b3aae76a0eb220ce3f528187508.1696288092.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "various improvements to the GUP interface", v2.
A series of fixes to simplify and improve the GUP interface with an eye to
providing groundwork to future improvements:-
* __access_remote_vm() and access_remote_vm() are functionally identical,
so make the former static such that in future we can potentially change
the external-facing implementation details of this function.
* Extend is_valid_gup_args() to cover the missing FOLL_TOUCH case, and
simplify things by defining INTERNAL_GUP_FLAGS to check against.
* Adjust __get_user_pages_locked() to explicitly treat a failure to pin any
pages as an error in all circumstances other than FOLL_NOWAIT being
specified, bringing it in line with the nommu implementation of this
function.
* (With many thanks to Arnd who suggested this in the first instance)
Update get_user_page_vma_remote() to explicitly only return a page or an
error, simplifying the interface and avoiding the questionable
IS_ERR_OR_NULL() pattern.
This patch (of 4):
access_remote_vm() passes through parameters to __access_remote_vm()
directly, so remove the __access_remote_vm() function from mm.h and use
access_remote_vm() in the one caller that needs it (ptrace_access_vm()).
This allows future adjustments to the GUP-internal __access_remote_vm()
function while keeping the access_remote_vm() function stable.
Link: https://lkml.kernel.org/r/cover.1696288092.git.lstoakes@gmail.com
Link: https://lkml.kernel.org/r/f7877c5039ce1c202a514a8aeeefc5cdd5e32d19.1696288092.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
All callers of work_on_cpu() share the same lock class key for all the
functions queued. As a result the workqueue related locking scenario for
a function A may be spuriously accounted as an inversion against the
locking scenario of function B such as in the following model:
long A(void *arg)
{
mutex_lock(&mutex);
mutex_unlock(&mutex);
}
long B(void *arg)
{
}
void launchA(void)
{
work_on_cpu(0, A, NULL);
}
void launchB(void)
{
mutex_lock(&mutex);
work_on_cpu(1, B, NULL);
mutex_unlock(&mutex);
}
launchA and launchB running concurrently have no chance to deadlock.
However the above can be reported by lockdep as a possible locking
inversion because the works containing A() and B() are treated as
belonging to the same locking class.
The following shows an existing example of such a spurious lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted
------------------------------------------------------
kworker/0:1/9 is trying to acquire lock:
ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0
but task is already holding lock:
ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
__flush_work+0x83/0x4e0
work_on_cpu+0x97/0xc0
rcu_nocb_cpu_offload+0x62/0xb0
rcu_nocb_toggle+0xd0/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}:
__mutex_lock+0x81/0xc80
rcu_nocb_cpu_deoffload+0x38/0xb0
rcu_nocb_toggle+0x144/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
percpu_down_write+0x31/0x200
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work)
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock((work_completion)(&wfc.work));
lock(rcu_state.barrier_mutex);
lock((work_completion)(&wfc.work));
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by kworker/0:1/9:
#0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500
#1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
stack backtrace:
CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Workqueue: events work_for_cpu_fn
Call Trace:
rcu-torture: rcu_torture_read_exit: Start of episode
<TASK>
dump_stack_lvl+0x4a/0x80
check_noncircular+0x132/0x150
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
? _cpu_down+0x57/0x2b0
percpu_down_write+0x31/0x200
? _cpu_down+0x57/0x2b0
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
? __pfx_worker_thread+0x10/0x10
kthread+0xe6/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2f/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK
Fix this with providing one lock class key per work_on_cpu() caller.
Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
If a system has isolated CPUs via the "isolcpus=" command line parameter,
then an attempt to offline the last housekeeping CPU will result in a
WARN_ON() when rebuilding the scheduler domains and a subsequent panic due
to and unhandled empty CPU mas in partition_sched_domains_locked().
cpuset_hotplug_workfn()
rebuild_sched_domains_locked()
ndoms = generate_sched_domains(&doms, &attr);
cpumask_and(doms[0], top_cpuset.effective_cpus, housekeeping_cpumask(HK_FLAG_DOMAIN));
Thus results in an empty CPU mask which triggers the warning and then the
subsequent crash:
WARNING: CPU: 4 PID: 80 at kernel/sched/topology.c:2366 build_sched_domains+0x120c/0x1408
Call trace:
build_sched_domains+0x120c/0x1408
partition_sched_domains_locked+0x234/0x880
rebuild_sched_domains_locked+0x37c/0x798
rebuild_sched_domains+0x30/0x58
cpuset_hotplug_workfn+0x2a8/0x930
Unable to handle kernel paging request at virtual address fffe80027ab37080
partition_sched_domains_locked+0x318/0x880
rebuild_sched_domains_locked+0x37c/0x798
Aside of the resulting crash, it does not make any sense to offline the last
last housekeeping CPU.
Prevent this by masking out the non-housekeeping CPUs when selecting a
target CPU for initiating the CPU unplug operation via the work queue.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/202310171709530660462@zte.com.cn
When employed within a sleepable program not under RCU protection, the
use of 'bpf_task_under_cgroup()' may trigger a warning in the kernel log,
particularly when CONFIG_PROVE_RCU is enabled:
[ 1259.662357] WARNING: suspicious RCU usage
[ 1259.662358] 6.5.0+ #33 Not tainted
[ 1259.662360] -----------------------------
[ 1259.662361] include/linux/cgroup.h:423 suspicious rcu_dereference_check() usage!
Other info that might help to debug this:
[ 1259.662366] rcu_scheduler_active = 2, debug_locks = 1
[ 1259.662368] 1 lock held by trace/72954:
[ 1259.662369] #0: ffffffffb5e3eda0 (rcu_read_lock_trace){....}-{0:0}, at: __bpf_prog_enter_sleepable+0x0/0xb0
Stack backtrace:
[ 1259.662385] CPU: 50 PID: 72954 Comm: trace Kdump: loaded Not tainted 6.5.0+ #33
[ 1259.662391] Call Trace:
[ 1259.662393] <TASK>
[ 1259.662395] dump_stack_lvl+0x6e/0x90
[ 1259.662401] dump_stack+0x10/0x20
[ 1259.662404] lockdep_rcu_suspicious+0x163/0x1b0
[ 1259.662412] task_css_set.part.0+0x23/0x30
[ 1259.662417] bpf_task_under_cgroup+0xe7/0xf0
[ 1259.662422] bpf_prog_7fffba481a3bcf88_lsm_run+0x5c/0x93
[ 1259.662431] bpf_trampoline_6442505574+0x60/0x1000
[ 1259.662439] bpf_lsm_bpf+0x5/0x20
[ 1259.662443] ? security_bpf+0x32/0x50
[ 1259.662452] __sys_bpf+0xe6/0xdd0
[ 1259.662463] __x64_sys_bpf+0x1a/0x30
[ 1259.662467] do_syscall_64+0x38/0x90
[ 1259.662472] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 1259.662479] RIP: 0033:0x7f487baf8e29
[...]
[ 1259.662504] </TASK>
This issue can be reproduced by executing a straightforward program, as
demonstrated below:
SEC("lsm.s/bpf")
int BPF_PROG(lsm_run, int cmd, union bpf_attr *attr, unsigned int size)
{
struct cgroup *cgrp = NULL;
struct task_struct *task;
int ret = 0;
if (cmd != BPF_LINK_CREATE)
return 0;
// The cgroup2 should be mounted first
cgrp = bpf_cgroup_from_id(1);
if (!cgrp)
goto out;
task = bpf_get_current_task_btf();
if (bpf_task_under_cgroup(task, cgrp))
ret = -1;
bpf_cgroup_release(cgrp);
out:
return ret;
}
After running the program, if you subsequently execute another BPF program,
you will encounter the warning.
It's worth noting that task_under_cgroup_hierarchy() is also utilized by
bpf_current_task_under_cgroup(). However, bpf_current_task_under_cgroup()
doesn't exhibit this issue because it cannot be used in sleepable BPF
programs.
Fixes: b5ad4cdc46 ("bpf: Add bpf_task_under_cgroup() kfunc")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Cc: Feng Zhou <zhoufeng.zf@bytedance.com>
Cc: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/bpf/20231007135945.4306-1-laoar.shao@gmail.com
A few drivers were missing a xdp_do_flush() invocation after
XDP_REDIRECT.
Add three helper functions each for one of the per-CPU lists. Return
true if the per-CPU list is non-empty and flush the list.
Add xdp_do_check_flushed() which invokes each helper functions and
creates a warning if one of the functions had a non-empty list.
Hide everything behind CONFIG_DEBUG_NET.
Suggested-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20231016125738.Yt79p1uF@linutronix.de
While adding a preferred console handling for serial_core for serial port
hardware based device addressing, Jiri suggested we constify name for
add_preferred_console(). The name gets copied anyways. This allows serial
core to add a preferred console using serial drv->dev_name without copying
it.
Note that constifying options causes changes all over the place because of
struct console for match().
Suggested-by: Jiri Slaby <jirislaby@kernel.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
Link: https://lore.kernel.org/r/20231012064300.50221-2-tony@atomide.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Let's check for valid console index values for preferred console to avoid
bogus console index numbers from kernel command line.
Let's also return an error for negative index numbers for the preferred
console. Unlike for device drivers, a negative index is not valid for the
preferred console.
Let's also constify idx while at it.
Signed-off-by: Tony Lindgren <tony@atomide.com>
Link: https://lore.kernel.org/r/20231012064300.50221-1-tony@atomide.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZS1d4wAKCRDbK58LschI
g4DSAP441CdKh8fd+wNKUSKHFbpCQ6EvocR6Nf+Sj2DFUx/w/QEA7mfju7Abqjc3
xwDEx0BuhrjMrjV5MmEpxc7lYl9XcQU=
=vuWk
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-10-16
We've added 90 non-merge commits during the last 25 day(s) which contain
a total of 120 files changed, 3519 insertions(+), 895 deletions(-).
The main changes are:
1) Add missed stats for kprobes to retrieve the number of missed kprobe
executions and subsequent executions of BPF programs, from Jiri Olsa.
2) Add cgroup BPF sockaddr hooks for unix sockets. The use case is
for systemd to reimplement the LogNamespace feature which allows
running multiple instances of systemd-journald to process the logs
of different services, from Daan De Meyer.
3) Implement BPF CPUv4 support for s390x BPF JIT, from Ilya Leoshkevich.
4) Improve BPF verifier log output for scalar registers to better
disambiguate their internal state wrt defaults vs min/max values
matching, from Andrii Nakryiko.
5) Extend the BPF fib lookup helpers for IPv4/IPv6 to support retrieving
the source IP address with a new BPF_FIB_LOOKUP_SRC flag,
from Martynas Pumputis.
6) Add support for open-coded task_vma iterator to help with symbolization
for BPF-collected user stacks, from Dave Marchevsky.
7) Add libbpf getters for accessing individual BPF ring buffers which
is useful for polling them individually, for example, from Martin Kelly.
8) Extend AF_XDP selftests to validate the SHARED_UMEM feature,
from Tushar Vyavahare.
9) Improve BPF selftests cross-building support for riscv arch,
from Björn Töpel.
10) Add the ability to pin a BPF timer to the same calling CPU,
from David Vernet.
11) Fix libbpf's bpf_tracing.h macros for riscv to use the generic
implementation of PT_REGS_SYSCALL_REGS() to access syscall arguments,
from Alexandre Ghiti.
12) Extend libbpf to support symbol versioning for uprobes, from Hengqi Chen.
13) Fix bpftool's skeleton code generation to guarantee that ELF data
is 8 byte aligned, from Ian Rogers.
14) Inherit system-wide cpu_mitigations_off() setting for Spectre v1/v4
security mitigations in BPF verifier, from Yafang Shao.
15) Annotate struct bpf_stack_map with __counted_by attribute to prepare
BPF side for upcoming __counted_by compiler support, from Kees Cook.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (90 commits)
bpf: Ensure proper register state printing for cond jumps
bpf: Disambiguate SCALAR register state output in verifier logs
selftests/bpf: Make align selftests more robust
selftests/bpf: Improve missed_kprobe_recursion test robustness
selftests/bpf: Improve percpu_alloc test robustness
selftests/bpf: Add tests for open-coded task_vma iter
bpf: Introduce task_vma open-coded iterator kfuncs
selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c
bpf: Don't explicitly emit BTF for struct btf_iter_num
bpf: Change syscall_nr type to int in struct syscall_tp_t
net/bpf: Avoid unused "sin_addr_len" warning when CONFIG_CGROUP_BPF is not set
bpf: Avoid unnecessary audit log for CPU security mitigations
selftests/bpf: Add tests for cgroup unix socket address hooks
selftests/bpf: Make sure mount directory exists
documentation/bpf: Document cgroup unix socket address hooks
bpftool: Add support for cgroup unix socket address hooks
libbpf: Add support for cgroup unix socket address hooks
bpf: Implement cgroup sockaddr hooks for unix sockets
bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpf
bpf: Propagate modified uaddrlen from cgroup sockaddr programs
...
====================
Link: https://lore.kernel.org/r/20231016204803.30153-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The CSD lock seems to get stuck in 2 "modes". When it gets stuck
temporarily, it usually gets released in a few seconds, and sometimes
up to one or two minutes.
If the CSD lock stays stuck for more than several minutes, it never
seems to get unstuck, and gradually more and more things in the system
end up also getting stuck.
In the latter case, we should just give up, so the system can dump out
a little more information about what went wrong, and, with panic_on_oops
and a kdump kernel loaded, dump a whole bunch more information about what
might have gone wrong. In addition, there is an smp.panic_on_ipistall
kernel boot parameter that by default retains the old behavior, but when
set enables the panic after the CSD lock has been stuck for more than
the specified number of milliseconds, as in 300,000 for five minutes.
[ paulmck: Apply Imran Khan feedback. ]
[ paulmck: Apply Leonardo Bras feedback. ]
Link: https://lore.kernel.org/lkml/bc7cc8b0-f587-4451-8bcd-0daae627bcc7@paulmck-laptop/
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Imran Khan <imran.f.khan@oracle.com>
Reviewed-by: Leonardo Bras <leobras@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
The cpupid (or access time) is stored in the head page for THP, so it is
safely to make should_numa_migrate_memory() and numa_hint_fault_latency()
to take a folio. This is in preparation for large folio numa balancing.
Link: https://lkml.kernel.org/r/20230921074417.24004-7-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Verifier emits relevant register state involved in any given instruction
next to it after `;` to the right, if possible. Or, worst case, on the
separate line repeating instruction index.
E.g., a nice and simple case would be:
2: (d5) if r0 s<= 0x0 goto pc+1 ; R0_w=0
But if there is some intervening extra output (e.g., precision
backtracking log) involved, we are supposed to see the state after the
precision backtrack log:
4: (75) if r0 s>= 0x0 goto pc+1
mark_precise: frame0: last_idx 4 first_idx 0 subseq_idx -1
mark_precise: frame0: regs=r0 stack= before 2: (d5) if r0 s<= 0x0 goto pc+1
mark_precise: frame0: regs=r0 stack= before 1: (b7) r0 = 0
6: R0_w=0
First off, note that in `6: R0_w=0` instruction index corresponds to the
next instruction, not to the conditional jump instruction itself, which
is wrong and we'll get to that.
But besides that, the above is a happy case that does work today. Yet,
if it so happens that precision backtracking had to traverse some of the
parent states, this `6: R0_w=0` state output would be missing.
This is due to a quirk of print_verifier_state() routine, which performs
mark_verifier_state_clean(env) at the end. This marks all registers as
"non-scratched", which means that subsequent logic to print *relevant*
registers (that is, "scratched ones") fails and doesn't see anything
relevant to print and skips the output altogether.
print_verifier_state() is used both to print instruction context, but
also to print an **entire** verifier state indiscriminately, e.g.,
during precision backtracking (and in a few other situations, like
during entering or exiting subprogram). Which means if we have to print
entire parent state before getting to printing instruction context
state, instruction context is marked as clean and is omitted.
Long story short, this is definitely not intentional. So we fix this
behavior in this patch by teaching print_verifier_state() to clear
scratch state only if it was used to print instruction state, not the
parent/callback state. This is determined by print_all option, so if
it's not set, we don't clear scratch state. This fixes missing
instruction state for these cases.
As for the mismatched instruction index, we fix that by making sure we
call print_insn_state() early inside check_cond_jmp_op() before we
adjusted insn_idx based on jump branch taken logic. And with that we get
desired correct information:
9: (16) if w4 == 0x1 goto pc+9
mark_precise: frame0: last_idx 9 first_idx 9 subseq_idx -1
mark_precise: frame0: parent state regs=r4 stack=: R2_w=1944 R4_rw=P1 R10=fp0
mark_precise: frame0: last_idx 8 first_idx 0 subseq_idx 9
mark_precise: frame0: regs=r4 stack= before 8: (66) if w4 s> 0x3 goto pc+5
mark_precise: frame0: regs=r4 stack= before 7: (b7) r4 = 1
9: R4=1
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20231011223728.3188086-6-andrii@kernel.org
Currently the way that verifier prints SCALAR_VALUE register state (and
PTR_TO_PACKET, which can have var_off and ranges info as well) is very
ambiguous.
In the name of brevity we are trying to eliminate "unnecessary" output
of umin/umax, smin/smax, u32_min/u32_max, and s32_min/s32_max values, if
possible. Current rules are that if any of those have their default
value (which for mins is the minimal value of its respective types: 0,
S32_MIN, or S64_MIN, while for maxs it's U32_MAX, S32_MAX, S64_MAX, or
U64_MAX) *OR* if there is another min/max value that as matching value.
E.g., if smin=100 and umin=100, we'll emit only umin=10, omitting smin
altogether. This approach has a few problems, being both ambiguous and
sort-of incorrect in some cases.
Ambiguity is due to missing value could be either default value or value
of umin/umax or smin/smax. This is especially confusing when we mix
signed and unsigned ranges. Quite often, umin=0 and smin=0, and so we'll
have only `umin=0` leaving anyone reading verifier log to guess whether
smin is actually 0 or it's actually -9223372036854775808 (S64_MIN). And
often times it's important to know, especially when debugging tricky
issues.
"Sort-of incorrectness" comes from mixing negative and positive values.
E.g., if umin is some large positive number, it can be equal to smin
which is, interpreted as signed value, is actually some negative value.
Currently, that smin will be omitted and only umin will be emitted with
a large positive value, giving an impression that smin is also positive.
Anyway, ambiguity is the biggest issue making it impossible to have an
exact understanding of register state, preventing any sort of automated
testing of verifier state based on verifier log. This patch is
attempting to rectify the situation by removing ambiguity, while
minimizing the verboseness of register state output.
The rules are straightforward:
- if some of the values are missing, then it definitely has a default
value. I.e., `umin=0` means that umin is zero, but smin is actually
S64_MIN;
- all the various boundaries that happen to have the same value are
emitted in one equality separated sequence. E.g., if umin and smin are
both 100, we'll emit `smin=umin=100`, making this explicit;
- we do not mix negative and positive values together, and even if
they happen to have the same bit-level value, they will be emitted
separately with proper sign. I.e., if both umax and smax happen to be
0xffffffffffffffff, we'll emit them both separately as
`smax=-1,umax=18446744073709551615`;
- in the name of a bit more uniformity and consistency,
{u32,s32}_{min,max} are renamed to {s,u}{min,max}32, which seems to
improve readability.
The above means that in case of all 4 ranges being, say, [50, 100] range,
we'd previously see hugely ambiguous:
R1=scalar(umin=50,umax=100)
Now, we'll be more explicit:
R1=scalar(smin=umin=smin32=umin32=50,smax=umax=smax32=umax32=100)
This is slightly more verbose, but distinct from the case when we don't
know anything about signed boundaries and 32-bit boundaries, which under
new rules will match the old case:
R1=scalar(umin=50,umax=100)
Also, in the name of simplicity of implementation and consistency, order
for {s,u}32_{min,max} are emitted *before* var_off. Previously they were
emitted afterwards, for unclear reasons.
This patch also includes a few fixes to selftests that expect exact
register state to accommodate slight changes to verifier format. You can
see that the changes are pretty minimal in common cases.
Note, the special case when SCALAR_VALUE register is a known constant
isn't changed, we'll emit constant value once, interpreted as signed
value.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20231011223728.3188086-5-andrii@kernel.org
The PSI trigger code is now making a distinction between privileged and
unprivileged triggers, after the following commit:
65457b74aa ("sched/psi: Rename existing poll members in preparation")
But some comments have not been modified along with the code, so they
need to be updated.
This will help readers better understand the code.
Signed-off-by: Fan Yu <fan.yu9@zte.com.cn>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Peter Ziljstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/202310161920399921184@zte.com.cn
Add the necessary structure to support custom private-data per
posix-clock user.
The previous implementation of posix-clock assumed all file open
instances need access to the same clock structure on private_data.
The need for individual data structures per file open instance has been
identified when developing support for multiple timestamp event queue
users for ptp_clock.
Signed-off-by: Xabier Marquiegui <reibax@gmail.com>
Suggested-by: Richard Cochran <richardcochran@gmail.com>
Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When entering kdb/kgdb on a kernel panic, it was be observed that the
console isn't flushed before the `kdb` prompt came up. Specifically,
when using the buddy lockup detector on arm64 and running:
echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
I could see:
[ 26.161099] lkdtm: Performing direct entry HARDLOCKUP
[ 32.499881] watchdog: Watchdog detected hard LOCKUP on cpu 6
[ 32.552865] Sending NMI from CPU 5 to CPUs 6:
[ 32.557359] NMI backtrace for cpu 6
... [backtrace for cpu 6] ...
[ 32.558353] NMI backtrace for cpu 5
... [backtrace for cpu 5] ...
[ 32.867471] Sending NMI from CPU 5 to CPUs 0-4,7:
[ 32.872321] NMI backtrace forP cpuANC: Hard LOCKUP
Entering kdb (current=..., pid 0) on processor 5 due to Keyboard Entry
[5]kdb>
As you can see, backtraces for the other CPUs start printing and get
interleaved with the kdb PANIC print.
Let's replicate the commands to flush the console in the kdb panic
entry point to avoid this.
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20230822131945.1.I5b460ae8f954e4c4f628a373d6e74713c06dd26f@changeid
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow
creation and manipulation of struct bpf_iter_task_vma in open-coded
iterator style. BPF programs can use these kfuncs directly or through
bpf_for_each macro for natural-looking iteration of all task vmas.
The implementation borrows heavily from bpf_find_vma helper's locking -
differing only in that it holds the mmap_read lock for all iterations
while the helper only executes its provided callback on a maximum of 1
vma. Aside from locking, struct vma_iterator and vma_next do all the
heavy lifting.
A pointer to an inner data struct, struct bpf_iter_task_vma_data, is the
only field in struct bpf_iter_task_vma. This is because the inner data
struct contains a struct vma_iterator (not ptr), whose size is likely to
change under us. If bpf_iter_task_vma_kern contained vma_iterator directly
such a change would require change in opaque bpf_iter_task_vma struct's
size. So better to allocate vma_iterator using BPF allocator, and since
that alloc must already succeed, might as well allocate all iter fields,
thereby freezing struct bpf_iter_task_vma size.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231013204426.1074286-4-davemarchevsky@fb.com
Commit 6018e1f407 ("bpf: implement numbers iterator") added the
BTF_TYPE_EMIT line that this patch is modifying. The struct btf_iter_num
doesn't exist, so only a forward declaration is emitted in BTF:
FWD 'btf_iter_num' fwd_kind=struct
That commit was probably hoping to ensure that struct bpf_iter_num is
emitted in vmlinux BTF. A previous version of this patch changed the
line to emit the correct type, but Yonghong confirmed that it would
definitely be emitted regardless in [0], so this patch simply removes
the line.
This isn't marked "Fixes" because the extraneous btf_iter_num FWD wasn't
causing any issues that I noticed, aside from mild confusion when I
looked through the code.
[0]: https://lore.kernel.org/bpf/25d08207-43e6-36a8-5e0f-47a913d4cda5@linux.dev/
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231013204426.1074286-2-davemarchevsky@fb.com
linux-rt-devel tree contains a patch (b1773eac3f29c ("sched: Add support
for lazy preemption")) that adds an extra member to struct trace_entry.
This causes the offset of args field in struct trace_event_raw_sys_enter
be different from the one in struct syscall_trace_enter:
struct trace_event_raw_sys_enter {
struct trace_entry ent; /* 0 12 */
/* XXX last struct has 3 bytes of padding */
/* XXX 4 bytes hole, try to pack */
long int id; /* 16 8 */
long unsigned int args[6]; /* 24 48 */
/* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
char __data[]; /* 72 0 */
/* size: 72, cachelines: 2, members: 4 */
/* sum members: 68, holes: 1, sum holes: 4 */
/* paddings: 1, sum paddings: 3 */
/* last cacheline: 8 bytes */
};
struct syscall_trace_enter {
struct trace_entry ent; /* 0 12 */
/* XXX last struct has 3 bytes of padding */
int nr; /* 12 4 */
long unsigned int args[]; /* 16 0 */
/* size: 16, cachelines: 1, members: 3 */
/* paddings: 1, sum paddings: 3 */
/* last cacheline: 16 bytes */
};
This, in turn, causes perf_event_set_bpf_prog() fail while running bpf
test_profiler testcase because max_ctx_offset is calculated based on the
former struct, while off on the latter:
10488 if (is_tracepoint || is_syscall_tp) {
10489 int off = trace_event_get_offsets(event->tp_event);
10490
10491 if (prog->aux->max_ctx_offset > off)
10492 return -EACCES;
10493 }
What bpf program is actually getting is a pointer to struct
syscall_tp_t, defined in kernel/trace/trace_syscalls.c. This patch fixes
the problem by aligning struct syscall_tp_t with struct
syscall_trace_(enter|exit) and changing the tests to use these structs
to dereference context.
Signed-off-by: Artem Savkov <asavkov@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20231013054219.172920-1-asavkov@redhat.com
An io_uring openat operation can update an audit reference count
from multiple threads resulting in the call trace below.
A call to io_uring_submit() with a single openat op with a flag of
IOSQE_ASYNC results in the following reference count updates.
These first part of the system call performs two increments that do not race.
do_syscall_64()
__do_sys_io_uring_enter()
io_submit_sqes()
io_openat_prep()
__io_openat_prep()
getname()
getname_flags() /* update 1 (increment) */
__audit_getname() /* update 2 (increment) */
The openat op is queued to an io_uring worker thread which starts the
opportunity for a race. The system call exit performs one decrement.
do_syscall_64()
syscall_exit_to_user_mode()
syscall_exit_to_user_mode_prepare()
__audit_syscall_exit()
audit_reset_context()
putname() /* update 3 (decrement) */
The io_uring worker thread performs one increment and two decrements.
These updates can race with the system call decrement.
io_wqe_worker()
io_worker_handle_work()
io_wq_submit_work()
io_issue_sqe()
io_openat()
io_openat2()
do_filp_open()
path_openat()
__audit_inode() /* update 4 (increment) */
putname() /* update 5 (decrement) */
__audit_uring_exit()
audit_reset_context()
putname() /* update 6 (decrement) */
The fix is to change the refcnt member of struct audit_names
from int to atomic_t.
kernel BUG at fs/namei.c:262!
Call Trace:
...
? putname+0x68/0x70
audit_reset_context.part.0.constprop.0+0xe1/0x300
__audit_uring_exit+0xda/0x1c0
io_issue_sqe+0x1f3/0x450
? lock_timer_base+0x3b/0xd0
io_wq_submit_work+0x8d/0x2b0
? __try_to_del_timer_sync+0x67/0xa0
io_worker_handle_work+0x17c/0x2b0
io_wqe_worker+0x10a/0x350
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/MW2PR2101MB1033FFF044A258F84AEAA584F1C9A@MW2PR2101MB1033.namprd21.prod.outlook.com/
Fixes: 5bd2182d58 ("audit,io_uring,io-wq: add some basic audit support to io_uring")
Signed-off-by: Dan Clash <daclash@linux.microsoft.com>
Link: https://lore.kernel.org/r/20231012215518.GA4048@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Acceleration in SRCU happens on enqueue time for each new callback. This
operation is expected not to fail and therefore any similar attempt
from other places shouldn't find any remaining callbacks to accelerate.
Moreover accelerations performed beyond enqueue time are error prone
because rcu_seq_snap() then may return the snapshot for a new grace
period that is not going to be started.
Remove these dangerous and needless accelerations and introduce instead
assertions reporting leaking unaccelerated callbacks beyond enqueue
time.
Co-developed-by: Yong He <alexyonghe@tencent.com>
Signed-off-by: Yong He <alexyonghe@tencent.com>
Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Co-developed-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com>
Reviewed-by: Like Xu <likexu@tencent.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
irq_init_generic_chip() only sets the name for the first chip type, which
leads to empty names for other chip types. Eventually, these names will be
shown as "-" /proc/interrupts.
Set the name for all chip types by default.
Signed-off-by: Keguang Zhang <keguang.zhang@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230925121734.93017-1-keguang.zhang@gmail.com
The PELT acronym definition can be found right at the top of
kernel/sched/pelt.c (of course), but it cannot be found through use of
grep -r PELT kernel/sched/
Add the acronym "(PELT)" after "Per Entity Load Tracking" at the top of
the source file.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231012125824.1260774-1-mathieu.desnoyers@efficios.com
Kuyo reported sporadic failures on a sched_setaffinity() vs CPU
hotplug stress-test -- notably affine_move_task() remains stuck in
wait_for_completion(), leading to a hung-task detector warning.
Specifically, it was reported that stop_one_cpu_nowait(.fn =
migration_cpu_stop) returns false -- this stopper is responsible for
the matching complete().
The race scenario is:
CPU0 CPU1
// doing _cpu_down()
__set_cpus_allowed_ptr()
task_rq_lock();
takedown_cpu()
stop_machine_cpuslocked(take_cpu_down..)
<PREEMPT: cpu_stopper_thread()
MULTI_STOP_PREPARE
...
__set_cpus_allowed_ptr_locked()
affine_move_task()
task_rq_unlock();
<PREEMPT: cpu_stopper_thread()\>
ack_state()
MULTI_STOP_RUN
take_cpu_down()
__cpu_disable();
stop_machine_park();
stopper->enabled = false;
/>
/>
stop_one_cpu_nowait(.fn = migration_cpu_stop);
if (stopper->enabled) // false!!!
That is, by doing stop_one_cpu_nowait() after dropping rq-lock, the
stopper thread gets a chance to preempt and allows the cpu-down for
the target CPU to complete.
OTOH, since stop_one_cpu_nowait() / cpu_stop_queue_work() needs to
issue a wakeup, it must not be ran under the scheduler locks.
Solve this apparent contradiction by keeping preemption disabled over
the unlock + queue_stopper combination:
preempt_disable();
task_rq_unlock(...);
if (!stop_pending)
stop_one_cpu_nowait(...)
preempt_enable();
This respects the lock ordering contraints while still avoiding the
above race. That is, if we find the CPU is online under rq-lock, the
targeted stop_one_cpu_nowait() must succeed.
Apply this pattern to all similar stop_one_cpu_nowait() invocations.
Fixes: 6d337eab04 ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Reported-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com>
Link: https://lkml.kernel.org/r/20231010200442.GA16515@noisy.programming.kicks-ass.net
We could bail out early when psi was disabled.
Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Link: https://lore.kernel.org/r/20230926115722.467833-1-haifeng.xu@shopee.com
- In cgroup1, the `tasks` file could have duplicate pids which can trigger a
warning in seq_file. Fix it by removing duplicate items after sorting.
- Comment update.
-----BEGIN PGP SIGNATURE-----
iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZSh+2A4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGfASAP9dgEe1Ay6jkJoCCGROnjPRDj2j7Cm9WWcHV79X
0Pr3zQEA/vFIpUzRZGbisrvnyXwNNLX12Hq/nwRX6DzN4UIkDgE=
=tPdI
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.6-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- In cgroup1, the `tasks` file could have duplicate pids which can
trigger a warning in seq_file. Fix it by removing duplicate items
after sorting
- Comment update
* tag 'cgroup-for-6.6-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Fix incorrect css_set_rwsem reference in comment
cgroup: Remove duplicates in cgroup v1 tasks file
* Fix access-after-free in pwq allocation error path.
* Implicitly ordered unbound workqueues should lose the implicit ordering if
an attribute change which isn't compatible with ordered operation is
requested. However, attribute changes requested through the sysfs
interface weren't doing that leaving no way to override the implicit
ordering through the sysfs interface. Fix it.
* Other doc and misc updates.
-----BEGIN PGP SIGNATURE-----
iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZSh9vQ4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGTG4AQCklH7aGqSbzGBPuV19gN6q+BPjkNNLTkEtOzW7
3t1gewEAuwGiGr5FwuxCuGhzTUm5dkELFFsKdzYk+Pt7B2M5Pg0=
=YQtE
-----END PGP SIGNATURE-----
Merge tag 'wq-for-6.6-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
- Fix access-after-free in pwq allocation error path
- Implicitly ordered unbound workqueues should lose the implicit
ordering if an attribute change which isn't compatible with ordered
operation is requested. However, attribute changes requested through
the sysfs interface weren't doing that leaving no way to override the
implicit ordering through the sysfs interface. Fix it.
- Other doc and misc updates
* tag 'wq-for-6.6-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix -Wformat-truncation in create_worker
workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()
workqueue: Use the kmem_cache_free() instead of kfree() to release pwq
workqueue: doc: Fix function and sysfs path errors
workqueue: Fix UAF report by KASAN in pwq_release_workfn()
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
kernel/bpf/verifier.c
829955981c ("bpf: Fix verifier log for async callback return values")
a923819fb2 ("bpf: Treat first argument as return value for bpf_throw")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Previous releases - regressions:
- af_packet: fix fortified memcpy() without flex array.
- tcp: fix crashes trying to free half-baked MTU probes
- xdp: fix zero-size allocation warning in xskq_create()
- can: sja1000: always restart the tx queue after an overrun
- eth: mlx5e: again mutually exclude RX-FCS and RX-port-timestamp
- eth: nfp: avoid rmmod nfp crash issues
- eth: octeontx2-pf: fix page pool frag allocation warning
Previous releases - always broken:
- mctp: perform route lookups under a RCU read-side lock
- bpf: s390: fix clobbering the caller's backchain in the trampoline
- phy: lynx-28g: cancel the CDR check work item on the remove path
- dsa: qca8k: fix qca8k driver for Turris 1.x
- eth: ravb: fix use-after-free issue in ravb_tx_timeout_work()
- eth: ixgbe: fix crash with empty VF macvlan list
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmUnw0USHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkN0EP/RKl317fLqlm6ZzRUMVP169CNRAgMaBG
7FIwxlCv4hfO2Rx09Mxu2wjDp+tBQKqBKaxfcwh8tEdLMqqCymOW2K5+tWVty8C8
TJJS+zggqLAo7DjXbnT8GBm5owHPLKGNxW6vRmnw9xraCD/nuV1wqolI2+l4IxB+
kqfliltepnJSakg0uXg7/uwAE87slBzX5VgB6K5JKLiiDMD8tYoAUmZzH8bMJd0l
Cl7+L+ucRfQkj0DPfuZM/FncM0el7oFB6imnKd36hD6vfDfCNxpyNBYG1yZ/61/N
7H3E595Hr9PA+YBZjja3UvQGbFXkyMHloQdYxmq4s0T2WHqKwRyjLlwPayMXvavn
OTJh2VAs68ivtti0ry5Nbgz4viiNfr32PLyZr6XySwCZ1/TCLjV4Cq9IYnaP3YeM
KA+CIl3d0asQdZuMXTBivmtF65Buawt9UX/gJzUst2mNdcqhV1RTNWDNWoFLQ0qW
gz8XN68V5LhbaaOq/Lat80krWgNLNZIlTNmSsE/Ie799w7dAHn/xvT6h+h5pF1XX
dhng9NK7RL7KVcI/9walArOnhz9ksGWc2+JPMQohuPM/ITMHW11oOUOX6NwAre5m
hBJKh+Rz7ylLDLn33C4qowUhxnJlqqm+rDCVDTmoYngEFQvhEl19mfndSsC8P/K/
xXQJ+diS/Jug
=orAS
-----END PGP SIGNATURE-----
Merge tag 'net-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from CAN and BPF.
We have a regression in TC currently under investigation, otherwise
the things that stand off most are probably the TCP and AF_PACKET
fixes, with both issues coming from 6.5.
Previous releases - regressions:
- af_packet: fix fortified memcpy() without flex array.
- tcp: fix crashes trying to free half-baked MTU probes
- xdp: fix zero-size allocation warning in xskq_create()
- can: sja1000: always restart the tx queue after an overrun
- eth: mlx5e: again mutually exclude RX-FCS and RX-port-timestamp
- eth: nfp: avoid rmmod nfp crash issues
- eth: octeontx2-pf: fix page pool frag allocation warning
Previous releases - always broken:
- mctp: perform route lookups under a RCU read-side lock
- bpf: s390: fix clobbering the caller's backchain in the trampoline
- phy: lynx-28g: cancel the CDR check work item on the remove path
- dsa: qca8k: fix qca8k driver for Turris 1.x
- eth: ravb: fix use-after-free issue in ravb_tx_timeout_work()
- eth: ixgbe: fix crash with empty VF macvlan list"
* tag 'net-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (54 commits)
rswitch: Fix imbalance phy_power_off() calling
rswitch: Fix renesas_eth_sw_remove() implementation
octeontx2-pf: Fix page pool frag allocation warning
nfc: nci: assert requested protocol is valid
af_packet: Fix fortified memcpy() without flex array.
net: tcp: fix crashes trying to free half-baked MTU probes
net/smc: Fix pos miscalculation in statistics
nfp: flower: avoid rmmod nfp crash issues
net: usb: dm9601: fix uninitialized variable use in dm9601_mdio_read
ethtool: Fix mod state of verbose no_mask bitset
net: nfc: fix races in nfc_llcp_sock_get() and nfc_llcp_sock_get_sn()
mctp: perform route lookups under a RCU read-side lock
net: skbuff: fix kernel-doc typos
s390/bpf: Fix unwinding past the trampoline
s390/bpf: Fix clobbering the caller's backchain in the trampoline
net/mlx5e: Again mutually exclude RX-FCS and RX-port-timestamp
net/smc: Fix dependency of SMC on ISM
ixgbe: fix crash with empty VF macvlan list
net/mlx5e: macsec: use update_pn flag instead of PN comparation
net: phy: mscc: macsec: reject PN update requests
...
Compiling with W=1 emitted the following warning
(Compiler: gcc (x86-64, ver. 13.2.1, .config: result of make allyesconfig,
"Treat warnings as errors" turned off):
kernel/workqueue.c:2188:54: warning: ‘%d’ directive output may be
truncated writing between 1 and 10 bytes into a region of size
between 5 and 14 [-Wformat-truncation=]
kernel/workqueue.c:2188:50: note: directive argument in the range
[0, 2147483647]
kernel/workqueue.c:2188:17: note: ‘snprintf’ output between 4 and 23 bytes
into a destination of size 16
setting "id_buf" to size 23 will silence the warning, since GCC
determines snprintf's output to be max. 23 bytes in line 2188.
Please let me know if there are any mistakes in my patch!
Signed-off-by: Lucy Mielke <lucymielke@icloud.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit 5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1
to be ordered") enabled implicit ordered attribute to be added to
WQ_UNBOUND workqueues with max_active of 1. This prevented the changing
of attributes to these workqueues leading to fix commit 0a94efb5ac
("workqueue: implicit ordered attribute should be overridable").
However, workqueue_apply_unbound_cpumask() was not updated at that time.
So sysfs changes to wq_unbound_cpumask has no effect on WQ_UNBOUND
workqueues with implicit ordered attribute. Since not all WQ_UNBOUND
workqueues are visible on sysfs, we are not able to make all the
necessary cpumask changes even if we iterates all the workqueue cpumasks
in sysfs and changing them one by one.
Fix this problem by applying the corresponding change made
to apply_workqueue_attrs_locked() in the fix commit to
workqueue_apply_unbound_cpumask().
Fixes: 5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
On an allyesconfig, with "treat warnings as errors" unset, GCC emits
these warnings:
kernel/locking/lockdep_proc.c:438:32: Warning: Format specifier '%lld' may
be truncated when writing 1 to 17 bytes into a region
of size 15 [-Wformat-truncation=]
kernel/locking/lockdep_proc.c:438:31: Note: Format directive argument is
in the range [-9223372036854775, 9223372036854775]
kernel/locking/lockdep_proc.c:438:9: Note: 'snprintf' has output
between 5 and 22 bytes into a target of size 15
In seq_time(), the longest s64 is "-9223372036854775808"-ish, which
converted to the fixed-point float format is "-9223372036854775.80": 21 bytes,
plus termination is another byte: 22. Therefore, a larger buffer size
of 22 is needed here - not 15. The code was safe due to the snprintf().
Fix it.
Signed-off-by: Lucy Mielke <lucymielke@icloud.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/ZSfOEHRkZAWaQr3U@fedora.fritz.box
Currently, the kfree() be used for pwq objects allocated with
kmem_cache_alloc() in alloc_and_link_pwqs(), this isn't wrong.
but usually, use "trace_kmem_cache_alloc/trace_kmem_cache_free"
to track memory allocation and free. this commit therefore use
kmem_cache_free() instead of kfree() in alloc_and_link_pwqs()
and also consistent with release of the pwq in rcu_free_pwq().
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Namhyung reported that bd27568117 ("perf: Rewrite core context handling")
regresses context switch overhead when perf-cgroup is in use together
with 'slow' PMUs like uncore.
Specifically, perf_cgroup_switch()'s perf_ctx_disable() /
ctx_sched_out() etc.. all iterate the full list of active PMUs for
that CPU, even if they don't have cgroup events.
Previously there was cgrp_cpuctx_list which linked the relevant PMUs
together, but that got lost in the rework. Instead of re-instruducing
a similar list, let the perf_event_pmu_context iteration skip those
that do not have cgroup events. This avoids growing multiple versions
of the perf_event_pmu_context iteration.
Measured performance (on a slightly different patch):
Before)
$ taskset -c 0 ./perf bench sched pipe -l 10000 -G AAA,BBB
# Running 'sched/pipe' benchmark:
# Executed 10000 pipe operations between two processes
Total time: 0.901 [sec]
90.128700 usecs/op
11095 ops/sec
After)
$ taskset -c 0 ./perf bench sched pipe -l 10000 -G AAA,BBB
# Running 'sched/pipe' benchmark:
# Executed 10000 pipe operations between two processes
Total time: 0.065 [sec]
6.560100 usecs/op
152436 ops/sec
Fixes: bd27568117 ("perf: Rewrite core context handling")
Reported-by: Namhyung Kim <namhyung@kernel.org>
Debugged-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231009210425.GC6307@noisy.programming.kicks-ass.net
While reworking the x86 topology code Thomas tripped over creating a 'DIE' domain
for the package mask. :-)
Since these names are CONFIG_SCHED_DEBUG=y only, rename them to make the
name less ambiguous.
[ Shrikanth Hegde: rename on s390 as well. ]
[ Valentin Schneider: also rename it in the comments. ]
[ mingo: port to recent kernels & find all remaining occurances. ]
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Valentin Schneider <vschneid@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230712141056.GI3100107@hirez.programming.kicks-ass.net
These hooks allows intercepting connect(), getsockname(),
getpeername(), sendmsg() and recvmsg() for unix sockets. The unix
socket hooks get write access to the address length because the
address length is not fixed when dealing with unix sockets and
needs to be modified when a unix socket address is modified by
the hook. Because abstract socket unix addresses start with a
NUL byte, we cannot recalculate the socket address in kernelspace
after running the hook by calculating the length of the unix socket
path using strlen().
These hooks can be used when users want to multiplex syscall to a
single unix socket to multiple different processes behind the scenes
by redirecting the connect() and other syscalls to process specific
sockets.
We do not implement support for intercepting bind() because when
using bind() with unix sockets with a pathname address, this creates
an inode in the filesystem which must be cleaned up. If we rewrite
the address, the user might try to clean up the wrong file, leaking
the socket in the filesystem where it is never cleaned up. Until we
figure out a solution for this (and a use case for intercepting bind()),
we opt to not allow rewriting the sockaddr in bind() calls.
We also implement recvmsg() support for connected streams so that
after a connect() that is modified by a sockaddr hook, any corresponding
recmvsg() on the connected socket can also be modified to make the
connected program think it is connected to the "intended" remote.
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-5-daan.j.demeyer@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
As prep for adding unix socket support to the cgroup sockaddr hooks,
let's add a kfunc bpf_sock_addr_set_sun_path() that allows modifying a unix
sockaddr from bpf. While this is already possible for AF_INET and AF_INET6,
we'll need this kfunc when we add unix socket support since modifying the
address for those requires modifying both the address and the sockaddr
length.
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-4-daan.j.demeyer@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
As prep for adding unix socket support to the cgroup sockaddr hooks,
let's propagate the sockaddr length back to the caller after running
a bpf cgroup sockaddr hook program. While not important for AF_INET or
AF_INET6, the sockaddr length is important when working with AF_UNIX
sockaddrs as the size of the sockaddr cannot be determined just from the
address family or the sockaddr's contents.
__cgroup_bpf_run_filter_sock_addr() is modified to take the uaddrlen as
an input/output argument. After running the program, the modified sockaddr
length is stored in the uaddrlen pointer.
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-3-daan.j.demeyer@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
The 'update_total' parameter of update_triggers() is always true after the
previous commit:
80cc1d1d5e ("sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes")
If the 'changed_states & group->rtpoll_states' condition is true,
'new_stall' in update_triggers() will be true, and then 'update_total'
should also be true.
So update_total is redundant - remove it.
[ mingo: Changelog updates ]
Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Peter Ziljstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/202310101645437859599@zte.com.cn
When psimon wakes up and there are no state changes for ->rtpoll_states,
it's unnecessary to update triggers and ->rtpoll_total because the pressures
being monitored by the user have not changed.
This will help to slightly reduce unnecessary computations of PSI.
[ mingo: Changelog updates ]
Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Peter Ziljstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/202310101641075436843@zte.com.cn
Enable unprivileged sandboxes to create their own binfmt_misc mounts.
This is based on Laurent's work in [1] but has been significantly
reworked to fix various issues we identified in earlier versions.
While binfmt_misc can currently only be mounted in the initial user
namespace, binary types registered in this binfmt_misc instance are
available to all sandboxes (Either by having them installed in the
sandbox or by registering the binary type with the F flag causing the
interpreter to be opened right away). So binfmt_misc binary types are
already delegated to sandboxes implicitly.
However, while a sandbox has access to all registered binary types in
binfmt_misc a sandbox cannot currently register its own binary types
in binfmt_misc. This has prevented various use-cases some of which were
already outlined in [1] but we have a range of issues associated with
this (cf. [3]-[5] below which are just a small sample).
Extend binfmt_misc to be mountable in non-initial user namespaces.
Similar to other filesystem such as nfsd, mqueue, and sunrpc we use
keyed superblock management. The key determines whether we need to
create a new superblock or can reuse an already existing one. We use the
user namespace of the mount as key. This means a new binfmt_misc
superblock is created once per user namespace creation. Subsequent
mounts of binfmt_misc in the same user namespace will mount the same
binfmt_misc instance. We explicitly do not create a new binfmt_misc
superblock on every binfmt_misc mount as the semantics for
load_misc_binary() line up with the keying model. This also allows us to
retrieve the relevant binfmt_misc instance based on the caller's user
namespace which can be done in a simple (bounded to 32 levels) loop.
Similar to the current binfmt_misc semantics allowing access to the
binary types in the initial binfmt_misc instance we do allow sandboxes
access to their parent's binfmt_misc mounts if they do not have created
a separate binfmt_misc instance.
Overall, this will unblock the use-cases mentioned below and in general
will also allow to support and harden execution of another
architecture's binaries in tight sandboxes. For instance, using the
unshare binary it possible to start a chroot of another architecture and
configure the binfmt_misc interpreter without being root to run the
binaries in this chroot and without requiring the host to modify its
binary type handlers.
Henning had already posted a few experiments in the cover letter at [1].
But here's an additional example where an unprivileged container
registers qemu-user-static binary handlers for various binary types in
its separate binfmt_misc mount and is then seamlessly able to start
containers with a different architecture without affecting the host:
root [lxc monitor] /var/snap/lxd/common/lxd/containers f1
1000000 \_ /sbin/init
1000000 \_ /lib/systemd/systemd-journald
1000000 \_ /lib/systemd/systemd-udevd
1000100 \_ /lib/systemd/systemd-networkd
1000101 \_ /lib/systemd/systemd-resolved
1000000 \_ /usr/sbin/cron -f
1000103 \_ /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
1000000 \_ /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
1000104 \_ /usr/sbin/rsyslogd -n -iNONE
1000000 \_ /lib/systemd/systemd-logind
1000000 \_ /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 vt220
1000107 \_ dnsmasq --conf-file=/dev/null -u lxc-dnsmasq --strict-order --bind-interfaces --pid-file=/run/lxc/dnsmasq.pid --liste
1000000 \_ [lxc monitor] /var/lib/lxc f1-s390x
1100000 \_ /usr/bin/qemu-s390x-static /sbin/init
1100000 \_ /usr/bin/qemu-s390x-static /lib/systemd/systemd-journald
1100000 \_ /usr/bin/qemu-s390x-static /usr/sbin/cron -f
1100103 \_ /usr/bin/qemu-s390x-static /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-ac
1100000 \_ /usr/bin/qemu-s390x-static /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
1100104 \_ /usr/bin/qemu-s390x-static /usr/sbin/rsyslogd -n -iNONE
1100000 \_ /usr/bin/qemu-s390x-static /lib/systemd/systemd-logind
1100000 \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 vt220
1100000 \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/0 115200,38400,9600 vt220
1100000 \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/1 115200,38400,9600 vt220
1100000 \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/2 115200,38400,9600 vt220
1100000 \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/3 115200,38400,9600 vt220
1100000 \_ /usr/bin/qemu-s390x-static /lib/systemd/systemd-udevd
[1]: https://lore.kernel.org/all/20191216091220.465626-1-laurent@vivier.eu
[2]: https://discuss.linuxcontainers.org/t/binfmt-misc-permission-denied
[3]: https://discuss.linuxcontainers.org/t/lxd-binfmt-support-for-qemu-static-interpreters
[4]: https://discuss.linuxcontainers.org/t/3-1-0-binfmt-support-service-in-unprivileged-guest-requires-write-access-on-hosts-proc-sys-fs-binfmt-misc
[5]: https://discuss.linuxcontainers.org/t/qemu-user-static-not-working-4-11
Link: https://lore.kernel.org/r/20191216091220.465626-2-laurent@vivier.eu (origin)
Link: https://lore.kernel.org/r/20211028103114.2849140-2-brauner@kernel.org (v1)
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Jann Horn <jannh@google.com>
Cc: Henning Schild <henning.schild@siemens.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Laurent Vivier <laurent@vivier.eu>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
---
/* v2 */
- Serge Hallyn <serge@hallyn.com>:
- Use GFP_KERNEL_ACCOUNT for userspace triggered allocations when a
new binary type handler is registered.
- Christian Brauner <christian.brauner@ubuntu.com>:
- Switch authorship to me. I refused to do that earlier even though
Laurent said I should do so because I think it's genuinely bad form.
But by now I have changed so many things that it'd be unfair to
blame Laurent for any potential bugs in here.
- Add more comments that explain what's going on.
- Rename functions while changing them to better reflect what they are
doing to make the code easier to understand.
- In the first version when a specific binary type handler was removed
either through a write to the entry's file or all binary type
handlers were removed by a write to the binfmt_misc mount's status
file all cleanup work happened during inode eviction.
That includes removal of the relevant entries from entry list. While
that works fine I disliked that model after thinking about it for a
bit. Because it means that there was a window were someone has
already removed a or all binary handlers but they could still be
safely reached from load_misc_binary() when it has managed to take
the read_lock() on the entries list while inode eviction was already
happening. Again, that perfectly benign but it's cleaner to remove
the binary handler from the list immediately meaning that ones the
write to then entry's file or the binfmt_misc status file returns
the binary type cannot be executed anymore. That gives stronger
guarantees to the user.
There is a typo so this checks the wrong variable. "chains" plural vs
"chain" singular. We already know that "chains" is non-zero.
Fixes: 7f993623e9 ("locktorture: Add call_rcu_chains module parameter")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
pr_flush() does not guarantee that all messages would really get flushed
to the console. The best it could do is to wait with a given timeout.[*]
The current interval 100ms for checking the progress might seem too
long in some situations. For example, such delays are not appreciated
during suspend and resume especially when the consoles have been flushed
"long" time before the check.
On the other hand, the sleeping wait might be useful in other situations.
Especially, it would allow flushing the messages using printk kthreads
on the same CPU[*].
Use msleep(1) as a compromise.
Also measure the time using jiffies. msleep() does not guarantee
precise wakeup after the given delay. It might be much longer,
especially for times < 20s. See Documentation/timers/timers-howto.rst
for more details.
Note that msecs_to_jiffies() already translates a negative value into
an infinite timeout.
[*] console_unlock() does not guarantee flushing the consoles since
the commit dbdda842fe ("printk: Add console owner and waiter
logic to load balance console writes").
It would be possible to guarantee it another way. For example,
the spinning might be enabled only when the console_lock has been
taken via console_trylock().
But the load balancing is helpful. And more importantly, the flush
with a timeout has been added as a preparation step for introducing
printk kthreads.
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://lore.kernel.org/r/20231006082151.6969-3-pmladek@suse.com
There is a comment that refers to cpu_load, however, this cpu_load was
removed with:
55627e3cd2 ("sched/core: Remove rq->cpu_load[]")
... back in 2019. The comment does not make sense with respect to this
removed array, so remove the comment.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231010155744.1381065-1-colin.i.king@gmail.com
VMAs are skipped if there is no recent fault activity but this represents
a chicken-and-egg problem as there may be no fault activity if the PTEs
are never updated to trap NUMA hints. There is an indirect reliance on
scanning to be forced early in the lifetime of a task but this may fail
to detect changes in phase behaviour. Force inactive VMAs to be scanned
when all other eligible VMAs have been updated within the same scan
sequence.
Test results in general look good with some changes in performance, both
negative and positive, depending on whether the additional scanning and
faulting was beneficial or not to the workload. The autonuma benchmark
workload NUMA01_THREADLOCAL was picked for closer examination. The workload
creates two processes with numerous threads and thread-local storage that
is zero-filled in a loop. It exercises the corner case where unrelated
threads may skip VMAs that are thread-local to another thread and still
has some VMAs that inactive while the workload executes.
The VMA skipping activity frequency with and without the patch:
6.6.0-rc2-sched-numabtrace-v1
=============================
649 reason=scan_delay
9,094 reason=unsuitable
48,915 reason=shared_ro
143,919 reason=inaccessible
193,050 reason=pid_inactive
6.6.0-rc2-sched-numabselective-v1
=============================
146 reason=seq_completed
622 reason=ignore_pid_inactive
624 reason=scan_delay
6,570 reason=unsuitable
16,101 reason=shared_ro
27,608 reason=inaccessible
41,939 reason=pid_inactive
Note that with the patch applied, the PID activity is ignored
(ignore_pid_inactive) to ensure a VMA with some activity is completely
scanned. In addition, a small number of VMAs are scanned when no other
eligible VMA is available during a single scan window (seq_completed).
The number of times a VMA is skipped due to no PID activity from the
scanning task (pid_inactive) drops dramatically. It is expected that
this will increase the number of PTEs updated for NUMA hinting faults
as well as hinting faults but these represent PTEs that would otherwise
have been missed. The tradeoff is scan+fault overhead versus improving
locality due to migration.
On a 2-socket Cascade Lake test machine, the time to complete the
workload is as follows;
6.6.0-rc2 6.6.0-rc2
sched-numabtrace-v1 sched-numabselective-v1
Min elsp-NUMA01_THREADLOCAL 174.22 ( 0.00%) 117.64 ( 32.48%)
Amean elsp-NUMA01_THREADLOCAL 175.68 ( 0.00%) 123.34 * 29.79%*
Stddev elsp-NUMA01_THREADLOCAL 1.20 ( 0.00%) 4.06 (-238.20%)
CoeffVar elsp-NUMA01_THREADLOCAL 0.68 ( 0.00%) 3.29 (-381.70%)
Max elsp-NUMA01_THREADLOCAL 177.18 ( 0.00%) 128.03 ( 27.74%)
The time to complete the workload is reduced by almost 30%:
6.6.0-rc2 6.6.0-rc2
sched-numabtrace-v1 sched-numabselective-v1 /
Duration User 91201.80 63506.64
Duration System 2015.53 1819.78
Duration Elapsed 1234.77 868.37
In this specific case, system CPU time was not increased but it's not
universally true.
From vmstat, the NUMA scanning and fault activity is as follows;
6.6.0-rc2 6.6.0-rc2
sched-numabtrace-v1 sched-numabselective-v1
Ops NUMA base-page range updates 64272.00 26374386.00
Ops NUMA PTE updates 36624.00 55538.00
Ops NUMA PMD updates 54.00 51404.00
Ops NUMA hint faults 15504.00 75786.00
Ops NUMA hint local faults % 14860.00 56763.00
Ops NUMA hint local percent 95.85 74.90
Ops NUMA pages migrated 1629.00 6469222.00
Both the number of PTE updates and hint faults is dramatically
increased. While this is superficially unfortunate, it represents
ranges that were simply skipped without the patch. As a result
of the scanning and hinting faults, many more pages were also
migrated but as the time to completion is reduced, the overhead
is offset by the gain.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
Link: https://lore.kernel.org/r/20231010083143.19593-7-mgorman@techsingularity.net
NUMA Balancing skips VMAs when the current task has not trapped a NUMA
fault within the VMA. If the VMA is skipped then mm->numa_scan_offset
advances and a task that is trapping faults within the VMA may never
fully update PTEs within the VMA.
Force tasks to update PTEs for partially scanned PTEs. The VMA will
be tagged for NUMA hints by some task but this removes some of the
benefit of tracking PID activity within a VMA. A follow-on patch
will mitigate this problem.
The test cases and machines evaluated did not trigger the corner case so
the performance results are neutral with only small changes within the
noise from normal test-to-test variance. However, the next patch makes
the corner case easier to trigger.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
Link: https://lore.kernel.org/r/20231010083143.19593-6-mgorman@techsingularity.net
Fix the various warnings from kernel-doc in kernel/fork.c
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20230824193644.3029141-1-willy@infradead.org
SRCU callbacks acceleration might fail if the preceding callbacks
advance also fails. This can happen when the following steps are met:
1) The RCU_WAIT_TAIL segment has callbacks (say for gp_num 8) and the
RCU_NEXT_READY_TAIL also has callbacks (say for gp_num 12).
2) The grace period for RCU_WAIT_TAIL is observed as started but not yet
completed so rcu_seq_current() returns 4 + SRCU_STATE_SCAN1 = 5.
3) This value is passed to rcu_segcblist_advance() which can't move
any segment forward and fails.
4) srcu_gp_start_if_needed() still proceeds with callback acceleration.
But then the call to rcu_seq_snap() observes the grace period for the
RCU_WAIT_TAIL segment (gp_num 8) as completed and the subsequent one
for the RCU_NEXT_READY_TAIL segment as started
(ie: 8 + SRCU_STATE_SCAN1 = 9) so it returns a snapshot of the
next grace period, which is 16.
5) The value of 16 is passed to rcu_segcblist_accelerate() but the
freshly enqueued callback in RCU_NEXT_TAIL can't move to
RCU_NEXT_READY_TAIL which already has callbacks for a previous grace
period (gp_num = 12). So acceleration fails.
6) Note in all these steps, srcu_invoke_callbacks() hadn't had a chance
to run srcu_invoke_callbacks().
Then some very bad outcome may happen if the following happens:
7) Some other CPU races and starts the grace period number 16 before the
CPU handling previous steps had a chance. Therefore srcu_gp_start()
isn't called on the latter sdp to fix the acceleration leak from
previous steps with a new pair of call to advance/accelerate.
8) The grace period 16 completes and srcu_invoke_callbacks() is finally
called. All the callbacks from previous grace periods (8 and 12) are
correctly advanced and executed but callbacks in RCU_NEXT_READY_TAIL
still remain. Then rcu_segcblist_accelerate() is called with a
snaphot of 20.
9) Since nothing started the grace period number 20, callbacks stay
unhandled.
This has been reported in real load:
[3144162.608392] INFO: task kworker/136:12:252684 blocked for more
than 122 seconds.
[3144162.615986] Tainted: G O K 5.4.203-1-tlinux4-0011.1 #1
[3144162.623053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[3144162.631162] kworker/136:12 D 0 252684 2 0x90004000
[3144162.631189] Workqueue: kvm-irqfd-cleanup irqfd_shutdown [kvm]
[3144162.631192] Call Trace:
[3144162.631202] __schedule+0x2ee/0x660
[3144162.631206] schedule+0x33/0xa0
[3144162.631209] schedule_timeout+0x1c4/0x340
[3144162.631214] ? update_load_avg+0x82/0x660
[3144162.631217] ? raw_spin_rq_lock_nested+0x1f/0x30
[3144162.631218] wait_for_completion+0x119/0x180
[3144162.631220] ? wake_up_q+0x80/0x80
[3144162.631224] __synchronize_srcu.part.19+0x81/0xb0
[3144162.631226] ? __bpf_trace_rcu_utilization+0x10/0x10
[3144162.631227] synchronize_srcu+0x5f/0xc0
[3144162.631236] irqfd_shutdown+0x3c/0xb0 [kvm]
[3144162.631239] ? __schedule+0x2f6/0x660
[3144162.631243] process_one_work+0x19a/0x3a0
[3144162.631244] worker_thread+0x37/0x3a0
[3144162.631247] kthread+0x117/0x140
[3144162.631247] ? process_one_work+0x3a0/0x3a0
[3144162.631248] ? __kthread_cancel_work+0x40/0x40
[3144162.631250] ret_from_fork+0x1f/0x30
Fix this with taking the snapshot for acceleration _before_ the read
of the current grace period number.
The only side effect of this solution is that callbacks advancing happen
then _after_ the full barrier in rcu_seq_snap(). This is not a problem
because that barrier only cares about:
1) Ordering accesses of the update side before call_srcu() so they don't
bleed.
2) See all the accesses prior to the grace period of the current gp_num
The only things callbacks advancing need to be ordered against are
carried by snp locking.
Reported-by: Yong He <alexyonghe@tencent.com>
Co-developed-by:: Yong He <alexyonghe@tencent.com>
Signed-off-by: Yong He <alexyonghe@tencent.com>
Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Co-developed-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com>
Link: http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com
Fixes: da915ad5cf ("srcu: Parallelize callback handling")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
The SMT control mechanism got added as speculation attack vector
mitigation. The implemented logic relies on the primary thread mask to
be set up properly.
This turns out to be an issue with XEN/PV guests because their CPU hotplug
mechanics do not enumerate APICs and therefore the mask is never correctly
populated.
This went unnoticed so far because by chance XEN/PV ends up with
smp_num_siblings == 2. So smt_hotplug_control stays at its default value
CPU_SMT_ENABLED and the primary thread mask is never evaluated in the
context of CPU hotplug.
This stopped "working" with the upcoming overhaul of the topology
evaluation which legitimately provides a fake topology for XEN/PV. That
sets smp_num_siblings to 1, which causes the core CPU hot-plug core to
refuse to bring up the APs.
This happens because smt_hotplug_control is set to CPU_SMT_NOT_SUPPORTED
which causes cpu_smt_allowed() to evaluate the unpopulated primary thread
mask with the conclusion that all non-boot CPUs are not valid to be
plugged.
Make cpu_smt_allowed() more robust and take CPU_SMT_NOT_SUPPORTED and
CPU_SMT_NOT_IMPLEMENTED into account. Rename it to cpu_bootable() while at
it as that makes it more clear what the function is about.
The primary mask issue on x86 XEN/PV needs to be addressed separately as
there are users outside of the CPU hotplug code too.
Fixes: 05736e4ac1 ("cpu/hotplug: Provide knobs to control SMT")
Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.149440843@linutronix.de
Recent NUMA hinting faulting activity is reset approximately every
VMA_PID_RESET_PERIOD milliseconds. However, if the current task has not
accessed a VMA then the reset check is missed and the reset is potentially
deferred forever. Check if the PID activity information should be reset
before checking if the current task recently trapped a NUMA hinting fault.
[ mgorman@techsingularity.net: Rewrite changelog ]
Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231010083143.19593-5-mgorman@techsingularity.net
NUMA balancing skips or scans VMAs for a variety of reasons. In preparation
for completing scans of VMAs regardless of PID access, trace the reasons
why a VMA was skipped. In a later patch, the tracing will be used to track
if a VMA was forcibly scanned.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231010083143.19593-4-mgorman@techsingularity.net
The access_pids[] field name is somewhat ambiguous as no PIDs are accessed.
Similarly, it's not clear that next_pid_reset is related to access_pids[].
Rename the fields to more accurately reflect their purpose.
[ mingo: Rename in the comments too. ]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231010083143.19593-3-mgorman@techsingularity.net
The verifier, as part of check_return_code(), verifies that async
callbacks such as from e.g. timers, will return 0. It does this by
correctly checking that R0->var_off is in tnum_const(0), which
effectively checks that it's in a range of 0. If this condition fails,
however, it prints an error message which says that the value should
have been in (0x0; 0x1). This results in possibly confusing output such
as the following in which an async callback returns 1:
At async callback the register R0 has value (0x1; 0x0) should have been in (0x0; 0x1)
The fix is easy -- we should just pass the tnum_const(0) as the correct
range to verbose_invalid_scalar(), which will then print the following:
At async callback the register R0 has value (0x1; 0x0) should have been in (0x0; 0x0)
Fixes: bfc6bb74e4 ("bpf: Implement verifier support for validation of async callbacks.")
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231009161414.235829-1-void@manifault.com
cgroup v1 or v2 or both controller names can be passed as arguments to
the 'cgroup_no_v1' kernel parameter, though most of the controller's
names are the same for both cgroup versions. This can be confusing when
both versions are used interchangeably, i.e., passing cgroup_no_v1=io
$ sudo dmesg |grep cgroup
...
cgroup: Disabling io control group subsystem in v1 mounts
cgroup: Disabled controller 'blkio'
Make it consistent across the pr_info()'s, by using ss->legacy_name, as
the subsystem name, while printing the cgroup v1 controller disabling
information in cgroup_init().
Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
One PID may appear multiple times in a preloaded pidlist.
(Possibly due to PID recycling but we have reports of the same
task_struct appearing with different PIDs, thus possibly involving
transfer of PID via de_thread().)
Because v1 seq_file iterator uses PIDs as position, it leads to
a message:
> seq_file: buggy .next function kernfs_seq_next did not update position index
Conservative and quick fix consists of removing duplicates from `tasks`
file (as opposed to removing pidlists altogether). It doesn't affect
correctness (it's sufficient to show a PID once), performance impact
would be hidden by unconditional sorting of the pidlist already in place
(asymptotically).
Link: https://lore.kernel.org/r/20230823174804.23632-1-mkoutny@suse.com/
Suggested-by: Firo Yang <firo.yang@suse.com>
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Move it out of the .c file into the shared scheduler-internal header file,
to gain type-checking.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20231009060037.170765-3-sshegde@linux.vnet.ibm.com
The 'sched_energy_aware' sysctl is available for the admin to disable/enable
energy aware scheduling(EAS). EAS is enabled only if few conditions are
met by the platform. They are, asymmetric CPU capacity, no SMT,
schedutil CPUfreq governor, frequency invariant load tracking etc.
A platform may boot without EAS capability, but could gain such
capability at runtime. For example, changing/registering the cpufreq
governor to schedutil.
At present, though platform doesn't support EAS, this sysctl returns 1
and it ends up calling build_perf_domains on write to 1 and
NOP when writing to 0. That is confusing and un-necessary.
Desired behavior would be to have this sysctl to enable/disable the EAS
on supported platform. On non-supported platform write to the sysctl
would return not supported error and read of the sysctl would return
empty. So sched_energy_aware returns empty - EAS is not possible at this moment
This will include EAS capable platforms which have at least one EAS
condition false during startup, e.g. not using the schedutil cpufreq governor
sched_energy_aware returns 0 - EAS is supported but disabled by admin.
sched_energy_aware returns 1 - EAS is supported and enabled.
User can find out the reason why EAS is not possible by checking
info messages. sched_is_eas_possible returns true if the platform
can do EAS at this moment.
Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20231009060037.170765-3-sshegde@linux.vnet.ibm.com
BPF supports creating high resolution timers using bpf_timer_* helper
functions. Currently, only the BPF_F_TIMER_ABS flag is supported, which
specifies that the timeout should be interpreted as absolute time. It
would also be useful to be able to pin that timer to a core. For
example, if you wanted to make a subset of cores run without timer
interrupts, and only have the timer be invoked on a single core.
This patch adds support for this with a new BPF_F_TIMER_CPU_PIN flag.
When specified, the HRTIMER_MODE_PINNED flag is passed to
hrtimer_start(). A subsequent patch will update selftests to validate.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20231004162339.200702-2-void@manifault.com
Per-package perf events are typically registered with a single CPU only,
however they can be read across all the CPUs within the package.
Currently perf_event_read maps the event CPU according to the topology
information to avoid an unnecessary SMP call, however
perf_event_read_local deals with hard values and rejects a read with a
failure if the CPU is not the one exactly registered. Allow similar
mapping within the perf_event_read_local if the perf event in question
can support this.
This allows users like BPF code to read the package perf events properly
across different CPUs within a package.
Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230913125956.3652667-1-tero.kristo@linux.intel.com
Some userspace applications use timerfd_create() to request wakeups after
a long period of time. For example, a backup application may request a
wakeup once per week. This is perfectly fine as long as the system does
not try to suspend. However, if the system tries to suspend and the
system's RTC does not support the required alarm timeout, the suspend
operation will fail with an error such as
rtc_cmos 00:01: Alarms can be up to one day in the future
PM: dpm_run_callback(): platform_pm_suspend+0x0/0x4a returns -22
alarmtimer alarmtimer.4.auto: platform_pm_suspend+0x0/0x4a returned -22 after 117 usecs
PM: Device alarmtimer.4.auto failed to suspend: error -22
This results in a refusal to suspend the system, causing substantial
battery drain on affected systems.
To fix the problem, use the maximum alarm time offset as reported by RTC
drivers to set the maximum alarm time. While this may result in early
wakeups from suspend, it is still much better than not suspending at all.
Standardize system behavior if the requested alarm timeout is larger than
the alarm timeout supported by the rtc chip. Currently, in this situation,
the RTC driver will do one of the following:
- It may return an error.
- It may limit the alarm timeout to the maximum supported by the rtc chip.
- It may mask the timeout by the maximum alarm timeout supported by the RTC
chip (i.e. a requested timeout of 1 day + 1 minute may result in a 1
minute timeout).
With this in place, if the RTC driver reports the maximum alarm timeout
supported by the RTC chip, the system will always limit the alarm timeout
to the maximum supported by the RTC chip.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/20230915152238.1144706-3-linux@roeck-us.net
Update_triggers() always returns now + group->rtpoll_min_period, and the
return value is only used by psi_rtpoll_work(), so change update_triggers()
to a void function, let group->rtpoll_next_update = now +
group->rtpoll_min_period directly.
This will avoid unnecessary function return value passing & simplifies
the function.
[ mingo: Updated changelog ]
Suggested-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/202310092024289721617@zte.com.cn
The Energy Aware Scheduler (EAS) estimates the energy consumption
of placing a task on different CPUs. The goal is to minimize this
energy consumption. Estimating the energy of different task placements
is increasingly complex with the size of the platform.
To avoid having a slow wake-up path, EAS is only enabled if this
complexity is low enough.
The current complexity limit was set in:
b68a4c0dba ("sched/topology: Disable EAS on inappropriate platforms")
... based on the first implementation of EAS, which was re-computing
the power of the whole platform for each task placement scenario, see:
390031e4c3 ("sched/fair: Introduce an energy estimation helper function")
... but the complexity of EAS was reduced in:
eb92692b25 ("sched/fair: Speed-up energy-aware wake-ups")
... and find_energy_efficient_cpu() (feec) algorithm was updated in:
3e8c6c9aac ("sched/fair: Remove task_util from effective utilization in feec()")
find_energy_efficient_cpu() (feec) is now doing:
feec()
\_ for_each_pd(pd) [0]
// get max_spare_cap_cpu and compute_prev_delta
\_ for_each_cpu(pd) [1]
\_ eenv_pd_busy_time(pd) [2]
\_ for_each_cpu(pd)
// compute_energy(pd) without the task
\_ eenv_pd_max_util(pd, -1) [3.0]
\_ for_each_cpu(pd)
\_ em_cpu_energy(pd, -1)
\_ for_each_ps(pd)
// compute_energy(pd) with the task on prev_cpu
\_ eenv_pd_max_util(pd, prev_cpu) [3.1]
\_ for_each_cpu(pd)
\_ em_cpu_energy(pd, prev_cpu)
\_ for_each_ps(pd)
// compute_energy(pd) with the task on max_spare_cap_cpu
\_ eenv_pd_max_util(pd, max_spare_cap_cpu) [3.2]
\_ for_each_cpu(pd)
\_ em_cpu_energy(pd, max_spare_cap_cpu)
\_ for_each_ps(pd)
[3.1] happens only once since prev_cpu is unique. With the same
definitions for nr_pd, nr_cpus and nr_ps, the complexity is of:
nr_pd * (2 * [nr_cpus in pd] + 2 * ([nr_cpus in pd] + [nr_ps in pd]))
+ ([nr_cpus in pd] + [nr_ps in pd])
[0] * ( [1] + [2] + [3.0] + [3.2] )
+ [3.1]
= nr_pd * (4 * [nr_cpus in pd] + 2 * [nr_ps in pd])
+ [nr_cpus in prev pd] + nr_ps
The complexity limit was set to 2048 in:
b68a4c0dba ("sched/topology: Disable EAS on inappropriate platforms")
... to make "EAS usable up to 16 CPUs with per-CPU DVFS and less than 8
performance states each". For the same platform, the complexity would
actually be of:
16 * (4 + 2 * 7) + 1 + 7 = 296
Since the EAS complexity was greatly reduced since the limit was
introduced, bigger platforms can handle EAS.
For instance, a platform with 112 CPUs with 7 performance states
each would not reach it:
112 * (4 + 2 * 7) + 1 + 7 = 2024
To reflect this improvement in the underlying EAS code, remove
the EAS complexity check.
Note that a limit on the number of CPUs still holds against
EM_MAX_NUM_CPUS to avoid overflows during the energy estimation.
[ mingo: Updates to the changelog. ]
Signed-off-by: Pierre Gondois <Pierre.Gondois@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20231009060037.170765-2-sshegde@linux.vnet.ibm.com
Remove the rq::cpu_capacity_orig field and use arch_scale_cpu_capacity()
instead.
The scheduler uses 3 methods to get access to a CPU's max compute capacity:
- arch_scale_cpu_capacity(cpu) which is the default way to get a CPU's capacity.
- cpu_capacity_orig field which is periodically updated with
arch_scale_cpu_capacity().
- capacity_orig_of(cpu) which encapsulates rq->cpu_capacity_orig.
There is no real need to save the value returned by arch_scale_cpu_capacity()
in struct rq. arch_scale_cpu_capacity() returns:
- either a per_cpu variable.
- or a const value for systems which have only one capacity.
Remove rq::cpu_capacity_orig and use arch_scale_cpu_capacity() everywhere.
No functional changes.
Some performance tests on Arm64:
- small SMP device (hikey): no noticeable changes
- HMP device (RB5): hackbench shows minor improvement (1-2%)
- large smp (thx2): hackbench and tbench shows minor improvement (1%)
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20231009103621.374412-2-vincent.guittot@linaro.org
Doing this matches the natural type of 'int' based calculus
in sched_rt_handler(), and also enables the adding in of a
correct upper bounds check on the sysctl interface.
[ mingo: Rewrote the changelog. ]
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231008021538.3063250-1-yajun.deng@linux.dev
find_new_ilb() returns nr_cpu_ids on failure - which is the usual
cpumask bitops return pattern, but is weird & unnecessary in this
context: not only is it a global variable, it it is a +1 out of
bounds CPU index and also has different signedness ...
Its only user, kick_ilb(), then checks the return against nr_cpu_ids
to decide to return. There's no other use.
So instead of this, use a standard -1 return on failure to find an
idle CPU, as the argument is signed already.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lore.kernel.org/r/20231006102518.2452758-4-mingo@kernel.org
Commit 9e70a5e109 ("printk: Add per-console suspended state")
removed console lock usage during resume and replaced it with
the clearly defined console_list_lock and srcu mechanisms.
However, the console lock usage had an important side-effect
of flushing the consoles. After its removal, consoles were no
longer flushed before checking their progress.
Add the console_lock/console_unlock dance to the beginning
of __pr_flush() to actually flush the consoles before checking
their progress. Also add comments to clarify this additional
usage of the console lock.
Note that console_unlock() does not guarantee flushing all messages
since the commit dbdda842fe ("printk: Add console owner and waiter
logic to load balance console writes").
Reported-by: Todd Brandt <todd.e.brandt@intel.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217955
Fixes: 9e70a5e109 ("printk: Add per-console suspended state")
Co-developed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Link: https://lore.kernel.org/r/20231006082151.6969-2-pmladek@suse.com
The old pick_eevdf() could fail to find the actual earliest eligible
deadline when it descended to the right looking for min_deadline, but
it turned out that that min_deadline wasn't actually eligible. In that
case we need to go back and search through any left branches we
skipped looking for the actual best _eligible_ min_deadline.
This is more expensive, but still O(log n), and at worst should only
involve descending two branches of the rbtree.
I've run this through a userspace stress test (thank you
tools/lib/rbtree.c), so hopefully this implementation doesn't miss any
corner cases.
Fixes: 147f3efaa2 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/xm261qego72d.fsf_-_@google.com
Marek and Biju reported instances of:
"EEVDF scheduling fail, picking leftmost"
which Mike correlated with cgroup scheduling and the min_deadline heap
getting corrupted; some trace output confirms:
> And yeah, min_deadline is hosed somehow:
>
> validate_cfs_rq: --- /
> __print_se: ffff88845cf48080 w: 1024 ve: -58857638 lag: 870381 vd: -55861854 vmd: -66302085 E (11372/tr)
> __print_se: ffff88810d165800 w: 25 ve: -80323686 lag: 22336429 vd: -41496434 vmd: -66302085 E (-1//autogroup-31)
> __print_se: ffff888108379000 w: 25 ve: 0 lag: -57987257 vd: 114632828 vmd: 114632828 N (-1//autogroup-33)
> validate_cfs_rq: min_deadline: -55861854 avg_vruntime: -62278313462 / 1074 = -57987256
Turns out that reweight_entity(), which tries really hard to be fast,
does not do the normal dequeue+update+enqueue pattern but *does* scale
the deadline.
However, it then fails to propagate the updated deadline value up the
heap.
Fixes: 147f3efaa2 ("sched/fair: Implement an EEVDF-like scheduling policy")
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reported-by: Biju Das <biju.das.jz@bp.renesas.com>
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Biju Das <biju.das.jz@bp.renesas.com>
Tested-by: Mike Galbraith <efault@gmx.de>
Link: https://lkml.kernel.org/r/20231006192445.GE743@noisy.programming.kicks-ass.net
Currently, there is no overflow-check with memdup_user().
Use the new function memdup_array_user() instead of memdup_user() for
duplicating the user-space array safely.
Suggested-by: David Airlie <airlied@redhat.com>
Signed-off-by: Philipp Stanner <pstanner@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Zack Rusin <zackr@vmware.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230920123612.16914-5-pstanner@redhat.com
Currently, there is no overflow-check with memdup_user().
Use the new function memdup_array_user() instead of memdup_user() for
duplicating the user-space array safely.
Suggested-by: David Airlie <airlied@redhat.com>
Signed-off-by: Philipp Stanner <pstanner@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Zack Rusin <zackr@vmware.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230920123612.16914-4-pstanner@redhat.com
- Two EEVDF fixes: one to fix sysctl_sched_base_slice propagation,
and to fix an avg_vruntime() corner-case.
- A cpufreq frequency scaling fix
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUidV8RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1itsA//Yl5PmadMcHxX2UYzMho9bCMNdVZyb8im
C0xbvs1o7LO1RgzkqV4kpGjRy1BqHzrYeE5xCiI9K/HoqdmChtB7+oQBJ1Y7mjfy
miePmjGXRol+0H5eR94QqgL5M/SspmwEmsFm9QwfMmcnUOZOMkiWiElnCKMJbsbr
XE2Bj0sj+BuFu6PK6f0R+aoy/H6Za0g5DpujxGfRJHlep8vuJA4afIO9rL18EXa3
AI2YnvZh5IH9EMJXQ8c+dtqi0xPTWhSpQ28EDAMV89TnAJAv+uo/cMPoZj+ewjb3
PFN5ASI2f7IdCuK4vdixZM1E9vgM9UTI4Ju9IanUcXkUs8YNUXJVZXezaGZG/wZN
QXD827AjScTZJWIJGGMfaB8ubYVRqg6wG4NRcToFHxp5G5o7iZ0joTenSA6/nGy9
o5RpA8KbB1oyuSlWvHqNCYmc8QavujoiaDbyqlsY2E5mIqNHkbegK7kiAgeVNnL1
oqKgnzjAAER0gujqP/4jTHIlF23sh17/oIRgHb+y2wWMxwnZR/TKIxuMaYrmoq0I
FIPx7l20USl9n2VmSl29vzzUZaM07AeKl3HGtYGMdgAUG+meEH0dATn690WPnF12
MNQolPMMpp051LV4EUJVKySIEb/KCknTPtRHpbHNfBSnrptF6X9GsPl+FmeA+aBh
/XxuZEAKDWs=
=2T70
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2023-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc scheduler fixes from Ingo Molnar:
- Two EEVDF fixes: one to fix sysctl_sched_base_slice propagation, and
to fix an avg_vruntime() corner-case.
- A cpufreq frequency scaling fix
* tag 'sched-urgent-2023-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpufreq: schedutil: Update next_freq when cpufreq_limits change
sched/eevdf: Fix avg_vruntime()
sched/eevdf: Also update slice on placement
Multiple blocked tasks are printed when the system hangs. They may have
the same parent pid, but belong to different task groups.
Printing tgid lets users better know whether these tasks are from the same
task group or not.
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230720080516.1515297-1-yajun.deng@linux.dev
The following commit:
9b3c4ab304 ("sched,rcu: Rework try_invoke_on_locked_down_task()")
... renamed try_invoke_on_locked_down_task() to task_call_func(),
but forgot to update the comment in try_to_wake_up().
But it turns out that the smp_rmb() doesn't live in task_call_func()
either, it was moved to __task_needs_rq_lock() in:
91dabf33ae ("sched: Fix race in task_call_func()")
Fix that now.
Also fix the s/smb/smp typo while at it.
Reported-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230731085759.11443-1-zhangqiao22@huawei.com
The recently added tcx attachment extended the BPF UAPI for attaching and
detaching by a couple of fields. Those fields are currently only supported
for tcx, other types like cgroups and flow dissector silently ignore the
new fields except for the new flags.
This is problematic once we extend bpf_mprog to older attachment types, since
it's hard to figure out whether the syscall really was successful if the
kernel silently ignores non-zero values.
Explicitly reject non-zero fields relevant to bpf_mprog for attachment types
which don't use the latter yet.
Fixes: e420bed025 ("bpf: Add fd-based tcx multi-prog infra with link support")
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20231006220655.1653-3-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Improve consistency for bpf_mprog_query() API and let the latter also handle
a NULL entry as can be the case for tcx. Instead of returning -ENOENT, we
copy a count of 0 and revision of 1 to user space, so that this can be fed
into a subsequent bpf_mprog_attach() call as expected_revision. A BPF self-
test as part of this series has been added to assert this case.
Suggested-by: Lorenz Bauer <lmb@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20231006220655.1653-2-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
While working on the ebpf-go [0] library integration for bpf_mprog and tcx,
Lorenz noticed that two subsequent BPF_PROG_QUERY requests currently fail. A
typical workflow is to first gather the bpf_mprog count without passing program/
link arrays, followed by the second request which contains the actual array
pointers.
The initial call populates count and revision fields. The second call gets
rejected due to a BPF_PROG_QUERY_LAST_FIELD bug which should point to
query.revision instead of query.link_attach_flags since the former is really
the last member.
It was not noticed in libbpf as bpf_prog_query_opts() always calls bpf(2) with
an on-stack bpf_attr that is memset() each time (and therefore query.revision
was reset to zero).
[0] https://ebpf-go.dev
Fixes: e420bed025 ("bpf: Add fd-based tcx multi-prog infra with link support")
Reported-by: Lorenz Bauer <lmb@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20231006220655.1653-1-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for
array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle [1], add __counted_by for struct bpf_stack_map.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci [1]
Link: https://lore.kernel.org/bpf/20231006201657.work.531-kees@kernel.org
This extends the current PR_SET_MDWE prctl arg with a bit to indicate that
the process doesn't want MDWE protection to propagate to children.
To implement this no-inherit mode, the tag in current->mm->flags must be
absent from MMF_INIT_MASK. This means that the encoding for "MDWE but
without inherit" is different in the prctl than in the mm flags. This
leads to a bit of bit-mangling in the prctl implementation.
Link: https://lkml.kernel.org/r/20230828150858.393570-6-revest@chromium.org
Signed-off-by: Florent Revest <revest@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexey Izbyshev <izbyshev@ispras.ru>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ayush Jain <ayush.jain3@amd.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
Cc: Topi Miettinen <toiwoton@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The Energy Aware Scheduler (EAS) relies on the schedutil governor.
When moving to/from the schedutil governor, sched domains must be
rebuilt to allow re-evaluating the enablement conditions of EAS.
This is done through sched_cpufreq_governor_change().
Having a cpufreq governor assumes a cpufreq driver is running.
Inserting/removing a cpufreq driver should trigger a re-evaluation
of EAS enablement conditions, avoiding to see EAS enabled when
removing a running cpufreq driver.
Rebuild the sched domains in schedutil's sugov_init()/sugov_exit(),
allowing to check EAS's enablement condition whenever schedutil
governor is initialized/exited from.
Move relevant code up in schedutil.c to avoid a split and conditional
function declaration.
Rename sched_cpufreq_governor_change() to sugov_eas_rebuild_sd().
Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The initialization code of the per-cpu sg_cpu struct is currently split
into two for-loop blocks. This can be simplified by merging the two
blocks into a single loop. This will make the code more maintainable.
Signed-off-by: Liao Chang <liaochang1@huawei.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When cpufreq's policy is 'single', there is a scenario that will
cause sg_policy's next_freq to be unable to update.
When the CPU's util is always max, the cpufreq will be max,
and then if we change the policy's scaling_max_freq to be a
lower freq, indeed, the sg_policy's next_freq need change to
be the lower freq, however, because the cpu_is_busy, the next_freq
would keep the max_freq.
For example:
The cpu7 is a single CPU:
unisoc:/sys/devices/system/cpu/cpufreq/policy7 # while true;do done& [1] 4737
unisoc:/sys/devices/system/cpu/cpufreq/policy7 # taskset -p 80 4737
pid 4737's current affinity mask: ff
pid 4737's new affinity mask: 80
unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq
2301000
unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_cur_freq
2301000
unisoc:/sys/devices/system/cpu/cpufreq/policy7 # echo 2171000 > scaling_max_freq
unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq
2171000
At this time, the sg_policy's next_freq would stay at 2301000, which
is wrong.
To fix this, add a check for the ->need_freq_update flag.
[ mingo: Clarified the changelog. ]
Co-developed-by: Guohua Yan <guohua.yan@unisoc.com>
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Guohua Yan <guohua.yan@unisoc.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: "Rafael J. Wysocki" <rafael@kernel.org>
Link: https://lore.kernel.org/r/20230719130527.8074-1-xuewen.yan@unisoc.com
I didn't collect precise data but feels like we've got a lot
of 6.5 fixes here. WiFi fixes are most user-awaited.
Current release - regressions:
- Bluetooth: fix hci_link_tx_to RCU lock usage
Current release - new code bugs:
- bpf: mprog: fix maximum program check on mprog attachment
- eth: ti: icssg-prueth: fix signedness bug in prueth_init_tx_chns()
Previous releases - regressions:
- ipv6: tcp: add a missing nf_reset_ct() in 3WHS handling
- vringh: don't use vringh_kiov_advance() in vringh_iov_xfer(),
it doesn't handle zero length like we expected
- wifi:
- cfg80211: fix cqm_config access race, fix crashes with brcmfmac
- iwlwifi: mvm: handle PS changes in vif_cfg_changed
- mac80211: fix mesh id corruption on 32 bit systems
- mt76: mt76x02: fix MT76x0 external LNA gain handling
- Bluetooth: fix handling of HCI_QUIRK_STRICT_DUPLICATE_FILTER
- l2tp: fix handling of transhdrlen in __ip{,6}_append_data()
- dsa: mv88e6xxx: avoid EEPROM timeout when EEPROM is absent
- eth: stmmac: fix the incorrect parameter after refactoring
Previous releases - always broken:
- net: replace calls to sock->ops->connect() with kernel_connect(),
prevent address rewrite in kernel_bind(); otherwise BPF hooks may
modify arguments, unexpectedly to the caller
- tcp: fix delayed ACKs when reads and writes align with MSS
- bpf:
- verifier: unconditionally reset backtrack_state masks on global
func exit
- s390: let arch_prepare_bpf_trampoline return program size,
fix struct_ops offsets
- sockmap: fix accounting of available bytes in presence of PEEKs
- sockmap: reject sk_msg egress redirects to non-TCP sockets
- ipv4/fib: send netlink notify when delete source address routes
- ethtool: plca: fix width of reads when parsing netlink commands
- netfilter: nft_payload: rebuild vlan header on h_proto access
- Bluetooth: hci_codec: fix leaking memory of local_codecs
- eth: intel: ice: always add legacy 32byte RXDID in supported_rxdids
- eth: stmmac:
- dwmac-stm32: fix resume on STM32 MCU
- remove buggy and unneeded stmmac_poll_controller, depend on NAPI
- ibmveth: always recompute TCP pseudo-header checksum, fix use
of the driver with Open vSwitch
- wifi:
- rtw88: rtw8723d: fix MAC address offset in EEPROM
- mt76: fix lock dependency problem for wed_lock
- mwifiex: sanity check data reported by the device
- iwlwifi: ensure ack flag is properly cleared
- iwlwifi: mvm: fix a memory corruption due to bad pointer arithm
- iwlwifi: mvm: fix incorrect usage of scan API
Misc:
- wifi: mac80211: work around Cisco AP 9115 VHT MPDU length
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmUe/hkACgkQMUZtbf5S
Iru94hAAlkVsHgGy7T8Te23m5Q5v33ZRj+7hbjFBN+be2TJu8cWo9qL6jGvrxBP6
D0X32ZvoWtX95Ua023Ibs1WtEc6ebROctTuDZpoIW35MYamFz9fYecoY4i+t+m1+
6Y/UVRu4eHWXUtZclRQUEUdMe5HAJms9uNlIkTxLivvFhERgmGKtAsca8nuU9wMo
lJJUAFxf4wJKx/338AUaa0yfsg6cQcgRpbm1csAR9VSa3mU1PbrIYzf7fMbWjRET
6DiijheeClb7biF4ZKqqHgZcProwVFxoFCq1GH+PCzaw8K1eIkGwXlpVEW89lgsC
Ukc1L9JgLJIBSObvNWiO2mu5w+V89+XR2rL9KM3tW6x+k5tEncNL3WOphqH69NPS
gfizGlvXjP+2aCJgHovvlQEaBn/7xNYWTAbqh1jVDleUC95Ur8ap4X42YB/3QvPN
X9l8hp4Htu8SclqjXKtMz9qt6Ug5Uyi88o+1U53BNE6C6ICKW9i4uApT1DsLBAK8
QM5WPTj/ChIBbQu7HWNW+Ux3NX5R6fFzZ5BfKrjbuNEHQKRauj2300gVtU6xGS7T
IFWiu8i00T34aXF2Vnfykc0zNRylhw/DHqtJFUxmJQOBQgyKlkjYacf2cYru5lnR
BWA8Zsg5wpapT5CWSGlSRid4sRMtcDiMsI7fnIqB5CPhJGnR6eI=
=JAtc
-----END PGP SIGNATURE-----
Merge tag 'net-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from Bluetooth, netfilter, BPF and WiFi.
I didn't collect precise data but feels like we've got a lot of 6.5
fixes here. WiFi fixes are most user-awaited.
Current release - regressions:
- Bluetooth: fix hci_link_tx_to RCU lock usage
Current release - new code bugs:
- bpf: mprog: fix maximum program check on mprog attachment
- eth: ti: icssg-prueth: fix signedness bug in prueth_init_tx_chns()
Previous releases - regressions:
- ipv6: tcp: add a missing nf_reset_ct() in 3WHS handling
- vringh: don't use vringh_kiov_advance() in vringh_iov_xfer(), it
doesn't handle zero length like we expected
- wifi:
- cfg80211: fix cqm_config access race, fix crashes with brcmfmac
- iwlwifi: mvm: handle PS changes in vif_cfg_changed
- mac80211: fix mesh id corruption on 32 bit systems
- mt76: mt76x02: fix MT76x0 external LNA gain handling
- Bluetooth: fix handling of HCI_QUIRK_STRICT_DUPLICATE_FILTER
- l2tp: fix handling of transhdrlen in __ip{,6}_append_data()
- dsa: mv88e6xxx: avoid EEPROM timeout when EEPROM is absent
- eth: stmmac: fix the incorrect parameter after refactoring
Previous releases - always broken:
- net: replace calls to sock->ops->connect() with kernel_connect(),
prevent address rewrite in kernel_bind(); otherwise BPF hooks may
modify arguments, unexpectedly to the caller
- tcp: fix delayed ACKs when reads and writes align with MSS
- bpf:
- verifier: unconditionally reset backtrack_state masks on global
func exit
- s390: let arch_prepare_bpf_trampoline return program size, fix
struct_ops offsets
- sockmap: fix accounting of available bytes in presence of PEEKs
- sockmap: reject sk_msg egress redirects to non-TCP sockets
- ipv4/fib: send netlink notify when delete source address routes
- ethtool: plca: fix width of reads when parsing netlink commands
- netfilter: nft_payload: rebuild vlan header on h_proto access
- Bluetooth: hci_codec: fix leaking memory of local_codecs
- eth: intel: ice: always add legacy 32byte RXDID in supported_rxdids
- eth: stmmac:
- dwmac-stm32: fix resume on STM32 MCU
- remove buggy and unneeded stmmac_poll_controller, depend on NAPI
- ibmveth: always recompute TCP pseudo-header checksum, fix use of
the driver with Open vSwitch
- wifi:
- rtw88: rtw8723d: fix MAC address offset in EEPROM
- mt76: fix lock dependency problem for wed_lock
- mwifiex: sanity check data reported by the device
- iwlwifi: ensure ack flag is properly cleared
- iwlwifi: mvm: fix a memory corruption due to bad pointer arithm
- iwlwifi: mvm: fix incorrect usage of scan API
Misc:
- wifi: mac80211: work around Cisco AP 9115 VHT MPDU length"
* tag 'net-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
MAINTAINERS: update Matthieu's email address
mptcp: userspace pm allow creating id 0 subflow
mptcp: fix delegated action races
net: stmmac: remove unneeded stmmac_poll_controller
net: lan743x: also select PHYLIB
net: ethernet: mediatek: disable irq before schedule napi
net: mana: Fix oversized sge0 for GSO packets
net: mana: Fix the tso_bytes calculation
net: mana: Fix TX CQE error handling
netlink: annotate data-races around sk->sk_err
sctp: update hb timer immediately after users change hb_interval
sctp: update transport state when processing a dupcook packet
tcp: fix delayed ACKs for MSS boundary condition
tcp: fix quick-ack counting to count actual ACKs of new data
page_pool: fix documentation typos
tipc: fix a potential deadlock on &tx->lock
net: stmmac: dwmac-stm32: fix resume on STM32 MCU
ipv4: Set offload_failed flag in fibmatch results
netfilter: nf_tables: nft_set_rbtree: fix spurious insertion failure
netfilter: nf_tables: Deduplicate nft_register_obj audit logs
...
The system_callback() function in trace_events.c is only used within that
file. The "static" annotation was missed.
Fixes: 5790b1fb3d ("eventfs: Remove eventfs_file and just use eventfs_inode")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202310051743.y9EobbUr-lkp@intel.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The update to removing the eventfs_file changed the way the events top
level directory was handled. Instead of returning a dentry, it now returns
the eventfs_inode. In this changed, the removing of the events top level
directory is not much different than removing any of the other
directories. Because of this, the removal just called eventfs_remove_dir()
instead of eventfs_remove_events_dir().
Although eventfs_remove_dir() does the clean up, it misses out on the
dget() of the ei->dentry done in eventfs_create_events_dir(). It makes
more sense to match eventfs_create_events_dir() with a specific function
eventfs_remove_events_dir() and this specific function can then perform
the dput() to the dentry that had the dget() when it was created.
Fixes: 5790b1fb3d ("eventfs: Remove eventfs_file and just use eventfs_inode")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202310051743.y9EobbUr-lkp@intel.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
We have the next_resource() is used once and no user for the
next_resource_skip_children() outside of the for_each_resource().
Unify them by adding skip_children parameter to the next_resource().
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20230912165312.402422-2-andriy.shevchenko@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We have a few places where for_each_resource() is open coded.
Replace that by the macro. This makes code easier to read and
understand.
With this, compile r_next() only for CONFIG_PROC_FS=y.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20230912165312.402422-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Instead of having a descriptor for every file represented in the eventfs
directory, only have the directory itself represented. Change the API to
send in a list of entries that represent all the files in the directory
(but not other directories). The entry list contains a name and a callback
function that will be used to create the files when they are accessed.
struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry *parent,
const struct eventfs_entry *entries,
int size, void *data);
is used for the top level eventfs directory, and returns an eventfs_inode
that will be used by:
struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode *parent,
const struct eventfs_entry *entries,
int size, void *data);
where both of the above take an array of struct eventfs_entry entries for
every file that is in the directory.
The entries are defined by:
typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data,
const struct file_operations **fops);
struct eventfs_entry {
const char *name;
eventfs_callback callback;
};
Where the name is the name of the file and the callback gets called when
the file is being created. The callback passes in the name (in case the
same callback is used for multiple files), a pointer to the mode, data and
fops. The data will be pointing to the data that was passed in
eventfs_create_dir() or eventfs_create_events_dir() but may be overridden
to point to something else, as it will be used to point to the
inode->i_private that is created. The information passed back from the
callback is used to create the dentry/inode.
If the callback fills the data and the file should be created, it must
return a positive number. On zero or negative, the file is ignored.
This logic may also be used as a prototype to convert entire pseudo file
systems into just-in-time allocation.
The "show_events_dentry" file has been updated to show the directories,
and any files they have.
With just the eventfs_file allocations:
Before after deltas for meminfo (in kB):
MemFree: -14360
MemAvailable: -14260
Buffers: 40
Cached: 24
Active: 44
Inactive: 48
Inactive(anon): 28
Active(file): 44
Inactive(file): 20
Dirty: -4
AnonPages: 28
Mapped: 4
KReclaimable: 132
Slab: 1604
SReclaimable: 132
SUnreclaim: 1472
Committed_AS: 12
Before after deltas for slabinfo:
<slab>: <objects> [ * <size> = <total>]
ext4_inode_cache 27 [* 1184 = 31968 ]
extent_status 102 [* 40 = 4080 ]
tracefs_inode_cache 144 [* 656 = 94464 ]
buffer_head 39 [* 104 = 4056 ]
shmem_inode_cache 49 [* 800 = 39200 ]
filp -53 [* 256 = -13568 ]
dentry 251 [* 192 = 48192 ]
lsm_file_cache 277 [* 32 = 8864 ]
vm_area_struct -14 [* 184 = -2576 ]
trace_event_file 1748 [* 88 = 153824 ]
kmalloc-1k 35 [* 1024 = 35840 ]
kmalloc-256 49 [* 256 = 12544 ]
kmalloc-192 -28 [* 192 = -5376 ]
kmalloc-128 -30 [* 128 = -3840 ]
kmalloc-96 10581 [* 96 = 1015776 ]
kmalloc-64 3056 [* 64 = 195584 ]
kmalloc-32 1291 [* 32 = 41312 ]
kmalloc-16 2310 [* 16 = 36960 ]
kmalloc-8 9216 [* 8 = 73728 ]
Free memory dropped by 14,360 kB
Available memory dropped by 14,260 kB
Total slab additions in size: 1,771,032 bytes
With this change:
Before after deltas for meminfo (in kB):
MemFree: -12084
MemAvailable: -11976
Buffers: 32
Cached: 32
Active: 72
Inactive: 168
Inactive(anon): 176
Active(file): 72
Inactive(file): -8
Dirty: 24
AnonPages: 196
Mapped: 8
KReclaimable: 148
Slab: 836
SReclaimable: 148
SUnreclaim: 688
Committed_AS: 324
Before after deltas for slabinfo:
<slab>: <objects> [ * <size> = <total>]
tracefs_inode_cache 144 [* 656 = 94464 ]
shmem_inode_cache -23 [* 800 = -18400 ]
filp -92 [* 256 = -23552 ]
dentry 179 [* 192 = 34368 ]
lsm_file_cache -3 [* 32 = -96 ]
vm_area_struct -13 [* 184 = -2392 ]
trace_event_file 1748 [* 88 = 153824 ]
kmalloc-1k -49 [* 1024 = -50176 ]
kmalloc-256 -27 [* 256 = -6912 ]
kmalloc-128 1864 [* 128 = 238592 ]
kmalloc-64 4685 [* 64 = 299840 ]
kmalloc-32 -72 [* 32 = -2304 ]
kmalloc-16 256 [* 16 = 4096 ]
total = 721352
Free memory dropped by 12,084 kB
Available memory dropped by 11,976 kB
Total slab additions in size: 721,352 bytes
That's over 2 MB in savings per instance for free and available memory,
and over 1 MB in savings per instance of slab memory.
Link: https://lore.kernel.org/linux-trace-kernel/20231003184059.4924468e@gandalf.local.home
Link: https://lore.kernel.org/linux-trace-kernel/20231004165007.43d79161@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ajay Kaher <akaher@vmware.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The callbacks migration is performed through an explicit call from
the hotplug control CPU right after the death of the target CPU and
before proceeding with the CPUHP_ teardown functions.
This is unusual but necessary and yet uncommented. Summarize the reason
as explained in the changelog of:
a58163d8ca (rcu: Migrate callbacks earlier in the CPU-offline timeline)
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
rcu_report_dead() and rcutree_migrate_callbacks() have their headers in
rcupdate.h while those are pure rcutree calls, like the other CPU-hotplug
functions.
Also rcu_cpu_starting() and rcu_report_dead() have different naming
conventions while they mirror each other's effects.
Fix the headers and propose a naming that relates both functions and
aligns with the prefix of other rcutree CPU-hotplug functions.
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Among the three CPU-hotplug teardown RCU callbacks, two of them early
exit if CONFIG_HOTPLUG_CPU=n, and one is left unchanged. In any case
all of them have an implementation when CONFIG_HOTPLUG_CPU=n.
Align instead with the common way to deal with CPU-hotplug teardown
callbacks and provide a proper stub when they are not supported.
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Smatch complains about returning negative error codes from a type
bool function.
kernel/cgroup/cpuset.c:705 cpu_exclusive_check() warn:
signedness bug returning '(-22)'
The code works correctly, but it is confusing. The current behavior is
that cpu_exclusive_check() returns true if it's *NOT* exclusive. Rename
it to cpusets_are_exclusive() and reverse the returns so it returns true
if it is exclusive and false if it's not. Update both callers as well.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Closes: https://lore.kernel.org/r/202309201706.2LhKdM6o-lkp@intel.com/
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When a local partition becomes invalid, it won't transition back to
valid partition automatically if a proper "cpuset.cpus.exclusive" or
"cpuset.cpus" change is made. Instead, system administrators have to
explicitly echo "root" or "isolated" into the "cpuset.cpus.partition"
file at the partition root.
This patch now enables the automatic transition of an invalid local
partition back to valid when there is a proper "cpuset.cpus.exclusive"
or "cpuset.cpus" change.
Automatic transition of an invalid remote partition to a valid one,
however, is not covered by this patch. They still need an explicit
write to "cpuset.cpus.partition" to become valid again.
The test_cpuset_prs.sh test script is updated to add new test cases to
test this automatic state transition.
Reported-by: Pierre Gondois <pierre.gondois@arm.com>
Link: https://lore.kernel.org/lkml/9777f0d2-2fdf-41cb-bd01-19c52939ef42@arm.com
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We have a need of using favordynmods with cgroup v1, which doesn't support
changing mount flags during remount. Enabling CONFIG_CGROUP_FAVOR_DYNMODS at
build-time is not an option because we want to be able to selectively
enable it for certain systems.
This commit addresses this by introducing the cgroup_favordynmods=
command-line option. This option works for both cgroup v1 and v2 and also
allows for disabling favorynmods when the kernel built with
CONFIG_CGROUP_FAVOR_DYNMODS=y.
Also, note that when cgroup_favordynmods=true favordynmods is never
disabled in cgroup_destroy_root().
Signed-off-by: Luiz Capitulino <luizcap@amazon.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The following crash is observed 100% of the time during resume from
the hibernation on a x86 QEMU system.
[ 12.931887] ? __die_body+0x1a/0x60
[ 12.932324] ? page_fault_oops+0x156/0x420
[ 12.932824] ? search_exception_tables+0x37/0x50
[ 12.933389] ? fixup_exception+0x21/0x300
[ 12.933889] ? exc_page_fault+0x69/0x150
[ 12.934371] ? asm_exc_page_fault+0x26/0x30
[ 12.934869] ? get_buffer.constprop.0+0xac/0x100
[ 12.935428] snapshot_write_next+0x7c/0x9f0
[ 12.935929] ? submit_bio_noacct_nocheck+0x2c2/0x370
[ 12.936530] ? submit_bio_noacct+0x44/0x2c0
[ 12.937035] ? hib_submit_io+0xa5/0x110
[ 12.937501] load_image+0x83/0x1a0
[ 12.937919] swsusp_read+0x17f/0x1d0
[ 12.938355] ? create_basic_memory_bitmaps+0x1b7/0x240
[ 12.938967] load_image_and_restore+0x45/0xc0
[ 12.939494] software_resume+0x13c/0x180
[ 12.939994] resume_store+0xa3/0x1d0
The commit being fixed introduced a bug in copying the zero bitmap
to safe pages. A temporary bitmap is allocated with PG_ANY flag in
prepare_image() to make a copy of zero bitmap after the unsafe pages
are marked. Freeing this temporary bitmap with PG_UNSAFE_KEEP later
results in an inconsistent state of unsafe pages. Since free bit is
left as is for this temporary bitmap after free, these pages are
treated as unsafe pages when they are allocated again. This results
in incorrect calculation of the number of pages pre-allocated for the
image.
nr_pages = (nr_zero_pages + nr_copy_pages) - nr_highmem - allocated_unsafe_pages;
The allocate_unsafe_pages is estimated to be higher than the actual
which results in running short of pages in safe_pages_list. Hence the
crash is observed in get_buffer() due to NULL pointer access of
safe_pages_list.
Fix this issue by creating the temporary zero bitmap from safe pages
(free bit not set) so that the corresponding free bits can be cleared
while freeing this bitmap.
Fixes: 005e8dddd4 ("PM: hibernate: don't store zero pages in the image file")
Suggested-by:: Brian Geffon <bgeffon@google.com>
Signed-off-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
Reviewed-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
So far, nobody calls functions parse_crashkernel_high() and
parse_crashkernel_low(), remove both of them.
Link: https://lkml.kernel.org/r/20230914033142.676708-10-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Jiahao <chenjiahao16@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Both crashk_res and crashk_low_res are used to mark the reserved
crashkernel regions in iomem_resource tree. And later the generic
crashkernel resrvation will be added into crash_core.c. So move
crashk_res and crashk_low_res definition into crash_core.c to avoid
compiling error if CONFIG_CRASH_CORE=on while CONFIG_KEXEC_CORE is unset.
Meanwhile include <asm/crash_core.h> in <linux/crash_core.h> if generic
reservation is needed. In that case, <asm/crash_core.h> need be added by
ARCH. In asm/crash_core.h, ARCH can provide its own macro definitions to
override macros in <linux/crash_core.h> if needed. Wrap the including
into CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION ifdeffery scope to
avoid compiling error in other ARCH-es which don't take the generic
reservation way yet.
Link: https://lkml.kernel.org/r/20230914033142.676708-6-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Jiahao <chenjiahao16@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In architecture like x86_64, arm64 and riscv, they have vast virtual
address space and usually have huge physical memory RAM. Their
crashkernel reservation doesn't have to be limited under 4G RAM, but can
be extended to the whole physical memory via crashkernel=,high support.
Now add function reserve_crashkernel_generic() to reserve crashkernel
memory if users specify any case of kernel pamameters, like
crashkernel=xM[@offset] or crashkernel=,high|low.
This is preparation to simplify code of crashkernel=,high support in
architecutures.
Link: https://lkml.kernel.org/r/20230914033142.676708-5-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Jiahao <chenjiahao16@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Now parse_crashkernel() is a real entry point for all kinds of crahskernel
parsing on any architecture.
And wrap the crahskernel=,high|low handling inside
CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION ifdeffery scope.
Link: https://lkml.kernel.org/r/20230914033142.676708-4-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Jiahao <chenjiahao16@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add two parameters 'low_size' and 'high' to function parse_crashkernel(),
later crashkernel=,high|low parsing will be added. Make adjustments in
all call sites of parse_crashkernel() in arch.
Link: https://lkml.kernel.org/r/20230914033142.676708-3-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Jiahao <chenjiahao16@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "kdump: use generic functions to simplify crashkernel
reservation in arch", v3.
In the current arm64, crashkernel=,high support has been finished after
several rounds of posting and careful reviewing. The code in arm64 which
parses crashkernel kernel parameters firstly, then reserve memory can be a
good example for other ARCH to refer to.
Whereas in x86_64, the code mixing crashkernel parameter parsing and
memory reserving is twisted, and looks messy. Refactoring the code to
make it more readable maintainable is necessary.
Here, firstly abstract the crashkernel parameter parsing code into
parse_crashkernel() to make it be able to parse crashkernel=,high|low.
Then abstract the crashkernel memory reserving code into a generic
function reserve_crashkernel_generic(). Finally, in ARCH which
crashkernel=,high support is needed, a simple arch_reserve_crashkernel()
can be added to call above two functions. This can remove the duplicated
implmentation code in each ARCH, like arm64, x86_64 and riscv.
crashkernel=512M,high
crashkernel=512M,high crashkernel=256M,low
crashkernel=512M,high crashkernel=0M,low
crashkernel=0M,high crashkernel=256M,low
crashkernel=512M
crashkernel=512M@0x4f000000
crashkernel=1G-4G:256M,4G-64G:320M,64G-:576M
crashkernel=0M
This patch (of 9):
In all call sites of __parse_crashkernel(), the parameter 'name' is
hardcoded as "crashkernel=". So remove the unnecessary parameter 'name',
add local varibale 'name' inside __parse_crashkernel() instead.
Link: https://lkml.kernel.org/r/20230914033142.676708-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20230914033142.676708-2-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Jiahao <chenjiahao16@huawei.com>
Cc: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
commit 95846ecf9dac("pid: replace pid bitmap implementation with IDR API")
removes 'last_pid' element, and use the idr_get_cursor-idr_set_cursor pair
to set the value of idr, so useless comments should be removed.
Link: https://lkml.kernel.org/r/tencent_157A2A1CAF19A3F5885F0687426159A19708@qq.com
Signed-off-by: Rong Tao <rongtao@cestc.cn>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Jeff Xu <jeffxu@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add a kthread_stop_put() helper that stops a thread and puts its task
struct. Use it to replace the various instances of kthread_stop()
followed by put_task_struct().
Remove the kthread_stop_put() macro in usbip that is similar but doesn't
return the result of kthread_stop().
[agruenba@redhat.com: fix kerneldoc comment]
Link: https://lkml.kernel.org/r/20230911111730.2565537-1-agruenba@redhat.com
[akpm@linux-foundation.org: document kthread_stop_put()'s argument]
Link: https://lkml.kernel.org/r/20230907234048.2499820-1-agruenba@redhat.com
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
do/while_each_thread should be avoided when possible.
Plus I _think_ this change allows to avoid lock_task_sighand() but I am
not sure, I forgot everything about taskstats. In any case, this code
does not look right in that the same thread can be accounted twice:
taskstats_exit() can account the exiting thread in signal->stats and drop
->siglock but this thread is still on the thread-group list, so
lock_task_sighand() can't help.
Link: https://lkml.kernel.org/r/20230909214951.GA24274@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
do/while_each_thread should be avoided when possible.
Plus this change allows to avoid lock_task_sighand(), we can use rcu
and/or sig->stats_lock instead.
Link: https://lkml.kernel.org/r/20230909172629.GA20454@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
No functional changes, cleanup/preparation.
Link: https://lkml.kernel.org/r/20230909172554.GA20441@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
do/while_each_thread should be avoided when possible.
Link: https://lkml.kernel.org/r/20230909164537.GA11633@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use atomic_try_cmpxchg instead of atomic_cmpxchg (*ptr, old, new) == old
in panic() and nmi_panic(). x86 CMPXCHG instruction returns success in ZF
flag, so this change saves a compare after cmpxchg (and related move
instruction in front of cmpxchg).
Also, rename cpu variable to this_cpu in nmi_panic() and try to unify
logic flow between panic() and nmi_panic().
No functional change intended.
[ubizjak@gmail.com: clean up if/else block]
Link: https://lkml.kernel.org/r/20230906191200.68707-1-ubizjak@gmail.com
Link: https://lkml.kernel.org/r/20230904152230.9227-1-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
No need to calculate/check the "success" variable, we can kill it and update
retval in the main loop unless it is zero.
Link: https://lkml.kernel.org/r/20230823171455.GA12188@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: David Laight <David.Laight@ACULAB.COM>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The last user was removed by the previous patch.
Link: https://lkml.kernel.org/r/20230826111409.GA23243@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
After move of Documentation/s390 to Documentation/arch/s390
Link: https://lkml.kernel.org/r/20230825013102.1487979-1-costa.shul@redhat.com
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Eric DeVolder <eric.devolder@oracle.com>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The feature got retired in f1a7941243 ("mm: convert mm's rss stats into
percpu_counter"), but the patch failed to fully clean it up.
Link: https://lkml.kernel.org/r/20230823170556.2281747-1-mjguzik@gmail.com
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
'top_waiter' is assigned unconditionally before first use,
so it does not need an initialization.
[ mingo: Created legible changelog. ]
Signed-off-by: Li zeming <zeming@nfschina.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230725195047.3106-1-zeming@nfschina.com
rcu_report_dead() has to be called locally by the CPU that is going to
exit the RCU state machine. Passing a cpu argument here is error-prone
and leaves the possibility for a racy remote call.
Use local access instead.
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
rcu_report_dead() is the last RCU word from the CPU down through the
hotplug path. It is called in the idle loop right before the CPU shuts
down for good. Because it removes the CPU from the grace period state
machine and reports an ultimate quiescent state if necessary, no further
use of RCU is allowed. Therefore it is expected that IRQs are disabled
upon calling this function and are not to be re-enabled again until the
CPU shuts down.
Remove the IRQs disablement from that function and verify instead that
it is actually called with IRQs disabled as it is expected at that
special point in the idle path.
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
This makes the code more readable.
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Since the actual slab freeing is deferred when calling kvfree_rcu(), so
is the kmemleak_free() callback informing kmemleak of the object
deletion. From the perspective of the kvfree_rcu() caller, the object is
freed and it may remove any references to it. Since kmemleak does not
scan RCU internal data storing the pointer, it will report such objects
as leaks during the grace period.
Tell kmemleak to ignore such objects on the kvfree_call_rcu() path. Note
that the tiny RCU implementation does not have such issue since the
objects can be tracked from the rcu_ctrlblk structure.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Christoph Paasch <cpaasch@apple.com>
Closes: https://lore.kernel.org/all/F903A825-F05F-4B77-A2B5-7356282FBA2C@apple.com/
Cc: <stable@vger.kernel.org>
Tested-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZRqk1wAKCRDbK58LschI
g8GRAQC4E0bw6BTFRl0b3MxvpZES6lU0BUtX2gKVK4tLZdXw/wEAmTlBXQqNzF3b
BkCQknVbFTSw/8l8pzUW123Fb46wUAQ=
=E3hd
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-10-02
We've added 11 non-merge commits during the last 12 day(s) which contain
a total of 12 files changed, 176 insertions(+), 41 deletions(-).
The main changes are:
1) Fix BPF verifier to reset backtrack_state masks on global function
exit as otherwise subsequent precision tracking would reuse them,
from Andrii Nakryiko.
2) Several sockmap fixes for available bytes accounting,
from John Fastabend.
3) Reject sk_msg egress redirects to non-TCP sockets given this
is only supported for TCP sockets today, from Jakub Sitnicki.
4) Fix a syzkaller splat in bpf_mprog when hitting maximum program
limits with BPF_F_BEFORE directive, from Daniel Borkmann
and Nikolay Aleksandrov.
5) Fix BPF memory allocator to use kmalloc_size_roundup() to adjust
size_index for selecting a bpf_mem_cache, from Hou Tao.
6) Fix arch_prepare_bpf_trampoline return code for s390 JIT,
from Song Liu.
7) Fix bpf_trampoline_get when CONFIG_BPF_JIT is turned off,
from Leon Hwang.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Use kmalloc_size_roundup() to adjust size_index
selftest/bpf: Add various selftests for program limits
bpf, mprog: Fix maximum program check on mprog attachment
bpf, sockmap: Reject sk_msg egress redirects to non-TCP sockets
bpf, sockmap: Add tests for MSG_F_PEEK
bpf, sockmap: Do not inc copied_seq when PEEK flag set
bpf: tcp_read_skb needs to pop skb regardless of seq
bpf: unconditionally reset backtrack_state masks on global func exit
bpf: Fix tr dereferencing
selftests/bpf: Check bpf_cubic_acked() is called via struct_ops
s390/bpf: Let arch_prepare_bpf_trampoline return program size
====================
Link: https://lore.kernel.org/r/20231002113417.2309-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are several scenarios that have come up where having a user_event
persist even if the process that registered it exits. The main one is
having a daemon create events on bootup that shouldn't get deleted if
the daemon has to exit or reload. Another is within OpenTelemetry
exporters, they wish to potentially check if a user_event exists on the
system to determine if exporting the data out should occur. The
user_event in this case must exist even in the absence of the owning
process running (such as the above daemon case).
Expose the previously internal flag USER_EVENT_REG_PERSIST to user
processes. Upon register or delete of events with this flag, ensure the
user is perfmon_capable to prevent random user processes with access to
tracefs from creating events that persist after exit.
Link: https://lkml.kernel.org/r/20230912180704.1284-2-beaub@linux.microsoft.com
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
rb_insert_pages. x86 CMPXCHG instruction returns success in ZF flag,
so this change saves a compare after cmpxchg (and related move
instruction in front of cmpxchg).
No functional change intended.
Link: https://lore.kernel.org/linux-trace-kernel/20230914163420.12923-1-ubizjak@gmail.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The ring buffer of global_trace is set to the minimum size in
order to save memory on boot up and then it will be expand when
some trace feature enabled.
However currently operations under an instance can also cause
global_trace ring buffer being expanded, and the expanded memory
would be wasted if global_trace then not being used.
See following case, we enable 'sched_switch' event in instance 'A', then
ring buffer of global_trace is unexpectedly expanded to be 1410KB, also
the '(expanded: 1408)' from 'buffer_size_kb' of instance is confusing.
# cd /sys/kernel/tracing
# mkdir instances/A
# cat buffer_size_kb
7 (expanded: 1408)
# cat instances/A/buffer_size_kb
1410 (expanded: 1408)
# echo sched:sched_switch > instances/A/set_event
# cat buffer_size_kb
1410
# cat instances/A/buffer_size_kb
1410
To fix it, we can:
- Make 'ring_buffer_expanded' as a member of 'struct trace_array';
- Make 'ring_buffer_expanded' of instance is defaultly true,
global_trace is defaultly false;
- In order not to expose 'global_trace' outside of file
'kernel/trace/trace.c', introduce trace_set_ring_buffer_expanded()
to set 'ring_buffer_expanded' as 'true';
- Pass the expected trace_array to tracing_update_buffers().
Link: https://lore.kernel.org/linux-trace-kernel/20230906091837.3998020-1-zhengyejian1@huawei.com
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
commit 'be65de6b03aa ("fs: Remove dcookies support")' removed the
syscall definition for lookup_dcookie. However, syscall tables still
point to the old sys_lookup_dcookie() definition. Update syscall tables
of all architectures to directly point to sys_ni_syscall() instead.
Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Namhyung Kim <namhyung@kernel.org> # for perf
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
The expectation is that placing a task at avg_vruntime() makes it
eligible. Turns out there is a corner case where this is not the case.
Specifically, avg_vruntime() relies on the fact that integer division
is a flooring function (eg. it discards the remainder). By this
property the value returned is slightly left of the true average.
However! when the average is a negative (relative to min_vruntime) the
effect is flipped and it becomes a ceil, with the result that the
returned value is just right of the average and thus not eligible.
Fixes: af4cf40470 ("sched/fair: Add cfs_rq::avg_vruntime")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tasks that never consume their full slice would not update their slice value.
This means that tasks that are spawned before the sysctl scaling keep their
original (UP) slice length.
Fixes: 147f3efaa2 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230915124822.847197830@noisy.programming.kicks-ass.net
Update the checking of return values from debugfs_create_file()
and debugfs_create_dir() to use IS_ERR().
Signed-off-by: Atul Kumar Pant <atulpant.linux@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20230807121834.7438-1-atulpant.linux@gmail.com
The validation of the value written to sched_rt_period_us was broken
because:
- the sysclt_sched_rt_period is declared as unsigned int
- parsed by proc_do_intvec()
- the range is asserted after the value parsed by proc_do_intvec()
Because of this negative values written to the file were written into a
unsigned integer that were later on interpreted as large positive
integers which did passed the check:
if (sysclt_sched_rt_period <= 0)
return EINVAL;
This commit fixes the parsing by setting explicit range for both
perid_us and runtime_us into the sched_rt_sysctls table and processes
the values with proc_dointvec_minmax() instead.
Alternatively if we wanted to use full range of unsigned int for the
period value we would have to split the proc_handler and use
proc_douintvec() for it however even the
Documentation/scheduller/sched-rt-group.rst describes the range as 1 to
INT_MAX.
As far as I can tell the only problem this causes is that the sysctl
file allows writing negative values which when read back may confuse
userspace.
There is also a LTP test being submitted for these sysctl files at:
http://patchwork.ozlabs.org/project/ltp/patch/20230901144433.2526-1-chrubis@suse.cz/
Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231002115553.3007-2-chrubis@suse.cz
to issues which were introduced after 6.5.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZRmSDAAKCRDdBJ7gKXxA
jlSaAQCe3SnBdjRmuzbp5iIfNJOY7GXLN4NwMsArRUxRGY27IwD+KWhXZP/ydVnt
ZgS4x9rmarHuh5Pxds+6SRGhihRz/Ak=
=sf/5
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2023-10-01-08-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"Fourteen hotfixes, eleven of which are cc:stable. The remainder
pertain to issues which were introduced after 6.5"
* tag 'mm-hotfixes-stable-2023-10-01-08-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
Crash: add lock to serialize crash hotplug handling
selftests/mm: fix awk usage in charge_reserved_hugetlb.sh and hugetlb_reparenting_test.sh that may cause error
mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and MPOL_MF_MOVE are specified
mm/damon/vaddr-test: fix memory leak in damon_do_test_apply_three_regions()
mm, memcg: reconsider kmem.limit_in_bytes deprecation
mm: zswap: fix potential memory corruption on duplicate store
arm64: hugetlb: fix set_huge_pte_at() to work with all swap entries
mm: hugetlb: add huge page size param to set_huge_pte_at()
maple_tree: add MAS_UNDERFLOW and MAS_OVERFLOW states
maple_tree: add mas_is_active() to detect in-tree walks
nilfs2: fix potential use after free in nilfs_gccache_submit_read_data()
mm: abstract moving to the next PFN
mm: report success more often from filemap_map_folio_range()
fs: binfmt_elf_efpic: fix personality for ELF-FDPIC
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUZMIoRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iCuw/+Mc2ScQK+Y2gQWzOsACMIm863CqnwYxAK
rzvny0wEiESHDcRGFC46Bv6Ru6BZr8tPrBbsWHWUJTx9dO4RVSUlT/DwoaehQEXb
hqFTmio3YA+yUVbz3oh3BiELkUBQ/Q3M33Z5DiMrB7fH9/e4Disuw3aZu1zZ/CqX
AGHvoL4TILNOQhDMAmHjDXDgp3HZUqCZIhteNbHVa5HJ7Bpal8xh3j73EZ8cYEoj
GWKlFvwSxQQmmCexBTmLxZN7O0guLv27qL3LtlhfCtCJH8Hb6yC14QsPs8zJShqN
TZ5su+meeJKkyE4y5fyhNxCKSmB/8x0fFf8+juQFNo+V73XcfMg/Ymz4mVNuhlxj
bGKTOqkZEGLwAWpopJFvJK+hnLU8PvqOrTkJJ14JisiXBEV5YRccUARVz9r+jv8V
iGu4xI1SQjuK+Jq7/yQscii4VpQdZ6cYrCBQJY84cYFc+jSevqu0QKs5fl4LV8Q5
s9TRnlMk+Eo/oIbRTNJCc2a4IVofRnIiAy5AbKy8UvSlul7UXkMKnXZCJKo5Pl16
jyQ59R4wXnGCi2SDwLUdJnghOgsL29MFSdMqcvNS1VI+bIR9l9mWVStjtyIY/dKE
JWNGNQPvDCxevjp28rEtIzPx1AKQszsAVUKzAANZb+kHtcC+YP1zQHVWJ0ZknG7N
B41vXGbRlmM=
=ByhJ
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
"Fix a RT tasks related lockup/live-lock during CPU offlining"
* tag 'sched-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/rt: Fix live lock between select_fallback_rq() and RT push
- Make sure 32 bit applications using user events have aligned access when
running on a 64 bit kernel.
- Add cond_resched in the loop that handles converting enums in print_fmt
string is trace events.
- Fix premature wake ups of polling processes in the tracing ring buffer. When
a task polls waiting for a percentage of the ring buffer to be filled, the
writer still will wake it up at every event. Add the polling's percentage to
the "shortest_full" list to tell the writer when to wake it up.
- For eventfs dir lookups on dynamic events, an event system's only event could
be removed, leaving its dentry with no children. This is totally legitimate.
But in eventfs_release() it must not access the children array, as it is only
allocated when the dentry has children.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZRiI2xQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qlvoAQDKbevbqA0C8lEV1rbVh4Q9Rnq580rz
EAyEO/RrSOwE9AEA2z+Q597mDjEiqQBvqTjBkS+0xZ7AUQYZRWgTHRIbegg=
=tqOM
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Make sure 32-bit applications using user events have aligned access
when running on a 64-bit kernel.
- Add cond_resched in the loop that handles converting enums in
print_fmt string is trace events.
- Fix premature wake ups of polling processes in the tracing ring
buffer. When a task polls waiting for a percentage of the ring buffer
to be filled, the writer still will wake it up at every event. Add
the polling's percentage to the "shortest_full" list to tell the
writer when to wake it up.
- For eventfs dir lookups on dynamic events, an event system's only
event could be removed, leaving its dentry with no children. This is
totally legitimate. But in eventfs_release() it must not access the
children array, as it is only allocated when the dentry has children.
* tag 'trace-v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
eventfs: Test for dentries array allocated in eventfs_release()
tracing/user_events: Align set_bit() address for all archs
tracing: relax trace_event_eval_update() execution with cond_resched()
ring-buffer: Update "shortest_full" in polling
All architectures should use a long aligned address passed to set_bit().
User processes can pass either a 32-bit or 64-bit sized value to be
updated when tracing is enabled when on a 64-bit kernel. Both cases are
ensured to be naturally aligned, however, that is not enough. The
address must be long aligned without affecting checks on the value
within the user process which require different adjustments for the bit
for little and big endian CPUs.
Add a compat flag to user_event_enabler that indicates when a 32-bit
value is being used on a 64-bit kernel. Long align addresses and correct
the bit to be used by set_bit() to account for this alignment. Ensure
compat flags are copied during forks and used during deletion clears.
Link: https://lore.kernel.org/linux-trace-kernel/20230925230829.341-2-beaub@linux.microsoft.com
Link: https://lore.kernel.org/linux-trace-kernel/20230914131102.179100-1-cleger@rivosinc.com/
Cc: stable@vger.kernel.org
Fixes: 7235759084 ("tracing/user_events: Use remote writes for event enablement")
Reported-by: Clément Léger <cleger@rivosinc.com>
Suggested-by: Clément Léger <cleger@rivosinc.com>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When kernel is compiled without preemption, the eval_map_work_func()
(which calls trace_event_eval_update()) will not be preempted up to its
complete execution. This can actually cause a problem since if another
CPU call stop_machine(), the call will have to wait for the
eval_map_work_func() function to finish executing in the workqueue
before being able to be scheduled. This problem was observe on a SMP
system at boot time, when the CPU calling the initcalls executed
clocksource_done_booting() which in the end calls stop_machine(). We
observed a 1 second delay because one CPU was executing
eval_map_work_func() and was not preempted by the stop_machine() task.
Adding a call to cond_resched() in trace_event_eval_update() allows
other tasks to be executed and thus continue working asynchronously
like before without blocking any pending task at boot time.
Link: https://lore.kernel.org/linux-trace-kernel/20230929191637.416931-1-cleger@rivosinc.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Clément Léger <cleger@rivosinc.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
It was discovered that the ring buffer polling was incorrectly stating
that read would not block, but that's because polling did not take into
account that reads will block if the "buffer-percent" was set. Instead,
the ring buffer polling would say reads would not block if there was any
data in the ring buffer. This was incorrect behavior from a user space
point of view. This was fixed by commit 42fb0a1e84 by having the polling
code check if the ring buffer had more data than what the user specified
"buffer percent" had.
The problem now is that the polling code did not register itself to the
writer that it wanted to wait for a specific "full" value of the ring
buffer. The result was that the writer would wake the polling waiter
whenever there was a new event. The polling waiter would then wake up, see
that there's not enough data in the ring buffer to notify user space and
then go back to sleep. The next event would wake it up again.
Before the polling fix was added, the code would wake up around 100 times
for a hackbench 30 benchmark. After the "fix", due to the constant waking
of the writer, it would wake up over 11,0000 times! It would never leave
the kernel, so the user space behavior was still "correct", but this
definitely is not the desired effect.
To fix this, have the polling code add what it's waiting for to the
"shortest_full" variable, to tell the writer not to wake it up if the
buffer is not as full as it expects to be.
Note, after this fix, it appears that the waiter is now woken up around 2x
the times it was before (~200). This is a tremendous improvement from the
11,000 times, but I will need to spend some time to see why polling is
more aggressive in its wakeups than the read blocking code.
Link: https://lore.kernel.org/linux-trace-kernel/20230929180113.01c2cae3@rorschach.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Fixes: 42fb0a1e84 ("tracing/ring-buffer: Have polling block on watermark")
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Tested-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
- fix the narea calculation in swiotlb initialization (Ross Lagerwall)
- fix the check whether a device has used swiotlb (Petr Tesarik)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmUYWTULHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYMilA/8DomLRCrDy792MoRvBCThWaY6auW4bCUoY7S6VZe7
LVECKAJRwFH7b+dk1mNjPhWTyq8/wB7kr/OLuU9HDcVIeiP9zxks4BMQ4RGc/ZuK
rSZ5ZHPVVCC4EOI3ncjQXrODwgGkGUvtdriByCtX2r4NLBjO1T0vUQB4bLyBTZf+
GnTCLxCkSrIxaqniRvM0K34yO/0rq0ci5840MNneR7MKQkVqPUDY83sHwL1KcQPf
s16lwclQdjZdOVpFMPxFin5NpvPIrjdrvhoaxdnz+8ZuwSACqRUZDQuNlZ3+Zep6
iaynNR04o0c2p0PTT5l3ZRD5vsyCjvc+/3kB3KlM33XbBArWi6XV+694QQn59JnZ
5MmHoIulwZGLsIlTG188QreZBlLrmxylUX311Kot5ood/HW8DsYbTo/krbiiUgEk
MXKWq9k6cQOdhgriS4zxvUl+xkjby12jvSFxv9tN3HHvFsFB8+veVrTuLZzEDXpX
a5PrmI/dcQmlVpCZllzVzeTgL2KeE1Jo0uRZ1vXhuoX8IBys4/TstIXOB4jnyVb4
kzrHbLIoVqLSN42eVMRKBrqGXGlZSWETBpkdSQ41St6t/3MurhKAWZ/1SFPXlI06
SnatIdOU7nSZRofK8/Xe1CnWia5NUpyUQpb+tLUHTgo4kZzGV330bf34iJ+BvC1h
aks=
=bhgc
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.6-2023-09-30' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
- fix the narea calculation in swiotlb initialization (Ross Lagerwall)
- fix the check whether a device has used swiotlb (Petr Tesarik)
* tag 'dma-mapping-6.6-2023-09-30' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: fix the check whether a device has used software IO TLB
swiotlb: use the calculated number of areas
Commit d52b59315b ("bpf: Adjust size_index according to the value of
KMALLOC_MIN_SIZE") uses KMALLOC_MIN_SIZE to adjust size_index, but as
reported by Nathan, the adjustment is not enough, because
__kmalloc_minalign() also decides the minimal alignment of slab object
as shown in new_kmalloc_cache() and its value may be greater than
KMALLOC_MIN_SIZE (e.g., 64 bytes vs 8 bytes under a riscv QEMU VM).
Instead of invoking __kmalloc_minalign() in bpf subsystem to find the
maximal alignment, just using kmalloc_size_roundup() directly to get the
corresponding slab object size for each allocation size. If these two
sizes are unmatched, adjust size_index to select a bpf_mem_cache with
unit_size equal to the object_size of the underlying slab cache for the
allocation size.
Fixes: 822fb26bdb ("bpf: Add a hint to allocated objects.")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Closes: https://lore.kernel.org/bpf/20230914181407.GA1000274@dev-arch.thelio-3990X/
Signed-off-by: Hou Tao <houtao1@huawei.com>
Tested-by: Emil Renner Berthing <emil.renner.berthing@canonical.com>
Link: https://lore.kernel.org/r/20230928101558.2594068-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Eric reported that handling corresponding crash hotplug event can be
failed easily when many memory hotplug event are notified in a short
period. They failed because failing to take __kexec_lock.
=======
[ 78.714569] Fallback order for Node 0: 0
[ 78.714575] Built 1 zonelists, mobility grouping on. Total pages: 1817886
[ 78.717133] Policy zone: Normal
[ 78.724423] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
[ 78.727207] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
[ 80.056643] PEFILE: Unsigned PE binary
=======
The memory hotplug events are notified very quickly and very many, while
the handling of crash hotplug is much slower relatively. So the atomic
variable __kexec_lock and kexec_trylock() can't guarantee the
serialization of crash hotplug handling.
Here, add a new mutex lock __crash_hotplug_lock to serialize crash hotplug
handling specifically. This doesn't impact the usage of __kexec_lock.
Link: https://lkml.kernel.org/r/20230926120905.392903-1-bhe@redhat.com
Fixes: 2472627561 ("crash: add generic infrastructure for crash hotplug support")
Signed-off-by: Baoquan He <bhe@redhat.com>
Tested-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Inspired by recent enhancements to comments in kernel/time/tick-sched.c,
go through the entire file and fix/unify its comments:
- Fix over a dozen typos, spelling mistakes & cases of bad grammar.
- Re-phrase sentences that I needed to read three times to understand.
[ I used the following arbitrary rule-of-thumb:
- if I had to read a comment twice, it was usually my fault,
- if I had to read it a third time, it was the comment's fault. ]
- Comma updates:
- Add commas where needed
- Remove commas where not needed
- In cases where a comma is optional, choose one variant and try to
standardize it over similar sentences in the file.
- Standardize on standalone 'NOHZ' spelling in free-flowing comments:
s/nohz/NOHZ
s/no idle tick/NOHZ
Still keep 'dynticks' as a popular synonym.
- Standardize on referring to variable names within free-flowing
comments with the "'var'" nomenclature, and function names as
"function_name()".
- Standardize on '64-bit' and '32-bit':
s/32bit/32-bit
s/64bit/64-bit
- Standardize on 'IRQ work':
s/irq work/IRQ work
- A few other tidyups I probably missed to list.
No change in functionality intended - other than one small change to
a syslog output string.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/ZRVCNeMcSQcXS36N@gmail.com
atomic_t variables are currently used to implement reference counters
with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)
Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows and
underflows. This is important since overflows and underflows can lead
to use-after-free situation and be exploitable.
The variable group_info.usage is used as pure reference counter.
Convert it to refcount_t and fix up the operations.
**Important note for maintainers:
Some functions from refcount_t API defined in refcount.h have different
memory ordering guarantees than their atomic counterparts. Please check
Documentation/core-api/refcount-vs-atomic.rst for more information.
Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in some
rare cases it might matter. Please double check that you don't have
some undocumented memory guarantees for this variable usage.
For the group_info.usage it might make a difference in following places:
- put_group_info(): decrement in refcount_dec_and_test() only
provides RELEASE ordering and ACQUIRE ordering on success vs. fully
ordered atomic counterpart
Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Link: https://lore.kernel.org/r/20230818041456.gonna.009-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Rename unqueue_multiple() as futex_unqueue_multiple(), and make both
that and futex_wait_multiple_setup() available for external users. This
is in preparation for wiring up vectored waits in io_uring.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To make it more generically useful, augment it with allowing the caller
to pass in the wake handler and wake data. Convert the futex_waitv()
syscall, passing in the default handlers.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With handling multiple futex_q for waitv, we cannot easily go from the
futex_q to data related to that request or queue. Add a wake_data
argument that belongs to the wake handler assigned.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the unqueue and lock_ptr clear into a helper that futex_wake_mark()
calls. Add it to the public functions as well, in preparation for using
it outside the core futex code.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for having another waker that isn't futex_wake_mark(),
add a wake handler in futex_q. No extra data is associated with the
handler outside of struct futex_q itself. futex_wake_mark() is defined as
the standard wakeup helper, now set through futex_q_init like other
defaults.
Normal sync futex waiting relies on wake_q holding tasks that should
be woken up. This is what futex_wake_mark() does, it'll unqueue the
futex and add the associated task to the wake queue. For async usage of
futex waiting, rather than having tasks sleeping on the futex, we'll
need to deal with a futex wake differently. For the planned io_uring
case, that means posting a completion event for the task in question.
Having a definable wake handler can help support that use case.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We need this for validating the futex2 flags outside of the normal
futex syscalls.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It was useful to track feec() placement decision and debug the spare
capacity and optimization issues vs uclamp_max.
Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230916232955.2099394-4-qyousef@layalina.io