inet6_sk(sk)->ipv6_ac_list is protected by lock_sock().
In ipv6_sock_ac_join(), only __dev_get_by_index(), __dev_get_by_flags(),
and __in6_dev_get() require RTNL.
__dev_get_by_flags() is only used by ipv6_sock_ac_join() and can be
converted to RCU version.
Let's replace RCU version helper and drop RTNL from IPV6_JOIN_ANYCAST.
setsockopt_needs_rtnl() will be removed in the next patch.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-15-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The next patch will replace __dev_get_by_index() and __dev_get_by_flags()
to RCU + refcount version.
Then, we will need to call dev_put() in some error paths.
Let's unify two error paths to make the next patch cleaner.
Note that we add READ_ONCE() for net->ipv6.devconf_all->forwarding
and idev->conf.forwarding as we will drop RTNL that protects them.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-14-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
inet6_sk(sk)->ipv6_ac_list is protected by lock_sock().
In ipv6_sock_ac_drop() and ipv6_sock_ac_close(),
only __dev_get_by_index() and __in6_dev_get() requrie RTNL.
Let's replace them with dev_get_by_index() and in6_dev_get()
and drop RTNL from IPV6_LEAVE_ANYCAST and IPV6_ADDRFORM.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-13-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
inet6_dev->ac_list is protected by inet6_dev->lock, so rtnl_dereference()
is a bit rough annotation.
As done in mcast.c, we can use ac_dereference() that checks if
inet6_dev->lock is held.
Let's replace rtnl_dereference() with a new helper ac_dereference().
Note that now addrconf_join_solict() / addrconf_leave_solict() in
__ipv6_dev_ac_inc() / __ipv6_dev_ac_dec() does not need RTNL, so we
can remove ASSERT_RTNL() there.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-12-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now, RTNL is not needed for mcast code, and what's commented in
ip6_mc_msfget() is apparent by for_each_pmc_socklock(), which has
lockdep annotation for lock_sock().
Let's remove the comment and ASSERT_RTNL() in ipv6_mc_rejoin_groups().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-11-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In ip6_mc_source() and ip6_mc_msfilter(), per-socket mld data is
protected by lock_sock() and inet6_dev->mc_lock is also held for
some per-interface functions.
ip6_mc_find_dev_rtnl() only depends on RTNL. If we want to remove
it, we need to check inet6_dev->dead under mc_lock to close the race
with addrconf_ifdown(), as mentioned earlier.
Let's do that and drop RTNL for the rest of MCAST_ socket options.
Note that ip6_mc_msfilter() has unnecessary lock dances and they
are integrated into one to avoid the last-minute error and simplify
the error handling.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-10-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In __ipv6_sock_mc_close(), per-socket mld data is protected by lock_sock(),
and only __dev_get_by_index() and __in6_dev_get() require RTNL.
Let's call __ipv6_sock_mc_drop() and drop RTNL in ipv6_sock_mc_close().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-9-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In __ipv6_sock_mc_drop(), per-socket mld data is protected by lock_sock(),
and only __dev_get_by_index() and __in6_dev_get() require RTNL.
Let's use dev_get_by_index() and in6_dev_get() and drop RTNL for
IPV6_ADD_MEMBERSHIP and MCAST_JOIN_GROUP.
Note that __ipv6_sock_mc_drop() is factorised to reuse in the next patch.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-8-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In __ipv6_sock_mc_join(), per-socket mld data is protected by lock_sock(),
and only __dev_get_by_index() requires RTNL.
Let's use dev_get_by_index() and drop RTNL for IPV6_ADD_MEMBERSHIP and
MCAST_JOIN_GROUP.
Note that we must call rt6_lookup() and dev_hold() under RCU.
If rt6_lookup() returns an entry from the exception table, dst_dev_put()
could change rt->dev.dst to loopback concurrently, and the original device
could lose the refcount before dev_hold() and unblock device registration.
dst_dev_put() is called from NETDEV_UNREGISTER and synchronize_net() follows
it, so as long as rt6_lookup() and dev_hold() are called within the same
RCU critical section, the dev is alive.
Even if the race happens, they are synchronised by idev->dead and mcast
addresses are cleaned up.
For the racy access to rt->dst.dev, we use dst_dev().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-7-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As well as __ipv6_dev_mc_inc(), all code in __ipv6_dev_mc_dec() are
protected by inet6_dev->mc_lock, and RTNL is not needed.
Let's use in6_dev_get() in ipv6_dev_mc_dec() and remove ASSERT_RTNL()
in __ipv6_dev_mc_dec().
Now, we can remove the RTNL comment above addrconf_leave_solict() too.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-6-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit 63ed8de4be ("mld: add mc_lock for protecting per-interface
mld data"), the newly allocated struct ifmcaddr6 cannot be removed until
inet6_dev->mc_lock is released, so mca_get() and mc_put() are unnecessary.
Let's remove the extra refcounting.
Note that mca_get() was only used in __ipv6_dev_mc_inc().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-5-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit 63ed8de4be ("mld: add mc_lock for protecting
per-interface mld data"), every multicast resource is protected
by inet6_dev->mc_lock.
RTNL is unnecessary in terms of protection but still needed for
synchronisation between addrconf_ifdown() and __ipv6_dev_mc_inc().
Once we removed RTNL, there would be a race below, where we could
add a multicast address to a dead inet6_dev.
CPU1 CPU2
==== ====
addrconf_ifdown() __ipv6_dev_mc_inc()
if (idev->dead) <-- false
dead = true return -ENODEV;
ipv6_mc_destroy_dev() / ipv6_mc_down()
mutex_lock(&idev->mc_lock)
...
mutex_unlock(&idev->mc_lock)
mutex_lock(&idev->mc_lock)
...
mutex_unlock(&idev->mc_lock)
The race window can be easily closed by checking inet6_dev->dead
under inet6_dev->mc_lock in __ipv6_dev_mc_inc() as addrconf_ifdown()
will acquire it after marking inet6_dev dead.
Let's check inet6_dev->dead under mc_lock in __ipv6_dev_mc_inc().
Note that now __ipv6_dev_mc_inc() no longer depends on RTNL and
we can remove ASSERT_RTNL() there and the RTNL comment above
addrconf_join_solict().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-4-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 63ed8de4be ("mld: add mc_lock for protecting per-interface
mld data") added the same comments regarding locking to many functions.
Let's replace the comments with lockdep annotation, which is more helpful.
Note that we just remove the comment for mld_clear_zeros() and
mld_send_cr(), where mc_dereference() is used in the entry of the
function.
While at it, a comment for __ipv6_sock_mc_join() is moved back to the
correct place.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-3-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ipv6_dev_mc_{inc,dec}() has the same check.
Let's remove __in6_dev_get() from pndisc_constructor() and
pndisc_destructor().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-2-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For afxdp, the return value of sendto() syscall doesn't reflect how many
descs handled in the kernel. One of use cases is that when user-space
application tries to know the number of transmitted skbs and then decides
if it continues to send, say, is it stopped due to max tx budget?
The following formular can be used after sending to learn how many
skbs/descs the kernel takes care of:
tx_queue.consumers_before - tx_queue.consumers_after
Prior to the current patch, in non-zc mode, the consumer of tx queue is
not immediately updated at the end of each sendto syscall when error
occurs, which leads to the consumer value out-of-dated from the perspective
of user space. So this patch requires store operation to pass the cached
value to the shared value to handle the problem.
More than those explicit errors appearing in the while() loop in
__xsk_generic_xmit(), there are a few possible error cases that might
be neglected in the following call trace:
__xsk_generic_xmit()
xskq_cons_peek_desc()
xskq_cons_read_desc()
xskq_cons_is_valid_desc()
It will also cause the premature exit in the while() loop even if not
all the descs are consumed.
Based on the above analysis, using @sent_frame could cover all the possible
cases where it might lead to out-of-dated global state of consumer after
finishing __xsk_generic_xmit().
The patch also adds a common helper __xsk_tx_release() to keep align
with the zc mode usage in xsk_tx_release().
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250703141712.33190-2-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We have an application that uses almost the same code for TCP and
AF_UNIX (SOCK_STREAM).
TCP can use TCP_INQ, but AF_UNIX doesn't have it and requires an
extra syscall, ioctl(SIOCINQ) or getsockopt(SO_MEMINFO) as an
alternative.
Let's introduce the generic version of TCP_INQ.
If SO_INQ is enabled, recvmsg() will put a cmsg of SCM_INQ that
contains the exact value of ioctl(SIOCINQ). The cmsg is also
included when msg->msg_get_inq is non-zero to make sockets
io_uring-friendly.
Note that SOCK_CUSTOM_SOCKOPT is flagged only for SOCK_STREAM to
override setsockopt() for SOL_SOCKET.
By having the flag in struct unix_sock, instead of struct sock, we
can later add SO_INQ support for TCP and reuse tcp_sk(sk)->recvmsg_inq.
Note also that supporting custom getsockopt() for SOL_SOCKET will need
preparation for other SOCK_CUSTOM_SOCKOPT users (UDP, vsock, MPTCP).
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250702223606.1054680-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In unix_stream_read_generic(), state->msg is fetched multiple times.
Let's cache it in a local variable.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250702223606.1054680-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Compared to TCP, ioctl(SIOCINQ) for AF_UNIX SOCK_STREAM socket is more
expensive, as unix_inq_len() requires iterating through the receive queue
and accumulating skb->len.
Let's cache the value for SOCK_STREAM to a new field during sendmsg()
and recvmsg().
The field is protected by the receive queue lock.
Note that ioctl(SIOCINQ) for SOCK_DGRAM returns the length of the first
skb in the queue.
SOCK_SEQPACKET still requires iterating through the queue because we do
not touch functions shared with unix_dgram_ops. But, if really needed,
we can support it by switching __skb_try_recv_datagram() to a custom
version.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250702223606.1054680-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
unix_stream_read_skb() calls skb_recv_datagram() with MSG_DONTWAIT,
which is mostly equivalent to sock_error(sk) + skb_dequeue().
In the following patch, we will add a new field to cache the number
of bytes in the receive queue. Then, we want to avoid introducing
atomic ops in the fast path, so we will reuse the receive queue lock.
As a preparation for the change, let's not use skb_recv_datagram()
in unix_stream_read_skb().
Note that sock_error() is now moved out of the u->iolock mutex as
the mutex does not synchronise the peer's close() at all.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250702223606.1054680-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
unix_stream_read_skb() checks SOCK_DEAD only when the dequeued skb is
OOB skb.
unix_stream_read_skb() is called for a SOCK_STREAM socket in SOCKMAP
when data is sent to it.
The function is invoked via sk_psock_verdict_data_ready(), which is
set to sk->sk_data_ready().
During sendmsg(), we check if the receiver has SOCK_DEAD, so there
is no point in checking it again later in ->read_skb().
Also, unix_read_skb() for SOCK_DGRAM does not have the test either.
Let's remove the SOCK_DEAD test in unix_stream_read_skb().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250702223606.1054680-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When __skb_try_recv_datagram() returns NULL in __unix_dgram_recvmsg(),
we hold unix_state_lock() unconditionally.
This is because SOCK_SEQPACKET sk needs to return EOF in case its peer
has been close()d concurrently.
This behaviour totally depends on the timing of the peer's close() and
reading sk->sk_shutdown, and taking the lock does not play a role.
Let's drop the lock from __unix_dgram_recvmsg() and use READ_ONCE().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250702223606.1054680-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Under some circumstances, the compiler will emit the following warning for
rxrpc_send_response():
net/rxrpc/output.c: In function 'rxrpc_send_response':
net/rxrpc/output.c:974:1: warning: the frame size of 1160 bytes is larger than 1024 bytes
This occurs because the local variables include a 16-element scatterlist
array and a 16-element bio_vec array. It's probably not actually a problem
as this function is only called by the rxrpc I/O thread function in a
kernel thread and there won't be much on the stack before it.
Fix this by overlaying the bio_vec array over the kvec array in the
rxrpc_local struct. There is one of these per I/O thread and the kvec
array is intended for pointing at bits of a packet to be transmitted,
typically a DATA or an ACK packet. As packets for a local endpoint are
only transmitted by its specific I/O thread, there can be no race, and so
overlaying this bit of memory should be no problem.
Fixes: 5800b1cf3f ("rxrpc: Allow CHALLENGEs to the passed to the app for a RESPONSE")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202506240423.E942yKJP-lkp@intel.com/
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20250707102435.2381045-2-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that we don't have the compat code we can reduce the indent
a little. No functional changes.
Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://patch.msgid.link/20250707184115.2285277-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All drivers are now converted to dedicated _rxfh_context ops.
Remove the use of >set_rxfh() to manage additional contexts.
Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://patch.msgid.link/20250707184115.2285277-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As RACK-TLP was published as a standards-track RFC8985,
so the outdated ref draft-ietf-tcpm-rack need to be updated.
Signed-off-by: Xin Guo <guoxin0309@gmail.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20250705163647.301231-1-guoxin0309@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Refine qdisc_pkt_len_init to include headers up through
the inner transport header when computing header size
for encapsulations. Also refine net/sched/sch_cake.c
borrowed from qdisc_pkt_len_init().
Signed-off-by: Fengyuan Gong <gfengyuan@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250702160741.1204919-1-gfengyuan@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Support returning VMADDR_CID_LOCAL in case no other vsock transport is
available.
Fixes: 0e12190578 ("vsock: add local transport support in the vsock core")
Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20250703-vsock-transports-toctou-v4-3-98f0eb530747@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Transport assignment may race with module unload. Protect new_transport
from becoming a stale pointer.
This also takes care of an insecure call in vsock_use_local_transport();
add a lockdep assert.
BUG: unable to handle page fault for address: fffffbfff8056000
Oops: Oops: 0000 [#1] SMP KASAN
RIP: 0010:vsock_assign_transport+0x366/0x600
Call Trace:
vsock_connect+0x59c/0xc40
__sys_connect+0xe8/0x100
__x64_sys_connect+0x6e/0xc0
do_syscall_64+0x92/0x1c0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Fixes: c0cfa2d8a7 ("vsock: add multi-transports support")
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20250703-vsock-transports-toctou-v4-2-98f0eb530747@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
vsock_find_cid() and vsock_dev_do_ioctl() may race with module unload.
transport_{g2h,h2g} may become NULL after the NULL check.
Introduce vsock_transport_local_cid() to protect from a potential
null-ptr-deref.
KASAN: null-ptr-deref in range [0x0000000000000118-0x000000000000011f]
RIP: 0010:vsock_find_cid+0x47/0x90
Call Trace:
__vsock_bind+0x4b2/0x720
vsock_bind+0x90/0xe0
__sys_bind+0x14d/0x1e0
__x64_sys_bind+0x6e/0xc0
do_syscall_64+0x92/0x1c0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
KASAN: null-ptr-deref in range [0x0000000000000118-0x000000000000011f]
RIP: 0010:vsock_dev_do_ioctl.isra.0+0x58/0xf0
Call Trace:
__x64_sys_ioctl+0x12d/0x190
do_syscall_64+0x92/0x1c0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Fixes: c0cfa2d8a7 ("vsock: add multi-transports support")
Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20250703-vsock-transports-toctou-v4-1-98f0eb530747@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since its introduction in commit 6fa01ccd88 ("skbuff: Add pskb_extract()
helper function"), pskb_carve_frag_list() never used the argument @skb.
Drop it and adapt the only caller.
No functional change intended.
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since its introduction in commit ce098da149 ("skbuff: Introduce
slab_build_skb()"), __slab_build_skb() never used the @skb argument. Remove
it and adapt both callers.
No functional change intended.
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since its introduction in commit 2e910b9532 ("net: Add a function to
splice pages into an skbuff for MSG_SPLICE_PAGES"), skb_splice_from_iter()
never used the @gfp argument. Remove it and adapt callers.
No functional change intended.
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20250702-splice-drop-unused-v3-2-55f68b60d2b7@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit 41c73a0d44 ("net: speedup skb_splice_bits()"),
__splice_segment() and spd_fill_page() do not use the @pipe argument. Drop
it.
While adapting the callers, move one line to enforce reverse xmas tree
order.
No functional change intended.
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20250702-splice-drop-unused-v3-1-55f68b60d2b7@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Syzkaller reported a bug [1] where sk->sk_forward_alloc can overflow.
When we send data, if an skb exists at the tail of the write queue, the
kernel will attempt to append the new data to that skb. However, the code
that checks for available space in the skb is flawed:
'''
copy = size_goal - skb->len
'''
The types of the variables involved are:
'''
copy: ssize_t (s64 on 64-bit systems)
size_goal: int
skb->len: unsigned int
'''
Due to C's type promotion rules, the signed size_goal is converted to an
unsigned int to match skb->len before the subtraction. The result is an
unsigned int.
When this unsigned int result is then assigned to the s64 copy variable,
it is zero-extended, preserving its non-negative value. Consequently, copy
is always >= 0.
Assume we are sending 2GB of data and size_goal has been adjusted to a
value smaller than skb->len. The subtraction will result in copy holding a
very large positive integer. In the subsequent logic, this large value is
used to update sk->sk_forward_alloc, which can easily cause it to overflow.
The syzkaller reproducer uses TCP_REPAIR to reliably create this
condition. However, this can also occur in real-world scenarios. The
tcp_bound_to_half_wnd() function can also reduce size_goal to a small
value. This would cause the subsequent tcp_wmem_schedule() to set
sk->sk_forward_alloc to a value close to INT_MAX. Further memory
allocation requests would then cause sk_forward_alloc to wrap around and
become negative.
[1]: https://syzkaller.appspot.com/bug?extid=de6565462ab540f50e47
Reported-by: syzbot+de6565462ab540f50e47@syzkaller.appspotmail.com
Fixes: 270a1c3de4 ("tcp: Support MSG_SPLICE_PAGES")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/20250707054112.101081-1-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a new netlink parameter 'HANDSHAKE_A_ACCEPT_KEYRING' to provide
the serial number of the keyring to use.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250701144657.104401-1-hare@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
ADDRLABEL only works when it was set in compilation phase. Replace it with
net_dbg_ratelimited().
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250702104417.1526138-1-wangliang74@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This reverts commit f75a2804da.
With all states (whether user or kern) removed from the hashtables
during deletion, there's no need for synchronous destruction of
states. xfrm6_tunnel states still need to have been destroyed (which
will be the case when its last user is deleted (not destroyed)) so
that xfrm6_tunnel_free_spi removes it from the per-netns hashtable
before the netns is destroyed.
This has the benefit of skipping one synchronize_rcu per state (in
__xfrm_state_destroy(sync=true)) when we exit a netns.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The ipcomp fallback tunnels currently get deleted (from the various
lists and hashtables) as the last user state that needed that fallback
is destroyed (not deleted). If a reference to that user state still
exists, the fallback state will remain on the hashtables/lists,
triggering the WARN in xfrm_state_fini. Because of those remaining
references, the fix in commit f75a2804da ("xfrm: destroy xfrm_state
synchronously on net exit path") is not complete.
We recently fixed one such situation in TCP due to defered freeing of
skbs (commit 9b6412e697 ("tcp: drop secpath at the same time as we
currently drop dst")). This can also happen due to IP reassembly: skbs
with a secpath remain on the reassembly queue until netns
destruction. If we can't guarantee that the queues are flushed by the
time xfrm_state_fini runs, there may still be references to a (user)
xfrm_state, preventing the timely deletion of the corresponding
fallback state.
Instead of chasing each instance of skbs holding a secpath one by one,
this patch fixes the issue directly within xfrm, by deleting the
fallback state as soon as the last user state depending on it has been
deleted. Destruction will still happen when the final reference is
dropped.
A separate lockdep class for the fallback state is required since
we're going to lock x->tunnel while x is locked.
Fixes: 9d4139c769 ("netns xfrm: per-netns xfrm_state_all list")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
tc_action_net_exit() got an rtnl exclusion in commit
a159d3c4b8 ("net_sched: acquire RTNL in tc_action_net_exit()")
Since then, commit 16af606739 ("net: sched: implement reference
counted action release") made this RTNL exclusion obsolete for
most cases.
Only tcf_action_offload_del() might still require it.
Move the rtnl locking into tcf_idrinfo_destroy() when
an offload action is found.
Most netns do not have actions, yet deleting them is adding a lot
of pressure on RTNL, which is for many the most contended mutex
in the kernel.
We are moving to a per-netns 'rtnl', so tc_action_net_exit()
will not be able to grab 'rtnl' a single time for a batch of netns.
Before the patch:
perf probe -a rtnl_lock
perf record -e probe:rtnl_lock -a /bin/bash -c 'unshare -n "/bin/true"; sleep 1'
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.305 MB perf.data (25 samples) ]
After the patch:
perf record -e probe:rtnl_lock -a /bin/bash -c 'unshare -n "/bin/true"; sleep 1'
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.304 MB perf.data (9 samples) ]
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vlad Buslov <vladbu@nvidia.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://patch.msgid.link/20250702071230.1892674-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add a few kunit tests for the gateway routing. Because we have multiple
route types now (direct and gateway), rename mctp_test_create_route to
mctp_test_create_route_direct, and add a _gateway variant too.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250702-dev-forwarding-v5-14-1468191da8a4@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This change allows for gateway routing, where a route table entry
may reference a routable endpoint (by network and EID), instead of
routing directly to a netdevice.
We add support for a RTM_GATEWAY attribute for netlink route updates,
with an attribute format of:
struct mctp_fq_addr {
unsigned int net;
mctp_eid_t eid;
}
- we need the net here to uniquely identify the target EID, as we no
longer have the device reference directly (which would provide the net
id in the case of direct routes).
This makes route lookups recursive, as a route lookup that returns a
gateway route must be resolved into a direct route (ie, to a device)
eventually. We provide a limit to the route lookups, to prevent infinite
loop routing.
The route lookup populates a new 'nexthop' field in the dst structure,
which now specifies the key for the neighbour table lookup on device
output, rather than using the packet destination address directly.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250702-dev-forwarding-v5-13-1468191da8a4@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The netlink route parsing functions end up setting a bunch of output
variables from the rt attributes. This will get messy when the routes
become more complex.
So, split the rt parsing into two types: a lookup (returning route
target data suitable for a route lookup, like when deleting a route) and
a populate (setting fields of a struct mctp_route).
In doing this, we need to separate the route allocation from
mctp_route_add, so add some comments on the lifetime semantics for the
latter.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250702-dev-forwarding-v5-12-1468191da8a4@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In upcoming changes, a route may not have a device associated. Since the
route is matched on the (network, eid) tuple, pass the netid itself into
mctp_route_remove.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250702-dev-forwarding-v5-11-1468191da8a4@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Recent changes have modified the extaddr path a little, so add a couple
of kunit tests to af-mctp.c. These check that we're correctly passing
lladdr data between sendmsg/recvmsg and the routing layer.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250702-dev-forwarding-v5-9-1468191da8a4@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Upcoming tests will check semantics of hardware addressing, which
require a dev with ->addr_len != 0. Add a constructor to create a
MCTP interface using a physically-addressed bus type.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250702-dev-forwarding-v5-5-1468191da8a4@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Now that we have the dst->haddr populated by sendmsg (when extended
addressing is in use), we no longer need to stash the link-layer address
in the skb->cb.
Instead, only use skb->cb for incoming lladdr data.
While we're at it: remove cb->src, as was never used.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250702-dev-forwarding-v5-4-1468191da8a4@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This change adds a struct mctp_dst, representing the result of a routing
lookup. This decouples the struct mctp_route from the actual
implementation of a routing operation.
This will allow for future routing changes which may require more
involved lookup logic, such as gateway routing - which may require
multiple traversals of the routing table.
Since we only use the struct mctp_route at lookup time, we no longer
hold routes over a routing operation, as we only need it to populate the
dst. However, we do hold the dev while the dst is active.
This requires some changes to the route test infrastructure, as we no
longer have a mock route to handle the route output operation, and
transient dsts are created by the routing code, so we can't override
them as easily.
Instead, we use kunit->priv to stash a packet queue, and a custom
dst_output function queues into that packet queue, which we can use for
later expectations.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250702-dev-forwarding-v5-3-1468191da8a4@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In our input_cloned_frag test, we currently allocate our test buffers
arbitrarily-sized at 100 bytes.
We only expect to receive a max of 15 bytes from the socket, so reduce
to a more appropriate size. There are some upcoming changes to the
routing code which hit a frame-size limit on s390, so reduce the usage
before that lands.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250702-dev-forwarding-v5-2-1468191da8a4@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In the output path, only check the skb->cb data when we know it's from
a local socket; input packets will have source address information there
instead.
In order to detect when we're forwarding, set skb->pkt_type on
input/output.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250702-dev-forwarding-v5-1-1468191da8a4@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently, ieee80211_rx_data_set_sta() does not correctly handle the
case where the interface supports multiple links (MLO), but the station
does not (non-MLO). This can lead to incorrect link assignment or
unexpected warnings when accessing link information.
Hence, add a fix to check if the station lacks valid link support and
use its default link ID for rx->link assignment. If the station
unexpectedly has valid links, fall back to the default link.
This ensures correct link association and prevents potential issues
in mixed MLO/non-MLO environments.
Signed-off-by: Hari Chandrakanthan <quic_haric@quicinc.com>
Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250630084119.3583593-1-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Downloading regulatory "firmware" needs a device to hang off of, and so
a platform device seemed like the simplest way to do this. Now that we
have a faux device interface, use that instead as this "regulatory
device" is not anything resembling a platform device at all.
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: <linux-wireless@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/2025070116-growing-skeptic-494c@gregkh
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
- hci_sync: Fix not disabling advertising instance
- hci_core: Remove check of BDADDR_ANY in hci_conn_hash_lookup_big_state
- hci_sync: Fix attempting to send HCI_Disconnect to BIS handle
- hci_event: Fix not marking Broadcast Sink BIS as connected
-----BEGIN PGP SIGNATURE-----
iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmhmpiMZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKTZOD/9ggH1wJ52hD6NeRow2dZHt
BAppV0qNg7p9PkmYPDTOKNkdASmp5VfdLyCXH0vBUjC7+iutrkA/oq5UmIzAM7Uz
1HSELyvsLBL3Gj2AxWyElBtuwo3F7cA70veesZKDJJ92VHlCMMXPJugMCrcy0iJn
sVYENIJKtMnLJMTnnpmBCpRQszQfHGYnzQerbeUlCPoFgn7vsRy+onZ9XiIj6R2K
HGZVk6MhUnNWrXthwcgwAZ0SRq+5nhxhB4jOSSZ6nNnT9Wt8oloDM7KtYSy976QA
Ub2LzvR2E8/XtnTeWus5oLL2kAvz1vClS6gpGrvziElnIryq8aYhwKOO7yhRFmDK
kllpOIPaiDrl1H1nVR4z7t/IK5z/0A0wkTcXXm4WNZkLOj8YmY+BdjPKO39PRxK4
PpcHzsmI6PQjnLeGRi3a4PYfMgVYRvdO/OKSO9/xphqgE+dQdonH7/Dsvw+QkEw8
TgFvfKj8bMNjd11Mynd46P45tnQ3HIrTSYGTbdvoU/n8ZFEjATuHtIxRscw0bRmj
iy4u55JV+U3BAdm2CsRouiiTHnvXt9S2HMyMnemwwxsGpim/JG+IsccYUZd8mvLf
d+N2LWsq71CpVhRSJ50WJDrCEJdKT71KwHbSYLJl8vGhObV3VpTBZ57eVF98SNjm
Zqg8JqLJFvCLOhl4RWj/xA==
=o7T9
-----END PGP SIGNATURE-----
Merge tag 'for-net-2025-07-03' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- hci_sync: Fix not disabling advertising instance
- hci_core: Remove check of BDADDR_ANY in hci_conn_hash_lookup_big_state
- hci_sync: Fix attempting to send HCI_Disconnect to BIS handle
- hci_event: Fix not marking Broadcast Sink BIS as connected
* tag 'for-net-2025-07-03' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: hci_event: Fix not marking Broadcast Sink BIS as connected
Bluetooth: hci_sync: Fix attempting to send HCI_Disconnect to BIS handle
Bluetooth: hci_core: Remove check of BDADDR_ANY in hci_conn_hash_lookup_big_state
Bluetooth: hci_sync: Fix not disabling advertising instance
====================
Link: https://patch.msgid.link/20250703160409.1791514-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Refactor Ethernet header population into dedicated function, completing
the layered abstraction with:
- push_eth() for link layer
- push_udp() for transport
- push_ipv4()/push_ipv6() for network
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-6-13cf3db24e2b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move UDP header construction from netpoll_send_udp() into a new
static helper function push_udp(). This completes the protocol
layer refactoring by:
1. Creating a dedicated helper for UDP header assembly
2. Removing UDP-specific logic from the main send function
3. Establishing a consistent pattern with existing IPv4/IPv6 helpers:
- push_udp()
- push_ipv4()
- push_ipv6()
The change improves code organization and maintains the encapsulation
pattern established in previous refactorings.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-5-13cf3db24e2b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move IPv4 header construction from netpoll_send_udp() into a new
static helper function push_ipv4(). This completes the refactoring
started with IPv6 header handling, creating symmetric helper functions
for both IP versions.
Changes include:
1. Extracting IPv4 header setup logic into push_ipv4()
2. Replacing inline IPv4 code with helper call
3. Moving eth assignment after helper calls for consistency
The refactoring reduces code duplication and improves maintainability
by isolating IP version-specific logic.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-4-13cf3db24e2b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move IPv6 header construction from netpoll_send_udp() into a new
static helper function, push_ipv6(). This refactoring reduces code
duplication and improves readability in netpoll_send_udp().
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-3-13cf3db24e2b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Extract UDP checksum calculation logic from netpoll_send_udp()
into a new static helper function netpoll_udp_checksum(). This
reduces code duplication and improves readability for both IPv4
and IPv6 cases.
No functional change intended.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-2-13cf3db24e2b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Replace pointer-dereference sizeof() operations with explicit struct names
for improved readability and maintainability. This change:
1. Replaces `sizeof(*udph)` with `sizeof(struct udphdr)`
2. Replaces `sizeof(*ip6h)` with `sizeof(struct ipv6hdr)`
3. Replaces `sizeof(*iph)` with `sizeof(struct iphdr)`
This will make it easy to move code in the upcoming patches.
No functional changes are introduced by this patch.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-1-13cf3db24e2b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that __page_pool_alloc_pages_slow() is for allocating netmem, not
struct page, rename it to __page_pool_alloc_netmems_slow() to reflect
what it does.
Signed-off-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250702053256.4594-4-byungchul@sk.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that __page_pool_release_page_dma() is for releasing netmem, not
struct page, rename it to __page_pool_release_netmem_dma() to reflect
what it does.
Signed-off-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Link: https://patch.msgid.link/20250702053256.4594-3-byungchul@sk.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that page_pool_return_page() is for returning netmem, not struct
page, rename it to page_pool_return_netmem() to reflect what it does.
Signed-off-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Link: https://patch.msgid.link/20250702053256.4594-2-byungchul@sk.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
syzbot reported a null-ptr-deref in tipc_conn_close() during netns
dismantle. [0]
tipc_topsrv_stop() iterates tipc_net(net)->topsrv->conn_idr and calls
tipc_conn_close() for each tipc_conn.
The problem is that tipc_conn_close() is called after releasing the
IDR lock.
At the same time, there might be tipc_conn_recv_work() running and it
could call tipc_conn_close() for the same tipc_conn and release its
last ->kref.
Once we release the IDR lock in tipc_topsrv_stop(), there is no
guarantee that the tipc_conn is alive.
Let's hold the ref before releasing the lock and put the ref after
tipc_conn_close() in tipc_topsrv_stop().
[0]:
BUG: KASAN: use-after-free in tipc_conn_close+0x122/0x140 net/tipc/topsrv.c:165
Read of size 8 at addr ffff888099305a08 by task kworker/u4:3/435
CPU: 0 PID: 435 Comm: kworker/u4:3 Not tainted 4.19.204-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: netns cleanup_net
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1fc/0x2ef lib/dump_stack.c:118
print_address_description.cold+0x54/0x219 mm/kasan/report.c:256
kasan_report_error.cold+0x8a/0x1b9 mm/kasan/report.c:354
kasan_report mm/kasan/report.c:412 [inline]
__asan_report_load8_noabort+0x88/0x90 mm/kasan/report.c:433
tipc_conn_close+0x122/0x140 net/tipc/topsrv.c:165
tipc_topsrv_stop net/tipc/topsrv.c:701 [inline]
tipc_topsrv_exit_net+0x27b/0x5c0 net/tipc/topsrv.c:722
ops_exit_list+0xa5/0x150 net/core/net_namespace.c:153
cleanup_net+0x3b4/0x8b0 net/core/net_namespace.c:553
process_one_work+0x864/0x1570 kernel/workqueue.c:2153
worker_thread+0x64c/0x1130 kernel/workqueue.c:2296
kthread+0x33f/0x460 kernel/kthread.c:259
ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:415
Allocated by task 23:
kmem_cache_alloc_trace+0x12f/0x380 mm/slab.c:3625
kmalloc include/linux/slab.h:515 [inline]
kzalloc include/linux/slab.h:709 [inline]
tipc_conn_alloc+0x43/0x4f0 net/tipc/topsrv.c:192
tipc_topsrv_accept+0x1b5/0x280 net/tipc/topsrv.c:470
process_one_work+0x864/0x1570 kernel/workqueue.c:2153
worker_thread+0x64c/0x1130 kernel/workqueue.c:2296
kthread+0x33f/0x460 kernel/kthread.c:259
ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:415
Freed by task 23:
__cache_free mm/slab.c:3503 [inline]
kfree+0xcc/0x210 mm/slab.c:3822
tipc_conn_kref_release net/tipc/topsrv.c:150 [inline]
kref_put include/linux/kref.h:70 [inline]
conn_put+0x2cd/0x3a0 net/tipc/topsrv.c:155
process_one_work+0x864/0x1570 kernel/workqueue.c:2153
worker_thread+0x64c/0x1130 kernel/workqueue.c:2296
kthread+0x33f/0x460 kernel/kthread.c:259
ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:415
The buggy address belongs to the object at ffff888099305a00
which belongs to the cache kmalloc-512 of size 512
The buggy address is located 8 bytes inside of
512-byte region [ffff888099305a00, ffff888099305c00)
The buggy address belongs to the page:
page:ffffea000264c140 count:1 mapcount:0 mapping:ffff88813bff0940 index:0x0
flags: 0xfff00000000100(slab)
raw: 00fff00000000100 ffffea00028b6b88 ffffea0002cd2b08 ffff88813bff0940
raw: 0000000000000000 ffff888099305000 0000000100000006 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888099305900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888099305980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff888099305a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888099305a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888099305b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fixes: c5fa7b3cf3 ("tipc: introduce new TIPC server infrastructure")
Reported-by: syzbot+d333febcf8f4bc5f6110@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=27169a847a70550d17be
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech>
Link: https://patch.msgid.link/20250702014350.692213-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Netlink has this pattern in some places
if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
atomic_add(skb->truesize, &sk->sk_rmem_alloc);
, which has the same problem fixed by commit 5a465a0da1 ("udp:
Fix multiple wraparounds of sk->sk_rmem_alloc.").
For example, if we set INT_MAX to SO_RCVBUFFORCE, the condition
is always false as the two operands are of int.
Then, a single socket can eat as many skb as possible until OOM
happens, and we can see multiple wraparounds of sk->sk_rmem_alloc.
Let's fix it by using atomic_add_return() and comparing the two
variables as unsigned int.
Before:
[root@fedora ~]# ss -f netlink
Recv-Q Send-Q Local Address:Port Peer Address:Port
-1668710080 0 rtnl:nl_wraparound/293 *
After:
[root@fedora ~]# ss -f netlink
Recv-Q Send-Q Local Address:Port Peer Address:Port
2147483072 0 rtnl:nl_wraparound/290 *
^
`--- INT_MAX - 576
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Jason Baron <jbaron@akamai.com>
Closes: https://lore.kernel.org/netdev/cover.1750285100.git.jbaron@akamai.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250704054824.1580222-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a packet enters OVS datapath and there is no flow to handle it,
packet goes to userspace through a MISS upcall. With per-CPU upcall
dispatch mechanism, we're using the current CPU id to select the
Netlink PID on which to send this packet. This allows us to send
packets from the same traffic flow through the same handler.
The handler will process the packet, install required flow into the
kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE.
While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a
recirculation action that will pass the (likely modified) packet
through the flow lookup again. And if the flow is not found, the
packet will be sent to userspace again through another MISS upcall.
However, the handler thread in userspace is likely running on a
different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled
in the syscall context of that thread. So, when the time comes to
send the packet through another upcall, the per-CPU dispatch will
choose a different Netlink PID, and this packet will end up processed
by a different handler thread on a different CPU.
The process continues as long as there are new recirculations, each
time the packet goes to a different handler thread before it is sent
out of the OVS datapath to the destination port. In real setups the
number of recirculations can go up to 4 or 5, sometimes more.
There is always a chance to re-order packets while processing upcalls,
because userspace will first install the flow and then re-inject the
original packet. So, there is a race window when the flow is already
installed and the second packet can match it and be forwarded to the
destination before the first packet is re-injected. But the fact that
packets are going through multiple upcalls handled by different
userspace threads makes the reordering noticeably more likely, because
we not only have a race between the kernel and a userspace handler
(which is hard to avoid), but also between multiple userspace handlers.
For example, let's assume that 10 packets got enqueued through a MISS
upcall for handler-1, it will start processing them, will install the
flow into the kernel and start re-injecting packets back, from where
they will go through another MISS to handler-2. Handler-2 will install
the flow into the kernel and start re-injecting the packets, while
handler-1 continues to re-inject the last of the 10 packets, they will
hit the flow installed by handler-2 and be forwarded without going to
the handler-2, while handler-2 still re-injects the first of these 10
packets. Given multiple recirculations and misses, these 10 packets
may end up completely mixed up on the output from the datapath.
Let's allow userspace to specify on which Netlink PID the packets
should be upcalled while processing OVS_PACKET_CMD_EXECUTE.
This makes it possible to ensure that all the packets are processed
by the same handler thread in the userspace even with them being
upcalled multiple times in the process. Packets will remain in order
since they will be enqueued to the same socket and re-injected in the
same order. This doesn't eliminate re-ordering as stated above, since
we still have a race between kernel and the userspace thread, but it
allows to eliminate races between multiple userspace threads.
Userspace knows the PID of the socket on which the original upcall is
received, so there is no need to send it up from the kernel.
Solution requires storing the value somewhere for the duration of the
packet processing. There are two potential places for this: our skb
extension or the per-CPU storage. It's not clear which is better,
so just following currently used scheme of storing this kind of things
along the skb. We still have a decent amount of space in the cb.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20250702155043.2331772-1-i.maximets@ovn.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch is a mitigation to prevent the A-MSDU spoofing vulnerability
for mesh networks. The initial update to the IEEE 802.11 standard, in
response to the FragAttacks, missed this case (CVE-2025-27558). It can
be considered a variant of CVE-2020-24588 but for mesh networks.
This patch tries to detect if a standard MSDU was turned into an A-MSDU
by an adversary. This is done by parsing a received A-MSDU as a standard
MSDU, calculating the length of the Mesh Control header, and seeing if
the 6 bytes after this header equal the start of an rfc1042 header. If
equal, this is a strong indication of an ongoing attack attempt.
This defense was tested with mac80211_hwsim against a mesh network that
uses an empty Mesh Address Extension field, i.e., when four addresses
are used, and when using a 12-byte Mesh Address Extension field, i.e.,
when six addresses are used. Functionality of normal MSDUs and A-MSDUs
was also tested, and confirmed working, when using both an empty and
12-byte Mesh Address Extension field.
It was also tested with mac80211_hwsim that A-MSDU attacks in non-mesh
networks keep being detected and prevented.
Note that the vulnerability being patched, and the defense being
implemented, was also discussed in the following paper and in the
following IEEE 802.11 presentation:
https://papers.mathyvanhoef.com/wisec2025.pdfhttps://mentor.ieee.org/802.11/dcn/25/11-25-0949-00-000m-a-msdu-mesh-spoof-protection.docx
Cc: stable@vger.kernel.org
Signed-off-by: Mathy Vanhoef <Mathy.Vanhoef@kuleuven.be>
Link: https://patch.msgid.link/20250616004635.224344-1-Mathy.Vanhoef@kuleuven.be
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
VHT operating mode notifications are not defined for channel widths
below 20 MHz. In particular, 5 MHz and 10 MHz are not valid under the
VHT specification and must be rejected.
Without this check, malformed notifications using these widths may
reach ieee80211_chan_width_to_rx_bw(), leading to a WARN_ON due to
invalid input. This issue was reported by syzbot.
Reject these unsupported widths early in sta_link_apply_parameters()
when opmode_notif is used. The accepted set includes 20, 40, 80, 160,
and 80+80 MHz, which are valid for VHT. While 320 MHz is not defined
for VHT, it is allowed to avoid rejecting HE or EHT clients that may
still send a VHT opmode notification.
Reported-by: syzbot+ededba317ddeca8b3f08@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=ededba317ddeca8b3f08
Fixes: 751e7489c1 ("wifi: mac80211: expose ieee80211_chan_width_to_rx_bw() to drivers")
Tested-by: syzbot+ededba317ddeca8b3f08@syzkaller.appspotmail.com
Signed-off-by: Moon Hee Lee <moonhee.lee.ca@gmail.com>
Link: https://patch.msgid.link/20250703193756.46622-2-moonhee.lee.ca@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When the non-transmitted BSSID profile is found, immediately return
from the search to not return the wrong profile_len when the profile
is found in a multiple BSSID element that isn't the last one in the
frame.
Fixes: 5023b14cf4 ("mac80211: support profile split between elements")
Reported-by: Michael-CY Lee <michael-cy.lee@mediatek.com>
Link: https://patch.msgid.link/20250630154501.f26cd45a0ecd.I28e0525d06e8a99e555707301bca29265cf20dc8@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In disconnect paths paths, local frame buffers are used
to build deauthentication frames to send them over the
air and as notifications to userspace. Some internal
error paths (that, given no other bugs, cannot happen)
don't always initialize the buffers before sending them
to userspace, so in the presence of other bugs they can
leak stack content. Initialize the buffers to avoid the
possibility of this happening.
Suggested-by: Zhongqiu Han <quic_zhonhan@quicinc.com>
Link: https://patch.msgid.link/20250701072213.13004-2-johannes@sipsolutions.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
mac80211 identifies a short beacon by the presence of the next
TBTT field, however the standard actually doesn't explicitly state that
the next TBTT can't be in a long beacon or even that it is required in
a short beacon - and as a result this validation does not work for all
vendor implementations.
The standard explicitly states that an S1G long beacon shall contain
the S1G beacon compatibility element as the first element in a beacon
transmitted at a TBTT that is not a TSBTT (Target Short Beacon
Transmission Time) as per IEEE80211-2024 11.1.3.10.1. This is validated
by 9.3.4.3 Table 9-76 which states that the S1G beacon compatibility
element is only allowed in the full set and is not allowed in the
minimum set of elements permitted for use within short beacons.
Correctly identify short beacons by the lack of an S1G beacon
compatibility element as the first element in an S1G beacon frame.
Fixes: 9eaffe5078 ("cfg80211: convert S1G beacon to scan results")
Signed-off-by: Simon Wadsworth <simon@morsemicro.com>
Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250701075541.162619-1-lachlan.hodges@morsemicro.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Rename hmac_sha256() to ceph_hmac_sha256(), to avoid a naming conflict
with the upcoming hmac_sha256() library function.
This code will be able to use the HMAC-SHA256 library, but that's left
for a later commit.
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250630160645.3198-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Now everything is ready to pass PIDFD_STALE to pidfd_prepare().
This will allow opening pidfd for reaped tasks.
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@google.com>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Cc: Luca Boccassi <bluca@debian.org>
Cc: David Rheinsberg <david@readahead.eu>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Link: https://lore.kernel.org/20250703222314.309967-7-aleksandr.mikhalitsyn@canonical.com
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
We need to ensure that pidfs dentry is allocated when we meet any
struct pid for the first time. This will allows us to open pidfd
even after the task it corresponds to is reaped.
Basically, we need to identify all places where we fill skb/scm_cookie
with struct pid reference for the first time and call pidfs_register_pid().
Tricky thing here is that we have a few places where this happends
depending on what userspace is doing:
- [__scm_replace_pid()] explicitly sending an SCM_CREDENTIALS message
and specified pid in a numeric format
- [unix_maybe_add_creds()] enabled SO_PASSCRED/SO_PASSPIDFD but
didn't send SCM_CREDENTIALS explicitly
- [scm_send()] force_creds is true. Netlink case, we don't need to touch it.
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@google.com>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Cc: Luca Boccassi <bluca@debian.org>
Cc: David Rheinsberg <david@readahead.eu>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Link: https://lore.kernel.org/20250703222314.309967-6-aleksandr.mikhalitsyn@canonical.com
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Existing logic in __scm_send() related to filling an struct scm_cookie
with a proper struct pid reference is already pretty tricky. Let's
simplify it a bit by introducing a new helper. This helper will be
extended in one of the next patches.
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@google.com>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Cc: Luca Boccassi <bluca@debian.org>
Cc: David Rheinsberg <david@readahead.eu>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Link: https://lore.kernel.org/20250703222314.309967-4-aleksandr.mikhalitsyn@canonical.com
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Instead of open-coding let's consolidate this logic in a separate
helper. This will simplify further changes.
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@google.com>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Cc: Luca Boccassi <bluca@debian.org>
Cc: David Rheinsberg <david@readahead.eu>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Link: https://lore.kernel.org/20250703222314.309967-3-aleksandr.mikhalitsyn@canonical.com
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
As a preparation for the next patches we need to allow sleeping
in unix_maybe_add_creds() and also return err. Currently, we can't do
that as unix_maybe_add_creds() is being called under unix_state_lock().
There is no need for this, really. So let's move call sites of
this helper a bit and do necessary function signature changes.
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@google.com>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Cc: Luca Boccassi <bluca@debian.org>
Cc: David Rheinsberg <david@readahead.eu>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Link: https://lore.kernel.org/20250703222314.309967-2-aleksandr.mikhalitsyn@canonical.com
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
In the crypto offload path, every packet is still processed by the
software stack. The state's statistics required for the expiration
check are being updated in software.
However, the code also calls xfrm_dev_state_update_stats(), which
triggers a query to the hardware device to fetch statistics. This
hardware query is redundant and introduces unnecessary performance
overhead.
Skip this call when it's crypto offload (not packet offload) to avoid
the unnecessary hardware access, thereby improving performance.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
collect_md property on xfrm interfaces can only be set on device creation,
thus xfrmi_changelink() should fail when called on such interfaces.
The check to enforce this was done only in the case where the xi was
returned from xfrmi_locate() which doesn't look for the collect_md
interface, and thus the validation was never reached.
Calling changelink would thus errornously place the special interface xi
in the xfrmi_net->xfrmi hash, but since it also exists in the
xfrmi_net->collect_md_xfrmi pointer it would lead to a double free when
the net namespace was taken down [1].
Change the check to use the xi from netdev_priv which is available earlier
in the function to prevent changes in xfrm collect_md interfaces.
[1] resulting oops:
[ 8.516540] kernel BUG at net/core/dev.c:12029!
[ 8.516552] Oops: invalid opcode: 0000 [#1] SMP NOPTI
[ 8.516559] CPU: 0 UID: 0 PID: 12 Comm: kworker/u80:0 Not tainted 6.15.0-virtme #5 PREEMPT(voluntary)
[ 8.516565] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 8.516569] Workqueue: netns cleanup_net
[ 8.516579] RIP: 0010:unregister_netdevice_many_notify+0x101/0xab0
[ 8.516590] Code: 90 0f 0b 90 48 8b b0 78 01 00 00 48 8b 90 80 01 00 00 48 89 56 08 48 89 32 4c 89 80 78 01 00 00 48 89 b8 80 01 00 00 eb ac 90 <0f> 0b 48 8b 45 00 4c 8d a0 88 fe ff ff 48 39 c5 74 5c 41 80 bc 24
[ 8.516593] RSP: 0018:ffffa93b8006bd30 EFLAGS: 00010206
[ 8.516598] RAX: ffff98fe4226e000 RBX: ffffa93b8006bd58 RCX: ffffa93b8006bc60
[ 8.516601] RDX: 0000000000000004 RSI: 0000000000000000 RDI: dead000000000122
[ 8.516603] RBP: ffffa93b8006bdd8 R08: dead000000000100 R09: ffff98fe4133c100
[ 8.516605] R10: 0000000000000000 R11: 00000000000003d2 R12: ffffa93b8006be00
[ 8.516608] R13: ffffffff96c1a510 R14: ffffffff96c1a510 R15: ffffa93b8006be00
[ 8.516615] FS: 0000000000000000(0000) GS:ffff98fee73b7000(0000) knlGS:0000000000000000
[ 8.516619] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8.516622] CR2: 00007fcd2abd0700 CR3: 000000003aa40000 CR4: 0000000000752ef0
[ 8.516625] PKRU: 55555554
[ 8.516627] Call Trace:
[ 8.516632] <TASK>
[ 8.516635] ? rtnl_is_locked+0x15/0x20
[ 8.516641] ? unregister_netdevice_queue+0x29/0xf0
[ 8.516650] ops_undo_list+0x1f2/0x220
[ 8.516659] cleanup_net+0x1ad/0x2e0
[ 8.516664] process_one_work+0x160/0x380
[ 8.516673] worker_thread+0x2aa/0x3c0
[ 8.516679] ? __pfx_worker_thread+0x10/0x10
[ 8.516686] kthread+0xfb/0x200
[ 8.516690] ? __pfx_kthread+0x10/0x10
[ 8.516693] ? __pfx_kthread+0x10/0x10
[ 8.516697] ret_from_fork+0x82/0xf0
[ 8.516705] ? __pfx_kthread+0x10/0x10
[ 8.516709] ret_from_fork_asm+0x1a/0x30
[ 8.516718] </TASK>
Fixes: abc340b38b ("xfrm: interface: support collect metadata mode")
Reported-by: Lonial Con <kongln9170@gmail.com>
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Current release - new code bugs:
- eth: txgbe: fix the issue of TX failure
- eth: ngbe: specify IRQ vector when the number of VFs is 7
Previous releases - regressions:
- sched: always pass notifications when child class becomes empty
- ipv4: fix stat increase when udp early demux drops the packet
- bluetooth: prevent unintended pause by checking if advertising is active
- virtio: fix error reporting in virtqueue_resize
- eth: virtio-net:
- ensure the received length does not exceed allocated size
- fix the xsk frame's length check
- eth: lan78xx: fix WARN in __netif_napi_del_locked on disconnect
Previous releases - always broken:
- bluetooth: mesh: check instances prior disabling advertising
- eth: idpf: convert control queue mutex to a spinlock
- eth: dpaa2: fix xdp_rxq_info leak
- eth: amd-xgbe: align CL37 AN sequence as per databook
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmhmfzQSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOk/GMP/ixlapKjTP/ggGIFO0nEDTm1tAFnhQl3
bBuwBDoGPjalb46WBO24SFSFYqvZwV6ZIYxCxCeBfmkPyEun0FBX6xjqUIZqohTZ
u5ZSmKFkODMoxQWAG0hXBGvfeKg/GBMWJT761o5IB2XvknRlqHq6uufUBcalvlJK
t58ykSYp2wjfowXSRQ4jEZnr4HZzVuvarhbCB9hJWv206fdk4LiC07teHB1VhW4w
LYmBQChp8SXDFCCYZajum0cNCzx78q90lGzz+MEErVXdXXnRVeqRAUY+k4Vd/Fz+
0OY1vZJ7xgFpy2ns3Z6TH8D41P9whBI8jUYXZ5nA45J8N5wdEQo8oVHlRe9a6Y/E
0oC+DPahhSQAq8BKGFtYSyyURGJvd4+TpQP/LV4e83myReW8i0ZKtyXVgH0Cibwb
529l6wIXBAcLK03tyYwmoCI2VjJbRoMV3nMCeiACCtDExK1YCa3dhjQ82fa8voLc
MIn7zXAGf12IKca39ZapRrdaooaqvSG4htxTn94vEqScNu0wi1cymvG47h9bDrES
cPyS4/MIUH0sduSDVL5PpFYfIDhqS3mpc0e8Nc3pOy7VLQ9kvtBX37OaO/tX5aeh
SWU+8q8y1Cnq0+mcUUHpENFMOgZEC5UO6rdeaJB3Nu0vlHlDEZoEkUXSkHEfsf2F
aodwE/oPyQCg
=O7OS
-----END PGP SIGNATURE-----
Merge tag 'net-6.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from Bluetooth.
Current release - new code bugs:
- eth:
- txgbe: fix the issue of TX failure
- ngbe: specify IRQ vector when the number of VFs is 7
Previous releases - regressions:
- sched: always pass notifications when child class becomes empty
- ipv4: fix stat increase when udp early demux drops the packet
- bluetooth: prevent unintended pause by checking if advertising is active
- virtio: fix error reporting in virtqueue_resize
- eth:
- virtio-net:
- ensure the received length does not exceed allocated size
- fix the xsk frame's length check
- lan78xx: fix WARN in __netif_napi_del_locked on disconnect
Previous releases - always broken:
- bluetooth: mesh: check instances prior disabling advertising
- eth:
- idpf: convert control queue mutex to a spinlock
- dpaa2: fix xdp_rxq_info leak
- amd-xgbe: align CL37 AN sequence as per databook"
* tag 'net-6.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (38 commits)
vsock/vmci: Clear the vmci transport packet properly when initializing it
dt-bindings: net: sophgo,sg2044-dwmac: Drop status from the example
net: ngbe: specify IRQ vector when the number of VFs is 7
net: wangxun: revert the adjustment of the IRQ vector sequence
net: txgbe: request MISC IRQ in ndo_open
virtio_net: Enforce minimum TX ring size for reliability
virtio_net: Cleanup '2+MAX_SKB_FRAGS'
virtio_ring: Fix error reporting in virtqueue_resize
virtio-net: xsk: rx: fix the frame's length check
virtio-net: use the check_mergeable_len helper
virtio-net: remove redundant truesize check with PAGE_SIZE
virtio-net: ensure the received length does not exceed allocated size
net: ipv4: fix stat increase when udp early demux drops the packet
net: libwx: fix the incorrect display of the queue number
amd-xgbe: do not double read link status
net/sched: Always pass notifications when child class becomes empty
nui: Fix dma_mapping_error() check
rose: fix dangling neighbour pointers in rose_rt_device_down()
enic: fix incorrect MTU comparison in enic_change_mtu()
amd-xgbe: align CL37 AN sequence as per databook
...
Upon receiving HCI_EVT_LE_BIG_SYNC_ESTABLISHED with status 0x00
(success) the corresponding BIS hci_conn state shall be set to
BT_CONNECTED otherwise they will be left with BT_OPEN which is invalid
at that point, also create the debugfs and sysfs entries following the
same logic as the likes of Broadcast Source BIS and CIS connections.
Fixes: f777d88278 ("Bluetooth: ISO: Notify user space about failed bis connections")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
BIS/PA connections do have their own cleanup proceedure which are
performed by hci_conn_cleanup/bis_cleanup.
Fixes: 23205562ff ("Bluetooth: separate CIS_LINK and BIS_LINK link types")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
As the code comments on hci_setup_ext_adv_instance_sync suggests the
advertising instance needs to be disabled in order to update its
parameters, but it was wrongly checking that !adv->pending.
Fixes: cba6b75871 ("Bluetooth: hci_sync: Make use of hci_cmd_sync_queue set 2")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Since commit 0e23387491 ("ipv6: fix races in ip6_dst_destroy()"),
'table' is unused in __fib6_drop_pcpu_from(), no need pass it from
fib6_drop_pcpu_from().
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250701041235.1333687-1-yuehaibing@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The DCCP socket family has now been removed from this tree, see:
8bb3212be4 ("Merge branch 'net-retire-dccp-socket'")
Remove connection tracking and NAT support for this protocol, this
should not pose a problem because no DCCP traffic is expected to be seen
on the wire.
As for the code for matching on dccp header for iptables and nftables,
mark it as deprecated and keep it in place. Ruleset restoration is an
atomic operation. Without dccp matching support, an astray match on dccp
could break this operation leaving your computer with no policy in
place, so let's follow a more conservative approach for matches.
Add CONFIG_NFT_EXTHDR_DCCP which is set to 'n' by default to deprecate
dccp extension support. Similarly, label CONFIG_NETFILTER_XT_MATCH_DCCP
as deprecated too and also set it to 'n' by default.
Code to match on DCCP protocol from ebtables also remains in place, this
is just a few checks on IPPROTO_DCCP from _check() path which is
exercised when ruleset is loaded. There is another use of IPPROTO_DCCP
from the _check() path in the iptables multiport match. Another check
for IPPROTO_DCCP from the packet in the reject target is also removed.
So let's schedule removal of the dccp matching for a second stage, this
should not interfer with the dccp retirement since this is only matching
on the dccp header.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In vmci_transport_packet_init memset the vmci_transport_packet before
populating the fields to avoid any uninitialised data being left in the
structure.
Cc: Bryan Tan <bryan-bt.tan@broadcom.com>
Cc: Vishnu Dasa <vishnu.dasa@broadcom.com>
Cc: Broadcom internal kernel review list
Cc: Stefano Garzarella <sgarzare@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: virtualization@lists.linux.dev
Cc: netdev@vger.kernel.org
Cc: stable <stable@kernel.org>
Signed-off-by: HarshaVardhana S A <harshavardhana.sa@broadcom.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: d021c34405 ("VSOCK: Introduce VM Sockets")
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250701122254.2397440-1-gregkh@linuxfoundation.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Callers couldn't care less which dentry did we get - anything
valid is treated as success.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All callers except the one in rpc_populate() pass explicit NULL
there; rpc_populate() passes its last argument ('private') instead,
but in the only call of rpc_populate() that creates any subdirectories
(when creating fixed subdirectories of root) private itself is NULL.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Just have it create gssd (in root), clntXX in gssd, then info and gssd in clntXX
- all with explicit rpc_new_dir()/rpc_new_file()/rpc_mkpipe_dentry().
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and make sure we set the fs-private part of inode up before
attaching it to dentry.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
rpc_new_file(); similar to rpc_new_dir(), except that here we pass
file_operations as well. Callers don't care about dentry, just
return 0 or -E...
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All users of __rpc_mkdir() have the same form - start_creating(),
followed, in case of success, by __rpc_mkdir() and unlocking parent.
Combine that into a single helper, expanding __rpc_mkdir() into it,
along with the call of __rpc_create_common() in it.
Don't mess with d_drop() + d_add() - just d_instantiate() and be
done with that. The reason __rpc_create_common() goes for that
dance is that dentry it gets might or might not be hashed; here
we know it's hashed.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Don't try to hold directories locked more than VFS requires;
lock just before getting a child to be made positive (using
simple_start_creating()) and unlock as soon as the child is
created. There's no benefit in keeping the parent locked
while populating the child - it won't stop dcache lookups anyway.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of returning a dentry or ERR_PTR(-E...), return 0 and store
dentry into pipe->dentry on success and return -E... on failure.
Callers are happier that way...
NOTE: dummy rpc_pipe is getting ->dentry set; we never access that,
since we
1) never call rpc_unlink() for it (dentry is taken out by
->kill_sb())
2) never call rpc_queue_upcall() for it (writing to that
sucker fails; no downcalls are ever submitted, so no replies are
going to arrive)
IOW, having that ->dentry set (and left dangling) is harmless,
if ugly; cleaner solution will take more massage.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
1) pass it pipe instead of pipe->dentry
2) zero pipe->dentry afterwards
3) it always returns 0; why bother?
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
rpc_populate() is called either from fill_super (where we don't
need to remove any files on failure - rpc_kill_sb() will take
them all out anyway) or from rpc_mkdir_populate(), where we need
to remove the directory we'd been trying to populate along with
whatever we'd put into it before we failed. Simpler to combine
that into simple_recursive_removal() there.
Note that rpc_pipe is overlocking directories quite a bit -
locked parent is no obstacle to finding a child in dcache, so
keeping it locked won't prevent userland observing a partially
built subtree.
All we need is to follow minimal VFS requirements; it's not
as if clients used directory locking for exclusion - tree
changes are serialized, but that's done on ->pipefs_sb_lock.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
note that the callback of simple_recursive_removal() is called with
the parent locked; the victim isn't locked by the caller.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
no need to give an exact list of objects to be remove when it's
simply every child of the victim directory.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
->kill_sb() will be called immediately after a failure return
anyway, so we don't need to bother with notifier chain and
rpc_gssd_dummy_depopulate(). What's more, rpc_gssd_dummy_populate()
failure exits do not need to bother with __rpc_depopulate() -
anything added to the tree will be taken out by ->kill_sb().
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
change 'Maximium' to 'Maximum'
Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://patch.msgid.link/20250702055820.112190-1-zhaochenguang@kylinos.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduce support for specifying relative bandwidth shares between
traffic classes (TC) in the devlink-rate API. This new option allows
users to allocate bandwidth across multiple traffic classes in a
single command.
This feature provides a more granular control over traffic management,
especially for scenarios requiring Enhanced Transmission Selection.
Users can now define a relative bandwidth share for each traffic class.
For example, assigning share values of 20 to TC0 (TCP/UDP) and 80 to TC5
(RoCE) will result in TC0 receiving 20% and TC5 receiving 80% of the
total bandwidth. The actual percentage each class receives depends on
the ratio of its share value to the sum of all shares.
Example:
DEV=pci/0000:08:00.0
$ devlink port function rate add $DEV/vfs_group tx_share 10Gbit \
tx_max 50Gbit tc-bw 0:20 1:0 2:0 3:0 4:0 5:80 6:0 7:0
$ devlink port function rate set $DEV/vfs_group \
tc-bw 0:20 1:0 2:0 3:0 4:0 5:20 6:60 7:0
Example usage with ynl:
./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \
--do rate-set --json '{
"bus-name": "pci",
"dev-name": "0000:08:00.0",
"port-index": 1,
"rate-tc-bws": [
{"rate-tc-index": 0, "rate-tc-bw": 50},
{"rate-tc-index": 1, "rate-tc-bw": 50},
{"rate-tc-index": 2, "rate-tc-bw": 0},
{"rate-tc-index": 3, "rate-tc-bw": 0},
{"rate-tc-index": 4, "rate-tc-bw": 0},
{"rate-tc-index": 5, "rate-tc-bw": 0},
{"rate-tc-index": 6, "rate-tc-bw": 0},
{"rate-tc-index": 7, "rate-tc-bw": 0}
]
}'
./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \
--do rate-get --json '{
"bus-name": "pci",
"dev-name": "0000:08:00.0",
"port-index": 1
}'
output for rate-get:
{'bus-name': 'pci',
'dev-name': '0000:08:00.0',
'port-index': 1,
'rate-tc-bws': [{'rate-tc-bw': 50, 'rate-tc-index': 0},
{'rate-tc-bw': 50, 'rate-tc-index': 1},
{'rate-tc-bw': 0, 'rate-tc-index': 2},
{'rate-tc-bw': 0, 'rate-tc-index': 3},
{'rate-tc-bw': 0, 'rate-tc-index': 4},
{'rate-tc-bw': 0, 'rate-tc-index': 5},
{'rate-tc-bw': 0, 'rate-tc-index': 6},
{'rate-tc-bw': 0, 'rate-tc-index': 7}],
'rate-tx-max': 0,
'rate-tx-priority': 0,
'rate-tx-share': 0,
'rate-tx-weight': 0,
'rate-type': 'leaf'}
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250629142138.361537-3-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
MSG_ZEROCOPY data must be copied before data is queued to local
sockets, to avoid indefinite timeout until memory release.
This test is performed by skb_orphan_frags_rx, which is called when
looping an egress skb to packet sockets, error queue or ingress path.
To preserve zerocopy for skbs that are looped to ingress but are then
forwarded to an egress device rather than delivered locally, defer
this last check until an skb enters the local IP receive path.
This is analogous to existing behavior of skb_clear_delivery_time.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630194312.1571410-2-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
udp_v4_early_demux now returns drop reasons as it either returns 0 or
ip_mc_validate_source, which returns itself a drop reason. However its
use was not converted in ip_rcv_finish_core and the drop reason is
ignored, leading to potentially skipping increasing LINUX_MIB_IPRPFILTER
if the drop reason is SKB_DROP_REASON_IP_RPFILTER.
This is a fix and we're not converting udp_v4_early_demux to explicitly
return a drop reason to ease backports; this can be done as a follow-up.
Fixes: d46f827016 ("net: ip: make ip_mc_validate_source() return drop reason")
Cc: Menglong Dong <menglong8.dong@gmail.com>
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250701074935.144134-1-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Certain classful qdiscs may invoke their classes' dequeue handler on an
enqueue operation. This may unexpectedly empty the child qdisc and thus
make an in-flight class passive via qlen_notify(). Most qdiscs do not
expect such behaviour at this point in time and may re-activate the
class eventually anyways which will lead to a use-after-free.
The referenced fix commit attempted to fix this behavior for the HFSC
case by moving the backlog accounting around, though this turned out to
be incomplete since the parent's parent may run into the issue too.
The following reproducer demonstrates this use-after-free:
tc qdisc add dev lo root handle 1: drr
tc filter add dev lo parent 1: basic classid 1:1
tc class add dev lo parent 1: classid 1:1 drr
tc qdisc add dev lo parent 1:1 handle 2: hfsc def 1
tc class add dev lo parent 2: classid 2:1 hfsc rt m1 8 d 1 m2 0
tc qdisc add dev lo parent 2:1 handle 3: netem
tc qdisc add dev lo parent 3:1 handle 4: blackhole
echo 1 | socat -u STDIN UDP4-DATAGRAM:127.0.0.1:8888
tc class delete dev lo classid 1:1
echo 1 | socat -u STDIN UDP4-DATAGRAM:127.0.0.1:8888
Since backlog accounting issues leading to a use-after-frees on stale
class pointers is a recurring pattern at this point, this patch takes
a different approach. Instead of trying to fix the accounting, the patch
ensures that qdisc_tree_reduce_backlog always calls qlen_notify when
the child qdisc is empty. This solves the problem because deletion of
qdiscs always involves a call to qdisc_reset() and / or
qdisc_purge_queue() which ultimately resets its qlen to 0 thus causing
the following qdisc_tree_reduce_backlog() to report to the parent. Note
that this may call qlen_notify on passive classes multiple times. This
is not a problem after the recent patch series that made all the
classful qdiscs qlen_notify() handlers idempotent.
Fixes: 3f98113810 ("sch_hfsc: Fix qlen accounting bug when using peek in hfsc_enqueue()")
Signed-off-by: Lion Ackermann <nnamrec@gmail.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/d912cbd7-193b-4269-9857-525bee8bbb6a@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Both functions are always called under RCU.
We remove the extra rcu_read_lock()/rcu_read_unlock().
We use skb_dst_dev_net_rcu() instead of skb_dst_dev_net().
We use dev_net_rcu() instead of dev_net().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-11-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the new helpers as a step to deal with potential dst->dev races.
v2: fix typo in ipv6_rthdr_rcv() (kernel test robot <lkp@intel.com>)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-10-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the new helper as a step to deal with potential dst->dev races.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the new helpers as a first step to deal with
potential dst->dev races.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dst->dev is read locklessly in many contexts,
and written in dst_dev_put().
Fixing all the races is going to need many changes.
We probably will have to add full RCU protection.
Add three helpers to ease this painful process.
static inline struct net_device *dst_dev(const struct dst_entry *dst)
{
return READ_ONCE(dst->dev);
}
static inline struct net_device *skb_dst_dev(const struct sk_buff *skb)
{
return dst_dev(skb_dst(skb));
}
static inline struct net *skb_dst_dev_net(const struct sk_buff *skb)
{
return dev_net(skb_dst_dev(skb));
}
static inline struct net *skb_dst_dev_net_rcu(const struct sk_buff *skb)
{
return dev_net_rcu(skb_dst_dev(skb));
}
Fixes: 4a6ce2b6f2 ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dst_dev_put() can overwrite dst->output while other
cpus might read this field (for instance from dst_output())
Add READ_ONCE()/WRITE_ONCE() annotations to suppress
potential issues.
We will likely need RCU protection in the future.
Fixes: 4a6ce2b6f2 ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dst_dev_put() can overwrite dst->input while other
cpus might read this field (for instance from dst_input())
Add READ_ONCE()/WRITE_ONCE() annotations to suppress
potential issues.
We will likely need full RCU protection later.
Fixes: 4a6ce2b6f2 ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
(dst_entry)->lastuse is read and written locklessly,
add corresponding annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
(dst_entry)->expires is read and written locklessly,
add corresponding annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
____cacheline_aligned_in_smp attribute only makes sure to align
a field to a cache line. It does not prevent the linker to use
the remaining of the cache line for other variables, causing
potential false sharing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
____cacheline_aligned_in_smp attribute only makes sure to align
a field to a cache line. It does not prevent the linker to use
the remaining of the cache line for other variables, causing
potential false sharing.
Move tcp_memory_allocated into a dedicated cache line.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Using per-cpu data for net->net_cookie generation is overkill,
because even busy hosts do not create hundreds of netns per second.
Make sure to put net_cookie in a private cache line to avoid
potential false sharing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This structure will hold networking data that must
consume a full cache line to avoid accidental false sharing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The issue originates when Strongswan initiates an XFRM_MSG_ALLOCSPI
Netlink message, which triggers the kernel function xfrm_alloc_spi().
This function is expected to ensure uniqueness of the Security Parameter
Index (SPI) for inbound Security Associations (SAs). However, it can
return success even when the requested SPI is already in use, leading
to duplicate SPIs assigned to multiple inbound SAs, differentiated
only by their destination addresses.
This behavior causes inconsistencies during SPI lookups for inbound packets.
Since the lookup may return an arbitrary SA among those with the same SPI,
packet processing can fail, resulting in packet drops.
According to RFC 4301 section 4.4.2 , for inbound processing a unicast SA
is uniquely identified by the SPI and optionally protocol.
Reproducing the Issue Reliably:
To consistently reproduce the problem, restrict the available SPI range in
charon.conf : spi_min = 0x10000000 spi_max = 0x10000002
This limits the system to only 2 usable SPI values.
Next, create more than 2 Child SA. each using unique pair of src/dst address.
As soon as the 3rd Child SA is initiated, it will be assigned a duplicate
SPI, since the SPI pool is already exhausted.
With a narrow SPI range, the issue is consistently reproducible.
With a broader/default range, it becomes rare and unpredictable.
Current implementation:
xfrm_spi_hash() lookup function computes hash using daddr, proto, and family.
So if two SAs have the same SPI but different destination addresses, then
they will:
a. Hash into different buckets
b. Be stored in different linked lists (byspi + h)
c. Not be seen in the same hlist_for_each_entry_rcu() iteration.
As a result, the lookup will result in NULL and kernel allows that Duplicate SPI
Proposed Change:
xfrm_state_lookup_spi_proto() does a truly global search - across all states,
regardless of hash bucket and matches SPI and proto.
Signed-off-by: Aakash Kumar S <saakashkumar@marvell.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The skb transport header pointer needs to be adjusted by network header
pointer plus the size of the ipcomp header.
This shows up when running traffic over ipcomp using transport mode.
After being reinjected, packets are dropped because the header isn't
adjusted properly and some checks can be triggered. E.g the skb is
mistakenly considered as IP fragmented packet and later dropped.
kworker/30:1-mm 443 [030] 102.055250: skb:kfree_skb:skbaddr=0xffff8f104aa3ce00 rx_sk=(
ffffffff8419f1f4 sk_skb_reason_drop+0x94 ([kernel.kallsyms])
ffffffff8419f1f4 sk_skb_reason_drop+0x94 ([kernel.kallsyms])
ffffffff84281420 ip_defrag+0x4b0 ([kernel.kallsyms])
ffffffff8428006e ip_local_deliver+0x4e ([kernel.kallsyms])
ffffffff8432afb1 xfrm_trans_reinject+0xe1 ([kernel.kallsyms])
ffffffff83758230 process_one_work+0x190 ([kernel.kallsyms])
ffffffff83758f37 worker_thread+0x2d7 ([kernel.kallsyms])
ffffffff83761cc9 kthread+0xf9 ([kernel.kallsyms])
ffffffff836c3437 ret_from_fork+0x197 ([kernel.kallsyms])
ffffffff836718da ret_from_fork_asm+0x1a ([kernel.kallsyms])
Fixes: eb2953d269 ("xfrm: ipcomp: Use crypto_acomp interface")
Link: https://bugzilla.suse.com/1244532
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The referenced commit replaced a call to __xfrm4|6_udp_encap_rcv() with
a custom check for non-ESP markers. But what the called function also
did was setting the transport header to the ESP header. The function
that follows, esp4|6_gro_receive(), relies on that being set when it calls
xfrm_parse_spi(). We have to set the full offset as the skb's head was
not moved yet so adding just the UDP header length won't work.
Fixes: e3fd057776 ("xfrm: Fix UDP GRO handling for some corner cases")
Signed-off-by: Tobias Brunner <tobias@strongswan.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Fix a typo:
lenghts -> length
The typo has been identified using codespell, and the tool currently
does not report any additional issues in comments.
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250629171226.4988-2-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are two bugs in rose_rt_device_down() that can cause
use-after-free:
1. The loop bound `t->count` is modified within the loop, which can
cause the loop to terminate early and miss some entries.
2. When removing an entry from the neighbour array, the subsequent entries
are moved up to fill the gap, but the loop index `i` is still
incremented, causing the next entry to be skipped.
For example, if a node has three neighbours (A, A, B) with count=3 and A
is being removed, the second A is not checked.
i=0: (A, A, B) -> (A, B) with count=2
^ checked
i=1: (A, B) -> (A, B) with count=2
^ checked (B, not A!)
i=2: (doesn't occur because i < count is false)
This leaves the second A in the array with count=2, but the rose_neigh
structure has been freed. Code that accesses these entries assumes that
the first `count` entries are valid pointers, causing a use-after-free
when it accesses the dangling pointer.
Fix both issues by iterating over the array in reverse order with a fixed
loop bound. This ensures that all entries are examined and that the removal
of an entry doesn't affect subsequent iterations.
Reported-by: syzbot+e04e2c007ba2c80476cb@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=e04e2c007ba2c80476cb
Tested-by: syzbot+e04e2c007ba2c80476cb@syzkaller.appspotmail.com
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kohei Enju <enjuk@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250629030833.6680-1-enjuk@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is possible via the ioctl API:
> ip -6 tunnel change ip6tnl0 mode any
Let's align the netlink API:
> ip link set ip6tnl0 type ip6tnl mode any
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://patch.msgid.link/20250630145602.1027220-1-nicolas.dichtel@6wind.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido spotted that I made a mistake in commit under Fixes,
ethnl_default_parse() may acquire a dev reference even when it returns
an error. This may have been driven by the code structure in dumps
(which unconditionally release dev before handling errors), but it's
too much of a trap. Functions should undo what they did before returning
an error, rather than expecting caller to clean up.
Rather than fixing ethnl_default_set_doit() directly make
ethnl_default_parse() clean up errors.
Reported-by: Ido Schimmel <idosch@idosch.org>
Link: https://lore.kernel.org/aGEPszpq9eojNF4Y@shredder
Fixes: 963781bdfe ("net: ethtool: call .parse_request for SET handlers")
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250630154053.1074664-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* Fix loop in GSS sequence number cache
* Clean up /proc/net/rpc/nfs if nfs_fs_proc_net_init() fails
* Fix a race to wake on NFS_LAYOUT_DRAIN
* Fix handling of NFS level errors in I/O
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmhkRaUACgkQ18tUv7Cl
QOskFxAAiyf6OIZ6M1mBfmbDDb+O5Gl6zofn0+OW2V9puJT/0U7pgNzepK9gFtO8
o3Bq0/GU3I2oxO7wEFrWFQXl3hkPvMqCN7ai1Vb2DjRGWhu97E0Mk3DWltSmuDFQ
IaofuURjJdhgvjLb03mI6ReQNxONbMU3qD0JgK4/WIfvm44574Fah6jTnod32G23
EHj8cBw+iIvGh8MmPb4g01XivMGM36bA08NP4qkU/wgeLnkJzFYb5XZf16v821T6
ZxwwruclX2fbpLtsQsHfJpOgW/TFRJTyjBcZw581H8fpkgh1PlJ96OFwrbOU7RCp
gVzDw3hvWoKFaMjVlkKk3wSWzwtMWLnB8a7TmgssuNU+DqmN3qMzkaRqrOxWSYMc
t7SycQ+PReaR2gQdlJNrN5/Q75OLpqplwPi6O5cqOMQXC2aMK+nhXVW9QiC1SPFI
ZcymKk4anzdgIgH+8TR3JpFVmPoEuuIeLV24+DQ0rlh7+4SI3TooTygfsl3/DErb
6Ic6nXgeSBWBPvuemnPbsq9DuAqGFbLrbdutVu4LUx/9XoGd8AfA9dVLMIb/0hgm
C3Lwt1xeata8dz1v2jHHS1Tzs8ZphXnUCU7gzcf4TDs3UQUGzKnnNfdfb1r2cvxU
LVz2guJ9xH4r3TsVNn2GQijbccxwPVFxszzPm0JobxiQYOna0Ss=
=F4+b
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.16-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
- Fix loop in GSS sequence number cache
- Clean up /proc/net/rpc/nfs if nfs_fs_proc_net_init() fails
- Fix a race to wake on NFS_LAYOUT_DRAIN
- Fix handling of NFS level errors in I/O
* tag 'nfs-for-6.16-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFSv4/flexfiles: Fix handling of NFS level errors in I/O
NFSv4/pNFS: Fix a race to wake on NFS_LAYOUT_DRAIN
nfs: Clean up /proc/net/rpc/nfs when nfs_fs_proc_net_init() fails.
sunrpc: fix loop in gss seqno cache
_Static_assert(ARRAY_SIZE(map) != IEEE8021Q_TT_MAX - 1) rejects only a
length of 7 and allows any other mismatch. Replace it with a strict
equality test via a helper macro so that every mapping table must have
exactly IEEE8021Q_TT_MAX (8) entries.
Signed-off-by: RubenKelevra <rubenkelevra@gmail.com>
Link: https://patch.msgid.link/20250626205907.1566384-1-rubenkelevra@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
At the time of commit bc51dddf98 ("netns: avoid disabling irq
for netns id") peernet2id() was not yet using RCU.
Commit 2dce224f46 ("netns: protect netns
ID lookups with RCU") changed peernet2id() to no longer
acquire net->nsid_lock (potentially from BH context).
We do not need to block soft interrupts when acquiring
net->nsid_lock anymore.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Guillaume Nault <gnault@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250627163242.230866-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tl;dr
=====
Add a new neighbor flag ("extern_valid") that can be used to indicate to
the kernel that a neighbor entry was learned and determined to be valid
externally. The kernel will not try to remove or invalidate such an
entry, leaving these decisions to the user space control plane. This is
needed for EVPN multi-homing where a neighbor entry for a multi-homed
host needs to be synced across all the VTEPs among which the host is
multi-homed.
Background
==========
In a typical EVPN multi-homing setup each host is multi-homed using a
set of links called ES (Ethernet Segment, i.e., LAG) to multiple leaf
switches (VTEPs). VTEPs that are connected to the same ES are called ES
peers.
When a neighbor entry is learned on a VTEP, it is distributed to both ES
peers and remote VTEPs using EVPN MAC/IP advertisement routes. ES peers
use the neighbor entry when routing traffic towards the multi-homed host
and remote VTEPs use it for ARP/NS suppression.
Motivation
==========
If the ES link between a host and the VTEP on which the neighbor entry
was locally learned goes down, the EVPN MAC/IP advertisement route will
be withdrawn and the neighbor entries will be removed from both ES peers
and remote VTEPs. Routing towards the multi-homed host and ARP/NS
suppression can fail until another ES peer locally learns the neighbor
entry and distributes it via an EVPN MAC/IP advertisement route.
"draft-rbickhart-evpn-ip-mac-proxy-adv-03" [1] suggests avoiding these
intermittent failures by having the ES peers install the neighbor
entries as before, but also injecting EVPN MAC/IP advertisement routes
with a proxy indication. When the previously mentioned ES link goes down
and the original EVPN MAC/IP advertisement route is withdrawn, the ES
peers will not withdraw their neighbor entries, but instead start aging
timers for the proxy indication.
If an ES peer locally learns the neighbor entry (i.e., it becomes
"reachable"), it will restart its aging timer for the entry and emit an
EVPN MAC/IP advertisement route without a proxy indication. An ES peer
will stop its aging timer for the proxy indication if it observes the
removal of the proxy indication from at least one of the ES peers
advertising the entry.
In the event that the aging timer for the proxy indication expired, an
ES peer will withdraw its EVPN MAC/IP advertisement route. If the timer
expired on all ES peers and they all withdrew their proxy
advertisements, the neighbor entry will be completely removed from the
EVPN fabric.
Implementation
==============
In the above scheme, when the control plane (e.g., FRR) advertises a
neighbor entry with a proxy indication, it expects the corresponding
entry in the data plane (i.e., the kernel) to remain valid and not be
removed due to garbage collection or loss of carrier. The control plane
also expects the kernel to notify it if the entry was learned locally
(i.e., became "reachable") so that it will remove the proxy indication
from the EVPN MAC/IP advertisement route. That is why these entries
cannot be programmed with dummy states such as "permanent" or "noarp".
Instead, add a new neighbor flag ("extern_valid") which indicates that
the entry was learned and determined to be valid externally and should
not be removed or invalidated by the kernel. The kernel can probe the
entry and notify user space when it becomes "reachable" (it is initially
installed as "stale"). However, if the kernel does not receive a
confirmation, have it return the entry to the "stale" state instead of
the "failed" state.
In other words, an entry marked with the "extern_valid" flag behaves
like any other dynamically learned entry other than the fact that the
kernel cannot remove or invalidate it.
One can argue that the "extern_valid" flag should not prevent garbage
collection and that instead a neighbor entry should be programmed with
both the "extern_valid" and "extern_learn" flags. There are two reasons
for not doing that:
1. Unclear why a control plane would like to program an entry that the
kernel cannot invalidate but can completely remove.
2. The "extern_learn" flag is used by FRR for neighbor entries learned
on remote VTEPs (for ARP/NS suppression) whereas here we are
concerned with local entries. This distinction is currently irrelevant
for the kernel, but might be relevant in the future.
Given that the flag only makes sense when the neighbor has a valid
state, reject attempts to add a neighbor with an invalid state and with
this flag set. For example:
# ip neigh add 192.0.2.1 nud none dev br0.10 extern_valid
Error: Cannot create externally validated neighbor with an invalid state.
# ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid
# ip neigh replace 192.0.2.1 nud failed dev br0.10 extern_valid
Error: Cannot mark neighbor as externally validated with an invalid state.
The above means that a neighbor cannot be created with the
"extern_valid" flag and flags such as "use" or "managed" as they result
in a neighbor being created with an invalid state ("none") and
immediately getting probed:
# ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
Error: Cannot create externally validated neighbor with an invalid state.
However, these flags can be used together with "extern_valid" after the
neighbor was created with a valid state:
# ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid
# ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
One consequence of preventing the kernel from invalidating a neighbor
entry is that by default it will only try to determine reachability
using unicast probes. This can be changed using the "mcast_resolicit"
sysctl:
# sysctl net.ipv4.neigh.br0/10.mcast_resolicit
0
# tcpdump -nn -e -i br0.10 -Q out arp &
# ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
# sysctl -wq net.ipv4.neigh.br0/10.mcast_resolicit=3
# ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28
iproute2 patches can be found here [2].
[1] https://datatracker.ietf.org/doc/html/draft-rbickhart-evpn-ip-mac-proxy-adv-03
[2] https://github.com/idosch/iproute2/tree/submit/extern_valid_v1
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/20250626073111.244534-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We already call get_rxfh under the rss_lock when we read back
context state after changes. Let's be consistent and always
hold the lock. The existing callers are all under rtnl_lock
so this should make no difference in practice, but it makes
the locking rules far less confusing IMHO. Any RSS callback
and any access to the RSS XArray should hold the lock.
Link: https://patch.msgid.link/20250626202848.104457-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Netlink code will want to perform the RSS_SET operation atomically
under the rss_lock. sfc wants to hold the rss_lock in rxfh_fields_get,
which makes that difficult. Lets move the locking up to the core
so that for all driver-facing callbacks rss_lock is taken consistently
by the core.
Link: https://patch.msgid.link/20250626202848.104457-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Always take the rss_lock in ethtool_set_rxfh(). We will want to
make a similar change in ethtool_set_rxfh_fields() and some
drivers lock that callback regardless of rss context ID being set.
Having some callbacks locked unconditionally and some only if
context ID is set would be very confusing.
ethtool handling is under rtnl_lock, so rss_lock is very unlikely
to ever be congested.
Link: https://patch.msgid.link/20250626202848.104457-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We now reuse .parse_request() from GET on SET, so we need to make sure
that the policies for both cover the attributes used for .parse_request().
genetlink will only allocate space in info->attrs for ARRAY_SIZE(policy).
Reported-by: syzbot+430f9f76633641a62217@syzkaller.appspotmail.com
Fixes: 963781bdfe ("net: ethtool: call .parse_request for SET handlers")
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250626233926.199801-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
S1G beacons contain fixed length optional fields that precede the
variable length elements, ensure we take this into account when
validating the beacon. This particular case was missed in
1e1f706fc2 ("wifi: cfg80211/mac80211: correctly parse S1G
beacon optional elements").
Fixes: 1d47f1198d ("nl80211: correctly validate S1G beacon head")
Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250626115118.68660-1-lachlan.hodges@morsemicro.com
[shorten/reword subject]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If the new coming segment covers more than one skbs in the ofo queue,
and which seq is equal to rcv_nxt, then the sequence range
that is duplicated will be sent as DUP SACK, the detail as below,
in step6, the {501,2001} range is clearly including too much
DUP SACK range, in violation of RFC 2883 rules.
1. client > server: Flags [.], seq 501:1001, ack 1325288529, win 20000, length 500
2. server > client: Flags [.], ack 1, [nop,nop,sack 1 {501:1001}], length 0
3. client > server: Flags [.], seq 1501:2001, ack 1325288529, win 20000, length 500
4. server > client: Flags [.], ack 1, [nop,nop,sack 2 {1501:2001} {501:1001}], length 0
5. client > server: Flags [.], seq 1:2001, ack 1325288529, win 20000, length 2000
6. server > client: Flags [.], ack 2001, [nop,nop,sack 1 {501:2001}], length 0
After this fix, the final ACK is as below:
6. server > client: Flags [.], ack 2001, options [nop,nop,sack 1 {501:1001}], length 0
[edumazet] added a new packetdrill test in the following patch.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: xin.guo <guoxin0309@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250626123420.1933835-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
inet_rtx_syn_ack() is a simple wrapper around tcp_rtx_synack(),
if we move req->num_retrans update.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250626153017.2156274-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now inet_rtx_syn_ack() is only used by TCP, it can directly
call tcp_rtx_synack() instead of using an indirect call
to req->rsk_ops->rtx_syn_ack().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250626153017.2156274-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, for controllers with extended advertising, the advertising
data is set in the asynchronous response handler for extended
adverstising params. As most advertising settings are performed in a
synchronous context, the (asynchronous) setting of the advertising data
is done too late (after enabling the advertising).
Move setting of adverstising data from asynchronous response handler
into synchronous context to fix ordering of HCI commands.
Signed-off-by: Christian Eggers <ceggers@arri.de>
Fixes: a0fb3726ba ("Bluetooth: Use Set ext adv/scan rsp data if controller supports")
Cc: stable@vger.kernel.org
v2: https://lore.kernel.org/linux-bluetooth/20250626115209.17839-1-ceggers@arri.de/
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
The unconditional call of hci_disable_advertising_sync() in
mesh_send_done_sync() also disables other LE advertisings (non mesh
related).
I am not sure whether this call is required at all, but checking the
adv_instances list (like done at other places) seems to solve the
problem.
Fixes: b338d91703 ("Bluetooth: Implement support for Mesh")
Cc: stable@vger.kernel.org
Signed-off-by: Christian Eggers <ceggers@arri.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
According to the message of commit b338d91703 ("Bluetooth: Implement
support for Mesh"), MGMT_OP_SET_MESH_RECEIVER should set the passive scan
parameters. Currently the scan interval and window parameters are
silently ignored, although user space (bluetooth-meshd) expects that
they can be used [1]
[1] https://git.kernel.org/pub/scm/bluetooth/bluez.git/tree/mesh/mesh-io-mgmt.c#n344
Fixes: b338d91703 ("Bluetooth: Implement support for Mesh")
Cc: stable@vger.kernel.org
Signed-off-by: Christian Eggers <ceggers@arri.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This reverts minor parts of the changes made in commit b338d91703
("Bluetooth: Implement support for Mesh"). It looks like these changes
were only made for development purposes but shouldn't have been part of
the commit.
Fixes: b338d91703 ("Bluetooth: Implement support for Mesh")
Cc: stable@vger.kernel.org
Signed-off-by: Christian Eggers <ceggers@arri.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
lwtunnel_build_state() has check validity of encap_type,
so no need to do this before call it.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250625022059.3958215-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iIsEABYKADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCaF3LFhUcZGFuaWVsQGlv
Z2VhcmJveC5uZXQACgkQ2yufC7HISINtRgD+JagJmBokoPnsk7DfauJnVhaP95aV
tsnna+fU1kGwS7MBAMINCoLyeISiD/XG0O+Om38czhhglWbl4+TgrthegPkE
=opKf
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2025-06-27
We've added 6 non-merge commits during the last 8 day(s) which contain
a total of 6 files changed, 120 insertions(+), 20 deletions(-).
The main changes are:
1) Fix RCU usage in task_cls_state() for BPF programs using helpers like
bpf_get_cgroup_classid_curr() outside of networking, from Charalampos
Mitrodimas.
2) Fix a sockmap race between map_update and a pending workqueue from
an earlier map_delete freeing the old psock where both pointed to the
same psock->sk, from Jiayuan Chen.
3) Fix a data corruption issue when using bpf_msg_pop_data() in kTLS which
failed to recalculate the ciphertext length, also from Jiayuan Chen.
4) Remove xdp_redirect_map{,_err} trace events since they are unused and
also hide XDP trace events under CONFIG_BPF_SYSCALL, from Steven Rostedt.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
xdp: tracing: Hide some xdp events under CONFIG_BPF_SYSCALL
xdp: Remove unused events xdp_redirect_map and xdp_redirect_map_err
net, bpf: Fix RCU usage in task_cls_state() for BPF programs
selftests/bpf: Add test to cover ktls with bpf_msg_pop_data
bpf, ktls: Fix data corruption when using bpf_msg_pop_data() in ktls
bpf, sockmap: Fix psock incorrectly pointing to sk
====================
Link: https://patch.msgid.link/20250626230111.24772-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge BPF, perf and other fixes after downstream PRs.
It restores BPF CI to green after critical fix
commit bc4394e5e7 ("perf: Fix the throttle error of some clock events")
No conflicts.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The length in the pseudo header should be the length of the L3 payload
AKA the L4 header+payload. The selftest code builds the packet from
the lower layers up, so all the headers are pushed already when it
constructs L4. We need to subtract the lower layer headers from skb->len.
Fixes: 3e1e58d64c ("net: add generic selftest support")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reported-by: Oleksij Rempel <o.rempel@pengutronix.de>
Tested-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250624183258.3377740-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
array_index_nospec() clamp the rxq_idx within the range of
[0, dev->real_num_rx_queues), move the check before it.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250624140159.3929503-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
lwtunnel_fill_encap() has NULL check and return 0, so no need
to check before call it.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250624140015.3929241-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In preparation for RSS_SET handling in ethnl introduce Netlink
notifications for RSS. Only cover modifications, not creation
and not removal of a context, because the latter may deserve
a different notification type. We should cross that bridge
when we add the support for context add / remove via Netlink.
Link: https://patch.msgid.link/20250623231720.3124717-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Copy information parsed for SET with .req_parse to NTF handling
and therefore the GET-equivalent that it ends up executing.
This way if the SET was on a sub-object (like RSS context)
the notification will also be appropriately scoped.
Also copy the phy_index, Maxime suggests this will help PLCA
commands generate accurate notifications as well.
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20250623231720.3124717-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ethtool_notify() takes a const void *data argument, which presumably
was intended to pass information from the call site to the subcommand
handler. This argument currently has no users.
Expecting the data to be subcommand-specific has two complications.
Complication #1 is that its not plumbed thru any of the standardized
callbacks. It gets propagated to ethnl_default_notify() where it
remains unused. Coming from the ethnl_default_set_doit() side we pass
in NULL, because how could we have a command specific attribute in
a generic handler.
Complication #2 is that we expect the ethtool_notify() callers to
know what attribute type to pass in. Again, the data pointer is
untyped.
RSS will need to pass the context ID to the notifications.
I think it's a better design if the "subcommand" exports its own
typed interface and constructs the appropriate argument struct
(which will be req_info). Remove the unused data argument from
ethtool_notify() but retain it in a new internal helper which
subcommands can use to build a typed interface.
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20250623231720.3124717-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In preparation for using req_info to carry parameters between SET
and NTF - call .parse_request during ethnl_default_set_doit().
The main question here is whether .parse_request is intended to be
GET-specific. Originally the SET handling was delegated to each subcommand
directly - ethnl_default_set_doit() and .set callbacks in ethnl_request_ops
did not exist. Looking at existing users does not shed much light, all
of the following subcommands use .parse_request but have no SET handler
(and no NTF):
net/ethtool/eeprom.c
net/ethtool/rss.c
net/ethtool/stats.c
net/ethtool/strset.c
net/ethtool/tsinfo.c
There's only one which does have a SET:
net/ethtool/pause.c
where .parse_request handling is used to select which statistics to query.
Not relevant for SET but also harmless.
Going back to RSS (which doesn't have SET today) .parse_request parses
the rss_context ID. Using the req_info struct to pass the context ID
from SET to NTF will be very useful.
Switch to ethnl_default_parse(), effectively adding the .parse_request
for SET handlers.
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20250623231720.3124717-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In preparation for using req_info to carry parameters between
SET and NTF allocate a full request info struct. Since the size
depends on the subcommand we need to allocate it on the heap.
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20250623231720.3124717-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- iwlegacy: work around large stack with clang/kasan
- mac80211: fix integer overflow
- mac80211: fix link struct init vs. RCU publish
- iwlwifi: fix warning on IFF_UP
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmhb4zAACgkQ10qiO8sP
aAAJMA/+KBq3/uvXAv7Mxtg2YTvmzKMs/Zsk3Uh1yCijsm6K9/yMkcuvyRDs9n+M
yobTGwlNKsg+xZmwKM6vT2TwFN4cJhFsz9pL/QQ7M6k4ZARv6MkIgXxhJ5504trv
ufsea2iCN2xHQ7Y81SNFjtZwckMPhS1Cgs7OdjOkP8GJPDsTXrdTNSZZh69l3XX1
RG0Fp98RhPnONmzj+ewj3leVMwJlCyAdzqx1B3Hk1tJQojUtkmwg1F+bpAifYBLs
arORBtkL4cVYXcYkoBJ+WFLMztooAaqo4Oal8s4zl5/VubjCb30+mBFRQytD9ldD
gp2gqY6Y1BAiCZKBxjAwRvG743CBPEBD7tHUDtS4cbPPChjMKWXrakWkLGjTDv50
wYnb9EsPKt03L353vaQ/BHW2m88ebxIYiaSVUY+animRvadpFihzT2fF9P+0/eHi
zU2AFlmF0bS74OtjyOVXSSin0pTmBEHDIRfqBdw/szGhMqBdHDq+qPb2H2TJnfZy
eec1MS0td8MYN9SV9xPac+4EqUbOHDlQ4fVD3rCn74+wESQp7HUnSJOgobB6cOJg
N3tao+n/K8kYxdaCplqAeWMD3bFSYs9oz404K3wOeO85lHz+lB581j6as8zJ86cL
nJVLEflnDCQHnxFCEMHeV/1q/GjIU7Xv5+rwDpfMEY3hx0jH1d8=
=6koP
-----END PGP SIGNATURE-----
Merge tag 'wireless-2025-06-25' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
Just a few fixes:
- iwlegacy: work around large stack with clang/kasan
- mac80211: fix integer overflow
- mac80211: fix link struct init vs. RCU publish
- iwlwifi: fix warning on IFF_UP
* tag 'wireless-2025-06-25' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: mac80211: finish link init before RCU publish
wifi: iwlwifi: mvm: assume '1' as the default mac_config_cmd version
wifi: mac80211: fix beacon interval calculation overflow
wifi: iwlegacy: work around excessive stack usage on clang/kasan
====================
Link: https://patch.msgid.link/20250625115433.41381-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Allow an S1G station to use aggregation.
Signed-off-by: Sophronia Koilpillai <sophronia.koilpillai@morsemicro.com>
Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250617080610.756048-5-lachlan.hodges@morsemicro.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add support for updating the stations S1G capabilities when
an S1G association occurs.
Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250617080610.756048-3-lachlan.hodges@morsemicro.com
[remove unused S1G_CAP3_MAX_MPDU_LEN_3895/_7791]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently there is no support for initialising a peers S1G capabilities,
this patch adds support for configuring an S1G station.
Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250617080610.756048-2-lachlan.hodges@morsemicro.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add support to get the radio for which RTS threshold needs to be changed
from userspace. Pass on this radio index to underlying drivers as an
additional argument.
A value of -1 indicates radio index is not mentioned and that the
configuration applies to all radio(s) of the wiphy.
Signed-off-by: Roopni Devanathan <quic_rdevanat@quicinc.com>
Link: https://patch.msgid.link/20250615082312.619639-5-quic_rdevanat@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In case of multi-radio wiphys, with per-radio RTS threshold brought
into use, RTS threshold for each radio in a wiphy can be recorded in
wiphy parameter - wiphy_radio_cfg, as an array. Add a new attribute -
NL80211_WIPHY_RADIO_ATTR_RTS_THRESHOLD in nested parameter -
NL80211_ATTR_WIPHY_RADIOS. When a request for getting RTS threshold
for a particular radio is received, parse the radio id and get the
required data. Add this data to the newly added nested attribute
NL80211_WIPHY_RADIO_ATTR_RTS_THRESHOLD. Add support to report this
data to userspace.
Signed-off-by: Roopni Devanathan <quic_rdevanat@quicinc.com>
Link: https://patch.msgid.link/20250615082312.619639-4-quic_rdevanat@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, setting RTS threshold is based on per-phy basis, i.e., all the
radios present in a wiphy will take RTS threshold value to be the one sent
from userspace. But each radio in a multi-radio wiphy can have different
RTS threshold requirements.
To extend support to set RTS threshold for each radio, get the radio for
which RTS threshold needs to be changed from the user. Use the attribute
in NL - NL80211_ATTR_WIPHY_RADIO_INDEX, to identify the radio of interest.
Create a new structure - wiphy_radio_cfg and add rts_threshold in it as a
u32 value to store RTS threshold of each radio in a wiphy and allocate
memory for it during wiphy register based on the wiphy.n_radio updated by
drivers. Pass radio id received from the user to mac80211 drivers along
with its corresponding RTS threshold.
Signed-off-by: Roopni Devanathan <quic_rdevanat@quicinc.com>
Link: https://patch.msgid.link/20250615082312.619639-3-quic_rdevanat@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, per-radio attributes are set on per-phy basis, i.e., all the
radios present in a wiphy will take attributes values sent from user. But
each radio in a wiphy can get different values from userspace based on
its requirement.
To extend support to set per-radio attributes, add support to get radio
index from userspace. Add an NL attribute - NL80211_ATTR_WIPHY_RADIO_INDEX,
to get user specified radio index for which attributes should be changed.
Pass this to individual drivers, so that the drivers can use this radio
index to change per-radio attributes when necessary. Currently, per-radio
attributes identified are:
NL80211_ATTR_WIPHY_TX_POWER_LEVEL
NL80211_ATTR_WIPHY_ANTENNA_TX
NL80211_ATTR_WIPHY_ANTENNA_RX
NL80211_ATTR_WIPHY_RETRY_SHORT
NL80211_ATTR_WIPHY_RETRY_LONG
NL80211_ATTR_WIPHY_FRAG_THRESHOLD
NL80211_ATTR_WIPHY_RTS_THRESHOLD
NL80211_ATTR_WIPHY_COVERAGE_CLASS
NL80211_ATTR_TXQ_LIMIT
NL80211_ATTR_TXQ_MEMORY_LIMIT
NL80211_ATTR_TXQ_QUANTUM
By default, the radio index is set to -1. This means the attribute should
be treated as a global configuration. If the user has not specified any
index, then the radio index passed to individual drivers would be -1. This
would indicate that the attribute applies to all radios in that wiphy.
Signed-off-by: Roopni Devanathan <quic_rdevanat@quicinc.com>
Link: https://patch.msgid.link/20250615082312.619639-2-quic_rdevanat@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, link station statistics for MLO are filled by mac80211.
But there are some statistics that kept by mac80211 might not be
accurate, so let the driver pre-fill the link statistics. The driver
can fill the values (indicating which field is filled, by setting the
filled bitmapin in link_station structure).
Statistics that driver don't fill are filled by mac80211.
Hence, add link_sta_statistics callback to fill link station statistics
for MLO in sta_set_link_sinfo() by drivers.
Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-11-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, RX stats packets are incremented for deflink member for
non-ML and multi-link(ML) station case. However, for ML station,
packets should be incremented based on the specific link.
Therefore, if a valid link_id is present, fetch the corresponding
link station information and increment the RX packets for that link.
For non-MLO stations, the deflink will still be used.
Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-10-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, sinfo structure is supported to fill information at
deflink( or one of the links) level for station. This has problems
when applied to fetch multi-link(ML) station information.
Hence, if valid_links are present, support filling link_station
structure for each link.
This will be helpful to check the link related statistics during MLO.
Additionally, TXQ stats for pertid are applicable at station level
not at link level. Therefore check link_id is less then 0, before
filling TXQ stats in pertid stats.
Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-9-quic_sarishar@quicinc.com
[fix some indentation]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, sinfo->filled is for set in sta_set_sinfo() after filling
the corresponding fields in station_info structure for station statistics.
For non-ML stations, the fields are correctly filled from sta->deflink
and corresponding sinfo->filled bit are set, but for MLO any one of
link's data is filled and corresponding sinfo->filled bit is set.
For MLO before embed NL message, fields of sinfo structure like
bytes, packets, signal are updated with accumulated, best, least of all
links data. But some of fields like rssi, pertid don't make much sense
at MLO level.
Hence, to prevent misinterpretation, clear sinfo->filled for fields
which don't make much sense at MLO level. This will prevent filling
misleading values in NL message.
Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-8-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, if a link gets removed in between for a station then
directly accumulated data will fall down to sum of other active links.
This will bring inconsistency in station dump statistics.
For instance, let's take Tx packets
- at t=0-> link-0:2 link-1:3 Tx packets => accumulated = 5
- at t=1-> link-0:4 link-1:6 Tx packets => accumulated = 10
let say at t=2, link-0 went down => link-0:0 link-1:7 => accumulated = 7
Here, suddenly accumulated Tx packets will come down to 7 from 10.
This is showing inconsistency.
Therefore, store link-0 data when it went down and add to accumulated
Tx packet = 11.
Hence, store the removed link statistics data in sta structure and
add it in accumulated statistics for consistency.
Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-7-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, station_info structure is passed to fill station statistics
from mac80211/drivers. After NL message send to user space for requested
station statistics, memory for station statistics is freed in cfg80211.
Therefore, memory allocation/free for link station statistics should
also happen in cfg80211 only.
Hence, allocate the memory for link_station structure for all
possible links and free in cfg80211_sinfo_release_content().
Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-6-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently statistics are handled at link level for multi-link
operation(MLO). There is no provision to check accumulated statistics
for a multi-link(ML) station. Other statistics, such as signal, rates,
are also managed at the link level only.
Statistics such as packets, bytes, signal, rates, etc are useful to
provide overall overview for the ML stations.
Statistics such as packets, bytes are accumulated statistics at MLO level.
However, MLO statistics for rates and signal can not be accumulated since
it won't make much sense. Hence, handle other statistics such as signal,
rates, etc bit differently at MLO level.
The signal could be the best of all links-
e.g. if Link 1 has a signal strength of -70 dBm and Link 2 has -65 dBm,
the signal for MLO will be -65 dBm.
The rate could be determined based on the most recently updated link-
e.g. if link 1 has a rate of 300 Mbps and link 2 has a rate of 450 Mbps,
the MLO rate can be calculated based on the inactivity of each link.
If the inactive time for link 1 is 20 seconds and for link 2 is 10 seconds,
the MLO rate will be the most recently updated rate, which is link 2's
rate of 450 Mbps.
The inactive time, dtim_period and beacon_interval can be taken as the
least value of field from link level.
Similarly, other MLO level applicable fields are handled and the fields
which don't make much sense at MLO level, a subsequent change will handle
to embed NL message.
Hence, add accumulated and other statistics for MLO station if valid links
are present to represent comprehensive overview for the ML stations.
Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-5-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, statistics is supported at deflink( or one of the links)
level for station. This has problems when applied to multi-link(ML)
connections.
Hence, add changes to support link level statistics to embed NL message
with link related information if valid links are present.
This will be helpful to check the link related statistics during MLO.
The statistics will be embedded into NL message as below:
For non-ML:
cmd->
NL80211_ATTR_IFINDEX
NL80211_ATTR_MAC
NL80211_ATTR_GENERATION
....etc
NL80211_ATTR_STA_INFO | nested
NL80211_STA_INFO_CONNECTED_TIME,
NL80211_STA_INFO_STA_FLAGS,
NL80211_STA_INFO_RX_BYTES,
NL80211_STA_INFO_TX_BYTES,
.........etc
For MLO:
cmd ->
NL80211_ATTR_IFINDEX
NL80211_ATTR_MAC
NL80211_ATTR_GENERATION
.......etc
NL80211_ATTR_STA_INFO | nested
NL80211_STA_INFO_CONNECTED_TIME,
NL80211_STA_INFO_STA_FLAGS,
........etc
NL80211_ATTR_MLO_LINK_ID,
NL80211_ATTR_MLD_ADDR,
NL80211_ATTR_MLO_LINKS | nested
link_id-1 | nested
NL80211_ATTR_MLO_LINK_ID,
NL80211_ATTR_MAC,
NL80211_ATTR_STA_INFO | nested
NL80211_STA_INFO_RX_BYTES,
NL80211_STA_INFO_TX_BYTES,
NL80211_STA_INFO_CONNECTED_TIME,
..........etc.
link_id-2 | nested
NL80211_ATTR_MLO_LINK_ID,
NL80211_ATTR_MAC,
NL80211_ATTR_STA_INFO | nested
NL80211_STA_INFO_RX_BYTES,
NL80211_STA_INFO_TX_BYTES,
NL80211_STA_INFO_CONNECTED_TIME,
.........etc
Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-4-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, in supporting API's to fill sinfo structure from sta
structure, is mapped to fill the fields from sta->deflink. However,
for multi-link (ML) station, sinfo structure should be filled from
corresponding link_id.
Therefore, add link_id as an additional argument in supporting API's
for filling sinfo structure correctly. Link_id is set to -1 for non-ML
station and corresponding link_id for ML stations. In supporting API's
for filling sinfo structure, check for link_id, if link_id < 0, fill
the sinfo structure from sta->deflink, otherwise fill from
sta->link[link_id].
Current, changes are done at the deflink level i.e, pass -1 as link_id.
Actual link_id will be added in subsequent patches to support
station statistics for MLO.
Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-2-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, when a non-DFS channel is brought up and the bandwidth is
expanded from 80 MHz to 160 MHz, where the primary 80 MHz is non-DFS
and the secondary 80 MHz consists of DFS channels, radar detection
fails if radar occurs in the secondary 80 MHz.
When the channel is switched from 80 MHz to 160 MHz, with the primary
80 MHz being non-DFS and the secondary 80 MHz consisting of DFS
channels, the radar required flag in the channel switch parameters
is set to true. However, when using a reserved channel context,
it is not updated in sdata, which disables radar detection in the
secondary 80 MHz DFS channels.
Update the radar required flag in sdata to fix this issue when using
a reserved channel context.
Signed-off-by: Ramya Gnanasekar <ramya.gnanasekar@oss.qualcomm.com>
Signed-off-by: Ramasamy Kaliappan <ramasamy.kaliappan@oss.qualcomm.com>
Link: https://patch.msgid.link/20250608140324.1687117-1-ramasamy.kaliappan@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Christian Brauner reported that even after MSG_OOB data is consumed,
calling close() on the receiver socket causes the peer's recv() to
return -ECONNRESET:
1. send() and recv() an OOB data.
>>> from socket import *
>>> s1, s2 = socketpair(AF_UNIX, SOCK_STREAM)
>>> s1.send(b'x', MSG_OOB)
1
>>> s2.recv(1, MSG_OOB)
b'x'
2. close() for s2 sets ECONNRESET to s1->sk_err even though
s2 consumed the OOB data
>>> s2.close()
>>> s1.recv(10, MSG_DONTWAIT)
...
ConnectionResetError: [Errno 104] Connection reset by peer
Even after being consumed, the skb holding the OOB 1-byte data stays in
the recv queue to mark the OOB boundary and break recv() at that point.
This must be considered while close()ing a socket.
Let's skip the leading consumed OOB skb while checking the -ECONNRESET
condition in unix_release_sock().
Fixes: 314001f0bf ("af_unix: Add OOB support")
Reported-by: Christian Brauner <brauner@kernel.org>
Closes: https://lore.kernel.org/netdev/20250529-sinkt-abfeuern-e7b08200c6b0@brauner/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://patch.msgid.link/20250619041457.1132791-4-kuni1840@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jann Horn reported a use-after-free in unix_stream_read_generic().
The following sequences reproduce the issue:
$ python3
from socket import *
s1, s2 = socketpair(AF_UNIX, SOCK_STREAM)
s1.send(b'x', MSG_OOB)
s2.recv(1, MSG_OOB) # leave a consumed OOB skb
s1.send(b'y', MSG_OOB)
s2.recv(1, MSG_OOB) # leave a consumed OOB skb
s1.send(b'z', MSG_OOB)
s2.recv(1) # recv 'z' illegally
s2.recv(1, MSG_OOB) # access 'z' skb (use-after-free)
Even though a user reads OOB data, the skb holding the data stays on
the recv queue to mark the OOB boundary and break the next recv().
After the last send() in the scenario above, the sk2's recv queue has
2 leading consumed OOB skbs and 1 real OOB skb.
Then, the following happens during the next recv() without MSG_OOB
1. unix_stream_read_generic() peeks the first consumed OOB skb
2. manage_oob() returns the next consumed OOB skb
3. unix_stream_read_generic() fetches the next not-yet-consumed OOB skb
4. unix_stream_read_generic() reads and frees the OOB skb
, and the last recv(MSG_OOB) triggers KASAN splat.
The 3. above occurs because of the SO_PEEK_OFF code, which does not
expect unix_skb_len(skb) to be 0, but this is true for such consumed
OOB skbs.
while (skip >= unix_skb_len(skb)) {
skip -= unix_skb_len(skb);
skb = skb_peek_next(skb, &sk->sk_receive_queue);
...
}
In addition to this use-after-free, there is another issue that
ioctl(SIOCATMARK) does not function properly with consecutive consumed
OOB skbs.
So, nothing good comes out of such a situation.
Instead of complicating manage_oob(), ioctl() handling, and the next
ECONNRESET fix by introducing a loop for consecutive consumed OOB skbs,
let's not leave such consecutive OOB unnecessarily.
Now, while receiving an OOB skb in unix_stream_recv_urg(), if its
previous skb is a consumed OOB skb, it is freed.
[0]:
BUG: KASAN: slab-use-after-free in unix_stream_read_actor (net/unix/af_unix.c:3027)
Read of size 4 at addr ffff888106ef2904 by task python3/315
CPU: 2 UID: 0 PID: 315 Comm: python3 Not tainted 6.16.0-rc1-00407-gec315832f6f9 #8 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-4.fc42 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl (lib/dump_stack.c:122)
print_report (mm/kasan/report.c:409 mm/kasan/report.c:521)
kasan_report (mm/kasan/report.c:636)
unix_stream_read_actor (net/unix/af_unix.c:3027)
unix_stream_read_generic (net/unix/af_unix.c:2708 net/unix/af_unix.c:2847)
unix_stream_recvmsg (net/unix/af_unix.c:3048)
sock_recvmsg (net/socket.c:1063 (discriminator 20) net/socket.c:1085 (discriminator 20))
__sys_recvfrom (net/socket.c:2278)
__x64_sys_recvfrom (net/socket.c:2291 (discriminator 1) net/socket.c:2287 (discriminator 1) net/socket.c:2287 (discriminator 1))
do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
RIP: 0033:0x7f8911fcea06
Code: 5d e8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 75 19 83 e2 39 83 fa 08 75 11 e8 26 ff ff ff 66 0f 1f 44 00 00 48 8b 45 10 0f 05 <48> 8b 5d f8 c9 c3 0f 1f 40 00 f3 0f 1e fa 55 48 89 e5 48 83 ec 08
RSP: 002b:00007fffdb0dccb0 EFLAGS: 00000202 ORIG_RAX: 000000000000002d
RAX: ffffffffffffffda RBX: 00007fffdb0dcdc8 RCX: 00007f8911fcea06
RDX: 0000000000000001 RSI: 00007f8911a5e060 RDI: 0000000000000006
RBP: 00007fffdb0dccd0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000202 R12: 00007f89119a7d20
R13: ffffffffc4653600 R14: 0000000000000000 R15: 0000000000000000
</TASK>
Allocated by task 315:
kasan_save_stack (mm/kasan/common.c:48)
kasan_save_track (mm/kasan/common.c:60 (discriminator 1) mm/kasan/common.c:69 (discriminator 1))
__kasan_slab_alloc (mm/kasan/common.c:348)
kmem_cache_alloc_node_noprof (./include/linux/kasan.h:250 mm/slub.c:4148 mm/slub.c:4197 mm/slub.c:4249)
__alloc_skb (net/core/skbuff.c:660 (discriminator 4))
alloc_skb_with_frags (./include/linux/skbuff.h:1336 net/core/skbuff.c:6668)
sock_alloc_send_pskb (net/core/sock.c:2993)
unix_stream_sendmsg (./include/net/sock.h:1847 net/unix/af_unix.c:2256 net/unix/af_unix.c:2418)
__sys_sendto (net/socket.c:712 (discriminator 20) net/socket.c:727 (discriminator 20) net/socket.c:2226 (discriminator 20))
__x64_sys_sendto (net/socket.c:2233 (discriminator 1) net/socket.c:2229 (discriminator 1) net/socket.c:2229 (discriminator 1))
do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
Freed by task 315:
kasan_save_stack (mm/kasan/common.c:48)
kasan_save_track (mm/kasan/common.c:60 (discriminator 1) mm/kasan/common.c:69 (discriminator 1))
kasan_save_free_info (mm/kasan/generic.c:579 (discriminator 1))
__kasan_slab_free (mm/kasan/common.c:271)
kmem_cache_free (mm/slub.c:4643 (discriminator 3) mm/slub.c:4745 (discriminator 3))
unix_stream_read_generic (net/unix/af_unix.c:3010)
unix_stream_recvmsg (net/unix/af_unix.c:3048)
sock_recvmsg (net/socket.c:1063 (discriminator 20) net/socket.c:1085 (discriminator 20))
__sys_recvfrom (net/socket.c:2278)
__x64_sys_recvfrom (net/socket.c:2291 (discriminator 1) net/socket.c:2287 (discriminator 1) net/socket.c:2287 (discriminator 1))
do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
The buggy address belongs to the object at ffff888106ef28c0
which belongs to the cache skbuff_head_cache of size 224
The buggy address is located 68 bytes inside of
freed 224-byte region [ffff888106ef28c0, ffff888106ef29a0)
The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff888106ef3cc0 pfn:0x106ef2
head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x200000000000040(head|node=0|zone=2)
page_type: f5(slab)
raw: 0200000000000040 ffff8881001d28c0 ffffea000422fe00 0000000000000004
raw: ffff888106ef3cc0 0000000080190010 00000000f5000000 0000000000000000
head: 0200000000000040 ffff8881001d28c0 ffffea000422fe00 0000000000000004
head: ffff888106ef3cc0 0000000080190010 00000000f5000000 0000000000000000
head: 0200000000000001 ffffea00041bbc81 00000000ffffffff 00000000ffffffff
head: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888106ef2800: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
ffff888106ef2880: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
>ffff888106ef2900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888106ef2980: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
ffff888106ef2a00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fixes: 314001f0bf ("af_unix: Add OOB support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Jann Horn <jannh@google.com>
Link: https://patch.msgid.link/20250619041457.1132791-2-kuni1840@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
As we are converting from TU to usecs, a beacon interval of
100*1024 usecs will lead to integer wrapping. To fix change
to use a u32.
Fixes: 057d5f4ba1 ("mac80211: sync dtim_count to TSF")
Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250621123209.511796-1-lachlan.hodges@morsemicro.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The bridge maintains a global list of ports behind which a multicast
router resides. The list is consulted during forwarding to ensure
multicast packets are forwarded to these ports even if the ports are not
member in the matching MDB entry.
When per-VLAN multicast snooping is enabled, the per-port multicast
context is disabled on each port and the port is removed from the global
router port list:
# ip link add name br1 up type bridge vlan_filtering 1 mcast_snooping 1
# ip link add name dummy1 up master br1 type dummy
# ip link set dev dummy1 type bridge_slave mcast_router 2
$ bridge -d mdb show | grep router
router ports on br1: dummy1
# ip link set dev br1 type bridge mcast_vlan_snooping 1
$ bridge -d mdb show | grep router
However, the port can be re-added to the global list even when per-VLAN
multicast snooping is enabled:
# ip link set dev dummy1 type bridge_slave mcast_router 0
# ip link set dev dummy1 type bridge_slave mcast_router 2
$ bridge -d mdb show | grep router
router ports on br1: dummy1
Since commit 4b30ae9adb ("net: bridge: mcast: re-implement
br_multicast_{enable, disable}_port functions"), when per-VLAN multicast
snooping is enabled, multicast disablement on a port will disable the
per-{port, VLAN} multicast contexts and not the per-port one. As a
result, a port will remain in the global router port list even after it
is deleted. This will lead to a use-after-free [1] when the list is
traversed (when adding a new port to the list, for example):
# ip link del dev dummy1
# ip link add name dummy2 up master br1 type dummy
# ip link set dev dummy2 type bridge_slave mcast_router 2
Similarly, stale entries can also be found in the per-VLAN router port
list. When per-VLAN multicast snooping is disabled, the per-{port, VLAN}
contexts are disabled on each port and the port is removed from the
per-VLAN router port list:
# ip link add name br1 up type bridge vlan_filtering 1 mcast_snooping 1 mcast_vlan_snooping 1
# ip link add name dummy1 up master br1 type dummy
# bridge vlan add vid 2 dev dummy1
# bridge vlan global set vid 2 dev br1 mcast_snooping 1
# bridge vlan set vid 2 dev dummy1 mcast_router 2
$ bridge vlan global show dev br1 vid 2 | grep router
router ports: dummy1
# ip link set dev br1 type bridge mcast_vlan_snooping 0
$ bridge vlan global show dev br1 vid 2 | grep router
However, the port can be re-added to the per-VLAN list even when
per-VLAN multicast snooping is disabled:
# bridge vlan set vid 2 dev dummy1 mcast_router 0
# bridge vlan set vid 2 dev dummy1 mcast_router 2
$ bridge vlan global show dev br1 vid 2 | grep router
router ports: dummy1
When the VLAN is deleted from the port, the per-{port, VLAN} multicast
context will not be disabled since multicast snooping is not enabled
on the VLAN. As a result, the port will remain in the per-VLAN router
port list even after it is no longer member in the VLAN. This will lead
to a use-after-free [2] when the list is traversed (when adding a new
port to the list, for example):
# ip link add name dummy2 up master br1 type dummy
# bridge vlan add vid 2 dev dummy2
# bridge vlan del vid 2 dev dummy1
# bridge vlan set vid 2 dev dummy2 mcast_router 2
Fix these issues by removing the port from the relevant (global or
per-VLAN) router port list in br_multicast_port_ctx_deinit(). The
function is invoked during port deletion with the per-port multicast
context and during VLAN deletion with the per-{port, VLAN} multicast
context.
Note that deleting the multicast router timer is not enough as it only
takes care of the temporary multicast router states (1 or 3) and not the
permanent one (2).
[1]
BUG: KASAN: slab-out-of-bounds in br_multicast_add_router.part.0+0x3f1/0x560
Write of size 8 at addr ffff888004a67328 by task ip/384
[...]
Call Trace:
<TASK>
dump_stack_lvl+0x6f/0xa0
print_address_description.constprop.0+0x6f/0x350
print_report+0x108/0x205
kasan_report+0xdf/0x110
br_multicast_add_router.part.0+0x3f1/0x560
br_multicast_set_port_router+0x74e/0xac0
br_setport+0xa55/0x1870
br_port_slave_changelink+0x95/0x120
__rtnl_newlink+0x5e8/0xa40
rtnl_newlink+0x627/0xb00
rtnetlink_rcv_msg+0x6fb/0xb70
netlink_rcv_skb+0x11f/0x350
netlink_unicast+0x426/0x710
netlink_sendmsg+0x75a/0xc20
__sock_sendmsg+0xc1/0x150
____sys_sendmsg+0x5aa/0x7b0
___sys_sendmsg+0xfc/0x180
__sys_sendmsg+0x124/0x1c0
do_syscall_64+0xbb/0x360
entry_SYSCALL_64_after_hwframe+0x4b/0x53
[2]
BUG: KASAN: slab-use-after-free in br_multicast_add_router.part.0+0x378/0x560
Read of size 8 at addr ffff888009f00840 by task bridge/391
[...]
Call Trace:
<TASK>
dump_stack_lvl+0x6f/0xa0
print_address_description.constprop.0+0x6f/0x350
print_report+0x108/0x205
kasan_report+0xdf/0x110
br_multicast_add_router.part.0+0x378/0x560
br_multicast_set_port_router+0x6f9/0xac0
br_vlan_process_options+0x8b6/0x1430
br_vlan_rtm_process_one+0x605/0xa30
br_vlan_rtm_process+0x396/0x4c0
rtnetlink_rcv_msg+0x2f7/0xb70
netlink_rcv_skb+0x11f/0x350
netlink_unicast+0x426/0x710
netlink_sendmsg+0x75a/0xc20
__sock_sendmsg+0xc1/0x150
____sys_sendmsg+0x5aa/0x7b0
___sys_sendmsg+0xfc/0x180
__sys_sendmsg+0x124/0x1c0
do_syscall_64+0xbb/0x360
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Fixes: 2796d846d7 ("net: bridge: vlan: convert mcast router global option to per-vlan entry")
Fixes: 4b30ae9adb ("net: bridge: mcast: re-implement br_multicast_{enable, disable}_port functions")
Reported-by: syzbot+7bfa4b72c6a5da128d32@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/684c18bd.a00a0220.279073.000b.GAE@google.com/T/
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250619182228.1656906-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Followup of commit 285975dd67 ("net: annotate data-races around
sk->sk_{rcv|snd}timeo").
Remove lock_sock()/release_sock() from ksmbd_tcp_rcv_timeout()
and add READ_ONCE()/WRITE_ONCE() where it is needed.
Also SO_RCVTIMEO_OLD and SO_RCVTIMEO_NEW can call sock_set_timeout()
without holding the socket lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250620155536.335520-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Followup of commit 285975dd67 ("net: annotate data-races around
sk->sk_{rcv|snd}timeo").
Remove lock_sock()/release_sock() from sock_set_sndtimeo(),
and add READ_ONCE()/WRITE_ONCE() where it is needed.
Also SO_SNDTIMEO_OLD and SO_SNDTIMEO_NEW can call sock_set_timeout()
without holding the socket lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250620155536.335520-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Difference between sock_i_uid() and sk_uid() is that
after sock_orphan(), sock_i_uid() returns GLOBAL_ROOT_UID
while sk_uid() returns the last cached sk->sk_uid value.
None of sock_i_uid() callers care about this.
Use sk_uid() which is much faster and inlined.
Note that diag/dump users are calling sock_i_ino() and
can not see the full benefit yet.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Link: https://patch.msgid.link/20250620133001.4090592-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sk->sk_uid can be read while another thread changes its
value in sockfs_setattr().
Add sk_uid(const struct sock *sk) helper to factorize the needed
READ_ONCE() annotations, and add corresponding WRITE_ONCE()
where needed.
Fixes: 86741ec254 ("net: core: Add a UID field to struct sock.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Link: https://patch.msgid.link/20250620133001.4090592-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
I tried to fix the stack usage in this function a couple of years ago,
but there is still a problem with the latest gcc versions in some
configurations:
net/caif/cfctrl.c:553:1: error: the frame size of 1296 bytes is larger than 1280 bytes [-Werror=frame-larger-than=]
Reduce this once again, with a separate cfctrl_link_setup() function that
holds the bulk of all the local variables. It also turns out that the
param[] array that takes up a large portion of the stack is write-only
and can be left out here.
Fixes: ce6289661b ("caif: reduce stack size with KASAN")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20250620112244.3425554-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Replace the deprecated strncpy() with the two-argument version of
strscpy() as the destination is an array
and buffer should be NUL-terminated.
Signed-off-by: Pranav Tyagi <pranav.tyagi03@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250620103653.6957-1-pranav.tyagi03@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Replace the deprecated strncpy() with two-argument version of
strscpy() as the destination is an array
and should be NUL-terminated.
Signed-off-by: Pranav Tyagi <pranav.tyagi03@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250620102559.6365-1-pranav.tyagi03@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
commit f1fce08e63 ("netpoll: Eliminate redundant assignment") removed
the initialization of the UDP checksum, which was wrong and broke
netpoll IPv6 transmission due to bad checksumming.
udph->check needs to be set before calling csum_ipv6_magic().
Fixes: f1fce08e63 ("netpoll: Eliminate redundant assignment")
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250620-netpoll_fix-v1-1-f9f0b82bc059@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There was a silly bug in the initial implementation where a loop
variable was not incremented. This commit increments the loop variable.
This bug is somewhat tricky to catch because it can only happen on loops
of two or more. If it is hit, it locks up a kernel thread in an infinite
loop.
Signed-off-by: Nikhil Jha <njha@janestreet.com>
Tested-by: Nikhil Jha <njha@janestreet.com>
Fixes: 08d6ee6d8a ("sunrpc: implement rfc2203 rpcsec_gss seqnum cache")
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
syzbot reported use-after-free in vhci_flush() without repro. [0]
From the splat, a thread close()d a vhci file descriptor while
its device was being used by iotcl() on another thread.
Once the last fd refcnt is released, vhci_release() calls
hci_unregister_dev(), hci_free_dev(), and kfree() for struct
vhci_data, which is set to hci_dev->dev->driver_data.
The problem is that there is no synchronisation after unlinking
hdev from hci_dev_list in hci_unregister_dev(). There might be
another thread still accessing the hdev which was fetched before
the unlink operation.
We can use SRCU for such synchronisation.
Let's run hci_dev_reset() under SRCU and wait for its completion
in hci_unregister_dev().
Another option would be to restore hci_dev->destruct(), which was
removed in commit 587ae086f6 ("Bluetooth: Remove unused
hci-destruct cb"). However, this would not be a good solution, as
we should not run hci_unregister_dev() while there are in-flight
ioctl() requests, which could lead to another data-race KCSAN splat.
Note that other drivers seem to have the same problem, for exmaple,
virtbt_remove().
[0]:
BUG: KASAN: slab-use-after-free in skb_queue_empty_lockless include/linux/skbuff.h:1891 [inline]
BUG: KASAN: slab-use-after-free in skb_queue_purge_reason+0x99/0x360 net/core/skbuff.c:3937
Read of size 8 at addr ffff88807cb8d858 by task syz.1.219/6718
CPU: 1 UID: 0 PID: 6718 Comm: syz.1.219 Not tainted 6.16.0-rc1-syzkaller-00196-g08207f42d3ff #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
Call Trace:
<TASK>
dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:408 [inline]
print_report+0xd2/0x2b0 mm/kasan/report.c:521
kasan_report+0x118/0x150 mm/kasan/report.c:634
skb_queue_empty_lockless include/linux/skbuff.h:1891 [inline]
skb_queue_purge_reason+0x99/0x360 net/core/skbuff.c:3937
skb_queue_purge include/linux/skbuff.h:3368 [inline]
vhci_flush+0x44/0x50 drivers/bluetooth/hci_vhci.c:69
hci_dev_do_reset net/bluetooth/hci_core.c:552 [inline]
hci_dev_reset+0x420/0x5c0 net/bluetooth/hci_core.c:592
sock_do_ioctl+0xd9/0x300 net/socket.c:1190
sock_ioctl+0x576/0x790 net/socket.c:1311
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:907 [inline]
__se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fcf5b98e929
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fcf5c7b9038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fcf5bbb6160 RCX: 00007fcf5b98e929
RDX: 0000000000000000 RSI: 00000000400448cb RDI: 0000000000000009
RBP: 00007fcf5ba10b39 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 00007fcf5bbb6160 R15: 00007ffd6353d528
</TASK>
Allocated by task 6535:
kasan_save_stack mm/kasan/common.c:47 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
__kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:394
kasan_kmalloc include/linux/kasan.h:260 [inline]
__kmalloc_cache_noprof+0x230/0x3d0 mm/slub.c:4359
kmalloc_noprof include/linux/slab.h:905 [inline]
kzalloc_noprof include/linux/slab.h:1039 [inline]
vhci_open+0x57/0x360 drivers/bluetooth/hci_vhci.c:635
misc_open+0x2bc/0x330 drivers/char/misc.c:161
chrdev_open+0x4c9/0x5e0 fs/char_dev.c:414
do_dentry_open+0xdf0/0x1970 fs/open.c:964
vfs_open+0x3b/0x340 fs/open.c:1094
do_open fs/namei.c:3887 [inline]
path_openat+0x2ee5/0x3830 fs/namei.c:4046
do_filp_open+0x1fa/0x410 fs/namei.c:4073
do_sys_openat2+0x121/0x1c0 fs/open.c:1437
do_sys_open fs/open.c:1452 [inline]
__do_sys_openat fs/open.c:1468 [inline]
__se_sys_openat fs/open.c:1463 [inline]
__x64_sys_openat+0x138/0x170 fs/open.c:1463
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Freed by task 6535:
kasan_save_stack mm/kasan/common.c:47 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:576
poison_slab_object mm/kasan/common.c:247 [inline]
__kasan_slab_free+0x62/0x70 mm/kasan/common.c:264
kasan_slab_free include/linux/kasan.h:233 [inline]
slab_free_hook mm/slub.c:2381 [inline]
slab_free mm/slub.c:4643 [inline]
kfree+0x18e/0x440 mm/slub.c:4842
vhci_release+0xbc/0xd0 drivers/bluetooth/hci_vhci.c:671
__fput+0x44c/0xa70 fs/file_table.c:465
task_work_run+0x1d1/0x260 kernel/task_work.c:227
exit_task_work include/linux/task_work.h:40 [inline]
do_exit+0x6ad/0x22e0 kernel/exit.c:955
do_group_exit+0x21c/0x2d0 kernel/exit.c:1104
__do_sys_exit_group kernel/exit.c:1115 [inline]
__se_sys_exit_group kernel/exit.c:1113 [inline]
__x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1113
x64_sys_call+0x21ba/0x21c0 arch/x86/include/generated/asm/syscalls_64.h:232
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The buggy address belongs to the object at ffff88807cb8d800
which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 88 bytes inside of
freed 1024-byte region [ffff88807cb8d800, ffff88807cb8dc00)
Fixes: bf18c7118c ("Bluetooth: vhci: Free driver_data on file release")
Reported-by: syzbot+2faa4825e556199361f9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=f62d64848fc4c7c30cd6
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Acked-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
We need the driver-core fixes that are in 6.16-rc3 into here as well
to build on top of.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Blamed commit missed that vcc_destroy_socket() calls
clip_push() with a NULL skb.
If clip_devs is NULL, clip_push() then crashes when reading
skb->truesize.
Fixes: 93a2014afb ("atm: fix a UAF in lec_arp_clear_vccs()")
Reported-by: syzbot+1316233c4c6803382a8b@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/68556f59.a00a0220.137b3.004e.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Gengming Liu <l.dmxcsnsbh@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Two fixes for commits in the nfsd-6.16 merge
- One fix for the recently-added NFSD netlink facility
- One fix for a remote SunRPC crasher
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmhWykQACgkQM2qzM29m
f5cfIA/+MtqpF8iHCNWBk2mmgQHtfS78qhQexEsQ69F1k90+bzCJ9Mu7AZT67xRZ
2E6Fm3gyJuFPaJ9YOGewyD6OV/eE0SioPQteZIIwUVclKcoyVtqJpc57A/aQfMJm
Te1bPWYrc5jfxoF/QzkOjzd4KWXOej2nIqsybFIDK5uTe7gn9QpPiFL8ePRuf7bJ
5oWBTZPpmuWpDcebQrz56OT0bvvrCNH9EV9TZNYwVa/B7yl3UNHwM2YB5O9m2Rl7
GUeWytc2Y7WgdDJe9poPwNrYyXFb8JX4pCbS54q5wCRLYLDgo8PsgsFCAFfgjx0l
ArATKDhzfYK0azASDtJma2eCJkrm4/nXRNFON7xucN79pYRh9GChcmhhfubb8Yo3
HopUaCSLOgy7PNsbbz3lNREn5uBpomeK1Mb88/UkJ/d8fmKGIO1ITTlarazYv2w5
GgmzZSBS43VdUmipqNyddIMg2uy7uaeAQZpr4jHVFVctPypjXCn/zIrhG2mmjRhW
M7PCJqtFj0a7DM48LA6r9K8N4Yi7EoSQKNAKfwduE++XXYWgBk9blaF/7om/fIRJ
VJm/RfiuFuyGaxCLd7WpPpASeS/nq410BMV5/kxol+UgiCxk8xjsvU7y8OWLOl7S
rUend5Yei7/h0NGoSmQPnBoYnb1YuTeMMzjfaRWSsV5gbvVDejQ=
=1C8u
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Two fixes for commits in the nfsd-6.16 merge
- One fix for the recently-added NFSD netlink facility
- One fix for a remote SunRPC crasher
* tag 'nfsd-6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
sunrpc: handle SVC_GARBAGE during svc auth processing as auth error
nfsd: use threads array as-is in netlink interface
SUNRPC: Cleanup/fix initial rq_pages allocation
NFSD: Avoid corruption of a referring call list
The smcd_ops->supports_v2 is only called in smcd_register_dev(), which
calls function smcd_supports_v2 for ism. For loopback-ism, function
smc_lo_supports_v2 is unused, remove it.
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250619030854.1536676-1-wangliang74@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All drivers have been converted. Stop using the rxnfc fallbacks
for Rx Flow Hashing configuration.
Joe pointed out in earlier review that in ethtool_set_rxfh()
we need both .get_rxnfc and .get_rxfh_fields, because we need
both the ring count and flow hashing (because we call
ethtool_check_flow_types()). IOW the existing check added
for transitioning drivers was buggy.
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250618203823.1336156-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Correct spelling as flagged by codespell.
With these changes in place codespell only flags false positives
in net/rds.
Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20250619-rds-minor-v1-2-86d2ee3a98b9@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Correct the endianness annotation of port assignments:
A host byte order value (RDS_TCP_PORT) is correctly converted to
network byte order (big endian) using htons. But it is then cast back to
host byte order before assigning to a variable that expects a big endian
value. Address this by dropping the cast.
This is not a bug because, while the endian annotation is changed by
this patch, the assigned value is unchanged.
Also correct the endianness of address assignment.
A host byte order value (INADDR_ANY) is incorrectly assigned as-is to
a variable that expects a big endian value. Address this by converting
the value to network byte order (big endian).
This is not a bug because INADDR_ANY is 0, which is isomorphic
with regards to endian conversions. IOW, while the endian annotation
is changed by this patch, the assigned value is unchanged.
Incorrect endian annotations appear to date back to IPv4-only code added
by commit 70041088e3 ("RDS: Add TCP transport to RDS").
Flagged by Sparse.
Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20250619-rds-minor-v1-1-86d2ee3a98b9@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
OBEX download from iPhone is currently slow due to small packet size
used to transfer data which doesn't follow the MTU negotiated during
L2CAP connection, i.e. 672 bytes instead of 32767:
< ACL Data TX: Handle 11 flags 0x00 dlen 12
L2CAP: Connection Request (0x02) ident 18 len 4
PSM: 4103 (0x1007)
Source CID: 72
> ACL Data RX: Handle 11 flags 0x02 dlen 16
L2CAP: Connection Response (0x03) ident 18 len 8
Destination CID: 14608
Source CID: 72
Result: Connection successful (0x0000)
Status: No further information available (0x0000)
< ACL Data TX: Handle 11 flags 0x00 dlen 27
L2CAP: Configure Request (0x04) ident 20 len 19
Destination CID: 14608
Flags: 0x0000
Option: Maximum Transmission Unit (0x01) [mandatory]
MTU: 32767
Option: Retransmission and Flow Control (0x04) [mandatory]
Mode: Enhanced Retransmission (0x03)
TX window size: 63
Max transmit: 3
Retransmission timeout: 2000
Monitor timeout: 12000
Maximum PDU size: 1009
> ACL Data RX: Handle 11 flags 0x02 dlen 26
L2CAP: Configure Request (0x04) ident 72 len 18
Destination CID: 72
Flags: 0x0000
Option: Retransmission and Flow Control (0x04) [mandatory]
Mode: Enhanced Retransmission (0x03)
TX window size: 32
Max transmit: 255
Retransmission timeout: 0
Monitor timeout: 0
Maximum PDU size: 65527
Option: Frame Check Sequence (0x05) [mandatory]
FCS: 16-bit FCS (0x01)
< ACL Data TX: Handle 11 flags 0x00 dlen 29
L2CAP: Configure Response (0x05) ident 72 len 21
Source CID: 14608
Flags: 0x0000
Result: Success (0x0000)
Option: Maximum Transmission Unit (0x01) [mandatory]
MTU: 672
Option: Retransmission and Flow Control (0x04) [mandatory]
Mode: Enhanced Retransmission (0x03)
TX window size: 32
Max transmit: 255
Retransmission timeout: 2000
Monitor timeout: 12000
Maximum PDU size: 1009
> ACL Data RX: Handle 11 flags 0x02 dlen 32
L2CAP: Configure Response (0x05) ident 20 len 24
Source CID: 72
Flags: 0x0000
Result: Success (0x0000)
Option: Maximum Transmission Unit (0x01) [mandatory]
MTU: 32767
Option: Retransmission and Flow Control (0x04) [mandatory]
Mode: Enhanced Retransmission (0x03)
TX window size: 63
Max transmit: 3
Retransmission timeout: 2000
Monitor timeout: 12000
Maximum PDU size: 1009
Option: Frame Check Sequence (0x05) [mandatory]
FCS: 16-bit FCS (0x01)
...
> ACL Data RX: Handle 11 flags 0x02 dlen 680
Channel: 72 len 676 ctrl 0x0202 [PSM 4103 mode Enhanced Retransmission (0x03)] {chan 8}
I-frame: Unsegmented TxSeq 1 ReqSeq 2
< ACL Data TX: Handle 11 flags 0x00 dlen 13
Channel: 14608 len 9 ctrl 0x0204 [PSM 4103 mode Enhanced Retransmission (0x03)] {chan 8}
I-frame: Unsegmented TxSeq 2 ReqSeq 2
> ACL Data RX: Handle 11 flags 0x02 dlen 680
Channel: 72 len 676 ctrl 0x0304 [PSM 4103 mode Enhanced Retransmission (0x03)] {chan 8}
I-frame: Unsegmented TxSeq 2 ReqSeq 3
The MTUs are negotiated for each direction. In this traces 32767 for
iPhone->localhost and no MTU for localhost->iPhone, which based on
'4.4 L2CAP_CONFIGURATION_REQ' (Core specification v5.4, Vol. 3, Part
A):
The only parameters that should be included in the
L2CAP_CONFIGURATION_REQ packet are those that require different
values than the default or previously agreed values.
...
Any missing configuration parameters are assumed to have their
most recently explicitly or implicitly accepted values.
and '5.1 Maximum transmission unit (MTU)':
If the remote device sends a positive L2CAP_CONFIGURATION_RSP
packet it should include the actual MTU to be used on this channel
for traffic flowing into the local device.
...
The default value is 672 octets.
is set by BlueZ to 672 bytes.
It seems that the iPhone used the lowest negotiated value to transfer
data to the localhost instead of the negotiated one for the incoming
direction.
This could be fixed by using the MTU negotiated for the other
direction, if exists, in the L2CAP_CONFIGURATION_RSP.
This allows to use segmented packets as in the following traces:
< ACL Data TX: Handle 11 flags 0x00 dlen 12
L2CAP: Connection Request (0x02) ident 22 len 4
PSM: 4103 (0x1007)
Source CID: 72
< ACL Data TX: Handle 11 flags 0x00 dlen 27
L2CAP: Configure Request (0x04) ident 24 len 19
Destination CID: 2832
Flags: 0x0000
Option: Maximum Transmission Unit (0x01) [mandatory]
MTU: 32767
Option: Retransmission and Flow Control (0x04) [mandatory]
Mode: Enhanced Retransmission (0x03)
TX window size: 63
Max transmit: 3
Retransmission timeout: 2000
Monitor timeout: 12000
Maximum PDU size: 1009
> ACL Data RX: Handle 11 flags 0x02 dlen 26
L2CAP: Configure Request (0x04) ident 15 len 18
Destination CID: 72
Flags: 0x0000
Option: Retransmission and Flow Control (0x04) [mandatory]
Mode: Enhanced Retransmission (0x03)
TX window size: 32
Max transmit: 255
Retransmission timeout: 0
Monitor timeout: 0
Maximum PDU size: 65527
Option: Frame Check Sequence (0x05) [mandatory]
FCS: 16-bit FCS (0x01)
< ACL Data TX: Handle 11 flags 0x00 dlen 29
L2CAP: Configure Response (0x05) ident 15 len 21
Source CID: 2832
Flags: 0x0000
Result: Success (0x0000)
Option: Maximum Transmission Unit (0x01) [mandatory]
MTU: 32767
Option: Retransmission and Flow Control (0x04) [mandatory]
Mode: Enhanced Retransmission (0x03)
TX window size: 32
Max transmit: 255
Retransmission timeout: 2000
Monitor timeout: 12000
Maximum PDU size: 1009
> ACL Data RX: Handle 11 flags 0x02 dlen 32
L2CAP: Configure Response (0x05) ident 24 len 24
Source CID: 72
Flags: 0x0000
Result: Success (0x0000)
Option: Maximum Transmission Unit (0x01) [mandatory]
MTU: 32767
Option: Retransmission and Flow Control (0x04) [mandatory]
Mode: Enhanced Retransmission (0x03)
TX window size: 63
Max transmit: 3
Retransmission timeout: 2000
Monitor timeout: 12000
Maximum PDU size: 1009
Option: Frame Check Sequence (0x05) [mandatory]
FCS: 16-bit FCS (0x01)
...
> ACL Data RX: Handle 11 flags 0x02 dlen 1009
Channel: 72 len 1005 ctrl 0x4202 [PSM 4103 mode Enhanced Retransmission (0x03)] {chan 8}
I-frame: Start (len 21884) TxSeq 1 ReqSeq 2
> ACL Data RX: Handle 11 flags 0x02 dlen 1009
Channel: 72 len 1005 ctrl 0xc204 [PSM 4103 mode Enhanced Retransmission (0x03)] {chan 8}
I-frame: Continuation TxSeq 2 ReqSeq 2
This has been tested with kernel 5.4 and BlueZ 5.77.
Cc: stable@vger.kernel.org
Signed-off-by: Frédéric Danis <frederic.danis@collabora.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
In the case of SME-in-driver, the driver can internally choose to
update the links based on the AP MLD recommendation and do link
reconfiguration negotiation with AP MLD.
(e.g., After the driver processing the BSS Transition Management request
frame received from the AP MLD with Neighbor Report containing
Multi-Link element with recommended links information chooses to do link
reconfiguration negotiation with AP MLD).
To support this, extend cfg80211_mlo_reconf_add_done() and
NL80211_CMD_ASSOC_MLO_RECONF to indicate added links information for
driver-initiated link reconfiguration requests. For removed links,
the driver indicates links information using the
NL80211_CMD_LINKS_REMOVED event for driver-initiated cases, the same as
supplicant initiated cases.
For the driver-initiated case, cfg80211 will receive link
reconfiguration result asynchronously from driver so holding BSSes of
the accepted add links is needed in the event path. Also, no need of
unhold call for the rejected add link BSSes since there was no hold call
happened previously.
Once the supplicant receives the NL80211_CMD_ASSOC_MLO_RECONF event,
it needs to process the information about newly added links and install
per-link group keys (e.g., GTK/IGTK/BIGTK etc.).
In case of the SME-in-driver, using a vendor interface etc. to notify
the supplicant to initiate a link reconfiguration request and then
supplicant sending command to the cfg80211 can lead to race conditions.
The correct design to avoid this is that the driver indicates the
cfg80211 directly with the results of the link reconfiguration
negotiation.
Signed-off-by: Kavita Kavita <quic_kkavita@quicinc.com>
Link: https://patch.msgid.link/20250604105757.2542-3-quic_kkavita@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, in ieee80211_assign_beacon() mbssid count is updated as link's
bssid_indicator. mbssid count is the total number of MBSSID elements in
the beacon instead of Max BSSID indicator of the Multiple BSS set.
This will result in drivers obtaining an invalid bssid_indicator for BSSes
in a Multiple BSS set.
Fix this by updating link's bssid_indicator from MBSSID element for
Transmitting BSS and update the same for all of its Non-Transmitting BSSes.
Fixes: dde78aa520 ("mac80211: update bssid_indicator in ieee80211_assign_beacon")
Signed-off-by: Rameshkumar Sundaram <rameshkumar.sundaram@oss.qualcomm.com>
Link: https://patch.msgid.link/20250530040940.3188537-1-rameshkumar.sundaram@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, in multi-radio wiphy cases, if one radio is operating on a DFS
channel, -EBUSY is returned even when a scan is requested on a different
radio. Because of this, an MLD AP with one radio (link) on a DFS channel
and Automatic Channel Selection (ACS) on another radio (link) cannot be
brought up.
In multi-radio wiphy cases, multiple radios are grouped under a single
wiphy. Hence, if a radio is operating on a DFS channel and a scan is
requested on a different radio of the same wiphy, the scan can be allowed
simultaneously without impacting the DFS operations.
Add logic to check the underlying radio used for the requested scan. If the
radio on which DFS is already running is not being used, allow the scan
operation; otherwise, return -EBUSY.
Signed-off-by: Raj Kumar Bhagat <quic_rajkbhag@quicinc.com>
Link: https://patch.msgid.link/20250527-mlo-dfs-acs-v2-3-92c2f37c81d9@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, in multi-radio wiphy cases, if a scan is ongoing on one radio,
-EBUSY is returned when DFS or a channel switch is initiated on another
radio. Because of this, an MLD AP with one radio (link) in an ongoing scan
cannot initiate DFS or a channel switch on another radio (link).
In multi-radio wiphy cases, multiple radios are grouped under a single
wiphy. Hence, if a scan is ongoing on one underlying radio and DFS or a
channel switch is requested on a different underlying radio of the same
wiphy, these operations can be allowed simultaneously.
Add logic to check the underlying radio used for the ongoing scan. If the
radio on which DFS or a channel switch is requested is not being used for
the scan, allow the operation; otherwise, return -EBUSY.
Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com>
Co-developed-by: Raj Kumar Bhagat <quic_rajkbhag@quicinc.com>
Signed-off-by: Raj Kumar Bhagat <quic_rajkbhag@quicinc.com>
Link: https://patch.msgid.link/20250527-mlo-dfs-acs-v2-2-92c2f37c81d9@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add utility API cfg80211_get_radio_idx_by_chan() to retrieve the radio
index corresponding to a given channel in a multi-radio wiphy.
This utility function can be used when we want to check the radio-specific
data for a channel in a multi-radio wiphy. For example, it can help
determine the radio index required to handle a scan request. This index
can then be used to decide whether the scan can proceed without
interfering with ongoing DFS operations on another radio.
Signed-off-by: Vasanthakumar Thiagarajan <vasanthakumar.thiagarajan@oss.qualcomm.com>
Co-developed-by: Raj Kumar Bhagat <quic_rajkbhag@quicinc.com>
Signed-off-by: Raj Kumar Bhagat <quic_rajkbhag@quicinc.com>
Link: https://patch.msgid.link/20250527-mlo-dfs-acs-v2-1-92c2f37c81d9@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The dev_hold() on skb->dev during packet reception was originally
added to prevent the device from being released prematurely during
asynchronous decryption operations.
As current hardware can offload decryption, this asynchronous path is
not always utilized. This often results in a pattern of dev_hold()
immediately followed by dev_put() for each packet, creating
unnecessary reference counting overhead detrimental to performance.
This patch optimizes this by skipping the dev_hold() and subsequent
dev_put() when asynchronous decryption is not being performed.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Now that we have dentries and the ability to create meaningful symlinks
to them, don't keep a name string in each tracker. Switch the output
format to print "class@address", and drop the name field.
Also, add a kerneldoc header for ref_tracker_dir_init().
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20250618-reftrack-dbgfs-v15-9-24fc37ead144@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After assigning the inode number to the namespace, use it to create a
unique name for each netns refcount tracker with the ns.inum and
net_cookie values in it, and register a symlink to the debugfs file for
it.
init_net is registered before the ref_tracker dir is created, so add a
late_initcall() to register its files and symlinks.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20250618-reftrack-dbgfs-v15-8-24fc37ead144@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A later patch in the series will be adding debugfs files for each
ref_tracker that get created in ref_tracker_dir_init(). The format will
be "class@%px". The current "name" string can vary between
ref_tracker_dir objects of the same type, so it's not suitable for this
purpose.
Add a new "class" string to the ref_tracker dir that describes the
the type of object (sans any individual info for that object).
Also, in the i915 driver, gate the creation of debugfs files on whether
the dentry pointer is still set to NULL. CI has shown that the
ref_tracker_dir can be initialized more than once.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20250618-reftrack-dbgfs-v15-4-24fc37ead144@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Extract the IPv6 address retrieval logic from netpoll_setup() into
a dedicated helper function netpoll_take_ipv6() to improve code
organization and readability.
The function handles obtaining the local IPv6 address from the
network device, including proper address type matching between
local and remote addresses (link-local vs global), and includes
appropriate error handling when IPv6 is not supported or no
suitable address is available.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250618-netpoll_ip_ref-v1-3-c2ac00fe558f@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move the IPv4 address retrieval logic from netpoll_setup() into a
separate netpoll_take_ipv4() function to improve code organization
and readability. This change consolidates the IPv4-specific logic
and error handling into a dedicated function while maintaining
the same functionality.
No functional changes.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250618-netpoll_ip_ref-v1-2-c2ac00fe558f@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Extract the carrier waiting logic into a dedicated helper function
netpoll_wait_carrier() to improve code readability and reduce
duplication in netpoll_setup().
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250618-netpoll_ip_ref-v1-1-c2ac00fe558f@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The input parameter "compressed_bufsize" of smc_buf_get_slot is unused,
remove it.
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250618103342.1423913-1-wangliang74@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_time_to_recover() does not need the @flag argument.
Its first parameter can be marked const, and of tcp_sock type.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250618091246.1260322-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As discussesd before in [0] proxy entries (which are more configuration
than runtime data) should stay when the link (carrier) goes does down.
This is what happens for regular neighbour entries.
So lets fix this by:
- storing in proxy entries the fact that it was added as NUD_PERMANENT
- not removing NUD_PERMANENT proxy entries when the carrier goes down
(same as how it's done in neigh_flush_dev() for regular neigh entries)
[0]: https://lore.kernel.org/netdev/c584ef7e-6897-01f3-5b80-12b53f7b4bf4@kernel.org/
Signed-off-by: Nicolas Escande <nico.escande@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250617141334.3724863-1-nico.escande@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- ath12k
- avoid busy-waiting
- activate correct number of links
- iwlwifi
- iwldvm regression (lots of warnings)
- iwlmld merge damage regression (crash)
- fix build with some old gcc versions
- carl9170: don't talk to device w/o FW [syzbot]
- ath6kl: remove bad FW WARN [syzbot]
- ieee80211: use variable-length arrays [syzbot]
- mac80211
- remove WARN on delayed beacon update [syzbot]
- drop OCB frames with invalid source [syzbot]
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmhSg10ACgkQ10qiO8sP
aAD10w//fdel2PXhAT6weaNRXmpUStR/TWCcKeUy6pko1BanxVbJNK9DPsCmg9Ym
ApDewdsmTF600VAW+/JpoHCckzIGhx8Q/i5DJ0Xi/sbLweFMeDcmAsHkqVu11wTn
gatTXlwP0/iXCL/EEtuS8fS1vs7HF29Q4QVNHPtOEaoNSm7LpE20Ukve4fBB8eXr
oG3l7bNxW4STIFJL1+EWIXXv+tuyfORsIoD65N+NephJvTYvdKocWjjjDgnsLrPR
HOW7Yli9G7xebXvDhdUVzdDjG0KEcarEHbyIosiPCv/xSnFZlrq4pg2zz8psHG4H
MxeHlDgKR2gSdAeX207vB2yZ884yrRtA2XuQ7t8JrVATvXI6LE9puB+P0ssqLsvz
jO9LjU9cfg8EoaBnK1w6vjL+OH3z2D1D8CvrkRjAUxZUKcBfk/NIQu4zsei5GuL5
fGDSowLiMUMMyecvKbLggxMEwp3qTs9zTb4w5Vn+QtbVFjj8+HgEb4AHrQM9tr/y
GavBZvCnKqEnv/VgkTh2Z9KfVznFfXKi3N0XR6gotVDueMeG6fkXvwFTSg6dW0UX
c1vHJTZ0H0AHelWoSz8/1FO6NXmqU76hnW7EwcYkwkGJ3/tji1Bzyt6ihknIttKl
AVYno+nmAC1WJNzKxJDFE+HX/F1RwowKlek/nLUMU1Ti4kzgRWo=
=Q6JR
-----END PGP SIGNATURE-----
Merge tag 'wireless-2025-06-18' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
More fixes:
- ath12k
- avoid busy-waiting
- activate correct number of links
- iwlwifi
- iwldvm regression (lots of warnings)
- iwlmld merge damage regression (crash)
- fix build with some old gcc versions
- carl9170: don't talk to device w/o FW [syzbot]
- ath6kl: remove bad FW WARN [syzbot]
- ieee80211: use variable-length arrays [syzbot]
- mac80211
- remove WARN on delayed beacon update [syzbot]
- drop OCB frames with invalid source [syzbot]
* tag 'wireless-2025-06-18' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: iwlwifi: Fix incorrect logic on cmd_ver range checking
wifi: iwlwifi: dvm: restore n_no_reclaim_cmds setting
wifi: iwlwifi: cfg: Limit cb_size to valid range
wifi: iwlwifi: restore missing initialization of async_handlers_list (again)
wifi: ath6kl: remove WARN on bad firmware input
wifi: carl9170: do not ping device which has failed to load firmware
wifi: ath12k: don't wait when there is no vdev started
wifi: ath12k: don't use static variables in ath12k_wmi_fw_stats_process()
wifi: ath12k: avoid burning CPU while waiting for firmware stats
wifi: ath12k: fix documentation on firmware stats
wifi: ath12k: don't activate more links than firmware supports
wifi: ath12k: update link active in case two links fall on the same MAC
wifi: ath12k: support WMI_MLO_LINK_SET_ACTIVE_CMDID command
wifi: ath12k: update freq range for each hardware mode
wifi: ath12k: parse and save sbs_lower_band_end_freq from WMI_SERVICE_READY_EXT2_EVENTID event
wifi: ath12k: parse and save hardware mode info from WMI_SERVICE_READY_EXT_EVENTID event for later use
wifi: ath12k: Avoid CPU busy-wait by handling VDEV_STAT and BCN_STAT
wifi: mac80211: don't WARN for late channel/color switch
wifi: mac80211: drop invalid source address OCB frames
wifi: remove zero-length arrays
====================
Link: https://patch.msgid.link/20250618210642.35805-6-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
/proc/net/atm/lec must ensure safety against dev_lec[] changes.
It appears it had dev_put() calls without prior dev_hold(),
leading to imbalance and UAF.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Francois Romieu <romieu@fr.zoreil.com> # Minor atm contributor
Link: https://patch.msgid.link/20250618140844.1686882-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Setting tty->disc_data before opening the NCI device means we need to
clean it up on error paths. This also opens some short window if device
starts sending data, even before NCIUARTSETDRIVER IOCTL succeeded
(broken hardware?). Close the window by exposing tty->disc_data only on
the success path, when opening of the NCI device and try_module_get()
succeeds.
The code differs in error path in one aspect: tty->disc_data won't be
ever assigned thus NULL-ified. This however should not be relevant
difference, because of "tty->disc_data=NULL" in nci_uart_tty_open().
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Fixes: 9961127d4b ("NFC: nci: add generic uart support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/20250618073649.25049-2-krzysztof.kozlowski@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tianshuo han reported a remotely-triggerable crash if the client sends a
kernel RPC server a specially crafted packet. If decoding the RPC reply
fails in such a way that SVC_GARBAGE is returned without setting the
rq_accept_statp pointer, then that pointer can be dereferenced and a
value stored there.
If it's the first time the thread has processed an RPC, then that
pointer will be set to NULL and the kernel will crash. In other cases,
it could create a memory scribble.
The server sunrpc code treats a SVC_GARBAGE return from svc_authenticate
or pg_authenticate as if it should send a GARBAGE_ARGS reply. RFC 5531
says that if authentication fails that the RPC should be rejected
instead with a status of AUTH_ERR.
Handle a SVC_GARBAGE return as an AUTH_ERROR, with a reason of
AUTH_BADCRED instead of returning GARBAGE_ARGS in that case. This
sidesteps the whole problem of touching the rpc_accept_statp pointer in
this situation and avoids the crash.
Cc: stable@kernel.org
Fixes: 29cd2927fb ("SUNRPC: Fix encoding of accepted but unsuccessful RPC replies")
Reported-by: tianshuo han <hantianshuo233@gmail.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Now that we stash persistent information in struct pid there's no need
to play volatile games with pinning struct pid via dentries in pidfs.
Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-8-98f3456fd552@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Since commit 3e6a0243ff ("gre: Fix again IPv6 link-local address
generation."), addrconf_gre_config() has stopped handling IP6GRE
devices specially and just calls the regular addrconf_addr_gen()
function to create their link-local IPv6 addresses.
We can thus avoid using addrconf_gre_config() for IP6GRE devices and
use the normal IPv6 initialisation path instead (that is, jump directly
to addrconf_dev_config() in addrconf_init_auto_addrs()).
See commit 3e6a0243ff ("gre: Fix again IPv6 link-local address
generation.") for a deeper explanation on how and why GRE devices
started handling their IPv6 link-local address generation specially,
why it was a problem, and why this is not even necessary in most cases
(especially for GRE over IPv6).
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/a9144be9c7ec3cf09f25becae5e8fdf141fde9f6.1750075076.git.gnault@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This patch expands the status information provided by ethtool for PSE c33
with current port priority and max port priority. It also adds a call to
pse_ethtool_set_prio() to configure the PSE port priority.
Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-8-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Report the index of the newly introduced PSE power domain to the user,
enabling improved management of the power budget for PSE devices.
Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-5-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add support for devm_pse_irq_helper() to register PSE interrupts and report
events such as over-current or over-temperature conditions. This follows a
similar approach to the regulator API but also sends notifications using a
dedicated PSE ethtool netlink socket.
Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-2-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a new device generic parameter to enable/disable the
PHC (PTP Hardware Clock) functionality in the device associated
with the devlink instance.
Signed-off-by: David Arinzon <darinzon@amazon.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250617110545.5659-6-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The existing netdev_ops_assert_locked() already asserts that either
the RTNL lock or the per-device lock is held, making the explicit
ASSERT_RTNL() redundant.
Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Link: https://patch.msgid.link/20250616162117.287806-5-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Drivers that are using ops lock and don't depend on RTNL lock
still need to manage it because udp_tunnel's RTNL dependency.
Introduce new udp_tunnel_nic_lock and use it instead of
rtnl_lock. Drop non-UDP_TUNNEL_NIC_INFO_MAY_SLEEP mode from
udp_tunnel infra (udp_tunnel_nic_device_sync_work needs to
grab udp_tunnel_nic_lock mutex and might sleep).
Cover more places in v4:
- netlink
- udp_tunnel_notify_add_rx_port (ndo_open)
- triggers udp_tunnel_nic_device_sync_work
- udp_tunnel_notify_del_rx_port (ndo_stop)
- triggers udp_tunnel_nic_device_sync_work
- udp_tunnel_get_rx_info (__netdev_update_features)
- triggers NETDEV_UDP_TUNNEL_PUSH_INFO
- udp_tunnel_drop_rx_info (__netdev_update_features)
- triggers NETDEV_UDP_TUNNEL_DROP_INFO
- udp_tunnel_nic_reset_ntf (ndo_open)
- notifiers
- udp_tunnel_nic_netdevice_event, depending on the event:
- triggers NETDEV_UDP_TUNNEL_PUSH_INFO
- triggers NETDEV_UDP_TUNNEL_DROP_INFO
- ethnl_tunnel_info_reply_size
- udp_tunnel_nic_set_port_priv (two intel drivers)
Cc: Michael Chan <michael.chan@broadcom.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Link: https://patch.msgid.link/20250616162117.287806-4-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is a bug with passive TFO sockets returning an invalid NAPI ID 0
from SO_INCOMING_NAPI_ID. Normally this is not an issue, but zero copy
receive relies on a correct NAPI ID to process sockets on the right
queue.
Fix by adding a sk_mark_napi_id_set().
Fixes: e5907459ce ("tcp: Record Rx hash and NAPI ID in tcp_child_process")
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250617212102.175711-5-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The reproduction steps:
1. create a tun interface
2. enable l2 bearer
3. TIPC_NL_UDP_GET_REMOTEIP with media name set to tun
tipc: Started in network mode
tipc: Node identity 8af312d38a21, cluster identity 4711
tipc: Enabled bearer <eth:syz_tun>, priority 1
Oops: general protection fault
KASAN: null-ptr-deref in range
CPU: 1 UID: 1000 PID: 559 Comm: poc Not tainted 6.16.0-rc1+ #117 PREEMPT
Hardware name: QEMU Ubuntu 24.04 PC
RIP: 0010:tipc_udp_nl_dump_remoteip+0x4a4/0x8f0
the ub was in fact a struct dev.
when bid != 0 && skip_cnt != 0, bearer_list[bid] may be NULL or
other media when other thread changes it.
fix this by checking media_id.
Fixes: 832629ca5c ("tipc: add UDP remoteip dump to netlink API")
Signed-off-by: Haixia Qu <hxqu@hillstonenet.com>
Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech>
Link: https://patch.msgid.link/20250617055624.2680-1-hxqu@hillstonenet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The implementation of nla_data is as follows:
static inline void *nla_data(const struct nlattr *nla)
{
return (char *) nla + NLA_HDRLEN;
}
Excluding the case where nla is exactly -NLA_HDRLEN, it will not return
NULL. And it seems misleading to assume that it can, other than in this
corner case. So drop checks for this condition.
Flagged by Smatch.
Compile tested only.
Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250617-nfc-null-data-v1-1-c7525ead2e95@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After the following commit from 2024:
commit e37ab73736 ("tcp: fix to allow timestamp undo if no retransmits were sent")
...there was buggy behavior where TCP connections without SACK support
could easily see erroneous undo events at the end of fast recovery or
RTO recovery episodes. The erroneous undo events could cause those
connections to suffer repeated loss recovery episodes and high
retransmit rates.
The problem was an interaction between the non-SACK behavior on these
connections and the undo logic. The problem is that, for non-SACK
connections at the end of a loss recovery episode, if snd_una ==
high_seq, then tcp_is_non_sack_preventing_reopen() holds steady in
CA_Recovery or CA_Loss, but clears tp->retrans_stamp to 0. Then upon
the next ACK the "tcp: fix to allow timestamp undo if no retransmits
were sent" logic saw the tp->retrans_stamp at 0 and erroneously
concluded that no data was retransmitted, and erroneously performed an
undo of the cwnd reduction, restoring cwnd immediately to the value it
had before loss recovery. This caused an immediate burst of traffic
and build-up of queues and likely another immediate loss recovery
episode.
This commit fixes tcp_packet_delayed() to ignore zero retrans_stamp
values for non-SACK connections when snd_una is at or above high_seq,
because tcp_is_non_sack_preventing_reopen() clears retrans_stamp in
this case, so it's not a valid signal that we can undo.
Note that the commit named in the Fixes footer restored long-present
behavior from roughly 2005-2019, so apparently this bug was present
for a while during that era, and this was simply not caught.
Fixes: e37ab73736 ("tcp: fix to allow timestamp undo if no retransmits were sent")
Reported-by: Eric Wheeler <netdev@lists.ewheeler.net>
Closes: https://lore.kernel.org/netdev/64ea9333-e7f9-0df-b0f2-8d566143acab@ewheeler.net/
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Co-developed-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In vcc_sendmsg(), we account skb->truesize to sk->sk_wmem_alloc by
atm_account_tx().
It is expected to be reverted by atm_pop_raw() later called by
vcc->dev->ops->send(vcc, skb).
However, vcc_sendmsg() misses the same revert when copy_from_iter_full()
fails, and then we will leak a socket.
Let's factorise the revert part as atm_return_tx() and call it in
the failure path.
Note that the corresponding sk_wmem_alloc operation can be found in
alloc_tx() as of the blamed commit.
$ git blame -L:alloc_tx net/atm/common.c c55fa3cccbc2c~
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Simon Horman <horms@kernel.org>
Closes: https://lore.kernel.org/netdev/20250614161959.GR414686@horms.kernel.org/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250616182147.963333-3-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws. To
replace tasklets, BH workqueue support was recently added. A BH workqueue
behaves similarly to regular workqueues except that the queued work items
are executed in the BH context.
This patch converts TCP Small Queues implementation from tasklet to BH
workqueue.
Semantically, this is an equivalent conversion and there shouldn't be any
user-visible behavior changes. While workqueue's queueing and execution
paths are a bit heavier than tasklet's, unless the work item is being queued
every packet, the difference hopefully shouldn't matter.
My experience with the networking stack is very limited and this patch
definitely needs attention from someone who actually understands networking.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/aFBeJ38AS1ZF3Dq5@slm.duckdns.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Multicast routing is today handled in the input path. Locally generated MC
packets don't hit the IPMR code today. Thus if a VXLAN remote address is
multicast, the driver needs to set an OIF during route lookup. Thus MC
routing configuration needs to be kept in sync with the VXLAN FDB and MDB.
Ideally, the VXLAN packets would be routed by the MC routing code instead.
To that end, this patch adds support to route locally generated multicast
packets. The newly-added routines do largely what ip6_mr_input() and
ip6_mr_forward() do: make an MR cache lookup to find where to send the
packets, and use ip6_output() to send each of them. When no cache entry is
found, the packet is punted to the daemon for resolution.
Similarly to the IPv4 case in a previous patch, the new logic is contingent
on a newly-added IP6CB flag being set.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/3bcc034a3ab4d3c291072fff38f78d7fbbeef4e6.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some of the work of ip6mr_forward2() is specific to IPMR forwarding, and
should not take place on the output path. In order to allow reuse of the
common parts, extract out of the function a helper,
ip6mr_prepare_forward().
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/8932bd5c0fbe3f662b158803b8509604fa7bc113.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The netfilter hook is invoked with skb->dev for input netdevice, and
vif_dev for output netdevice. However at the point of invocation, skb->dev
is already set to vif_dev, and MR-forwarded packets are reported with
in=out:
# ip6tables -A FORWARD -j LOG --log-prefix '[forw]'
# cd tools/testing/selftests/net/forwarding
# ./router_multicast.sh
# dmesg | fgrep '[forw]'
[ 1670.248245] [forw]IN=v5 OUT=v5 [...]
For reference, IPv4 MR code shows in and out as appropriate.
Fix by caching skb->dev and using the updated value for output netdev.
Fixes: 7bc570c8b4 ("[IPV6] MROUTE: Support multicast forwarding.")
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/3141ae8386fbe13fef4b793faa75e6bae58d798a.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ip6tunnel_xmit() erases the contents of the SKB control block. In order to
be able to set particular IP6CB flags on the SKB, add a corresponding
parameter, and propagate it to udp_tunnel6_xmit_skb() as well.
In one of the following patches, VXLAN driver will use this facility to
mark packets as subject to IPv6 multicast routing.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/acb4f9f3e40c3a931236c3af08a720b017fbfbfb.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The function always returns zero, thus the return value does not carry any
signal. Just make it void.
Most callers already ignore the return value. However:
- Refold arguments of the call from sctp_v6_xmit() so that they fit into
the 80-column limit.
- tipc_udp_xmit() initializes err from the return value, but that should
already be always zero at that point. So there's no practical change, but
elision of the assignment prompts a couple more tweaks to clean up the
function.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/7facacf9d8ca3ca9391a4aee88160913671b868d.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Multicast routing is today handled in the input path. Locally generated MC
packets don't hit the IPMR code today. Thus if a VXLAN remote address is
multicast, the driver needs to set an OIF during route lookup. Thus MC
routing configuration needs to be kept in sync with the VXLAN FDB and MDB.
Ideally, the VXLAN packets would be routed by the MC routing code instead.
To that end, this patch adds support to route locally generated multicast
packets. The newly-added routines do largely what ip_mr_input() and
ip_mr_forward() do: make an MR cache lookup to find where to send the
packets, and use ip_mc_output() to send each of them. When no cache entry
is found, the packet is punted to the daemon for resolution.
However, an installation that uses a VXLAN underlay netdevice for which it
also has matching MC routes, would get a different routing with this patch.
Previously, the MC packets would be delivered directly to the underlay
port, whereas now they would be MC-routed. In order to avoid this change in
behavior, introduce an IPCB flag. Only if the flag is set will
ip_mr_output() actually engage, otherwise it reverts to ip_mc_output().
This code is based on work by Roopa Prabhu and Nikolay Aleksandrov.
Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/0aadbd49330471c0f758d54afb05eb3b6e3a6b65.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some of the work of ipmr_queue_xmit() is specific to IPMR forwarding, and
should not take place on the output path. In order to allow reuse of the
common parts, split the function into two: the ipmr_prepare_xmit() helper
that takes care of the common bits, and the ipmr_queue_fwd_xmit(), which
invokes the former and encapsulates the whole forwarding algorithm.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/4e8db165572a4f8bd29a723a801e854e9d20df4d.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The variable is used for caching of rt->dst.dev. The netdevice referenced
therein does not change during the scope of validity of that local. At the
same time, the local is only used twice, and each of these uses will end up
in a different function in the following patches, further eliminating any
use the local could have had.
Drop the local altogether and inline the uses.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/c80600a4b51679fe78f429ccb6d60892c2f9e4de.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
iptunnel_xmit() erases the contents of the SKB control block. In order to
be able to set particular IPCB flags on the SKB, add a corresponding
parameter, and propagate it to udp_tunnel_xmit_skb() as well.
In one of the following patches, VXLAN driver will use this facility to
mark packets as subject to IP multicast routing.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Acked-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/89c9daf9f2dc088b6b92ccebcc929f51742de91f.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add support for legacy Broadcom FCS tags, which are similar to
DSA_TAG_PROTO_BRCM_LEGACY.
BCM5325 and BCM5365 switches require including the original FCS value and
length, as opposed to BCM63xx switches.
Adding the original FCS value and length to DSA_TAG_PROTO_BRCM_LEGACY would
impact performance of BCM63xx switches, so it's better to create a new tag.
Signed-off-by: Álvaro Fernández Rojas <noltari@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250614080000.1884236-3-noltari@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move brcm_leg_tag_rcv() definition to top.
This function is going to be shared between two different tags.
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Álvaro Fernández Rojas <noltari@gmail.com>
Link: https://patch.msgid.link/20250614080000.1884236-2-noltari@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that we have removed the RFC3517/RFC6675 hints,
tcp_clear_retrans_hints_partial() is empty, and can be removed.
Suggested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250615001435.2390793-4-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that obsolete RFC3517/RFC6675 TCP loss detection has been removed,
we can remove the somewhat complex and intrusive code to maintain its
hint state: lost_skb_hint and lost_cnt_hint.
This commit makes tcp_clear_retrans_hints_partial() empty. We will
remove tcp_clear_retrans_hints_partial() and its call sites in the
next commit.
Suggested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250615001435.2390793-3-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
RACK-TLP loss detection has been enabled as the default loss detection
algorithm for Linux TCP since 2018, in:
commit b38a51fec1 ("tcp: disable RFC6675 loss detection")
In case users ran into unexpected bugs or performance regressions,
that commit allowed Linux system administrators to revert to using
RFC3517/RFC6675 loss recovery by setting net.ipv4.tcp_recovery to 0.
In the seven years since 2018, our team has not heard reports of
anyone reverting Linux TCP to use RFC3517/RFC6675 loss recovery, and
we can't find any record in web searches of such a revert.
RACK-TLP was published as a standards-track RFC, RFC8985, in February
2021.
Several other major TCP implementations have default-enabled RACK-TLP
at this point as well.
RACK-TLP offers several significant performance advantages over
RFC3517/RFC6675 loss recovery, including much better performance in
the common cases of tail drops, lost retransmissions, and reordering.
It is now time to remove the obsolete and unused RFC3517/RFC6675 loss
recovery code. This will allow a substantial simplification of the
Linux TCP code base, and removes 12 bytes of state in every tcp_sock
for 64-bit machines (8 bytes on 32-bit machines).
To arrange the commits in reasonable sizes, this patch series is split
into 3 commits. The following 2 commits remove bookkeeping state and
code that is no longer needed after this removal of RFC3517/RFC6675
loss recovery.
Suggested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250615001435.2390793-2-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since taprio’s taprio_dev_notifier() isn’t protected by an
RCU read-side critical section, a race with advance_sched()
can lead to a use-after-free.
Adding rcu_read_lock() inside taprio_dev_notifier() prevents this.
Fixes: fed87cc671 ("net/sched: taprio: automatically calculate queueMaxSDU based on TC gate durations")
Cc: stable@vger.kernel.org
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/aEzIYYxt0is9upYG@v4bel-B760M-AORUS-ELITE-AX
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEn/sM2K9nqF/8FWzzDHRl3/mQkZwFAmhRji0THG1rbEBwZW5n
dXRyb25peC5kZQAKCRAMdGXf+ZCRnIhnB/4sPyASpIMeCZh20cL8WHFKhz1F2daS
znxvsL3sQZfol4a6pEc4/lYHEcpODitUhF4sPbMgU7qlSEysZbEsfuYZw1RLy8j9
bsIS+LqRqSslwS2/XuLwOjXly4+bgPwc4wj2RwH9+D9TlaLkZ02NZAqDNkidPxm/
8udZxMqorVIUhqsckfsmIqVonllz7l3InSZ3gYeXVLXAMTgsvperjVtN/YivdbPm
H1T6b6qopUEmIqDdsKelb1D9URMoiW9exDX05OpSzxZWWeTPZyJclOraKu6Otf4W
Jd7IVgOv03prcOcerlXZoD6tEC0Amsl1RdkPiX6csAa6m63XvxjHIJyH
=hk/l
-----END PGP SIGNATURE-----
Merge tag 'linux-can-fixes-for-6.16-20250617' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2025-06-17
The patch is by Brett Werling, and fixes the power regulator retrieval
during probe of the tcan4x5x glue code for the m_can driver.
* tag 'linux-can-fixes-for-6.16-20250617' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
can: tcan4x5x: fix power regulator retrieval during probe
openvswitch: Allocate struct ovs_pcpu_storage dynamically
====================
Link: https://patch.msgid.link/20250617155123.2141584-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
skb_ensure_writable should succeed when it's trying to write to the
header of the unreadable skbs, so it doesn't need an unconditional
skb_frags_readable check. The preceding pskb_may_pull() call will
succeed if write_len is within the head and fail if we're trying to
write to the unreadable payload, so we don't need an additional check.
Removing this check restores DSCP functionality with unreadable skbs as
it's called from dscp_tg.
Cc: willemb@google.com
Cc: asml.silence@gmail.com
Fixes: 65249feb6b ("net: add support for skbs with unreadable frags")
Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250615200733.520113-1-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov says:
====================
io_uring cmd for tx timestamps (part)
Apply the networking helpers for the io_uring timestamp API.
====================
Link: https://patch.msgid.link/cover.1750065793.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Explicitly include <linux/export.h> in files which contain an
EXPORT_SYMBOL().
See commit a934a57a42 ("scripts/misc-check: check missing #include
<linux/export.h> when W=1") for more details.
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
PERCPU_MODULE_RESERVE defines the maximum size that can by used for the
per-CPU data size used by modules. This is 8KiB.
Commit 035fcdc4d2 ("openvswitch: Merge three per-CPU structures into
one") restructured the per-CPU memory allocation for the module and
moved the separate alloc_percpu() invocations at module init time to a
static per-CPU variable which is allocated by the module loader.
The size of the per-CPU data section for openvswitch is 6488 bytes which
is ~80% of the available per-CPU memory. Together with a few other
modules it is easy to exhaust the available 8KiB of memory.
Allocate ovs_pcpu_storage dynamically at module init time.
Reported-by: Gal Pressman <gal@nvidia.com>
Closes: https://lore.kernel.org/all/c401e017-f8db-4f57-a1cd-89beb979a277@nvidia.com
Fixes: 035fcdc4d2 ("openvswitch: Merge three per-CPU structures into one")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20250613123629.-XSoQTCu@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This code (tty_get -> vhangup -> tty_put) is repeated on few places.
Introduce a helper similar to tty_port_tty_hangup() (asynchronous) to
handle even vhangup (synchronous).
And use it on those places.
In fact, reuse the tty_port_tty_hangup()'s code and call tty_vhangup()
depending on a new bool parameter.
Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: David Lin <dtwlin@gmail.com>
Cc: Johan Hovold <johan@kernel.org>
Cc: Alex Elder <elder@kernel.org>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://lore.kernel.org/r/20250611100319.186924-2-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Extend the End.X behavior to accept an output interface as an optional
attribute and make use of it when resolving a route. This is needed when
user space wants to use a link-local address as the nexthop address.
Before:
# ip route add 2001:db8:1::/64 encap seg6local action End.X nh6 fe80::1 oif eth0 dev sr6
# ip route add 2001:db8:2::/64 encap seg6local action End.X nh6 2001:db8:10::1 dev sr6
$ ip -6 route show
2001:db8:1::/64 encap seg6local action End.X nh6 fe80::1 dev sr6 metric 1024 pref medium
2001:db8:2::/64 encap seg6local action End.X nh6 2001:db8:10::1 dev sr6 metric 1024 pref medium
After:
# ip route add 2001:db8:1::/64 encap seg6local action End.X nh6 fe80::1 oif eth0 dev sr6
# ip route add 2001:db8:2::/64 encap seg6local action End.X nh6 2001:db8:10::1 dev sr6
$ ip -6 route show
2001:db8:1::/64 encap seg6local action End.X nh6 fe80::1 oif eth0 dev sr6 metric 1024 pref medium
2001:db8:2::/64 encap seg6local action End.X nh6 2001:db8:10::1 dev sr6 metric 1024 pref medium
Note that the oif attribute is not dumped to user space when it was not
specified (as an oif of 0) since each entry keeps track of the optional
attributes that it parsed during configuration (see struct
seg6_local_lwt::parsed_optattrs).
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Link: https://patch.msgid.link/20250612122323.584113-4-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
seg6_lookup_nexthop() is a wrapper around seg6_lookup_any_nexthop().
Change End.X behavior to invoke seg6_lookup_any_nexthop() directly so
that we would not need to expose the new output interface argument
outside of the seg6local module.
No functional changes intended.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Link: https://patch.msgid.link/20250612122323.584113-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
seg6_lookup_any_nexthop() is called by the different endpoint behaviors
(e.g., End, End.X) to resolve an IPv6 route. Extend the function with an
output interface argument so that it could be used to resolve a route
with a certain output interface. This will be used by subsequent patches
that will extend the End.X behavior with an output interface as an
optional argument.
ip6_route_input_lookup() cannot be used when an output interface is
specified as it ignores this parameter. Similarly, calling
ip6_pol_route() when a table ID was not specified (e.g., End.X behavior)
is wrong.
Therefore, when an output interface is specified without a table ID,
resolve the route using ip6_route_output() which will take the output
interface into account.
Note that no endpoint behavior currently passes both a table ID and an
output interface, so the oif argument passed to ip6_pol_route() is
always zero and there are no functional changes in this regard.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Link: https://patch.msgid.link/20250612122323.584113-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move netpoll_print_options() from net/core/netpoll.c to
drivers/net/netconsole.c and make it static. This function is only used
by netconsole, so there's no need to export it or keep it in the public
netpoll API.
This reduces the netpoll API surface and improves code locality
by keeping netconsole-specific functionality within the netconsole
driver.
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20250613-rework-v3-4-0752bf2e6912@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move netpoll_parse_ip_addr() and netpoll_parse_options() from the generic
netpoll module to the netconsole module where they are actually used.
These functions were originally placed in netpoll but are only consumed by
netconsole. This refactoring improves code organization by:
- Removing unnecessary exported symbols from netpoll
- Making netpoll_parse_options() static (no longer needs global visibility)
- Reducing coupling between netpoll and netconsole modules
The functions remain functionally identical - this is purely a code
reorganization to better reflect their actual usage patterns. Here are
the changes:
1) Move both functions from netpoll to netconsole
2) Add static to netpoll_parse_options()
3) Removed the EXPORT_SYMBOL()
PS: This diff does not change the function format, so, it is easy to
review, but, checkpatch will not be happy. A follow-up patch will
address the current issues reported by checkpatch.
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20250613-rework-v3-3-0752bf2e6912@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move np_info(), np_err(), and np_notice() macros from internal
implementation to the public netpoll header file to make them
available for use by netpoll consumers.
These logging macros provide consistent formatting for netpoll-related
messages by automatically prefixing log output with the netpoll instance
name.
The goal is to use the exact same format that is being displayed today,
instead of creating something netconsole-specific.
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20250613-rework-v3-2-0752bf2e6912@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit 97714695ef ("net: netconsole: Defer netpoll cleanup to
avoid lock release during list traversal"), netconsole no longer uses
__netpoll_cleanup(). With no remaining users, remove this function
from the exported netpoll API.
The function remains available internally within netpoll for use by
netpoll_cleanup().
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20250613-rework-v3-1-0752bf2e6912@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
phys_port_id_show, phys_port_name_show and phys_switch_id_show would
return -EOPNOTSUPP if the netdev didn't implement the corresponding
method.
There is no point in creating these files if they are unsupported.
Put these attributes in netdev_phys_group and implement the is_visible
method. make phys_(port_id, port_name, switch_id) invisible if the netdev
dosen't implement the corresponding method.
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250612142707.4644-1-yajun.deng@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The bpf selftest xdp_adjust_tail/xdp_adjust_frags_tail_grow failed on
arm64 with 64KB page:
xdp_adjust_tail/xdp_adjust_frags_tail_grow:FAIL
In bpf_prog_test_run_xdp(), the xdp->frame_sz is set to 4K, but later on
when constructing frags, with 64K page size, the frag data_len could
be more than 4K. This will cause problems in bpf_xdp_frags_increase_tail().
To fix the failure, the xdp->frame_sz is set to be PAGE_SIZE so kernel
can test different page size properly. With the kernel change, the user
space and bpf prog needs adjustment. Currently, the MAX_SKB_FRAGS default
value is 17, so for 4K page, the maximum packet size will be less than 68K.
To test 64K page, a bigger maximum packet size than 68K is desired. So two
different functions are implemented for subtest xdp_adjust_frags_tail_grow.
Depending on different page size, different data input/output sizes are used
to adapt with different page size.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20250612035032.2207498-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In NC-SI spec v1.2 section 8.4.44.2, the firmware name doesn't
need to be null terminated while its size occupies the full size
of the field. Fix the buffer overflow issue by adding one
additional byte for null terminator.
Signed-off-by: Hari Kalavakunta <kalavakunta.hari.prasad@gmail.com>
Reviewed-by: Paul Fertser <fercerpav@gmail.com>
Link: https://patch.msgid.link/20250610193338.1368-1-kalavakunta.hari.prasad@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While investigating some reports of memory-constrained NUMA machines
failing to mount v3 and v4.0 nfs mounts, we found that svc_init_buffer()
was not attempting to retry allocations from the bulk page allocator.
Typically, this results in a single page allocation being returned and
the mount attempt fails with -ENOMEM. A retry would have allowed the mount
to succeed.
Additionally, it seems that the bulk allocation in svc_init_buffer() is
redundant because svc_alloc_arg() will perform the required allocation and
does the correct thing to retry the allocations.
The call to allocate memory in svc_alloc_arg() drops the preferred node
argument, but I expect we'll still allocate on the preferred node because
the allocation call happens within the svc thread context, which chooses
the node with memory closest to the current thread's execution.
This patch cleans out the bulk allocation in svc_init_buffer() to allow
svc_alloc_arg() to handle the allocation/retry logic for rq_pages.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Fixes: ed603bcf4f ("sunrpc: Replace the rq_pages array with dynamically-allocated memory")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
We mux multiple calls to the drivers via the .get_nfc and .set_nfc
callbacks. This is slightly inconvenient to the drivers as they
have to de-mux them back. It will also be awkward for netlink code
to construct struct ethtool_rxnfc when it wants to get info about
RX Flow Hash, from the RSS module.
Add dedicated driver callbacks. Create struct ethtool_rxfh_fields
which contains only data relevant to RXFH. Maintain the names of
the fields to avoid having to heavily modify the drivers.
For now support both callbacks, once all drivers are converted
ethtool_*et_rxfh_fields() will stop using the rxnfc callbacks.
Link: https://patch.msgid.link/20250611145949.2674086-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
RX Flow Hashing supports using different configuration for different
RSS contexts. Only two drivers seem to support it. Make sure we
uniformly error out for drivers which don't.
Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20250611145949.2674086-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that the handles have been separated - remove the RX Flow Hash
handling from rxnfc functions and vice versa.
Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20250611145949.2674086-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
RX Flow Hash configuration uses the same argument structure
as flow filters. This is probably why ethtool IOCTL handles
them together. The more checks we add the more convoluted
this code is getting (as some of the checks apply only
to flow filters and others only to the hashing).
Copy the code to separate the handling. This is an exact
copy, the next change will remove unnecessary handling.
Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20250611145949.2674086-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Current release - regressions:
- af_unix: allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD
Current release - new code bugs:
- eth: airoha: correct enable mask for RX queues 16-31
- veth: prevent NULL pointer dereference in veth_xdp_rcv when peer
disappears under traffic
- ipv6: move fib6_config_validate() to ip6_route_add(), prevent invalid
routes
Previous releases - regressions:
- phy: phy_caps: don't skip better duplex match on non-exact match
- dsa: b53: fix untagged traffic sent via cpu tagged with VID 0
- Revert "wifi: mwifiex: Fix HT40 bandwidth issue.", it caused transient
packet loss, exact reason not fully understood, yet
Previous releases - always broken:
- net: clear the dst when BPF is changing skb protocol (IPv4 <> IPv6)
- sched: sfq: fix a potential crash on gso_skb handling
- Bluetooth: intel: improve rx buffer posting to avoid causing issues
in the firmware
- eth: intel: i40e: make reset handling robust against multiple requests
- eth: mlx5: ensure FW pages are always allocated on the local NUMA
node, even when device is configure to 'serve' another node
- wifi: ath12k: fix GCC_GCC_PCIE_HOT_RST definition for WCN7850,
prevent kernel crashes
- wifi: ath11k: avoid burning CPU in ath11k_debugfs_fw_stats_request()
for 3 sec if fw_stats_done is not set
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmhK/3IACgkQMUZtbf5S
IruE5A//RdwiBW/pqoMIiRKLA3HZeUA/beYOl4DwVf8WFQNUIqdboeAi6k4yFrS+
SykKN0s1z8fW45lA46iFv3sR0QKYGln/v/cANsqojYqKBD3PF42dRifFlEAIz2M5
fnXK1VHPJOFK/OBOyKiiW3R6mFv+v9epZM8BKED77vFy7osDV2zkObePeE8/34B7
yVAr6JNTpB5Ex4ziG+e/6tFF6IX9RJLBl4fkRRynLDSsb1NFuy39LxPsxRQPxnzo
tlfHfxEFl5qDNGondUoSxmp38HoO6MRofWp1d1GZoBbTXi0gXV26I5WaaBHBqPkm
jZ7AtIMQq2+JuEg0y4dFFRehZLwLEMuhvlbacbIOKNBngVIsploBzvbG3ntWuUa4
Z5VFayQXumsHB5g7+vEFK6vCPaIpatKt419JsFXogNvVmmQzghALFlSymm/WbyGL
Bj3R448xGDJw+2zDAXSH/nMMHkRaQd2Ptj2czvJ0Y7Fj8bxJgH0okaHOBrk9RQTQ
bdUGCiMY84p6WI7rKDkFyyohMxppdYsY8A9hSdGgpqvu7dZi5yGmzz1Sp9+uSfSF
Lj61am4LSvRsIuTP5cdqmTBn3mZS5R49hvJsFddgXRhF+Y9gB7LSm0sypZbuOEKD
m9ijKcNETglzer0iMCwAVrIbDHGjqqHS74DkRzsuPsQ8kaCjsno=
=0mtm
-----END PGP SIGNATURE-----
Merge tag 'net-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bluetooth and wireless.
Current release - regressions:
- af_unix: allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD
Current release - new code bugs:
- eth: airoha: correct enable mask for RX queues 16-31
- veth: prevent NULL pointer dereference in veth_xdp_rcv when peer
disappears under traffic
- ipv6: move fib6_config_validate() to ip6_route_add(), prevent
invalid routes
Previous releases - regressions:
- phy: phy_caps: don't skip better duplex match on non-exact match
- dsa: b53: fix untagged traffic sent via cpu tagged with VID 0
- Revert "wifi: mwifiex: Fix HT40 bandwidth issue.", it caused
transient packet loss, exact reason not fully understood, yet
Previous releases - always broken:
- net: clear the dst when BPF is changing skb protocol (IPv4 <> IPv6)
- sched: sfq: fix a potential crash on gso_skb handling
- Bluetooth: intel: improve rx buffer posting to avoid causing issues
in the firmware
- eth: intel: i40e: make reset handling robust against multiple
requests
- eth: mlx5: ensure FW pages are always allocated on the local NUMA
node, even when device is configure to 'serve' another node
- wifi: ath12k: fix GCC_GCC_PCIE_HOT_RST definition for WCN7850,
prevent kernel crashes
- wifi: ath11k: avoid burning CPU in ath11k_debugfs_fw_stats_request()
for 3 sec if fw_stats_done is not set"
* tag 'net-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (70 commits)
selftests: drv-net: rss_ctx: Add test for ntuple rules targeting default RSS context
net: ethtool: Don't check if RSS context exists in case of context 0
af_unix: Allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD.
ipv6: Move fib6_config_validate() to ip6_route_add().
net: drv: netdevsim: don't napi_complete() from netpoll
net/mlx5: HWS, Add error checking to hws_bwc_rule_complex_hash_node_get()
veth: prevent NULL pointer dereference in veth_xdp_rcv
net_sched: remove qdisc_tree_flush_backlog()
net_sched: ets: fix a race in ets_qdisc_change()
net_sched: tbf: fix a race in tbf_change()
net_sched: red: fix a race in __red_change()
net_sched: prio: fix a race in prio_tune()
net_sched: sch_sfq: reject invalid perturb period
net: phy: phy_caps: Don't skip better duplex macth on non-exact match
MAINTAINERS: Update Kuniyuki Iwashima's email address.
selftests: net: add test case for NAT46 looping back dst
net: clear the dst when changing skb protocol
net/mlx5e: Fix number of lanes to UNKNOWN when using data_rate_oper
net/mlx5e: Fix leak of Geneve TLV option object
net/mlx5: HWS, make sure the uplink is the last destination
...
- revert mwifiex HT40 that was causing issues
- many ath10k/ath11k/ath12k fixes
- re-add some iwlwifi code I lost in a merge
- use kfree_sensitive() on an error path in cfg80211
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmhKjoYACgkQ10qiO8sP
aAADgQ//fAOMAGzuuyxy3KwxtytpYWq/k0jb3HmHct135qxteoOSv/ah0/+nvYFD
4BNAkDa44hqAP5ynWYgGQIqssJ0WkkZFooCzMpb3mzsN5sONy7XfkqG0M8RIC3xC
d28nt5zDufKt+0QtWUq9pUHamm6f+4kG+LQa9kGSlUNJ3wHUMSsONTgC7T8Rpb3u
CxW5vyeIp0OJDKN65qsN1iGqzzA5hF7j4jX2BH+NF/8eoztY3t5C/o0mpRaHqY/d
RWB9Sm5TmIXKnEHvy8CxIwm4+5goEdRi1ua/xJAC/SWmLm3NEEQPotJASnP3+xky
1Ft2EEGkYJHExYnGZaAHjykVY1JGNZos5gitp13325iFGLy9CyeCQ87Uml/k0uw9
k3xZKLzbCwIr1gPy6gTn0ai2V2P3CLmDuuvKiulIvOkfVsXbtB6zCxgansmVW8Xx
WklAX7ZMUxSvI628FFAbGW2Gt5OSPQOXRIsk4LdMO6JQs85mMRux69rIM69F4aEV
8Kean7BjT/7RZGyd5VKjkDwYvOlh2z6+UUzShudqW7otsdA2P+W3+6yu2zPq8oC/
KUQTHtb9wrYux1CdSOovmmIVnc2NALegKLP6rf3VlVF+51TIAPMz8QhDA3N29ArZ
/lhImuPhDpva08pgmFyT6WkHb3lj3KAQsp/O5gLHpLKhjkSdhi8=
=wXKD
-----END PGP SIGNATURE-----
Merge tag 'wireless-2025-06-12' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
Another quick round of updates:
- revert mwifiex HT40 that was causing issues
- many ath10k/ath11k/ath12k fixes
- re-add some iwlwifi code I lost in a merge
- use kfree_sensitive() on an error path in cfg80211
* tag 'wireless-2025-06-12' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: cfg80211: use kfree_sensitive() for connkeys cleanup
wifi: iwlwifi: fix merge damage related to iwl_pci_resume
Revert "wifi: mwifiex: Fix HT40 bandwidth issue."
wifi: ath12k: fix uaf in ath12k_core_init()
wifi: ath12k: Fix hal_reo_cmd_status kernel-doc
wifi: ath12k: fix GCC_GCC_PCIE_HOT_RST definition for WCN7850
wifi: ath11k: validate ath11k_crypto_mode on top of ath11k_core_qmi_firmware_ready
wifi: ath11k: consistently use ath11k_mac_get_fw_stats()
wifi: ath11k: move locking outside of ath11k_mac_get_fw_stats()
wifi: ath11k: adjust unlock sequence in ath11k_update_stats_event()
wifi: ath11k: move some firmware stats related functions outside of debugfs
wifi: ath11k: don't wait when there is no vdev started
wifi: ath11k: don't use static variables in ath11k_debugfs_fw_stats_process()
wifi: ath11k: avoid burning CPU in ath11k_debugfs_fw_stats_request()
wil6210: fix support for sparrow chipsets
wifi: ath10k: Avoid vdev delete timeout when firmware is already down
ath10k: snoc: fix unbalanced IRQ enable in crash recovery
====================
Link: https://patch.msgid.link/20250612082519.11447-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Context 0 (default context) always exists, there is no need to check
whether it exists or not when adding a flow steering rule.
The existing check fails when creating a flow steering rule for context
0 as it is not stored in the rss_ctx xarray.
For example:
$ ethtool --config-ntuple eth2 flow-type tcp4 dst-ip 194.237.147.23 dst-port 19983 context 0 loc 618
rmgr: Cannot insert RX class rule: Invalid argument
Cannot insert classification rule
An example usecase for this could be:
- A high-priority rule (loc 0) directing specific port traffic to
context 0.
- A low-priority rule (loc 1) directing all other TCP traffic to context
1.
This is a user-visible regression that was caught in our testing
environment, it was not reported by a user yet.
Fixes: de7f7582df ("net: ethtool: prevent flow steering to RSS contexts which don't exist")
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://patch.msgid.link/20250612071958.1696361-2-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- eir: Fix NULL pointer deference on eir_get_service_data
- eir: Fix possible crashes on eir_create_adv_data
- hci_sync: Fix broadcast/PA when using an existing instance
- ISO: Fix using BT_SK_PA_SYNC to detect BIS sockets
- ISO: Fix not using bc_sid as advertisement SID
- MGMT: Fix sparse errors
-----BEGIN PGP SIGNATURE-----
iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmhJ66MZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKfp/D/0VTEMF4PiA2eLHIPSwyIHr
pvpz3nY1WE84lAVL0VKNJalA15dk6TVs3Vxgns62BHLdajBOmYPpuJGXaSERBfLB
t5eb4nU9rx9F7+SW8zVLNwtnn5bTENNYKQIjfLmslDQQGfOjeaUP5sO/rIcLEiO3
0rEi55pE4nM6S2wUcmQlhWPC6tr3vIptg4lAz3MWlATDuUnkLjJ3rzEZdkg2kt39
2VJGNxXEG7sBrwv+coO3ROe54YSOrb+gvd9HOL0vq3MVBcvncCRqc7TuBlYi7/5C
p+WdEyG26FgS/TzdgMJKuVISQp6kNKulbuRhsnD2XZA3Gik+t+79Ex9haYW+HLDS
AWQNBm1FgYdCc4LsAxKfwGdvp8wAx1ci1vLNniYVTelyUAc5LosEZ/15DCCyTKdK
9zXEAfxwn72dLVtryVIRKqDR39QVqsxDSuV9ydgXzPJWwjisHX3AB01EqN5PGjYH
aspNgMGfYL9zSw6N1LQ+99M+/JLbvLs7b4jui4CbD3EI7nxN0YqOcKlHw7vEje5s
auU/UEL7DgWOzHTxCcidwATuV79pfx0CRSwsXaPLV1yA9lhS5AYdpBlsRB+wRFbN
vhpw8dwj/WCM0GVYnG87BU3mriyfNgaERTVA2nLKZXvn+cRkVBUkLwBV3Jpi7vQZ
cJ22gcrRj7uYotfvyCHv9g==
=dulg
-----END PGP SIGNATURE-----
Merge tag 'for-net-2025-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- eir: Fix NULL pointer deference on eir_get_service_data
- eir: Fix possible crashes on eir_create_adv_data
- hci_sync: Fix broadcast/PA when using an existing instance
- ISO: Fix using BT_SK_PA_SYNC to detect BIS sockets
- ISO: Fix not using bc_sid as advertisement SID
- MGMT: Fix sparse errors
* tag 'for-net-2025-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: MGMT: Fix sparse errors
Bluetooth: ISO: Fix not using bc_sid as advertisement SID
Bluetooth: ISO: Fix using BT_SK_PA_SYNC to detect BIS sockets
Bluetooth: eir: Fix possible crashes on eir_create_adv_data
Bluetooth: hci_sync: Fix broadcast/PA when using an existing instance
Bluetooth: Fix NULL pointer deference on eir_get_service_data
====================
Link: https://patch.msgid.link/20250611204944.1559356-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Before the cited commit, the kernel unconditionally embedded SCM
credentials to skb for embryo sockets even when both the sender
and listener disabled SO_PASSCRED and SO_PASSPIDFD.
Now, the credentials are added to skb only when configured by the
sender or the listener.
However, as reported in the link below, it caused a regression for
some programs that assume credentials are included in every skb,
but sometimes not now.
The only problematic scenario would be that a socket starts listening
before setting the option. Then, there will be 2 types of non-small
race window, where a client can send skb without credentials, which
the peer receives as an "invalid" message (and aborts the connection
it seems ?):
Client Server
------ ------
s1.listen() <-- No SO_PASS{CRED,PIDFD}
s2.connect()
s2.send() <-- w/o cred
s1.setsockopt(SO_PASS{CRED,PIDFD})
s2.send() <-- w/ cred
or
Client Server
------ ------
s1.listen() <-- No SO_PASS{CRED,PIDFD}
s2.connect()
s2.send() <-- w/o cred
s3, _ = s1.accept() <-- Inherit cred options
s2.send() <-- w/o cred but not set yet
s3.setsockopt(SO_PASS{CRED,PIDFD})
s2.send() <-- w/ cred
It's unfortunate that buggy programs depend on the behaviour,
but let's restore the previous behaviour.
Fixes: 3f84d577b7 ("af_unix: Inherit sk_flags at connect().")
Reported-by: Jacek Łuczak <difrost.kernel@gmail.com>
Closes: https://lore.kernel.org/all/68d38b0b-1666-4974-85d4-15575789c8d4@gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Tested-by: Christian Heusel <christian@heusel.eu>
Tested-by: André Almeida <andrealmeid@igalia.com>
Tested-by: Jacek Łuczak <difrost.kernel@gmail.com>
Link: https://patch.msgid.link/20250611202758.3075858-1-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gerrard Tai reported a race condition in ETS, whenever SFQ perturb timer
fires at the wrong time.
The race is as follows:
CPU 0 CPU 1
[1]: lock root
[2]: qdisc_tree_flush_backlog()
[3]: unlock root
|
| [5]: lock root
| [6]: rehash
| [7]: qdisc_tree_reduce_backlog()
|
[4]: qdisc_put()
This can be abused to underflow a parent's qlen.
Calling qdisc_purge_queue() instead of qdisc_tree_flush_backlog()
should fix the race, because all packets will be purged from the qdisc
before releasing the lock.
Fixes: b05972f01e ("net: sched: tbf: don't call qdisc_put() while holding tree lock")
Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Suggested-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250611111515.1983366-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gerrard Tai reported a race condition in TBF, whenever SFQ perturb timer
fires at the wrong time.
The race is as follows:
CPU 0 CPU 1
[1]: lock root
[2]: qdisc_tree_flush_backlog()
[3]: unlock root
|
| [5]: lock root
| [6]: rehash
| [7]: qdisc_tree_reduce_backlog()
|
[4]: qdisc_put()
This can be abused to underflow a parent's qlen.
Calling qdisc_purge_queue() instead of qdisc_tree_flush_backlog()
should fix the race, because all packets will be purged from the qdisc
before releasing the lock.
Fixes: b05972f01e ("net: sched: tbf: don't call qdisc_put() while holding tree lock")
Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Suggested-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Zhengchao Shao <shaozhengchao@huawei.com>
Link: https://patch.msgid.link/20250611111515.1983366-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gerrard Tai reported a race condition in RED, whenever SFQ perturb timer
fires at the wrong time.
The race is as follows:
CPU 0 CPU 1
[1]: lock root
[2]: qdisc_tree_flush_backlog()
[3]: unlock root
|
| [5]: lock root
| [6]: rehash
| [7]: qdisc_tree_reduce_backlog()
|
[4]: qdisc_put()
This can be abused to underflow a parent's qlen.
Calling qdisc_purge_queue() instead of qdisc_tree_flush_backlog()
should fix the race, because all packets will be purged from the qdisc
before releasing the lock.
Fixes: 0c8d13ac96 ("net: sched: red: delay destroying child qdisc on replace")
Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Suggested-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250611111515.1983366-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gerrard Tai reported a race condition in PRIO, whenever SFQ perturb timer
fires at the wrong time.
The race is as follows:
CPU 0 CPU 1
[1]: lock root
[2]: qdisc_tree_flush_backlog()
[3]: unlock root
|
| [5]: lock root
| [6]: rehash
| [7]: qdisc_tree_reduce_backlog()
|
[4]: qdisc_put()
This can be abused to underflow a parent's qlen.
Calling qdisc_purge_queue() instead of qdisc_tree_flush_backlog()
should fix the race, because all packets will be purged from the qdisc
before releasing the lock.
Fixes: 7b8e0b6e65 ("net: sched: prio: delay destroying child qdiscs on change")
Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Suggested-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250611111515.1983366-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gerrard Tai reported that SFQ perturb_period has no range check yet,
and this can be used to trigger a race condition fixed in a separate patch.
We want to make sure ctl->perturb_period * HZ will not overflow
and is positive.
Tested:
tc qd add dev lo root sfq perturb -10 # negative value : error
Error: sch_sfq: invalid perturb period.
tc qd add dev lo root sfq perturb 1000000000 # too big : error
Error: sch_sfq: invalid perturb period.
tc qd add dev lo root sfq perturb 2000000 # acceptable value
tc -s -d qd sh dev lo
qdisc sfq 8005: root refcnt 2 limit 127p quantum 64Kb depth 127 flows 128 divisor 1024 perturb 2000000sec
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20250611083501.1810459-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Offload path is used for GRO with SW IPsec, and not just for HW
offload. So initialize it anyway.
Fixes: 585b64f5a6 ("xfrm: delay initialization of offload path till its actually requested")
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Closes: https://lore.kernel.org/all/aEGW_5HfPqU1rFjl@krikkit
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
A not-so-careful NAT46 BPF program can crash the kernel
if it indiscriminately flips ingress packets from v4 to v6:
BUG: kernel NULL pointer dereference, address: 0000000000000000
ip6_rcv_core (net/ipv6/ip6_input.c:190:20)
ipv6_rcv (net/ipv6/ip6_input.c:306:8)
process_backlog (net/core/dev.c:6186:4)
napi_poll (net/core/dev.c:6906:9)
net_rx_action (net/core/dev.c:7028:13)
do_softirq (kernel/softirq.c:462:3)
netif_rx (net/core/dev.c:5326:3)
dev_loopback_xmit (net/core/dev.c:4015:2)
ip_mc_finish_output (net/ipv4/ip_output.c:363:8)
NF_HOOK (./include/linux/netfilter.h:314:9)
ip_mc_output (net/ipv4/ip_output.c:400:5)
dst_output (./include/net/dst.h:459:9)
ip_local_out (net/ipv4/ip_output.c:130:9)
ip_send_skb (net/ipv4/ip_output.c:1496:8)
udp_send_skb (net/ipv4/udp.c:1040:8)
udp_sendmsg (net/ipv4/udp.c:1328:10)
The output interface has a 4->6 program attached at ingress.
We try to loop the multicast skb back to the sending socket.
Ingress BPF runs as part of netif_rx(), pushes a valid v6 hdr
and changes skb->protocol to v6. We enter ip6_rcv_core which
tries to use skb_dst(). But the dst is still an IPv4 one left
after IPv4 mcast output.
Clear the dst in all BPF helpers which change the protocol.
Try to preserve metadata dsts, those may carry non-routing
metadata.
Cc: stable@vger.kernel.org
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Fixes: d219df60a7 ("bpf: Add ipip6 and ip6ip decap support for bpf_skb_adjust_room()")
Fixes: 1b00e0dfe7 ("bpf: update skb->protocol in bpf_skb_net_grow")
Fixes: 6578171a7f ("bpf: add bpf_skb_change_proto helper")
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250610001245.1981782-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently bc_sid is being ignore when acting as Broadcast Source role,
so this fix it by passing the bc_sid and then use it when programming
the PA:
< HCI Command: LE Set Exte.. (0x08|0x0036) plen 25
Handle: 0x01
Properties: 0x0000
Min advertising interval: 140.000 msec (0x00e0)
Max advertising interval: 140.000 msec (0x00e0)
Channel map: 37, 38, 39 (0x07)
Own address type: Random (0x01)
Peer address type: Public (0x00)
Peer address: 00:00:00:00:00:00 (OUI 00-00-00)
Filter policy: Allow Scan Request from Any, Allow Connect Request from Any (0x00)
TX power: Host has no preference (0x7f)
Primary PHY: LE 1M (0x01)
Secondary max skip: 0x00
Secondary PHY: LE 2M (0x02)
SID: 0x01
Scan request notifications: Disabled (0x00)
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
BT_SK_PA_SYNC is only valid for Broadcast Sinks which means socket used
for Broadcast Sources wouldn't be able to use the likes of getpeername
to read out the sockaddr_iso_bc fields which may have been update (e.g.
bc_sid).
Fixes: 0a766a0aff ("Bluetooth: ISO: Fix getpeername not returning sockaddr_iso_bc fields")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
eir_create_adv_data may attempt to add EIR_FLAGS and EIR_TX_POWER
without checking if that would fit.
Link: https://github.com/bluez/bluez/issues/1117#issuecomment-2958244066
Fixes: 01ce70b0a2 ("Bluetooth: eir: Move EIR/Adv Data functions to its own file")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
When using and existing adv_info instance for broadcast source it
needs to be updated to periodic first before it can be reused, also in
case the existing instance already have data hci_set_adv_instance_data
cannot be used directly since it would overwrite the existing data so
this reappend the original data after the Broadcast ID, if one was
generated.
Example:
bluetoothctl># Add PBP to EA so it can be later referenced as the BIS ID
bluetoothctl> advertise.service 0x1856 0x00 0x00
bluetoothctl> advertise on
...
< HCI Command: LE Set Extended Advertising Data (0x08|0x0037) plen 13
Handle: 0x01
Operation: Complete extended advertising data (0x03)
Fragment preference: Minimize fragmentation (0x01)
Data length: 0x09
Service Data: Public Broadcast Announcement (0x1856)
Data[2]: 0000
Flags: 0x06
LE General Discoverable Mode
BR/EDR Not Supported
...
bluetoothctl># Attempt to acquire Broadcast Source transport
bluetoothctl>transport.acquire /org/bluez/hci0/pac_bcast0/fd0
...
< HCI Command: LE Set Extended Advertising Data (0x08|0x0037) plen 255
Handle: 0x01
Operation: Complete extended advertising data (0x03)
Fragment preference: Minimize fragmentation (0x01)
Data length: 0x0e
Service Data: Broadcast Audio Announcement (0x1852)
Broadcast ID: 11371620 (0xad8464)
Service Data: Public Broadcast Announcement (0x1856)
Data[2]: 0000
Flags: 0x06
LE General Discoverable Mode
BR/EDR Not Supported
Link: https://github.com/bluez/bluez/issues/1117
Fixes: eca0ae4aea ("Bluetooth: Add initial implementation of BIS connections")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
The len parameter is considered optional so it can be NULL so it cannot
be used for skipping to next entry of EIR_SERVICE_DATA.
Fixes: 8f9ae5b3ae ("Bluetooth: eir: Add helpers for managing service data")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
The commit ee971630f2 ("bpf: Allow some trace helpers for all prog
types") made bpf_get_cgroup_classid_curr helper available to all BPF
program types, not just networking programs.
This helper calls __task_get_classid() which internally calls
task_cls_state() requiring rcu_read_lock_bh_held(). This works
in networking/tc context where RCU BH is held, but triggers an RCU
warning when called from other contexts like BPF syscall programs
that run under rcu_read_lock_trace():
WARNING: suspicious RCU usage
6.15.0-rc4-syzkaller-g079e5c56a5c4 #0 Not tainted
-----------------------------
net/core/netclassid_cgroup.c:24 suspicious rcu_dereference_check() usage!
Fix this by also accepting rcu_read_lock_held() and
rcu_read_lock_trace_held() as valid RCU contexts in the
task_cls_state() function. This ensures the helper works correctly
in all needed RCU contexts where it might be called, regular RCU,
RCU BH (for networking), and RCU trace (for BPF syscall programs).
Fixes: ee971630f2 ("bpf: Allow some trace helpers for all prog types")
Reported-by: syzbot+b4169a1cfb945d2ed0ec@syzkaller.appspotmail.com
Signed-off-by: Charalampos Mitrodimas <charmitro@posteo.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20250611-rcu-fix-task_cls_state-v3-1-3d30e1de753f@posteo.net
Closes: https://syzkaller.appspot.com/bug?extid=b4169a1cfb945d2ed0ec
Default ->d_op being simple_dentry_operations is equivalent to leaving
it NULL and putting DCACHE_DONTCACHE into ->s_d_flags.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When sending plaintext data, we initially calculated the corresponding
ciphertext length. However, if we later reduced the plaintext data length
via socket policy, we failed to recalculate the ciphertext length.
This results in transmitting buffers containing uninitialized data during
ciphertext transmission.
This causes uninitialized bytes to be appended after a complete
"Application Data" packet, leading to errors on the receiving end when
parsing TLS record.
Fixes: d3b18ad31f ("tls: add bpf support to sk_msg handling")
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20250609020910.397930-2-jiayuan.chen@linux.dev
Apart from the network and mount namespace all other namespaces expose a
stable inode number and userspace has been relying on that for a very
long time now. It's very much heavily used API. Align the network
namespace and use a stable inode number from the reserved procfs inode
number space so this is consistent across all namespaces.
Link: https://lore.kernel.org/20250606-work-nsfs-v1-2-b8749c9a8844@kernel.org
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The nl80211_parse_connkeys() function currently uses kfree() to release
the 'result' structure in error handling paths. However, if an error
occurs due to result->def being less than 0, the 'result' structure may
contain sensitive information.
To prevent potential leakage of sensitive data, replace kfree() with
kfree_sensitive() when freeing 'result'. This change aligns with the
approach used in its caller, nl80211_join_ibss(), enhancing the overall
security of the wireless subsystem.
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Link: https://patch.msgid.link/20250523110156.4017111-1-zilin@seu.edu.cn
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
... to be used instead of manually assigning to ->s_d_op.
All in-tree filesystem converted (and field itself is renamed,
so any out-of-tree ones in need of conversion will be caught
by compiler).
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Once the THREADED napi is disabled, the napi kthread should also be
stopped. Keeping the kthread intact after disabling THREADED napi makes
the PID of this kthread show up in the output of netlink 'napi-get' and
ps -ef output.
The is discussed in the patch below:
https://lore.kernel.org/all/20250502191548.559cc416@kernel.org
NAPI kthread should stop only if,
- There are no pending napi poll scheduled for this thread.
- There are no new napi poll scheduled for this thread while it has
stopped.
- The ____napi_schedule can correctly fallback to the softirq for napi
polling.
Since napi_schedule_prep provides mutual exclusion over STATE_SCHED bit,
it is safe to unset the STATE_THREADED when SCHED_THREADED is set or the
SCHED bit is not set. SCHED_THREADED being set means that SCHED is
already set and the kthread owns this napi.
To disable threaded napi, unset STATE_THREADED bit safely if
SCHED_THREADED is set or SCHED is unset. Once STATE_THREADED is unset
safely then wait for the kthread to unset the SCHED_THREADED bit so it
safe to stop the kthread.
Add a new test in nl_netdev to verify this behaviour.
Tested:
./tools/testing/selftests/net/nl_netdev.py
TAP version 13
1..6
ok 1 nl_netdev.empty_check
ok 2 nl_netdev.lo_check
ok 3 nl_netdev.page_pool_check
ok 4 nl_netdev.napi_list_check
ok 5 nl_netdev.dev_set_threaded
ok 6 nl_netdev.nsim_rxq_reset_down
# Totals: pass:6 fail:0 xfail:0 xpass:0 skip:0 error:0
Ran neper for 300 seconds and did enable/disable of thread napi in a
loop continuously.
Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Link: https://patch.msgid.link/20250609173015.3851695-1-skhawaja@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEn/sM2K9nqF/8FWzzDHRl3/mQkZwFAmhH/swTHG1rbEBwZW5n
dXRyb25peC5kZQAKCRAMdGXf+ZCRnONCCACa16bTW53gBzmiTxdEgUJ/h+gQuR8G
Fj+yOYIWNZY/YOExa40ldApu3iB9UAB0D+FOly4Wv5zYDct6yNBxqtZjbkTFMaoi
3i+SSrRLNtIxgGs1KgJKVPis8mhCqiBL0aGoJDGyRiye6hotECDyQWvlGM3lMGUr
wdMDQW2xyKOWvm++jXijkUMyKThmI7czlSH8al+JU9KcAO9hiUlGzejdI56KUIMW
TRlg2QSK9CfIzgUP4RQughbF59/8Xbq3LOidu50xMad2wiOJj0IUHB0h6LoAshnS
jFy4Ox4Gw5hcmdaEKazjYEtq3nQeZ6wct7jThw02e4D9h0ac2MCVhphk
=Pt9d
-----END PGP SIGNATURE-----
Merge tag 'linux-can-next-for-6.17-20250610' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
pull-request: can-next 2025-06-10
The first 4 patches are by Vincent Mailhol and prepare the CAN netlink
interface for the introduction of CAN XL configuration.
Geert Uytterhoeven's patch updates the CAN networking documentation.
The last 2 patched are by Davide Caratti and introduce skb drop
reasons in the receive path of several CAN protocols.
* tag 'linux-can-next-for-6.17-20250610' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
can: add drop reasons in CAN protocols receive path
can: add drop reasons in the receive path of AF_CAN
documentation: networking: can: Document alloc_candev_mqs()
can: netlink: can_changelink(): rename tdc_mask into fd_tdc_flag_provided
can: bittiming: rename can_tdc_is_enabled() into can_fd_tdc_is_enabled()
can: bittiming: rename CAN_CTRLMODE_TDC_MASK into CAN_CTRLMODE_FD_TDC_MASK
can: netlink: replace tabulation by space in assignment
====================
Link: https://patch.msgid.link/20250610094933.1593081-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This function was introduced in commit 783da70e83 ("net: add
sock_enable_timestamps"), with one caller in rxrpc.
That only caller was removed in commit 7903d4438b ("rxrpc: Don't use
received skbuff timestamps").
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250609153254.3504909-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We observed an issue from the latest selftest: sockmap_redir where
sk_psock(psock->sk) != psock in the backlog. The root cause is the special
behavior in sockmap_redir - it frequently performs map_update() and
map_delete() on the same socket. During map_update(), we create a new
psock and during map_delete(), we eventually free the psock via rcu_work
in sk_psock_drop(). However, pending workqueues might still exist and not
be processed yet. If users immediately perform another map_update(), a new
psock will be allocated for the same sk, resulting in two psocks pointing
to the same sk.
When the pending workqueue is later triggered, it uses the old psock to
access sk for I/O operations, which is incorrect.
Timing Diagram:
cpu0 cpu1
map_update(sk):
sk->psock = psock1
psock1->sk = sk
map_delete(sk):
rcu_work_free(psock1)
map_update(sk):
sk->psock = psock2
psock2->sk = sk
workqueue:
wakeup with psock1, but the sk of psock1
doesn't belong to psock1
rcu_handler:
clean psock1
free(psock1)
Previously, we used reference counting to address the concurrency issue
between backlog and sock_map_close(). This logic remains necessary as it
prevents the sk from being freed while processing the backlog. But this
patch prevents pending backlogs from using a psock after it has been
stopped.
Note: We cannot call cancel_delayed_work_sync() in map_delete() since this
might be invoked in BPF context by BPF helper, and the function may sleep.
Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20250609025908.79331-1-jiayuan.chen@linux.dev
sock_queue_rcv_skb() can fail because of lack of memory resources: use
drop reasons and pass the receiving socket to the tracepoint, so that
it's possible to better locate/debug such events.
Tested with:
| # modprobe vcan echo=1
| # ip link add name vcan2 type vcan
| # ip link set dev vcan2 up
| # ./netlayer/tst-proc 1 &
| # bg
| # while true ; do perf record -e skb:kfree_skb -aR -- \
| > ./raw/tst-raw-sendto vcan2 ; perf script ; done
| [...]
| tst-raw-sendto 10942 [000] 506428.431856: skb:kfree_skb: skbaddr=0xffff97cec38b4200 rx_sk=0xffff97cf0f75a800 protocol=12 location=raw_rcv+0x20e reason: SOCKET_RCVBUF
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://patch.msgid.link/20250604160605.1005704-3-dcaratti@redhat.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Besides the existing pr_warn_once(), use skb drop reasons in case AF_CAN
layer drops non-conformant CAN{,FD,XL} frames, or conformant frames
received by "wrong" devices, so that it's possible to debug (and count)
such events using existing tracepoints:
| # perf record -e skb:kfree_skb -aR -- ./drv/canfdtest -v -g -l 1 vcan0
| # perf script
| [...]
| canfdtest 1123 [000] 3893.271264: skb:kfree_skb: skbaddr=0xffff975703c9f700 rx_sk=(nil) protocol=12 location=can_rcv+0x4b reason: CAN_RX_INVALID_FRAME
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://patch.msgid.link/20250604160605.1005704-2-dcaratti@redhat.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
- MGMT: Fix UAF on mgmt_remove_adv_monitor_complete
- MGMT: Protect mgmt_pending list with its own lock
- hci_core: fix list_for_each_entry_rcu usage
- btintel_pcie: Increase the tx and rx descriptor count
- btintel_pcie: Reduce driver buffer posting to prevent race condition
- btintel_pcie: Fix driver not posting maximum rx buffers
-----BEGIN PGP SIGNATURE-----
iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmhB65gZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKXuaEACPXWNUOViPFPE85M1Y/VGA
Hw4uDO9x25XySBk740NRT3qkYS8pWZa8SujQZa0ijqklrggosnz3q7QdwiRow5Cv
CLqCZiuQDtekXV8K9xa66K8rt2iUxMDnQRzNW32Pe0OW6Xy2RFiYqC7ZVpFomXBj
2vMj+aNRwbdzvKStEQTxWCISdCkP7XSuOdWS/wnAFyiSThgr4R8PByLQZ9P2J5xj
KfLBs+QzwHCc1hGbO7odTVqyv+UN3v82aN2fmyusdgBYBJ9ymLMV1gpBm/B4oGI7
/zXbU9bZWL+uis+pB3k9MQnaytc32v1ODFyqY8Ua1slE4Qzwz7OKB/8TP9MeOO1s
MzzIYuAK2KJ6C5mxyIBRVMcbdX2GgiwVIXJBWesuqoZc0H1En+eSpoKNzfoX16Ul
hMc8pCfvpKXaqo9KOJMldr5Yg4iKV83Am7zNUB1ka6TymM8NUx56gbF50tYDlOXY
TGYpli8OQF4x5/tWRh9AE+DxgYa4sVrDiQncvnSMlmlyBGf/wCczCjaFwRlGM9Wu
MZPi2zm0lwa1F6T358uOyJRbcFawaV39AGHo37SrCFOvPIKC+c6iTYqLHWLeq6V6
mXlUn4BrTrt7TUqFpBIUcN0LOOLKgxr7Oa8UAhhCfn8LLsFvryuTEbNtxOqvFLQP
4ZUyJFMjUnVAr5PMPjyJ3w==
=VZN1
-----END PGP SIGNATURE-----
Merge tag 'for-net-2025-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- MGMT: Fix UAF on mgmt_remove_adv_monitor_complete
- MGMT: Protect mgmt_pending list with its own lock
- hci_core: fix list_for_each_entry_rcu usage
- btintel_pcie: Increase the tx and rx descriptor count
- btintel_pcie: Reduce driver buffer posting to prevent race condition
- btintel_pcie: Fix driver not posting maximum rx buffers
* tag 'for-net-2025-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: MGMT: Protect mgmt_pending list with its own lock
Bluetooth: MGMT: Fix UAF on mgmt_remove_adv_monitor_complete
Bluetooth: btintel_pcie: Reduce driver buffer posting to prevent race condition
Bluetooth: btintel_pcie: Increase the tx and rx descriptor count
Bluetooth: btintel_pcie: Fix driver not posting maximum rx buffers
Bluetooth: hci_core: fix list_for_each_entry_rcu usage
====================
Link: https://patch.msgid.link/20250605191136.904411-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
SFQ has an assumption of always being able to queue at least one packet.
However, after the blamed commit, sch->q.len can be inflated by packets
in sch->gso_skb, and an enqueue() on an empty SFQ qdisc can be followed
by an immediate drop.
Fix sfq_drop() to properly clear q->tail in this situation.
Tested:
ip netns add lb
ip link add dev to-lb type veth peer name in-lb netns lb
ethtool -K to-lb tso off # force qdisc to requeue gso_skb
ip netns exec lb ethtool -K in-lb gro on # enable NAPI
ip link set dev to-lb up
ip -netns lb link set dev in-lb up
ip addr add dev to-lb 192.168.20.1/24
ip -netns lb addr add dev in-lb 192.168.20.2/24
tc qdisc replace dev to-lb root sfq limit 100
ip netns exec lb netserver
netperf -H 192.168.20.2 -l 100 &
netperf -H 192.168.20.2 -l 100 &
netperf -H 192.168.20.2 -l 100 &
netperf -H 192.168.20.2 -l 100 &
Fixes: a53851e2c3 ("net: sched: explicit locking in gso_cpu fallback")
Reported-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>
Closes: https://lore.kernel.org/netdev/9da42688-bfaa-4364-8797-e9271f3bdaef@hetzner-cloud.de/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250606165127.3629486-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move this API to the canonical timer_*() namespace.
[ tglx: Redone against pre rc1 ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
Current release - regressions:
- Revert "kunit: configs: Enable CONFIG_INIT_STACK_ALL_PATTERN
in all_tests", makes kunit error out if compiler is old
- wifi: iwlwifi: mvm: fix assert on suspend
- rxrpc: fix return from none_validate_challenge()
Current release - new code bugs:
- ovpn: couple of fixes for socket cleanup and UDP-tunnel teardown
- can: kvaser_pciefd: refine error prone echo_skb_max handling logic
- fix net_devmem_bind_dmabuf() stub when DEVMEM not compiled
- eth: airoha: fixes for config / accel in bridge mode
Previous releases - regressions:
- Bluetooth: hci_qca: move the SoC type check to the right place,
fix GPIO integration
- prevent a NULL deref in rtnl_create_link() after locking changes
- fix udp gso skb_segment after pull from frag_list
- hv_netvsc: fix potential deadlock in netvsc_vf_setxdp()
Previous releases - always broken:
- netfilter:
- nf_nat: also check reverse tuple to obtain clashing entry
- nf_set_pipapo_avx2: fix initial map fill (zeroing)
- fix the helper for incremental update of packet checksums after
modifying the IP address, used by ILA and BPF
- eth: stmmac: prevent div by 0 when clock rate is misconfigured
- eth: ice: fix Tx scheduler handling of XDP and changing queue count
- eth: b53: fix support for the RGMII interface when delays configured
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmhBv5kACgkQMUZtbf5S
Irs/DA/+PIh7a33iVcsGIcmWtpnGp+18id1tSLnYGUGx1cW6zxutPD8rb6BsAN84
KR+XVsbMDUehIa10xPoF2L5mX5YujEiPSkjP8eE2KJKDLGpDtYNOyOWKT21yudnd
4EVF5JQoEbWHrkHMKF97tla84QLd5fFtgsvejVeZtQYSIDOteNGfra4Jly8iiR+J
i9k+HdB0CNEKVvvibQZjZ5CrkpmdNPmB9UoJ59bG15q2+vXdzOPm/CCNo//9ZQJB
I8O40nu16msRRVA9nc2V/Tp98fTk9dnDpTSyWiBlNCut9g9ftx456Ew+tjobMRIT
yeh+q9+1z3YHjGJB8P1FGmMZWK3tbrwyqjFGqpSjr7juucFok9kxAaRPqrQxga7H
Yxq3RegeNqukEAV39ZE14TL765Jy+XXF1uTHhNBkUADlNJVKnZygSk78/Ut2nDvQ
vkfoto+CfKny5qkSbTk8KKv1rZu3xwewoOjlcdkHlOBoouCjPOxTC7yxTZgUZB5c
yap0jQsedJct4OAA+O7IGLCmf3KrJ0H32HbWEY68mpTEd+4Df5vAWiIi7vmVJmk3
DX9JWmu5A5yjNMhOEsBQU98gkNw366aA/E8dr+lEfp3AoqDrmdbG3l8+qqhqYnb+
nnL1sNiQH1griZwQBUROAhrtXnYlYsAsZi+cv23Q0hQiGIvIC2Q=
=sRQt
-----END PGP SIGNATURE-----
Merge tag 'net-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from CAN, wireless, Bluetooth, and Netfilter.
Current release - regressions:
- Revert "kunit: configs: Enable CONFIG_INIT_STACK_ALL_PATTERN in
all_tests", makes kunit error out if compiler is old
- wifi: iwlwifi: mvm: fix assert on suspend
- rxrpc: fix return from none_validate_challenge()
Current release - new code bugs:
- ovpn: couple of fixes for socket cleanup and UDP-tunnel teardown
- can: kvaser_pciefd: refine error prone echo_skb_max handling logic
- fix net_devmem_bind_dmabuf() stub when DEVMEM not compiled
- eth: airoha: fixes for config / accel in bridge mode
Previous releases - regressions:
- Bluetooth: hci_qca: move the SoC type check to the right place, fix
GPIO integration
- prevent a NULL deref in rtnl_create_link() after locking changes
- fix udp gso skb_segment after pull from frag_list
- hv_netvsc: fix potential deadlock in netvsc_vf_setxdp()
Previous releases - always broken:
- netfilter:
- nf_nat: also check reverse tuple to obtain clashing entry
- nf_set_pipapo_avx2: fix initial map fill (zeroing)
- fix the helper for incremental update of packet checksums after
modifying the IP address, used by ILA and BPF
- eth:
- stmmac: prevent div by 0 when clock rate is misconfigured
- ice: fix Tx scheduler handling of XDP and changing queue count
- eth: fix support for the RGMII interface when delays configured"
* tag 'net-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (76 commits)
calipso: unlock rcu before returning -EAFNOSUPPORT
seg6: Fix validation of nexthop addresses
net: prevent a NULL deref in rtnl_create_link()
net: annotate data-races around cleanup_net_task
selftests: drv-net: tso: make bkg() wait for socat to quit
selftests: drv-net: tso: fix the GRE device name
selftests: drv-net: add configs for the TSO test
wireguard: device: enable threaded NAPI
netlink: specs: rt-link: decode ip6gre
netlink: specs: rt-link: add missing byte-order properties
net: wwan: mhi_wwan_mbim: use correct mux_id for multiplexing
wifi: cfg80211/mac80211: correctly parse S1G beacon optional elements
net: dsa: b53: do not touch DLL_IQQD on bcm53115
net: dsa: b53: allow RGMII for bcm63xx RGMII ports
net: dsa: b53: do not configure bcm63xx's IMP port interface
net: dsa: b53: do not enable RGMII delay on bcm63xx
net: dsa: b53: do not enable EEE on bcm63xx
net: ti: icssg-prueth: Fix swapped TX stats for MII interfaces.
selftests: netfilter: nft_nat.sh: add test for reverse clash with nat
netfilter: nf_nat: also check reverse tuple to obtain clashing entry
...
Releasing + re-acquiring RCU lock inside list_for_each_entry_rcu() loop
body is not correct.
Fix by taking the update-side hdev->lock instead.
Fixes: c7eaf80bfb ("Bluetooth: Fix hci_link_tx_to RCU lock usage")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
syzbot reported that a recent patch forgot to unlock rcu
in the error path.
Adopt the convention that netlbl_conn_setattr() is already using.
Fixes: 6e9f2df1c5 ("calipso: Don't call calipso functions for AF_INET sk.")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://patch.msgid.link/20250604133826.1667664-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The kernel currently validates that the length of the provided nexthop
address does not exceed the specified length. This can lead to the
kernel reading uninitialized memory if user space provided a shorter
length than the specified one.
Fix by validating that the provided length exactly matches the specified
one.
Fixes: d1df6fd8a1 ("ipv6: sr: define core operations for seg6local lightweight tunnel")
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250604113252.371528-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
At the time rtnl_create_link() is running, dev->netdev_ops is NULL,
we must not use netdev_lock_ops() or risk a NULL deref if
CONFIG_NET_SHAPER is defined.
Use netif_set_group() instead of dev_set_group().
RIP: 0010:netdev_need_ops_lock include/net/netdev_lock.h:33 [inline]
RIP: 0010:netdev_lock_ops include/net/netdev_lock.h:41 [inline]
RIP: 0010:dev_set_group+0xc0/0x230 net/core/dev_api.c:82
Call Trace:
<TASK>
rtnl_create_link+0x748/0xd10 net/core/rtnetlink.c:3674
rtnl_newlink_create+0x25c/0xb00 net/core/rtnetlink.c:3813
__rtnl_newlink net/core/rtnetlink.c:3940 [inline]
rtnl_newlink+0x16d6/0x1c70 net/core/rtnetlink.c:4055
rtnetlink_rcv_msg+0x7cf/0xb70 net/core/rtnetlink.c:6944
netlink_rcv_skb+0x208/0x470 net/netlink/af_netlink.c:2534
netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline]
netlink_unicast+0x75b/0x8d0 net/netlink/af_netlink.c:1339
netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1883
sock_sendmsg_nosec net/socket.c:712 [inline]
Reported-by: syzbot+9fc858ba0312b42b577e@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6840265f.a00a0220.d4325.0009.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 7e4d784f58 ("net: hold netdev instance lock during rtnetlink operations")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250604105815.1516973-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
from_cleanup_net() reads cleanup_net_task locklessly.
Add READ_ONCE()/WRITE_ONCE() annotations to avoid
a potential KCSAN warning, even if the race is harmless.
Fixes: 0734d7c3d9 ("net: expedite synchronize_net() for cleanup_net()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250604093928.1323333-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
S1G beacons are not traditional beacons but a type of extension frame.
Extension frames contain the frame control and duration fields, followed
by zero or more optional fields before the frame body. These optional
fields are distinct from the variable length elements.
The presence of optional fields is indicated in the frame control field.
To correctly locate the elements offset, the frame control must be parsed
to identify which optional fields are present. Currently, mac80211 parses
S1G beacons based on fixed assumptions about the frame layout, without
inspecting the frame control field. This can result in incorrect offsets
to the "variable" portion of the frame.
Properly parse S1G beacon frames by using the field lengths defined in
IEEE 802.11-2024, section 9.3.4.3, ensuring that the elements offset is
calculated accurately.
Fixes: 9eaffe5078 ("cfg80211: convert S1G beacon to scan results")
Fixes: cd418ba63f ("mac80211: convert S1G beacon to scan results")
Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250603053538.468562-1-lachlan.hodges@morsemicro.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The logic added in the blamed commit was supposed to only omit nat source
port allocation if neither the existing nor the new entry are subject to
NAT.
However, its not enough to lookup the conntrack based on the proposed
tuple, we must also check the reverse direction.
Otherwise there are esoteric cases where the collision is in the reverse
direction because that colliding connection has a port rewrite, but the
new entry doesn't. In this case, we only check the new entry and then
erronously conclude that no clash exists anymore.
The existing (udp) tuple is:
a:p -> b:P, with nat translation to s:P, i.e. pure daddr rewrite,
reverse tuple in conntrack table is s:P -> a:p.
When another UDP packet is sent directly to s, i.e. a:p->s:P, this is
correctly detected as a colliding entry: tuple is taken by existing reply
tuple in reverse direction.
But the colliding conntrack is only searched for with unreversed
direction, and we can't find such entry matching a:p->s:P.
The incorrect conclusion is that the clashing entry has timed out and
that no port address translation is required.
Such conntrack will then be discarded at nf_confirm time because the
proposed reverse direction clashes with an existing mapping in the
conntrack table.
Search for the reverse tuple too, this will then check the NAT bits of
the colliding entry and triggers port reallocation.
Followp patch extends nft_nat.sh selftest to cover this scenario.
The IPS_SEQ_ADJUST change is also a bug fix:
Instead of checking for SEQ_ADJ this tested for SEEN_REPLY and ASSURED
by accident -- _BIT is only for use with the test_bit() API.
This bug has little consequence in practice, because the sequence number
adjustments are only useful for TCP which doesn't support clash resolution.
The existing test case (conntrack_reverse_clash.sh) exercise a race
condition path (parallel conntrack creation on different CPUs), so
the colliding entries have neither SEEN_REPLY nor ASSURED set.
Thanks to Yafang Shao and Shaun Brady for an initial investigation
of this bug.
Fixes: d8f84a9bc7 ("netfilter: nf_nat: don't try nat source port reallocation for reverse dir clash")
Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1795
Reported-by: Yafang Shao <laoar.shao@gmail.com>
Reported-by: Shaun Brady <brady.1345@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If the first field doesn't cover the entire start map, then we must zero
out the remainder, else we leak those bits into the next match round map.
The early fix was incomplete and did only fix up the generic C
implementation.
A followup patch adds a test case to nft_concat_range.sh.
Fixes: 791a615b7a ("netfilter: nf_set_pipapo: fix initial map fill")
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
New Features:
* Implement the Sunrpc rfc2203 rpcsec_gss sequence number cache
* Add support for FALLOC_FL_ZERO_RANGE on NFS v4.2
* Add a localio sysfs attribute
Stable Fixes:
* Fix double-unlock bug in nfs_return_empty_folio()
* Don't check for OPEN feature support in v4.1
* Always probe for LOCALIO support asynchronously
* Prevent hang on NFS mounts with xprtsec=[m]tls
Other Bugfixes:
* xattr handlers should check for absent nfs filehandles
* Fix setattr caching of TIME_[MODIFY|ACCESS]_SET when timestamps are delegated
* Fix listxattr to return selinux security labels
* Connect to NFSv3 DS using TLS if MDS connection uses TLS
* Clear SB_RDONLY before getting a superblock, and ignore when remounting
* Fix incorrect handling of NFS error codes in nfs4_do_mkdir()
* Various nfs_localio fixes from Neil Brown that include fixing an
rcu compilation error found by older gcc versions.
* Update stats on flexfiles pNFS DSes when receiving NFS4ERR_DELAY
Cleanups:
* Add a refcount tracker for struct net in the nfs_client
* Allow FREE_STATEID to clean up delegations
* Always set NLINK even if the server doesn't support it
* Cleanups to the NFS folio writeback code
* Remove dead code from xs_tcp_tls_setup_socket()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmg/YkAACgkQ18tUv7Cl
QOuGpQ/+OuG/xkVX6j7FerUcdbVhcZ+5jDUKC0cNe6EeFeFRjgqsdFB0uqH+AgJh
DlxEJuXTMq+9mcptl0rjrOn0tj7dlTpgZowp3kWdK3bX1zSI2jBEJjnz3xVzjBQx
3lbmF/UAIaHv5bPVc9aF8mioaj5DSRKWTBLTg7iOM1ol1DqgHK/M0q2D7d2n1yB4
WYGI7LlAWSBGV4PvEkhHW6PwVPDSqECPBvIxd1obq8TSNl+YZlmVxCoJ99+zVqWf
dvaDOwfs5x+YEQH/+N/XWdc38QiCGfu7H79qGHShWB8t/KT4axxmjVs2fT7xtUsv
yN3fb77rlFOCJaPLRF549/4EJqHYMWmFDKIMUZ7YC1vEBCG4B1kQUqarA5eCbsAi
s/rxBs1VNKeev/RecDaViAeH3XZoVU1rNyIBJjOuWgNlC5wnbF+An3zE0m8MAXxO
Vh7wQSH3GZEY+VCR6ljwLhIv6+tvSVQxEZKUUjfVQXp5UuNwN3wKa+sW6li+FBl6
uV6lJcmdUffrurNhvSghIiSQGDkerHUVhSltgtj5FnmRp/AM95Z850t5a7qqc7Cv
duks9siLLaeC4K5W+AOcKLWXho1dJMIPWUej3ErCiHWnA20QiNXsQN4QoimkDKqf
9SYdcl6UECqV5MzIa/L7cW96S3K0acrq+8ofJCjN3A8M0pcTGgU=
=5DFQ
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.16-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS clent updates from Anna Schumaker:
"New Features:
- Implement the Sunrpc rfc2203 rpcsec_gss sequence number cache
- Add support for FALLOC_FL_ZERO_RANGE on NFS v4.2
- Add a localio sysfs attribute
Stable Fixes:
- Fix double-unlock bug in nfs_return_empty_folio()
- Don't check for OPEN feature support in v4.1
- Always probe for LOCALIO support asynchronously
- Prevent hang on NFS mounts with xprtsec=[m]tls
Other Bugfixes:
- xattr handlers should check for absent nfs filehandles
- Fix setattr caching of TIME_[MODIFY|ACCESS]_SET when timestamps are
delegated
- Fix listxattr to return selinux security labels
- Connect to NFSv3 DS using TLS if MDS connection uses TLS
- Clear SB_RDONLY before getting a superblock, and ignore when
remounting
- Fix incorrect handling of NFS error codes in nfs4_do_mkdir()
- Various nfs_localio fixes from Neil Brown that include fixing an
rcu compilation error found by older gcc versions.
- Update stats on flexfiles pNFS DSes when receiving NFS4ERR_DELAY
Cleanups:
- Add a refcount tracker for struct net in the nfs_client
- Allow FREE_STATEID to clean up delegations
- Always set NLINK even if the server doesn't support it
- Cleanups to the NFS folio writeback code
- Remove dead code from xs_tcp_tls_setup_socket()"
* tag 'nfs-for-6.16-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (30 commits)
flexfiles/pNFS: update stats on NFS4ERR_DELAY for v4.1 DSes
nfs_localio: change nfsd_file_put_local() to take a pointer to __rcu pointer
nfs_localio: protect race between nfs_uuid_put() and nfs_close_local_fh()
nfs_localio: duplicate nfs_close_local_fh()
nfs_localio: simplify interface to nfsd for getting nfsd_file
nfs_localio: always hold nfsd net ref with nfsd_file ref
nfs_localio: use cmpxchg() to install new nfs_file_localio
SUNRPC: Remove dead code from xs_tcp_tls_setup_socket()
SUNRPC: Prevent hang on NFS mount with xprtsec=[m]tls
nfs: fix incorrect handling of large-number NFS errors in nfs4_do_mkdir()
nfs: ignore SB_RDONLY when remounting nfs
nfs: clear SB_RDONLY before getting superblock
NFS: always probe for LOCALIO support asynchronously
pnfs/flexfiles: connect to NFSv3 DS using TLS if MDS connection uses TLS
NFS: add localio to sysfs
nfs: use writeback_iter directly
nfs: refactor nfs_do_writepage
nfs: don't return AOP_WRITEPAGE_ACTIVATE from nfs_do_writepage
nfs: fold nfs_page_async_flush into nfs_do_writepage
NFSv4: Always set NLINK even if the server doesn't support it
...
If we get preempted during xfrm_state_find, we could run
xfrm_state_look_at using a different pcpu_id than the one
xfrm_state_find saw. This could lead to ignoring states that should
have matched, and triggering acquires on a CPU that already has a pcpu
state.
xfrm_state_find starts on CPU1
pcpu_id = 1
lookup starts
<preemption, we're now on CPU2>
xfrm_state_look_at pcpu_id = 2
finds a state
found:
best->pcpu_num != pcpu_id (2 != 1)
if (!x && !error && !acquire_in_progress) {
...
xfrm_state_alloc
xfrm_init_tempstate
...
This can be avoided by passing the original pcpu_id down to all
xfrm_state_look_at() calls.
Also switch to raw_smp_processor_id, disabling preempting just to
re-enable it immediately doesn't really make sense.
Fixes: 1ddf9916ac ("xfrm: Add support for per cpu xfrm state handling.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
In case of preemption, xfrm_state_look_at will find a different
pcpu_id and look up states for that other CPU. If we matched a state
for CPU2 in the state_cache while the lookup started on CPU1, we will
jump to "found", but the "best" state that we got will be ignored and
we will enter the "acquire" block. This block uses state_ptrs, which
isn't initialized at this point.
Let's initialize state_ptrs just after taking rcu_read_lock. This will
also prevent a possible misuse in the future, if someone adjusts this
function.
Reported-by: syzbot+7ed9d47e15e88581dc5b@syzkaller.appspotmail.com
Fixes: e952837f3d ("xfrm: state: fix out-of-bounds read during lookup")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPUAAKCRCRxhvAZXjc
ouMEAQCrviYPG/WMtPTH7nBIbfVQTfNEXt/TvN7u7OjXb+RwRAEAwe9tLy4GrS/t
GuvUPWAthbhs77LTvxj6m3Gf49BOVgQ=
=6FqN
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.16-rc1.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull netfs updates from Christian Brauner:
- The main API document has been extensively updated/rewritten
- Fix an oops in write-retry due to mis-resetting the I/O iterator
- Fix the recording of transferred bytes for short DIO reads
- Fix a request's work item to not require a reference, thereby
avoiding the need to get rid of it in BH/IRQ context
- Fix waiting and waking to be consistent about the waitqueue used
- Remove NETFS_SREQ_SEEK_DATA_READ, NETFS_INVALID_WRITE,
NETFS_ICTX_WRITETHROUGH, NETFS_READ_HOLE_CLEAR,
NETFS_RREQ_DONT_UNLOCK_FOLIOS, and NETFS_RREQ_BLOCKED
- Reorder structs to eliminate holes
- Remove netfs_io_request::ractl
- Only provide proc_link field if CONFIG_PROC_FS=y
- Remove folio_queue::marks3
- Fix undifferentiation of DIO reads from unbuffered reads
* tag 'vfs-6.16-rc1.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
netfs: Fix undifferentiation of DIO reads from unbuffered reads
netfs: Fix wait/wake to be consistent about the waitqueue used
netfs: Fix the request's work item to not require a ref
netfs: Fix setting of transferred bytes with short DIO reads
netfs: Fix oops in write-retry from mis-resetting the subreq iterator
fs/netfs: remove unused flag NETFS_RREQ_BLOCKED
fs/netfs: remove unused flag NETFS_RREQ_DONT_UNLOCK_FOLIOS
folio_queue: remove unused field `marks3`
fs/netfs: declare field `proc_link` only if CONFIG_PROC_FS=y
fs/netfs: remove `netfs_io_request.ractl`
fs/netfs: reorder struct fields to eliminate holes
fs/netfs: remove unused enum choice NETFS_READ_HOLE_CLEAR
fs/netfs: remove unused flag NETFS_ICTX_WRITETHROUGH
fs/netfs: remove unused source NETFS_INVALID_WRITE
fs/netfs: remove unused flag NETFS_SREQ_SEEK_DATA_READ
Commit a1e40ac5b5 ("net: gso: fix udp gso fraglist segmentation after
pull from frag_list") detected invalid geometry in frag_list skbs and
redirects them from skb_segment_list to more robust skb_segment. But some
packets with modified geometry can also hit bugs in that code. We don't
know how many such cases exist. Addressing each one by one also requires
touching the complex skb_segment code, which risks introducing bugs for
other types of skbs. Instead, linearize all these packets that fail the
basic invariants on gso fraglist skbs. That is more robust.
If only part of the fraglist payload is pulled into head_skb, it will
always cause exception when splitting skbs by skb_segment. For detailed
call stack information, see below.
Valid SKB_GSO_FRAGLIST skbs
- consist of two or more segments
- the head_skb holds the protocol headers plus first gso_size
- one or more frag_list skbs hold exactly one segment
- all but the last must be gso_size
Optional datapath hooks such as NAT and BPF (bpf_skb_pull_data) can
modify fraglist skbs, breaking these invariants.
In extreme cases they pull one part of data into skb linear. For UDP,
this causes three payloads with lengths of (11,11,10) bytes were
pulled tail to become (12,10,10) bytes.
The skbs no longer meets the above SKB_GSO_FRAGLIST conditions because
payload was pulled into head_skb, it needs to be linearized before pass
to regular skb_segment.
skb_segment+0xcd0/0xd14
__udp_gso_segment+0x334/0x5f4
udp4_ufo_fragment+0x118/0x15c
inet_gso_segment+0x164/0x338
skb_mac_gso_segment+0xc4/0x13c
__skb_gso_segment+0xc4/0x124
validate_xmit_skb+0x9c/0x2c0
validate_xmit_skb_list+0x4c/0x80
sch_direct_xmit+0x70/0x404
__dev_queue_xmit+0x64c/0xe5c
neigh_resolve_output+0x178/0x1c4
ip_finish_output2+0x37c/0x47c
__ip_finish_output+0x194/0x240
ip_finish_output+0x20/0xf4
ip_output+0x100/0x1a0
NF_HOOK+0xc4/0x16c
ip_forward+0x314/0x32c
ip_rcv+0x90/0x118
__netif_receive_skb+0x74/0x124
process_backlog+0xe8/0x1a4
__napi_poll+0x5c/0x1f8
net_rx_action+0x154/0x314
handle_softirqs+0x154/0x4b8
[118.376811] [C201134] rxq0_pus: [name:bug&]kernel BUG at net/core/skbuff.c:4278!
[118.376829] [C201134] rxq0_pus: [name:traps&]Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[118.470774] [C201134] rxq0_pus: [name:mrdump&]Kernel Offset: 0x178cc00000 from 0xffffffc008000000
[118.470810] [C201134] rxq0_pus: [name:mrdump&]PHYS_OFFSET: 0x40000000
[118.470827] [C201134] rxq0_pus: [name:mrdump&]pstate: 60400005 (nZCv daif +PAN -UAO)
[118.470848] [C201134] rxq0_pus: [name:mrdump&]pc : [0xffffffd79598aefc] skb_segment+0xcd0/0xd14
[118.470900] [C201134] rxq0_pus: [name:mrdump&]lr : [0xffffffd79598a5e8] skb_segment+0x3bc/0xd14
[118.470928] [C201134] rxq0_pus: [name:mrdump&]sp : ffffffc008013770
Fixes: a1e40ac5b5 ("gso: fix udp gso fraglist segmentation after pull from frag_list")
Signed-off-by: Shiming Cheng <shiming.cheng@mediatek.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By passing the hard_iface to netdev_master_upper_dev_link() as private
data, we can iterate over hardifs of a mesh interface more efficiently
using netdev_for_each_lower_private*() (instead of iterating over the
global hardif list). In addition, this will enable resolving a hardif
from its netdev using netdev_lower_dev_get_private() and getting rid of
the global list altogether in the following patches.
A similar approach can be seen in the bonding driver.
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
This version will contain all the (major or even only minor) changes for
Linux 6.17.
The version number isn't a semantic version number with major and minor
information. It is just encoding the year of the expected publishing as
Linux -rc1 and the number of published versions this year (starting at 0).
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
In Cilium, we use bpf_csum_diff + bpf_l4_csum_replace to, among other
things, update the L4 checksum after reverse SNATing IPv6 packets. That
use case is however not currently supported and leads to invalid
skb->csum values in some cases. This patch adds support for IPv6 address
changes in bpf_l4_csum_update via a new flag.
When calling bpf_l4_csum_replace in Cilium, it ends up calling
inet_proto_csum_replace_by_diff:
1: void inet_proto_csum_replace_by_diff(__sum16 *sum, struct sk_buff *skb,
2: __wsum diff, bool pseudohdr)
3: {
4: if (skb->ip_summed != CHECKSUM_PARTIAL) {
5: csum_replace_by_diff(sum, diff);
6: if (skb->ip_summed == CHECKSUM_COMPLETE && pseudohdr)
7: skb->csum = ~csum_sub(diff, skb->csum);
8: } else if (pseudohdr) {
9: *sum = ~csum_fold(csum_add(diff, csum_unfold(*sum)));
10: }
11: }
The bug happens when we're in the CHECKSUM_COMPLETE state. We've just
updated one of the IPv6 addresses. The helper now updates the L4 header
checksum on line 5. Next, it updates skb->csum on line 7. It shouldn't.
For an IPv6 packet, the updates of the IPv6 address and of the L4
checksum will cancel each other. The checksums are set such that
computing a checksum over the packet including its checksum will result
in a sum of 0. So the same is true here when we update the L4 checksum
on line 5. We'll update it as to cancel the previous IPv6 address
update. Hence skb->csum should remain untouched in this case.
The same bug doesn't affect IPv4 packets because, in that case, three
fields are updated: the IPv4 address, the IP checksum, and the L4
checksum. The change to the IPv4 address and one of the checksums still
cancel each other in skb->csum, but we're left with one checksum update
and should therefore update skb->csum accordingly. That's exactly what
inet_proto_csum_replace_by_diff does.
This special case for IPv6 L4 checksums is also described atop
inet_proto_csum_replace16, the function we should be using in this case.
This patch introduces a new bpf_l4_csum_replace flag, BPF_F_IPV6,
to indicate that we're updating the L4 checksum of an IPv6 packet. When
the flag is set, inet_proto_csum_replace_by_diff will skip the
skb->csum update.
Fixes: 7d672345ed ("bpf: add generic bpf_csum_diff helper")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/96a6bc3a443e6f0b21ff7b7834000e17fb549e05.1748509484.git.paul.chaignon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
During ILA address translations, the L4 checksums can be handled in
different ways. One of them, adj-transport, consist in parsing the
transport layer and updating any found checksum. This logic relies on
inet_proto_csum_replace_by_diff and produces an incorrect skb->csum when
in state CHECKSUM_COMPLETE.
This bug can be reproduced with a simple ILA to SIR mapping, assuming
packets are received with CHECKSUM_COMPLETE:
$ ip a show dev eth0
14: eth0@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 62:ae:35:9e:0f:8d brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 3333:0:0:1::c078/64 scope global
valid_lft forever preferred_lft forever
inet6 fd00:10:244:1::c078/128 scope global nodad
valid_lft forever preferred_lft forever
inet6 fe80::60ae:35ff:fe9e:f8d/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
$ ip ila add loc_match fd00:10:244:1 loc 3333:0:0:1 \
csum-mode adj-transport ident-type luid dev eth0
Then I hit [fd00:10:244:1::c078]:8000 with a server listening only on
[3333:0:0:1::c078]:8000. With the bug, the SYN packet is dropped with
SKB_DROP_REASON_TCP_CSUM after inet_proto_csum_replace_by_diff changed
skb->csum. The translation and drop are visible on pwru [1] traces:
IFACE TUPLE FUNC
eth0:9 [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp) ipv6_rcv
eth0:9 [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp) ip6_rcv_core
eth0:9 [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp) nf_hook_slow
eth0:9 [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp) inet_proto_csum_replace_by_diff
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) tcp_v6_early_demux
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) ip6_route_input
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) ip6_input
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) ip6_input_finish
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) ip6_protocol_deliver_rcu
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) raw6_local_deliver
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) ipv6_raw_deliver
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) tcp_v6_rcv
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) __skb_checksum_complete
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) kfree_skb_reason(SKB_DROP_REASON_TCP_CSUM)
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) skb_release_head_state
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) skb_release_data
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) skb_free_head
eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) kfree_skbmem
This is happening because inet_proto_csum_replace_by_diff is updating
skb->csum when it shouldn't. The L4 checksum is updated such that it
"cancels" the IPv6 address change in terms of checksum computation, so
the impact on skb->csum is null.
Note this would be different for an IPv4 packet since three fields
would be updated: the IPv4 address, the IP checksum, and the L4
checksum. Two would cancel each other and skb->csum would still need
to be updated to take the L4 checksum change into account.
This patch fixes it by passing an ipv6 flag to
inet_proto_csum_replace_by_diff, to skip the skb->csum update if we're
in the IPv6 case. Note the behavior of the only other user of
inet_proto_csum_replace_by_diff, the BPF subsystem, is left as is in
this patch and fixed in the subsequent patch.
With the fix, using the reproduction from above, I can confirm
skb->csum is not touched by inet_proto_csum_replace_by_diff and the TCP
SYN proceeds to the application after the ILA translation.
Link: https://github.com/cilium/pwru [1]
Fixes: 65d7ab8de5 ("net: Identifier Locator Addressing module")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/b5539869e3550d46068504feb02d37653d939c0b.1748509484.git.paul.chaignon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The MANA driver's probe registers netdevice via the following call chain:
mana_probe()
register_netdev()
register_netdevice()
register_netdevice() calls notifier callback for netvsc driver,
holding the netdev mutex via netdev_lock_ops().
Further this netvsc notifier callback end up attempting to acquire the
same lock again in dev_xdp_propagate() leading to deadlock.
netvsc_netdev_event()
netvsc_vf_setxdp()
dev_xdp_propagate()
This deadlock was not observed so far because net_shaper_ops was never set,
and thus the lock was effectively a no-op in this case. Fix this by using
netif_xdp_propagate() instead of dev_xdp_propagate() to avoid recursive
locking in this path.
And, since no deadlock is observed on the other path which is via
netvsc_probe, add the lock exclusivly for that path.
Also, clean up the unregistration path by removing the unnecessary call to
netvsc_vf_setxdp(), since unregister_netdevice_many_notify() already
performs this cleanup via dev_xdp_uninstall().
Fixes: 97246d6d21 ("net: hold netdev instance lock during ndo_bpf")
Cc: stable@vger.kernel.org
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Tested-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://patch.msgid.link/1748513910-23963-1-git-send-email-ssengar@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix the signature of the net_devmem_bind_dmabuf API for
CONFIG_NET_DEVMEM=n.
Fixes: bd61848900 ("net: devmem: Implement TX path")
Signed-off-by: Pranjal Shrivastava <praan@google.com>
Link: https://patch.msgid.link/20250528211058.1826608-1-praan@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
BRCM_LEG_PORT_ID was incorrectly used for pskb_may_pull length.
The correct check is BRCM_LEG_TAG_LEN + VLAN_HLEN, or 10 bytes.
Fixes: 964dbf186e ("net: dsa: tag_brcm: add support for legacy tags")
Signed-off-by: Álvaro Fernández Rojas <noltari@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250529124406.2513779-1-noltari@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Depending on the security set the response to L2CAP_LE_CONN_REQ shall be
just L2CAP_CR_LE_ENCRYPTION if only encryption when BT_SECURITY_MEDIUM
is selected since that means security mode 2 which doesn't require
authentication which is something that is covered in the qualification
test L2CAP/LE/CFC/BV-25-C.
Link: https://github.com/bluez/bluez/issues/1270
Fixes: 27e2d4c8d2 ("Bluetooth: Add basic LE L2CAP connect request receiving support")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
In 'mgmt_hci_cmd_sync()', check whether the size of parameters passed
in 'struct mgmt_cp_hci_cmd_sync' matches the total size of the data
(i.e. 'sizeof(struct mgmt_cp_hci_cmd_sync)' plus trailing bytes).
Otherwise, large invalid 'params_len' will cause 'hci_cmd_sync_alloc()'
to do 'skb_put_data()' from an area beyond the one actually passed to
'mgmt_hci_cmd_sync()'.
Reported-by: syzbot+5fe2d5bfbfbec0b675a0@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=5fe2d5bfbfbec0b675a0
Fixes: 827af4787e ("Bluetooth: MGMT: Add initial implementation of MGMT_OP_HCI_CMD_SYNC")
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Fix the return value of none_validate_challenge() to be explicitly true
(which indicates the source packet should simply be discarded) rather than
implicitly true (because rxrpc_abort_conn() always returns -EPROTO which
gets converted to true).
Note that this change doesn't change the behaviour of the code (which is
correct by accident) and, in any case, we *shouldn't* get a CHALLENGE
packet to an rxnull connection (ie. no security).
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lists.infradead.org/pipermail/linux-afs/2025-April/009738.html
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/10720.1748358103@warthog.procyon.org.uk
Fixes: 5800b1cf3f ("rxrpc: Allow CHALLENGEs to the passed to the app for a RESPONSE")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
syzbot reported a uaf in page_pool_recycle_in_ring:
BUG: KASAN: slab-use-after-free in lock_release+0x151/0xa30 kernel/locking/lockdep.c:5862
Read of size 8 at addr ffff8880286045a0 by task syz.0.284/6943
CPU: 0 UID: 0 PID: 6943 Comm: syz.0.284 Not tainted 6.13.0-rc3-syzkaller-gdfa94ce54f41 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0x169/0x550 mm/kasan/report.c:489
kasan_report+0x143/0x180 mm/kasan/report.c:602
lock_release+0x151/0xa30 kernel/locking/lockdep.c:5862
__raw_spin_unlock_bh include/linux/spinlock_api_smp.h:165 [inline]
_raw_spin_unlock_bh+0x1b/0x40 kernel/locking/spinlock.c:210
spin_unlock_bh include/linux/spinlock.h:396 [inline]
ptr_ring_produce_bh include/linux/ptr_ring.h:164 [inline]
page_pool_recycle_in_ring net/core/page_pool.c:707 [inline]
page_pool_put_unrefed_netmem+0x748/0xb00 net/core/page_pool.c:826
page_pool_put_netmem include/net/page_pool/helpers.h:323 [inline]
page_pool_put_full_netmem include/net/page_pool/helpers.h:353 [inline]
napi_pp_put_page+0x149/0x2b0 net/core/skbuff.c:1036
skb_pp_recycle net/core/skbuff.c:1047 [inline]
skb_free_head net/core/skbuff.c:1094 [inline]
skb_release_data+0x6c4/0x8a0 net/core/skbuff.c:1125
skb_release_all net/core/skbuff.c:1190 [inline]
__kfree_skb net/core/skbuff.c:1204 [inline]
sk_skb_reason_drop+0x1c9/0x380 net/core/skbuff.c:1242
kfree_skb_reason include/linux/skbuff.h:1263 [inline]
__skb_queue_purge_reason include/linux/skbuff.h:3343 [inline]
root cause is:
page_pool_recycle_in_ring
ptr_ring_produce
spin_lock(&r->producer_lock);
WRITE_ONCE(r->queue[r->producer++], ptr)
//recycle last page to pool
page_pool_release
page_pool_scrub
page_pool_empty_ring
ptr_ring_consume
page_pool_return_page //release all page
__page_pool_destroy
free_percpu(pool->recycle_stats);
free(pool) //free
spin_unlock(&r->producer_lock); //pool->ring uaf read
recycle_stat_inc(pool, ring);
page_pool can be free while page pool recycle the last page in ring.
Add producer-lock barrier to page_pool_release to prevent the page
pool from being free before all pages have been recycled.
recycle_stat_inc() is empty when CONFIG_PAGE_POOL_STATS is not
enabled, which will trigger Wempty-body build warning. Add definition
for pool stat macro to fix warning.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/netdev/20250513083123.3514193-1-dongchenchen2@huawei.com
Fixes: ff7d6b27f8 ("page_pool: refurbish version of page_pool code")
Reported-by: syzbot+204a4382fcb3311f3858@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=204a4382fcb3311f3858
Signed-off-by: Dong Chenchen <dongchenchen2@huawei.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250527114152.3119109-1-dongchenchen2@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a process under memory pressure is not part of any cgroup and
the charged flag is false, trace_sock_exceed_buf_limit was not called
as expected.
This regression was introduced by commit 2def8ff3fd ("sock:
Code cleanup on __sk_mem_raise_allocated()"). The fix changes the
default value of charged to true while preserving existing logic.
Fixes: 2def8ff3fd ("sock: Code cleanup on __sk_mem_raise_allocated()")
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Tengteng Yang <yangtengteng@bytedance.com>
Link: https://patch.msgid.link/20250527030419.67693-1-yangtengteng@bytedance.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmg3NqgACgkQ6rmadz2v
bTpNUQ/8DPeYtn3nskpsP2OwFy6O3hhfCe6gjOAmUVSk000xbG+AcI/h1DnGZWgk
xlVcEs93ekzUzHd7k1+RJ2c5yDLXieLJAtb66rbFU1enkxs2cWlcWSKE6K/gaoh3
G1BCARVlKwtrJhrVrsXtYP/eGZxKRSUZFK7xhtCk7lp7sRI3xkTLE+FJBcDkTJ6W
HwF14i3zO+BkqNGdFwwlASCCqRItSNBBiM3KjW1DbETOTfAKlvCTrcgdUiODqxhF
PNnULW+xmICABDFlKfDMlUAGNlSHKjiI3+g31LdblA5eyEhIqiCRgBGFYoCnsluk
qUauRSie61KqC7fxN3qVpC3bXJfD1td7uIvoqSkDLtTv8a5+HAoiohzi1qBzCayl
LAGkBYewAfDtdDDjNY38JLH2RCdyY6zG9DhqghPHdPlM7zj7L5zZgj34igEwesMM
mfj9TuFFF99yfX5UUeSxKpDGR1eO4Ew0p7tg8CRs8Fqh6AIQSmboREZrsncVRCTS
4SDHSI4KcO4LO2pEKzy+X4dewganN7aESnQG34iG0liyvDDwJOgUnDWLRwPLas7k
3b/zIfBLxOJpA5R+0hhAMtjMA4NgyKJf4yFZwEieuasQjvzwTApi24YhZ/b3HSEB
2Dp8kHEEbwezv0OFFz/fJ88dNQnrDmtJ+QByN/liA8kj4Yuh2+Q=
=j3t8
-----END PGP SIGNATURE-----
Merge tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:
- Fix and improve BTF deduplication of identical BTF types (Alan
Maguire and Andrii Nakryiko)
- Support up to 12 arguments in BPF trampoline on arm64 (Xu Kuohai and
Alexis Lothoré)
- Support load-acquire and store-release instructions in BPF JIT on
riscv64 (Andrea Parri)
- Fix uninitialized values in BPF_{CORE,PROBE}_READ macros (Anton
Protopopov)
- Streamline allowed helpers across program types (Feng Yang)
- Support atomic update for hashtab of BPF maps (Hou Tao)
- Implement json output for BPF helpers (Ihor Solodrai)
- Several s390 JIT fixes (Ilya Leoshkevich)
- Various sockmap fixes (Jiayuan Chen)
- Support mmap of vmlinux BTF data (Lorenz Bauer)
- Support BPF rbtree traversal and list peeking (Martin KaFai Lau)
- Tests for sockmap/sockhash redirection (Michal Luczaj)
- Introduce kfuncs for memory reads into dynptrs (Mykyta Yatsenko)
- Add support for dma-buf iterators in BPF (T.J. Mercier)
- The verifier support for __bpf_trap() (Yonghong Song)
* tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (135 commits)
bpf, arm64: Remove unused-but-set function and variable.
selftests/bpf: Add tests with stack ptr register in conditional jmp
bpf: Do not include stack ptr register in precision backtracking bookkeeping
selftests/bpf: enable many-args tests for arm64
bpf, arm64: Support up to 12 function arguments
bpf: Check rcu_read_lock_trace_held() in bpf_map_lookup_percpu_elem()
bpf: Avoid __bpf_prog_ret0_warn when jit fails
bpftool: Add support for custom BTF path in prog load/loadall
selftests/bpf: Add unit tests with __bpf_trap() kfunc
bpf: Warn with __bpf_trap() kfunc maybe due to uninitialized variable
bpf: Remove special_kfunc_set from verifier
selftests/bpf: Add test for open coded dmabuf_iter
selftests/bpf: Add test for dmabuf_iter
bpf: Add open coded dmabuf iterator
bpf: Add dmabuf iterator
dma-buf: Rename debugfs symbols
bpf: Fix error return value in bpf_copy_from_user_dynptr
libbpf: Use mmap to parse vmlinux BTF from sysfs
selftests: bpf: Add a test for mmapable vmlinux BTF
btf: Allow mmap of vmlinux btf
...
Core
----
- Implement the Device Memory TCP transmit path, allowing zero-copy
data transmission on top of TCP from e.g. GPU memory to the wire.
- Move all the IPv6 routing tables management outside the RTNL scope,
under its own lock and RCU. The route control path is now 3x times
faster.
- Convert queue related netlink ops to instance lock, reducing
again the scope of the RTNL lock. This improves the control plane
scalability.
- Refactor the software crc32c implementation, removing unneeded
abstraction layers and improving significantly the related
micro-benchmarks.
- Optimize the GRO engine for UDP-tunneled traffic, for a 10%
performance improvement in related stream tests.
- Cover more per-CPU storage with local nested BH locking; this is a
prep work to remove the current per-CPU lock in local_bh_disable()
on PREMPT_RT.
- Introduce and use nlmsg_payload helper, combining buffer bounds
verification with accessing payload carried by netlink messages.
Netfilter
---------
- Rewrite the procfs conntrack table implementation, improving
considerably the dump performance. A lot of user-space tools
still use this interface.
- Implement support for wildcard netdevice in netdev basechain
and flowtables.
- Integrate conntrack information into nft trace infrastructure.
- Export set count and backend name to userspace, for better
introspection.
BPF
---
- BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops
programs and can be controlled in similar way to traditional qdiscs
using the "tc qdisc" command.
- Refactor the UDP socket iterator, addressing long standing issues
WRT duplicate hits or missed sockets.
Protocols
---------
- Improve TCP receive buffer auto-tuning and increase the default
upper bound for the receive buffer; overall this improves the single
flow maximum thoughput on 200Gbs link by over 60%.
- Add AFS GSSAPI security class to AF_RXRPC; it provides transport
security for connections to the AFS fileserver and VL server.
- Improve TCP multipath routing, so that the sources address always
matches the nexthop device.
- Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS,
and thus preventing DoS caused by passing around problematic FDs.
- Retire DCCP socket. DCCP only receives updates for bugs, and major
distros disable it by default. Its removal allows for better
organisation of TCP fields to reduce the number of cache lines hit
in the fast path.
- Extend TCP drop-reason support to cover PAWS checks.
Driver API
----------
- Reorganize PTP ioctl flag support to require an explicit opt-in for
the drivers, avoiding the problem of drivers not rejecting new
unsupported flags.
- Converted several device drivers to timestamping APIs.
- Introduce per-PHY ethtool dump helpers, improving the support for
dump operations targeting PHYs.
Tests and tooling
-----------------
- Add support for classic netlink in user space C codegen, so that
ynl-c can now read, create and modify links, routes addresses and
qdisc layer configuration.
- Add ynl sub-types for binary attributes, allowing ynl-c to output
known struct instead of raw binary data, clarifying the classic
netlink output.
- Extend MPTCP selftests to improve the code-coverage.
- Add tests for XDP tail adjustment in AF_XDP.
New hardware / drivers
----------------------
- OpenVPN virtual driver: offload OpenVPN data channels processing
to the kernel-space, increasing the data transfer throughput WRT
the user-space implementation.
- Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC.
- Broadcom asp-v3.0 ethernet driver.
- AMD Renoir ethernet device.
- ReakTek MT9888 2.5G ethernet PHY driver.
- Aeonsemi 10G C45 PHYs driver.
Drivers
-------
- Ethernet high-speed NICs:
- nVidia/Mellanox (mlx5):
- refactor the stearing table handling to reduce significantly
the amount of memory used
- add support for complex matches in H/W flow steering
- improve flow streeing error handling
- convert to netdev instance locking
- Intel (100G, ice, igb, ixgbe, idpf):
- ice: add switchdev support for LLDP traffic over VF
- ixgbe: add firmware manipulation and regions devlink support
- igb: introduce support for frame transmission premption
- igb: adds persistent NAPI configuration
- idpf: introduce RDMA support
- idpf: add initial PTP support
- Meta (fbnic):
- extend hardware stats coverage
- add devlink dev flash support
- Broadcom (bnxt):
- add support for RX-side device memory TCP
- Wangxun (txgbe):
- implement support for udp tunnel offload
- complete PTP and SRIOV support for AML 25G/10G devices
- Ethernet NICs embedded and virtual:
- Google (gve):
- add device memory TCP TX support
- Amazon (ena):
- support persistent per-NAPI config
- Airoha:
- add H/W support for L2 traffic offload
- add per flow stats for flow offloading
- RealTek (rtl8211): add support for WoL magic packet
- Synopsys (stmmac):
- dwmac-socfpga 1000BaseX support
- add Loongson-2K3000 support
- introduce support for hardware-accelerated VLAN stripping
- Broadcom (bcmgenet):
- expose more H/W stats
- Freescale (enetc, dpaa2-eth):
- enetc: add MAC filter, VLAN filter RSS and loopback support
- dpaa2-eth: convert to H/W timestamping APIs
- vxlan: convert FDB table to rhashtable, for better scalabilty
- veth: apply qdisc backpressure on full ring to reduce TX drops
- Ethernet switches:
- Microchip (kzZ88x3): add ETS scheduler support
- Ethernet PHYs:
- RealTek (rtl8211):
- add support for WoL magic packet
- add support for PHY LEDs
- CAN:
- Adds RZ/G3E CANFD support to the rcar_canfd driver.
- Preparatory work for CAN-XL support.
- Add self-tests framework with support for CAN physical interfaces.
- WiFi:
- mac80211:
- scan improvements with multi-link operation (MLO)
- Qualcomm (ath12k):
- enable AHB support for IPQ5332
- add monitor interface support to QCN9274
- add multi-link operation support to WCN7850
- add 802.11d scan offload support to WCN7850
- monitor mode for WCN7850, better 6 GHz regulatory
- Qualcomm (ath11k):
- restore hibernation support
- MediaTek (mt76):
- WiFi-7 improvements
- implement support for mt7990
- Intel (iwlwifi):
- enhanced multi-link single-radio (EMLSR) support on 5 GHz links
- rework device configuration
- RealTek (rtw88):
- improve throughput for RTL8814AU
- RealTek (rtw89):
- add multi-link operation support
- STA/P2P concurrency improvements
- support different SAR configs by antenna
- Bluetooth:
- introduce HCI Driver protocol
- btintel_pcie: do not generate coredump for diagnostic events
- btusb: add HCI Drv commands for configuring altsetting
- btusb: add RTL8851BE device 0x0bda:0xb850
- btusb: add new VID/PID 13d3/3584 for MT7922
- btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925
- btnxpuart: implement host-wakeup feature
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmg3D64SHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkcIsQAK2eEc+BxQer975wzvtMg6gF9eoex4a+
rZ7jxfDzDtNvTauoQsrpehDZp0FnySaVGCU36lHGB2OvDnhCpPc5hXzKDWQpOuqQ
SHrGG3/6FTbdTG/HfHUcbNyrUzIf53SADSObiQ3qg4gyEQ3sCpcOKtVtMcU8rvsY
/HqMnsJWFaROUMjMtCcnUSgjmeY9kBvha3sTXUqgeRugEOCvZD7z4rpqFIcQqHw7
e2Fi8dwIXEYNxqPp6MRq2qdyUTewCRruE8ZIMAFuhtfYeMElUZMPlqlMENX3AzTQ
cr0EgwcFOUxRA7oZRxhoBNBsVXavtSpQr4ZDoWplxP4aQ37n5tc1E9Q72axpB/Og
FbJRl6GvWYnCd8071BczgmfHlKaTAigPvt2Z4r6JjM5I/Bij/IZ3k+On1OTuOAj/
EqfFkdZ0a5cfKrwUMP+oSGtSAywkMVUtnIKJlZeRbjSj2432sCfe2jVAlS8ELM43
3LUgXYrAKtA87g171LlsRu5EEpI5QmqPb+i5LpPlEXe2TJEgPisyfecJ3NafF/2+
j575lm+TFNm9NTNhGGjDPEvw0djI5wSGGMe9J4gC74eWi6s5t6C4cuUf84TKWdwR
x+9H0IB7rfFncAwXHJuUUtzd+fPHaYzs5dDGbSgMQOXr1cr1wlubCK8mQ1r/Wt/a
3GjFIOQKW2Q5
=t/Tz
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core:
- Implement the Device Memory TCP transmit path, allowing zero-copy
data transmission on top of TCP from e.g. GPU memory to the wire.
- Move all the IPv6 routing tables management outside the RTNL scope,
under its own lock and RCU. The route control path is now 3x times
faster.
- Convert queue related netlink ops to instance lock, reducing again
the scope of the RTNL lock. This improves the control plane
scalability.
- Refactor the software crc32c implementation, removing unneeded
abstraction layers and improving significantly the related
micro-benchmarks.
- Optimize the GRO engine for UDP-tunneled traffic, for a 10%
performance improvement in related stream tests.
- Cover more per-CPU storage with local nested BH locking; this is a
prep work to remove the current per-CPU lock in local_bh_disable()
on PREMPT_RT.
- Introduce and use nlmsg_payload helper, combining buffer bounds
verification with accessing payload carried by netlink messages.
Netfilter:
- Rewrite the procfs conntrack table implementation, improving
considerably the dump performance. A lot of user-space tools still
use this interface.
- Implement support for wildcard netdevice in netdev basechain and
flowtables.
- Integrate conntrack information into nft trace infrastructure.
- Export set count and backend name to userspace, for better
introspection.
BPF:
- BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops
programs and can be controlled in similar way to traditional qdiscs
using the "tc qdisc" command.
- Refactor the UDP socket iterator, addressing long standing issues
WRT duplicate hits or missed sockets.
Protocols:
- Improve TCP receive buffer auto-tuning and increase the default
upper bound for the receive buffer; overall this improves the
single flow maximum thoughput on 200Gbs link by over 60%.
- Add AFS GSSAPI security class to AF_RXRPC; it provides transport
security for connections to the AFS fileserver and VL server.
- Improve TCP multipath routing, so that the sources address always
matches the nexthop device.
- Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS,
and thus preventing DoS caused by passing around problematic FDs.
- Retire DCCP socket. DCCP only receives updates for bugs, and major
distros disable it by default. Its removal allows for better
organisation of TCP fields to reduce the number of cache lines hit
in the fast path.
- Extend TCP drop-reason support to cover PAWS checks.
Driver API:
- Reorganize PTP ioctl flag support to require an explicit opt-in for
the drivers, avoiding the problem of drivers not rejecting new
unsupported flags.
- Converted several device drivers to timestamping APIs.
- Introduce per-PHY ethtool dump helpers, improving the support for
dump operations targeting PHYs.
Tests and tooling:
- Add support for classic netlink in user space C codegen, so that
ynl-c can now read, create and modify links, routes addresses and
qdisc layer configuration.
- Add ynl sub-types for binary attributes, allowing ynl-c to output
known struct instead of raw binary data, clarifying the classic
netlink output.
- Extend MPTCP selftests to improve the code-coverage.
- Add tests for XDP tail adjustment in AF_XDP.
New hardware / drivers:
- OpenVPN virtual driver: offload OpenVPN data channels processing to
the kernel-space, increasing the data transfer throughput WRT the
user-space implementation.
- Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC.
- Broadcom asp-v3.0 ethernet driver.
- AMD Renoir ethernet device.
- ReakTek MT9888 2.5G ethernet PHY driver.
- Aeonsemi 10G C45 PHYs driver.
Drivers:
- Ethernet high-speed NICs:
- nVidia/Mellanox (mlx5):
- refactor the steering table handling to significantly
reduce the amount of memory used
- add support for complex matches in H/W flow steering
- improve flow streeing error handling
- convert to netdev instance locking
- Intel (100G, ice, igb, ixgbe, idpf):
- ice: add switchdev support for LLDP traffic over VF
- ixgbe: add firmware manipulation and regions devlink support
- igb: introduce support for frame transmission premption
- igb: adds persistent NAPI configuration
- idpf: introduce RDMA support
- idpf: add initial PTP support
- Meta (fbnic):
- extend hardware stats coverage
- add devlink dev flash support
- Broadcom (bnxt):
- add support for RX-side device memory TCP
- Wangxun (txgbe):
- implement support for udp tunnel offload
- complete PTP and SRIOV support for AML 25G/10G devices
- Ethernet NICs embedded and virtual:
- Google (gve):
- add device memory TCP TX support
- Amazon (ena):
- support persistent per-NAPI config
- Airoha:
- add H/W support for L2 traffic offload
- add per flow stats for flow offloading
- RealTek (rtl8211): add support for WoL magic packet
- Synopsys (stmmac):
- dwmac-socfpga 1000BaseX support
- add Loongson-2K3000 support
- introduce support for hardware-accelerated VLAN stripping
- Broadcom (bcmgenet):
- expose more H/W stats
- Freescale (enetc, dpaa2-eth):
- enetc: add MAC filter, VLAN filter RSS and loopback support
- dpaa2-eth: convert to H/W timestamping APIs
- vxlan: convert FDB table to rhashtable, for better scalabilty
- veth: apply qdisc backpressure on full ring to reduce TX drops
- Ethernet switches:
- Microchip (kzZ88x3): add ETS scheduler support
- Ethernet PHYs:
- RealTek (rtl8211):
- add support for WoL magic packet
- add support for PHY LEDs
- CAN:
- Adds RZ/G3E CANFD support to the rcar_canfd driver.
- Preparatory work for CAN-XL support.
- Add self-tests framework with support for CAN physical interfaces.
- WiFi:
- mac80211:
- scan improvements with multi-link operation (MLO)
- Qualcomm (ath12k):
- enable AHB support for IPQ5332
- add monitor interface support to QCN9274
- add multi-link operation support to WCN7850
- add 802.11d scan offload support to WCN7850
- monitor mode for WCN7850, better 6 GHz regulatory
- Qualcomm (ath11k):
- restore hibernation support
- MediaTek (mt76):
- WiFi-7 improvements
- implement support for mt7990
- Intel (iwlwifi):
- enhanced multi-link single-radio (EMLSR) support on 5 GHz links
- rework device configuration
- RealTek (rtw88):
- improve throughput for RTL8814AU
- RealTek (rtw89):
- add multi-link operation support
- STA/P2P concurrency improvements
- support different SAR configs by antenna
- Bluetooth:
- introduce HCI Driver protocol
- btintel_pcie: do not generate coredump for diagnostic events
- btusb: add HCI Drv commands for configuring altsetting
- btusb: add RTL8851BE device 0x0bda:0xb850
- btusb: add new VID/PID 13d3/3584 for MT7922
- btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925
- btnxpuart: implement host-wakeup feature"
* tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1611 commits)
selftests/bpf: Fix bpf selftest build warning
selftests: netfilter: Fix skip of wildcard interface test
net: phy: mscc: Stop clearing the the UDPv4 checksum for L2 frames
net: openvswitch: Fix the dead loop of MPLS parse
calipso: Don't call calipso functions for AF_INET sk.
selftests/tc-testing: Add a test for HFSC eltree double add with reentrant enqueue behaviour on netem
net_sched: hfsc: Address reentrant enqueue adding class to eltree twice
octeontx2-pf: QOS: Refactor TC_HTB_LEAF_DEL_LAST callback
octeontx2-pf: QOS: Perform cache sync on send queue teardown
net: mana: Add support for Multi Vports on Bare metal
net: devmem: ncdevmem: remove unused variable
net: devmem: ksft: upgrade rx test to send 1K data
net: devmem: ksft: add 5 tuple FS support
net: devmem: ksft: add exit_wait to make rx test pass
net: devmem: ksft: add ipv4 support
net: devmem: preserve sockc_err
page_pool: fix ugly page_pool formatting
net: devmem: move list_add to net_devmem_bind_dmabuf.
selftests: netfilter: nft_queue.sh: include file transfer duration in log message
net: phy: mscc: Fix memory leak when using one step timestamping
...
xs_tcp_tls_finish_connecting() already marks the upper xprt
connected, so the same code in xs_tcp_tls_setup_socket() is
never executed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
Engineers at Hammerspace noticed that sometimes mounting with
"xprtsec=tls" hangs for a minute or so, and then times out, even
when the NFS server is reachable and responsive.
kTLS shuts off data_ready callbacks if strp->msg_ready is set to
mitigate data_ready callbacks when a full TLS record is not yet
ready to be read from the socket.
Normally msg_ready is clear when the first TLS record arrives on
a socket. However, I observed that sometimes tls_setsockopt() sets
strp->msg_ready, and that prevents forward progress because
tls_data_ready() becomes a no-op.
Moreover, Jakub says: "If there's a full record queued at the time
when [tlshd] passes the socket back to the kernel, it's up to the
reader to read the already queued data out." So SunRPC cannot
expect a data_ready call when ingress data is already waiting.
Add an explicit poll after SunRPC's upper transport is set up to
pick up any data that arrived after the TLS handshake but before
transport set-up is complete.
Reported-by: Steve Sears <sjs@hammerspace.com>
Suggested-by: Jakub Kacinski <kuba@kernel.org>
Fixes: 75eb6af7ac ("SUNRPC: Add a TCP-with-TLS RPC transport class")
Tested-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
The marquee feature for this release is that the limit on the
maximum rsize and wsize has been raised to 4MB. The default remains
at 1MB, but risk-seeking administrators now have the ability to try
larger I/O sizes with NFS clients that support them. Eventually the
default setting will be increased when we have confidence that this
change will not have negative impact.
With v6.16, NFSD now has its own debugfs file system where we can
add experimental features and make them available outside of our
development community without impacting production deployments. The
first experimental setting added is one that makes all NFS READ
operations use vfs_iter_read() instead of the NFSD splice actor. The
plan is to eventually retire the splice actor, as that will enable a
number of new capabilities such as the use of struct bio_vec from the
top to the bottom of the NFSD stack.
Jeff Layton contributed a number of observability improvements. The
use of dprintk() in a number of high-traffic code paths has been
replaced with static trace points.
This release sees the continuation of efforts to harden the NFSv4.2
COPY operation. Soon, the restriction on async COPY operations can
be lifted.
Many thanks to the contributors, reviewers, testers, and bug
reporters who participated during the v6.16 development cycle.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmg1yGUACgkQM2qzM29m
f5frMA//TJTbSWiM7qBX1GhVMNr1lxQcjU4BPKo0qZfEtwV06F2BB9mWgDU+BIQh
AcGfMZUmNWAnhOTOYvwqyW6dnX+1yt8sBsCZ/1ctY30A4JH4AgG5sdZS7BUrlEEr
bGDMUCaPnvQ3maeDjMlefe7Xv/rUhj9TVhXmDkt4vf/jCde2JODTB/z8n7WeAxYJ
eOvmr/n5z6VI5Q67M7b5/xqofBEaEoq9P5UEgn61ThfeR0bMlrklm/avDCbbNIH8
6n7Z3tjzllK1CAjEmwHalq4LRbMX5FHWzNkyJw+wtviXS18J5vCAvRe+JDoykusu
L2bgXT8bBUqy46eO4WKEOJtEqVQhIsRFx/8ku1iTLrpDWlwrR4mHVyObEDkkdlMX
EyBQ4svg2OxCXSyy5O8oggzU0TWVJStIjbIEHbJYusWLU7HxxFveBwqwzYHXLtip
WKm6N2ANqQi1du+Pc6xmgXo9svA5Vk+DQjljm1Y5up9dhi2K9cvCIHjwFsZ+E0VL
XqXJ2YgIQb3oXK7FttzLOiDrpX1OX82sTIbgdcPcfT7lP+ej7uiHMBPmdPwgaZIU
EbIp0ThoTkh8/VRMDcWIt+B6SEhmb5vY3Zgz9Lcf2J0PM1fuYJ67L7xGTviFX7Ci
DpohiCgceb6PHYeIuarayF86tPJGF8Vb7XvQZej2Ybv8QdxLFg8=
=FbeG
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"The marquee feature for this release is that the limit on the maximum
rsize and wsize has been raised to 4MB. The default remains at 1MB,
but risk-seeking administrators now have the ability to try larger I/O
sizes with NFS clients that support them. Eventually the default
setting will be increased when we have confidence that this change
will not have negative impact.
With v6.16, NFSD now has its own debugfs file system where we can add
experimental features and make them available outside of our
development community without impacting production deployments. The
first experimental setting added is one that makes all NFS READ
operations use vfs_iter_read() instead of the NFSD splice actor. The
plan is to eventually retire the splice actor, as that will enable a
number of new capabilities such as the use of struct bio_vec from the
top to the bottom of the NFSD stack.
Jeff Layton contributed a number of observability improvements. The
use of dprintk() in a number of high-traffic code paths has been
replaced with static trace points.
This release sees the continuation of efforts to harden the NFSv4.2
COPY operation. Soon, the restriction on async COPY operations can be
lifted.
Many thanks to the contributors, reviewers, testers, and bug reporters
who participated during the v6.16 development cycle"
* tag 'nfsd-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (60 commits)
xdrgen: Fix code generated for counted arrays
SUNRPC: Bump the maximum payload size for the server
NFSD: Add a "default" block size
NFSD: Remove NFSSVC_MAXBLKSIZE_V2 macro
NFSD: Remove NFSD_BUFSIZE
sunrpc: Remove the RPCSVC_MAXPAGES macro
svcrdma: Adjust the number of entries in svc_rdma_send_ctxt::sc_pages
svcrdma: Adjust the number of entries in svc_rdma_recv_ctxt::rc_pages
sunrpc: Adjust size of socket's receive page array dynamically
SUNRPC: Remove svc_rqst :: rq_vec
SUNRPC: Remove svc_fill_write_vector()
NFSD: Use rqstp->rq_bvec in nfsd_iter_write()
SUNRPC: Export xdr_buf_to_bvec()
NFSD: De-duplicate the svc_fill_write_vector() call sites
NFSD: Use rqstp->rq_bvec in nfsd_iter_read()
sunrpc: Replace the rq_bvec array with dynamically-allocated memory
sunrpc: Replace the rq_pages array with dynamically-allocated memory
sunrpc: Remove backchannel check in svc_init_buffer()
sunrpc: Add a helper to derive maxpages from sv_max_mesg
svcrdma: Reduce the number of rdma_rw contexts per-QP
...
The unexpected MPLS packet may not end with the bottom label stack.
When there are many stacks, The label count value has wrapped around.
A dead loop occurs, soft lockup/CPU stuck finally.
stack backtrace:
UBSAN: array-index-out-of-bounds in /build/linux-0Pa0xK/linux-5.15.0/net/openvswitch/flow.c:662:26
index -1 is out of range for type '__be32 [3]'
CPU: 34 PID: 0 Comm: swapper/34 Kdump: loaded Tainted: G OE 5.15.0-121-generic #131-Ubuntu
Hardware name: Dell Inc. PowerEdge C6420/0JP9TF, BIOS 2.12.2 07/14/2021
Call Trace:
<IRQ>
show_stack+0x52/0x5c
dump_stack_lvl+0x4a/0x63
dump_stack+0x10/0x16
ubsan_epilogue+0x9/0x36
__ubsan_handle_out_of_bounds.cold+0x44/0x49
key_extract_l3l4+0x82a/0x840 [openvswitch]
? kfree_skbmem+0x52/0xa0
key_extract+0x9c/0x2b0 [openvswitch]
ovs_flow_key_extract+0x124/0x350 [openvswitch]
ovs_vport_receive+0x61/0xd0 [openvswitch]
? kernel_init_free_pages.part.0+0x4a/0x70
? get_page_from_freelist+0x353/0x540
netdev_port_receive+0xc4/0x180 [openvswitch]
? netdev_port_receive+0x180/0x180 [openvswitch]
netdev_frame_hook+0x1f/0x40 [openvswitch]
__netif_receive_skb_core.constprop.0+0x23a/0xf00
__netif_receive_skb_list_core+0xfa/0x240
netif_receive_skb_list_internal+0x18e/0x2a0
napi_complete_done+0x7a/0x1c0
bnxt_poll+0x155/0x1c0 [bnxt_en]
__napi_poll+0x30/0x180
net_rx_action+0x126/0x280
? bnxt_msix+0x67/0x80 [bnxt_en]
handle_softirqs+0xda/0x2d0
irq_exit_rcu+0x96/0xc0
common_interrupt+0x8e/0xa0
</IRQ>
Fixes: fbdcdd78da ("Change in Openvswitch to support MPLS label depth of 3 in ingress direction")
Signed-off-by: Faicker Mo <faicker.mo@zenlayer.com>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/259D3404-575D-4A6D-B263-1DF59A67CF89@zenlayer.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Preserve the error code returned by sock_cmsg_send and return that on
err.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250523230524.1107879-4-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It's annoying for the list_add to be outside net_devmem_bind_dmabuf, but
the list_del is in net_devmem_unbind_dmabuf. Make it consistent by
having both the list_add/del be inside the net_devmem_[un]bind_dmabuf.
Cc: ap420073@gmail.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Tested-by: Taehee Yoo <ap420073@gmail.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250523230524.1107879-2-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sctp_do_peeloff is only used inside of net/sctp/socket.c,
so mark it static.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20250526054745.2329201-1-hch@lst.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
GENERIC_ALLOCATOR is a non-prompt kconfig, meaning users can't enable it
selectively. All kconfig users of GENERIC_ALLOCATOR select it, except of
NET_DEVMEM which only depends on it, there is no easy way to turn
GENERIC_ALLOCATOR on unless we select other unnecessary configs that
will select it.
Instead of depending on it, select it when NET_DEVMEM is enabled.
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/1747950086-1246773-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The tipc_aead_free() function currently uses kfree() to release the aead
structure. However, this structure contains sensitive information, such
as key's SALT value, which should be securely erased from memory to
prevent potential leakage.
To enhance security, replace kfree() with kfree_sensitive() when freeing
the aead structure. This change ensures that sensitive data is explicitly
cleared before memory deallocation, aligning with the approach used in
tipc_aead_init() and adhering to best practices for handling confidential
information.
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech>
Link: https://patch.msgid.link/20250523114717.4021518-1-zilin@seu.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- Convert init_timer*(), try_to_del_timer_sync() and
destroy_timer_on_stack() over to the canonical timer_*() namespace
convention.
There are is another large converstion pending, which has not been included
because it would have caused a gazillion of merge conflicts in next. The
conversion scripts will be run towards the end of the merge window and a
pull request sent once all conflict dependencies have been merged.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmgzgTkTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodwVD/97rF1Juqm1JZNIZPN/vMqwCxRoUkc6
tsK0+UC7UXusuJadxJ+Bsv25iPF+qejnThMU+SQ5yTVj/PNfxOe0WPdCEGGiL8Ye
2JCk6GqSOB/360SlLmtR1B1xHDwsuuUcQTz0w57CH66HRV5vpoWSMSwj/ypy+8nU
PlgjItaxdCKa9NJ+SUJZPWIxRkt/PsA1kwlV1OcxkgB++IiIHQEbPxECq9mlzWXF
b4Sq/Sdf2OmEePN+DYoey4fneRwJnkjkeX/o+CqosCPHRIiWUlSu5W/lU5IYojM3
s3XpMNNg/z8PMXR4JA2VaPYWLUZyBOs+3dM7Y6Am+z55EoxMxfzg6pGx2tfM4ftl
vF8wG3Z1c9MmpLk+P9LatNvfHeVLNve8KgOLa5phMDQ/El/a8KqLu6HmRDPONvKp
d6iXdPq1CP8P6jOtlFfzLmKPShgEcp+Zz9W3CaQR/0ZJEsEqrpKOLzdT86hJhBV0
mBCdzixmGtKAh0BdPdmg2FCLScqER3HKIJhZSdV8I+jSETIHCuMiIfbMXR7iwm/H
R1/ayvxrbc1mPseo28scqvo7m6cn5BFBxIUf4Sokp52ZCapz1v2aWzo4vHI0cTgT
ZOjlTrf+fgYLn1dqdD45TJiQPnmRrw4dU+WWSFRFJY2qjfyucj80vdqdkE5zkp5b
UPomlVimG4ccPg==
=FHGU
-----END PGP SIGNATURE-----
Merge tag 'timers-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer cleanups from Thomas Gleixner:
"Another set of timer API cleanups:
- Convert init_timer*(), try_to_del_timer_sync() and
destroy_timer_on_stack() over to the canonical timer_*()
namespace convention.
There is another large conversion pending, which has not been included
because it would have caused a gazillion of merge conflicts in next.
The conversion scripts will be run towards the end of the merge window
and a pull request sent once all conflict dependencies have been
merged"
* tag 'timers-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
treewide, timers: Rename destroy_timer_on_stack() as timer_destroy_on_stack()
treewide, timers: Rename try_to_del_timer_sync() as timer_delete_sync_try()
timers: Rename init_timers() as timers_init()
timers: Rename NEXT_TIMER_MAX_DELTA as TIMER_NEXT_MAX_DELTA
timers: Rename __init_timer_on_stack() as __timer_init_on_stack()
timers: Rename __init_timer() as __timer_init()
timers: Rename init_timer_on_stack_key() as timer_init_key_on_stack()
timers: Rename init_timer_key() as timer_init_key()
In commit 7ead4405e0 ("xsk: convert xdp_copy_frags_from_zc() to use
page_pool_dev_alloc()"), when converting from netmem to page, I missed a
call to page_address() around skb_frag_page(frag) to get the virtual
address of the page. This commit uses skb_frag_address() helper to fix
the issue.
Fixes: 7ead4405e0 ("xsk: convert xdp_copy_frags_from_zc() to use page_pool_dev_alloc()")
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Link: https://patch.msgid.link/20250522040115.5057-1-minhquangbui99@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Syzkaller reports the following issue:
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:578
__mutex_lock+0x106/0xe80 kernel/locking/mutex.c:746
team_change_rx_flags+0x38/0x220 drivers/net/team/team_core.c:1781
dev_change_rx_flags net/core/dev.c:9145 [inline]
__dev_set_promiscuity+0x3f8/0x590 net/core/dev.c:9189
netif_set_promiscuity+0x50/0xe0 net/core/dev.c:9201
dev_set_promiscuity+0x126/0x260 net/core/dev_api.c:286 packet_dev_mc net/packet/af_packet.c:3698 [inline]
packet_dev_mclist_delete net/packet/af_packet.c:3722 [inline]
packet_notifier+0x292/0xa60 net/packet/af_packet.c:4247
notifier_call_chain+0x1b3/0x3e0 kernel/notifier.c:85
call_netdevice_notifiers_extack net/core/dev.c:2214 [inline]
call_netdevice_notifiers net/core/dev.c:2228 [inline]
unregister_netdevice_many_notify+0x15d8/0x2330 net/core/dev.c:11972
rtnl_delete_link net/core/rtnetlink.c:3522 [inline]
rtnl_dellink+0x488/0x710 net/core/rtnetlink.c:3564
rtnetlink_rcv_msg+0x7cf/0xb70 net/core/rtnetlink.c:6955
netlink_rcv_skb+0x219/0x490 net/netlink/af_netlink.c:2534
Calling `PACKET_ADD_MEMBERSHIP` on an ops-locked device can trigger
the `NETDEV_UNREGISTER` notifier, which may require disabling promiscuous
and/or allmulti mode. Both of these operations require acquiring
the netdev instance lock.
Move the call to `packet_dev_mc` outside of the RCU critical section.
The `mclist` modifications (add, del, flush, unregister) are protected by
the RTNL, not the RCU. The RCU only protects the `sklist` and its
associated `sks`. The delayed operation on the `mclist` entry remains
within the RTNL.
Reported-by: syzbot+b191b5ccad8d7a986286@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b191b5ccad8d7a986286
Fixes: ad7c7b2172 ("net: hold netdev instance lock during sysfs operations")
Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250522031129.3247266-1-stfomichev@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Lingering should be transport-independent in the long run. In preparation
for supporting other transports, as well as the linger on shutdown(), move
code to core.
Generalize by querying vsock_transport::unsent_bytes(), guard against the
callback being unimplemented. Do not pass sk_lingertime explicitly. Pull
SOCK_LINGER check into vsock_linger().
Flatten the function. Remove the nested block by inverting the condition:
return early on !timeout.
Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20250522-vsock-linger-v6-2-2ad00b0e447e@rbox.co
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently vsock's lingering effectively boils down to waiting (or timing
out) until packets are consumed or dropped by the peer; be it by receiving
the data, closing or shutting down the connection.
To align with the semantics described in the SO_LINGER section of man
socket(7) and to mimic AF_INET's behaviour more closely, change the logic
of a lingering close(): instead of waiting for all data to be handled,
block until data is considered sent from the vsock's transport point of
view. That is until worker picks the packets for processing and decrements
virtio_vsock_sock::bytes_unsent down to 0.
Note that (some interpretation of) lingering was always limited to
transports that called virtio_transport_wait_close() on transport release.
This does not change, i.e. under Hyper-V and VMCI no lingering would be
observed.
The implementation does not adhere strictly to man page's interpretation of
SO_LINGER: shutdown() will not trigger the lingering. This follows AF_INET.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20250522-vsock-linger-v6-1-2ad00b0e447e@rbox.co
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Convert callers of dev_set_mac_address_user() to use struct
sockaddr_storage. Add sanity checks on dev->addr_len usage.
Signed-off-by: Kees Cook <kees@kernel.org>
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://patch.msgid.link/20250521204619.2301870-8-kees@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Instead of a heap allocating a variably sized struct sockaddr and lying
about the type in the call to netif_set_mac_address(), use a stack
allocated struct sockaddr_storage. This lets us drop the cast and avoid
the allocation.
Putting "ss" on the stack means it will get a reused stack slot since
it is the same size (128B) as other existing single-scope stack variables,
like the vfinfo array (128B), so no additional stack space is used by
this function.
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20250521204619.2301870-7-kees@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
All users of dev_set_mac_address() are now using a struct sockaddr_storage.
Convert the internal data type to struct sockaddr_storage, drop the casts,
and update pointer types.
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20250521204619.2301870-6-kees@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Switch to struct sockaddr_storage for calling dev_set_mac_address(). Add
a temporary cast to struct sockaddr, which will be removed in a
subsequent patch.
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20250521204619.2301870-4-kees@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
To avoid future casting with coming API type changes, switch struct
ncsi_dev_priv::pending_mac to a full struct sockaddr_storage.
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20250521204619.2301870-3-kees@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In order to avoid passing around struct sockaddr that has a size the
compiler cannot reason about (nor track at runtime), convert
netif_set_mac_address() to take struct sockaddr_storage. This is just a
cast conversion, so there is are no binary changes. Following patches
will make actual allocation changes.
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20250521204619.2301870-2-kees@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
All the callers of inet_addr_is_any() have a sockaddr_storage-backed
sockaddr. Avoid casts and switch prototype to the actual object being
used.
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20250521204619.2301870-1-kees@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The strncpy() function is actively dangerous to use since it may not
NULL-terminate the destination string, resulting in potential memory
content exposures, unbounded reads, or crashes.
Link: https://github.com/KSPP/linux/issues/90
In addition, strscpy_pad is more appropriate because it also zero-fills
any remaining space in the destination if the source is shorter than
the provided buffer size.
Signed-off-by: Baris Can Goral <goralbaris@gmail.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20250521161036.14489-1-goralbaris@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPTwAKCRCRxhvAZXjc
oliqAQCVdrBn7D2+dB04hjefFq6W6LhyLGrtCCliflicN5SyxAD+PHHiB9nFKe6J
xQkaNArCJjPd2QEx73aGjHzi3UQq6Qs=
=Pk9c
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.16-rc1.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull coredump updates from Christian Brauner:
"This adds support for sending coredumps over an AF_UNIX socket. It
also makes (implicit) use of the new SO_PEERPIDFD ability to hand out
pidfds for reaped peer tasks
The new coredump socket will allow userspace to not have to rely on
usermode helpers for processing coredumps and provides a saf way to
handle them instead of relying on super privileged coredumping helpers
This will also be significantly more lightweight since the kernel
doens't have to do a fork()+exec() for each crashing process to spawn
a usermodehelper. Instead the kernel just connects to the AF_UNIX
socket and userspace can process it concurrently however it sees fit.
Support for userspace is incoming starting with systemd-coredump
There's more work coming in that direction next cycle. The rest below
goes into some details and background
Coredumping currently supports two modes:
(1) Dumping directly into a file somewhere on the filesystem.
(2) Dumping into a pipe connected to a usermode helper process
spawned as a child of the system_unbound_wq or kthreadd
For simplicity I'm mostly ignoring (1). There's probably still some
users of (1) out there but processing coredumps in this way can be
considered adventurous especially in the face of set*id binaries
The most common option should be (2) by now. It works by allowing
userspace to put a string into /proc/sys/kernel/core_pattern like:
|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
The "|" at the beginning indicates to the kernel that a pipe must be
used. The path following the pipe indicator is a path to a binary that
will be spawned as a usermode helper process. Any additional
parameters pass information about the task that is generating the
coredump to the binary that processes the coredump
In the example the core_pattern shown causes the kernel to spawn
systemd-coredump as a usermode helper. There's various conceptual
consequences of this (non-exhaustive list):
- systemd-coredump is spawned with file descriptor number 0 (stdin)
connected to the read-end of the pipe. All other file descriptors
are closed. That specifically includes 1 (stdout) and 2 (stderr).
This has already caused bugs because userspace assumed that this
cannot happen (Whether or not this is a sane assumption is
irrelevant)
- systemd-coredump will be spawned as a child of system_unbound_wq.
So it is not a child of any userspace process and specifically not
a child of PID 1. It cannot be waited upon and is in a weird hybrid
upcall which are difficult for userspace to control correctly
- systemd-coredump is spawned with full kernel privileges. This
necessitates all kinds of weird privilege dropping excercises in
userspace to make this safe
- A new usermode helper has to be spawned for each crashing process
This adds a new mode:
(3) Dumping into an AF_UNIX socket
Userspace can set /proc/sys/kernel/core_pattern to:
@/path/to/coredump.socket
The "@" at the beginning indicates to the kernel that an AF_UNIX
coredump socket will be used to process coredumps
The coredump socket must be located in the initial mount namespace.
When a task coredumps it opens a client socket in the initial network
namespace and connects to the coredump socket:
- The coredump server uses SO_PEERPIDFD to get a stable handle on the
connected crashing task. The retrieved pidfd will provide a stable
reference even if the crashing task gets SIGKILLed while generating
the coredump. That is a huge attack vector right now
- By setting core_pipe_limit non-zero userspace can guarantee that
the crashing task cannot be reaped behind it's back and thus
process all necessary information in /proc/<pid>. The SO_PEERPIDFD
can be used to detect whether /proc/<pid> still refers to the same
process
The core_pipe_limit isn't used to rate-limit connections to the
socket. This can simply be done via AF_UNIX socket directly
- The pidfd for the crashing task will contain information how the
task coredumps. The PIDFD_GET_INFO ioctl gained a new flag
PIDFD_INFO_COREDUMP which can be used to retreive the coredump
information
If the coredump gets a new coredump client connection the kernel
guarantees that PIDFD_INFO_COREDUMP information is available.
Currently the following information is provided in the new
@coredump_mask extension to struct pidfd_info:
* PIDFD_COREDUMPED is raised if the task did actually coredump
* PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping
(e.g., undumpable)
* PIDFD_COREDUMP_USER is raised if this is a regular coredump and
doesn't need special care by the coredump server
* PIDFD_COREDUMP_ROOT is raised if the generated coredump should
be treated as sensitive and the coredump server should restrict
access to the generated coredump to sufficiently privileged
users"
* tag 'vfs-6.16-rc1.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
mips, net: ensure that SOCK_COREDUMP is defined
selftests/coredump: add tests for AF_UNIX coredumps
selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure
coredump: validate socket name as it is written
coredump: show supported coredump modes
pidfs, coredump: add PIDFD_INFO_COREDUMP
coredump: add coredump socket
coredump: reflow dump helpers a little
coredump: massage do_coredump()
coredump: massage format_corename()
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPTwAKCRCRxhvAZXjc
ov4zAP4yfqKBAz6eMt9CzDgHCdVQJ9Nuur1EiRdot3maPzHTcQEA2hVkJrvVo1Y/
jCVAf7gmGX1Uu6nCUF6Vjy35g8i20gA=
=nzYS
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pidfs updates from Christian Brauner:
"Features:
- Allow handing out pidfds for reaped tasks for AF_UNIX SO_PEERPIDFD
socket option
SO_PEERPIDFD is a socket option that allows to retrieve a pidfd for
the process that called connect() or listen(). This is heavily used
to safely authenticate clients in userspace avoiding security bugs
due to pid recycling races (dbus, polkit, systemd, etc.)
SO_PEERPIDFD currently doesn't support handing out pidfds if the
sk->sk_peer_pid thread-group leader has already been reaped. In
this case it currently returns EINVAL. Userspace still wants to get
a pidfd for a reaped process to have a stable handle it can pass
on. This is especially useful now that it is possible to retrieve
exit information through a pidfd via the PIDFD_GET_INFO ioctl()'s
PIDFD_INFO_EXIT flag
Another summary has been provided by David Rheinsberg:
> A pidfd can outlive the task it refers to, and thus user-space
> must already be prepared that the task underlying a pidfd is
> gone at the time they get their hands on the pidfd. For
> instance, resolving the pidfd to a PID via the fdinfo must be
> prepared to read `-1`.
>
> Despite user-space knowing that a pidfd might be stale, several
> kernel APIs currently add another layer that checks for this. In
> particular, SO_PEERPIDFD returns `EINVAL` if the peer-task was
> already reaped, but returns a stale pidfd if the task is reaped
> immediately after the respective alive-check.
>
> This has the unfortunate effect that user-space now has two ways
> to check for the exact same scenario: A syscall might return
> EINVAL/ESRCH/... *or* the pidfd might be stale, even though
> there is no particular reason to distinguish both cases. This
> also propagates through user-space APIs, which pass on pidfds.
> They must be prepared to pass on `-1` *or* the pidfd, because
> there is no guaranteed way to get a stale pidfd from the kernel.
>
> Userspace must already deal with a pidfd referring to a reaped
> task as the task may exit and get reaped at any time will there
> are still many pidfds referring to it
In order to allow handing out reaped pidfd SO_PEERPIDFD needs to
ensure that PIDFD_INFO_EXIT information is available whenever a
pidfd for a reaped task is created by PIDFD_INFO_EXIT. The uapi
promises that reaped pidfds are only handed out if it is guaranteed
that the caller sees the exit information:
TEST_F(pidfd_info, success_reaped)
{
struct pidfd_info info = {
.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT,
};
/*
* Process has already been reaped and PIDFD_INFO_EXIT been set.
* Verify that we can retrieve the exit status of the process.
*/
ASSERT_EQ(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0);
ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
ASSERT_TRUE(WIFEXITED(info.exit_code));
ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
}
To hand out pidfds for reaped processes we thus allocate a pidfs
entry for the relevant sk->sk_peer_pid at the time the
sk->sk_peer_pid is stashed and drop it when the socket is
destroyed. This guarantees that exit information will always be
recorded for the sk->sk_peer_pid task and we can hand out pidfds
for reaped processes
- Hand a pidfd to the coredump usermode helper process
Give userspace a way to instruct the kernel to install a pidfd for
the crashing process into the process started as a usermode helper.
There's still tricky race-windows that cannot be easily or
sometimes not closed at all by userspace. There's various ways like
looking at the start time of a process to make sure that the
usermode helper process is started after the crashing process but
it's all very very brittle and fraught with peril
The crashed-but-not-reaped process can be killed by userspace
before coredump processing programs like systemd-coredump have had
time to manually open a PIDFD from the PID the kernel provides
them, which means they can be tricked into reading from an
arbitrary process, and they run with full privileges as they are
usermode helper processes
Even if that specific race-window wouldn't exist it's still the
safest and cleanest way to let the kernel provide the pidfd
directly instead of requiring userspace to do it manually. In
parallel with this commit we already have systemd adding support
for this in [1]
When the usermode helper process is forked we install a pidfd file
descriptor three into the usermode helper's file descriptor table
so it's available to the exec'd program
Since usermode helpers are either children of the system_unbound_wq
workqueue or kthreadd we know that the file descriptor table is
empty and can thus always use three as the file descriptor number
Note, that we'll install a pidfd for the thread-group leader even
if a subthread is calling do_coredump(). We know that task linkage
hasn't been removed yet and even if this @current isn't the actual
thread-group leader we know that the thread-group leader cannot be
reaped until
@current has exited
- Allow telling when a task has not been found from finding the wrong
task when creating a pidfd
We currently report EINVAL whenever a struct pid has no tasked
attached anymore thereby conflating two concepts:
(1) The task has already been reaped
(2) The caller requested a pidfd for a thread-group leader but the
pid actually references a struct pid that isn't used as a
thread-group leader
This is causing issues for non-threaded workloads as in where they
expect ESRCH to be reported, not EINVAL
So allow userspace to reliably distinguish between (1) and (2)
- Make it possible to detect when a pidfs entry would outlive the
struct pid it pinned
- Add a range of new selftests
Cleanups:
- Remove unneeded NULL check from pidfd_prepare() for passed struct
pid
- Avoid pointless reference count bump during release_task()
Fixes:
- Various fixes to the pidfd and coredump selftests
- Fix error handling for replace_fd() when spawning coredump usermode
helper"
* tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
pidfs: detect refcount bugs
coredump: hand a pidfd to the usermode coredump helper
coredump: fix error handling for replace_fd()
pidfs: move O_RDWR into pidfs_alloc_file()
selftests: coredump: Raise timeout to 2 minutes
selftests: coredump: Fix test failure for slow machines
selftests: coredump: Properly initialize pointer
net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid
pidfs: get rid of __pidfd_prepare()
net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid
pidfs: register pid in pidfs
net, pidfd: report EINVAL for ESRCH
release_task: kill the no longer needed get/put_pid(thread_pid)
pidfs: ensure consistent ENOENT/ESRCH reporting
exit: move wake_up_all() pidfd waiters into __unhash_process()
selftest/pidfd: add test for thread-group leader pidfd open for thread
pidfd: improve uapi when task isn't found
pidfd: remove unneeded NULL check from pidfd_prepare()
selftests/pidfd: adapt to recent changes
In `struct virtio_vsock_sock`, we maintain two counters:
- `rx_bytes`: used internally to track how many bytes have been read.
This supports mechanisms like .stream_has_data() and sock_rcvlowat().
- `fwd_cnt`: used for the credit mechanism to inform available receive
buffer space to the remote peer.
These counters are updated via virtio_transport_inc_rx_pkt() and
virtio_transport_dec_rx_pkt().
Since the beginning with commit 06a8fc7836 ("VSOCK: Introduce
virtio_vsock_common.ko"), we call virtio_transport_dec_rx_pkt() in
virtio_transport_stream_do_dequeue() only when we consume the entire
packet, so partial reads, do not update `rx_bytes` and `fwd_cnt`.
This is fine for `fwd_cnt`, because we still have space used for the
entire packet, and we don't want to update the credit for the other
peer until we free the space of the entire packet. However, this
causes `rx_bytes` to be stale on partial reads.
Previously, this didn’t cause issues because `rx_bytes` was used only by
.stream_has_data(), and any unread portion of a packet implied data was
still available. However, since commit 93b8088766
("virtio/vsock: fix logic which reduces credit update messages"), we now
rely on `rx_bytes` to determine if a credit update should be sent when
the data in the RX queue drops below SO_RCVLOWAT value.
This patch fixes the accounting by updating `rx_bytes` with the number
of bytes actually read, even on partial reads, while leaving `fwd_cnt`
untouched until the packet is fully consumed. Also introduce a new
`buf_used` counter to check that the remote peer is honoring the given
credit; this was previously done via `rx_bytes`.
Fixes: 93b8088766 ("virtio/vsock: fix logic which reduces credit update messages")
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250521121705.196379-1-sgarzare@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmgwd00ACgkQ1w0aZmrP
KyEfwA//RXQ3i8PCa7lKHxDRhVzG3rEvgXRmiXeNd+JjzsCnybBb7+wRf3dtBGWT
+1s44Utx1JqosWxCVBulqYC5bqSC66789l5X2jhYJmUZxRrbcsqPngwnIrjb/XeK
ZJM62wiRhkBQED7yZLGy+y4VHQiG8CEMt16AOQHk863aruWv1tT7up90CTtzA545
4GF/grU3FC0PsoTLwzWyvqsWK+9uk3Y4Tifp5hU3w6uRD9EjX5tHCZlXXSqOF5gu
KT26OYsePYXhJVZIwDf2oVLGi0EVTPB9IFxZSNgLqyXqu2ILAb9OwRNVTNfTP7Pg
1RWJWmgqvRNs9OM2ecifYgQf/AfvCL0Cja1BJOjmvtICuGegrYH7G5YYQsMl9CoE
7jBoTzpToSASat5+dwoz81Bvzh447dYxRE2VmbxmRTTWToQYS1KGBPc9e3u/n5Rr
ruh8tRZ3/R0Fy+YLDkrJst3grh5RLITbuyu4ElJMArPU50mLTVYxKd6nA3BqwB5G
1GmLfCzvQH3e6PKz6CNke1AytVDy/wLTXtcbLnze2Muaj4AqhtOe5Q8ypnOO0Vyk
PsJ6U3rm2asd3GE9+AIx8gZBv8yCu1w9CiwLK8ybT2NETb2dEnqPgWeDyT7rpcaD
sQOPsBE1q/TEp9gofbYCHBm5E2mX9UP7Q6EHCTekrI97xLq8Q2M=
=fBhd
-----END PGP SIGNATURE-----
Merge tag 'nf-next-25-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following batch contains Netfilter updates for net-next,
specifically 26 patches: 5 patches adding/updating selftests,
4 fixes, 3 PREEMPT_RT fixes, and 14 patches to enhance nf_tables):
1) Improve selftest coverage for pipapo 4 bit group format, from
Florian Westphal.
2) Fix incorrect dependencies when compiling a kernel without
legacy ip{6}tables support, also from Florian.
3) Two patches to fix nft_fib vrf issues, including selftest updates
to improve coverage, also from Florian Westphal.
4) Fix incorrect nesting in nft_tunnel's GENEVE support, from
Fernando F. Mancera.
5) Three patches to fix PREEMPT_RT issues with nf_dup infrastructure
and nft_inner to match in inner headers, from Sebastian Andrzej Siewior.
6) Integrate conntrack information into nft trace infrastructure,
from Florian Westphal.
7) A series of 13 patches to allow to specify wildcard netdevice in
netdev basechain and flowtables, eg.
table netdev filter {
chain ingress {
type filter hook ingress devices = { eth0, eth1, vlan* } priority 0; policy accept;
}
}
This also allows for runtime hook registration on NETDEV_{UN}REGISTER
event, from Phil Sutter.
netfilter pull request 25-05-23
* tag 'nf-next-25-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: (26 commits)
selftests: netfilter: Torture nftables netdev hooks
netfilter: nf_tables: Add notifications for hook changes
netfilter: nf_tables: Support wildcard netdev hook specs
netfilter: nf_tables: Sort labels in nft_netdev_hook_alloc()
netfilter: nf_tables: Handle NETDEV_CHANGENAME events
netfilter: nf_tables: Wrap netdev notifiers
netfilter: nf_tables: Respect NETDEV_REGISTER events
netfilter: nf_tables: Prepare for handling NETDEV_REGISTER events
netfilter: nf_tables: Have a list of nf_hook_ops in nft_hook
netfilter: nf_tables: Pass nf_hook_ops to nft_unregister_flowtable_hook()
netfilter: nf_tables: Introduce nft_register_flowtable_ops()
netfilter: nf_tables: Introduce nft_hook_find_ops{,_rcu}()
netfilter: nf_tables: Introduce functions freeing nft_hook objects
netfilter: nf_tables: add packets conntrack state to debug trace info
netfilter: conntrack: make nf_conntrack_id callable without a module dependency
netfilter: nf_dup_netdev: Move the recursion counter struct netdev_xmit
netfilter: nft_inner: Use nested-BH locking for nft_pcpu_tun_ctx
netfilter: nf_dup{4, 6}: Move duplication check to task_struct
netfilter: nft_tunnel: fix geneve_opt dump
selftests: netfilter: nft_fib.sh: add type and oif tests with and without VRFs
...
====================
Link: https://patch.msgid.link/20250523132712.458507-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmgwJa4ACgkQrB3Eaf9P
W7d34A//V3NukN6UNAUKd+MbH80eXCEbNSNIuVUstfr0S71qTCxovLX58u+oQztb
43mx/NsnF38TzNFWVyVzF4vcr/n0DS/Da3P5pJEjoewIYSDrz/WfOum6VpVIUsZ/
kLCDZlIoX/fBPFZDPHMmsDXDemAdrtr8CuK72NUH10vKDuGKSUG0NElqDieDBEsA
y/fqgBsyxQXi9cMdRxf+DLDK/hzqyaJmVj8B8WEcFtYXJ4RE6+jfLgAaTE6J7V5W
fYACTu/IcdtgEEm2U7wlow66oIjqqGReuWUzV9zHGJNCB9+da6L4dbGtzlRmOPdn
kI1PIALFWT2HbKnJOJJbaThO6zES1rMOm3PsWt7iVewCT8HuhAa9kDV0xzdcLQE1
+REfo8dXW9f5hRUrSuqpJFUArkCHWHLhQEcmTHaF0b2RveC/hd9rOyKIfae+fgIP
5uLU2DpwafDgw5UCjsQTLyQ5M6icO8wFgM7vKAUJWyI1Pck1ktf7Ic6+KQRNjWiv
Q7ImwpSdLH2bZpIbIKDnIcyZg3CMBIQ88cdsYi0+ckgDQ0hMf6ZrXRseXKRe0P/M
gKgBOoXIJBF7niJQTDqHjsmnYGvvhZysIJNQLf4BZFYOeF5L9OduP6ywqMe5pFKt
QAsJSZw/+SibheLEYQAzvyLD6VdMXaxeOAHlPylRRpl9vEX0l04=
=GRVJ
-----END PGP SIGNATURE-----
Merge tag 'ipsec-next-2025-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
1) Remove some unnecessary strscpy_pad() size arguments.
From Thorsten Blum.
2) Correct use of xso.real_dev on bonding offloads.
Patchset from Cosmin Ratiu.
3) Add hardware offload configuration to XFRM_MSG_MIGRATE.
From Chiachang Wang.
4) Refactor migration setup during cloning. This was
done after the clone was created. Now it is done
in the cloning function itself.
From Chiachang Wang.
5) Validate assignment of maximal possible SEQ number.
Prevent from setting to the maximum sequrnce number
as this would cause for traffic drop.
From Leon Romanovsky.
6) Prevent configuration of interface index when offload
is used. Hardware can't handle this case.i
From Leon Romanovsky.
7) Always use kfree_sensitive() for SA secret zeroization.
From Zilin Guan.
ipsec-next-2025-05-23
* tag 'ipsec-next-2025-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
xfrm: use kfree_sensitive() for SA secret zeroization
xfrm: prevent configuration of interface index when offload is used
xfrm: validate assignment of maximal possible SEQ number
xfrm: Refactor migration setup during the cloning process
xfrm: Migrate offload configuration
bonding: Fix multiple long standing offload races
bonding: Mark active offloaded xfrm_states
xfrm: Add explicit dev to .xdo_dev_state_{add,delete,free}
xfrm: Remove unneeded device check from validate_xmit_xfrm
xfrm: Use xdo.dev instead of xdo.real_dev
net/mlx5: Avoid using xso.real_dev unnecessarily
xfrm: Remove unnecessary strscpy_pad() size arguments
====================
Link: https://patch.msgid.link/20250523075611.3723340-1-steffen.klassert@secunet.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub suggests:
> I have a different request :) Matt, once this ends up in net-next
> (end of this week) could you refactor it to use nlmsg_payload() ?
> It doesn't exist in net but this is exactly why it was added.
This refactors the additions to both mctp_dump_addrinfo(), and
mctp_rtm_getneigh() - two cases where we're calling nlh_data() on an
an incoming netlink message, without a prior nlmsg_parse().
For the neigh.c case, we cannot hit the failure where the nlh does not
contain a full ndmsg at present, as the core handler
(net/core/neighbour.c, neigh_get()) has already validated the size
through neigh_valid_req_get(), and would have failed the get operation
before the MCTP hander is called.
However, relying on that is a bit fragile, so apply the nlmsg_payload
refector here too.
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250521-mctp-nlmsg-payload-v2-1-e85df160c405@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBN6wAKCRCRxhvAZXjc
ok32AQD9DTiSCAoVg+7s+gSBuLTi8drPTN++mCaxdTqRh5WpRAD9GVyrGQT0s6LH
eo9bm8d1TAYjilEWM0c0K0TxyQ7KcAA=
=IW7H
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.16-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs directory lookup updates from Christian Brauner:
"This contains cleanups for the lookup_one*() family of helpers.
We expose a set of functions with names containing "lookup_one_len"
and others without the "_len". This difference has nothing to do with
"len". It's rater a historical accident that can be confusing.
The functions without "_len" take a "mnt_idmap" pointer. This is found
in the "vfsmount" and that is an important question when choosing
which to use: do you have a vfsmount, or are you "inside" the
filesystem. A related question is "is permission checking relevant
here?".
nfsd and cachefiles *do* have a vfsmount but *don't* use the non-_len
functions. They pass nop_mnt_idmap and refuse to work on filesystems
which have any other idmap.
This work changes nfsd and cachefile to use the lookup_one family of
functions and to explictily pass &nop_mnt_idmap which is consistent
with all other vfs interfaces used where &nop_mnt_idmap is explicitly
passed.
The remaining uses of the "_one" functions do not require permission
checks so these are renamed to be "_noperm" and the permission
checking is removed.
This series also changes these lookup function to take a qstr instead
of separate name and len. In many cases this simplifies the call"
* tag 'vfs-6.16-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
VFS: change lookup_one_common and lookup_noperm_common to take a qstr
Use try_lookup_noperm() instead of d_hash_and_lookup() outside of VFS
VFS: rename lookup_one_len family to lookup_noperm and remove permission check
cachefiles: Use lookup_one() rather than lookup_one_len()
nfsd: Use lookup_one() rather than lookup_one_len()
VFS: improve interface for lookup_one functions
Replace kfree_skb() used in neigh_resolve_output() and
neigh_connected_output() with kfree_skb_reason().
Following new skb drop reason is added:
/* failed to fill the device hard header */
SKB_DROP_REASON_NEIGH_HH_FILLFAIL
Signed-off-by: Qiu Yutan <qiu.yutan@zte.com.cn>
Signed-off-by: Jiang Kun <jiang.kun2@zte.com.cn>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Xu Xin <xu.xin16@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
sendmsg() with a single iov becomes ITER_UBUF, sendmsg() with multiple
iovs becomes ITER_IOVEC. iter_iov_len does not return correct
value for UBUF, so teach to treat UBUF differently.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Mina Almasry <almasrymina@google.com>
Fixes: bd61848900 ("net: devmem: Implement TX path")
Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Acked-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Notify user space if netdev hooks are updated due to netdev add/remove
events. Send minimal notification messages by introducing
NFT_MSG_NEWDEV/DELDEV message types describing a single device only.
Upon NETDEV_CHANGENAME, the callback has no information about the
interface's old name. To provide a clear message to user space, include
the hook's stored interface name in the notification.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
User space may pass non-nul-terminated NFTA_DEVICE_NAME attribute values
to indicate a suffix wildcard.
Expect for multiple devices to match the given prefix in
nft_netdev_hook_alloc() and populate 'ops_list' with them all.
When checking for duplicate hooks, compare the shortest prefix so a
device may never match more than a single hook spec.
Finally respect the stored prefix length when hooking into new devices
from event handlers.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
No point in having err_hook_alloc, just call return directly. Also
rename err_hook_dev - it's not about the hook's device but freeing the
hook itself.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
For the sake of simplicity, treat them like consecutive NETDEV_REGISTER
and NETDEV_UNREGISTER events. If the new name matches a hook spec and
registration fails, escalate the error and keep things as they are.
To avoid unregistering the newly registered hook again during the
following fake NETDEV_UNREGISTER event, leave hooks alone if their
interface spec matches the new name.
Note how this patch also skips for NETDEV_REGISTER if the device is
already registered. This is not yet possible as the new name would have
to match the old one. This will change with wildcard interface specs,
though.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>