Commit cc34acd577 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-6-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit cc34acd577 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.
__netif_set_mtu is used only by bond, so move it into
NETDEV_INTERNAL namespace.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-5-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit cc34acd577 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.
netif_pre_changeaddr_notify is used only by ipvlan/bond, so move it into
NETDEV_INTERNAL namespace.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-4-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit cc34acd577 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.
netif_get_mac_address is used only by tun/tap, so move it into
NETDEV_INTERNAL namespace.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-3-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit cc34acd577 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-2-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Detect NICs and drivers that either drop frames with a corrupted TCP
checksum or, worse, pass them up as valid. The test flips one bit in
the checksum, transmits the packet in internal loopback, and fails when
the driver reports CHECKSUM_UNNECESSARY.
Discussed at:
https://lore.kernel.org/all/20250625132117.1b3264e8@kernel.org/
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250717083524.1645069-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a new SKB drop reason (SKB_DROP_REASON_PFMEMALLOC) to track packets
dropped due to memory pressure. In production environments, we've observed
memory exhaustion reported by memory layer stack traces, but these drops
were not properly tracked in the SKB drop reason infrastructure.
While most network code paths now properly report pfmemalloc drops, some
protocol-specific socket implementations still use sk_filter() without
drop reason tracking:
- Bluetooth L2CAP sockets
- CAIF sockets
- IUCV sockets
- Netlink sockets
- SCTP sockets
- Unix domain sockets
These remaining cases represent less common paths and could be converted
in a follow-up patch if needed. The current implementation provides
significantly improved observability into memory pressure events in the
network stack, especially for key protocols like TCP and UDP, helping to
diagnose problems in production environments.
Reported-by: Matt Fleming <mfleming@cloudflare.com>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://patch.msgid.link/175268316579.2407873.11634752355644843509.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a proper description for the sk_stream_write_space() function as
previously marked by a FIXME comment.
No functional changes.
Signed-off-by: Suchit Karunakaran <suchitkarunakaran@gmail.com>
Link: https://patch.msgid.link/20250716153404.7385-1-suchitkarunakaran@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit 0937cb5f34 ("Revert "wifi: mac80211: Update
skb's control block key in ieee80211_tx_dequeue()"").
This commit broke TX with 802.11 encapsulation HW offloading, now that
this is fixed, reapply it.
Fixes: bb42f2d13f ("mac80211: Move reorder-sensitive TX handlers to after TXQ dequeue")
Signed-off-by: Remi Pommarel <repk@triplefau.lt>
Link: https://patch.msgid.link/66b8fc39fb0194fa06c9ca7eeb6ffe0118dcb3ec.1752765971.git.repk@triplefau.lt
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
skb_get_hash() can only be used when the skb is linked to a netdev
device.
Signed-off-by: Alexander Wetzel <Alexander@wetzel-home.de>
Fixes: 73bc9e0af5 ("mac80211: don't apply flow control on management frames")
Link: https://patch.msgid.link/20250717162547.94582-3-Alexander@wetzel-home.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Ignore TXQs with the flag IEEE80211_TXQ_STOP when scheduling a queue.
The flag is only set after all fragments have been dequeued and won't
allow dequeueing other frames as long as the flag is set.
For drivers using ieee80211_txq_schedule_start() this prevents an
loop trying to push the queued frames while IEEE80211_TXQ_STOP is set:
After setting IEEE80211_TXQ_STOP the driver will call
ieee80211_return_txq(). Which calls __ieee80211_schedule_txq(), detects
that there sill are frames in the queue and immediately restarts the
stopped TXQ. Which can't dequeue any frame and thus starts over the loop.
Signed-off-by: Alexander Wetzel <Alexander@wetzel-home.de>
Fixes: ba8c3d6f16 ("mac80211: add an intermediate software queue implementation")
Link: https://patch.msgid.link/20250717162547.94582-2-Alexander@wetzel-home.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When short beaconing is enabled, check the value of the sb_count
to determine whether we are to send a long beacon or short beacon.
sb_count represents the number of short beacons until the next
long beacon, where if its value is 0 we are to send a long beacon.
The value is then reset to the long beacon period, which represents
the number of beacon intervals between each long beacon. The decrement
process follows the same cadence as the decrement of the DTIM count value.
Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250717074205.312577-5-lachlan.hodges@morsemicro.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Introduce the sb_count variable which tracks the number of
beacon intervals until the next long beacon. To initialise this
value, we find the current short beacon index into this period
which represents the number of short beacons left to send before
the next long beacon. We use the same TSF value used to initialise
the DTIM count to ensure the short beacon count and DTIM count
are in sync as its common for the long beacon period and DTIM period
to be equivalent.
Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250717074205.312577-4-lachlan.hodges@morsemicro.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Introduce the ability to parse the short beacon data and long
beacon period. The long beacon period represents the number of beacon
intervals between each long beacon transmission. Additionally,
as a BSS cannot change its configuration such that short beaconing
is dynamically disabled/enabled without tearing down the interface
- we ensure we have an existing short beacon before performing
the update.
Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250717074205.312577-3-lachlan.hodges@morsemicro.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
S1G short beacons are an optional frame type used in an S1G BSS
that contain a limited set of elements. While they are optional,
they are a fundamental part of S1G that enables significant
power saving.
Expose 2 additional netlink attributes,
NL80211_ATTR_S1G_LONG_BEACON_PERIOD which denotes the number of beacon
intervals between each long beacon and NL80211_ATTR_S1G_SHORT_BEACON
which is a nested attribute containing the short beacon tail and
head. We split them as the long beacon period cannot be updated,
and is only used when initialisng the interface, whereas the short
beacon data can be used to both initialise and update the templates.
This follows how things such as the beacon interval and DTIM period
currently operate.
During the initialisation path, we ensure we have the long beacon
period if the short beacon data is being passed down, whereas
the update path will simply update the template if its sent down.
The short beacon data is validated using the same routines for regular
beacons as they support correctly parsing the short beacon format
while ensuring the frame is well-formed.
Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250717074205.312577-2-lachlan.hodges@morsemicro.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, reset connection monitor (ieee80211_sta_reset_conn_monitor())
timer is handled only for non-AP non-MLD STA and do not support non-AP MLD
STA. The current implementation checks for the CSA active and update the
monitor timer with the timeout value of deflink and reset the timer based
on the deflink's timeout value else schedule the connection loss work when
the deflink is timed out and it won't work for the non-AP MLD STA.
Handle the reset connection monitor timer for non-AP MLD STA by updating
the monitor timer with the timeout value which is determined based on the
link that will expire last among all the links in MLO. If at least one link
has not timed out, the timer is updated accordingly with the latest timeout
value else schedule the connection loss work when all links have timed out.
Remove the MLO-related WARN_ON() checks in the beacon and connection
monitoring logic code paths as they support MLO now.
Signed-off-by: Maharaja Kennadyrajan <maharaja.kennadyrajan@oss.qualcomm.com>
Link: https://patch.msgid.link/20250718060837.59371-5-maharaja.kennadyrajan@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, reset beacon monitor (ieee80211_sta_reset_beacon_monitor())
timer is handled only for non-AP non-MLD STA and do not support non-AP MLD
STA. When the beacon loss occurs in non-AP MLD STA with the current
implementation, it is treated as a single link and the timer will reset
based on the timeout of the deflink, without checking all the links.
Check the CSA flags for all the links in the MLO and decide whether to
schedule the work queue for beacon loss. If any of the links has CSA
active, then beacon loss work is not scheduled.
Also, call the functions ieee80211_sta_reset_beacon_monitor() and
ieee80211_sta_reset_conn_monitor() from ieee80211_csa_switch_work() only
when all the links are CSA active.
Signed-off-by: Maharaja Kennadyrajan <maharaja.kennadyrajan@oss.qualcomm.com>
Link: https://patch.msgid.link/20250718060837.59371-4-maharaja.kennadyrajan@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, the existing macro for_each_link_data() uses sdata_dereference()
which requires the wiphy lock. This lock cannot be used in atomic or RCU
read-side contexts, such as in the RX path.
Introduce a new macro, for_each_link_data_rcu(), that iterates over link of
sdata using rcu_dereference(), making it safe to use in RCU contexts. This
allows callers to access link data without requiring the wiphy lock.
The macro takes into account the vif.valid_links bitmap and ensures only
valid links are accessed safely. Callers are responsible for ensuring that
rcu_read_lock() is held when using this macro.
Signed-off-by: Maharaja Kennadyrajan <maharaja.kennadyrajan@oss.qualcomm.com>
Link: https://patch.msgid.link/20250718060837.59371-3-maharaja.kennadyrajan@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The for_each_link_data() macro currently declares a local variable
__sdata directly, which could lead to compiler warnings or errors when
reused in the same function or within switch-case blocks due to variable
redefinition or invalid scoping.
To address this, restructure the macro to use an outer for-loop that runs
only once, allowing safe declaration of __sdata without polluting the outer
scope. This ensures compatibility with static analyzers.
No functional changes; this is purely a cleanup to improve macro hygiene.
Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com>
Signed-off-by: Maharaja Kennadyrajan <maharaja.kennadyrajan@oss.qualcomm.com>
Link: https://patch.msgid.link/20250718060837.59371-2-maharaja.kennadyrajan@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This (partially) reverts commits
- 838c7b8f1f ("wifi: nl80211: Avoid address calculations via out of bounds array indexing")
- f1d3334d60 ("wifi: cfg80211: sme: init n_channels before channels[] access")
- 82bbe02b25 ("wifi: mac80211: Set n_channels after allocating struct cfg80211_scan_request")
These commits all set the structure to be in an inconsistent
state, setting n_channels to some value before them actually
being filled in. That's fine for what the code does now, but
with the removal of __counted_by() in 444020f4bf ("wifi:
cfg80211: remove scan request n_channels counted_by") it's no
longer needed and it does leave a bit of a landmine there
since breaking out of some code to send the scan or something
would leave it wrong.
Remove the now superfluous n_channels settings.
Link: https://patch.msgid.link/20250718103237.59510b2384c5.Ied5ba9c5c49efc008f4491c8ca7a45858a83f064@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ6NaUOruQGUkvPdG4raS+Z+3y5EwUCaHlCFwAKCRAraS+Z+3y5
E6qQAP9jVyIq+bKkZhRkew07cDNbYB01rJkJEO0Y/N7hnTyfwgD+PhiXGv5FiPp9
8iM3d51QKCOLlR/h3zc2RqR72S17RQA=
=ZaJz
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Martin KaFai Lau says:
====================
pull-request: bpf-next 2025-07-17
We've added 13 non-merge commits during the last 20 day(s) which contain
a total of 4 files changed, 712 insertions(+), 84 deletions(-).
The main changes are:
1) Avoid skipping or repeating a sk when using a TCP bpf_iter,
from Jordan Rife.
2) Clarify the driver requirement on using the XDP metadata,
from Song Yoong Siang
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
doc: xdp: Clarify driver implementation for XDP Rx metadata
selftests/bpf: Add tests for bucket resume logic in established sockets
selftests/bpf: Create iter_tcp_destroy test program
selftests/bpf: Create established sockets in socket iterator tests
selftests/bpf: Make ehash buckets configurable in socket iterator tests
selftests/bpf: Allow for iteration over multiple states
selftests/bpf: Allow for iteration over multiple ports
selftests/bpf: Add tests for bucket resume logic in listening sockets
bpf: tcp: Avoid socket skips and repeats during iteration
bpf: tcp: Use bpf_tcp_iter_batch_item for bpf_tcp_iter_state batch items
bpf: tcp: Get rid of st_bucket_done
bpf: tcp: Make sure iter->batch always contains a full bucket snapshot
bpf: tcp: Make mem flags configurable through bpf_iter_tcp_realloc_batch
====================
Link: https://patch.msgid.link/20250717191731.4142326-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
neigh_add() updates pneigh_entry() found or created by pneigh_create().
This update is serialised by RTNL, but we will remove it.
Let's move the update part to pneigh_create() and make it return errno
instead of a pointer of pneigh_entry.
Now, the pneigh code is RTNL free.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-16-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tbl->phash_buckets[] is only modified in the slow path by pneigh_create()
and pneigh_delete() under the table lock.
Both of them are called under RTNL, so no extra lock is needed, but we
will remove RTNL from the paths.
pneigh_create() looks up a pneigh_entry, and this part can be lockless,
but it would complicate the logic like
1. lookup
2. allocate pengih_entry for GFP_KERNEL
3. lookup again but under lock
4. if found, return it after freeing the allocated memory
5. else, return the new one
Instead, let's add a per-table mutex and run lookup and allocation
under it.
Note that updating pneigh_entry part in neigh_add() is still protected
by RTNL and will be moved to pneigh_create() in the next patch.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-15-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now, all callers of pneigh_lookup() are under RCU, and the read
lock there is no longer needed.
Let's drop the lock, inline __pneigh_lookup_1() to pneigh_lookup(),
and call it from pneigh_create().
The next patch will remove tbl->lock from pneigh_create().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-14-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
__pneigh_lookup() is the lockless version of pneigh_lookup(),
but its only caller pndisc_is_router() holds the table lock and
reads pneigh_netry.flags.
This is because accessing pneigh_entry after pneigh_lookup() was
illegal unless the caller holds RTNL or the table lock.
Now, pneigh_entry is guaranteed to be alive during the RCU critical
section.
Let's call pneigh_lookup() and use READ_ONCE() for n->flags in
pndisc_is_router() and remove __pneigh_lookup().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-13-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now pneigh_entry is guaranteed to be alive during the
RCU critical section even without holding tbl->lock.
Let's use rcu_dereference() in pneigh_get_{first,next}().
Note that neigh_seq_start() still holds tbl->lock for the
normal neighbour entry.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-12-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now pneigh_entry is guaranteed to be alive during the
RCU critical section even without holding tbl->lock.
Let's drop read_lock_bh(&tbl->lock) and use rcu_dereference()
to iterate tbl->phash_buckets[] in pneigh_dump_table()
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-11-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Only __dev_get_by_index() is the RTNL dependant in neigh_get().
Let's replace it with dev_get_by_index_rcu() and convert RTM_GETNEIGH
to RCU.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-10-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We will convert pneigh readers to RCU, and its flags and protocol
will be read locklessly.
Let's annotate the access to the two fields.
Note that all access to pn->permanent is under RTNL (neigh_add()
and pneigh_ifdown_and_unlock()), so WRITE_ONCE() and READ_ONCE()
are not needed.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-9-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We will convert RTM_GETNEIGH to RCU.
neigh_get() looks up pneigh_entry by pneigh_lookup() and passes
it to pneigh_fill_info().
Then, we must ensure that the entry is alive till pneigh_fill_info()
completes, but read_lock_bh(&tbl->lock) in pneigh_lookup() does not
guarantee that.
Also, we will convert all readers of tbl->phash_buckets[] to RCU.
Let's use call_rcu() to free pneigh_entry and update phash_buckets[]
and ->next by rcu_assign_pointer().
pneigh_ifdown_and_unlock() uses list_head to avoid overwriting
->next and moving RCU iterators to another list.
pndisc_destructor() (only IPv6 ndisc uses this) uses a mutex, so it
is not delayed to call_rcu(), where we cannot sleep. This is fine
because the mcast code works with RCU and ipv6_dev_mc_dec() frees
mcast objects after RCU grace period.
While at it, we change the return type of pneigh_ifdown_and_unlock()
to void.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-8-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The next patch will free pneigh_entry with call_rcu().
Then, we need to annotate neigh_table.phash_buckets[] and
pneigh_entry.next with __rcu.
To make the next patch cleaner, let's annotate the fields in advance.
Currently, all accesses to the fields are under the neigh table lock,
so rcu_dereference_protected() is used with 1 for now, but most of them
(except in pneigh_delete() and pneigh_ifdown_and_unlock()) will be
replaced with rcu_dereference() and rcu_dereference_check().
Note that pneigh_ifdown_and_unlock() changes pneigh_entry.next to a
local list, which is illegal because the RCU iterator could be moved
to another list. This part will be fixed in the next patch.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
pneigh_lookup() has ASSERT_RTNL() in the middle of the function, which
is confusing.
When called with the last argument, creat, 0, pneigh_lookup() literally
looks up a proxy neighbour entry. This is the case of the reader path
as the fast path and RTM_GETNEIGH.
pneigh_lookup(), however, creates a pneigh_entry when called with creat 1
from RTM_NEWNEIGH and SIOCSARP, which require RTNL.
Let's split pneigh_lookup() into two functions.
We will convert all the reader paths to RCU, and read_lock_bh(&tbl->lock)
in the new pneigh_lookup() will be dropped.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
neigh_valid_get_req() calls neigh_find_table() to fetch neigh_tables[].
neigh_find_table() uses rcu_dereference_rtnl(), but RTNL actually does
not protect it at all; neigh_table_clear() can be called without RTNL
and only waits for RCU readers by synchronize_rcu().
Fortunately, there is no bug because IPv4 is built-in, IPv6 cannot be
unloaded, and DECNET was removed.
To fetch neigh_tables[] by rcu_dereference() later, let's move
neigh_find_table() from neigh_valid_get_req() to neigh_get().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We will remove RTNL for neigh_get() and run it under RCU instead.
neigh_get_reply() and pneigh_get_reply() allocate skb with GFP_KERNEL.
Let's move the allocation before __dev_get_by_index() in neigh_get().
Now, neigh_get_reply() and pneigh_get_reply() are inlined and
rtnl_unicast() is factorised.
We will convert pneigh_lookup() to __pneigh_lookup() later.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We will remove RTNL for neigh_get() and run it under RCU instead.
neigh_get() returns -EINVAL in the following cases:
* NDA_DST is not specified
* Both ndm->ndm_ifindex and NTF_PROXY are not specified
These validations do not require RCU.
Let's move them to neigh_valid_get_req().
While at it, the extack string for the first case is replaced with
NL_SET_ERR_ATTR_MISS().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
neigh_get() passes 4 local variable pointers to neigh_valid_get_req().
If it returns a pointer of struct ndmsg, we do not need to pass two
of them.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add support for ETHTOOL_SRXFH (setting hashing fields) in RSS_SET.
The tricky part is dealing with symmetric hashing. In netlink user
can change the hashing fields and symmetric hash in one request,
in IOCTL the two used to be set via different uAPI requests.
Since fields and hash function config are still separate driver
callbacks - changes to the two are not atomic. Keep things simple
and validate the settings against both pre- and post- change ones.
Meaning that we will reject the config request if user tries
to correct the flow fields and set input_xfrm in one request,
or disables input_xfrm and makes flow fields non-symmetric.
We can adjust it later if there's a real need. Starting simple feels
right, and potentially partially applying the settings isn't nice,
either.
Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20250716000331.1378807-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Support configuring symmetric hashing via Netlink.
We have the flow field config prepared as part of SET handling,
so scan it for conflicts instead of querying the driver again.
Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20250716000331.1378807-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Support setting RSS hashing key via ethtool Netlink.
Use the Netlink policy to make sure user doesn't pass
an empty key, "resetting" the key is not a thing.
Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20250716000331.1378807-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Support setting RSS hash function / algo via ethtool Netlink.
Like IOCTL we don't validate that the function is within the
range known to the kernel. The drivers do a pretty good job
validating the inputs, and the IDs are technically "dynamically
queried" rather than part of uAPI.
Only change should be that in Netlink we don't support user
explicitly passing ETH_RSS_HASH_NO_CHANGE (0), if no change
is requested the attribute should be absent.
The ETH_RSS_HASH_NO_CHANGE is retained in driver-facing
API for consistency (not that I see a strong reason for it).
Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20250716000331.1378807-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add initial support for RSS_SET, for now only operations on
the indirection table are supported.
Unlike the ioctl don't check if at least one parameter is
being changed. This is how other ethtool-nl ops behave,
so pick the ethtool-nl consistency vs copying ioctl behavior.
There are two special cases here:
1) resetting the table to defaults;
2) support for tables of different size.
For (1) I use an empty Netlink attribute (array of size 0).
(2) may require some background. AFAICT a lot of modern devices
allow allocating RSS tables of different sizes. mlx5 can upsize
its tables, bnxt has some "table size calculation", and Intel
folks asked about RSS table sizing in context of resource allocation
in the past. The ethtool IOCTL API has a concept of table size,
but right now the user is expected to provide a table exactly
the size the device requests. Some drivers may change the table
size at runtime (in response to queue count changes) but the
user is not in control of this. What's not great is that all
RSS contexts share the same table size. For example a device
with 128 queues enabled, 16 RSS contexts 8 queues in each will
likely have 256 entry tables for each of the 16 contexts,
while 32 would be more than enough given each context only has
8 queues. To address this the Netlink API should avoid enforcing
table size at the uAPI level, and should allow the user to express
the min table size they expect.
To fully solve (2) we will need more driver plumbing but
at the uAPI level this patch allows the user to specify
a table size smaller than what the device advertises. The device
table size must be a multiple of the user requested table size.
We then replicate the user-provided table to fill the full device
size table. This addresses the "allow the user to express the min
table size" objective, while not enforcing any fixed size.
From Netlink perspective .get_rxfh_indir_size() is now de facto
the "max" table size supported by the device.
We may choose to support table replication in ethtool, too,
when we actually plumb this thru the device APIs.
Initially I was considering moving full pattern generation
to the kernel (which queues to use, at which frequency and
what min sequence length). I don't think this complexity
would buy us much and most if not all devices have pow-2
table sizes, which simplifies the replication a lot.
Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20250716000331.1378807-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- hci_sync: fix connectable extended advertising when using static random address
- hci_core: fix typos in macros
- hci_core: add missing braces when using macro parameters
- hci_core: replace 'quirks' integer by 'quirk_flags' bitmap
- SMP: If an unallowed command is received consider it a failure
- SMP: Fix using HCI_ERROR_REMOTE_USER_TERM on timeout
- L2CAP: Fix null-ptr-deref in l2cap_sock_resume_cb()
- L2CAP: Fix attempting to adjust outgoing MTU
- btintel: Check if controller is ISO capable on btintel_classify_pkt_type
- btusb: QCA: Fix downloading wrong NVM for WCN6855 GF variant without board ID
-----BEGIN PGP SIGNATURE-----
iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmh5CF4ZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKeeYD/9oqrpfAnF+ZYakvt+W+bJx
KPBWlXlgVSnPbK9qJPWm8AUaEOz6yyGd728S0QYZ+y5map5TVMWE0n1BYfauUgch
GmUS/Li44qRVi9ygxS3CqiXHoVFtRiMJd3kx5v3SH8LHUQZakcNsFg4DCfQAufZe
uJI+/+vccBx8rF+WR3mlhziE0bUosHOLAkqujnuKg/EpVO4xc4zeG6AKK5ihVHgQ
1SlPp/s6BYz1VcMj9HMsEk6z4iY8WF5bdN1YdzkvRziTYYuFMDpJwI83FtkmmrsG
v59GwlPMsGNlz25KbapzqGgflydeXSKbigTJQr7LHAaKv4jmqnnAeCOlhkvFuFBm
snb2Zkkw16w+s/DBQvriBy6D+yiaSwKkZUjNWwGTvyDqAna6Kx44jzT1QpgOSm2p
d+rxjrNXRjT59wiIo1JsOXpK5Mbbyz5QGXge/RbUO36glh/J2Vs44F1HueZHwSSw
GGt0jmRTjB8/icbcvnkMVgwnoQEul7bsV95fPOq6CGSuRxYIX7uFXWMM/Wb/1SN7
QWQyN/P7z5XpZMWFH3SDVx/FhN6G5Pi17OkvaLVSwfKs7jK45gb10Oi6cdypL2rc
Ed6EkBOIL0ETAqDd4NLGTOeHEpJ3zfxxWqlu5cYrUf4qj7vtXk39ylNmXNNA/4Ci
lU4vQdGAHYX3BIhaRdqryQ==
=JuS2
-----END PGP SIGNATURE-----
Merge tag 'for-net-2025-07-17' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- hci_sync: fix connectable extended advertising when using static random address
- hci_core: fix typos in macros
- hci_core: add missing braces when using macro parameters
- hci_core: replace 'quirks' integer by 'quirk_flags' bitmap
- SMP: If an unallowed command is received consider it a failure
- SMP: Fix using HCI_ERROR_REMOTE_USER_TERM on timeout
- L2CAP: Fix null-ptr-deref in l2cap_sock_resume_cb()
- L2CAP: Fix attempting to adjust outgoing MTU
- btintel: Check if controller is ISO capable on btintel_classify_pkt_type
- btusb: QCA: Fix downloading wrong NVM for WCN6855 GF variant without board ID
* tag 'for-net-2025-07-17' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: L2CAP: Fix attempting to adjust outgoing MTU
Bluetooth: btusb: QCA: Fix downloading wrong NVM for WCN6855 GF variant without board ID
Bluetooth: hci_dev: replace 'quirks' integer by 'quirk_flags' bitmap
Bluetooth: hci_core: add missing braces when using macro parameters
Bluetooth: hci_core: fix typos in macros
Bluetooth: SMP: Fix using HCI_ERROR_REMOTE_USER_TERM on timeout
Bluetooth: SMP: If an unallowed command is received consider it a failure
Bluetooth: btintel: Check if controller is ISO capable on btintel_classify_pkt_type
Bluetooth: hci_sync: fix connectable extended advertising when using static random address
Bluetooth: Fix null-ptr-deref in l2cap_sock_resume_cb()
====================
Link: https://patch.msgid.link/20250717142849.537425-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix rxrpc to use connection-level aborts for things that affect the whole
connection, such as the service ID not matching a local service.
Fixes: 57af281e53 ("rxrpc: Tidy up abort generation infrastructure")
Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20250717074350.3767366-6-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Under some circumstances, such as when a server socket is closing, ABORT
packets will be generated in response to incoming packets. Unfortunately,
this also may include generating aborts in response to incoming aborts -
which may cause a cycle. It appears this may be made possible by giving
the client a multicast address.
Fix this such that rxrpc_reject_packet() will refuse to generate aborts in
response to aborts.
Fixes: 248f219cb8 ("rxrpc: Rewrite the data and ack handling code")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Junvyyang, Tencent Zhuque Lab <zhuque@tencent.com>
cc: LePremierHomme <kwqcheii@proton.me>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20250717074350.3767366-5-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a call is released, rxrpc takes the spinlock and removes it from
->recvmsg_q in an effort to prevent racing recvmsg() invocations from
seeing the same call. Now, rxrpc_recvmsg() only takes the spinlock when
actually removing a call from the queue; it doesn't, however, take it in
the lead up to that when it checks to see if the queue is empty. It *does*
hold the socket lock, which prevents a recvmsg/recvmsg race - but this
doesn't prevent sendmsg from ending the call because sendmsg() drops the
socket lock and relies on the call->user_mutex.
Fix this by firstly removing the bit in rxrpc_release_call() that dequeues
the released call and, instead, rely on recvmsg() to simply discard
released calls (done in a preceding fix).
Secondly, rxrpc_notify_socket() is abandoned if the call is already marked
as released rather than trying to be clever by setting both pointers in
call->recvmsg_link to NULL to trick list_empty(). This isn't perfect and
can still race, resulting in a released call on the queue, but recvmsg()
will now clean that up.
Fixes: 17926a7932 ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Junvyyang, Tencent Zhuque Lab <zhuque@tencent.com>
cc: LePremierHomme <kwqcheii@proton.me>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20250717074350.3767366-4-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If a call receives an event (such as incoming data), the call gets placed
on the socket's queue and a thread in recvmsg can be awakened to go and
process it. Once the thread has picked up the call off of the queue,
further events will cause it to be requeued, and once the socket lock is
dropped (recvmsg uses call->user_mutex to allow the socket to be used in
parallel), a second thread can come in and its recvmsg can pop the call off
the socket queue again.
In such a case, the first thread will be receiving stuff from the call and
the second thread will be blocked on call->user_mutex. The first thread
can, at this point, process both the event that it picked call for and the
event that the second thread picked the call for and may see the call
terminate - in which case the call will be "released", decoupling the call
from the user call ID assigned to it (RXRPC_USER_CALL_ID in the control
message).
The first thread will return okay, but then the second thread will wake up
holding the user_mutex and, if it sees that the call has been released by
the first thread, it will BUG thusly:
kernel BUG at net/rxrpc/recvmsg.c:474!
Fix this by just dequeuing the call and ignoring it if it is seen to be
already released. We can't tell userspace about it anyway as the user call
ID has become stale.
Fixes: 248f219cb8 ("rxrpc: Rewrite the data and ack handling code")
Reported-by: Junvyyang, Tencent Zhuque Lab <zhuque@tencent.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: LePremierHomme <kwqcheii@proton.me>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20250717074350.3767366-3-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The rxrpc_assess_MTU_size() function calls down into the IP layer to find
out the MTU size for a route. When accepting an incoming call, this is
called from rxrpc_new_incoming_call() which holds interrupts disabled
across the code that calls down to it. Unfortunately, the IP layer uses
local_bh_enable() which, config dependent, throws a warning if IRQs are
enabled:
WARNING: CPU: 1 PID: 5544 at kernel/softirq.c:387 __local_bh_enable_ip+0x43/0xd0
...
RIP: 0010:__local_bh_enable_ip+0x43/0xd0
...
Call Trace:
<TASK>
rt_cache_route+0x7e/0xa0
rt_set_nexthop.isra.0+0x3b3/0x3f0
__mkroute_output+0x43a/0x460
ip_route_output_key_hash+0xf7/0x140
ip_route_output_flow+0x1b/0x90
rxrpc_assess_MTU_size.isra.0+0x2a0/0x590
rxrpc_new_incoming_peer+0x46/0x120
rxrpc_alloc_incoming_call+0x1b1/0x400
rxrpc_new_incoming_call+0x1da/0x5e0
rxrpc_input_packet+0x827/0x900
rxrpc_io_thread+0x403/0xb60
kthread+0x2f7/0x310
ret_from_fork+0x2a/0x230
ret_from_fork_asm+0x1a/0x30
...
hardirqs last enabled at (23): _raw_spin_unlock_irq+0x24/0x50
hardirqs last disabled at (24): _raw_read_lock_irq+0x17/0x70
softirqs last enabled at (0): copy_process+0xc61/0x2730
softirqs last disabled at (25): rt_add_uncached_list+0x3c/0x90
Fix this by moving the call to rxrpc_assess_MTU_size() out of
rxrpc_init_peer() and further up the stack where it can be done without
interrupts disabled.
It shouldn't be a problem for rxrpc_new_incoming_call() to do it after the
locks are dropped as pmtud is going to be performed by the I/O thread - and
we're in the I/O thread at this point.
Fixes: a2ea9a9072 ("rxrpc: Use irq-disabling spinlocks between app and I/O thread")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Junvyyang, Tencent Zhuque Lab <zhuque@tencent.com>
cc: LePremierHomme <kwqcheii@proton.me>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20250717074350.3767366-2-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
htb_lookup_leaf has a BUG_ON that can trigger with the following:
tc qdisc del dev lo root
tc qdisc add dev lo root handle 1: htb default 1
tc class add dev lo parent 1: classid 1:1 htb rate 64bit
tc qdisc add dev lo parent 1:1 handle 2: netem
tc qdisc add dev lo parent 2:1 handle 3: blackhole
ping -I lo -c1 -W0.001 127.0.0.1
The root cause is the following:
1. htb_dequeue calls htb_dequeue_tree which calls the dequeue handler on
the selected leaf qdisc
2. netem_dequeue calls enqueue on the child qdisc
3. blackhole_enqueue drops the packet and returns a value that is not
just NET_XMIT_SUCCESS
4. Because of this, netem_dequeue calls qdisc_tree_reduce_backlog, and
since qlen is now 0, it calls htb_qlen_notify -> htb_deactivate ->
htb_deactiviate_prios -> htb_remove_class_from_row -> htb_safe_rb_erase
5. As this is the only class in the selected hprio rbtree,
__rb_change_child in __rb_erase_augmented sets the rb_root pointer to
NULL
6. Because blackhole_dequeue returns NULL, netem_dequeue returns NULL,
which causes htb_dequeue_tree to call htb_lookup_leaf with the same
hprio rbtree, and fail the BUG_ON
The function graph for this scenario is shown here:
0) | htb_enqueue() {
0) + 13.635 us | netem_enqueue();
0) 4.719 us | htb_activate_prios();
0) # 2249.199 us | }
0) | htb_dequeue() {
0) 2.355 us | htb_lookup_leaf();
0) | netem_dequeue() {
0) + 11.061 us | blackhole_enqueue();
0) | qdisc_tree_reduce_backlog() {
0) | qdisc_lookup_rcu() {
0) 1.873 us | qdisc_match_from_root();
0) 6.292 us | }
0) 1.894 us | htb_search();
0) | htb_qlen_notify() {
0) 2.655 us | htb_deactivate_prios();
0) 6.933 us | }
0) + 25.227 us | }
0) 1.983 us | blackhole_dequeue();
0) + 86.553 us | }
0) # 2932.761 us | qdisc_warn_nonwc();
0) | htb_lookup_leaf() {
0) | BUG_ON();
------------------------------------------
The full original bug report can be seen here [1].
We can fix this just by returning NULL instead of the BUG_ON,
as htb_dequeue_tree returns NULL when htb_lookup_leaf returns
NULL.
[1] https://lore.kernel.org/netdev/pF5XOOIim0IuEfhI-SOxTgRvNoDwuux7UHKnE_Y5-zVd4wmGvNk2ceHjKb8ORnzw0cGwfmVu42g9dL7XyJLf1NEzaztboTWcm0Ogxuojoeo=@willsroot.io/
Fixes: 512bb43eb5 ("pkt_sched: sch_htb: Optimize WARN_ONs in htb_dequeue_tree() etc.")
Signed-off-by: William Liu <will@willsroot.io>
Signed-off-by: Savino Dicanosa <savy@syst3mfailure.io>
Link: https://patch.msgid.link/20250717022816.221364-1-will@willsroot.io
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Do not offload IGMP/MLD messages as it could lead to IGMP/MLD Reports
being unintentionally flooded to Hosts. Instead, let the bridge decide
where to send these IGMP/MLD messages.
Consider the case where the local host is sending out reports in response
to a remote querier like the following:
mcast-listener-process (IP_ADD_MEMBERSHIP)
\
br0
/ \
swp1 swp2
| |
QUERIER SOME-OTHER-HOST
In the above setup, br0 will want to br_forward() reports for
mcast-listener-process's group(s) via swp1 to QUERIER; but since the
source hwdom is 0, the report is eligible for tx offloading, and is
flooded by hardware to both swp1 and swp2, reaching SOME-OTHER-HOST as
well. (Example and illustration provided by Tobias.)
Fixes: 472111920f ("net: bridge: switchdev: allow the TX data plane forwarding to be offloaded")
Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250716153551.1830255-1-Joseph.Huang@garmin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Assuming the "rx-vlan-filter" feature is enabled on a net device, the
8021q module will automatically add or remove VLAN 0 when the net device
is put administratively up or down, respectively. There are a couple of
problems with the above scheme.
The first problem is a memory leak that can happen if the "rx-vlan-filter"
feature is disabled while the device is running:
# ip link add bond1 up type bond mode 0
# ethtool -K bond1 rx-vlan-filter off
# ip link del dev bond1
When the device is put administratively down the "rx-vlan-filter"
feature is disabled, so the 8021q module will not remove VLAN 0 and the
memory will be leaked [1].
Another problem that can happen is that the kernel can automatically
delete VLAN 0 when the device is put administratively down despite not
adding it when the device was put administratively up since during that
time the "rx-vlan-filter" feature was disabled. null-ptr-unref or
bug_on[2] will be triggered by unregister_vlan_dev() for refcount
imbalance if toggling filtering during runtime:
$ ip link add bond0 type bond mode 0
$ ip link add link bond0 name vlan0 type vlan id 0 protocol 802.1q
$ ethtool -K bond0 rx-vlan-filter off
$ ifconfig bond0 up
$ ethtool -K bond0 rx-vlan-filter on
$ ifconfig bond0 down
$ ip link del vlan0
Root cause is as below:
step1: add vlan0 for real_dev, such as bond, team.
register_vlan_dev
vlan_vid_add(real_dev,htons(ETH_P_8021Q),0) //refcnt=1
step2: disable vlan filter feature and enable real_dev
step3: change filter from 0 to 1
vlan_device_event
vlan_filter_push_vids
ndo_vlan_rx_add_vid //No refcnt added to real_dev vlan0
step4: real_dev down
vlan_device_event
vlan_vid_del(dev, htons(ETH_P_8021Q), 0); //refcnt=0
vlan_info_rcu_free //free vlan0
step5: delete vlan0
unregister_vlan_dev
BUG_ON(!vlan_info); //vlan_info is null
Fix both problems by noting in the VLAN info whether VLAN 0 was
automatically added upon NETDEV_UP and based on that decide whether it
should be deleted upon NETDEV_DOWN, regardless of the state of the
"rx-vlan-filter" feature.
[1]
unreferenced object 0xffff8880068e3100 (size 256):
comm "ip", pid 384, jiffies 4296130254
hex dump (first 32 bytes):
00 20 30 0d 80 88 ff ff 00 00 00 00 00 00 00 00 . 0.............
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace (crc 81ce31fa):
__kmalloc_cache_noprof+0x2b5/0x340
vlan_vid_add+0x434/0x940
vlan_device_event.cold+0x75/0xa8
notifier_call_chain+0xca/0x150
__dev_notify_flags+0xe3/0x250
rtnl_configure_link+0x193/0x260
rtnl_newlink_create+0x383/0x8e0
__rtnl_newlink+0x22c/0xa40
rtnl_newlink+0x627/0xb00
rtnetlink_rcv_msg+0x6fb/0xb70
netlink_rcv_skb+0x11f/0x350
netlink_unicast+0x426/0x710
netlink_sendmsg+0x75a/0xc20
__sock_sendmsg+0xc1/0x150
____sys_sendmsg+0x5aa/0x7b0
___sys_sendmsg+0xfc/0x180
[2]
kernel BUG at net/8021q/vlan.c:99!
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 0 UID: 0 PID: 382 Comm: ip Not tainted 6.16.0-rc3 #61 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:unregister_vlan_dev (net/8021q/vlan.c:99 (discriminator 1))
RSP: 0018:ffff88810badf310 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88810da84000 RCX: ffffffffb47ceb9a
RDX: dffffc0000000000 RSI: 0000000000000008 RDI: ffff88810e8b43c8
RBP: 0000000000000000 R08: 0000000000000000 R09: fffffbfff6cefe80
R10: ffffffffb677f407 R11: ffff88810badf3c0 R12: ffff88810e8b4000
R13: 0000000000000000 R14: ffff88810642a5c0 R15: 000000000000017e
FS: 00007f1ff68c20c0(0000) GS:ffff888163a24000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1ff5dad240 CR3: 0000000107e56000 CR4: 00000000000006f0
Call Trace:
<TASK>
rtnl_dellink (net/core/rtnetlink.c:3511 net/core/rtnetlink.c:3553)
rtnetlink_rcv_msg (net/core/rtnetlink.c:6945)
netlink_rcv_skb (net/netlink/af_netlink.c:2535)
netlink_unicast (net/netlink/af_netlink.c:1314 net/netlink/af_netlink.c:1339)
netlink_sendmsg (net/netlink/af_netlink.c:1883)
____sys_sendmsg (net/socket.c:712 net/socket.c:727 net/socket.c:2566)
___sys_sendmsg (net/socket.c:2622)
__sys_sendmsg (net/socket.c:2652)
do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
Fixes: ad1afb0039 ("vlan_dev: VLAN 0 should be treated as "no vlan tag" (802.1p packet)")
Reported-by: syzbot+a8b046e462915c65b10b@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=a8b046e462915c65b10b
Suggested-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Dong Chenchen <dongchenchen2@huawei.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250716034504.2285203-2-dongchenchen2@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After recent changes in net-next TCP compacts skbs much more
aggressively. This unearthed a bug in TLS where we may try
to operate on an old skb when checking if all skbs in the
queue have matching decrypt state and geometry.
BUG: KASAN: slab-use-after-free in tls_strp_check_rcv+0x898/0x9a0 [tls]
(net/tls/tls_strp.c:436 net/tls/tls_strp.c:530 net/tls/tls_strp.c:544)
Read of size 4 at addr ffff888013085750 by task tls/13529
CPU: 2 UID: 0 PID: 13529 Comm: tls Not tainted 6.16.0-rc5-virtme
Call Trace:
kasan_report+0xca/0x100
tls_strp_check_rcv+0x898/0x9a0 [tls]
tls_rx_rec_wait+0x2c9/0x8d0 [tls]
tls_sw_recvmsg+0x40f/0x1aa0 [tls]
inet_recvmsg+0x1c3/0x1f0
Always reload the queue, fast path is to have the record in the queue
when we wake, anyway (IOW the path going down "if !strp->stm.full_len").
Fixes: 0d87bbd39d ("tls: strp: make sure the TCP skbs do not have overlapping data")
Link: https://patch.msgid.link/20250716143850.1520292-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A new warning in clang [1] points out a place in pep_sock_accept() where
dst is uninitialized then passed as a const pointer to pep_find_pipe():
net/phonet/pep.c:829:37: error: variable 'dst' is uninitialized when passed as a const pointer argument here [-Werror,-Wuninitialized-const-pointer]
829 | newsk = pep_find_pipe(&pn->hlist, &dst, pipe_handle);
| ^~~:
Move the call to pn_skb_get_dst_sockaddr(), which initializes dst, to
before the call to pep_find_pipe(), so that dst is consistently used
initialized throughout the function.
Cc: stable@vger.kernel.org
Fixes: f7ae8d59f6 ("Phonet: allocate sock from accept syscall rather than soft IRQ")
Link: 00dacf8c22 [1]
Closes: https://github.com/ClangBuiltLinux/linux/issues/2101
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Link: https://patch.msgid.link/20250715-net-phonet-fix-uninit-const-pointer-v1-1-8efd1bd188b3@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Configuration request only configure the incoming direction of the peer
initiating the request, so using the MTU is the other direction shall
not be used, that said the spec allows the peer responding to adjust:
Bluetooth Core 6.1, Vol 3, Part A, Section 4.5
'Each configuration parameter value (if any is present) in an
L2CAP_CONFIGURATION_RSP packet reflects an ‘adjustment’ to a
configuration parameter value that has been sent (or, in case of
default values, implied) in the corresponding
L2CAP_CONFIGURATION_REQ packet.'
That said adjusting the MTU in the response shall be limited to ERTM
channels only as for older modes the remote stack may not be able to
detect the adjustment causing it to silently drop packets.
Link: https://github.com/bluez/bluez/issues/1422
Link: https://gitlab.archlinux.org/archlinux/packaging/packages/linux/-/issues/149
Link: https://gitlab.freedesktop.org/pipewire/pipewire/-/issues/4793
Fixes: 042bb9603c ("Bluetooth: L2CAP: Fix L2CAP MTU negotiation")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
- cfg80211: fix double-free introduced earlier
- mac80211: fix RCU iteration in CSA
- iwlwifi: many cleanups (unused FW APIs, PCIe code, WoWLAN)
- mac80211: some work around how FIPS affects wifi, which was
wrong (RC4 is used by TKIP, not only WEP)
- cfg/mac80211: improvements for unsolicated probe response
handling
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmh4xE8ACgkQ10qiO8sP
aAANTA/9HWfjfXTvZ3lEZqmOeJdHxd6xw3uHA732NF+KxoKi8vnO8V63WNVxBzCF
6srSaSi0IkdtnyR/uPqG5AkWej/ouAeDy+FtzoDpP8O6G4e9yjKZBRnvP+O/gkod
ERM4ouLWsumAIZqSqrSAJHgNTOP0/n5MNiLoKHlcjZmXphL2/lM8Qay7kVmWG+vT
tqv/LrwqRiTe/a+O9PKaW2sq6z23CBhEqWf6hhFCaEPVQXQJLBc6FTI4008zP/Rt
aSq672rMwA8ckuVOH0EAX544DMJc736to9XJEuu9EzuOaDZ9FwfqF9nOTMnFDev2
k9qn8Q5vsx+786A6pXw+sarnxhUxnQCYB00MhMe1e1yoB0lv57hZCD1Z0JTBGXPh
8ghmbelYBvkLollie6zUUcdf3G98ADUYvHsR0elBgGADWPHvnbn3ORVcv852X6jS
El1CR3lyvwwHn9USKaR91ESQWFRtLtPG+OmmLRLIG7wdkp0sDjHOpvV7A/7QpFLw
A18QrN+ORNL3Pja4A4dyIpDRQiTKHIjEEm3dAlVDzSWcPpcws5WlfreC1L71XblA
kRl+1I7SZ87rp8d6TXXxc8QmIMA+M4OL6M4FMAJgjn8VdyjlaC6lvl1i+tgdaDik
ISrY/78gnnt93BQmHU4P/syvh+NdpgNN+LgeaikKb1kmjG837Yk=
=g8VQ
-----END PGP SIGNATURE-----
Merge tag 'wireless-next-2025-07-17' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
Another set of changes, notably:
- cfg80211: fix double-free introduced earlier
- mac80211: fix RCU iteration in CSA
- iwlwifi: many cleanups (unused FW APIs, PCIe code, WoWLAN)
- mac80211: some work around how FIPS affects wifi, which was
wrong (RC4 is used by TKIP, not only WEP)
- cfg/mac80211: improvements for unsolicated probe response
handling
* tag 'wireless-next-2025-07-17' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (64 commits)
wifi: cfg80211: fix double free for link_sinfo in nl80211_station_dump()
wifi: cfg80211: fix off channel operation allowed check for MLO
wifi: mac80211: use RCU-safe iteration in ieee80211_csa_finish
wifi: mac80211_hwsim: Update comments in header
wifi: mac80211: parse unsolicited broadcast probe response data
wifi: cfg80211: parse attribute to update unsolicited probe response template
wifi: mac80211: don't use TPE data from assoc response
wifi: mac80211: handle WLAN_HT_ACTION_NOTIFY_CHANWIDTH async
wifi: mac80211: simplify __ieee80211_rx_h_amsdu() loop
wifi: mac80211: don't mark keys for inactive links as uploaded
wifi: mac80211: only assign chanctx in reconfig
wifi: mac80211_hwsim: Declare support for AP scanning
wifi: mac80211: clean up cipher suite handling
wifi: mac80211: don't send keys to driver when fips_enabled
wifi: cfg80211: Fix interface type validation
wifi: mac80211: remove ieee80211_link_unreserve_chanctx() return value
wifi: mac80211: don't unreserve never reserved chanctx
mwl8k: Add missing check after DMA map
wifi: mac80211: make VHT opmode NSS ignore a debug message
wifi: iwlwifi: remove support of several iwl_ppag_table_cmd versions
...
====================
Link: https://patch.msgid.link/20250717094610.20106-47-johannes@sipsolutions.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmh4xIoACgkQ1w0aZmrP
KyGuKg/8CA1NTSW0EIwaxOTbZrbBOrgByEvEGDpvKdekBrY6HNh4lTTjA4yvpjgb
Enh6RhakBKImXDBZQrcbXrLhgoi3Uo5UJ7QzOZM3utytOg8L87E2+VXBdlH+ajMZ
WG3+JaCZTKGadM53RYxzI3hvBbU7YAK8R10Aeqo8hvRV/nf7WXt6x/QB6xs4B6cy
UrxQQBqV6v+2ch/hD4a1ljOplDIEtCpO+2wvWIVu3uCeJjZ50lWQHuVsjEIFQMNh
9AnO/6l+kpstHUwSElctEc9MKtrzreQngeWu9xlrLarmnY4RBHnfNQj1yBMQsVHY
PhYiLBN8D10+QD+UysXO8M3B8vIlU2+oIrCpf63dEbCNYd4gJ8J3sf3eF/C37BRf
RrfdW+ahWooYjiLyILs9R938BZcQY6KM6+4pciimY0soZXmiMKV+kuexvGGf9fDM
YJwLn1hIbM9c4B+DVRV2NLLycLfYIlvymJX3mm/ZejyD1Z+OuYMtybotGxhRgzV8
onkzWcZNerNeGYPig259t4ieitJ7iY2NeBKi23ih5vBFQieUoOA7eO684sN6GKPF
lk8de29pD8V5NStXYqNBkXx4Q3WIhvj088qdyqKEAuRNN3fCNyrY1KlWc6ZhcqQy
5BgHYyJnl3NIOZYuSTINAVHTExD5f0P9arkNDyhVAb8UvjdhPsc=
=5X7Y
-----END PGP SIGNATURE-----
Merge tag 'nf-25-07-17' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following batch contains Netfilter fixes for net:
1) Three patches to enhance conntrack selftests for resize and clash
resolution, from Florian Westphal.
2) Expand nft_concat_range.sh selftest to improve coverage from error
path, from Florian Westphal.
3) Hide clash bit to userspace from netlink dumps until there is a
good reason to expose, from Florian Westphal.
4) Revert notification for device registration/unregistration for
nftables basechains and flowtables, we decided to go for a better
way to handle this through the nfnetlink_hook infrastructure which
will come via nf-next, patch from Phil Sutter.
5) Fix crash in conntrack due to race related to SLAB_TYPESAFE_BY_RCU
that results in removing a recycled object that is not yet in the
hashes. Move IPS_CONFIRM setting after the object is in the hashes.
From Florian Westphal.
netfilter pull request 25-07-17
* tag 'nf-25-07-17' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_conntrack: fix crash due to removal of uninitialised entry
Revert "netfilter: nf_tables: Add notifications for hook changes"
netfilter: nf_tables: hide clash bit from userspace
selftests: netfilter: nft_concat_range.sh: send packets to empty set
selftests: netfilter: conntrack_resize.sh: also use udpclash tool
selftests: netfilter: add conntrack clash resolution test case
selftests: netfilter: conntrack_resize.sh: extend resize test
====================
Link: https://patch.msgid.link/20250717095808.41725-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
A crash in conntrack was reported while trying to unlink the conntrack
entry from the hash bucket list:
[exception RIP: __nf_ct_delete_from_lists+172]
[..]
#7 [ff539b5a2b043aa0] nf_ct_delete at ffffffffc124d421 [nf_conntrack]
#8 [ff539b5a2b043ad0] nf_ct_gc_expired at ffffffffc124d999 [nf_conntrack]
#9 [ff539b5a2b043ae0] __nf_conntrack_find_get at ffffffffc124efbc [nf_conntrack]
[..]
The nf_conn struct is marked as allocated from slab but appears to be in
a partially initialised state:
ct hlist pointer is garbage; looks like the ct hash value
(hence crash).
ct->status is equal to IPS_CONFIRMED|IPS_DYING, which is expected
ct->timeout is 30000 (=30s), which is unexpected.
Everything else looks like normal udp conntrack entry. If we ignore
ct->status and pretend its 0, the entry matches those that are newly
allocated but not yet inserted into the hash:
- ct hlist pointers are overloaded and store/cache the raw tuple hash
- ct->timeout matches the relative time expected for a new udp flow
rather than the absolute 'jiffies' value.
If it were not for the presence of IPS_CONFIRMED,
__nf_conntrack_find_get() would have skipped the entry.
Theory is that we did hit following race:
cpu x cpu y cpu z
found entry E found entry E
E is expired <preemption>
nf_ct_delete()
return E to rcu slab
init_conntrack
E is re-inited,
ct->status set to 0
reply tuplehash hnnode.pprev
stores hash value.
cpu y found E right before it was deleted on cpu x.
E is now re-inited on cpu z. cpu y was preempted before
checking for expiry and/or confirm bit.
->refcnt set to 1
E now owned by skb
->timeout set to 30000
If cpu y were to resume now, it would observe E as
expired but would skip E due to missing CONFIRMED bit.
nf_conntrack_confirm gets called
sets: ct->status |= CONFIRMED
This is wrong: E is not yet added
to hashtable.
cpu y resumes, it observes E as expired but CONFIRMED:
<resumes>
nf_ct_expired()
-> yes (ct->timeout is 30s)
confirmed bit set.
cpu y will try to delete E from the hashtable:
nf_ct_delete() -> set DYING bit
__nf_ct_delete_from_lists
Even this scenario doesn't guarantee a crash:
cpu z still holds the table bucket lock(s) so y blocks:
wait for spinlock held by z
CONFIRMED is set but there is no
guarantee ct will be added to hash:
"chaintoolong" or "clash resolution"
logic both skip the insert step.
reply hnnode.pprev still stores the
hash value.
unlocks spinlock
return NF_DROP
<unblocks, then
crashes on hlist_nulls_del_rcu pprev>
In case CPU z does insert the entry into the hashtable, cpu y will unlink
E again right away but no crash occurs.
Without 'cpu y' race, 'garbage' hlist is of no consequence:
ct refcnt remains at 1, eventually skb will be free'd and E gets
destroyed via: nf_conntrack_put -> nf_conntrack_destroy -> nf_ct_destroy.
To resolve this, move the IPS_CONFIRMED assignment after the table
insertion but before the unlock.
Pablo points out that the confirm-bit-store could be reordered to happen
before hlist add resp. the timeout fixup, so switch to set_bit and
before_atomic memory barrier to prevent this.
It doesn't matter if other CPUs can observe a newly inserted entry right
before the CONFIRMED bit was set:
Such event cannot be distinguished from above "E is the old incarnation"
case: the entry will be skipped.
Also change nf_ct_should_gc() to first check the confirmed bit.
The gc sequence is:
1. Check if entry has expired, if not skip to next entry
2. Obtain a reference to the expired entry.
3. Call nf_ct_should_gc() to double-check step 1.
nf_ct_should_gc() is thus called only for entries that already failed an
expiry check. After this patch, once the confirmed bit check passes
ct->timeout has been altered to reflect the absolute 'best before' date
instead of a relative time. Step 3 will therefore not remove the entry.
Without this change to nf_ct_should_gc() we could still get this sequence:
1. Check if entry has expired.
2. Obtain a reference.
3. Call nf_ct_should_gc() to double-check step 1:
4 - entry is still observed as expired
5 - meanwhile, ct->timeout is corrected to absolute value on other CPU
and confirm bit gets set
6 - confirm bit is seen
7 - valid entry is removed again
First do check 6), then 4) so the gc expiry check always picks up either
confirmed bit unset (entry gets skipped) or expiry re-check failure for
re-inited conntrack objects.
This change cannot be backported to releases before 5.19. Without
commit 8a75a2c174 ("netfilter: conntrack: remove unconfirmed list")
|= IPS_CONFIRMED line cannot be moved without further changes.
Cc: Razvan Cojocaru <rzvncj@gmail.com>
Link: https://lore.kernel.org/netfilter-devel/20250627142758.25664-1-fw@strlen.de/
Link: https://lore.kernel.org/netfilter-devel/4239da15-83ff-4ca4-939d-faef283471bb@gmail.com/
Fixes: 1397af5bfd ("netfilter: conntrack: remove the percpu dying list")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since "net: gro: use cb instead of skb->network_header", the skb network
header is no longer set in the GRO path.
This breaks fraglist segmentation, which relies on ip_hdr()/tcp_hdr()
to check for address/port changes.
Fix this regression by selectively setting the network header for merged
segment skbs.
Fixes: 186b1ea73a ("net: gro: use cb instead of skb->network_header")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250705150622.10699-1-nbd@nbd.name
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
pmc->idev is still used in ip6_mc_clear_src(), so as mld_clear_delrec()
does, the reference should be put after ip6_mc_clear_src() return.
Fixes: 63ed8de4be ("mld: add mc_lock for protecting per-interface mld data")
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://patch.msgid.link/20250714141957.3301871-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use __skb_queue_purge() instead of re-implementing it. Note that it uses
kfree_skb_reason() instead of kfree_skb() internally, and pass
SKB_DROP_REASON_QUEUE_PURGE drop reason to the kfree_skb tracepoint.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20250715120709.3941510-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The CI reported a UaF in tcp_prune_ofo_queue():
BUG: KASAN: slab-use-after-free in tcp_prune_ofo_queue+0x55d/0x660
Read of size 4 at addr ffff8880134729d8 by task socat/20348
CPU: 0 UID: 0 PID: 20348 Comm: socat Not tainted 6.16.0-rc5-virtme #1 PREEMPT(full)
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
<TASK>
dump_stack_lvl+0x82/0xd0
print_address_description.constprop.0+0x2c/0x400
print_report+0xb4/0x270
kasan_report+0xca/0x100
tcp_prune_ofo_queue+0x55d/0x660
tcp_try_rmem_schedule+0x855/0x12e0
tcp_data_queue+0x4dd/0x2260
tcp_rcv_established+0x5e8/0x2370
tcp_v4_do_rcv+0x4ba/0x8c0
__release_sock+0x27a/0x390
release_sock+0x53/0x1d0
tcp_sendmsg+0x37/0x50
sock_write_iter+0x3c1/0x520
vfs_write+0xc09/0x1210
ksys_write+0x183/0x1d0
do_syscall_64+0xc1/0x380
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fcf73ef2337
Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007ffd4f924708 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fcf73ef2337
RDX: 0000000000002000 RSI: 0000555f11d1a000 RDI: 0000000000000008
RBP: 0000555f11d1a000 R08: 0000000000002000 R09: 0000000000000000
R10: 0000000000000040 R11: 0000000000000246 R12: 0000000000000008
R13: 0000000000002000 R14: 0000555ee1a44570 R15: 0000000000002000
</TASK>
Allocated by task 20348:
kasan_save_stack+0x24/0x50
kasan_save_track+0x14/0x30
__kasan_slab_alloc+0x59/0x70
kmem_cache_alloc_node_noprof+0x110/0x340
__alloc_skb+0x213/0x2e0
tcp_collapse+0x43f/0xff0
tcp_try_rmem_schedule+0x6b9/0x12e0
tcp_data_queue+0x4dd/0x2260
tcp_rcv_established+0x5e8/0x2370
tcp_v4_do_rcv+0x4ba/0x8c0
__release_sock+0x27a/0x390
release_sock+0x53/0x1d0
tcp_sendmsg+0x37/0x50
sock_write_iter+0x3c1/0x520
vfs_write+0xc09/0x1210
ksys_write+0x183/0x1d0
do_syscall_64+0xc1/0x380
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Freed by task 20348:
kasan_save_stack+0x24/0x50
kasan_save_track+0x14/0x30
kasan_save_free_info+0x3b/0x60
__kasan_slab_free+0x38/0x50
kmem_cache_free+0x149/0x330
tcp_prune_ofo_queue+0x211/0x660
tcp_try_rmem_schedule+0x855/0x12e0
tcp_data_queue+0x4dd/0x2260
tcp_rcv_established+0x5e8/0x2370
tcp_v4_do_rcv+0x4ba/0x8c0
__release_sock+0x27a/0x390
release_sock+0x53/0x1d0
tcp_sendmsg+0x37/0x50
sock_write_iter+0x3c1/0x520
vfs_write+0xc09/0x1210
ksys_write+0x183/0x1d0
do_syscall_64+0xc1/0x380
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The buggy address belongs to the object at ffff888013472900
which belongs to the cache skbuff_head_cache of size 232
The buggy address is located 216 bytes inside of
freed 232-byte region [ffff888013472900, ffff8880134729e8)
The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x13472
head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x80000000000040(head|node=0|zone=1)
page_type: f5(slab)
raw: 0080000000000040 ffff88800198fb40 ffffea0000347b10 ffffea00004f5290
raw: 0000000000000000 0000000000120012 00000000f5000000 0000000000000000
head: 0080000000000040 ffff88800198fb40 ffffea0000347b10 ffffea00004f5290
head: 0000000000000000 0000000000120012 00000000f5000000 0000000000000000
head: 0080000000000001 ffffea00004d1c81 00000000ffffffff 00000000ffffffff
head: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888013472880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888013472900: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff888013472980: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc
^
ffff888013472a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888013472a80: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
Indeed tcp_prune_ofo_queue() is reusing the skb dropped a few lines
above. The caller wants to enqueue 'in_skb', lets check space vs the
latter.
Fixes: 1d2fbaad7c ("tcp: stronger sk_rcvbuf checks")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: syzbot+865aca08c0533171bf6a@syzkaller.appspotmail.com
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/b78d2d9bdccca29021eed9a0e7097dd8dc00f485.1752567053.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The requirement of ->get_rxfh_fields() in ethtool_set_rxfh() is there to
verify that we have no conflict of input_xfrm with the RSS fields
options, there is no point in doing it if input_xfrm is not
supported/requested.
This is under the assumption that a driver that supports input_xfrm will
also support ->get_rxfh_fields(), so add a WARN_ON() to
ethtool_check_ops() to verify it, and remove the op NULL check.
This fixes the following error in mlx4_en, which doesn't support
getting/setting RXFH fields.
$ ethtool --set-rxfh-indir eth2 hfunc xor
Cannot set RX flow hash configuration: Operation not supported
Fixes: 72792461c8 ("net: ethtool: don't mux RXFH via rxnfc callbacks")
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20250715140754.489677-1-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The 'quirks' member already ran out of bits on some platforms some time
ago. Replace the integer member by a bitmap in order to have enough bits
in future. Replace raw bit operations by accessor macros.
Fixes: ff26b2dd65 ("Bluetooth: Add quirk for broken READ_VOICE_SETTING")
Fixes: 127881334e ("Bluetooth: Add quirk for broken READ_PAGE_SCAN_TYPE")
Suggested-by: Pauli Virtanen <pav@iki.fi>
Tested-by: Ivan Pravdin <ipravdin.official@gmail.com>
Signed-off-by: Kiran K <kiran.k@intel.com>
Signed-off-by: Christian Eggers <ceggers@arri.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This replaces the usage of HCI_ERROR_REMOTE_USER_TERM, which as the name
suggest is to indicate a regular disconnection initiated by an user,
with HCI_ERROR_AUTH_FAILURE to indicate the session has timeout thus any
pairing shall be considered as failed.
Fixes: 1e91c29eb6 ("Bluetooth: Use hci_disconnect for immediate disconnection from SMP")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
If a command is received while a bonding is ongoing consider it a
pairing failure so the session is cleanup properly and the device is
disconnected immediately instead of continuing with other commands that
may result in the session to get stuck without ever completing such as
the case bellow:
> ACL Data RX: Handle 2048 flags 0x02 dlen 21
SMP: Identity Information (0x08) len 16
Identity resolving key[16]: d7e08edef97d3e62cd2331f82d8073b0
> ACL Data RX: Handle 2048 flags 0x02 dlen 21
SMP: Signing Information (0x0a) len 16
Signature key[16]: 1716c536f94e843a9aea8b13ffde477d
Bluetooth: hci0: unexpected SMP command 0x0a from XX:XX:XX:XX:XX:XX
> ACL Data RX: Handle 2048 flags 0x02 dlen 12
SMP: Identity Address Information (0x09) len 7
Address: XX:XX:XX:XX:XX:XX (Intel Corporate)
While accourding to core spec 6.1 the expected order is always BD_ADDR
first first then CSRK:
When using LE legacy pairing, the keys shall be distributed in the
following order:
LTK by the Peripheral
EDIV and Rand by the Peripheral
IRK by the Peripheral
BD_ADDR by the Peripheral
CSRK by the Peripheral
LTK by the Central
EDIV and Rand by the Central
IRK by the Central
BD_ADDR by the Central
CSRK by the Central
When using LE Secure Connections, the keys shall be distributed in the
following order:
IRK by the Peripheral
BD_ADDR by the Peripheral
CSRK by the Peripheral
IRK by the Central
BD_ADDR by the Central
CSRK by the Central
According to the Core 6.1 for commands used for key distribution "Key
Rejected" can be used:
'3.6.1. Key distribution and generation
A device may reject a distributed key by sending the Pairing Failed command
with the reason set to "Key Rejected".
Fixes: b28b494366 ("Bluetooth: Add strict checks for allowed SMP PDUs")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Currently, the connectable flag used by the setup of an extended
advertising instance drives whether we require privacy when trying to pass
a random address to the advertising parameters (Own Address).
If privacy is not required, then it automatically falls back to using the
controller's public address. This can cause problems when using controllers
that do not have a public address set, but instead use a static random
address.
e.g. Assume a BLE controller that does not have a public address set.
The controller upon powering is set with a random static address by default
by the kernel.
< HCI Command: LE Set Random Address (0x08|0x0005) plen 6
Address: E4:AF:26:D8:3E:3A (Static)
> HCI Event: Command Complete (0x0e) plen 4
LE Set Random Address (0x08|0x0005) ncmd 1
Status: Success (0x00)
Setting non-connectable extended advertisement parameters in bluetoothctl
mgmt
add-ext-adv-params -r 0x801 -x 0x802 -P 2M -g 1
correctly sets Own address type as Random
< HCI Command: LE Set Extended Advertising Parameters (0x08|0x0036)
plen 25
...
Own address type: Random (0x01)
Setting connectable extended advertisement parameters in bluetoothctl mgmt
add-ext-adv-params -r 0x801 -x 0x802 -P 2M -g -c 1
mistakenly sets Own address type to Public (which causes to use Public
Address 00:00:00:00:00:00)
< HCI Command: LE Set Extended Advertising Parameters (0x08|0x0036)
plen 25
...
Own address type: Public (0x00)
This causes either the controller to emit an Invalid Parameters error or to
mishandle the advertising.
This patch makes sure that we use the already set static random address
when requesting a connectable extended advertising when we don't require
privacy and our public address is not set (00:00:00:00:00:00).
Fixes: 3fe318ee72 ("Bluetooth: move hci_get_random_address() to hci_sync")
Signed-off-by: Alessandro Gasbarroni <alex.gasbarroni@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
mptcp_disconnect() clears the fallback bit unconditionally, without
touching the associated flags.
The bit clear is safe, as no fallback operation can race with that --
all subflow are already in TCP_CLOSE status thanks to the previous
FASTCLOSE -- but we need to consistently reset all the fallback related
status.
Also acquire the relevant lock, to avoid fouling static analyzers.
Fixes: b29fcfb54c ("mptcp: full disconnect implementation")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250714-net-mptcp-fallback-races-v1-3-391aff963322@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We have races similar to the one addressed by the previous patch between
subflow failing and additional subflow creation. They are just harder to
trigger.
The solution is similar. Use a separate flag to track the condition
'socket state prevent any additional subflow creation' protected by the
fallback lock.
The socket fallback makes such flag true, and also receiving or sending
an MP_FAIL option.
The field 'allow_infinite_fallback' is now always touched under the
relevant lock, we can drop the ONCE annotation on write.
Fixes: 478d770008 ("mptcp: send out MP_FAIL when data checksum fails")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250714-net-mptcp-fallback-races-v1-2-391aff963322@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These is no need to check null for idev before return NULL.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250714081732.3109764-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rds_tcp_accept_one() starts with a pretty much verbatim
copy of kernel_accept(). Might as well use the real thing...
That code went into mainline in 2009, kernel_accept()
had been added in Aug 2006, the copyright on rds/tcp_listen.c
is "Copyright (c) 2006 Oracle", so it's entirely possible
that it predates the introduction of kernel_accept().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://patch.msgid.link/20250713180134.GC1880847@ZenIV
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Prior to calling bind() a program may call connect() on a socket to
restrict to a remote peer address.
Using connect() is the normal mechanism to specify a remote network
peer, so we use that here. In MCTP connect() is only used for bound
sockets - send() is not available for MCTP since a tag must be provided
for each message.
The smctp_type must match between connect() and bind() calls.
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Link: https://patch.msgid.link/20250710-mctp-bind-v4-6-8ec2f6460c56@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Ensure that a specific EID (remote or local) bind will match in
preference to a MCTP_ADDR_ANY bind.
This adds infrastructure for binding a socket to receive messages from a
specific remote peer address, a future commit will expose an API for
this.
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Link: https://patch.msgid.link/20250710-mctp-bind-v4-5-8ec2f6460c56@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When a specific EID is passed as a bind address, it only makes sense to
interpret with an actual network ID, so resolve that to the default
network at bind time.
For bind address of MCTP_ADDR_ANY, we want to be able to capture traffic
to any network and address, so keep the current behaviour of matching
traffic from any network interface (don't interpret MCTP_NET_ANY as
the default network ID).
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Link: https://patch.msgid.link/20250710-mctp-bind-v4-3-8ec2f6460c56@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Disallow bind() calls that have the same arguments as existing bound
sockets. Previously multiple sockets could bind() to the same
type/local address, with an arbitrary socket receiving matched messages.
This is only a partial fix, a future commit will define precedence order
for MCTP_ADDR_ANY versus specific EID bind(), which are allowed to exist
together.
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Link: https://patch.msgid.link/20250710-mctp-bind-v4-2-8ec2f6460c56@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The sock was not being released. Other than leaking, the stale socket
will conflict with subsequent bind() calls in unrelated MCTP tests.
Fixes: 46ee16462f ("net: mctp: test: Add extaddr routing output test")
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Link: https://patch.msgid.link/20250710-mctp-bind-v4-1-8ec2f6460c56@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In cfg80211_off_channel_oper_allowed(), the current logic disallows
off-channel operations if any link operates on a radar channel,
assuming such channels cannot be vacated. This assumption holds for
non-MLO interfaces but not for MLO.
With MLO and multi-radio devices, different links may operate on
separate radios. This allows one link to scan off-channel while
another remains on a radar channel. For example, in a 5 GHz
split-phy setup, the lower band can scan while the upper band
stays on a radar channel.
Off-channel operations can be allowed if the radio/link onto which the
input channel falls is different from the radio/link which has an active
radar channel. Therefore, fix cfg80211_off_channel_oper_allowed() by
returning false only if the requested channel maps to the same radio as
an active radar channel. Allow off-channel operations when the requested
channel is on a different radio, as in MLO with multi-radio setups.
Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com>
Signed-off-by: Amith A <quic_amitajit@quicinc.com>
Link: https://patch.msgid.link/20250714040742.538550-1-quic_amitajit@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The ieee80211_csa_finish() function currently uses for_each_sdata_link()
to iterate over links of sdata. However, this macro internally uses
wiphy_dereference(), which expects the wiphy->mtx lock to be held.
When ieee80211_csa_finish() is invoked under an RCU read-side critical
section (e.g., under rcu_read_lock()), this leads to a warning from the
RCU debugging framework.
WARNING: suspicious RCU usage
net/mac80211/cfg.c:3830 suspicious rcu_dereference_protected() usage!
This warning is triggered because wiphy_dereference() is not safe to use
without holding the wiphy mutex, and it is being used in an RCU context
without the required locking.
Fix this by introducing and using a new macro, for_each_sdata_link_rcu(),
which performs RCU-safe iteration over sdata links using
list_for_each_entry_rcu() and rcu_dereference(). This ensures that the
link pointers are accessed safely under RCU and eliminates the warning.
Fixes: f600832794 ("wifi: mac80211: restructure tx profile retrieval for MLO MBSSID")
Signed-off-by: Maharaja Kennadyrajan <maharaja.kennadyrajan@oss.qualcomm.com>
Link: https://patch.msgid.link/20250711033846.40455-1-maharaja.kennadyrajan@oss.qualcomm.com
[unindent like the non-RCU macro]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
During commands like channel switch and color change, the updated
unsolicited broadcast probe response template may be provided. However,
this data is not parsed or acted upon in mac80211.
Add support to parse it and set the BSS changed flag
BSS_CHANGED_UNSOL_BCAST_PROBE_RESP so that drivers could take further
action.
Signed-off-by: Yuvarani V <quic_yuvarani@quicinc.com>
Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com>
Link: https://patch.msgid.link/20250710-update_unsol_bcast_probe_resp-v2-2-31aca39d3b30@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
At present, the updated unsolicited broadcast probe response template is
not processed during userspace commands such as channel switch or color
change. This leads to an issue where older incorrect unsolicited probe
response is still used during these events.
Add support to parse the netlink attribute and store it so that
mac80211/drivers can use it to set the BSS_CHANGED_UNSOL_BCAST_PROBE_RESP
flag in order to send the updated unsolicited broadcast probe response
templates during these events.
Signed-off-by: Yuvarani V <quic_yuvarani@quicinc.com>
Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com>
Link: https://patch.msgid.link/20250710-update_unsol_bcast_probe_resp-v2-1-31aca39d3b30@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since there's no TPE element in the (re)assoc response, trying
to use the data from it just leads to using the defaults, even
though the real values had been set during authentication from
the discovered BSS information.
Fix this by simply not handling the TPE data in assoc response
since it's not intended to be present, if it changes later the
necessary changes will be made by tracking beacons later.
As a side effect, by passing the real frame subtype, now print
a correct value for ML reconfiguration responses.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250709233537.caa1ca853f5a.I588271f386731978163aa9d84ae75d6f79633e16@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If this action frame, with the value of IEEE80211_HT_CHANWIDTH_ANY,
arrives right after a beacon that changed the operational bandwidth from
20 MHz to 40 MHz, then updating the rate control bandwidth to 40 can
race with updating the chanctx width (that happens in the beacon
proccesing) back to 40 MHz:
cpu0 cpu1
ieee80211_rx_mgmt_beacon
ieee80211_config_bw
ieee80211_link_change_chanreq
(*)ieee80211_link_update_chanreq
ieee80211_rx_h_action
(**)ieee80211_sta_cur_vht_bw
(***) ieee80211_recalc_chanctx_chantype
in (**), the maximum between the capability width and the bss width is
returned. But the bss width was just updated to 40 in (*),
so the action frame handling code will increase the width of the rate
control before the chanctx was increased (in ***), leading to a FW error
(at least in iwlwifi driver. But this is wrong regardless).
Fix this by simply handling the action frame async, so it won't race
with the beacon proccessing.
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218632
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250709233537.bb9dc6f36c35.I39782d6077424e075974c3bee4277761494a1527@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
During resume, the driver can call ieee80211_add_gtk_rekey for keys that
are not programmed into the device, e.g. keys of inactive links.
Don't mark such a key as uploaded to avoid removing it later from the
driver/device.
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250709233537.655094412b0b.Iacae31af3ba2a705da0a9baea976c2f799d65dc4@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
At the end of reconfig we are activating all the links that were active
before the error.
During the activation, _ieee80211_link_use_channel will unassign and
re-assign the chanctx from/to the link.
But we only need to do the assign, as we are re-building the state as it
was before the reconfig.
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250709233537.6245c3ae7031.Ia5f68992c7c112bea8a426c9339f50c88be3a9ca@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Under the previous commit's assumption that FIPS isn't
supported by hardware, we don't need to modify the
cipher suite list, but just need to use the software
one instead of the driver's in this case, so clean up
the code.
Also fix it to exclude TKIP in this case, since that's
also dependent on RC4.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250709233537.cff427e8f8a5.I744d1ea6a37e3ea55ae8bc3e770acee734eff268@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When fips_enabled is set, don't send any keys to the driver
(including possibly WoWLAN KEK/KCK material), assuming that
no device exists with the necessary certifications. If this
turns out to be false in the future, we can add a HW flag.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250709233537.e5eebc2b19d8.I968ef8c9ffb48d464ada78685bd25d22349fb063@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
All the paths that could return an error are considered
misuses of the function and WARN already, and none of
the callers ever check the return value. Remove it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250709233537.5b436ee3c20c.Ieff61ec510939adb5fe6da4840557b649b3aa820@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If a link has no chanctx, indicating it is an inactive link
that we tracked CSA for, then attempting to unreserve the
reserved chanctx will throw a warning and fail, since there
never was a reserved chanctx. Skip the unreserve.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250709233537.022192f4b1ae.Ib58156ac13e674a9f4d714735be0764a244c0aae@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, TCP stack accepts incoming packet if sizes of receive queues
are below sk->sk_rcvbuf limit.
This can cause memory overshoot if the packet is big, like an 1/2 MB
BIG TCP one.
Refine the check to take into account the incoming skb truesize.
Note that we still accept the packet if the receive queue is empty,
to not completely freeze TCP flows in pathological conditions.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250711114006.480026-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These functions to not modify the skb, add a const qualifier.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250711114006.480026-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_measure_rcv_mss() is used to update icsk->icsk_ack.rcv_mss
(tcpi_rcv_mss in tcp_info) and tp->scaling_ratio.
Calling it from tcp_data_queue_ofo() makes sure these
fields are updated, and permits a better tuning
of sk->sk_rcvbuf, in the case a new flow receives many ooo
packets.
Fixes: dfa2f04833 ("tcp: get rid of sysctl_tcp_adv_win_scale")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250711114006.480026-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a new SNMP MIB : LINUX_MIB_BEYOND_WINDOW
Incremented when an incoming packet is received beyond the
receiver window.
nstat -az | grep TcpExtBeyondWindow
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250711114006.480026-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, TCP accepts incoming packets which might go beyond the
offered RWIN.
Add to tcp_sequence() the validation of packet end sequence.
Add the corresponding check in the fast path.
We relax this new constraint if the receive queue is empty,
to not freeze flows from buggy peers.
Add a new drop reason : SKB_DROP_REASON_TCP_INVALID_END_SEQUENCE.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250711114006.480026-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A net device has a threaded sysctl that can be used to enable threaded
NAPI polling on all of the NAPI contexts under that device. Allow
enabling threaded NAPI polling at individual NAPI level using netlink.
Extend the netlink operation `napi-set` and allow setting the threaded
attribute of a NAPI. This will enable the threaded polling on a NAPI
context.
Add a test in `nl_netdev.py` that verifies various cases of threaded
NAPI being set at NAPI and at device level.
Tested
./tools/testing/selftests/net/nl_netdev.py
TAP version 13
1..7
ok 1 nl_netdev.empty_check
ok 2 nl_netdev.lo_check
ok 3 nl_netdev.page_pool_check
ok 4 nl_netdev.napi_list_check
ok 5 nl_netdev.dev_set_threaded
ok 6 nl_netdev.napi_set_threaded
ok 7 nl_netdev.nsim_rxq_reset_down
# Totals: pass:7 fail:0 xfail:0 xpass:0 skip:0 error:0
Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250710211203.3979655-1-skhawaja@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, __mkroute_output overrules the MTU value configured for
broadcast routes.
This buggy behaviour can be reproduced with:
ip link set dev eth1 mtu 9000
ip route del broadcast 192.168.0.255 dev eth1 proto kernel scope link src 192.168.0.2
ip route add broadcast 192.168.0.255 dev eth1 proto kernel scope link src 192.168.0.2 mtu 1500
The maximum packet size should be 1500, but it is actually 8000:
ping -b 192.168.0.255 -s 8000
Fix __mkroute_output to allow MTU values to be configured for
for broadcast routes (to support a mixed-MTU local-area-network).
Signed-off-by: Oscar Maes <oscmaes92@gmail.com>
Link: https://patch.msgid.link/20250710142714.12986-1-oscmaes92@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
x25_terminate_link() has been unused since the last use was removed
in 2020 by:
commit 7eed751b3b ("net/x25: handle additional netdev events")
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Acked-by: Martin Schiller <ms@dev.tdt.de>
Link: https://patch.msgid.link/20250712205759.278777-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is a follow-up for commit eb1ac9ff6c ("ipv6: anycast: Don't
hold RTNL for IPV6_JOIN_ANYCAST.").
We should not add a new device lookup API without netdevice_tracker.
Let's pass netdevice_tracker to dev_get_by_flags_rcu() and rename it
with netdev_ prefix to match other newer APIs.
Note that we always use GFP_ATOMIC for netdev_hold() as it's expected
to be called under RCU.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/netdev/20250708184053.102109f6@kernel.org/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250711051120.2866855-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Remove a bunch of unused xdr_*decode* functions:
The last use of xdr_decode_netobj() was removed in 2021 by:
commit 7cf96b6d01 ("lockd: Update the NLMv4 SHARE arguments decoder to
use struct xdr_stream")
The last use of xdr_decode_string_inplace() was removed in 2021 by:
commit 3049e974a7 ("lockd: Update the NLMv4 FREE_ALL arguments decoder
to use struct xdr_stream")
The last use of xdr_stream_decode_opaque() was removed in 2024 by:
commit fed8a17c61 ("xdrgen: typedefs should use the built-in string and
opaque functions")
The functions xdr_stream_decode_string() and
xdr_stream_decode_opaque_dup() were both added in 2018 by the
commit 0e779aa703 ("SUNRPC: Add helpers for decoding opaque and string
types")
but never used.
Remove them.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://lore.kernel.org/r/20250712233006.403226-1-linux@treblig.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Replace the offset-based approach for tracking progress through a bucket
in the TCP table with one based on socket cookies. Remember the cookies
of unprocessed sockets from the last batch and use this list to
pick up where we left off or, in the case that the next socket
disappears between reads, find the first socket after that point that
still exists in the bucket and resume from there.
This approach guarantees that all sockets that existed when iteration
began and continue to exist throughout will be visited exactly once.
Sockets that are added to the table during iteration may or may not be
seen, but if they are they will be seen exactly once.
Signed-off-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Prepare for the next patch that tracks cookies between iterations by
converting struct sock **batch to union bpf_tcp_iter_batch_item *batch
inside struct bpf_tcp_iter_state.
Signed-off-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Get rid of the st_bucket_done field to simplify TCP iterator state and
logic. Before, st_bucket_done could be false if bpf_iter_tcp_batch
returned a partial batch; however, with the last patch ("bpf: tcp: Make
sure iter->batch always contains a full bucket snapshot"),
st_bucket_done == true is equivalent to iter->cur_sk == iter->end_sk.
Signed-off-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Require that iter->batch always contains a full bucket snapshot. This
invariant is important to avoid skipping or repeating sockets during
iteration when combined with the next few patches. Before, there were
two cases where a call to bpf_iter_tcp_batch may only capture part of a
bucket:
1. When bpf_iter_tcp_realloc_batch() returns -ENOMEM.
2. When more sockets are added to the bucket while calling
bpf_iter_tcp_realloc_batch(), making the updated batch size
insufficient.
In cases where the batch size only covers part of a bucket, it is
possible to forget which sockets were already visited, especially if we
have to process a bucket in more than two batches. This forces us to
choose between repeating or skipping sockets, so don't allow this:
1. Stop iteration and propagate -ENOMEM up to userspace if reallocation
fails instead of continuing with a partial batch.
2. Try bpf_iter_tcp_realloc_batch() with GFP_USER just as before, but if
we still aren't able to capture the full bucket, call
bpf_iter_tcp_realloc_batch() again while holding the bucket lock to
guarantee the bucket does not change. On the second attempt use
GFP_NOWAIT since we hold onto the spin lock.
I did some manual testing to exercise the code paths where GFP_NOWAIT is
used and where ERR_PTR(err) is returned. I used the realloc test cases
included later in this series to trigger a scenario where a realloc
happens inside bpf_iter_tcp_batch and made a small code tweak to force
the first realloc attempt to allocate a too-small batch, thus requiring
another attempt with GFP_NOWAIT. Some printks showed both reallocs with
the tests passing:
Jun 27 00:00:53 crow kernel: again GFP_USER
Jun 27 00:00:53 crow kernel: again GFP_NOWAIT
Jun 27 00:00:53 crow kernel: again GFP_USER
Jun 27 00:00:53 crow kernel: again GFP_NOWAIT
With this setup, I also forced each of the bpf_iter_tcp_realloc_batch
calls to return -ENOMEM to ensure that iteration ends and that the
read() in userspace fails.
Signed-off-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Prepare for the next patch which needs to be able to choose either
GFP_USER or GFP_NOWAIT for calls to bpf_iter_tcp_realloc_batch.
Signed-off-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
The return value of sock_sendmsg() is signed, and svc_tcp_sendto() wants
a signed value to return.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This ends up returning AUTH_BADCRED when memory allocation fails today.
Fix it to return AUTH_FAILED, which better indicates a failure on the
server.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
rq_accept_statp should point to the location of the accept_status in the
reply. This field is not reset between RPCs so if svc_authenticate or
pg_authenticate return SVC_DENIED without setting the pointer, it could
result in the status being written to the wrong place.
This pointer starts its lifetime as NULL. Reset it on every iteration
so we get consistent behavior if this happens.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
In the case of an unknown error code from svc_authenticate or
pg_authenticate, return AUTH_ERROR with a status of AUTH_FAILED. Also
add the other auth_stat value from RFC 5531, and document all the status
codes.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Convert the svc_wake_up tracepoint into svc_pool_thread_event class.
Have it also record the pool id, and add new tracepoints for when the
thread is already running and for when there are no idle threads.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
csum_partial_copy_to_xdr is only used inside the sunrpc module, so
remove the export.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
csum_partial_copy_to_xdr can handle a checksumming and non-checksumming
case and implements this using a callback, which leads to a lot of
boilerplate code and indirect calls in the fast path.
Switch to storing a need_checksum flag in struct xdr_skb_reader instead
to remove the indirect call and simplify the code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The rqst argument to xdr_init_encode_pages is set to NULL by all callers,
and pages is always set to buf->pages. Remove the two arguments and
hardcode the assignments.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Rename the existing sha1_init() to sha1_init_raw(), since it conflicts
with the upcoming library function. This will later be removed, but
this keeps the kernel building for the introduction of the library.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250712232329.818226-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
This reverts commit 465b9ee0ee.
Such notifications fit better into core or nfnetlink_hook code,
following the NFNL_MSG_HOOK_GET message format.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Its a kernel implementation detail, at least at this time:
We can later decide to revert this patch if there is a compelling
reason, but then we should also remove the ifdef that prevents exposure
of ip_conntrack_status enum IPS_NAT_CLASH value in the uapi header.
Clash entries are not included in dumps (true for both old /proc
and ctnetlink) either. So for now exclude the clash bit when dumping.
Fixes: 7e5c6aa67e ("netfilter: nf_tables: add packets conntrack state to debug trace info")
Link: https://lore.kernel.org/netfilter-devel/aGwf3dCggwBlRKKC@strlen.de/
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Once unix_sock ->path is set, we are guaranteed that its ->path will remain
unchanged (and pinned) until the socket is closed. OTOH, dentry_open()
does not modify the path passed to it.
IOW, there's no need to copy unix_sk(sk)->path in unix_open_file() - we
can just pass it to dentry_open() and be done with that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/20250712054157.GZ1880847@ZenIV
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for
pending_refcnt to decrement to zero before returning. The pending_refcnt
is decremented by 1 when the skb->destructor function is called,
indicating that the skb has been successfully sent and needs to be
destroyed.
If an error occurs during this process, the tpacket_snd() function will
exit and return error, but pending_refcnt may not yet have decremented to
zero. Assuming the next send operation is executed immediately, but there
are no available frames to be sent in tx_ring (i.e., packet_current_frame
returns NULL), and skb is also NULL, the function will not execute
wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it
will enter a do-while loop, waiting for pending_refcnt to be zero. Even
if the previous skb has completed transmission, the skb->destructor
function can only be invoked in the ksoftirqd thread (assuming NAPI
threading is enabled). When both the ksoftirqd thread and the tpacket_snd
operation happen to run on the same CPU, and the CPU trapped in the
do-while loop without yielding, the ksoftirqd thread will not get
scheduled to run. As a result, pending_refcnt will never be reduced to
zero, and the do-while loop cannot exit, eventually leading to a CPU soft
lockup issue.
In fact, skb is true for all but the first iterations of that loop, and
as long as pending_refcnt is not zero, even if incremented by a previous
call, wait_for_completion_interruptible_timeout() should be executed to
yield the CPU, allowing the ksoftirqd thread to be scheduled. Therefore,
the execution condition of this function should be modified to check if
pending_refcnt is not zero, instead of check skb.
- if (need_wait && skb) {
+ if (need_wait && packet_read_pending(&po->tx_ring)) {
As a result, the judgment conditions are duplicated with the end code of
the while loop, and packet_read_pending() is a very expensive function.
Actually, this loop can only exit when ph is NULL, so the loop condition
can be changed to while (1), and in the "ph = NULL" branch, if the
subsequent condition of if is not met, the loop can break directly. Now,
the loop logic remains the same as origin but is clearer and more obvious.
Fixes: 89ed5b5190 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET")
Cc: stable@kernel.org
Suggested-by: LongJun Tang <tanglongjun@kylinos.cn>
Signed-off-by: Yun Lu <luyun@kylinos.cn>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Due to the changes in commit 581073f626 ("af_packet: do not call
packet_read_pending() from tpacket_destruct_skb()"), every time
tpacket_destruct_skb() is executed, the skb_completion is marked as
completed. When wait_for_completion_interruptible_timeout() returns
completed, the pending_refcnt has not yet been reduced to zero.
Therefore, when ph is NULL, the wait function may need to be called
multiple times until packet_read_pending() finally returns zero.
We should call sock_sndtimeo() only once, otherwise the SO_SNDTIMEO
constraint could be way off.
Fixes: 581073f626 ("af_packet: do not call packet_read_pending() from tpacket_destruct_skb()")
Cc: stable@kernel.org
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yun Lu <luyun@kylinos.cn>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A race condition can occur when 'agg' is modified in qfq_change_agg
(called during qfq_enqueue) while other threads access it
concurrently. For example, qfq_dump_class may trigger a NULL
dereference, and qfq_delete_class may cause a use-after-free.
This patch addresses the issue by:
1. Moved qfq_destroy_class into the critical section.
2. Added sch_tree_lock protection to qfq_dump_class and
qfq_dump_class_stats.
Fixes: 462dbc9101 ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost")
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- bump version strings, by Simon Wunderlich
- batman-adv: store hard_iface as iflink private data,
by Matthias Schiffer
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAmhv7XkWHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoUnbEACJ27/RIB99REd5dRNkX/IgFYe9
S2wKLHeD/kQlYDTm7Qht12kytGWKmipBftW3Wk/Hd30jyE8fzfSw22dz7lAAKZgH
O3R0nYxaZ9EGNBKDWWzrJd6+nDGxbxlvzU6Ce7momnYtSD4gfscLwSiYDhwaL6qI
gXSZ599kDjL1rs5JK28Vb+TlSnvzHYhKmBGuobMHTwGvnE66Xg0UHLcM5VQhm4ue
kuogMDA09ntzroKMTT1IU8Hs9EmnWLdTeXV/KT3RL/WVwpLAMdRhCmuhNo4vEEf9
bvtp2lN+NFV86ObtGEy66w12uT9BvDMUrzmUIToW83Y5qhcHkDYBJMQ/m379nh7h
uYakwkGWdQi9wg6zT3iS3I5qJJiN1SKCwV5uZXqleeY75GdUzDrwfwCWqbdQh3km
kceIH9asENRrIVD9fK7iP5t5s4yge+dMFJIVniYZIe7l58cvsK9/lfTE5j4MU3TH
kf4usLY1v9ST+Q3CA7HCDbxecKF34/V7//W5erKGEs6YklfPhtE4sp7HwATt8Akd
0AB15pc8vx7A+HlZbRvdJTINE6tvRxpAhJc533iKm7peA27oYCjdpUzznahwbh+W
8A+jF8MZVSd2ItyisQDpUbhYE6YWiWzzBX6fHbzSqY4ckQcx6iJrATzYSPDA3OPt
eWF726Qz3IuogSPwjA==
=sHBk
-----END PGP SIGNATURE-----
Merge tag 'batadv-next-pullrequest-20250710' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
This cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
- batman-adv: store hard_iface as iflink private data,
by Matthias Schiffer
* tag 'batadv-next-pullrequest-20250710' of git://git.open-mesh.org/linux-merge:
batman-adv: store hard_iface as iflink private data
batman-adv: Start new development cycle
====================
Link: https://patch.msgid.link/20250710164501.153872-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Also storing tcf_action into struct tcf_skbedit_params
makes sure there is no discrepancy in tcf_skbedit_act().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-12-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Also storing tcf_action into struct tcf_police_params
makes sure there is no discrepancy in tcf_police_act().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-11-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Also storing tcf_action into struct tcf_pedit_params
makes sure there is no discrepancy in tcf_pedit_act().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-10-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Also storing tcf_action into struct tcf_nat_params
makes sure there is no discrepancy in tcf_nat_act().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Also storing tcf_action into struct tcf_mpls_params
makes sure there is no discrepancy in tcf_mpls_act().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Also storing tcf_action into struct tcf_ctinfo_params
makes sure there is no discrepancy in tcf_ctinfo_act().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 21c167aa0b ("net/sched: act_ctinfo: use percpu stats")
missed that stats_dscp_set, stats_dscp_error and stats_cpmark_set
might be written (and read) locklessly.
Use atomic64_t for these three fields, I doubt act_ctinfo is used
heavily on big SMP hosts anyway.
Fixes: 24ec483cec ("net: sched: Introduce act_ctinfo action")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pedro Tammela <pctammela@mojatatu.com>
Link: https://patch.msgid.link/20250709090204.797558-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Also storing tcf_action into struct tcf_ct_params
makes sure there is no discrepancy in tcf_ct_act().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Also storing tcf_action into struct tcf_csum_params
makes sure there is no discrepancy in tcf_csum_act().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Also storing tcf_action into struct tcf_connmark_parms
makes sure there is no discrepancy in tcf_connmark_act().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250709090204.797558-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
netem_enqueue's duplication prevention logic breaks when a netem
resides in a qdisc tree with other netems - this can lead to a
soft lockup and OOM loop in netem_dequeue, as seen in [1].
Ensure that a duplicating netem cannot exist in a tree with other
netems.
Previous approaches suggested in discussions in chronological order:
1) Track duplication status or ttl in the sk_buff struct. Considered
too specific a use case to extend such a struct, though this would
be a resilient fix and address other previous and potential future
DOS bugs like the one described in loopy fun [2].
2) Restrict netem_enqueue recursion depth like in act_mirred with a
per cpu variable. However, netem_dequeue can call enqueue on its
child, and the depth restriction could be bypassed if the child is a
netem.
3) Use the same approach as in 2, but add metadata in netem_skb_cb
to handle the netem_dequeue case and track a packet's involvement
in duplication. This is an overly complex approach, and Jamal
notes that the skb cb can be overwritten to circumvent this
safeguard.
4) Prevent the addition of a netem to a qdisc tree if its ancestral
path contains a netem. However, filters and actions can cause a
packet to change paths when re-enqueued to the root from netem
duplication, leading us to the current solution: prevent a
duplicating netem from inhabiting the same tree as other netems.
[1] https://lore.kernel.org/netdev/8DuRWwfqjoRDLDmBMlIfbrsZg9Gx50DHJc1ilxsEBNe2D6NMoigR_eIRIG0LOjMc3r10nUUZtArXx4oZBIdUfZQrwjcQhdinnMis_0G7VEk=@willsroot.io/
[2] https://lwn.net/Articles/719297/
Fixes: 0afb51e728 ("[PKT_SCHED]: netem: reinsert for duplication")
Reported-by: William Liu <will@willsroot.io>
Reported-by: Savino Dicanosa <savy@syst3mfailure.io>
Signed-off-by: William Liu <will@willsroot.io>
Signed-off-by: Savino Dicanosa <savy@syst3mfailure.io>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20250708164141.875402-1-will@willsroot.io
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR (net-6.16-rc6-2).
No conflicts.
Adjacent changes:
drivers/net/wireless/mediatek/mt76/mt7925/mcu.c
c701574c54 ("wifi: mt76: mt7925: fix invalid array index in ssid assignment during hw scan")
b3a431fe2e ("wifi: mt76: mt7925: fix off by one in mt7925_mcu_hw_scan()")
drivers/net/wireless/mediatek/mt76/mt7996/mac.c
62da647a2b ("wifi: mt76: mt7996: Add MLO support to mt7996_tx_check_aggr()")
dc66a129ad ("wifi: mt76: add a wrapper for wcid access with validation")
drivers/net/wireless/mediatek/mt76/mt7996/main.c
3dd6f67c66 ("wifi: mt76: Move RCU section in mt7996_mcu_add_rate_ctrl()")
8989d8e90f ("wifi: mt76: mt7996: Do not set wcid.sta to 1 in mt7996_mac_sta_event()")
net/mac80211/cfg.c
58fcb1b428 ("wifi: mac80211: reject VHT opmode for unsupported channel widths")
037dc18ac3 ("wifi: mac80211: add support for storing station S1G capabilities")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use attach_type in bpf_link, and remove it in sockmap_link.
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20250710032038.888700-4-chen.dylane@linux.dev
Attach_type will be set when a link is created by user. It is better to
record attach_type in bpf_link generically and have it available
universally for all link types. So add the attach_type field in bpf_link
and move the sleepable field to avoid unnecessary gap padding.
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20250710032038.888700-2-chen.dylane@linux.dev
Commit under Fixes tightened up the memory accounting for Netlink
sockets. Looks like the accounting is too strict for some existing
use cases, Marek reported issues with nl80211 / WiFi iw CLI.
To reduce number of iterations Netlink dumps try to allocate
messages based on the size of the buffer passed to previous
recvmsg() calls. If user space uses a larger buffer in recvmsg()
than sk_rcvbuf we will allocate an skb we won't be able to queue.
Make sure we always allow at least one skb to be queued.
Same workaround is already present in netlink_attachskb().
Alternative would be to cap the allocation size to
rcvbuf - rmem_alloc
but as I said, the workaround is already present in other places.
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/9794af18-4905-46c6-b12c-365ea2f05858@samsung.com
Fixes: ae8f160e7e ("netlink: Fix wraparounds of sk->sk_rmem_alloc.")
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250711001121.3649033-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We need to allow queuing at least one skb even when skb is
larger than sk->sk_rcvbuf.
The cited commit made a mistake while converting a condition
in netlink_broadcast_deliver().
Let's correct the rmem check for the allow-one-skb rule.
Fixes: ae8f160e7e ("netlink: Fix wraparounds of sk->sk_rmem_alloc.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250711053208.2965945-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmhvEc8ACgkQ1w0aZmrP
KyGYmBAAxw/EnlVp9sHWAPBNka6KOQECAb1LrnBVjv5nFZK3jHy8ehb/XjW0Q/rR
ScjTgGJLBVuXMSGfFYBsDezesLYZLs/QK3amsTP9+HtYYGdOxdgvmTPMrUe2mF4W
He/xrvWpVKTV00vnpVvZe5zbWnpw9iDTXbDecxmArqNVu78LOsRFU20S2MPc86ne
wbgJohCS/QvhHPsKBUd83niGtKXR8uaFitLDaCgaaqqF4noXfNxIpdxm2O3NIqu1
2gbEAp3qi1YWHUYHQUxsK7W7JjY8xfd9EFLViRdQCI0aevE01GmwJc4J2oDlZ7Ki
DqRFEYzpwpxqe0YwihAsB72faXMOsWFODwIiojlAue26W5iVLzu+w7hBUiOBiq6E
CEBH2OSL47n3s/Va6e1Gb9W2A24x+Qp+EZEZb0Ci0I0MlstOw0aUutoc/P6F9MxI
v9ApavHkt5/FImeXiTB3j61ZodgduC5WLZ7HCM5EjrCjlZznsJFuxtWSYARyYTUM
aj3U00KRW85WgeuvWj1aTGTiT7dmTsQVo7XllEpzfXWvVxE0ATwfiLRfUQwtUwGx
6SecA49s34AwgdUKMynfPtlkxIzraGjhzJmGdTCA1Hi1fw2TUYniASmHRDKVQvUA
1jYCQucYf+Q1zybosscXepUCtsVSPTZTZE/gQSDCUy6WNvHnv84=
=zECx
-----END PGP SIGNATURE-----
Merge tag 'nf-next-25-07-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next (v2)
The following series contains an initial small batch of Netfilter
updates for net-next:
1) Remove DCCP conntrack support, keep DCCP matches around in order to
avoid breakage when loading ruleset, add Kconfig to wrap the code
so it can be disabled by distributors.
2) Remove buggy code aiming at shrinking netlink deletion event, then
re-add it correctly in another patch. This is to prevent -stable to
pick up on a fix that breaks old userspace. From Phil Sutter.
3) Missing WARN_ON_ONCE() to check for lockdep_commit_lock_is_held()
to uncover bugs. From Fedor Pchelkin.
* tag 'nf-next-25-07-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_tables: adjust lockdep assertions handling
netfilter: nf_tables: Reintroduce shortened deletion notifications
netfilter: nf_tables: Drop dead code from fill_*_info routines
netfilter: conntrack: remove DCCP protocol support
====================
Link: https://patch.msgid.link/20250710010706.2861281-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use addrconf_add_dev() instead of ipv6_find_idev() in
addrconf_gre_config() so that we don't just get the inet6_dev, but also
install the default ff00::/8 multicast route.
Before commit 3e6a0243ff ("gre: Fix again IPv6 link-local address
generation."), the multicast route was created at the end of the
function by addrconf_add_mroute(). But this code path is now only taken
in one particular case (gre devices not bound to a local IP address and
in EUI64 mode). For all other cases, the function exits early and
addrconf_add_mroute() is not called anymore.
Using addrconf_add_dev() instead of ipv6_find_idev() in
addrconf_gre_config(), fixes the problem as it will create the default
multicast route for all gre devices. This also brings
addrconf_gre_config() a bit closer to the normal netdevice IPv6
configuration code (addrconf_dev_config()).
Cc: stable@vger.kernel.org
Fixes: 3e6a0243ff ("gre: Fix again IPv6 link-local address generation.")
Reported-by: Aiden Yang <ling@moedove.com>
Closes: https://lore.kernel.org/netdev/CANR=AhRM7YHHXVxJ4DmrTNMeuEOY87K2mLmo9KMed1JMr20p6g@mail.gmail.com/
Reviewed-by: Gary Guo <gary@garyguo.net>
Tested-by: Gary Guo <gary@garyguo.net>
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/027a923dcb550ad115e6d93ee8bb7d310378bd01.1752070620.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When updating an existing route entry in atrtr_create(), the old device
reference was not being released before assigning the new device,
leading to a device refcount leak. Fix this by calling dev_put() to
release the old device reference before holding the new one.
Fixes: c7f905f0f6 ("[ATALK]: Add missing dev_hold() to atrtr_create().")
Signed-off-by: Kito Xu <veritas501@foxmail.com>
Link: https://patch.msgid.link/tencent_E1A26771CDAB389A0396D1681A90A49E5D09@qq.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Implement ETHTOOL_GRXFH over Netlink. The number of flow types is
reasonable (around 20) so report all of them at once for simplicity.
Do not maintain the flow ID mapping with ioctl at the uAPI level.
This gives us a chance to clean up the confusion that come from
RxNFC vs RxFH (flow direction vs hashing) in the ioctl.
Try to align with the names used in ethtool CLI, they seem to have
stood the test of time just fine. One annoyance is that we still
call L4 ports the weird names, but I guess they also apply to IPSec
(where they cover the SPI) so it is what it is.
$ ynl --family ethtool --dump rss-get
{
"header": {
"dev-index": 1,
"dev-name": "enp1s0"
},
"hfunc": 1,
"hkey": b"...",
"indir": [0, 1, ...],
"flow-hash": {
"ether": {"l2da"},
"ah-esp4": {"ip-src", "ip-dst"},
"ah-esp6": {"ip-src", "ip-dst"},
"ah4": {"ip-src", "ip-dst"},
"ah6": {"ip-src", "ip-dst"},
"esp4": {"ip-src", "ip-dst"},
"esp6": {"ip-src", "ip-dst"},
"ip4": {"ip-src", "ip-dst"},
"ip6": {"ip-src", "ip-dst"},
"sctp4": {"ip-src", "ip-dst"},
"sctp6": {"ip-src", "ip-dst"},
"udp4": {"ip-src", "ip-dst"},
"udp6": {"ip-src", "ip-dst"}
"tcp4": {"l4-b-0-1", "l4-b-2-3", "ip-src", "ip-dst"},
"tcp6": {"l4-b-0-1", "l4-b-2-3", "ip-src", "ip-dst"},
},
}
Link: https://patch.msgid.link/20250708220640.2738464-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Looks like some drivers (ena, enetc, fbnic.. there's probably more)
consider ETHER_FLOW to be legitimate target for flow hashing.
I'm not sure how intentional that is from the uAPI perspective
vs just an effect of ethtool IOCTL doing minimal input validation.
But Netlink will do strict validation, so we need to decide whether
we allow this use case or not. I don't see a strong reason against
it, and rejecting it would potentially regress a number of drivers.
So update the comments and flow_type_hashable().
Link: https://patch.msgid.link/20250708220640.2738464-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After commit 040cef30b5 ("net: ethtool: move get_rxfh callback
under the rss_lock") we're expected to take rss_lock around get.
Switch dump to using the new prep helper and move the locking into it.
Link: https://patch.msgid.link/20250708220640.2738464-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- mt76: firmware recovery improvements, MLO work
- iwlwifi: use embedded PNVM in (to be released) FW images
to fix compatibility issues
- cfg80211/mac80211: extended regulatory info support (6 GHz)
- cfg80211: use "faux device" for regulatory
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmhvskoACgkQ10qiO8sP
aACdthAAiN1wzi9Hrpm2ULqDecEL553ecvfOPAzlYibF4wiulVEdKJd7nFn03udA
n2ylEgiANbTb+yEgDjj6IwQ9w5XmYvlLSGO/wIag3UUSWUYuCC5if/JEq56lom+B
IZFMOFXpvO35jSe1h1v0HuuJvbeZv7KDWMJlA4aSXbLx4Y1juAuQid4/YllUkuXt
QgAziAhF7laNk+8nfQLQ3N1DQytftiDK32vCJ6kJ7ciEhh8qxwT5aVkmE0Q/iOLg
na4TdrcRRMQ7kkgODqksGz1nVrr/0AHyMVi/rQ/YaL1uY8fBxvbs0nmjU8G78LkO
gM/401kySEzgLl6x/xrnaVUupbAtsyYrHYH+4q6Z07UPg1OgYY4G71rJIMp/Hq52
/iQVMOFmbhoIqFLqlvmh92QqZRH+rMf/KnO9Pnh/wVdxGH74ZP0bFYB02bqdaRie
uxtO9lHfnuMedET85D747rmgAqrBJ2t3BqAD++LNdqF530eOaWiBkAPeAbTRSgis
d1BJoDtWL3l0tjAA1ivdUw/fPjqWxffpdDTtzSwRjuFUNE5YPpz3VuKNSurSgvMZ
DCSvxTHf5DLVrg6b4W/YXickD2kArJLpaCwkxzkpyBzh6wyRvxs+b/oZSQRVeKvL
C36I3WNDqARfC29ilYUiz/G8kJry8IQaSuLaMP6X92Fa6Hdl7Kg=
=qkYl
-----END PGP SIGNATURE-----
Merge tag 'wireless-next-2025-07-10' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
Quite a bit more work, notably:
- mt76: firmware recovery improvements, MLO work
- iwlwifi: use embedded PNVM in (to be released) FW images
to fix compatibility issues
- cfg80211/mac80211: extended regulatory info support (6 GHz)
- cfg80211: use "faux device" for regulatory
* tag 'wireless-next-2025-07-10' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (48 commits)
wifi: mac80211: don't complete management TX on SAE commit
wifi: cfg80211/mac80211: implement dot11ExtendedRegInfoSupport
wifi: mac80211: send extended MLD capa/ops if AP has it
wifi: mac80211: copy first_part into HW scan
wifi: cfg80211: add a flag for the first part of a scan
wifi: mac80211: remove DISALLOW_PUNCTURING_5GHZ code
wifi: cfg80211: only verify part of Extended MLD Capabilities
wifi: nl80211: make nl80211_check_scan_flags() type safe
wifi: cfg80211: hide scan internals
wifi: mac80211: fix deactivated link CSA
wifi: mac80211: add mandatory bitrate support for 6 GHz
wifi: mac80211: remove spurious blank line
wifi: mac80211: verify state before connection
wifi: mac80211: avoid weird state in error path
wifi: iwlwifi: mvm: remove support for iwl_wowlan_info_notif_v4
wifi: iwlwifi: bump minimum API version in BZ
wifi: iwlwifi: mvm: remove unneeded argument
wifi: iwlwifi: mvm: remove MLO GTK rekey code
wifi: iwlwifi: pcie: rename iwl_pci_gen1_2_probe() argument
wifi: iwlwifi: match discrete/integrated to fix some names
...
====================
Link: https://patch.msgid.link/20250710123113.24878-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ND_PRINTK with val > 1 only works when the ND_DEBUG was set in compilation
phase. Replace it with dynamic debug. Convert ND_PRINTK with val <= 1 to
net_{err,warn}_ratelimited, and convert the rest to net_dbg_ratelimited.
Suggested-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250708033342.1627636-1-wangliang74@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch provides a setsockopt method to let applications leverage to
adjust how many descs to be handled at most in one send syscall. It
mitigates the situation where the default value (32) that is too small
leads to higher frequency of triggering send syscall.
Considering the prosperity/complexity the applications have, there is no
absolutely ideal suggestion fitting all cases. So keep 32 as its default
value like before.
The patch does the following things:
- Add XDP_MAX_TX_SKB_BUDGET socket option.
- Set max_tx_budget to 32 by default in the initialization phase as a
per-socket granular control.
- Set the range of max_tx_budget as [32, xs->tx->nentries].
The idea behind this comes out of real workloads in production. We use a
user-level stack with xsk support to accelerate sending packets and
minimize triggering syscalls. When the packets are aggregated, it's not
hard to hit the upper bound (namely, 32). The moment user-space stack
fetches the -EAGAIN error number passed from sendto(), it will loop to try
again until all the expected descs from tx ring are sent out to the driver.
Enlarging the XDP_MAX_TX_SKB_BUDGET value contributes to less frequency of
sendto() and higher throughput/PPS.
Here is what I did in production, along with some numbers as follows:
For one application I saw lately, I suggested using 128 as max_tx_budget
because I saw two limitations without changing any default configuration:
1) XDP_MAX_TX_SKB_BUDGET, 2) socket sndbuf which is 212992 decided by
net.core.wmem_default. As to XDP_MAX_TX_SKB_BUDGET, the scenario behind
this was I counted how many descs are transmitted to the driver at one
time of sendto() based on [1] patch and then I calculated the
possibility of hitting the upper bound. Finally I chose 128 as a
suitable value because 1) it covers most of the cases, 2) a higher
number would not bring evident results. After twisting the parameters,
a stable improvement of around 4% for both PPS and throughput and less
resources consumption were found to be observed by strace -c -p xxx:
1) %time was decreased by 7.8%
2) error counter was decreased from 18367 to 572
[1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20250704160138.48677-1-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In reconfig we add the virtual monitor in 2 cases:
1. If we are resuming (it was deleted on suspend)
2. If it was added after an error but before the reconfig
(due to the last non-monitor interface removal).
In the second case, the removal of the non-monitor interface will succeed
but the addition of the virtual monitor will fail, so we add it in the
reconfig.
The problem is that we mislead the driver to think that this is an existing
interface that is getting re-added - while it is actually a completely new
interface from the drivers' point of view.
Some drivers act differently when a interface is re-added. For example, it
might not initialize things because they were already initialized.
Such drivers will - in this case - be left with a partialy initialized vif.
To fix it, add the virtual monitor after reconfig_complete, so the
driver will know that this is a completely new interface.
Fixes: 3c3e21e744 ("mac80211: destroy virtual monitor interface across suspend")
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250709233451.648d39b041e8.I2e37b68375278987e303d6c00cc5f3d8334d2f96@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This is currently not initialized for a virtual monitor, leading to a
NULL pointer dereference when - for example - iterating over all the
keys of all the vifs.
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250709233400.8dcefe578497.I4c90a00ae3256520e063199d7f6f2580d5451acf@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
To prevent a potential crash in agg_dequeue (net/sched/sch_qfq.c)
when cl->qdisc->ops->peek(cl->qdisc) returns NULL, we check the return
value before using it, similar to the existing approach in sch_hfsc.c.
To avoid code duplication, the following changes are made:
1. Changed qdisc_warn_nonwc(include/net/pkt_sched.h) into a static
inline function.
2. Moved qdisc_peek_len from net/sched/sch_hfsc.c to
include/net/pkt_sched.h so that sch_qfq can reuse it.
3. Applied qdisc_peek_len in agg_dequeue to avoid crashing.
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Link: https://patch.msgid.link/20250705212143.3982664-1-xmei5@asu.edu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
If an AF_RXRPC service socket is opened and bound, but calls are
preallocated, then rxrpc_alloc_incoming_call() will oops because the
rxrpc_backlog struct doesn't get allocated until the first preallocation is
made.
Fix this by returning NULL from rxrpc_alloc_incoming_call() if there is no
backlog struct. This will cause the incoming call to be aborted.
Reported-by: Junvyyang, Tencent Zhuque Lab <zhuque@tencent.com>
Suggested-by: Junvyyang, Tencent Zhuque Lab <zhuque@tencent.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: LePremierHomme <kwqcheii@proton.me>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Willy Tarreau <w@1wt.eu>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20250708211506.2699012-3-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When userspace is using AF_RXRPC to provide a server, it has to preallocate
incoming calls and assign to them call IDs that will be used to thread
related recvmsg() and sendmsg() together. The preallocated call IDs will
automatically be attached to calls as they come in until the pool is empty.
To the kernel, the call IDs are just arbitrary numbers, but userspace can
use the call ID to hold a pointer to prepared structs. In any case, the
user isn't permitted to create two calls with the same call ID (call IDs
become available again when the call ends) and EBADSLT should result from
sendmsg() if an attempt is made to preallocate a call with an in-use call
ID.
However, the cleanup in the error handling will trigger both assertions in
rxrpc_cleanup_call() because the call isn't marked complete and isn't
marked as having been released.
Fix this by setting the call state in rxrpc_service_prealloc_one() and then
marking it as being released before calling the cleanup function.
Fixes: 00e907127e ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Junvyyang, Tencent Zhuque Lab <zhuque@tencent.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: LePremierHomme <kwqcheii@proton.me>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20250708211506.2699012-2-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add support for SIOCINQ ioctl, indicating the length of bytes unread in the
socket. The value is obtained from `vsock_stream_has_data()`.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Luigi Leonardi <leonardi@redhat.com>
Link: https://patch.msgid.link/20250708-siocinq-v6-2-3775f9a9e359@antgroup.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When hv_sock was originally added, __vsock_stream_recvmsg() and
vsock_stream_has_data() actually only needed to know whether there
is any readable data or not, so hvs_stream_has_data() was written to
return 1 or 0 for simplicity.
However, now hvs_stream_has_data() should return the readable bytes
because vsock_data_ready() -> vsock_stream_has_data() needs to know the
actual bytes rather than a boolean value of 1 or 0.
The SIOCINQ ioctl support also needs hvs_stream_has_data() to return
the readable bytes.
Let hvs_stream_has_data() return the readable bytes of the payload in
the next host-to-guest VMBus hv_sock packet.
Note: there may be multiple incoming hv_sock packets pending in the
VMBus channel's ringbuffer, but so far there is not a VMBus API that
allows us to know all the readable bytes in total without reading and
caching the payload of the multiple packets, so let's just return the
readable bytes of the next single packet. In the future, we'll either
add a VMBus API that allows us to know the total readable bytes without
touching the data in the ringbuffer, or the hv_sock driver needs to
understand the VMBus packet format and parse the packets directly.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Link: https://patch.msgid.link/20250708-siocinq-v6-1-3775f9a9e359@antgroup.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When using sockmap for forwarding, the average latency for different packet sizes
after sending 10,000 packets is as follows:
size old(us) new(us)
512 56 55
1472 58 58
1600 106 81
3000 145 105
5000 182 125
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Feng Yang <yangfeng@kylinos.cn>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250708054053.39551-1-yangfeng59949@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit b35108a51c ("jiffies: Define secs_to_jiffies()") introduced
secs_to_jiffies(). As the value here is a multiple of 1000, use
secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication.
This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with
the following Coccinelle rules:
@depends on patch@
expression E;
@@
-msecs_to_jiffies(E * 1000)
+secs_to_jiffies(E)
-msecs_to_jiffies(E * MSEC_PER_SEC)
+secs_to_jiffies(E)
While here, manually convert a couple timeouts denominated in seconds
Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Link: https://patch.msgid.link/20250707-netdev-secs-to-jiffies-part-2-v2-2-b7817036342f@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit b35108a51c ("jiffies: Define secs_to_jiffies()") introduced
secs_to_jiffies(). As the value here is a multiple of 1000, use
secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication.
This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with
the following Coccinelle rules:
@depends on patch@
expression E;
@@
-msecs_to_jiffies(E * 1000)
+secs_to_jiffies(E)
-msecs_to_jiffies(E * MSEC_PER_SEC)
+secs_to_jiffies(E)
Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Reviewed-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Link: https://patch.msgid.link/20250707-netdev-secs-to-jiffies-part-2-v2-1-b7817036342f@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a passive flow has not been accepted yet, it is
not wise to increase sk_rcvbuf when receiving ooo packets.
A very busy server might tune down tcp_rmem[1] to better
control how much memory can be used by sockets waiting
in its listeners accept queues.
Fixes: 63ad7dfedf ("tcp: adjust rcvbuf in presence of reorders")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250707213900.1543248-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a new device generic parameter to specify clock ID that should
be used by the device for registering DPLL devices and pins.
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250704182202.1641943-5-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Only 8, 16 and 32-bit integers are supported for numeric devlink
parameters. The subsequent patch adds support for DPLL clock ID
that is defined as 64-bit number. Add support for u64 parameter
type.
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250704182202.1641943-4-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It's needed to check the return value of lockdep_commit_lock_is_held(),
otherwise there's no point in this assertion as it doesn't print any
debug information on itself.
Found by Linux Verification Center (linuxtesting.org) with Svace static
analysis tool.
Fixes: b04df3da1b ("netfilter: nf_tables: do not defer rule destruction via call_rcu")
Reported-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Restore commit 28339b21a3 ("netfilter: nf_tables: do not send complete
notification of deletions") and fix it:
- Avoid upfront modification of 'event' variable so the conditionals
become effective.
- Always include NFTA_OBJ_TYPE attribute in object notifications, user
space requires it for proper deserialisation.
- Catch DESTROY events, too.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This practically reverts commit 28339b21a3 ("netfilter: nf_tables: do
not send complete notification of deletions"): The feature was never
effective, due to prior modification of 'event' variable the conditional
early return never happened.
User space also relies upon the current behaviour, so better reintroduce
the shortened deletion notifications once it is fixed.
Fixes: 28339b21a3 ("netfilter: nf_tables: do not send complete notification of deletions")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ioctl(ATMARP_MKIP) allocates struct clip_vcc and set it to
vcc->user_back.
The code assumes that vcc_destroy_socket() passes NULL skb
to vcc->push() when the socket is close()d, and then clip_push()
frees clip_vcc.
However, ioctl(ATMARPD_CTRL) sets NULL to vcc->push() in
atm_init_atmarp(), resulting in memory leak.
Let's serialise two ioctl() by lock_sock() and check vcc->push()
in atm_init_atmarp() to prevent memleak.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250704062416.1613927-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
atmarpd is protected by RTNL since commit f3a0592b37 ("[ATM]: clip
causes unregister hang").
However, it is not enough because to_atmarpd() is called without RTNL,
especially clip_neigh_solicit() / neigh_ops->solicit() is unsleepable.
Also, there is no RTNL dependency around atmarpd.
Let's use a private mutex and RCU to protect access to atmarpd in
to_atmarpd().
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250704062416.1613927-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When SAE commit is sent and received in response, there's no
ordering for the SAE confirm messages. As such, don't call
drivers to stop listening on the channel when the confirm
message is still expected.
This fixes an issue if the local confirm is transmitted later
than the AP's confirm, for iwlwifi (and possibly mt76) the
AP's confirm would then get lost since the device isn't on
the channel at the time the AP transmit the confirm.
For iwlwifi at least, this also improves the overall timing
of the authentication handshake (by about 15ms according to
the report), likely since the session protection won't be
aborted and rescheduled.
Note that even before this, mgd_complete_tx() wasn't always
called for each call to mgd_prepare_tx() (e.g. in the case
of WEP key shared authentication), and the current drivers
that have the complete callback don't seem to mind. Document
this as well though.
Reported-by: Jan Hendrik Farr <kernel@jfarr.cc>
Closes: https://lore.kernel.org/all/aB30Ea2kRG24LINR@archlinux/
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250609213232.12691580e140.I3f1d3127acabcd58348a110ab11044213cf147d3@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Implement dot11ExtendedRegInfoSupport to advertise non-AP station
regulatory power capability as part of regulatory connectivity
element in (Re)Association request frames so that AP can achieve
maximum client connectivity. Control field which was interpreted
using value of 3-bits B5 to B3, now uses value of 4-bits B6 to B3 to
interpret the type of AP. Hence update IEEE80211_HE_6GHZ_OPER_CTRL_REG_INFO
to parse 4-bits control field. If older AP still updates only 3-bits
value of control field, station can still interpret the value as per
section E.2.7 of IEEE 802.11 REVme D7.0 and support the appropriate
AP type.
Also update IEEE80211_6GHZ_CTRL_REG_INDOOR_SP_AP as the value of
standard power AP is changed to 8 instead of 4 so that AP can support both
LPI AP and SP AP to maximize the connectivity with stations. For backward
compatibility, keeping value 4 as old AP by limiting it to SP AP only.
Signed-off-by: Somashekhar Puttagangaiah <somashekhar.puttagangaiah@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250609213232.90cdef116aad.I85da390fbee59355e3855691933e6a5e55c47ac4@changeid
[fix kernel-doc]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently the code only sends extended MLD capa/ops in
strict mode, but if the AP has it then it should also
be able to parse it. There could be cases where the AP
doesn't have it but we would want to advertise it (e.g.
if the AP supports nothing but we want to have BTM.),
but given the broken deployed APs out there right now
this is the best we can do.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250609213232.c9b8b3a6ca77.I1153d4283d1fbb9e5db60e7b939cc133a6345db5@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
cfg80211 now reports whether this is the first part of a scan. Copy
that information into the driver request.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250609213231.63f6078bd7be.Ia6e5cee945e6d9617c2f427552d89d23c92eee83@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When there are no non-6 GHz channels, then the 6 GHz scan is the first
part of a split scan. Add a boolean denoting whether the scan is the
first part of a scan as it might be useful to drivers for internal
bookkeeping. This flag is also set if the scan is not split.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250609213231.07e5a8a452ec.Ibf18f513e507422078fb31b28947e582a20df87a@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We verify that the Extended MLD Capabilities are matching between links.
However, some bits are reserved and in particular the Recommended Max
Links subfield may not necessarily match. So only verify the known
subfields that can reliably be expected to be the same. More information
can be found in Table 9-417o, in IEEE P802.11be/D7.0.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250609213231.a2fad48dd3e6.Iae1740cd2ac833bc4a64fd2af718e1485158fd42@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The cast from void * here coupled with the boolean argument
on what to cast to is confusing and really not needed, just
split the code and make a type-safe interface. It seems to
even reduce the code size slightly, at least on x86-64.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250609213231.bdb3c96570b0.Ia153e6ce06dc9a636ff5bcc1d52468a1afd06e13@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Hide the internal scan fields from mac80211 and drivers, the
'notified' variable is for internal tracking, and the 'info'
is output that's passed to cfg80211_scan_done() and stored
only for delayed userspace notification.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250609213231.6a62e41858e2.I004f66e9c087cc6e6ae4a24951cf470961ee9466@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If the link is deactivated and the CSA completes, then that
needs to update the link station's bandwidth (only the AP STA
can exist at this point, no TDLS on inactive links) and set
the CSA to no longer be active. Fix this.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250609213231.07f120cf687d.I5a868c501ee73fcc2355d61c2ee06e5f444b350f@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When a new station is added, ensure that mandatory bit-rates
are enabled for 6 GHz band.
Signed-off-by: Somashekhar Puttagangaiah <somashekhar.puttagangaiah@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250609213231.4aecd7f3b85b.I33a54872a3267c9f6155ce537d6c9c2a31c3f117@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
ieee80211_prep_connection is supposed to be called when both bitmaps
(valid_links and active_links) are cleared.
Make sure of it and WARN if this is not the case, to avoid weird issues.
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250609213231.f616c7b693df.Ie983155627ad0d2e7c19c30ce642915246d0ed9d@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The cleanup attribute runs kfree() when the variable goes out of scope.
There is a possibility that the link_elems variable is uninitialized
if the loop ends before an assignment is made to this variable.
This leads to uninitialized variable bug.
Fix this by assigning link_elems to NULL.
Signed-off-by: Pagadala Yesu Anjaneyulu <pagadala.yesu.anjaneyulu@intel.com>
Reviewed-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250609213231.eeacd3738a7b.I0f876fa1359daeec47ab3aef098255a9c23efd70@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If we get to the error path of ieee80211_prep_connection, for example
because of a FW issue, then ieee80211_vif_set_links is called
with 0.
But the call to drv_change_vif_links from ieee80211_vif_update_links
will probably fail as well, for the same reason.
In this case, the valid_links and active_links bitmaps will be reverted
to the value of the failing connection.
Then, in the next connection, due to the logic of
ieee80211_set_vif_links_bitmaps, valid_links will be set to the ID of
the new connection assoc link, but the active_links will remain with the
ID of the old connection's assoc link.
If those IDs are different, we get into a weird state of valid_links and
active_links being different. One of the consequences of this state is
to call drv_change_vif_links with new_links as 0, since the & operation
between the bitmaps will be 0.
Since a removal of a link should always succeed, ignore the return value
of drv_change_vif_links if it was called to only remove links, which is
the case for the ieee80211_prep_connection's error path.
That way, the bitmaps will not be reverted to have the value from the
failing connection and will have 0, so the next connection will have a
good state.
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250609213231.ba2011fb435f.Id87ff6dab5e1cf757b54094ac2d714c656165059@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Use DEFINE_MUTEX() to initialize udp_tunnel_gro_type_lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250707091634.311974-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We no longer need to hold RTNL for IPv6 socket options.
Let's remove setsockopt_needs_rtnl().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-16-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>