Commit Graph

81553 Commits

Author SHA1 Message Date
Roopni Devanathan
407bc77b70 wifi: mac80211: Set RTS threshold on per-radio basis
Add support to get the radio for which RTS threshold needs to be changed
from userspace. Pass on this radio index to underlying drivers as an
additional argument.

A value of -1 indicates radio index is not mentioned and that the
configuration applies to all radio(s) of the wiphy.

Signed-off-by: Roopni Devanathan <quic_rdevanat@quicinc.com>
Link: https://patch.msgid.link/20250615082312.619639-5-quic_rdevanat@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24 15:19:27 +02:00
Roopni Devanathan
8959519005 wifi: cfg80211: Report per-radio RTS threshold to userspace
In case of multi-radio wiphys, with per-radio RTS threshold brought
into use, RTS threshold for each radio in a wiphy can be recorded in
wiphy parameter - wiphy_radio_cfg, as an array. Add a new attribute -
NL80211_WIPHY_RADIO_ATTR_RTS_THRESHOLD in nested parameter -
NL80211_ATTR_WIPHY_RADIOS. When a request for getting RTS threshold
for a particular radio is received, parse the radio id and get the
required data. Add this data to the newly added nested attribute
NL80211_WIPHY_RADIO_ATTR_RTS_THRESHOLD. Add support to report this
data to userspace.

Signed-off-by: Roopni Devanathan <quic_rdevanat@quicinc.com>
Link: https://patch.msgid.link/20250615082312.619639-4-quic_rdevanat@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24 15:19:27 +02:00
Roopni Devanathan
264637941c wifi: cfg80211: Add Support to Set RTS Threshold for each Radio
Currently, setting RTS threshold is based on per-phy basis, i.e., all the
radios present in a wiphy will take RTS threshold value to be the one sent
from userspace. But each radio in a multi-radio wiphy can have different
RTS threshold requirements.

To extend support to set RTS threshold for each radio, get the radio for
which RTS threshold needs to be changed from the user. Use the attribute
in NL - NL80211_ATTR_WIPHY_RADIO_INDEX, to identify the radio of interest.
Create a new structure - wiphy_radio_cfg and add rts_threshold in it as a
u32 value to store RTS threshold of each radio in a wiphy and allocate
memory for it during wiphy register based on the wiphy.n_radio updated by
drivers. Pass radio id received from the user to mac80211 drivers along
with its corresponding RTS threshold.

Signed-off-by: Roopni Devanathan <quic_rdevanat@quicinc.com>
Link: https://patch.msgid.link/20250615082312.619639-3-quic_rdevanat@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24 15:19:27 +02:00
Roopni Devanathan
b74947b4f6 wifi: cfg80211/mac80211: Add support to get radio index
Currently, per-radio attributes are set on per-phy basis, i.e., all the
radios present in a wiphy will take attributes values sent from user. But
each radio in a wiphy can get different values from userspace based on
its requirement.

To extend support to set per-radio attributes, add support to get radio
index from userspace. Add an NL attribute - NL80211_ATTR_WIPHY_RADIO_INDEX,
to get user specified radio index for which attributes should be changed.
Pass this to individual drivers, so that the drivers can use this radio
index to change per-radio attributes when necessary. Currently, per-radio
attributes identified are:
NL80211_ATTR_WIPHY_TX_POWER_LEVEL
NL80211_ATTR_WIPHY_ANTENNA_TX
NL80211_ATTR_WIPHY_ANTENNA_RX
NL80211_ATTR_WIPHY_RETRY_SHORT
NL80211_ATTR_WIPHY_RETRY_LONG
NL80211_ATTR_WIPHY_FRAG_THRESHOLD
NL80211_ATTR_WIPHY_RTS_THRESHOLD
NL80211_ATTR_WIPHY_COVERAGE_CLASS
NL80211_ATTR_TXQ_LIMIT
NL80211_ATTR_TXQ_MEMORY_LIMIT
NL80211_ATTR_TXQ_QUANTUM

By default, the radio index is set to -1. This means the attribute should
be treated as a global configuration. If the user has not specified any
index, then the radio index passed to individual drivers would be -1. This
would indicate that the attribute applies to all radios in that wiphy.

Signed-off-by: Roopni Devanathan <quic_rdevanat@quicinc.com>
Link: https://patch.msgid.link/20250615082312.619639-2-quic_rdevanat@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24 15:19:27 +02:00
Sarika Sharma
4cb1ce7e25 wifi: mac80211: add link_sta_statistics ops to fill link station statistics
Currently, link station statistics for MLO are filled by mac80211.
But there are some statistics that kept by mac80211 might not be
accurate, so let the driver pre-fill the link statistics. The driver
can fill the values (indicating which field is filled, by setting the
filled bitmapin in link_station structure).
Statistics that driver don't fill are filled by mac80211.

Hence, add link_sta_statistics callback to fill link station statistics
for MLO in sta_set_link_sinfo() by drivers.

Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-11-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24 15:19:27 +02:00
Sarika Sharma
5e9129f574 wifi: mac80211: correct RX stats packet increment for multi-link
Currently, RX stats packets are incremented for deflink member for
non-ML and multi-link(ML) station case. However, for ML station,
packets should be incremented based on the specific link.

Therefore, if a valid link_id is present, fetch the corresponding
link station information and increment the RX packets for that link.
For non-MLO stations, the deflink will still be used.

Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-10-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24 15:19:27 +02:00
Sarika Sharma
505991fba9 wifi: mac80211: extend support to fill link level sinfo structure
Currently, sinfo structure is supported to fill information at
deflink( or one of the links) level for station. This has problems
when applied to fetch multi-link(ML) station information.

Hence, if valid_links are present, support filling link_station
structure for each link.

This will be helpful to check the link related statistics during MLO.

Additionally, TXQ stats for pertid are applicable at station level
not at link level. Therefore check link_id is less then 0, before
filling TXQ stats in pertid stats.

Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-9-quic_sarishar@quicinc.com
[fix some indentation]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24 15:19:26 +02:00
Sarika Sharma
8af903e454 wifi: cfg80211: clear sinfo->filled for MLO station statistics
Currently, sinfo->filled is for set in sta_set_sinfo() after filling
the corresponding fields in station_info structure for station statistics.

For non-ML stations, the fields are correctly filled from sta->deflink
and corresponding sinfo->filled bit are set, but for MLO any one of
link's data is filled and corresponding sinfo->filled bit is set.

For MLO before embed NL message, fields of sinfo structure like
bytes, packets, signal are updated with accumulated, best, least of all
links data. But some of fields like rssi, pertid don't make much sense
at MLO level.

Hence, to prevent misinterpretation, clear sinfo->filled for fields
which don't make much sense at MLO level. This will prevent filling
misleading values in NL message.

Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-8-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24 15:19:26 +02:00
Sarika Sharma
80b2fa4679 wifi: mac80211: add support to accumulate removed link statistics
Currently, if a link gets removed in between for a station then
directly accumulated data will fall down to sum of other active links.
This will bring inconsistency in station dump statistics.

For instance, let's take Tx packets
- at t=0-> link-0:2 link-1:3 Tx packets => accumulated = 5
- at t=1-> link-0:4 link-1:6 Tx packets => accumulated = 10
let say at t=2, link-0 went down => link-0:0 link-1:7 => accumulated = 7
Here, suddenly accumulated Tx packets will come down to 7 from 10.
This is showing inconsistency.

Therefore, store link-0 data when it went down and add to accumulated
Tx packet = 11.

Hence, store the removed link statistics data in sta structure and
add it in accumulated statistics for consistency.

Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-7-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24 15:19:26 +02:00
Sarika Sharma
49e47223ec wifi: cfg80211: allocate memory for link_station info structure
Currently, station_info structure is passed to fill station statistics
from mac80211/drivers. After NL message send to user space for requested
station statistics, memory for station statistics is freed in cfg80211.
Therefore, memory allocation/free for link station statistics should
also happen in cfg80211 only.

Hence, allocate the memory for link_station structure for all
possible links and free in cfg80211_sinfo_release_content().

Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-6-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24 15:19:26 +02:00
Sarika Sharma
2d226d41db wifi: cfg80211: add statistics for providing overview for MLO station
Currently statistics are handled at link level for multi-link
operation(MLO). There is no provision to check accumulated statistics
for a multi-link(ML) station. Other statistics, such as signal, rates,
are also managed at the link level only.

Statistics such as packets, bytes, signal, rates, etc are useful to
provide overall overview for the ML stations.

Statistics such as packets, bytes are accumulated statistics at MLO level.
However, MLO statistics for rates and signal can not be accumulated since
it won't make much sense. Hence, handle other statistics such as signal,
rates, etc bit differently at MLO level.

The signal could be the best of all links-
e.g. if Link 1 has a signal strength of -70 dBm and Link 2 has -65 dBm,
the signal for MLO will be -65 dBm.

The rate could be determined based on the most recently updated link-
e.g. if link 1 has a rate of 300 Mbps and link 2 has a rate of 450 Mbps,
the MLO rate can be calculated based on the inactivity of each link.
If the inactive time for link 1 is 20 seconds and for link 2 is 10 seconds,
the MLO rate will be the most recently updated rate, which is link 2's
rate of 450 Mbps.

The inactive time, dtim_period and beacon_interval can be taken as the
least value of field from link level.

Similarly, other MLO level applicable fields are handled and the fields
which don't make much sense at MLO level, a subsequent change will handle
to embed NL message.

Hence, add accumulated and other statistics for MLO station if valid links
are present to represent comprehensive overview for the ML stations.

Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-5-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24 15:19:26 +02:00
Sarika Sharma
82d7f841d9 wifi: cfg80211: extend to embed link level statistics in NL message
Currently, statistics is supported at deflink( or one of the links)
level for station. This has problems when applied to multi-link(ML)
connections.

Hence, add changes to support link level statistics to embed NL message
with link related information if valid links are present.

This will be helpful to check the link related statistics during MLO.

The statistics will be embedded into NL message as below:
For non-ML:
cmd->
    NL80211_ATTR_IFINDEX
    NL80211_ATTR_MAC
    NL80211_ATTR_GENERATION
    ....etc
    NL80211_ATTR_STA_INFO | nested
        NL80211_STA_INFO_CONNECTED_TIME,
        NL80211_STA_INFO_STA_FLAGS,
        NL80211_STA_INFO_RX_BYTES,
        NL80211_STA_INFO_TX_BYTES,
        .........etc

For MLO:
cmd ->
    NL80211_ATTR_IFINDEX
    NL80211_ATTR_MAC
    NL80211_ATTR_GENERATION
    .......etc
    NL80211_ATTR_STA_INFO | nested
        NL80211_STA_INFO_CONNECTED_TIME,
        NL80211_STA_INFO_STA_FLAGS,
        ........etc
    NL80211_ATTR_MLO_LINK_ID,
    NL80211_ATTR_MLD_ADDR,
    NL80211_ATTR_MLO_LINKS | nested
        link_id-1 | nested
            NL80211_ATTR_MLO_LINK_ID,
            NL80211_ATTR_MAC,
            NL80211_ATTR_STA_INFO | nested
                NL80211_STA_INFO_RX_BYTES,
                NL80211_STA_INFO_TX_BYTES,
               NL80211_STA_INFO_CONNECTED_TIME,
               ..........etc.
        link_id-2 | nested
            NL80211_ATTR_MLO_LINK_ID,
            NL80211_ATTR_MAC,
            NL80211_ATTR_STA_INFO | nested
                NL80211_STA_INFO_RX_BYTES,
                NL80211_STA_INFO_TX_BYTES,
               NL80211_STA_INFO_CONNECTED_TIME,
                .........etc

Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-4-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24 15:19:26 +02:00
Sarika Sharma
e581b7fe62 wifi: mac80211: add support towards MLO handling of station statistics
Currently, in supporting API's to fill sinfo structure from sta
structure, is mapped to fill the fields from sta->deflink. However,
for multi-link (ML) station, sinfo structure should be filled from
corresponding link_id.

Therefore, add  link_id as an additional argument in supporting API's
for filling sinfo structure correctly. Link_id is set to -1 for non-ML
station and corresponding link_id for ML stations. In supporting API's
for filling sinfo structure, check for link_id, if link_id < 0, fill
the sinfo structure from sta->deflink, otherwise fill from
sta->link[link_id].

Current, changes are done at the deflink level i.e, pass -1 as link_id.
Actual link_id will be added in subsequent patches to support
station statistics for MLO.

Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250528054420.3050133-2-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24 15:19:26 +02:00
Johannes Berg
d87c3ca0f8 wifi: mac80211: finish link init before RCU publish
Since the link/conf pointers can be accessed without any
protection other than RCU, make sure the data is actually
set up before publishing the structures.

Fixes: b2e8434f18 ("wifi: mac80211: set up/tear down client vif links properly")
Link: https://patch.msgid.link/20250624130749.9a308b713c74.I4a80f5eead112a38730939ea591d2e275c721256@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24 15:16:43 +02:00
Paolo Abeni
1fd26729e0 bluetooth pull request for net:
- L2CAP: Fix L2CAP MTU negotiation
  - hci_core: Fix use-after-free in vhci_flush()
  - btintel_pcie: Fix potential race condition in firmware download
  - hci_qca: fix unable to load the BT driver
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmhZhnIZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKZX6EACeRMlkRoRe7oUKqfpXw+R+
 RLdYvvco2LVSIwaMq664kHQWAbNcSj2okXcbJqPTXjHR1CZhaFGp6Rufgi1kz363
 r19Ym4eGL2ZN9JlQwSDJqbWMrXxCo5tB4ymNof3yW+2ZPRurqDJDwCjcnhikfG14
 v8ZOQRxDJOJdPSyqtW75qgFNv9Szekg272ZeLjAK/6XSqMpiPzyLsN4TJdij2Wf6
 Ie02w368TQkmJMHly37QE24hDkyZ/UuR0lmTyB0bTCAiDwnTya8oWEUPRwa5I1+S
 1FubHOBUGlx907bEJXZkBow98sCsChg/PNGqO0dsoJD/GJo4U6lUX1Lb/6qWNL+d
 T6PcDLMKRrDcY9ZgPAqSq7sYvzPGjaw+JWTN01okr5mVoVEsh8XDLhEQYPKJ8NzU
 TJy6FHXtZyGuqw21VD9+VbGrOFJMNYUVhUxQZidAaxQbqE7Vgl59Hj08Z6zGad26
 hE8srBlGH7PcJN+DSXX9coYP11bSWVgXKLgmXgtPF3jVYQOch+Vxhcz3kIGSIZe1
 W5qFfJhurHjdFEop6IRAXOJUeZWL8/YEcQJbFcS3Z2pfGAitY438Y18tvHibcvbc
 7c/DMt7586qh1VbQSz0Gf8IeUuXMlAXlVaIYTKTtlURuo1w8+f+cz/PowtSKIjum
 J1N/Jf+1DzCVPO2tazdd1w==
 =vb9E
 -----END PGP SIGNATURE-----

Merge tag 'for-net-2025-06-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

 - L2CAP: Fix L2CAP MTU negotiation
 - hci_core: Fix use-after-free in vhci_flush()
 - btintel_pcie: Fix potential race condition in firmware download
 - hci_qca: fix unable to load the BT driver

* tag 'for-net-2025-06-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: hci_core: Fix use-after-free in vhci_flush()
  driver: bluetooth: hci_qca:fix unable to load the BT driver
  Bluetooth: L2CAP: Fix L2CAP MTU negotiation
  Bluetooth: btintel_pcie: Fix potential race condition in firmware download
====================

Link: https://patch.msgid.link/20250623165405.227619-1-luiz.dentz@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-06-24 12:40:54 +02:00
Ramya Gnanasekar
140c6a61d8 wifi: mac80211: update radar_required in channel context after channel switch
Currently, when a non-DFS channel is brought up and the bandwidth is
expanded from 80 MHz to 160 MHz, where the primary 80 MHz is non-DFS
and the secondary 80 MHz consists of DFS channels, radar detection
fails if radar occurs in the secondary 80 MHz.

When the channel is switched from 80 MHz to 160 MHz, with the primary
80 MHz being non-DFS and the secondary 80 MHz consisting of DFS
channels, the radar required flag in the channel switch parameters
is set to true. However, when using a reserved channel context,
it is not updated in sdata, which disables radar detection in the
secondary 80 MHz DFS channels.

Update the radar required flag in sdata to fix this issue when using
a reserved channel context.

Signed-off-by: Ramya Gnanasekar <ramya.gnanasekar@oss.qualcomm.com>
Signed-off-by: Ramasamy Kaliappan <ramasamy.kaliappan@oss.qualcomm.com>
Link: https://patch.msgid.link/20250608140324.1687117-1-ramasamy.kaliappan@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24 11:37:10 +02:00
Kuniyuki Iwashima
2a5a484184 af_unix: Don't set -ECONNRESET for consumed OOB skb.
Christian Brauner reported that even after MSG_OOB data is consumed,
calling close() on the receiver socket causes the peer's recv() to
return -ECONNRESET:

  1. send() and recv() an OOB data.

    >>> from socket import *
    >>> s1, s2 = socketpair(AF_UNIX, SOCK_STREAM)
    >>> s1.send(b'x', MSG_OOB)
    1
    >>> s2.recv(1, MSG_OOB)
    b'x'

  2. close() for s2 sets ECONNRESET to s1->sk_err even though
     s2 consumed the OOB data

    >>> s2.close()
    >>> s1.recv(10, MSG_DONTWAIT)
    ...
    ConnectionResetError: [Errno 104] Connection reset by peer

Even after being consumed, the skb holding the OOB 1-byte data stays in
the recv queue to mark the OOB boundary and break recv() at that point.

This must be considered while close()ing a socket.

Let's skip the leading consumed OOB skb while checking the -ECONNRESET
condition in unix_release_sock().

Fixes: 314001f0bf ("af_unix: Add OOB support")
Reported-by: Christian Brauner <brauner@kernel.org>
Closes: https://lore.kernel.org/netdev/20250529-sinkt-abfeuern-e7b08200c6b0@brauner/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://patch.msgid.link/20250619041457.1132791-4-kuni1840@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-06-24 10:10:07 +02:00
Kuniyuki Iwashima
32ca245464 af_unix: Don't leave consecutive consumed OOB skbs.
Jann Horn reported a use-after-free in unix_stream_read_generic().

The following sequences reproduce the issue:

  $ python3
  from socket import *
  s1, s2 = socketpair(AF_UNIX, SOCK_STREAM)
  s1.send(b'x', MSG_OOB)
  s2.recv(1, MSG_OOB)     # leave a consumed OOB skb
  s1.send(b'y', MSG_OOB)
  s2.recv(1, MSG_OOB)     # leave a consumed OOB skb
  s1.send(b'z', MSG_OOB)
  s2.recv(1)              # recv 'z' illegally
  s2.recv(1, MSG_OOB)     # access 'z' skb (use-after-free)

Even though a user reads OOB data, the skb holding the data stays on
the recv queue to mark the OOB boundary and break the next recv().

After the last send() in the scenario above, the sk2's recv queue has
2 leading consumed OOB skbs and 1 real OOB skb.

Then, the following happens during the next recv() without MSG_OOB

  1. unix_stream_read_generic() peeks the first consumed OOB skb
  2. manage_oob() returns the next consumed OOB skb
  3. unix_stream_read_generic() fetches the next not-yet-consumed OOB skb
  4. unix_stream_read_generic() reads and frees the OOB skb

, and the last recv(MSG_OOB) triggers KASAN splat.

The 3. above occurs because of the SO_PEEK_OFF code, which does not
expect unix_skb_len(skb) to be 0, but this is true for such consumed
OOB skbs.

  while (skip >= unix_skb_len(skb)) {
    skip -= unix_skb_len(skb);
    skb = skb_peek_next(skb, &sk->sk_receive_queue);
    ...
  }

In addition to this use-after-free, there is another issue that
ioctl(SIOCATMARK) does not function properly with consecutive consumed
OOB skbs.

So, nothing good comes out of such a situation.

Instead of complicating manage_oob(), ioctl() handling, and the next
ECONNRESET fix by introducing a loop for consecutive consumed OOB skbs,
let's not leave such consecutive OOB unnecessarily.

Now, while receiving an OOB skb in unix_stream_recv_urg(), if its
previous skb is a consumed OOB skb, it is freed.

[0]:
BUG: KASAN: slab-use-after-free in unix_stream_read_actor (net/unix/af_unix.c:3027)
Read of size 4 at addr ffff888106ef2904 by task python3/315

CPU: 2 UID: 0 PID: 315 Comm: python3 Not tainted 6.16.0-rc1-00407-gec315832f6f9 #8 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-4.fc42 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl (lib/dump_stack.c:122)
 print_report (mm/kasan/report.c:409 mm/kasan/report.c:521)
 kasan_report (mm/kasan/report.c:636)
 unix_stream_read_actor (net/unix/af_unix.c:3027)
 unix_stream_read_generic (net/unix/af_unix.c:2708 net/unix/af_unix.c:2847)
 unix_stream_recvmsg (net/unix/af_unix.c:3048)
 sock_recvmsg (net/socket.c:1063 (discriminator 20) net/socket.c:1085 (discriminator 20))
 __sys_recvfrom (net/socket.c:2278)
 __x64_sys_recvfrom (net/socket.c:2291 (discriminator 1) net/socket.c:2287 (discriminator 1) net/socket.c:2287 (discriminator 1))
 do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
RIP: 0033:0x7f8911fcea06
Code: 5d e8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 75 19 83 e2 39 83 fa 08 75 11 e8 26 ff ff ff 66 0f 1f 44 00 00 48 8b 45 10 0f 05 <48> 8b 5d f8 c9 c3 0f 1f 40 00 f3 0f 1e fa 55 48 89 e5 48 83 ec 08
RSP: 002b:00007fffdb0dccb0 EFLAGS: 00000202 ORIG_RAX: 000000000000002d
RAX: ffffffffffffffda RBX: 00007fffdb0dcdc8 RCX: 00007f8911fcea06
RDX: 0000000000000001 RSI: 00007f8911a5e060 RDI: 0000000000000006
RBP: 00007fffdb0dccd0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000202 R12: 00007f89119a7d20
R13: ffffffffc4653600 R14: 0000000000000000 R15: 0000000000000000
 </TASK>

Allocated by task 315:
 kasan_save_stack (mm/kasan/common.c:48)
 kasan_save_track (mm/kasan/common.c:60 (discriminator 1) mm/kasan/common.c:69 (discriminator 1))
 __kasan_slab_alloc (mm/kasan/common.c:348)
 kmem_cache_alloc_node_noprof (./include/linux/kasan.h:250 mm/slub.c:4148 mm/slub.c:4197 mm/slub.c:4249)
 __alloc_skb (net/core/skbuff.c:660 (discriminator 4))
 alloc_skb_with_frags (./include/linux/skbuff.h:1336 net/core/skbuff.c:6668)
 sock_alloc_send_pskb (net/core/sock.c:2993)
 unix_stream_sendmsg (./include/net/sock.h:1847 net/unix/af_unix.c:2256 net/unix/af_unix.c:2418)
 __sys_sendto (net/socket.c:712 (discriminator 20) net/socket.c:727 (discriminator 20) net/socket.c:2226 (discriminator 20))
 __x64_sys_sendto (net/socket.c:2233 (discriminator 1) net/socket.c:2229 (discriminator 1) net/socket.c:2229 (discriminator 1))
 do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

Freed by task 315:
 kasan_save_stack (mm/kasan/common.c:48)
 kasan_save_track (mm/kasan/common.c:60 (discriminator 1) mm/kasan/common.c:69 (discriminator 1))
 kasan_save_free_info (mm/kasan/generic.c:579 (discriminator 1))
 __kasan_slab_free (mm/kasan/common.c:271)
 kmem_cache_free (mm/slub.c:4643 (discriminator 3) mm/slub.c:4745 (discriminator 3))
 unix_stream_read_generic (net/unix/af_unix.c:3010)
 unix_stream_recvmsg (net/unix/af_unix.c:3048)
 sock_recvmsg (net/socket.c:1063 (discriminator 20) net/socket.c:1085 (discriminator 20))
 __sys_recvfrom (net/socket.c:2278)
 __x64_sys_recvfrom (net/socket.c:2291 (discriminator 1) net/socket.c:2287 (discriminator 1) net/socket.c:2287 (discriminator 1))
 do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

The buggy address belongs to the object at ffff888106ef28c0
 which belongs to the cache skbuff_head_cache of size 224
The buggy address is located 68 bytes inside of
 freed 224-byte region [ffff888106ef28c0, ffff888106ef29a0)

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff888106ef3cc0 pfn:0x106ef2
head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x200000000000040(head|node=0|zone=2)
page_type: f5(slab)
raw: 0200000000000040 ffff8881001d28c0 ffffea000422fe00 0000000000000004
raw: ffff888106ef3cc0 0000000080190010 00000000f5000000 0000000000000000
head: 0200000000000040 ffff8881001d28c0 ffffea000422fe00 0000000000000004
head: ffff888106ef3cc0 0000000080190010 00000000f5000000 0000000000000000
head: 0200000000000001 ffffea00041bbc81 00000000ffffffff 00000000ffffffff
head: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff888106ef2800: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
 ffff888106ef2880: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
>ffff888106ef2900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                   ^
 ffff888106ef2980: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
 ffff888106ef2a00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: 314001f0bf ("af_unix: Add OOB support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Jann Horn <jannh@google.com>
Link: https://patch.msgid.link/20250619041457.1132791-2-kuni1840@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-06-24 10:10:06 +02:00
Lachlan Hodges
7a3750ff0f wifi: mac80211: fix beacon interval calculation overflow
As we are converting from TU to usecs, a beacon interval of
100*1024 usecs will lead to integer wrapping. To fix change
to use a u32.

Fixes: 057d5f4ba1 ("mac80211: sync dtim_count to TSF")
Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250621123209.511796-1-lachlan.hodges@morsemicro.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24 09:04:55 +02:00
Ido Schimmel
7544f3f5b0 bridge: mcast: Fix use-after-free during router port configuration
The bridge maintains a global list of ports behind which a multicast
router resides. The list is consulted during forwarding to ensure
multicast packets are forwarded to these ports even if the ports are not
member in the matching MDB entry.

When per-VLAN multicast snooping is enabled, the per-port multicast
context is disabled on each port and the port is removed from the global
router port list:

 # ip link add name br1 up type bridge vlan_filtering 1 mcast_snooping 1
 # ip link add name dummy1 up master br1 type dummy
 # ip link set dev dummy1 type bridge_slave mcast_router 2
 $ bridge -d mdb show | grep router
 router ports on br1: dummy1
 # ip link set dev br1 type bridge mcast_vlan_snooping 1
 $ bridge -d mdb show | grep router

However, the port can be re-added to the global list even when per-VLAN
multicast snooping is enabled:

 # ip link set dev dummy1 type bridge_slave mcast_router 0
 # ip link set dev dummy1 type bridge_slave mcast_router 2
 $ bridge -d mdb show | grep router
 router ports on br1: dummy1

Since commit 4b30ae9adb ("net: bridge: mcast: re-implement
br_multicast_{enable, disable}_port functions"), when per-VLAN multicast
snooping is enabled, multicast disablement on a port will disable the
per-{port, VLAN} multicast contexts and not the per-port one. As a
result, a port will remain in the global router port list even after it
is deleted. This will lead to a use-after-free [1] when the list is
traversed (when adding a new port to the list, for example):

 # ip link del dev dummy1
 # ip link add name dummy2 up master br1 type dummy
 # ip link set dev dummy2 type bridge_slave mcast_router 2

Similarly, stale entries can also be found in the per-VLAN router port
list. When per-VLAN multicast snooping is disabled, the per-{port, VLAN}
contexts are disabled on each port and the port is removed from the
per-VLAN router port list:

 # ip link add name br1 up type bridge vlan_filtering 1 mcast_snooping 1 mcast_vlan_snooping 1
 # ip link add name dummy1 up master br1 type dummy
 # bridge vlan add vid 2 dev dummy1
 # bridge vlan global set vid 2 dev br1 mcast_snooping 1
 # bridge vlan set vid 2 dev dummy1 mcast_router 2
 $ bridge vlan global show dev br1 vid 2 | grep router
       router ports: dummy1
 # ip link set dev br1 type bridge mcast_vlan_snooping 0
 $ bridge vlan global show dev br1 vid 2 | grep router

However, the port can be re-added to the per-VLAN list even when
per-VLAN multicast snooping is disabled:

 # bridge vlan set vid 2 dev dummy1 mcast_router 0
 # bridge vlan set vid 2 dev dummy1 mcast_router 2
 $ bridge vlan global show dev br1 vid 2 | grep router
       router ports: dummy1

When the VLAN is deleted from the port, the per-{port, VLAN} multicast
context will not be disabled since multicast snooping is not enabled
on the VLAN. As a result, the port will remain in the per-VLAN router
port list even after it is no longer member in the VLAN. This will lead
to a use-after-free [2] when the list is traversed (when adding a new
port to the list, for example):

 # ip link add name dummy2 up master br1 type dummy
 # bridge vlan add vid 2 dev dummy2
 # bridge vlan del vid 2 dev dummy1
 # bridge vlan set vid 2 dev dummy2 mcast_router 2

Fix these issues by removing the port from the relevant (global or
per-VLAN) router port list in br_multicast_port_ctx_deinit(). The
function is invoked during port deletion with the per-port multicast
context and during VLAN deletion with the per-{port, VLAN} multicast
context.

Note that deleting the multicast router timer is not enough as it only
takes care of the temporary multicast router states (1 or 3) and not the
permanent one (2).

[1]
BUG: KASAN: slab-out-of-bounds in br_multicast_add_router.part.0+0x3f1/0x560
Write of size 8 at addr ffff888004a67328 by task ip/384
[...]
Call Trace:
 <TASK>
 dump_stack_lvl+0x6f/0xa0
 print_address_description.constprop.0+0x6f/0x350
 print_report+0x108/0x205
 kasan_report+0xdf/0x110
 br_multicast_add_router.part.0+0x3f1/0x560
 br_multicast_set_port_router+0x74e/0xac0
 br_setport+0xa55/0x1870
 br_port_slave_changelink+0x95/0x120
 __rtnl_newlink+0x5e8/0xa40
 rtnl_newlink+0x627/0xb00
 rtnetlink_rcv_msg+0x6fb/0xb70
 netlink_rcv_skb+0x11f/0x350
 netlink_unicast+0x426/0x710
 netlink_sendmsg+0x75a/0xc20
 __sock_sendmsg+0xc1/0x150
 ____sys_sendmsg+0x5aa/0x7b0
 ___sys_sendmsg+0xfc/0x180
 __sys_sendmsg+0x124/0x1c0
 do_syscall_64+0xbb/0x360
 entry_SYSCALL_64_after_hwframe+0x4b/0x53

[2]
BUG: KASAN: slab-use-after-free in br_multicast_add_router.part.0+0x378/0x560
Read of size 8 at addr ffff888009f00840 by task bridge/391
[...]
Call Trace:
 <TASK>
 dump_stack_lvl+0x6f/0xa0
 print_address_description.constprop.0+0x6f/0x350
 print_report+0x108/0x205
 kasan_report+0xdf/0x110
 br_multicast_add_router.part.0+0x378/0x560
 br_multicast_set_port_router+0x6f9/0xac0
 br_vlan_process_options+0x8b6/0x1430
 br_vlan_rtm_process_one+0x605/0xa30
 br_vlan_rtm_process+0x396/0x4c0
 rtnetlink_rcv_msg+0x2f7/0xb70
 netlink_rcv_skb+0x11f/0x350
 netlink_unicast+0x426/0x710
 netlink_sendmsg+0x75a/0xc20
 __sock_sendmsg+0xc1/0x150
 ____sys_sendmsg+0x5aa/0x7b0
 ___sys_sendmsg+0xfc/0x180
 __sys_sendmsg+0x124/0x1c0
 do_syscall_64+0xbb/0x360
 entry_SYSCALL_64_after_hwframe+0x4b/0x53

Fixes: 2796d846d7 ("net: bridge: vlan: convert mcast router global option to per-vlan entry")
Fixes: 4b30ae9adb ("net: bridge: mcast: re-implement br_multicast_{enable, disable}_port functions")
Reported-by: syzbot+7bfa4b72c6a5da128d32@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/684c18bd.a00a0220.279073.000b.GAE@google.com/T/
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250619182228.1656906-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23 18:19:10 -07:00
Eric Dumazet
935b67675a net: make sk->sk_rcvtimeo lockless
Followup of commit 285975dd67 ("net: annotate data-races around
sk->sk_{rcv|snd}timeo").

Remove lock_sock()/release_sock() from ksmbd_tcp_rcv_timeout()
and add READ_ONCE()/WRITE_ONCE() where it is needed.

Also SO_RCVTIMEO_OLD and SO_RCVTIMEO_NEW can call sock_set_timeout()
without holding the socket lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250620155536.335520-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23 17:05:12 -07:00
Eric Dumazet
3169e36ae1 net: make sk->sk_sndtimeo lockless
Followup of commit 285975dd67 ("net: annotate data-races around
sk->sk_{rcv|snd}timeo").

Remove lock_sock()/release_sock() from sock_set_sndtimeo(),
and add READ_ONCE()/WRITE_ONCE() where it is needed.

Also SO_SNDTIMEO_OLD and SO_SNDTIMEO_NEW can call sock_set_timeout()
without holding the socket lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250620155536.335520-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23 17:05:11 -07:00
Eric Dumazet
c51da3f7a1 net: remove sock_i_uid()
Difference between sock_i_uid() and sk_uid() is that
after sock_orphan(), sock_i_uid() returns GLOBAL_ROOT_UID
while sk_uid() returns the last cached sk->sk_uid value.

None of sock_i_uid() callers care about this.

Use sk_uid() which is much faster and inlined.

Note that diag/dump users are calling sock_i_ino() and
can not see the full benefit yet.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Link: https://patch.msgid.link/20250620133001.4090592-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23 17:04:03 -07:00
Eric Dumazet
e84a4927a4 net: annotate races around sk->sk_uid
sk->sk_uid can be read while another thread changes its
value in sockfs_setattr().

Add sk_uid(const struct sock *sk) helper to factorize the needed
READ_ONCE() annotations, and add corresponding WRITE_ONCE()
where needed.

Fixes: 86741ec254 ("net: core: Add a UID field to struct sock.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Link: https://patch.msgid.link/20250620133001.4090592-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23 17:04:03 -07:00
Arnd Bergmann
b630c781bc caif: reduce stack size, again
I tried to fix the stack usage in this function a couple of years ago,
but there is still a problem with the latest gcc versions in some
configurations:

net/caif/cfctrl.c:553:1: error: the frame size of 1296 bytes is larger than 1280 bytes [-Werror=frame-larger-than=]

Reduce this once again, with a separate cfctrl_link_setup() function that
holds the bulk of all the local variables. It also turns out that the
param[] array that takes up a large portion of the stack is write-only
and can be left out here.

Fixes: ce6289661b ("caif: reduce stack size with KASAN")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20250620112244.3425554-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23 16:58:43 -07:00
Pranav Tyagi
b04202d606 net/sched: replace strncpy with strscpy
Replace the deprecated strncpy() with the two-argument version of
strscpy() as the destination is an array
and buffer should be NUL-terminated.

Signed-off-by: Pranav Tyagi <pranav.tyagi03@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250620103653.6957-1-pranav.tyagi03@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23 16:56:51 -07:00
Pranav Tyagi
ae2402bf88 net/smc: replace strncpy with strscpy
Replace the deprecated strncpy() with two-argument version of
strscpy() as the destination is an array
and should be NUL-terminated.

Signed-off-by: Pranav Tyagi <pranav.tyagi03@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250620102559.6365-1-pranav.tyagi03@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23 16:56:21 -07:00
Breno Leitao
f599020702 net: netpoll: Initialize UDP checksum field before checksumming
commit f1fce08e63 ("netpoll: Eliminate redundant assignment") removed
the initialization of the UDP checksum, which was wrong and broke
netpoll IPv6 transmission due to bad checksumming.

udph->check needs to be set before calling csum_ipv6_magic().

Fixes: f1fce08e63 ("netpoll: Eliminate redundant assignment")
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250620-netpoll_fix-v1-1-f9f0b82bc059@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23 13:14:51 -07:00
Kory Maincent
96c16c59b7 ethtool: pse-pd: Add missing linux/export.h include
Fix missing linux/export.h header include in net/ethtool/pse-pd.c to resolve
build warning reported by the kernel test robot.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202506200024.T3O0FWeR-lkp@intel.com/
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20250619162547.1989468-1-kory.maincent@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23 13:13:01 -07:00
Nikhil Jha
9c19b3315c sunrpc: fix loop in gss seqno cache
There was a silly bug in the initial implementation where a loop
variable was not incremented. This commit increments the loop variable.

This bug is somewhat tricky to catch because it can only happen on loops
of two or more. If it is hit, it locks up a kernel thread in an infinite
loop.

Signed-off-by: Nikhil Jha <njha@janestreet.com>
Tested-by: Nikhil Jha <njha@janestreet.com>
Fixes: 08d6ee6d8a ("sunrpc: implement rfc2203 rpcsec_gss seqnum cache")
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-06-23 11:01:15 -04:00
Jens Axboe
5be5726e1a Merge branch 'timestamp-for-jens' of https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next into for-6.17/io_uring
Pull networking side timestamp prep patch from Jakub.

* 'timestamp-for-jens' of https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next:
  net: timestamp: add helper returning skb's tx tstamp
2025-06-23 08:59:38 -06:00
Kuniyuki Iwashima
1d6123102e Bluetooth: hci_core: Fix use-after-free in vhci_flush()
syzbot reported use-after-free in vhci_flush() without repro. [0]

From the splat, a thread close()d a vhci file descriptor while
its device was being used by iotcl() on another thread.

Once the last fd refcnt is released, vhci_release() calls
hci_unregister_dev(), hci_free_dev(), and kfree() for struct
vhci_data, which is set to hci_dev->dev->driver_data.

The problem is that there is no synchronisation after unlinking
hdev from hci_dev_list in hci_unregister_dev().  There might be
another thread still accessing the hdev which was fetched before
the unlink operation.

We can use SRCU for such synchronisation.

Let's run hci_dev_reset() under SRCU and wait for its completion
in hci_unregister_dev().

Another option would be to restore hci_dev->destruct(), which was
removed in commit 587ae086f6 ("Bluetooth: Remove unused
hci-destruct cb").  However, this would not be a good solution, as
we should not run hci_unregister_dev() while there are in-flight
ioctl() requests, which could lead to another data-race KCSAN splat.

Note that other drivers seem to have the same problem, for exmaple,
virtbt_remove().

[0]:
BUG: KASAN: slab-use-after-free in skb_queue_empty_lockless include/linux/skbuff.h:1891 [inline]
BUG: KASAN: slab-use-after-free in skb_queue_purge_reason+0x99/0x360 net/core/skbuff.c:3937
Read of size 8 at addr ffff88807cb8d858 by task syz.1.219/6718

CPU: 1 UID: 0 PID: 6718 Comm: syz.1.219 Not tainted 6.16.0-rc1-syzkaller-00196-g08207f42d3ff #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
Call Trace:
 <TASK>
 dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:408 [inline]
 print_report+0xd2/0x2b0 mm/kasan/report.c:521
 kasan_report+0x118/0x150 mm/kasan/report.c:634
 skb_queue_empty_lockless include/linux/skbuff.h:1891 [inline]
 skb_queue_purge_reason+0x99/0x360 net/core/skbuff.c:3937
 skb_queue_purge include/linux/skbuff.h:3368 [inline]
 vhci_flush+0x44/0x50 drivers/bluetooth/hci_vhci.c:69
 hci_dev_do_reset net/bluetooth/hci_core.c:552 [inline]
 hci_dev_reset+0x420/0x5c0 net/bluetooth/hci_core.c:592
 sock_do_ioctl+0xd9/0x300 net/socket.c:1190
 sock_ioctl+0x576/0x790 net/socket.c:1311
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:907 [inline]
 __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fcf5b98e929
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fcf5c7b9038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fcf5bbb6160 RCX: 00007fcf5b98e929
RDX: 0000000000000000 RSI: 00000000400448cb RDI: 0000000000000009
RBP: 00007fcf5ba10b39 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 00007fcf5bbb6160 R15: 00007ffd6353d528
 </TASK>

Allocated by task 6535:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
 poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
 __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:394
 kasan_kmalloc include/linux/kasan.h:260 [inline]
 __kmalloc_cache_noprof+0x230/0x3d0 mm/slub.c:4359
 kmalloc_noprof include/linux/slab.h:905 [inline]
 kzalloc_noprof include/linux/slab.h:1039 [inline]
 vhci_open+0x57/0x360 drivers/bluetooth/hci_vhci.c:635
 misc_open+0x2bc/0x330 drivers/char/misc.c:161
 chrdev_open+0x4c9/0x5e0 fs/char_dev.c:414
 do_dentry_open+0xdf0/0x1970 fs/open.c:964
 vfs_open+0x3b/0x340 fs/open.c:1094
 do_open fs/namei.c:3887 [inline]
 path_openat+0x2ee5/0x3830 fs/namei.c:4046
 do_filp_open+0x1fa/0x410 fs/namei.c:4073
 do_sys_openat2+0x121/0x1c0 fs/open.c:1437
 do_sys_open fs/open.c:1452 [inline]
 __do_sys_openat fs/open.c:1468 [inline]
 __se_sys_openat fs/open.c:1463 [inline]
 __x64_sys_openat+0x138/0x170 fs/open.c:1463
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 6535:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:576
 poison_slab_object mm/kasan/common.c:247 [inline]
 __kasan_slab_free+0x62/0x70 mm/kasan/common.c:264
 kasan_slab_free include/linux/kasan.h:233 [inline]
 slab_free_hook mm/slub.c:2381 [inline]
 slab_free mm/slub.c:4643 [inline]
 kfree+0x18e/0x440 mm/slub.c:4842
 vhci_release+0xbc/0xd0 drivers/bluetooth/hci_vhci.c:671
 __fput+0x44c/0xa70 fs/file_table.c:465
 task_work_run+0x1d1/0x260 kernel/task_work.c:227
 exit_task_work include/linux/task_work.h:40 [inline]
 do_exit+0x6ad/0x22e0 kernel/exit.c:955
 do_group_exit+0x21c/0x2d0 kernel/exit.c:1104
 __do_sys_exit_group kernel/exit.c:1115 [inline]
 __se_sys_exit_group kernel/exit.c:1113 [inline]
 __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1113
 x64_sys_call+0x21ba/0x21c0 arch/x86/include/generated/asm/syscalls_64.h:232
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

The buggy address belongs to the object at ffff88807cb8d800
 which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 88 bytes inside of
 freed 1024-byte region [ffff88807cb8d800, ffff88807cb8dc00)

Fixes: bf18c7118c ("Bluetooth: vhci: Free driver_data on file release")
Reported-by: syzbot+2faa4825e556199361f9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=f62d64848fc4c7c30cd6
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Acked-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-06-23 10:59:29 -04:00
Greg Kroah-Hartman
63dafeb392 Merge 6.16-rc3 into driver-core-next
We need the driver-core fixes that are in 6.16-rc3 into here as well
to build on top of.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-23 07:53:36 +02:00
Eric Dumazet
b993ea46b3 atm: clip: prevent NULL deref in clip_push()
Blamed commit missed that vcc_destroy_socket() calls
clip_push() with a NULL skb.

If clip_devs is NULL, clip_push() then crashes when reading
skb->truesize.

Fixes: 93a2014afb ("atm: fix a UAF in lec_arp_clear_vccs()")
Reported-by: syzbot+1316233c4c6803382a8b@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/68556f59.a00a0220.137b3.004e.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Gengming Liu <l.dmxcsnsbh@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-06-22 19:31:14 +01:00
Linus Torvalds
739a6c93cc nfsd-6.16 fixes:
- Two fixes for commits in the nfsd-6.16 merge
 - One fix for the recently-added NFSD netlink facility
 - One fix for a remote SunRPC crasher
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmhWykQACgkQM2qzM29m
 f5cfIA/+MtqpF8iHCNWBk2mmgQHtfS78qhQexEsQ69F1k90+bzCJ9Mu7AZT67xRZ
 2E6Fm3gyJuFPaJ9YOGewyD6OV/eE0SioPQteZIIwUVclKcoyVtqJpc57A/aQfMJm
 Te1bPWYrc5jfxoF/QzkOjzd4KWXOej2nIqsybFIDK5uTe7gn9QpPiFL8ePRuf7bJ
 5oWBTZPpmuWpDcebQrz56OT0bvvrCNH9EV9TZNYwVa/B7yl3UNHwM2YB5O9m2Rl7
 GUeWytc2Y7WgdDJe9poPwNrYyXFb8JX4pCbS54q5wCRLYLDgo8PsgsFCAFfgjx0l
 ArATKDhzfYK0azASDtJma2eCJkrm4/nXRNFON7xucN79pYRh9GChcmhhfubb8Yo3
 HopUaCSLOgy7PNsbbz3lNREn5uBpomeK1Mb88/UkJ/d8fmKGIO1ITTlarazYv2w5
 GgmzZSBS43VdUmipqNyddIMg2uy7uaeAQZpr4jHVFVctPypjXCn/zIrhG2mmjRhW
 M7PCJqtFj0a7DM48LA6r9K8N4Yi7EoSQKNAKfwduE++XXYWgBk9blaF/7om/fIRJ
 VJm/RfiuFuyGaxCLd7WpPpASeS/nq410BMV5/kxol+UgiCxk8xjsvU7y8OWLOl7S
 rUend5Yei7/h0NGoSmQPnBoYnb1YuTeMMzjfaRWSsV5gbvVDejQ=
 =1C8u
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Two fixes for commits in the nfsd-6.16 merge

 - One fix for the recently-added NFSD netlink facility

 - One fix for a remote SunRPC crasher

* tag 'nfsd-6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  sunrpc: handle SVC_GARBAGE during svc auth processing as auth error
  nfsd: use threads array as-is in netlink interface
  SUNRPC: Cleanup/fix initial rq_pages allocation
  NFSD: Avoid corruption of a referring call list
2025-06-21 09:20:15 -07:00
Wang Liang
091d019adc net/smc: remove unused function smc_lo_supports_v2
The smcd_ops->supports_v2 is only called in smcd_register_dev(), which
calls function smcd_supports_v2 for ism. For loopback-ism, function
smc_lo_supports_v2 is unused, remove it.

Signed-off-by: Wang Liang <wangliang74@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250619030854.1536676-1-wangliang74@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-21 08:26:42 -07:00
Jakub Kicinski
72792461c8 net: ethtool: don't mux RXFH via rxnfc callbacks
All drivers have been converted. Stop using the rxnfc fallbacks
for Rx Flow Hashing configuration.

Joe pointed out in earlier review that in ethtool_set_rxfh()
we need both .get_rxnfc and .get_rxfh_fields, because we need
both the ring count and flow hashing (because we call
ethtool_check_flow_types()). IOW the existing check added
for transitioning drivers was buggy.

Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250618203823.1336156-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-21 07:55:00 -07:00
Simon Horman
433dce0692 rds: Correct spelling
Correct spelling as flagged by codespell.
With these changes in place codespell only flags false positives
in net/rds.

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20250619-rds-minor-v1-2-86d2ee3a98b9@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-21 07:35:39 -07:00
Simon Horman
6e307a873d rds: Correct endian annotation of port and addr assignments
Correct the endianness annotation of port assignments:

  A host byte order value (RDS_TCP_PORT) is correctly converted to
  network byte order (big endian) using htons. But it is then cast back to
  host byte order before assigning to a variable that expects a big endian
  value.  Address this by dropping the cast.

  This is not a bug because, while the endian annotation is changed by
  this patch, the assigned value is unchanged.

Also correct the endianness of address assignment.

  A host byte order value (INADDR_ANY) is incorrectly assigned as-is to
  a variable that expects a big endian value. Address this by converting
  the value to network byte order (big endian).

  This is not a bug because INADDR_ANY is 0, which is isomorphic
  with regards to endian conversions. IOW, while the endian annotation
  is changed by this patch, the assigned value is unchanged.

Incorrect endian annotations appear to date back to IPv4-only code added
by commit 70041088e3 ("RDS: Add TCP transport to RDS").

Flagged by Sparse.

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20250619-rds-minor-v1-1-86d2ee3a98b9@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-21 07:35:38 -07:00
Frédéric Danis
042bb9603c Bluetooth: L2CAP: Fix L2CAP MTU negotiation
OBEX download from iPhone is currently slow due to small packet size
used to transfer data which doesn't follow the MTU negotiated during
L2CAP connection, i.e. 672 bytes instead of 32767:

  < ACL Data TX: Handle 11 flags 0x00 dlen 12
      L2CAP: Connection Request (0x02) ident 18 len 4
        PSM: 4103 (0x1007)
        Source CID: 72
  > ACL Data RX: Handle 11 flags 0x02 dlen 16
      L2CAP: Connection Response (0x03) ident 18 len 8
        Destination CID: 14608
        Source CID: 72
        Result: Connection successful (0x0000)
        Status: No further information available (0x0000)
  < ACL Data TX: Handle 11 flags 0x00 dlen 27
      L2CAP: Configure Request (0x04) ident 20 len 19
        Destination CID: 14608
        Flags: 0x0000
        Option: Maximum Transmission Unit (0x01) [mandatory]
          MTU: 32767
        Option: Retransmission and Flow Control (0x04) [mandatory]
          Mode: Enhanced Retransmission (0x03)
          TX window size: 63
          Max transmit: 3
          Retransmission timeout: 2000
          Monitor timeout: 12000
          Maximum PDU size: 1009
  > ACL Data RX: Handle 11 flags 0x02 dlen 26
      L2CAP: Configure Request (0x04) ident 72 len 18
        Destination CID: 72
        Flags: 0x0000
        Option: Retransmission and Flow Control (0x04) [mandatory]
          Mode: Enhanced Retransmission (0x03)
          TX window size: 32
          Max transmit: 255
          Retransmission timeout: 0
          Monitor timeout: 0
          Maximum PDU size: 65527
        Option: Frame Check Sequence (0x05) [mandatory]
          FCS: 16-bit FCS (0x01)
  < ACL Data TX: Handle 11 flags 0x00 dlen 29
      L2CAP: Configure Response (0x05) ident 72 len 21
        Source CID: 14608
        Flags: 0x0000
        Result: Success (0x0000)
        Option: Maximum Transmission Unit (0x01) [mandatory]
          MTU: 672
        Option: Retransmission and Flow Control (0x04) [mandatory]
          Mode: Enhanced Retransmission (0x03)
          TX window size: 32
          Max transmit: 255
          Retransmission timeout: 2000
          Monitor timeout: 12000
          Maximum PDU size: 1009
  > ACL Data RX: Handle 11 flags 0x02 dlen 32
      L2CAP: Configure Response (0x05) ident 20 len 24
        Source CID: 72
        Flags: 0x0000
        Result: Success (0x0000)
        Option: Maximum Transmission Unit (0x01) [mandatory]
          MTU: 32767
        Option: Retransmission and Flow Control (0x04) [mandatory]
          Mode: Enhanced Retransmission (0x03)
          TX window size: 63
          Max transmit: 3
          Retransmission timeout: 2000
          Monitor timeout: 12000
          Maximum PDU size: 1009
        Option: Frame Check Sequence (0x05) [mandatory]
          FCS: 16-bit FCS (0x01)
  ...
  > ACL Data RX: Handle 11 flags 0x02 dlen 680
      Channel: 72 len 676 ctrl 0x0202 [PSM 4103 mode Enhanced Retransmission (0x03)] {chan 8}
      I-frame: Unsegmented TxSeq 1 ReqSeq 2
  < ACL Data TX: Handle 11 flags 0x00 dlen 13
      Channel: 14608 len 9 ctrl 0x0204 [PSM 4103 mode Enhanced Retransmission (0x03)] {chan 8}
      I-frame: Unsegmented TxSeq 2 ReqSeq 2
  > ACL Data RX: Handle 11 flags 0x02 dlen 680
      Channel: 72 len 676 ctrl 0x0304 [PSM 4103 mode Enhanced Retransmission (0x03)] {chan 8}
      I-frame: Unsegmented TxSeq 2 ReqSeq 3

The MTUs are negotiated for each direction. In this traces 32767 for
iPhone->localhost and no MTU for localhost->iPhone, which based on
'4.4 L2CAP_CONFIGURATION_REQ' (Core specification v5.4, Vol. 3, Part
A):

  The only parameters that should be included in the
  L2CAP_CONFIGURATION_REQ packet are those that require different
  values than the default or previously agreed values.
  ...
  Any missing configuration parameters are assumed to have their
  most recently explicitly or implicitly accepted values.

and '5.1 Maximum transmission unit (MTU)':

  If the remote device sends a positive L2CAP_CONFIGURATION_RSP
  packet it should include the actual MTU to be used on this channel
  for traffic flowing into the local device.
  ...
  The default value is 672 octets.

is set by BlueZ to 672 bytes.

It seems that the iPhone used the lowest negotiated value to transfer
data to the localhost instead of the negotiated one for the incoming
direction.

This could be fixed by using the MTU negotiated for the other
direction, if exists, in the L2CAP_CONFIGURATION_RSP.
This allows to use segmented packets as in the following traces:

  < ACL Data TX: Handle 11 flags 0x00 dlen 12
        L2CAP: Connection Request (0x02) ident 22 len 4
          PSM: 4103 (0x1007)
          Source CID: 72
  < ACL Data TX: Handle 11 flags 0x00 dlen 27
        L2CAP: Configure Request (0x04) ident 24 len 19
          Destination CID: 2832
          Flags: 0x0000
          Option: Maximum Transmission Unit (0x01) [mandatory]
            MTU: 32767
          Option: Retransmission and Flow Control (0x04) [mandatory]
            Mode: Enhanced Retransmission (0x03)
            TX window size: 63
            Max transmit: 3
            Retransmission timeout: 2000
            Monitor timeout: 12000
            Maximum PDU size: 1009
  > ACL Data RX: Handle 11 flags 0x02 dlen 26
        L2CAP: Configure Request (0x04) ident 15 len 18
          Destination CID: 72
          Flags: 0x0000
          Option: Retransmission and Flow Control (0x04) [mandatory]
            Mode: Enhanced Retransmission (0x03)
            TX window size: 32
            Max transmit: 255
            Retransmission timeout: 0
            Monitor timeout: 0
            Maximum PDU size: 65527
          Option: Frame Check Sequence (0x05) [mandatory]
            FCS: 16-bit FCS (0x01)
  < ACL Data TX: Handle 11 flags 0x00 dlen 29
        L2CAP: Configure Response (0x05) ident 15 len 21
          Source CID: 2832
          Flags: 0x0000
          Result: Success (0x0000)
          Option: Maximum Transmission Unit (0x01) [mandatory]
            MTU: 32767
          Option: Retransmission and Flow Control (0x04) [mandatory]
            Mode: Enhanced Retransmission (0x03)
            TX window size: 32
            Max transmit: 255
            Retransmission timeout: 2000
            Monitor timeout: 12000
            Maximum PDU size: 1009
  > ACL Data RX: Handle 11 flags 0x02 dlen 32
        L2CAP: Configure Response (0x05) ident 24 len 24
          Source CID: 72
          Flags: 0x0000
          Result: Success (0x0000)
          Option: Maximum Transmission Unit (0x01) [mandatory]
            MTU: 32767
          Option: Retransmission and Flow Control (0x04) [mandatory]
            Mode: Enhanced Retransmission (0x03)
            TX window size: 63
            Max transmit: 3
            Retransmission timeout: 2000
            Monitor timeout: 12000
            Maximum PDU size: 1009
          Option: Frame Check Sequence (0x05) [mandatory]
            FCS: 16-bit FCS (0x01)
  ...
  > ACL Data RX: Handle 11 flags 0x02 dlen 1009
        Channel: 72 len 1005 ctrl 0x4202 [PSM 4103 mode Enhanced Retransmission (0x03)] {chan 8}
        I-frame: Start (len 21884) TxSeq 1 ReqSeq 2
  > ACL Data RX: Handle 11 flags 0x02 dlen 1009
        Channel: 72 len 1005 ctrl 0xc204 [PSM 4103 mode Enhanced Retransmission (0x03)] {chan 8}
        I-frame: Continuation TxSeq 2 ReqSeq 2

This has been tested with kernel 5.4 and BlueZ 5.77.

Cc: stable@vger.kernel.org
Signed-off-by: Frédéric Danis <frederic.danis@collabora.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-06-20 11:54:48 -04:00
Kavita Kavita
7c598c653a wifi: cfg80211: Add support for link reconfiguration negotiation offload to driver
In the case of SME-in-driver, the driver can internally choose to
update the links based on the AP MLD recommendation and do link
reconfiguration negotiation with AP MLD.
(e.g., After the driver processing the BSS Transition Management request
frame received from the AP MLD with Neighbor Report containing
Multi-Link element with recommended links information chooses to do link
reconfiguration negotiation with AP MLD).

To support this, extend cfg80211_mlo_reconf_add_done() and
NL80211_CMD_ASSOC_MLO_RECONF to indicate added links information for
driver-initiated link reconfiguration requests. For removed links,
the driver indicates links information using the
NL80211_CMD_LINKS_REMOVED event for driver-initiated cases, the same as
supplicant initiated cases.

For the driver-initiated case, cfg80211 will receive link
reconfiguration result asynchronously from driver so holding BSSes of
the accepted add links is needed in the event path. Also, no need of
unhold call for the rejected add link BSSes since there was no hold call
happened previously.

Once the supplicant receives the NL80211_CMD_ASSOC_MLO_RECONF event,
it needs to process the information about newly added links and install
per-link group keys (e.g., GTK/IGTK/BIGTK etc.).

In case of the SME-in-driver, using a vendor interface etc. to notify
the supplicant to initiate a link reconfiguration request and then
supplicant sending command to the cfg80211 can lead to race conditions.
The correct design to avoid this is that the driver indicates the
cfg80211 directly with the results of the link reconfiguration
negotiation.

Signed-off-by: Kavita Kavita <quic_kkavita@quicinc.com>
Link: https://patch.msgid.link/20250604105757.2542-3-quic_kkavita@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-20 10:46:31 +02:00
Rameshkumar Sundaram
2eb7c1baf4 wifi: mac80211: Fix bssid_indicator for MBSSID in AP mode
Currently, in ieee80211_assign_beacon() mbssid count is updated as link's
bssid_indicator. mbssid count is the total number of MBSSID elements in
the beacon instead of Max BSSID indicator of the Multiple BSS set.
This will result in drivers obtaining an invalid bssid_indicator for BSSes
in a Multiple BSS set.
Fix this by updating link's bssid_indicator from MBSSID element for
Transmitting BSS and update the same for all of its Non-Transmitting BSSes.

Fixes: dde78aa520 ("mac80211: update bssid_indicator in ieee80211_assign_beacon")
Signed-off-by: Rameshkumar Sundaram <rameshkumar.sundaram@oss.qualcomm.com>
Link: https://patch.msgid.link/20250530040940.3188537-1-rameshkumar.sundaram@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-20 10:45:19 +02:00
Raj Kumar Bhagat
c9172fae4b wifi: mac80211: Allow scan on a radio while operating on DFS on another radio
Currently, in multi-radio wiphy cases, if one radio is operating on a DFS
channel, -EBUSY is returned even when a scan is requested on a different
radio. Because of this, an MLD AP with one radio (link) on a DFS channel
and Automatic Channel Selection (ACS) on another radio (link) cannot be
brought up.

In multi-radio wiphy cases, multiple radios are grouped under a single
wiphy. Hence, if a radio is operating on a DFS channel and a scan is
requested on a different radio of the same wiphy, the scan can be allowed
simultaneously without impacting the DFS operations.

Add logic to check the underlying radio used for the requested scan. If the
radio on which DFS is already running is not being used, allow the scan
operation; otherwise, return -EBUSY.

Signed-off-by: Raj Kumar Bhagat <quic_rajkbhag@quicinc.com>
Link: https://patch.msgid.link/20250527-mlo-dfs-acs-v2-3-92c2f37c81d9@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-20 10:44:05 +02:00
Aditya Kumar Singh
fe8582dbb4 wifi: mac80211: Allow DFS/CSA on a radio if scan is ongoing on another radio
Currently, in multi-radio wiphy cases, if a scan is ongoing on one radio,
-EBUSY is returned when DFS or a channel switch is initiated on another
radio. Because of this, an MLD AP with one radio (link) in an ongoing scan
cannot initiate DFS or a channel switch on another radio (link).

In multi-radio wiphy cases, multiple radios are grouped under a single
wiphy. Hence, if a scan is ongoing on one underlying radio and DFS or a
channel switch is requested on a different underlying radio of the same
wiphy, these operations can be allowed simultaneously.

Add logic to check the underlying radio used for the ongoing scan. If the
radio on which DFS or a channel switch is requested is not being used for
the scan, allow the operation; otherwise, return -EBUSY.

Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com>
Co-developed-by: Raj Kumar Bhagat <quic_rajkbhag@quicinc.com>
Signed-off-by: Raj Kumar Bhagat <quic_rajkbhag@quicinc.com>
Link: https://patch.msgid.link/20250527-mlo-dfs-acs-v2-2-92c2f37c81d9@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-20 10:44:05 +02:00
Vasanthakumar Thiagarajan
df42bfc96e wifi: cfg80211: Add utility API to get radio index from channel
Add utility API cfg80211_get_radio_idx_by_chan() to retrieve the radio
index corresponding to a given channel in a multi-radio wiphy.

This utility function can be used when we want to check the radio-specific
data for a channel in a multi-radio wiphy. For example, it can help
determine the radio index required to handle a scan request. This index
can then be used to decide whether the scan can proceed without
interfering with ongoing DFS operations on another radio.

Signed-off-by: Vasanthakumar Thiagarajan <vasanthakumar.thiagarajan@oss.qualcomm.com>
Co-developed-by: Raj Kumar Bhagat <quic_rajkbhag@quicinc.com>
Signed-off-by: Raj Kumar Bhagat <quic_rajkbhag@quicinc.com>
Link: https://patch.msgid.link/20250527-mlo-dfs-acs-v2-1-92c2f37c81d9@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-20 10:44:05 +02:00
Jianbo Liu
b05d42eefa xfrm: hold device only for the asynchronous decryption
The dev_hold() on skb->dev during packet reception was originally
added to prevent the device from being released prematurely during
asynchronous decryption operations.

As current hardware can offload decryption, this asynchronous path is
not always utilized. This often results in a pattern of dev_hold()
immediately followed by dev_put() for each packet, creating
unnecessary reference counting overhead detrimental to performance.

This patch optimizes this by skipping the dev_hold() and subsequent
dev_put() when asynchronous decryption is not being performed.

Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-06-20 10:39:19 +02:00
Jakub Kicinski
77f08133bc Merge branch 'ref_tracker-add-ability-to-register-a-debugfs-file-for-a-ref_tracker_dir'
Jeff Layton says:

====================
ref_tracker: add ability to register a debugfs file for a ref_tracker_dir

For those just joining in, this series adds a new top-level
"ref_tracker" debugfs directory, and has each ref_tracker_dir register a
file in there as part of its initialization. It also adds the ability to
register a symlink with a more human-usable name that points to the
file, and does some general cleanup of how the ref_tracker object names
are handled.

v14: https://lore.kernel.org/20250610-reftrack-dbgfs-v14-0-efb532861428@kernel.org
v13: https://lore.kernel.org/20250603-reftrack-dbgfs-v13-0-7b2a425019d8@kernel.org
v12: https://lore.kernel.org/20250529-reftrack-dbgfs-v12-0-11b93c0c0b6e@kernel.org
v11: https://lore.kernel.org/20250528-reftrack-dbgfs-v11-0-94ae0b165841@kernel.org
v10: https://lore.kernel.org/20250527-reftrack-dbgfs-v10-0-dc55f7705691@kernel.org
v9: https://lore.kernel.org/20250509-reftrack-dbgfs-v9-0-8ab888a4524d@kernel.org
v8: https://lore.kernel.org/20250507-reftrack-dbgfs-v8-0-607717d3bb98@kernel.org
v7: https://lore.kernel.org/20250505-reftrack-dbgfs-v7-0-f78c5d97bcca@kernel.org
v6: https://lore.kernel.org/20250430-reftrack-dbgfs-v6-0-867c29aff03a@kernel.org
v5: https://lore.kernel.org/20250428-reftrack-dbgfs-v5-0-1cbbdf2038bd@kernel.org
v4: https://lore.kernel.org/20250418-reftrack-dbgfs-v4-0-5ca5c7899544@kernel.org
v3: https://lore.kernel.org/20250417-reftrack-dbgfs-v3-0-c3159428c8fb@kernel.org
v2: https://lore.kernel.org/20250415-reftrack-dbgfs-v2-0-b18c4abd122f@kernel.org
v1: https://lore.kernel.org/20250414-reftrack-dbgfs-v1-0-f03585832203@kernel.org
====================

Link: https://patch.msgid.link/20250618-reftrack-dbgfs-v15-0-24fc37ead144@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19 17:02:07 -07:00
Jeff Layton
707bd05be7 ref_tracker: eliminate the ref_tracker_dir name field
Now that we have dentries and the ability to create meaningful symlinks
to them, don't keep a name string in each tracker. Switch the output
format to print "class@address", and drop the name field.

Also, add a kerneldoc header for ref_tracker_dir_init().

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20250618-reftrack-dbgfs-v15-9-24fc37ead144@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19 17:02:04 -07:00
Jeff Layton
8f2079f8da net: add symlinks to ref_tracker_dir for netns
After assigning the inode number to the namespace, use it to create a
unique name for each netns refcount tracker with the ns.inum and
net_cookie values in it, and register a symlink to the debugfs file for
it.

init_net is registered before the ref_tracker dir is created, so add a
late_initcall() to register its files and symlinks.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20250618-reftrack-dbgfs-v15-8-24fc37ead144@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19 17:02:04 -07:00
Jeff Layton
aa7d26c3c3 ref_tracker: add a static classname string to each ref_tracker_dir
A later patch in the series will be adding debugfs files for each
ref_tracker that get created in ref_tracker_dir_init(). The format will
be "class@%px". The current "name" string can vary between
ref_tracker_dir objects of the same type, so it's not suitable for this
purpose.

Add a new "class" string to the ref_tracker dir that describes the
the type of object (sans any individual info for that object).

Also, in the i915 driver, gate the creation of debugfs files on whether
the dentry pointer is still set to NULL. CI has shown that the
ref_tracker_dir can be initialized more than once.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20250618-reftrack-dbgfs-v15-4-24fc37ead144@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19 17:02:04 -07:00
Breno Leitao
6ad7969a36 netpoll: Extract IPv6 address retrieval function
Extract the IPv6 address retrieval logic from netpoll_setup() into
a dedicated helper function netpoll_take_ipv6() to improve code
organization and readability.

The function handles obtaining the local IPv6 address from the
network device, including proper address type matching between
local and remote addresses (link-local vs global), and includes
appropriate error handling when IPv6 is not supported or no
suitable address is available.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250618-netpoll_ip_ref-v1-3-c2ac00fe558f@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19 16:15:35 -07:00
Breno Leitao
3699f992e8 netpoll: extract IPv4 address retrieval into helper function
Move the IPv4 address retrieval logic from netpoll_setup() into a
separate netpoll_take_ipv4() function to improve code organization
and readability. This change consolidates the IPv4-specific logic
and error handling into a dedicated function while maintaining
the same functionality.

No functional changes.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250618-netpoll_ip_ref-v1-2-c2ac00fe558f@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19 16:15:35 -07:00
Breno Leitao
76d30b51e8 netpoll: Extract carrier wait function
Extract the carrier waiting logic into a dedicated helper function
netpoll_wait_carrier() to improve code readability and reduce
duplication in netpoll_setup().

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250618-netpoll_ip_ref-v1-1-c2ac00fe558f@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19 16:15:35 -07:00
Wang Liang
c3ee72ded0 net/smc: remove unused input parameters in smc_buf_get_slot
The input parameter "compressed_bufsize" of smc_buf_get_slot is unused,
remove it.

Signed-off-by: Wang Liang <wangliang74@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250618103342.1423913-1-wangliang74@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19 16:15:34 -07:00
Eric Dumazet
f64bd2045d tcp: tcp_time_to_recover() cleanup
tcp_time_to_recover() does not need the @flag argument.

Its first parameter can be marked const, and of tcp_sock type.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250618091246.1260322-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19 16:15:34 -07:00
Nicolas Escande
c7d78566bb neighbour: add support for NUD_PERMANENT proxy entries
As discussesd before in [0] proxy entries (which are more configuration
than runtime data) should stay when the link (carrier) goes does down.
This is what happens for regular neighbour entries.

So lets fix this by:
  - storing in proxy entries the fact that it was added as NUD_PERMANENT
  - not removing NUD_PERMANENT proxy entries when the carrier goes down
    (same as how it's done in neigh_flush_dev() for regular neigh entries)

[0]: https://lore.kernel.org/netdev/c584ef7e-6897-01f3-5b80-12b53f7b4bf4@kernel.org/

Signed-off-by: Nicolas Escande <nico.escande@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250617141334.3724863-1-nico.escande@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19 16:15:34 -07:00
Jakub Kicinski
62deb67fc5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.16-rc3).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19 13:00:24 -07:00
Jakub Kicinski
dccf87ea49 More fixes:
- ath12k
    - avoid busy-waiting
    - activate correct number of links
  - iwlwifi
    - iwldvm regression (lots of warnings)
    - iwlmld merge damage regression (crash)
    - fix build with some old gcc versions
  - carl9170: don't talk to device w/o FW [syzbot]
  - ath6kl: remove bad FW WARN [syzbot]
  - ieee80211: use variable-length arrays [syzbot]
  - mac80211
    - remove WARN on delayed beacon update [syzbot]
    - drop OCB frames with invalid source [syzbot]
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmhSg10ACgkQ10qiO8sP
 aAD10w//fdel2PXhAT6weaNRXmpUStR/TWCcKeUy6pko1BanxVbJNK9DPsCmg9Ym
 ApDewdsmTF600VAW+/JpoHCckzIGhx8Q/i5DJ0Xi/sbLweFMeDcmAsHkqVu11wTn
 gatTXlwP0/iXCL/EEtuS8fS1vs7HF29Q4QVNHPtOEaoNSm7LpE20Ukve4fBB8eXr
 oG3l7bNxW4STIFJL1+EWIXXv+tuyfORsIoD65N+NephJvTYvdKocWjjjDgnsLrPR
 HOW7Yli9G7xebXvDhdUVzdDjG0KEcarEHbyIosiPCv/xSnFZlrq4pg2zz8psHG4H
 MxeHlDgKR2gSdAeX207vB2yZ884yrRtA2XuQ7t8JrVATvXI6LE9puB+P0ssqLsvz
 jO9LjU9cfg8EoaBnK1w6vjL+OH3z2D1D8CvrkRjAUxZUKcBfk/NIQu4zsei5GuL5
 fGDSowLiMUMMyecvKbLggxMEwp3qTs9zTb4w5Vn+QtbVFjj8+HgEb4AHrQM9tr/y
 GavBZvCnKqEnv/VgkTh2Z9KfVznFfXKi3N0XR6gotVDueMeG6fkXvwFTSg6dW0UX
 c1vHJTZ0H0AHelWoSz8/1FO6NXmqU76hnW7EwcYkwkGJ3/tji1Bzyt6ihknIttKl
 AVYno+nmAC1WJNzKxJDFE+HX/F1RwowKlek/nLUMU1Ti4kzgRWo=
 =Q6JR
 -----END PGP SIGNATURE-----

Merge tag 'wireless-2025-06-18' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes Berg says:

====================
More fixes:
 - ath12k
   - avoid busy-waiting
   - activate correct number of links
 - iwlwifi
   - iwldvm regression (lots of warnings)
   - iwlmld merge damage regression (crash)
   - fix build with some old gcc versions
 - carl9170: don't talk to device w/o FW [syzbot]
 - ath6kl: remove bad FW WARN [syzbot]
 - ieee80211: use variable-length arrays [syzbot]
 - mac80211
   - remove WARN on delayed beacon update [syzbot]
   - drop OCB frames with invalid source [syzbot]

* tag 'wireless-2025-06-18' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: iwlwifi: Fix incorrect logic on cmd_ver range checking
  wifi: iwlwifi: dvm: restore n_no_reclaim_cmds setting
  wifi: iwlwifi: cfg: Limit cb_size to valid range
  wifi: iwlwifi: restore missing initialization of async_handlers_list (again)
  wifi: ath6kl: remove WARN on bad firmware input
  wifi: carl9170: do not ping device which has failed to load firmware
  wifi: ath12k: don't wait when there is no vdev started
  wifi: ath12k: don't use static variables in ath12k_wmi_fw_stats_process()
  wifi: ath12k: avoid burning CPU while waiting for firmware stats
  wifi: ath12k: fix documentation on firmware stats
  wifi: ath12k: don't activate more links than firmware supports
  wifi: ath12k: update link active in case two links fall on the same MAC
  wifi: ath12k: support WMI_MLO_LINK_SET_ACTIVE_CMDID command
  wifi: ath12k: update freq range for each hardware mode
  wifi: ath12k: parse and save sbs_lower_band_end_freq from WMI_SERVICE_READY_EXT2_EVENTID event
  wifi: ath12k: parse and save hardware mode info from WMI_SERVICE_READY_EXT_EVENTID event for later use
  wifi: ath12k: Avoid CPU busy-wait by handling VDEV_STAT and BCN_STAT
  wifi: mac80211: don't WARN for late channel/color switch
  wifi: mac80211: drop invalid source address OCB frames
  wifi: remove zero-length arrays
====================

Link: https://patch.msgid.link/20250618210642.35805-6-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19 08:38:41 -07:00
Eric Dumazet
d03b79f459 net: atm: fix /proc/net/atm/lec handling
/proc/net/atm/lec must ensure safety against dev_lec[] changes.

It appears it had dev_put() calls without prior dev_hold(),
leading to imbalance and UAF.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Francois Romieu <romieu@fr.zoreil.com> # Minor atm contributor
Link: https://patch.msgid.link/20250618140844.1686882-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19 08:36:31 -07:00
Eric Dumazet
d13a3824bf net: atm: add lec_mutex
syzbot found its way in net/atm/lec.c, and found an error path
in lecd_attach() could leave a dangling pointer in dev_lec[].

Add a mutex to protect dev_lecp[] uses from lecd_attach(),
lec_vcc_attach() and lec_mcast_attach().

Following patch will use this mutex for /proc/net/atm/lec.

BUG: KASAN: slab-use-after-free in lecd_attach net/atm/lec.c:751 [inline]
BUG: KASAN: slab-use-after-free in lane_ioctl+0x2224/0x23e0 net/atm/lec.c:1008
Read of size 8 at addr ffff88807c7b8e68 by task syz.1.17/6142

CPU: 1 UID: 0 PID: 6142 Comm: syz.1.17 Not tainted 6.16.0-rc1-syzkaller-00239-g08215f5486ec #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
Call Trace:
 <TASK>
  __dump_stack lib/dump_stack.c:94 [inline]
  dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120
  print_address_description mm/kasan/report.c:408 [inline]
  print_report+0xcd/0x680 mm/kasan/report.c:521
  kasan_report+0xe0/0x110 mm/kasan/report.c:634
  lecd_attach net/atm/lec.c:751 [inline]
  lane_ioctl+0x2224/0x23e0 net/atm/lec.c:1008
  do_vcc_ioctl+0x12c/0x930 net/atm/ioctl.c:159
  sock_do_ioctl+0x118/0x280 net/socket.c:1190
  sock_ioctl+0x227/0x6b0 net/socket.c:1311
  vfs_ioctl fs/ioctl.c:51 [inline]
  __do_sys_ioctl fs/ioctl.c:907 [inline]
  __se_sys_ioctl fs/ioctl.c:893 [inline]
  __x64_sys_ioctl+0x18e/0x210 fs/ioctl.c:893
  do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
  do_syscall_64+0xcd/0x4c0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
 </TASK>

Allocated by task 6132:
  kasan_save_stack+0x33/0x60 mm/kasan/common.c:47
  kasan_save_track+0x14/0x30 mm/kasan/common.c:68
  poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
  __kasan_kmalloc+0xaa/0xb0 mm/kasan/common.c:394
  kasan_kmalloc include/linux/kasan.h:260 [inline]
  __do_kmalloc_node mm/slub.c:4328 [inline]
  __kvmalloc_node_noprof+0x27b/0x620 mm/slub.c:5015
  alloc_netdev_mqs+0xd2/0x1570 net/core/dev.c:11711
  lecd_attach net/atm/lec.c:737 [inline]
  lane_ioctl+0x17db/0x23e0 net/atm/lec.c:1008
  do_vcc_ioctl+0x12c/0x930 net/atm/ioctl.c:159
  sock_do_ioctl+0x118/0x280 net/socket.c:1190
  sock_ioctl+0x227/0x6b0 net/socket.c:1311
  vfs_ioctl fs/ioctl.c:51 [inline]
  __do_sys_ioctl fs/ioctl.c:907 [inline]
  __se_sys_ioctl fs/ioctl.c:893 [inline]
  __x64_sys_ioctl+0x18e/0x210 fs/ioctl.c:893
  do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
  do_syscall_64+0xcd/0x4c0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 6132:
  kasan_save_stack+0x33/0x60 mm/kasan/common.c:47
  kasan_save_track+0x14/0x30 mm/kasan/common.c:68
  kasan_save_free_info+0x3b/0x60 mm/kasan/generic.c:576
  poison_slab_object mm/kasan/common.c:247 [inline]
  __kasan_slab_free+0x51/0x70 mm/kasan/common.c:264
  kasan_slab_free include/linux/kasan.h:233 [inline]
  slab_free_hook mm/slub.c:2381 [inline]
  slab_free mm/slub.c:4643 [inline]
  kfree+0x2b4/0x4d0 mm/slub.c:4842
  free_netdev+0x6c5/0x910 net/core/dev.c:11892
  lecd_attach net/atm/lec.c:744 [inline]
  lane_ioctl+0x1ce8/0x23e0 net/atm/lec.c:1008
  do_vcc_ioctl+0x12c/0x930 net/atm/ioctl.c:159
  sock_do_ioctl+0x118/0x280 net/socket.c:1190
  sock_ioctl+0x227/0x6b0 net/socket.c:1311
  vfs_ioctl fs/ioctl.c:51 [inline]
  __do_sys_ioctl fs/ioctl.c:907 [inline]
  __se_sys_ioctl fs/ioctl.c:893 [inline]
  __x64_sys_ioctl+0x18e/0x210 fs/ioctl.c:893

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: syzbot+8b64dec3affaed7b3af5@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6852c6f6.050a0220.216029.0018.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250618140844.1686882-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19 08:36:13 -07:00
Krzysztof Kozlowski
fc27ab4890 NFC: nci: uart: Set tty->disc_data only in success path
Setting tty->disc_data before opening the NCI device means we need to
clean it up on error paths.  This also opens some short window if device
starts sending data, even before NCIUARTSETDRIVER IOCTL succeeded
(broken hardware?).  Close the window by exposing tty->disc_data only on
the success path, when opening of the NCI device and try_module_get()
succeeds.

The code differs in error path in one aspect: tty->disc_data won't be
ever assigned thus NULL-ified.  This however should not be relevant
difference, because of "tty->disc_data=NULL" in nci_uart_tty_open().

Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Fixes: 9961127d4b ("NFC: nci: add generic uart support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/20250618073649.25049-2-krzysztof.kozlowski@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19 08:33:54 -07:00
Kuniyuki Iwashima
10876da918 calipso: Fix null-ptr-deref in calipso_req_{set,del}attr().
syzkaller reported a null-ptr-deref in sock_omalloc() while allocating
a CALIPSO option.  [0]

The NULL is of struct sock, which was fetched by sk_to_full_sk() in
calipso_req_setattr().

Since commit a1a5344ddb ("tcp: avoid two atomic ops for syncookies"),
reqsk->rsk_listener could be NULL when SYN Cookie is returned to its
client, as hinted by the leading SYN Cookie log.

Here are 3 options to fix the bug:

  1) Return 0 in calipso_req_setattr()
  2) Return an error in calipso_req_setattr()
  3) Alaways set rsk_listener

1) is no go as it bypasses LSM, but 2) effectively disables SYN Cookie
for CALIPSO.  3) is also no go as there have been many efforts to reduce
atomic ops and make TCP robust against DDoS.  See also commit 3b24d854cb
("tcp/dccp: do not touch listener sk_refcnt under synflood").

As of the blamed commit, SYN Cookie already did not need refcounting,
and no one has stumbled on the bug for 9 years, so no CALIPSO user will
care about SYN Cookie.

Let's return an error in calipso_req_setattr() and calipso_req_delattr()
in the SYN Cookie case.

This can be reproduced by [1] on Fedora and now connect() of nc times out.

[0]:
TCP: request_sock_TCPv6: Possible SYN flooding on port [::]:20002. Sending cookies.
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000006: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037]
CPU: 3 UID: 0 PID: 12262 Comm: syz.1.2611 Not tainted 6.14.0 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
RIP: 0010:read_pnet include/net/net_namespace.h:406 [inline]
RIP: 0010:sock_net include/net/sock.h:655 [inline]
RIP: 0010:sock_kmalloc+0x35/0x170 net/core/sock.c:2806
Code: 89 d5 41 54 55 89 f5 53 48 89 fb e8 25 e3 c6 fd e8 f0 91 e3 00 48 8d 7b 30 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 26 01 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8b
RSP: 0018:ffff88811af89038 EFLAGS: 00010216
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffff888105266400
RDX: 0000000000000006 RSI: ffff88800c890000 RDI: 0000000000000030
RBP: 0000000000000050 R08: 0000000000000000 R09: ffff88810526640e
R10: ffffed1020a4cc81 R11: ffff88810526640f R12: 0000000000000000
R13: 0000000000000820 R14: ffff888105266400 R15: 0000000000000050
FS:  00007f0653a07640(0000) GS:ffff88811af80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f863ba096f4 CR3: 00000000163c0005 CR4: 0000000000770ef0
PKRU: 80000000
Call Trace:
 <IRQ>
 ipv6_renew_options+0x279/0x950 net/ipv6/exthdrs.c:1288
 calipso_req_setattr+0x181/0x340 net/ipv6/calipso.c:1204
 calipso_req_setattr+0x56/0x80 net/netlabel/netlabel_calipso.c:597
 netlbl_req_setattr+0x18a/0x440 net/netlabel/netlabel_kapi.c:1249
 selinux_netlbl_inet_conn_request+0x1fb/0x320 security/selinux/netlabel.c:342
 selinux_inet_conn_request+0x1eb/0x2c0 security/selinux/hooks.c:5551
 security_inet_conn_request+0x50/0xa0 security/security.c:4945
 tcp_v6_route_req+0x22c/0x550 net/ipv6/tcp_ipv6.c:825
 tcp_conn_request+0xec8/0x2b70 net/ipv4/tcp_input.c:7275
 tcp_v6_conn_request+0x1e3/0x440 net/ipv6/tcp_ipv6.c:1328
 tcp_rcv_state_process+0xafa/0x52b0 net/ipv4/tcp_input.c:6781
 tcp_v6_do_rcv+0x8a6/0x1a40 net/ipv6/tcp_ipv6.c:1667
 tcp_v6_rcv+0x505e/0x5b50 net/ipv6/tcp_ipv6.c:1904
 ip6_protocol_deliver_rcu+0x17c/0x1da0 net/ipv6/ip6_input.c:436
 ip6_input_finish+0x103/0x180 net/ipv6/ip6_input.c:480
 NF_HOOK include/linux/netfilter.h:314 [inline]
 NF_HOOK include/linux/netfilter.h:308 [inline]
 ip6_input+0x13c/0x6b0 net/ipv6/ip6_input.c:491
 dst_input include/net/dst.h:469 [inline]
 ip6_rcv_finish net/ipv6/ip6_input.c:79 [inline]
 ip6_rcv_finish+0xb6/0x490 net/ipv6/ip6_input.c:69
 NF_HOOK include/linux/netfilter.h:314 [inline]
 NF_HOOK include/linux/netfilter.h:308 [inline]
 ipv6_rcv+0xf9/0x490 net/ipv6/ip6_input.c:309
 __netif_receive_skb_one_core+0x12e/0x1f0 net/core/dev.c:5896
 __netif_receive_skb+0x1d/0x170 net/core/dev.c:6009
 process_backlog+0x41e/0x13b0 net/core/dev.c:6357
 __napi_poll+0xbd/0x710 net/core/dev.c:7191
 napi_poll net/core/dev.c:7260 [inline]
 net_rx_action+0x9de/0xde0 net/core/dev.c:7382
 handle_softirqs+0x19a/0x770 kernel/softirq.c:561
 do_softirq.part.0+0x36/0x70 kernel/softirq.c:462
 </IRQ>
 <TASK>
 do_softirq arch/x86/include/asm/preempt.h:26 [inline]
 __local_bh_enable_ip+0xf1/0x110 kernel/softirq.c:389
 local_bh_enable include/linux/bottom_half.h:33 [inline]
 rcu_read_unlock_bh include/linux/rcupdate.h:919 [inline]
 __dev_queue_xmit+0xc2a/0x3c40 net/core/dev.c:4679
 dev_queue_xmit include/linux/netdevice.h:3313 [inline]
 neigh_hh_output include/net/neighbour.h:523 [inline]
 neigh_output include/net/neighbour.h:537 [inline]
 ip6_finish_output2+0xd69/0x1f80 net/ipv6/ip6_output.c:141
 __ip6_finish_output net/ipv6/ip6_output.c:215 [inline]
 ip6_finish_output+0x5dc/0xd60 net/ipv6/ip6_output.c:226
 NF_HOOK_COND include/linux/netfilter.h:303 [inline]
 ip6_output+0x24b/0x8d0 net/ipv6/ip6_output.c:247
 dst_output include/net/dst.h:459 [inline]
 NF_HOOK include/linux/netfilter.h:314 [inline]
 NF_HOOK include/linux/netfilter.h:308 [inline]
 ip6_xmit+0xbbc/0x20d0 net/ipv6/ip6_output.c:366
 inet6_csk_xmit+0x39a/0x720 net/ipv6/inet6_connection_sock.c:135
 __tcp_transmit_skb+0x1a7b/0x3b40 net/ipv4/tcp_output.c:1471
 tcp_transmit_skb net/ipv4/tcp_output.c:1489 [inline]
 tcp_send_syn_data net/ipv4/tcp_output.c:4059 [inline]
 tcp_connect+0x1c0c/0x4510 net/ipv4/tcp_output.c:4148
 tcp_v6_connect+0x156c/0x2080 net/ipv6/tcp_ipv6.c:333
 __inet_stream_connect+0x3a7/0xed0 net/ipv4/af_inet.c:677
 tcp_sendmsg_fastopen+0x3e2/0x710 net/ipv4/tcp.c:1039
 tcp_sendmsg_locked+0x1e82/0x3570 net/ipv4/tcp.c:1091
 tcp_sendmsg+0x2f/0x50 net/ipv4/tcp.c:1358
 inet6_sendmsg+0xb9/0x150 net/ipv6/af_inet6.c:659
 sock_sendmsg_nosec net/socket.c:718 [inline]
 __sock_sendmsg+0xf4/0x2a0 net/socket.c:733
 __sys_sendto+0x29a/0x390 net/socket.c:2187
 __do_sys_sendto net/socket.c:2194 [inline]
 __se_sys_sendto net/socket.c:2190 [inline]
 __x64_sys_sendto+0xe1/0x1c0 net/socket.c:2190
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xc3/0x1d0 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f06553c47ed
Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f0653a06fc8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 00007f0655605fa0 RCX: 00007f06553c47ed
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000000b
RBP: 00007f065545db38 R08: 0000200000000140 R09: 000000000000001c
R10: f7384d4ea84b01bd R11: 0000000000000246 R12: 0000000000000000
R13: 00007f0655605fac R14: 00007f0655606038 R15: 00007f06539e7000
 </TASK>
Modules linked in:

[1]:
dnf install -y selinux-policy-targeted policycoreutils netlabel_tools procps-ng nmap-ncat
mount -t selinuxfs none /sys/fs/selinux
load_policy
netlabelctl calipso add pass doi:1
netlabelctl map del default
netlabelctl map add default address:::1 protocol:calipso,1
sysctl net.ipv4.tcp_syncookies=2
nc -l ::1 80 &
nc ::1 80

Fixes: e1adea9270 ("calipso: Allow request sockets to be relabelled by the lsm.")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Reported-by: John Cheung <john.cs.hey@gmail.com>
Closes: https://lore.kernel.org/netdev/CAP=Rh=MvfhrGADy+-WJiftV2_WzMH4VEhEFmeT28qY+4yxNu4w@mail.gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://patch.msgid.link/20250617224125.17299-1-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19 08:33:09 -07:00
Jeff Layton
94d10a4dba sunrpc: handle SVC_GARBAGE during svc auth processing as auth error
tianshuo han reported a remotely-triggerable crash if the client sends a
kernel RPC server a specially crafted packet. If decoding the RPC reply
fails in such a way that SVC_GARBAGE is returned without setting the
rq_accept_statp pointer, then that pointer can be dereferenced and a
value stored there.

If it's the first time the thread has processed an RPC, then that
pointer will be set to NULL and the kernel will crash. In other cases,
it could create a memory scribble.

The server sunrpc code treats a SVC_GARBAGE return from svc_authenticate
or pg_authenticate as if it should send a GARBAGE_ARGS reply. RFC 5531
says that if authentication fails that the RPC should be rejected
instead with a status of AUTH_ERR.

Handle a SVC_GARBAGE return as an AUTH_ERROR, with a reason of
AUTH_BADCRED instead of returning GARBAGE_ARGS in that case. This
sidesteps the whole problem of touching the rpc_accept_statp pointer in
this situation and avoids the crash.

Cc: stable@kernel.org
Fixes: 29cd2927fb ("SUNRPC: Fix encoding of accepted but unsuccessful RPC replies")
Reported-by: tianshuo han <hantianshuo233@gmail.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-06-19 09:35:45 -04:00
Christian Brauner
804d679449 pidfs: remove pidfs_{get,put}_pid()
Now that we stash persistent information in struct pid there's no need
to play volatile games with pinning struct pid via dentries in pidfs.

Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-8-98f3456fd552@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-19 14:28:24 +02:00
Guillaume Nault
d3623dd5bd ipv6: Simplify link-local address generation for IPv6 GRE.
Since commit 3e6a0243ff ("gre: Fix again IPv6 link-local address
generation."), addrconf_gre_config() has stopped handling IP6GRE
devices specially and just calls the regular addrconf_addr_gen()
function to create their link-local IPv6 addresses.

We can thus avoid using addrconf_gre_config() for IP6GRE devices and
use the normal IPv6 initialisation path instead (that is, jump directly
to addrconf_dev_config() in addrconf_init_auto_addrs()).

See commit 3e6a0243ff ("gre: Fix again IPv6 link-local address
generation.") for a deeper explanation on how and why GRE devices
started handling their IPv6 link-local address generation specially,
why it was a problem, and why this is not even necessary in most cases
(especially for GRE over IPv6).

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/a9144be9c7ec3cf09f25becae5e8fdf141fde9f6.1750075076.git.gnault@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-06-19 12:11:40 +02:00
Kory Maincent (Dent Project)
eeb0c8f72f net: ethtool: Add PSE port priority support feature
This patch expands the status information provided by ethtool for PSE c33
with current port priority and max port priority. It also adds a call to
pse_ethtool_set_prio() to configure the PSE port priority.

Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-8-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 19:00:17 -07:00
Kory Maincent (Dent Project)
1176978ed8 net: ethtool: Add support for new power domains index description
Report the index of the newly introduced PSE power domain to the user,
enabling improved management of the power budget for PSE devices.

Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-5-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 19:00:16 -07:00
Kory Maincent (Dent Project)
fc0e6db309 net: pse-pd: Add support for reporting events
Add support for devm_pse_irq_helper() to register PSE interrupts and report
events such as over-current or over-temperature conditions. This follows a
similar approach to the regulator API but also sends notifications using a
dedicated PSE ethtool netlink socket.

Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-2-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 19:00:15 -07:00
David Arinzon
cea465a96a devlink: Add new "enable_phc" generic device param
Add a new device generic parameter to enable/disable the
PHC (PTP Hardware Clock) functionality in the device associated
with the devlink instance.

Signed-off-by: David Arinzon <darinzon@amazon.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250617110545.5659-6-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 18:57:29 -07:00
Stanislav Fomichev
3a321b6b1f net: remove redundant ASSERT_RTNL() in queue setup functions
The existing netdev_ops_assert_locked() already asserts that either
the RTNL lock or the per-device lock is held, making the explicit
ASSERT_RTNL() redundant.

Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Link: https://patch.msgid.link/20250616162117.287806-5-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 18:53:51 -07:00
Stanislav Fomichev
1ead750109 udp_tunnel: remove rtnl_lock dependency
Drivers that are using ops lock and don't depend on RTNL lock
still need to manage it because udp_tunnel's RTNL dependency.
Introduce new udp_tunnel_nic_lock and use it instead of
rtnl_lock. Drop non-UDP_TUNNEL_NIC_INFO_MAY_SLEEP mode from
udp_tunnel infra (udp_tunnel_nic_device_sync_work needs to
grab udp_tunnel_nic_lock mutex and might sleep).

Cover more places in v4:

- netlink
  - udp_tunnel_notify_add_rx_port (ndo_open)
    - triggers udp_tunnel_nic_device_sync_work
  - udp_tunnel_notify_del_rx_port (ndo_stop)
    - triggers udp_tunnel_nic_device_sync_work
  - udp_tunnel_get_rx_info (__netdev_update_features)
    - triggers NETDEV_UDP_TUNNEL_PUSH_INFO
  - udp_tunnel_drop_rx_info (__netdev_update_features)
    - triggers NETDEV_UDP_TUNNEL_DROP_INFO
  - udp_tunnel_nic_reset_ntf (ndo_open)

- notifiers
  - udp_tunnel_nic_netdevice_event, depending on the event:
    - triggers NETDEV_UDP_TUNNEL_PUSH_INFO
    - triggers NETDEV_UDP_TUNNEL_DROP_INFO

- ethnl_tunnel_info_reply_size
- udp_tunnel_nic_set_port_priv (two intel drivers)

Cc: Michael Chan <michael.chan@broadcom.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Link: https://patch.msgid.link/20250616162117.287806-4-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 18:53:51 -07:00
David Wei
dbe0ca8da1 tcp: fix passive TFO socket having invalid NAPI ID
There is a bug with passive TFO sockets returning an invalid NAPI ID 0
from SO_INCOMING_NAPI_ID. Normally this is not an issue, but zero copy
receive relies on a correct NAPI ID to process sockets on the right
queue.

Fix by adding a sk_mark_napi_id_set().

Fixes: e5907459ce ("tcp: Record Rx hash and NAPI ID in tcp_child_process")
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250617212102.175711-5-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 18:30:51 -07:00
Haixia Qu
f82727adcf tipc: fix null-ptr-deref when acquiring remote ip of ethernet bearer
The reproduction steps:
1. create a tun interface
2. enable l2 bearer
3. TIPC_NL_UDP_GET_REMOTEIP with media name set to tun

tipc: Started in network mode
tipc: Node identity 8af312d38a21, cluster identity 4711
tipc: Enabled bearer <eth:syz_tun>, priority 1
Oops: general protection fault
KASAN: null-ptr-deref in range
CPU: 1 UID: 1000 PID: 559 Comm: poc Not tainted 6.16.0-rc1+ #117 PREEMPT
Hardware name: QEMU Ubuntu 24.04 PC
RIP: 0010:tipc_udp_nl_dump_remoteip+0x4a4/0x8f0

the ub was in fact a struct dev.

when bid != 0 && skip_cnt != 0, bearer_list[bid] may be NULL or
other media when other thread changes it.

fix this by checking media_id.

Fixes: 832629ca5c ("tipc: add UDP remoteip dump to netlink API")
Signed-off-by: Haixia Qu <hxqu@hillstonenet.com>
Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech>
Link: https://patch.msgid.link/20250617055624.2680-1-hxqu@hillstonenet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 14:19:50 -07:00
Simon Horman
a9874d961e nfc: Remove checks for nla_data returning NULL
The implementation of nla_data is as follows:

static inline void *nla_data(const struct nlattr *nla)
{
	return (char *) nla + NLA_HDRLEN;
}

Excluding the case where nla is exactly -NLA_HDRLEN, it will not return
NULL. And it seems misleading to assume that it can, other than in this
corner case. So drop checks for this condition.

Flagged by Smatch.

Compile tested only.

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250617-nfc-null-data-v1-1-c7525ead2e95@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18 14:17:32 -07:00
Neal Cardwell
d0fa59897e tcp: fix tcp_packet_delayed() for tcp_is_non_sack_preventing_reopen() behavior
After the following commit from 2024:

commit e37ab73736 ("tcp: fix to allow timestamp undo if no retransmits were sent")

...there was buggy behavior where TCP connections without SACK support
could easily see erroneous undo events at the end of fast recovery or
RTO recovery episodes. The erroneous undo events could cause those
connections to suffer repeated loss recovery episodes and high
retransmit rates.

The problem was an interaction between the non-SACK behavior on these
connections and the undo logic. The problem is that, for non-SACK
connections at the end of a loss recovery episode, if snd_una ==
high_seq, then tcp_is_non_sack_preventing_reopen() holds steady in
CA_Recovery or CA_Loss, but clears tp->retrans_stamp to 0. Then upon
the next ACK the "tcp: fix to allow timestamp undo if no retransmits
were sent" logic saw the tp->retrans_stamp at 0 and erroneously
concluded that no data was retransmitted, and erroneously performed an
undo of the cwnd reduction, restoring cwnd immediately to the value it
had before loss recovery.  This caused an immediate burst of traffic
and build-up of queues and likely another immediate loss recovery
episode.

This commit fixes tcp_packet_delayed() to ignore zero retrans_stamp
values for non-SACK connections when snd_una is at or above high_seq,
because tcp_is_non_sack_preventing_reopen() clears retrans_stamp in
this case, so it's not a valid signal that we can undo.

Note that the commit named in the Fixes footer restored long-present
behavior from roughly 2005-2019, so apparently this bug was present
for a while during that era, and this was simply not caught.

Fixes: e37ab73736 ("tcp: fix to allow timestamp undo if no retransmits were sent")
Reported-by: Eric Wheeler <netdev@lists.ewheeler.net>
Closes: https://lore.kernel.org/netdev/64ea9333-e7f9-0df-b0f2-8d566143acab@ewheeler.net/
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Co-developed-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-06-18 09:04:34 +01:00
Kuniyuki Iwashima
7851263998 atm: Revert atm_account_tx() if copy_from_iter_full() fails.
In vcc_sendmsg(), we account skb->truesize to sk->sk_wmem_alloc by
atm_account_tx().

It is expected to be reverted by atm_pop_raw() later called by
vcc->dev->ops->send(vcc, skb).

However, vcc_sendmsg() misses the same revert when copy_from_iter_full()
fails, and then we will leak a socket.

Let's factorise the revert part as atm_return_tx() and call it in
the failure path.

Note that the corresponding sk_wmem_alloc operation can be found in
alloc_tx() as of the blamed commit.

  $ git blame -L:alloc_tx net/atm/common.c c55fa3cccbc2c~

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Simon Horman <horms@kernel.org>
Closes: https://lore.kernel.org/netdev/20250614161959.GR414686@horms.kernel.org/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250616182147.963333-3-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:42:44 -07:00
Tejun Heo
fd0406e5ca net: tcp: tsq: Convert from tasklet to BH workqueue
The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws. To
replace tasklets, BH workqueue support was recently added. A BH workqueue
behaves similarly to regular workqueues except that the queued work items
are executed in the BH context.

This patch converts TCP Small Queues implementation from tasklet to BH
workqueue.

Semantically, this is an equivalent conversion and there shouldn't be any
user-visible behavior changes. While workqueue's queueing and execution
paths are a bit heavier than tasklet's, unless the work item is being queued
every packet, the difference hopefully shouldn't matter.

My experience with the networking stack is very limited and this patch
definitely needs attention from someone who actually understands networking.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/aFBeJ38AS1ZF3Dq5@slm.duckdns.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:29:21 -07:00
Kuniyuki Iwashima
6dbb0d97c5 mpls: Use rcu_dereference_rtnl() in mpls_route_input_rcu().
As syzbot reported [0], mpls_route_input_rcu() can be called
from mpls_getroute(), where is under RTNL.

net->mpls.platform_label is only updated under RTNL.

Let's use rcu_dereference_rtnl() in mpls_route_input_rcu() to
silence the splat.

[0]:
WARNING: suspicious RCU usage
6.15.0-rc7-syzkaller-00082-g5cdb2c77c4c3 #0 Not tainted
 ----------------------------
net/mpls/af_mpls.c:84 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
1 lock held by syz.2.4451/17730:
 #0: ffffffff9012a3e8 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock net/core/rtnetlink.c:80 [inline]
 #0: ffffffff9012a3e8 (rtnl_mutex){+.+.}-{4:4}, at: rtnetlink_rcv_msg+0x371/0xe90 net/core/rtnetlink.c:6961

stack backtrace:
CPU: 1 UID: 0 PID: 17730 Comm: syz.2.4451 Not tainted 6.15.0-rc7-syzkaller-00082-g5cdb2c77c4c3 #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x16c/0x1f0 lib/dump_stack.c:120
 lockdep_rcu_suspicious+0x166/0x260 kernel/locking/lockdep.c:6865
 mpls_route_input_rcu+0x1d4/0x200 net/mpls/af_mpls.c:84
 mpls_getroute+0x621/0x1ea0 net/mpls/af_mpls.c:2381
 rtnetlink_rcv_msg+0x3c9/0xe90 net/core/rtnetlink.c:6964
 netlink_rcv_skb+0x16d/0x440 net/netlink/af_netlink.c:2534
 netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline]
 netlink_unicast+0x53a/0x7f0 net/netlink/af_netlink.c:1339
 netlink_sendmsg+0x8d1/0xdd0 net/netlink/af_netlink.c:1883
 sock_sendmsg_nosec net/socket.c:712 [inline]
 __sock_sendmsg net/socket.c:727 [inline]
 ____sys_sendmsg+0xa98/0xc70 net/socket.c:2566
 ___sys_sendmsg+0x134/0x1d0 net/socket.c:2620
 __sys_sendmmsg+0x200/0x420 net/socket.c:2709
 __do_sys_sendmmsg net/socket.c:2736 [inline]
 __se_sys_sendmmsg net/socket.c:2733 [inline]
 __x64_sys_sendmmsg+0x9c/0x100 net/socket.c:2733
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xcd/0x230 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f0a2818e969
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f0a28f52038 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
RAX: ffffffffffffffda RBX: 00007f0a283b5fa0 RCX: 00007f0a2818e969
RDX: 0000000000000003 RSI: 0000200000000080 RDI: 0000000000000003
RBP: 00007f0a28210ab1 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 00007f0a283b5fa0 R15: 00007ffce5e9f268
 </TASK>

Fixes: 0189197f44 ("mpls: Basic routing support")
Reported-by: syzbot+8a583bdd1a5cc0b0e068@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/68507981.a70a0220.395abc.01ef.GAE@google.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250616201532.1036568-1-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:21:59 -07:00
Petr Machata
96e8f5a9fe net: ipv6: Add ip6_mr_output()
Multicast routing is today handled in the input path. Locally generated MC
packets don't hit the IPMR code today. Thus if a VXLAN remote address is
multicast, the driver needs to set an OIF during route lookup. Thus MC
routing configuration needs to be kept in sync with the VXLAN FDB and MDB.
Ideally, the VXLAN packets would be routed by the MC routing code instead.

To that end, this patch adds support to route locally generated multicast
packets. The newly-added routines do largely what ip6_mr_input() and
ip6_mr_forward() do: make an MR cache lookup to find where to send the
packets, and use ip6_output() to send each of them. When no cache entry is
found, the packet is punted to the daemon for resolution.

Similarly to the IPv4 case in a previous patch, the new logic is contingent
on a newly-added IP6CB flag being set.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/3bcc034a3ab4d3c291072fff38f78d7fbbeef4e6.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:18:46 -07:00
Petr Machata
1b02f4475d net: ipv6: ip6mr: Split ip6mr_forward2() in two
Some of the work of ip6mr_forward2() is specific to IPMR forwarding, and
should not take place on the output path. In order to allow reuse of the
common parts, extract out of the function a helper,
ip6mr_prepare_forward().

Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/8932bd5c0fbe3f662b158803b8509604fa7bc113.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:18:46 -07:00
Petr Machata
094f39d5e8 net: ipv6: ip6mr: Make ip6mr_forward2() void
Nobody uses the return value, so convert the function to void.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/e0bee259da0da58da96647ea9e21e6360c8f7e11.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:18:45 -07:00
Petr Machata
3365afd3ab net: ipv6: ip6mr: Fix in/out netdev to pass to the FORWARD chain
The netfilter hook is invoked with skb->dev for input netdevice, and
vif_dev for output netdevice. However at the point of invocation, skb->dev
is already set to vif_dev, and MR-forwarded packets are reported with
in=out:

 # ip6tables -A FORWARD -j LOG --log-prefix '[forw]'
 # cd tools/testing/selftests/net/forwarding
 # ./router_multicast.sh
 # dmesg | fgrep '[forw]'
 [ 1670.248245] [forw]IN=v5 OUT=v5 [...]

For reference, IPv4 MR code shows in and out as appropriate.
Fix by caching skb->dev and using the updated value for output netdev.

Fixes: 7bc570c8b4 ("[IPV6] MROUTE: Support multicast forwarding.")
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/3141ae8386fbe13fef4b793faa75e6bae58d798a.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:18:45 -07:00
Petr Machata
f78c75d84f net: ipv6: Add a flags argument to ip6tunnel_xmit(), udp_tunnel6_xmit_skb()
ip6tunnel_xmit() erases the contents of the SKB control block. In order to
be able to set particular IP6CB flags on the SKB, add a corresponding
parameter, and propagate it to udp_tunnel6_xmit_skb() as well.

In one of the following patches, VXLAN driver will use this facility to
mark packets as subject to IPv6 multicast routing.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/acb4f9f3e40c3a931236c3af08a720b017fbfbfb.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:18:45 -07:00
Petr Machata
6a7d88ca15 net: ipv6: Make udp_tunnel6_xmit_skb() void
The function always returns zero, thus the return value does not carry any
signal. Just make it void.

Most callers already ignore the return value. However:

- Refold arguments of the call from sctp_v6_xmit() so that they fit into
  the 80-column limit.

- tipc_udp_xmit() initializes err from the return value, but that should
  already be always zero at that point. So there's no practical change, but
  elision of the assignment prompts a couple more tweaks to clean up the
  function.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/7facacf9d8ca3ca9391a4aee88160913671b868d.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:18:45 -07:00
Petr Machata
35bec72a24 net: ipv4: Add ip_mr_output()
Multicast routing is today handled in the input path. Locally generated MC
packets don't hit the IPMR code today. Thus if a VXLAN remote address is
multicast, the driver needs to set an OIF during route lookup. Thus MC
routing configuration needs to be kept in sync with the VXLAN FDB and MDB.
Ideally, the VXLAN packets would be routed by the MC routing code instead.

To that end, this patch adds support to route locally generated multicast
packets. The newly-added routines do largely what ip_mr_input() and
ip_mr_forward() do: make an MR cache lookup to find where to send the
packets, and use ip_mc_output() to send each of them. When no cache entry
is found, the packet is punted to the daemon for resolution.

However, an installation that uses a VXLAN underlay netdevice for which it
also has matching MC routes, would get a different routing with this patch.
Previously, the MC packets would be delivered directly to the underlay
port, whereas now they would be MC-routed. In order to avoid this change in
behavior, introduce an IPCB flag. Only if the flag is set will
ip_mr_output() actually engage, otherwise it reverts to ip_mc_output().

This code is based on work by Roopa Prabhu and Nikolay Aleksandrov.

Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/0aadbd49330471c0f758d54afb05eb3b6e3a6b65.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:18:45 -07:00
Petr Machata
b2e653bcff net: ipv4: ipmr: Split ipmr_queue_xmit() in two
Some of the work of ipmr_queue_xmit() is specific to IPMR forwarding, and
should not take place on the output path. In order to allow reuse of the
common parts, split the function into two: the ipmr_prepare_xmit() helper
that takes care of the common bits, and the ipmr_queue_fwd_xmit(), which
invokes the former and encapsulates the whole forwarding algorithm.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/4e8db165572a4f8bd29a723a801e854e9d20df4d.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:18:45 -07:00
Petr Machata
3b7bc938e0 net: ipv4: ipmr: ipmr_queue_xmit(): Drop local variable `dev'
The variable is used for caching of rt->dst.dev. The netdevice referenced
therein does not change during the scope of validity of that local. At the
same time, the local is only used twice, and each of these uses will end up
in a different function in the following patches, further eliminating any
use the local could have had.

Drop the local altogether and inline the uses.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/c80600a4b51679fe78f429ccb6d60892c2f9e4de.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:18:45 -07:00
Petr Machata
e3411e326f net: ipv4: Add a flags argument to iptunnel_xmit(), udp_tunnel_xmit_skb()
iptunnel_xmit() erases the contents of the SKB control block. In order to
be able to set particular IPCB flags on the SKB, add a corresponding
parameter, and propagate it to udp_tunnel_xmit_skb() as well.

In one of the following patches, VXLAN driver will use this facility to
mark packets as subject to IP multicast routing.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Acked-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/89c9daf9f2dc088b6b92ccebcc929f51742de91f.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 18:18:44 -07:00
Álvaro Fernández Rojas
ef07df397a net: dsa: tag_brcm: add support for legacy FCS tags
Add support for legacy Broadcom FCS tags, which are similar to
DSA_TAG_PROTO_BRCM_LEGACY.
BCM5325 and BCM5365 switches require including the original FCS value and
length, as opposed to BCM63xx switches.
Adding the original FCS value and length to DSA_TAG_PROTO_BRCM_LEGACY would
impact performance of BCM63xx switches, so it's better to create a new tag.

Signed-off-by: Álvaro Fernández Rojas <noltari@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250614080000.1884236-3-noltari@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 17:52:01 -07:00
Álvaro Fernández Rojas
a4daaf063f net: dsa: tag_brcm: legacy: reorganize functions
Move brcm_leg_tag_rcv() definition to top.
This function is going to be shared between two different tags.

Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Álvaro Fernández Rojas <noltari@gmail.com>
Link: https://patch.msgid.link/20250614080000.1884236-2-noltari@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 17:51:59 -07:00
Neal Cardwell
db16319efc tcp: remove RFC3517/RFC6675 tcp_clear_retrans_hints_partial()
Now that we have removed the RFC3517/RFC6675 hints,
tcp_clear_retrans_hints_partial() is empty, and can be removed.

Suggested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250615001435.2390793-4-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 16:19:04 -07:00
Neal Cardwell
ba4618885b tcp: remove RFC3517/RFC6675 hint state: lost_skb_hint, lost_cnt_hint
Now that obsolete RFC3517/RFC6675 TCP loss detection has been removed,
we can remove the somewhat complex and intrusive code to maintain its
hint state: lost_skb_hint and lost_cnt_hint.

This commit makes tcp_clear_retrans_hints_partial() empty. We will
remove tcp_clear_retrans_hints_partial() and its call sites in the
next commit.

Suggested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250615001435.2390793-3-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 16:19:04 -07:00
Neal Cardwell
1c120191dc tcp: remove obsolete and unused RFC3517/RFC6675 loss recovery code
RACK-TLP loss detection has been enabled as the default loss detection
algorithm for Linux TCP since 2018, in:

 commit b38a51fec1 ("tcp: disable RFC6675 loss detection")

In case users ran into unexpected bugs or performance regressions,
that commit allowed Linux system administrators to revert to using
RFC3517/RFC6675 loss recovery by setting net.ipv4.tcp_recovery to 0.

In the seven years since 2018, our team has not heard reports of
anyone reverting Linux TCP to use RFC3517/RFC6675 loss recovery, and
we can't find any record in web searches of such a revert.

RACK-TLP was published as a standards-track RFC, RFC8985, in February
2021.

Several other major TCP implementations have default-enabled RACK-TLP
at this point as well.

RACK-TLP offers several significant performance advantages over
RFC3517/RFC6675 loss recovery, including much better performance in
the common cases of tail drops, lost retransmissions, and reordering.

It is now time to remove the obsolete and unused RFC3517/RFC6675 loss
recovery code. This will allow a substantial simplification of the
Linux TCP code base, and removes 12 bytes of state in every tcp_sock
for 64-bit machines (8 bytes on 32-bit machines).

To arrange the commits in reasonable sizes, this patch series is split
into 3 commits. The following 2 commits remove bookkeeping state and
code that is no longer needed after this removal of RFC3517/RFC6675
loss recovery.

Suggested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250615001435.2390793-2-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 16:18:48 -07:00
Hyunwoo Kim
b160766e26 net/sched: fix use-after-free in taprio_dev_notifier
Since taprio’s taprio_dev_notifier() isn’t protected by an
RCU read-side critical section, a race with advance_sched()
can lead to a use-after-free.

Adding rcu_read_lock() inside taprio_dev_notifier() prevents this.

Fixes: fed87cc671 ("net/sched: taprio: automatically calculate queueMaxSDU based on TC gate durations")
Cc: stable@vger.kernel.org
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/aEzIYYxt0is9upYG@v4bel-B760M-AORUS-ELITE-AX
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 16:14:04 -07:00
Jakub Kicinski
68a4abb1ef linux-can-fixes-for-6.16-20250617
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEn/sM2K9nqF/8FWzzDHRl3/mQkZwFAmhRji0THG1rbEBwZW5n
 dXRyb25peC5kZQAKCRAMdGXf+ZCRnIhnB/4sPyASpIMeCZh20cL8WHFKhz1F2daS
 znxvsL3sQZfol4a6pEc4/lYHEcpODitUhF4sPbMgU7qlSEysZbEsfuYZw1RLy8j9
 bsIS+LqRqSslwS2/XuLwOjXly4+bgPwc4wj2RwH9+D9TlaLkZ02NZAqDNkidPxm/
 8udZxMqorVIUhqsckfsmIqVonllz7l3InSZ3gYeXVLXAMTgsvperjVtN/YivdbPm
 H1T6b6qopUEmIqDdsKelb1D9URMoiW9exDX05OpSzxZWWeTPZyJclOraKu6Otf4W
 Jd7IVgOv03prcOcerlXZoD6tEC0Amsl1RdkPiX6csAa6m63XvxjHIJyH
 =hk/l
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-fixes-for-6.16-20250617' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2025-06-17

The patch is by Brett Werling, and fixes the power regulator retrieval
during probe of the tcan4x5x glue code for the m_can driver.

* tag 'linux-can-fixes-for-6.16-20250617' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
  can: tcan4x5x: fix power regulator retrieval during probe
  openvswitch: Allocate struct ovs_pcpu_storage dynamically
====================

Link: https://patch.msgid.link/20250617155123.2141584-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 15:48:38 -07:00
Mina Almasry
6f793a1d05 net: netmem: fix skb_ensure_writable with unreadable skbs
skb_ensure_writable should succeed when it's trying to write to the
header of the unreadable skbs, so it doesn't need an unconditional
skb_frags_readable check. The preceding pskb_may_pull() call will
succeed if write_len is within the head and fail if we're trying to
write to the unreadable payload, so we don't need an additional check.

Removing this check restores DSCP functionality with unreadable skbs as
it's called from dscp_tg.

Cc: willemb@google.com
Cc: asml.silence@gmail.com
Fixes: 65249feb6b ("net: add support for skbs with unreadable frags")
Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250615200733.520113-1-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 15:48:20 -07:00
Jakub Kicinski
b9ebe0cd5d Merge branch 'io_uring-cmd-for-tx-timestamps'
Pavel Begunkov says:

====================
io_uring cmd for tx timestamps (part)

Apply the networking helpers for the io_uring timestamp API.
====================

Link: https://patch.msgid.link/cover.1750065793.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 15:24:30 -07:00
Pavel Begunkov
2410251cde net: timestamp: add helper returning skb's tx tstamp
Add a helper function skb_get_tx_timestamp() that returns a tx timestamp
associated with an error queue skb.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/702357dd8936ef4c0d3864441e853bfe3224a677.1750065793.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17 15:24:23 -07:00
Heiko Carstens
8a56977051 s390/drivers: Explicitly include <linux/export.h>
Explicitly include <linux/export.h> in files which contain an
EXPORT_SYMBOL().

See commit a934a57a42 ("scripts/misc-check: check missing #include
<linux/export.h> when W=1") for more details.

Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2025-06-17 18:18:02 +02:00
Sebastian Andrzej Siewior
7b4ac12cc9 openvswitch: Allocate struct ovs_pcpu_storage dynamically
PERCPU_MODULE_RESERVE defines the maximum size that can by used for the
per-CPU data size used by modules. This is 8KiB.

Commit 035fcdc4d2 ("openvswitch: Merge three per-CPU structures into
one") restructured the per-CPU memory allocation for the module and
moved the separate alloc_percpu() invocations at module init time to a
static per-CPU variable which is allocated by the module loader.

The size of the per-CPU data section for openvswitch is 6488 bytes which
is ~80% of the available per-CPU memory. Together with a few other
modules it is easy to exhaust the available 8KiB of memory.

Allocate ovs_pcpu_storage dynamically at module init time.

Reported-by: Gal Pressman <gal@nvidia.com>
Closes: https://lore.kernel.org/all/c401e017-f8db-4f57-a1cd-89beb979a277@nvidia.com
Fixes: 035fcdc4d2 ("openvswitch: Merge three per-CPU structures into one")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20250613123629.-XSoQTCu@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-06-17 14:47:46 +02:00
Johannes Berg
d19bac3d4e wifi: mac80211: don't WARN for late channel/color switch
There's really no value in the WARN stack trace etc., the reason
for this happening isn't directly related to the calling function
anyway. Also, syzbot has been observing it constantly, and there's
no way we can resolve it there - those systems are just slow.

Instead print an error message (once) and add a comment about what
really causes this message.

Reported-by: syzbot+468656785707b0e995df@syzkaller.appspotmail.com
Reported-by: syzbot+18c783c5cf6a781e3e2c@syzkaller.appspotmail.com
Reported-by: syzbot+d5924d5cffddfccab68e@syzkaller.appspotmail.com
Reported-by: syzbot+7d73d99525d1ff7752ef@syzkaller.appspotmail.com
Reported-by: syzbot+8e6e002c74d1927edaf5@syzkaller.appspotmail.com
Reported-by: syzbot+97254a3b10c541879a65@syzkaller.appspotmail.com
Reported-by: syzbot+dfd1fd46a1960ad9c6ec@syzkaller.appspotmail.com
Reported-by: syzbot+85e0b8d12d9ca877d806@syzkaller.appspotmail.com
Link: https://patch.msgid.link/20250617104902.146e10919be1.I85f352ca4a2dce6f556e5ff45ceaa5f3769cb5ce@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-17 14:30:06 +02:00
Johannes Berg
d1b1a5eb27 wifi: mac80211: drop invalid source address OCB frames
In OCB, don't accept frames from invalid source addresses
(and in particular don't try to create stations for them),
drop the frames instead.

Reported-by: syzbot+8b512026a7ec10dcbdd9@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/6788d2d9.050a0220.20d369.0028.GAE@google.com/
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Tested-by: syzbot+8b512026a7ec10dcbdd9@syzkaller.appspotmail.com
Link: https://patch.msgid.link/20250616171838.7433379cab5d.I47444d63c72a0bd58d2e2b67bb99e1fea37eec6f@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-17 14:29:49 +02:00
Jiri Slaby (SUSE)
2b5eac0f8c tty: introduce and use tty_port_tty_vhangup() helper
This code (tty_get -> vhangup -> tty_put) is repeated on few places.
Introduce a helper similar to tty_port_tty_hangup() (asynchronous) to
handle even vhangup (synchronous).

And use it on those places.

In fact, reuse the tty_port_tty_hangup()'s code and call tty_vhangup()
depending on a new bool parameter.

Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: David Lin <dtwlin@gmail.com>
Cc: Johan Hovold <johan@kernel.org>
Cc: Alex Elder <elder@kernel.org>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://lore.kernel.org/r/20250611100319.186924-2-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-17 13:42:33 +02:00
Thomas Weißschuh
2fbe82037a sysfs: treewide: switch back to bin_attribute::read()/write()
The bin_attribute argument of bin_attribute::read() is now const.
This makes the _new() callbacks unnecessary. Switch all users back.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20250530-sysfs-const-bin_attr-final-v3-3-724bfcf05b99@weissschuh.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-17 10:44:13 +02:00
Ido Schimmel
a2840d4e25 seg6: Allow End.X behavior to accept an oif
Extend the End.X behavior to accept an output interface as an optional
attribute and make use of it when resolving a route. This is needed when
user space wants to use a link-local address as the nexthop address.

Before:

 # ip route add 2001:db8:1::/64 encap seg6local action End.X nh6 fe80::1 oif eth0 dev sr6
 # ip route add 2001:db8:2::/64 encap seg6local action End.X nh6 2001:db8:10::1 dev sr6
 $ ip -6 route show
 2001:db8:1::/64  encap seg6local action End.X nh6 fe80::1 dev sr6 metric 1024 pref medium
 2001:db8:2::/64  encap seg6local action End.X nh6 2001:db8:10::1 dev sr6 metric 1024 pref medium

After:

 # ip route add 2001:db8:1::/64 encap seg6local action End.X nh6 fe80::1 oif eth0 dev sr6
 # ip route add 2001:db8:2::/64 encap seg6local action End.X nh6 2001:db8:10::1 dev sr6
 $ ip -6 route show
 2001:db8:1::/64  encap seg6local action End.X nh6 fe80::1 oif eth0 dev sr6 metric 1024 pref medium
 2001:db8:2::/64  encap seg6local action End.X nh6 2001:db8:10::1 dev sr6 metric 1024 pref medium

Note that the oif attribute is not dumped to user space when it was not
specified (as an oif of 0) since each entry keeps track of the optional
attributes that it parsed during configuration (see struct
seg6_local_lwt::parsed_optattrs).

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Link: https://patch.msgid.link/20250612122323.584113-4-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-16 15:31:14 -07:00
Ido Schimmel
3159671855 seg6: Call seg6_lookup_any_nexthop() from End.X behavior
seg6_lookup_nexthop() is a wrapper around seg6_lookup_any_nexthop().
Change End.X behavior to invoke seg6_lookup_any_nexthop() directly so
that we would not need to expose the new output interface argument
outside of the seg6local module.

No functional changes intended.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Link: https://patch.msgid.link/20250612122323.584113-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-16 15:31:14 -07:00
Ido Schimmel
01c411238c seg6: Extend seg6_lookup_any_nexthop() with an oif argument
seg6_lookup_any_nexthop() is called by the different endpoint behaviors
(e.g., End, End.X) to resolve an IPv6 route. Extend the function with an
output interface argument so that it could be used to resolve a route
with a certain output interface. This will be used by subsequent patches
that will extend the End.X behavior with an output interface as an
optional argument.

ip6_route_input_lookup() cannot be used when an output interface is
specified as it ignores this parameter. Similarly, calling
ip6_pol_route() when a table ID was not specified (e.g., End.X behavior)
is wrong.

Therefore, when an output interface is specified without a table ID,
resolve the route using ip6_route_output() which will take the output
interface into account.

Note that no endpoint behavior currently passes both a table ID and an
output interface, so the oif argument passed to ip6_pol_route() is
always zero and there are no functional changes in this regard.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Link: https://patch.msgid.link/20250612122323.584113-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-16 15:31:14 -07:00
Breno Leitao
ccc7edf0ad netpoll: move netpoll_print_options to netconsole
Move netpoll_print_options() from net/core/netpoll.c to
drivers/net/netconsole.c and make it static. This function is only used
by netconsole, so there's no need to export it or keep it in the public
netpoll API.

This reduces the netpoll API surface and improves code locality
by keeping netconsole-specific functionality within the netconsole
driver.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20250613-rework-v3-4-0752bf2e6912@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-16 15:18:33 -07:00
Breno Leitao
5a34c9a853 netpoll: relocate netconsole-specific functions to netconsole module
Move netpoll_parse_ip_addr() and netpoll_parse_options() from the generic
netpoll module to the netconsole module where they are actually used.

These functions were originally placed in netpoll but are only consumed by
netconsole. This refactoring improves code organization by:

 - Removing unnecessary exported symbols from netpoll
 - Making netpoll_parse_options() static (no longer needs global visibility)
 - Reducing coupling between netpoll and netconsole modules

The functions remain functionally identical - this is purely a code
reorganization to better reflect their actual usage patterns. Here are
the changes:

 1) Move both functions from netpoll to netconsole
 2) Add static to netpoll_parse_options()
 3) Removed the EXPORT_SYMBOL()

PS: This diff does not change the function format, so, it is easy to
review, but, checkpatch will not be happy. A follow-up patch will
address the current issues reported by checkpatch.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20250613-rework-v3-3-0752bf2e6912@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-16 15:18:33 -07:00
Breno Leitao
afb023329c netpoll: expose netpoll logging macros in public header
Move np_info(), np_err(), and np_notice() macros from internal
implementation to the public netpoll header file to make them
available for use by netpoll consumers.

These logging macros provide consistent formatting for netpoll-related
messages by automatically prefixing log output with the netpoll instance
name.

The goal is to use the exact same format that is being displayed today,
instead of creating something netconsole-specific.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20250613-rework-v3-2-0752bf2e6912@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-16 15:18:33 -07:00
Breno Leitao
260948993a netpoll: remove __netpoll_cleanup from exported API
Since commit 97714695ef ("net: netconsole: Defer netpoll cleanup to
avoid lock release during list traversal"), netconsole no longer uses
__netpoll_cleanup(). With no remaining users, remove this function
from the exported netpoll API.

The function remains available internally within netpoll for use by
netpoll_cleanup().

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20250613-rework-v3-1-0752bf2e6912@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-16 15:18:33 -07:00
Yajun Deng
0c17270f9b net: sysfs: Implement is_visible for phys_(port_id, port_name, switch_id)
phys_port_id_show, phys_port_name_show and phys_switch_id_show would
return -EOPNOTSUPP if the netdev didn't implement the corresponding
method.

There is no point in creating these files if they are unsupported.

Put these attributes in netdev_phys_group and implement the is_visible
method. make phys_(port_id, port_name, switch_id) invisible if the netdev
dosen't implement the corresponding method.

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250612142707.4644-1-yajun.deng@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-14 11:26:59 -07:00
Qiu Yutan
0051ea4aca net: arp: use kfree_skb_reason() in arp_rcv()
Replace kfree_skb() with kfree_skb_reason() in arp_rcv().

Signed-off-by: Qiu Yutan <qiu.yutan@zte.com.cn>
Signed-off-by: Jiang Kun <jiang.kun2@zte.com.cn>
Link: https://patch.msgid.link/20250612110259698Q2KNNOPQhnIApRskKN3Hi@zte.com.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-13 17:30:02 -07:00
Yonghong Song
4fc012daf9 bpf: Fix an issue in bpf_prog_test_run_xdp when page size greater than 4K
The bpf selftest xdp_adjust_tail/xdp_adjust_frags_tail_grow failed on
arm64 with 64KB page:
   xdp_adjust_tail/xdp_adjust_frags_tail_grow:FAIL

In bpf_prog_test_run_xdp(), the xdp->frame_sz is set to 4K, but later on
when constructing frags, with 64K page size, the frag data_len could
be more than 4K. This will cause problems in bpf_xdp_frags_increase_tail().

To fix the failure, the xdp->frame_sz is set to be PAGE_SIZE so kernel
can test different page size properly. With the kernel change, the user
space and bpf prog needs adjustment. Currently, the MAX_SKB_FRAGS default
value is 17, so for 4K page, the maximum packet size will be less than 68K.
To test 64K page, a bigger maximum packet size than 68K is desired. So two
different functions are implemented for subtest xdp_adjust_frags_tail_grow.
Depending on different page size, different data input/output sizes are used
to adapt with different page size.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20250612035032.2207498-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-12 19:07:51 -07:00
Hari Kalavakunta
8e16170ae9 net: ncsi: Fix buffer overflow in fetching version id
In NC-SI spec v1.2 section 8.4.44.2, the firmware name doesn't
need to be null terminated while its size occupies the full size
of the field. Fix the buffer overflow issue by adding one
additional byte for null terminator.

Signed-off-by: Hari Kalavakunta <kalavakunta.hari.prasad@gmail.com>
Reviewed-by: Paul Fertser <fercerpav@gmail.com>
Link: https://patch.msgid.link/20250610193338.1368-1-kalavakunta.hari.prasad@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12 18:21:59 -07:00
Benjamin Coddington
8b3ac9faba SUNRPC: Cleanup/fix initial rq_pages allocation
While investigating some reports of memory-constrained NUMA machines
failing to mount v3 and v4.0 nfs mounts, we found that svc_init_buffer()
was not attempting to retry allocations from the bulk page allocator.
Typically, this results in a single page allocation being returned and
the mount attempt fails with -ENOMEM.  A retry would have allowed the mount
to succeed.

Additionally, it seems that the bulk allocation in svc_init_buffer() is
redundant because svc_alloc_arg() will perform the required allocation and
does the correct thing to retry the allocations.

The call to allocate memory in svc_alloc_arg() drops the preferred node
argument, but I expect we'll still allocate on the preferred node because
the allocation call happens within the svc thread context, which chooses
the node with memory closest to the current thread's execution.

This patch cleans out the bulk allocation in svc_init_buffer() to allow
svc_alloc_arg() to handle the allocation/retry logic for rq_pages.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Fixes: ed603bcf4f ("sunrpc: Replace the rq_pages array with dynamically-allocated memory")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-06-12 20:37:33 -04:00
Jakub Kicinski
9bb00786fc net: ethtool: add dedicated callbacks for getting and setting rxfh fields
We mux multiple calls to the drivers via the .get_nfc and .set_nfc
callbacks. This is slightly inconvenient to the drivers as they
have to de-mux them back. It will also be awkward for netlink code
to construct struct ethtool_rxnfc when it wants to get info about
RX Flow Hash, from the RSS module.

Add dedicated driver callbacks. Create struct ethtool_rxfh_fields
which contains only data relevant to RXFH. Maintain the names of
the fields to avoid having to heavily modify the drivers.

For now support both callbacks, once all drivers are converted
ethtool_*et_rxfh_fields() will stop using the rxnfc callbacks.

Link: https://patch.msgid.link/20250611145949.2674086-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12 17:16:20 -07:00
Jakub Kicinski
fac4b41741 net: ethtool: require drivers to opt into the per-RSS ctx RXFH
RX Flow Hashing supports using different configuration for different
RSS contexts. Only two drivers seem to support it. Make sure we
uniformly error out for drivers which don't.

Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20250611145949.2674086-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12 17:16:19 -07:00
Jakub Kicinski
2a644c5cec net: ethtool: remove the duplicated handling from rxfh and rxnfc
Now that the handles have been separated - remove the RX Flow Hash
handling from rxnfc functions and vice versa.

Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20250611145949.2674086-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12 17:16:19 -07:00
Jakub Kicinski
f4f1265355 net: ethtool: copy the rxfh flow handling
RX Flow Hash configuration uses the same argument structure
as flow filters. This is probably why ethtool IOCTL handles
them together. The more checks we add the more convoluted
this code is getting (as some of the checks apply only
to flow filters and others only to the hashing).

Copy the code to separate the handling. This is an exact
copy, the next change will remove unnecessary handling.

Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20250611145949.2674086-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12 17:16:19 -07:00
Jakub Kicinski
535de52801 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.16-rc2).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12 10:09:10 -07:00
Linus Torvalds
27605c8c0f Including fixes from bluetooth and wireless.
Current release - regressions:
 
  - af_unix: allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD
 
 Current release - new code bugs:
 
  - eth: airoha: correct enable mask for RX queues 16-31
 
  - veth: prevent NULL pointer dereference in veth_xdp_rcv when peer
    disappears under traffic
 
  - ipv6: move fib6_config_validate() to ip6_route_add(), prevent invalid
    routes
 
 Previous releases - regressions:
 
  - phy: phy_caps: don't skip better duplex match on non-exact match
 
  - dsa: b53: fix untagged traffic sent via cpu tagged with VID 0
 
  - Revert "wifi: mwifiex: Fix HT40 bandwidth issue.", it caused transient
    packet loss, exact reason not fully understood, yet
 
 Previous releases - always broken:
 
  - net: clear the dst when BPF is changing skb protocol (IPv4 <> IPv6)
 
  - sched: sfq: fix a potential crash on gso_skb handling
 
  - Bluetooth: intel: improve rx buffer posting to avoid causing issues
    in the firmware
 
  - eth: intel: i40e: make reset handling robust against multiple requests
 
  - eth: mlx5: ensure FW pages are always allocated on the local NUMA
    node, even when device is configure to 'serve' another node
 
  - wifi: ath12k: fix GCC_GCC_PCIE_HOT_RST definition for WCN7850,
    prevent kernel crashes
 
  - wifi: ath11k: avoid burning CPU in ath11k_debugfs_fw_stats_request()
    for 3 sec if fw_stats_done is not set
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmhK/3IACgkQMUZtbf5S
 IruE5A//RdwiBW/pqoMIiRKLA3HZeUA/beYOl4DwVf8WFQNUIqdboeAi6k4yFrS+
 SykKN0s1z8fW45lA46iFv3sR0QKYGln/v/cANsqojYqKBD3PF42dRifFlEAIz2M5
 fnXK1VHPJOFK/OBOyKiiW3R6mFv+v9epZM8BKED77vFy7osDV2zkObePeE8/34B7
 yVAr6JNTpB5Ex4ziG+e/6tFF6IX9RJLBl4fkRRynLDSsb1NFuy39LxPsxRQPxnzo
 tlfHfxEFl5qDNGondUoSxmp38HoO6MRofWp1d1GZoBbTXi0gXV26I5WaaBHBqPkm
 jZ7AtIMQq2+JuEg0y4dFFRehZLwLEMuhvlbacbIOKNBngVIsploBzvbG3ntWuUa4
 Z5VFayQXumsHB5g7+vEFK6vCPaIpatKt419JsFXogNvVmmQzghALFlSymm/WbyGL
 Bj3R448xGDJw+2zDAXSH/nMMHkRaQd2Ptj2czvJ0Y7Fj8bxJgH0okaHOBrk9RQTQ
 bdUGCiMY84p6WI7rKDkFyyohMxppdYsY8A9hSdGgpqvu7dZi5yGmzz1Sp9+uSfSF
 Lj61am4LSvRsIuTP5cdqmTBn3mZS5R49hvJsFddgXRhF+Y9gB7LSm0sypZbuOEKD
 m9ijKcNETglzer0iMCwAVrIbDHGjqqHS74DkRzsuPsQ8kaCjsno=
 =0mtm
 -----END PGP SIGNATURE-----

Merge tag 'net-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bluetooth and wireless.

  Current release - regressions:

   - af_unix: allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD

  Current release - new code bugs:

   - eth: airoha: correct enable mask for RX queues 16-31

   - veth: prevent NULL pointer dereference in veth_xdp_rcv when peer
     disappears under traffic

   - ipv6: move fib6_config_validate() to ip6_route_add(), prevent
     invalid routes

  Previous releases - regressions:

   - phy: phy_caps: don't skip better duplex match on non-exact match

   - dsa: b53: fix untagged traffic sent via cpu tagged with VID 0

   - Revert "wifi: mwifiex: Fix HT40 bandwidth issue.", it caused
     transient packet loss, exact reason not fully understood, yet

  Previous releases - always broken:

   - net: clear the dst when BPF is changing skb protocol (IPv4 <> IPv6)

   - sched: sfq: fix a potential crash on gso_skb handling

   - Bluetooth: intel: improve rx buffer posting to avoid causing issues
     in the firmware

   - eth: intel: i40e: make reset handling robust against multiple
     requests

   - eth: mlx5: ensure FW pages are always allocated on the local NUMA
     node, even when device is configure to 'serve' another node

   - wifi: ath12k: fix GCC_GCC_PCIE_HOT_RST definition for WCN7850,
     prevent kernel crashes

   - wifi: ath11k: avoid burning CPU in ath11k_debugfs_fw_stats_request()
     for 3 sec if fw_stats_done is not set"

* tag 'net-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (70 commits)
  selftests: drv-net: rss_ctx: Add test for ntuple rules targeting default RSS context
  net: ethtool: Don't check if RSS context exists in case of context 0
  af_unix: Allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD.
  ipv6: Move fib6_config_validate() to ip6_route_add().
  net: drv: netdevsim: don't napi_complete() from netpoll
  net/mlx5: HWS, Add error checking to hws_bwc_rule_complex_hash_node_get()
  veth: prevent NULL pointer dereference in veth_xdp_rcv
  net_sched: remove qdisc_tree_flush_backlog()
  net_sched: ets: fix a race in ets_qdisc_change()
  net_sched: tbf: fix a race in tbf_change()
  net_sched: red: fix a race in __red_change()
  net_sched: prio: fix a race in prio_tune()
  net_sched: sch_sfq: reject invalid perturb period
  net: phy: phy_caps: Don't skip better duplex macth on non-exact match
  MAINTAINERS: Update Kuniyuki Iwashima's email address.
  selftests: net: add test case for NAT46 looping back dst
  net: clear the dst when changing skb protocol
  net/mlx5e: Fix number of lanes to UNKNOWN when using data_rate_oper
  net/mlx5e: Fix leak of Geneve TLV option object
  net/mlx5: HWS, make sure the uplink is the last destination
  ...
2025-06-12 09:50:36 -07:00
Jakub Kicinski
d5705afbac Another quick round of updates:
- revert mwifiex HT40 that was causing issues
  - many ath10k/ath11k/ath12k fixes
  - re-add some iwlwifi code I lost in a merge
  - use kfree_sensitive() on an error path in cfg80211
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmhKjoYACgkQ10qiO8sP
 aAADgQ//fAOMAGzuuyxy3KwxtytpYWq/k0jb3HmHct135qxteoOSv/ah0/+nvYFD
 4BNAkDa44hqAP5ynWYgGQIqssJ0WkkZFooCzMpb3mzsN5sONy7XfkqG0M8RIC3xC
 d28nt5zDufKt+0QtWUq9pUHamm6f+4kG+LQa9kGSlUNJ3wHUMSsONTgC7T8Rpb3u
 CxW5vyeIp0OJDKN65qsN1iGqzzA5hF7j4jX2BH+NF/8eoztY3t5C/o0mpRaHqY/d
 RWB9Sm5TmIXKnEHvy8CxIwm4+5goEdRi1ua/xJAC/SWmLm3NEEQPotJASnP3+xky
 1Ft2EEGkYJHExYnGZaAHjykVY1JGNZos5gitp13325iFGLy9CyeCQ87Uml/k0uw9
 k3xZKLzbCwIr1gPy6gTn0ai2V2P3CLmDuuvKiulIvOkfVsXbtB6zCxgansmVW8Xx
 WklAX7ZMUxSvI628FFAbGW2Gt5OSPQOXRIsk4LdMO6JQs85mMRux69rIM69F4aEV
 8Kean7BjT/7RZGyd5VKjkDwYvOlh2z6+UUzShudqW7otsdA2P+W3+6yu2zPq8oC/
 KUQTHtb9wrYux1CdSOovmmIVnc2NALegKLP6rf3VlVF+51TIAPMz8QhDA3N29ArZ
 /lhImuPhDpva08pgmFyT6WkHb3lj3KAQsp/O5gLHpLKhjkSdhi8=
 =wXKD
 -----END PGP SIGNATURE-----

Merge tag 'wireless-2025-06-12' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes Berg says:

====================
Another quick round of updates:

 - revert mwifiex HT40 that was causing issues
 - many ath10k/ath11k/ath12k fixes
 - re-add some iwlwifi code I lost in a merge
 - use kfree_sensitive() on an error path in cfg80211

* tag 'wireless-2025-06-12' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: cfg80211: use kfree_sensitive() for connkeys cleanup
  wifi: iwlwifi: fix merge damage related to iwl_pci_resume
  Revert "wifi: mwifiex: Fix HT40 bandwidth issue."
  wifi: ath12k: fix uaf in ath12k_core_init()
  wifi: ath12k: Fix hal_reo_cmd_status kernel-doc
  wifi: ath12k: fix GCC_GCC_PCIE_HOT_RST definition for WCN7850
  wifi: ath11k: validate ath11k_crypto_mode on top of ath11k_core_qmi_firmware_ready
  wifi: ath11k: consistently use ath11k_mac_get_fw_stats()
  wifi: ath11k: move locking outside of ath11k_mac_get_fw_stats()
  wifi: ath11k: adjust unlock sequence in ath11k_update_stats_event()
  wifi: ath11k: move some firmware stats related functions outside of debugfs
  wifi: ath11k: don't wait when there is no vdev started
  wifi: ath11k: don't use static variables in ath11k_debugfs_fw_stats_process()
  wifi: ath11k: avoid burning CPU in ath11k_debugfs_fw_stats_request()
  wil6210: fix support for sparrow chipsets
  wifi: ath10k: Avoid vdev delete timeout when firmware is already down
  ath10k: snoc: fix unbalanced IRQ enable in crash recovery
====================

Link: https://patch.msgid.link/20250612082519.11447-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12 08:16:47 -07:00
Gal Pressman
d78ebc772c net: ethtool: Don't check if RSS context exists in case of context 0
Context 0 (default context) always exists, there is no need to check
whether it exists or not when adding a flow steering rule.

The existing check fails when creating a flow steering rule for context
0 as it is not stored in the rss_ctx xarray.

For example:
$ ethtool --config-ntuple eth2 flow-type tcp4 dst-ip 194.237.147.23 dst-port 19983 context 0 loc 618
rmgr: Cannot insert RX class rule: Invalid argument
Cannot insert classification rule

An example usecase for this could be:
- A high-priority rule (loc 0) directing specific port traffic to
  context 0.
- A low-priority rule (loc 1) directing all other TCP traffic to context
  1.

This is a user-visible regression that was caught in our testing
environment, it was not reported by a user yet.

Fixes: de7f7582df ("net: ethtool: prevent flow steering to RSS contexts which don't exist")
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://patch.msgid.link/20250612071958.1696361-2-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12 08:15:35 -07:00
Jakub Kicinski
d5441acae7 bluetooth pull request for net:
- eir: Fix NULL pointer deference on eir_get_service_data
  - eir: Fix possible crashes on eir_create_adv_data
  - hci_sync: Fix broadcast/PA when using an existing instance
  - ISO: Fix using BT_SK_PA_SYNC to detect BIS sockets
  - ISO: Fix not using bc_sid as advertisement SID
  - MGMT: Fix sparse errors
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmhJ66MZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKfp/D/0VTEMF4PiA2eLHIPSwyIHr
 pvpz3nY1WE84lAVL0VKNJalA15dk6TVs3Vxgns62BHLdajBOmYPpuJGXaSERBfLB
 t5eb4nU9rx9F7+SW8zVLNwtnn5bTENNYKQIjfLmslDQQGfOjeaUP5sO/rIcLEiO3
 0rEi55pE4nM6S2wUcmQlhWPC6tr3vIptg4lAz3MWlATDuUnkLjJ3rzEZdkg2kt39
 2VJGNxXEG7sBrwv+coO3ROe54YSOrb+gvd9HOL0vq3MVBcvncCRqc7TuBlYi7/5C
 p+WdEyG26FgS/TzdgMJKuVISQp6kNKulbuRhsnD2XZA3Gik+t+79Ex9haYW+HLDS
 AWQNBm1FgYdCc4LsAxKfwGdvp8wAx1ci1vLNniYVTelyUAc5LosEZ/15DCCyTKdK
 9zXEAfxwn72dLVtryVIRKqDR39QVqsxDSuV9ydgXzPJWwjisHX3AB01EqN5PGjYH
 aspNgMGfYL9zSw6N1LQ+99M+/JLbvLs7b4jui4CbD3EI7nxN0YqOcKlHw7vEje5s
 auU/UEL7DgWOzHTxCcidwATuV79pfx0CRSwsXaPLV1yA9lhS5AYdpBlsRB+wRFbN
 vhpw8dwj/WCM0GVYnG87BU3mriyfNgaERTVA2nLKZXvn+cRkVBUkLwBV3Jpi7vQZ
 cJ22gcrRj7uYotfvyCHv9g==
 =dulg
 -----END PGP SIGNATURE-----

Merge tag 'for-net-2025-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

 - eir: Fix NULL pointer deference on eir_get_service_data
 - eir: Fix possible crashes on eir_create_adv_data
 - hci_sync: Fix broadcast/PA when using an existing instance
 - ISO: Fix using BT_SK_PA_SYNC to detect BIS sockets
 - ISO: Fix not using bc_sid as advertisement SID
 - MGMT: Fix sparse errors

* tag 'for-net-2025-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: MGMT: Fix sparse errors
  Bluetooth: ISO: Fix not using bc_sid as advertisement SID
  Bluetooth: ISO: Fix using BT_SK_PA_SYNC to detect BIS sockets
  Bluetooth: eir: Fix possible crashes on eir_create_adv_data
  Bluetooth: hci_sync: Fix broadcast/PA when using an existing instance
  Bluetooth: Fix NULL pointer deference on eir_get_service_data
====================

Link: https://patch.msgid.link/20250611204944.1559356-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12 08:13:48 -07:00
Kuniyuki Iwashima
43fb2b30ee af_unix: Allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD.
Before the cited commit, the kernel unconditionally embedded SCM
credentials to skb for embryo sockets even when both the sender
and listener disabled SO_PASSCRED and SO_PASSPIDFD.

Now, the credentials are added to skb only when configured by the
sender or the listener.

However, as reported in the link below, it caused a regression for
some programs that assume credentials are included in every skb,
but sometimes not now.

The only problematic scenario would be that a socket starts listening
before setting the option.  Then, there will be 2 types of non-small
race window, where a client can send skb without credentials, which
the peer receives as an "invalid" message (and aborts the connection
it seems ?):

  Client                    Server
  ------                    ------
                            s1.listen()  <-- No SO_PASS{CRED,PIDFD}
  s2.connect()
  s2.send()  <-- w/o cred
                            s1.setsockopt(SO_PASS{CRED,PIDFD})
  s2.send()  <-- w/  cred

or

  Client                    Server
  ------                    ------
                            s1.listen()  <-- No SO_PASS{CRED,PIDFD}
  s2.connect()
  s2.send()  <-- w/o cred
                            s3, _ = s1.accept()  <-- Inherit cred options
  s2.send()  <-- w/o cred                            but not set yet

                            s3.setsockopt(SO_PASS{CRED,PIDFD})
  s2.send()  <-- w/  cred

It's unfortunate that buggy programs depend on the behaviour,
but let's restore the previous behaviour.

Fixes: 3f84d577b7 ("af_unix: Inherit sk_flags at connect().")
Reported-by: Jacek Łuczak <difrost.kernel@gmail.com>
Closes: https://lore.kernel.org/all/68d38b0b-1666-4974-85d4-15575789c8d4@gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Tested-by: Christian Heusel <christian@heusel.eu>
Tested-by: André Almeida <andrealmeid@igalia.com>
Tested-by: Jacek Łuczak <difrost.kernel@gmail.com>
Link: https://patch.msgid.link/20250611202758.3075858-1-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12 08:13:06 -07:00
Kuniyuki Iwashima
b3979e3d2f ipv6: Move fib6_config_validate() to ip6_route_add().
syzkaller created an IPv6 route from a malformed packet, which has
a prefix len > 128, triggering the splat below. [0]

This is a similar issue fixed by commit 586ceac9ac ("ipv6: Restore
fib6_config validation for SIOCADDRT.").

The cited commit removed fib6_config validation from some callers
of ip6_add_route().

Let's move the validation back to ip6_route_add() and
ip6_route_multipath_add().

[0]:
UBSAN: array-index-out-of-bounds in ./include/net/ipv6.h:616:34
index 20 is out of range for type '__u8 [16]'
CPU: 1 UID: 0 PID: 7444 Comm: syz.0.708 Not tainted 6.16.0-rc1-syzkaller-g19272b37aa4f #0 PREEMPT
Hardware name: riscv-virtio,qemu (DT)
Call Trace:
[<ffffffff80078a80>] dump_backtrace+0x2e/0x3c arch/riscv/kernel/stacktrace.c:132
[<ffffffff8000327a>] show_stack+0x30/0x3c arch/riscv/kernel/stacktrace.c:138
[<ffffffff80061012>] __dump_stack lib/dump_stack.c:94 [inline]
[<ffffffff80061012>] dump_stack_lvl+0x12e/0x1a6 lib/dump_stack.c:120
[<ffffffff800610a6>] dump_stack+0x1c/0x24 lib/dump_stack.c:129
[<ffffffff8001c0ea>] ubsan_epilogue+0x14/0x46 lib/ubsan.c:233
[<ffffffff819ba290>] __ubsan_handle_out_of_bounds+0xf6/0xf8 lib/ubsan.c:455
[<ffffffff85b363a4>] ipv6_addr_prefix include/net/ipv6.h:616 [inline]
[<ffffffff85b363a4>] ip6_route_info_create+0x8f8/0x96e net/ipv6/route.c:3793
[<ffffffff85b635da>] ip6_route_add+0x2a/0x1aa net/ipv6/route.c:3889
[<ffffffff85b02e08>] addrconf_prefix_route+0x2c4/0x4e8 net/ipv6/addrconf.c:2487
[<ffffffff85b23bb2>] addrconf_prefix_rcv+0x1720/0x1e62 net/ipv6/addrconf.c:2878
[<ffffffff85b92664>] ndisc_router_discovery+0x1a06/0x3504 net/ipv6/ndisc.c:1570
[<ffffffff85b99038>] ndisc_rcv+0x500/0x600 net/ipv6/ndisc.c:1874
[<ffffffff85bc2c18>] icmpv6_rcv+0x145e/0x1e0a net/ipv6/icmp.c:988
[<ffffffff85af6798>] ip6_protocol_deliver_rcu+0x18a/0x1976 net/ipv6/ip6_input.c:436
[<ffffffff85af8078>] ip6_input_finish+0xf4/0x174 net/ipv6/ip6_input.c:480
[<ffffffff85af8262>] NF_HOOK include/linux/netfilter.h:317 [inline]
[<ffffffff85af8262>] NF_HOOK include/linux/netfilter.h:311 [inline]
[<ffffffff85af8262>] ip6_input+0x16a/0x70c net/ipv6/ip6_input.c:491
[<ffffffff85af8dcc>] ip6_mc_input+0x5c8/0x1268 net/ipv6/ip6_input.c:588
[<ffffffff85af6112>] dst_input include/net/dst.h:469 [inline]
[<ffffffff85af6112>] ip6_rcv_finish net/ipv6/ip6_input.c:79 [inline]
[<ffffffff85af6112>] NF_HOOK include/linux/netfilter.h:317 [inline]
[<ffffffff85af6112>] NF_HOOK include/linux/netfilter.h:311 [inline]
[<ffffffff85af6112>] ipv6_rcv+0x5ae/0x6e0 net/ipv6/ip6_input.c:309
[<ffffffff85087e84>] __netif_receive_skb_one_core+0x106/0x16e net/core/dev.c:5977
[<ffffffff85088104>] __netif_receive_skb+0x2c/0x144 net/core/dev.c:6090
[<ffffffff850883c6>] netif_receive_skb_internal net/core/dev.c:6176 [inline]
[<ffffffff850883c6>] netif_receive_skb+0x1aa/0xbf2 net/core/dev.c:6235
[<ffffffff8328656e>] tun_rx_batched.isra.0+0x430/0x686 drivers/net/tun.c:1485
[<ffffffff8329ed3a>] tun_get_user+0x2952/0x3d6c drivers/net/tun.c:1938
[<ffffffff832a21e0>] tun_chr_write_iter+0xc4/0x21c drivers/net/tun.c:1984
[<ffffffff80b9b9ae>] new_sync_write fs/read_write.c:593 [inline]
[<ffffffff80b9b9ae>] vfs_write+0x56c/0xa9a fs/read_write.c:686
[<ffffffff80b9c2be>] ksys_write+0x126/0x228 fs/read_write.c:738
[<ffffffff80b9c42e>] __do_sys_write fs/read_write.c:749 [inline]
[<ffffffff80b9c42e>] __se_sys_write fs/read_write.c:746 [inline]
[<ffffffff80b9c42e>] __riscv_sys_write+0x6e/0x94 fs/read_write.c:746
[<ffffffff80076912>] syscall_handler+0x94/0x118 arch/riscv/include/asm/syscall.h:112
[<ffffffff8637e31e>] do_trap_ecall_u+0x396/0x530 arch/riscv/kernel/traps.c:341
[<ffffffff863a69e2>] handle_exception+0x146/0x152 arch/riscv/kernel/entry.S:197

Fixes: fa76c1674f ("ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config().")
Reported-by: syzbot+4c2358694722d304c44e@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6849b8c3.a00a0220.1eb5f5.00f0.GAE@google.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250611193551.2999991-1-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12 08:12:35 -07:00
Eric Dumazet
d92adacdd8 net_sched: ets: fix a race in ets_qdisc_change()
Gerrard Tai reported a race condition in ETS, whenever SFQ perturb timer
fires at the wrong time.

The race is as follows:

CPU 0                                 CPU 1
[1]: lock root
[2]: qdisc_tree_flush_backlog()
[3]: unlock root
 |
 |                                    [5]: lock root
 |                                    [6]: rehash
 |                                    [7]: qdisc_tree_reduce_backlog()
 |
[4]: qdisc_put()

This can be abused to underflow a parent's qlen.

Calling qdisc_purge_queue() instead of qdisc_tree_flush_backlog()
should fix the race, because all packets will be purged from the qdisc
before releasing the lock.

Fixes: b05972f01e ("net: sched: tbf: don't call qdisc_put() while holding tree lock")
Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Suggested-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250611111515.1983366-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12 08:05:50 -07:00
Eric Dumazet
43eb466041 net_sched: tbf: fix a race in tbf_change()
Gerrard Tai reported a race condition in TBF, whenever SFQ perturb timer
fires at the wrong time.

The race is as follows:

CPU 0                                 CPU 1
[1]: lock root
[2]: qdisc_tree_flush_backlog()
[3]: unlock root
 |
 |                                    [5]: lock root
 |                                    [6]: rehash
 |                                    [7]: qdisc_tree_reduce_backlog()
 |
[4]: qdisc_put()

This can be abused to underflow a parent's qlen.

Calling qdisc_purge_queue() instead of qdisc_tree_flush_backlog()
should fix the race, because all packets will be purged from the qdisc
before releasing the lock.

Fixes: b05972f01e ("net: sched: tbf: don't call qdisc_put() while holding tree lock")
Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Suggested-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Zhengchao Shao <shaozhengchao@huawei.com>
Link: https://patch.msgid.link/20250611111515.1983366-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12 08:05:50 -07:00
Eric Dumazet
85a3e0ede3 net_sched: red: fix a race in __red_change()
Gerrard Tai reported a race condition in RED, whenever SFQ perturb timer
fires at the wrong time.

The race is as follows:

CPU 0                                 CPU 1
[1]: lock root
[2]: qdisc_tree_flush_backlog()
[3]: unlock root
 |
 |                                    [5]: lock root
 |                                    [6]: rehash
 |                                    [7]: qdisc_tree_reduce_backlog()
 |
[4]: qdisc_put()

This can be abused to underflow a parent's qlen.

Calling qdisc_purge_queue() instead of qdisc_tree_flush_backlog()
should fix the race, because all packets will be purged from the qdisc
before releasing the lock.

Fixes: 0c8d13ac96 ("net: sched: red: delay destroying child qdisc on replace")
Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Suggested-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250611111515.1983366-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12 08:05:49 -07:00
Eric Dumazet
d35acc1be3 net_sched: prio: fix a race in prio_tune()
Gerrard Tai reported a race condition in PRIO, whenever SFQ perturb timer
fires at the wrong time.

The race is as follows:

CPU 0                                 CPU 1
[1]: lock root
[2]: qdisc_tree_flush_backlog()
[3]: unlock root
 |
 |                                    [5]: lock root
 |                                    [6]: rehash
 |                                    [7]: qdisc_tree_reduce_backlog()
 |
[4]: qdisc_put()

This can be abused to underflow a parent's qlen.

Calling qdisc_purge_queue() instead of qdisc_tree_flush_backlog()
should fix the race, because all packets will be purged from the qdisc
before releasing the lock.

Fixes: 7b8e0b6e65 ("net: sched: prio: delay destroying child qdiscs on change")
Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Suggested-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250611111515.1983366-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12 08:05:49 -07:00
Eric Dumazet
7ca52541c0 net_sched: sch_sfq: reject invalid perturb period
Gerrard Tai reported that SFQ perturb_period has no range check yet,
and this can be used to trigger a race condition fixed in a separate patch.

We want to make sure ctl->perturb_period * HZ will not overflow
and is positive.

Tested:

tc qd add dev lo root sfq perturb -10   # negative value : error
Error: sch_sfq: invalid perturb period.

tc qd add dev lo root sfq perturb 1000000000 # too big : error
Error: sch_sfq: invalid perturb period.

tc qd add dev lo root sfq perturb 2000000 # acceptable value
tc -s -d qd sh dev lo
qdisc sfq 8005: root refcnt 2 limit 127p quantum 64Kb depth 127 flows 128 divisor 1024 perturb 2000000sec
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20250611083501.1810459-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12 08:03:08 -07:00
Leon Romanovsky
c0f21029f1 xfrm: always initialize offload path
Offload path is used for GRO with SW IPsec, and not just for HW
offload. So initialize it anyway.

Fixes: 585b64f5a6 ("xfrm: delay initialization of offload path till its actually requested")
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Closes: https://lore.kernel.org/all/aEGW_5HfPqU1rFjl@krikkit
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-06-12 07:07:14 +02:00
Jakub Kicinski
ba9db6f907 net: clear the dst when changing skb protocol
A not-so-careful NAT46 BPF program can crash the kernel
if it indiscriminately flips ingress packets from v4 to v6:

  BUG: kernel NULL pointer dereference, address: 0000000000000000
    ip6_rcv_core (net/ipv6/ip6_input.c:190:20)
    ipv6_rcv (net/ipv6/ip6_input.c:306:8)
    process_backlog (net/core/dev.c:6186:4)
    napi_poll (net/core/dev.c:6906:9)
    net_rx_action (net/core/dev.c:7028:13)
    do_softirq (kernel/softirq.c:462:3)
    netif_rx (net/core/dev.c:5326:3)
    dev_loopback_xmit (net/core/dev.c:4015:2)
    ip_mc_finish_output (net/ipv4/ip_output.c:363:8)
    NF_HOOK (./include/linux/netfilter.h:314:9)
    ip_mc_output (net/ipv4/ip_output.c:400:5)
    dst_output (./include/net/dst.h:459:9)
    ip_local_out (net/ipv4/ip_output.c:130:9)
    ip_send_skb (net/ipv4/ip_output.c:1496:8)
    udp_send_skb (net/ipv4/udp.c:1040:8)
    udp_sendmsg (net/ipv4/udp.c:1328:10)

The output interface has a 4->6 program attached at ingress.
We try to loop the multicast skb back to the sending socket.
Ingress BPF runs as part of netif_rx(), pushes a valid v6 hdr
and changes skb->protocol to v6. We enter ip6_rcv_core which
tries to use skb_dst(). But the dst is still an IPv4 one left
after IPv4 mcast output.

Clear the dst in all BPF helpers which change the protocol.
Try to preserve metadata dsts, those may carry non-routing
metadata.

Cc: stable@vger.kernel.org
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Fixes: d219df60a7 ("bpf: Add ipip6 and ip6ip decap support for bpf_skb_adjust_room()")
Fixes: 1b00e0dfe7 ("bpf: update skb->protocol in bpf_skb_net_grow")
Fixes: 6578171a7f ("bpf: add bpf_skb_change_proto helper")
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250610001245.1981782-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-11 17:02:29 -07:00
Luiz Augusto von Dentz
7dd38ba4ac Bluetooth: MGMT: Fix sparse errors
This fixes the following errors:

net/bluetooth/mgmt.c:5400:59: sparse: sparse: incorrect type in argument 3
(different base types) @@     expected unsigned short [usertype] handle @@
got restricted __le16 [usertype] monitor_handle @@
net/bluetooth/mgmt.c:5400:59: sparse:     expected unsigned short [usertype] handle
net/bluetooth/mgmt.c:5400:59: sparse:     got restricted __le16 [usertype] monitor_handle

Fixes: e6ed54e86a ("Bluetooth: MGMT: Fix UAF on mgmt_remove_adv_monitor_complete")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202506060347.ux2O1p7L-lkp@intel.com/
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-06-11 16:39:25 -04:00
Luiz Augusto von Dentz
5842c01a9e Bluetooth: ISO: Fix not using bc_sid as advertisement SID
Currently bc_sid is being ignore when acting as Broadcast Source role,
so this fix it by passing the bc_sid and then use it when programming
the PA:

< HCI Command: LE Set Exte.. (0x08|0x0036) plen 25
        Handle: 0x01
        Properties: 0x0000
        Min advertising interval: 140.000 msec (0x00e0)
        Max advertising interval: 140.000 msec (0x00e0)
        Channel map: 37, 38, 39 (0x07)
        Own address type: Random (0x01)
        Peer address type: Public (0x00)
        Peer address: 00:00:00:00:00:00 (OUI 00-00-00)
        Filter policy: Allow Scan Request from Any, Allow Connect Request from Any (0x00)
        TX power: Host has no preference (0x7f)
        Primary PHY: LE 1M (0x01)
        Secondary max skip: 0x00
        Secondary PHY: LE 2M (0x02)
        SID: 0x01
        Scan request notifications: Disabled (0x00)

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-06-11 16:29:55 -04:00
Luiz Augusto von Dentz
2df108c227 Bluetooth: ISO: Fix using BT_SK_PA_SYNC to detect BIS sockets
BT_SK_PA_SYNC is only valid for Broadcast Sinks which means socket used
for Broadcast Sources wouldn't be able to use the likes of getpeername
to read out the sockaddr_iso_bc fields which may have been update (e.g.
bc_sid).

Fixes: 0a766a0aff ("Bluetooth: ISO: Fix getpeername not returning sockaddr_iso_bc fields")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-06-11 16:29:39 -04:00
Luiz Augusto von Dentz
47c0390226 Bluetooth: eir: Fix possible crashes on eir_create_adv_data
eir_create_adv_data may attempt to add EIR_FLAGS and EIR_TX_POWER
without checking if that would fit.

Link: https://github.com/bluez/bluez/issues/1117#issuecomment-2958244066
Fixes: 01ce70b0a2 ("Bluetooth: eir: Move EIR/Adv Data functions to its own file")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-06-11 16:29:22 -04:00
Luiz Augusto von Dentz
5725bc6082 Bluetooth: hci_sync: Fix broadcast/PA when using an existing instance
When using and existing adv_info instance for broadcast source it
needs to be updated to periodic first before it can be reused, also in
case the existing instance already have data hci_set_adv_instance_data
cannot be used directly since it would overwrite the existing data so
this reappend the original data after the Broadcast ID, if one was
generated.

Example:

bluetoothctl># Add PBP to EA so it can be later referenced as the BIS ID
bluetoothctl> advertise.service 0x1856 0x00 0x00
bluetoothctl> advertise on
...
< HCI Command: LE Set Extended Advertising Data (0x08|0x0037) plen 13
        Handle: 0x01
        Operation: Complete extended advertising data (0x03)
        Fragment preference: Minimize fragmentation (0x01)
        Data length: 0x09
        Service Data: Public Broadcast Announcement (0x1856)
          Data[2]: 0000
        Flags: 0x06
          LE General Discoverable Mode
          BR/EDR Not Supported
...
bluetoothctl># Attempt to acquire Broadcast Source transport
bluetoothctl>transport.acquire /org/bluez/hci0/pac_bcast0/fd0
...
< HCI Command: LE Set Extended Advertising Data (0x08|0x0037) plen 255
        Handle: 0x01
        Operation: Complete extended advertising data (0x03)
        Fragment preference: Minimize fragmentation (0x01)
        Data length: 0x0e
        Service Data: Broadcast Audio Announcement (0x1852)
        Broadcast ID: 11371620 (0xad8464)
        Service Data: Public Broadcast Announcement (0x1856)
          Data[2]: 0000
        Flags: 0x06
          LE General Discoverable Mode
          BR/EDR Not Supported

Link: https://github.com/bluez/bluez/issues/1117
Fixes: eca0ae4aea ("Bluetooth: Add initial implementation of BIS connections")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-06-11 16:27:29 -04:00
Luiz Augusto von Dentz
20a2aa01f5 Bluetooth: Fix NULL pointer deference on eir_get_service_data
The len parameter is considered optional so it can be NULL so it cannot
be used for skipping to next entry of EIR_SERVICE_DATA.

Fixes: 8f9ae5b3ae ("Bluetooth: eir: Add helpers for managing service data")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-06-11 15:59:07 -04:00
Charalampos Mitrodimas
7f12c33850 net, bpf: Fix RCU usage in task_cls_state() for BPF programs
The commit ee971630f2 ("bpf: Allow some trace helpers for all prog
types") made bpf_get_cgroup_classid_curr helper available to all BPF
program types, not just networking programs.

This helper calls __task_get_classid() which internally calls
task_cls_state() requiring rcu_read_lock_bh_held(). This works
in networking/tc context where RCU BH is held, but triggers an RCU
warning when called from other contexts like BPF syscall programs
that run under rcu_read_lock_trace():

  WARNING: suspicious RCU usage
  6.15.0-rc4-syzkaller-g079e5c56a5c4 #0 Not tainted
  -----------------------------
  net/core/netclassid_cgroup.c:24 suspicious rcu_dereference_check() usage!

Fix this by also accepting rcu_read_lock_held() and
rcu_read_lock_trace_held() as valid RCU contexts in the
task_cls_state() function. This ensures the helper works correctly
in all needed RCU contexts where it might be called, regular RCU,
RCU BH (for networking), and RCU trace (for BPF syscall programs).

Fixes: ee971630f2 ("bpf: Allow some trace helpers for all prog types")
Reported-by: syzbot+b4169a1cfb945d2ed0ec@syzkaller.appspotmail.com
Signed-off-by: Charalampos Mitrodimas <charmitro@posteo.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20250611-rcu-fix-task_cls_state-v3-1-3d30e1de753f@posteo.net
Closes: https://syzkaller.appspot.com/bug?extid=b4169a1cfb945d2ed0ec
2025-06-11 21:30:29 +02:00
Al Viro
fe3c5120d6 devpts, sunrpc, hostfs: don't bother with ->d_op
Default ->d_op being simple_dentry_operations is equivalent to leaving
it NULL and putting DCACHE_DONTCACHE into ->s_d_flags.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-11 13:40:04 -04:00
Jiayuan Chen
178f6a5c8c bpf, ktls: Fix data corruption when using bpf_msg_pop_data() in ktls
When sending plaintext data, we initially calculated the corresponding
ciphertext length. However, if we later reduced the plaintext data length
via socket policy, we failed to recalculate the ciphertext length.

This results in transmitting buffers containing uninitialized data during
ciphertext transmission.

This causes uninitialized bytes to be appended after a complete
"Application Data" packet, leading to errors on the receiving end when
parsing TLS record.

Fixes: d3b18ad31f ("tls: add bpf support to sk_msg handling")
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20250609020910.397930-2-jiayuan.chen@linux.dev
2025-06-11 16:59:42 +02:00
Christian Brauner
9b0240b3cc
netns: use stable inode number for initial mount ns
Apart from the network and mount namespace all other namespaces expose a
stable inode number and userspace has been relying on that for a very
long time now. It's very much heavily used API. Align the network
namespace and use a stable inode number from the reserved procfs inode
number space so this is consistent across all namespaces.

Link: https://lore.kernel.org/20250606-work-nsfs-v1-2-b8749c9a8844@kernel.org
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-11 11:59:08 +02:00
Zilin Guan
f87586598f wifi: cfg80211: use kfree_sensitive() for connkeys cleanup
The nl80211_parse_connkeys() function currently uses kfree() to release
the 'result' structure in error handling paths. However, if an error
occurs due to result->def being less than 0, the 'result' structure may
contain sensitive information.

To prevent potential leakage of sensitive data, replace kfree() with
kfree_sensitive() when freeing 'result'. This change aligns with the
approach used in its caller, nl80211_join_ibss(), enhancing the overall
security of the wireless subsystem.

Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Link: https://patch.msgid.link/20250523110156.4017111-1-zilin@seu.edu.cn
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-11 11:36:56 +02:00
Al Viro
05fb0e6664 new helper: set_default_d_op()
... to be used instead of manually assigning to ->s_d_op.
All in-tree filesystem converted (and field itself is renamed,
so any out-of-tree ones in need of conversion will be caught
by compiler).

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-10 22:21:16 -04:00
Samiullah Khawaja
689883de94 net: stop napi kthreads when THREADED napi is disabled
Once the THREADED napi is disabled, the napi kthread should also be
stopped. Keeping the kthread intact after disabling THREADED napi makes
the PID of this kthread show up in the output of netlink 'napi-get' and
ps -ef output.

The is discussed in the patch below:
https://lore.kernel.org/all/20250502191548.559cc416@kernel.org

NAPI kthread should stop only if,

- There are no pending napi poll scheduled for this thread.
- There are no new napi poll scheduled for this thread while it has
  stopped.
- The ____napi_schedule can correctly fallback to the softirq for napi
  polling.

Since napi_schedule_prep provides mutual exclusion over STATE_SCHED bit,
it is safe to unset the STATE_THREADED when SCHED_THREADED is set or the
SCHED bit is not set. SCHED_THREADED being set means that SCHED is
already set and the kthread owns this napi.

To disable threaded napi, unset STATE_THREADED bit safely if
SCHED_THREADED is set or SCHED is unset. Once STATE_THREADED is unset
safely then wait for the kthread to unset the SCHED_THREADED bit so it
safe to stop the kthread.

Add a new test in nl_netdev to verify this behaviour.

Tested:
 ./tools/testing/selftests/net/nl_netdev.py
 TAP version 13
 1..6
 ok 1 nl_netdev.empty_check
 ok 2 nl_netdev.lo_check
 ok 3 nl_netdev.page_pool_check
 ok 4 nl_netdev.napi_list_check
 ok 5 nl_netdev.dev_set_threaded
 ok 6 nl_netdev.nsim_rxq_reset_down
 # Totals: pass:6 fail:0 xfail:0 xpass:0 skip:0 error:0

Ran neper for 300 seconds and did enable/disable of thread napi in a
loop continuously.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Link: https://patch.msgid.link/20250609173015.3851695-1-skhawaja@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-10 17:52:03 -07:00
Jakub Kicinski
34355b6712 linux-can-next-for-6.17-20250610
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEn/sM2K9nqF/8FWzzDHRl3/mQkZwFAmhH/swTHG1rbEBwZW5n
 dXRyb25peC5kZQAKCRAMdGXf+ZCRnONCCACa16bTW53gBzmiTxdEgUJ/h+gQuR8G
 Fj+yOYIWNZY/YOExa40ldApu3iB9UAB0D+FOly4Wv5zYDct6yNBxqtZjbkTFMaoi
 3i+SSrRLNtIxgGs1KgJKVPis8mhCqiBL0aGoJDGyRiye6hotECDyQWvlGM3lMGUr
 wdMDQW2xyKOWvm++jXijkUMyKThmI7czlSH8al+JU9KcAO9hiUlGzejdI56KUIMW
 TRlg2QSK9CfIzgUP4RQughbF59/8Xbq3LOidu50xMad2wiOJj0IUHB0h6LoAshnS
 jFy4Ox4Gw5hcmdaEKazjYEtq3nQeZ6wct7jThw02e4D9h0ac2MCVhphk
 =Pt9d
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-next-for-6.17-20250610' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2025-06-10

The first 4 patches are by Vincent Mailhol and prepare the CAN netlink
interface for the introduction of CAN XL configuration.

Geert Uytterhoeven's patch updates the CAN networking documentation.

The last 2 patched are by Davide Caratti and introduce skb drop
reasons in the receive path of several CAN protocols.

* tag 'linux-can-next-for-6.17-20250610' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
  can: add drop reasons in CAN protocols receive path
  can: add drop reasons in the receive path of AF_CAN
  documentation: networking: can: Document alloc_candev_mqs()
  can: netlink: can_changelink(): rename tdc_mask into fd_tdc_flag_provided
  can: bittiming: rename can_tdc_is_enabled() into can_fd_tdc_is_enabled()
  can: bittiming: rename CAN_CTRLMODE_TDC_MASK into CAN_CTRLMODE_FD_TDC_MASK
  can: netlink: replace tabulation by space in assignment
====================

Link: https://patch.msgid.link/20250610094933.1593081-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-10 15:44:47 -07:00
Willem de Bruijn
561939ed44 net: remove unused sock_enable_timestamps
This function was introduced in commit 783da70e83 ("net: add
sock_enable_timestamps"), with one caller in rxrpc.

That only caller was removed in commit 7903d4438b ("rxrpc: Don't use
received skbuff timestamps").

Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250609153254.3504909-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-10 14:43:40 -07:00
Jiayuan Chen
76be5fae32 bpf, sockmap: Fix psock incorrectly pointing to sk
We observed an issue from the latest selftest: sockmap_redir where
sk_psock(psock->sk) != psock in the backlog. The root cause is the special
behavior in sockmap_redir - it frequently performs map_update() and
map_delete() on the same socket. During map_update(), we create a new
psock and during map_delete(), we eventually free the psock via rcu_work
in sk_psock_drop(). However, pending workqueues might still exist and not
be processed yet. If users immediately perform another map_update(), a new
psock will be allocated for the same sk, resulting in two psocks pointing
to the same sk.

When the pending workqueue is later triggered, it uses the old psock to
access sk for I/O operations, which is incorrect.

Timing Diagram:

cpu0                        cpu1

map_update(sk):
    sk->psock = psock1
    psock1->sk = sk
map_delete(sk):
   rcu_work_free(psock1)

map_update(sk):
    sk->psock = psock2
    psock2->sk = sk
                            workqueue:
                                wakeup with psock1, but the sk of psock1
                                doesn't belong to psock1
rcu_handler:
    clean psock1
    free(psock1)

Previously, we used reference counting to address the concurrency issue
between backlog and sock_map_close(). This logic remains necessary as it
prevents the sk from being freed while processing the backlog. But this
patch prevents pending backlogs from using a psock after it has been
stopped.

Note: We cannot call cancel_delayed_work_sync() in map_delete() since this
might be invoked in BPF context by BPF helper, and the function may sleep.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20250609025908.79331-1-jiayuan.chen@linux.dev
2025-06-10 18:16:15 +02:00
Davide Caratti
81807451c2 can: add drop reasons in CAN protocols receive path
sock_queue_rcv_skb() can fail because of lack of memory resources: use
drop reasons and pass the receiving socket to the tracepoint, so that
it's possible to better locate/debug such events.

Tested with:

| # modprobe vcan echo=1
| # ip link add name vcan2 type vcan
| # ip link set dev vcan2 up
| # ./netlayer/tst-proc 1 &
| # bg
| # while true ; do perf record -e skb:kfree_skb -aR  -- \
| > ./raw/tst-raw-sendto vcan2 ; perf script ; done
| [...]
| tst-raw-sendto 10942 [000] 506428.431856: skb:kfree_skb: skbaddr=0xffff97cec38b4200 rx_sk=0xffff97cf0f75a800 protocol=12 location=raw_rcv+0x20e reason: SOCKET_RCVBUF

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://patch.msgid.link/20250604160605.1005704-3-dcaratti@redhat.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-06-10 10:23:56 +02:00
Davide Caratti
127c49624a can: add drop reasons in the receive path of AF_CAN
Besides the existing pr_warn_once(), use skb drop reasons in case AF_CAN
layer drops non-conformant CAN{,FD,XL} frames, or conformant frames
received by "wrong" devices, so that it's possible to debug (and count)
such events using existing tracepoints:

| # perf record -e skb:kfree_skb -aR -- ./drv/canfdtest -v -g -l 1 vcan0
| # perf script
| [...]
| canfdtest  1123 [000]  3893.271264: skb:kfree_skb: skbaddr=0xffff975703c9f700 rx_sk=(nil) protocol=12 location=can_rcv+0x4b  reason: CAN_RX_INVALID_FRAME

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://patch.msgid.link/20250604160605.1005704-2-dcaratti@redhat.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-06-10 10:23:30 +02:00
Jakub Kicinski
fdd9ebccfc bluetooth pull request for net:
- MGMT: Fix UAF on mgmt_remove_adv_monitor_complete
  - MGMT: Protect mgmt_pending list with its own lock
  - hci_core: fix list_for_each_entry_rcu usage
  - btintel_pcie: Increase the tx and rx descriptor count
  - btintel_pcie: Reduce driver buffer posting to prevent race condition
  - btintel_pcie: Fix driver not posting maximum rx buffers
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmhB65gZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKXuaEACPXWNUOViPFPE85M1Y/VGA
 Hw4uDO9x25XySBk740NRT3qkYS8pWZa8SujQZa0ijqklrggosnz3q7QdwiRow5Cv
 CLqCZiuQDtekXV8K9xa66K8rt2iUxMDnQRzNW32Pe0OW6Xy2RFiYqC7ZVpFomXBj
 2vMj+aNRwbdzvKStEQTxWCISdCkP7XSuOdWS/wnAFyiSThgr4R8PByLQZ9P2J5xj
 KfLBs+QzwHCc1hGbO7odTVqyv+UN3v82aN2fmyusdgBYBJ9ymLMV1gpBm/B4oGI7
 /zXbU9bZWL+uis+pB3k9MQnaytc32v1ODFyqY8Ua1slE4Qzwz7OKB/8TP9MeOO1s
 MzzIYuAK2KJ6C5mxyIBRVMcbdX2GgiwVIXJBWesuqoZc0H1En+eSpoKNzfoX16Ul
 hMc8pCfvpKXaqo9KOJMldr5Yg4iKV83Am7zNUB1ka6TymM8NUx56gbF50tYDlOXY
 TGYpli8OQF4x5/tWRh9AE+DxgYa4sVrDiQncvnSMlmlyBGf/wCczCjaFwRlGM9Wu
 MZPi2zm0lwa1F6T358uOyJRbcFawaV39AGHo37SrCFOvPIKC+c6iTYqLHWLeq6V6
 mXlUn4BrTrt7TUqFpBIUcN0LOOLKgxr7Oa8UAhhCfn8LLsFvryuTEbNtxOqvFLQP
 4ZUyJFMjUnVAr5PMPjyJ3w==
 =VZN1
 -----END PGP SIGNATURE-----

Merge tag 'for-net-2025-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

 - MGMT: Fix UAF on mgmt_remove_adv_monitor_complete
 - MGMT: Protect mgmt_pending list with its own lock
 - hci_core: fix list_for_each_entry_rcu usage
 - btintel_pcie: Increase the tx and rx descriptor count
 - btintel_pcie: Reduce driver buffer posting to prevent race condition
 - btintel_pcie: Fix driver not posting maximum rx buffers

* tag 'for-net-2025-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: MGMT: Protect mgmt_pending list with its own lock
  Bluetooth: MGMT: Fix UAF on mgmt_remove_adv_monitor_complete
  Bluetooth: btintel_pcie: Reduce driver buffer posting to prevent race condition
  Bluetooth: btintel_pcie: Increase the tx and rx descriptor count
  Bluetooth: btintel_pcie: Fix driver not posting maximum rx buffers
  Bluetooth: hci_core: fix list_for_each_entry_rcu usage
====================

Link: https://patch.msgid.link/20250605191136.904411-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-09 15:47:30 -07:00
Eric Dumazet
82ffbe7776 net_sched: sch_sfq: fix a potential crash on gso_skb handling
SFQ has an assumption of always being able to queue at least one packet.

However, after the blamed commit, sch->q.len can be inflated by packets
in sch->gso_skb, and an enqueue() on an empty SFQ qdisc can be followed
by an immediate drop.

Fix sfq_drop() to properly clear q->tail in this situation.

Tested:

ip netns add lb
ip link add dev to-lb type veth peer name in-lb netns lb
ethtool -K to-lb tso off                 # force qdisc to requeue gso_skb
ip netns exec lb ethtool -K in-lb gro on # enable NAPI
ip link set dev to-lb up
ip -netns lb link set dev in-lb up
ip addr add dev to-lb 192.168.20.1/24
ip -netns lb addr add dev in-lb 192.168.20.2/24
tc qdisc replace dev to-lb root sfq limit 100

ip netns exec lb netserver

netperf -H 192.168.20.2 -l 100 &
netperf -H 192.168.20.2 -l 100 &
netperf -H 192.168.20.2 -l 100 &
netperf -H 192.168.20.2 -l 100 &

Fixes: a53851e2c3 ("net: sched: explicit locking in gso_cpu fallback")
Reported-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>
Closes: https://lore.kernel.org/netdev/9da42688-bfaa-4364-8797-e9271f3bdaef@hetzner-cloud.de/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250606165127.3629486-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-09 14:21:36 -07:00
Ingo Molnar
41cb08555c treewide, timers: Rename from_timer() to timer_container_of()
Move this API to the canonical timer_*() namespace.

[ tglx: Redone against pre rc1 ]

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-06-08 09:07:37 +02:00
Linus Torvalds
2c7e4a2663 Including fixes from CAN, wireless, Bluetooth, and Netfilter.
Current release - regressions:
 
  - Revert "kunit: configs: Enable CONFIG_INIT_STACK_ALL_PATTERN
    in all_tests", makes kunit error out if compiler is old
 
  - wifi: iwlwifi: mvm: fix assert on suspend
 
  - rxrpc: fix return from none_validate_challenge()
 
 Current release - new code bugs:
 
  - ovpn: couple of fixes for socket cleanup and UDP-tunnel teardown
 
  - can: kvaser_pciefd: refine error prone echo_skb_max handling logic
 
  - fix net_devmem_bind_dmabuf() stub when DEVMEM not compiled
 
  - eth: airoha: fixes for config / accel in bridge mode
 
 Previous releases - regressions:
 
  - Bluetooth: hci_qca: move the SoC type check to the right place,
    fix GPIO integration
 
  - prevent a NULL deref in rtnl_create_link() after locking changes
 
  - fix udp gso skb_segment after pull from frag_list
 
  - hv_netvsc: fix potential deadlock in netvsc_vf_setxdp()
 
 Previous releases - always broken:
 
  - netfilter:
    - nf_nat: also check reverse tuple to obtain clashing entry
    - nf_set_pipapo_avx2: fix initial map fill (zeroing)
 
  - fix the helper for incremental update of packet checksums after
    modifying the IP address, used by ILA and BPF
 
  - eth: stmmac: prevent div by 0 when clock rate is misconfigured
 
  - eth: ice: fix Tx scheduler handling of XDP and changing queue count
 
  - eth: b53: fix support for the RGMII interface when delays configured
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmhBv5kACgkQMUZtbf5S
 Irs/DA/+PIh7a33iVcsGIcmWtpnGp+18id1tSLnYGUGx1cW6zxutPD8rb6BsAN84
 KR+XVsbMDUehIa10xPoF2L5mX5YujEiPSkjP8eE2KJKDLGpDtYNOyOWKT21yudnd
 4EVF5JQoEbWHrkHMKF97tla84QLd5fFtgsvejVeZtQYSIDOteNGfra4Jly8iiR+J
 i9k+HdB0CNEKVvvibQZjZ5CrkpmdNPmB9UoJ59bG15q2+vXdzOPm/CCNo//9ZQJB
 I8O40nu16msRRVA9nc2V/Tp98fTk9dnDpTSyWiBlNCut9g9ftx456Ew+tjobMRIT
 yeh+q9+1z3YHjGJB8P1FGmMZWK3tbrwyqjFGqpSjr7juucFok9kxAaRPqrQxga7H
 Yxq3RegeNqukEAV39ZE14TL765Jy+XXF1uTHhNBkUADlNJVKnZygSk78/Ut2nDvQ
 vkfoto+CfKny5qkSbTk8KKv1rZu3xwewoOjlcdkHlOBoouCjPOxTC7yxTZgUZB5c
 yap0jQsedJct4OAA+O7IGLCmf3KrJ0H32HbWEY68mpTEd+4Df5vAWiIi7vmVJmk3
 DX9JWmu5A5yjNMhOEsBQU98gkNw366aA/E8dr+lEfp3AoqDrmdbG3l8+qqhqYnb+
 nnL1sNiQH1griZwQBUROAhrtXnYlYsAsZi+cv23Q0hQiGIvIC2Q=
 =sRQt
 -----END PGP SIGNATURE-----

Merge tag 'net-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from CAN, wireless, Bluetooth, and Netfilter.

  Current release - regressions:

   - Revert "kunit: configs: Enable CONFIG_INIT_STACK_ALL_PATTERN in
     all_tests", makes kunit error out if compiler is old

   - wifi: iwlwifi: mvm: fix assert on suspend

   - rxrpc: fix return from none_validate_challenge()

  Current release - new code bugs:

   - ovpn: couple of fixes for socket cleanup and UDP-tunnel teardown

   - can: kvaser_pciefd: refine error prone echo_skb_max handling logic

   - fix net_devmem_bind_dmabuf() stub when DEVMEM not compiled

   - eth: airoha: fixes for config / accel in bridge mode

  Previous releases - regressions:

   - Bluetooth: hci_qca: move the SoC type check to the right place, fix
     GPIO integration

   - prevent a NULL deref in rtnl_create_link() after locking changes

   - fix udp gso skb_segment after pull from frag_list

   - hv_netvsc: fix potential deadlock in netvsc_vf_setxdp()

  Previous releases - always broken:

   - netfilter:
       - nf_nat: also check reverse tuple to obtain clashing entry
       - nf_set_pipapo_avx2: fix initial map fill (zeroing)

   - fix the helper for incremental update of packet checksums after
     modifying the IP address, used by ILA and BPF

   - eth:
       - stmmac: prevent div by 0 when clock rate is misconfigured
       - ice: fix Tx scheduler handling of XDP and changing queue count
       - eth: fix support for the RGMII interface when delays configured"

* tag 'net-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (76 commits)
  calipso: unlock rcu before returning -EAFNOSUPPORT
  seg6: Fix validation of nexthop addresses
  net: prevent a NULL deref in rtnl_create_link()
  net: annotate data-races around cleanup_net_task
  selftests: drv-net: tso: make bkg() wait for socat to quit
  selftests: drv-net: tso: fix the GRE device name
  selftests: drv-net: add configs for the TSO test
  wireguard: device: enable threaded NAPI
  netlink: specs: rt-link: decode ip6gre
  netlink: specs: rt-link: add missing byte-order properties
  net: wwan: mhi_wwan_mbim: use correct mux_id for multiplexing
  wifi: cfg80211/mac80211: correctly parse S1G beacon optional elements
  net: dsa: b53: do not touch DLL_IQQD on bcm53115
  net: dsa: b53: allow RGMII for bcm63xx RGMII ports
  net: dsa: b53: do not configure bcm63xx's IMP port interface
  net: dsa: b53: do not enable RGMII delay on bcm63xx
  net: dsa: b53: do not enable EEE on bcm63xx
  net: ti: icssg-prueth: Fix swapped TX stats for MII interfaces.
  selftests: netfilter: nft_nat.sh: add test for reverse clash with nat
  netfilter: nf_nat: also check reverse tuple to obtain clashing entry
  ...
2025-06-05 12:34:55 -07:00
Luiz Augusto von Dentz
6fe26f694c Bluetooth: MGMT: Protect mgmt_pending list with its own lock
This uses a mutex to protect from concurrent access of mgmt_pending
list which can cause crashes like:

==================================================================
BUG: KASAN: slab-use-after-free in hci_sock_get_channel+0x60/0x68 net/bluetooth/hci_sock.c:91
Read of size 2 at addr ffff0000c48885b2 by task syz.4.334/7318

CPU: 0 UID: 0 PID: 7318 Comm: syz.4.334 Not tainted 6.15.0-rc7-syzkaller-g187899f4124a #0 PREEMPT
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025
Call trace:
 show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:466 (C)
 __dump_stack+0x30/0x40 lib/dump_stack.c:94
 dump_stack_lvl+0xd8/0x12c lib/dump_stack.c:120
 print_address_description+0xa8/0x254 mm/kasan/report.c:408
 print_report+0x68/0x84 mm/kasan/report.c:521
 kasan_report+0xb0/0x110 mm/kasan/report.c:634
 __asan_report_load2_noabort+0x20/0x2c mm/kasan/report_generic.c:379
 hci_sock_get_channel+0x60/0x68 net/bluetooth/hci_sock.c:91
 mgmt_pending_find+0x7c/0x140 net/bluetooth/mgmt_util.c:223
 pending_find net/bluetooth/mgmt.c:947 [inline]
 remove_adv_monitor+0x44/0x1a4 net/bluetooth/mgmt.c:5445
 hci_mgmt_cmd+0x780/0xc00 net/bluetooth/hci_sock.c:1712
 hci_sock_sendmsg+0x544/0xbb0 net/bluetooth/hci_sock.c:1832
 sock_sendmsg_nosec net/socket.c:712 [inline]
 __sock_sendmsg net/socket.c:727 [inline]
 sock_write_iter+0x25c/0x378 net/socket.c:1131
 new_sync_write fs/read_write.c:591 [inline]
 vfs_write+0x62c/0x97c fs/read_write.c:684
 ksys_write+0x120/0x210 fs/read_write.c:736
 __do_sys_write fs/read_write.c:747 [inline]
 __se_sys_write fs/read_write.c:744 [inline]
 __arm64_sys_write+0x7c/0x90 fs/read_write.c:744
 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
 invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49
 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132
 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151
 el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767
 el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786
 el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600

Allocated by task 7037:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x40/0x78 mm/kasan/common.c:68
 kasan_save_alloc_info+0x44/0x54 mm/kasan/generic.c:562
 poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
 __kasan_kmalloc+0x9c/0xb4 mm/kasan/common.c:394
 kasan_kmalloc include/linux/kasan.h:260 [inline]
 __do_kmalloc_node mm/slub.c:4327 [inline]
 __kmalloc_noprof+0x2fc/0x4c8 mm/slub.c:4339
 kmalloc_noprof include/linux/slab.h:909 [inline]
 sk_prot_alloc+0xc4/0x1f0 net/core/sock.c:2198
 sk_alloc+0x44/0x3ac net/core/sock.c:2254
 bt_sock_alloc+0x4c/0x300 net/bluetooth/af_bluetooth.c:148
 hci_sock_create+0xa8/0x194 net/bluetooth/hci_sock.c:2202
 bt_sock_create+0x14c/0x24c net/bluetooth/af_bluetooth.c:132
 __sock_create+0x43c/0x91c net/socket.c:1541
 sock_create net/socket.c:1599 [inline]
 __sys_socket_create net/socket.c:1636 [inline]
 __sys_socket+0xd4/0x1c0 net/socket.c:1683
 __do_sys_socket net/socket.c:1697 [inline]
 __se_sys_socket net/socket.c:1695 [inline]
 __arm64_sys_socket+0x7c/0x94 net/socket.c:1695
 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
 invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49
 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132
 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151
 el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767
 el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786
 el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600

Freed by task 6607:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x40/0x78 mm/kasan/common.c:68
 kasan_save_free_info+0x58/0x70 mm/kasan/generic.c:576
 poison_slab_object mm/kasan/common.c:247 [inline]
 __kasan_slab_free+0x68/0x88 mm/kasan/common.c:264
 kasan_slab_free include/linux/kasan.h:233 [inline]
 slab_free_hook mm/slub.c:2380 [inline]
 slab_free mm/slub.c:4642 [inline]
 kfree+0x17c/0x474 mm/slub.c:4841
 sk_prot_free net/core/sock.c:2237 [inline]
 __sk_destruct+0x4f4/0x760 net/core/sock.c:2332
 sk_destruct net/core/sock.c:2360 [inline]
 __sk_free+0x320/0x430 net/core/sock.c:2371
 sk_free+0x60/0xc8 net/core/sock.c:2382
 sock_put include/net/sock.h:1944 [inline]
 mgmt_pending_free+0x88/0x118 net/bluetooth/mgmt_util.c:290
 mgmt_pending_remove+0xec/0x104 net/bluetooth/mgmt_util.c:298
 mgmt_set_powered_complete+0x418/0x5cc net/bluetooth/mgmt.c:1355
 hci_cmd_sync_work+0x204/0x33c net/bluetooth/hci_sync.c:334
 process_one_work+0x7e8/0x156c kernel/workqueue.c:3238
 process_scheduled_works kernel/workqueue.c:3319 [inline]
 worker_thread+0x958/0xed8 kernel/workqueue.c:3400
 kthread+0x5fc/0x75c kernel/kthread.c:464
 ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:847

Fixes: a380b6cff1 ("Bluetooth: Add generic mgmt helper API")
Closes: https://syzkaller.appspot.com/bug?extid=0a7039d5d9986ff4ecec
Closes: https://syzkaller.appspot.com/bug?extid=cc0cc52e7f43dc9e6df1
Reported-by: syzbot+0a7039d5d9986ff4ecec@syzkaller.appspotmail.com
Tested-by: syzbot+0a7039d5d9986ff4ecec@syzkaller.appspotmail.com
Tested-by: syzbot+cc0cc52e7f43dc9e6df1@syzkaller.appspotmail.com
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-06-05 14:54:57 -04:00
Luiz Augusto von Dentz
e6ed54e86a Bluetooth: MGMT: Fix UAF on mgmt_remove_adv_monitor_complete
This reworks MGMT_OP_REMOVE_ADV_MONITOR to not use mgmt_pending_add to
avoid crashes like bellow:

==================================================================
BUG: KASAN: slab-use-after-free in mgmt_remove_adv_monitor_complete+0xe5/0x540 net/bluetooth/mgmt.c:5406
Read of size 8 at addr ffff88801c53f318 by task kworker/u5:5/5341

CPU: 0 UID: 0 PID: 5341 Comm: kworker/u5:5 Not tainted 6.15.0-syzkaller-10402-g4cb6c8af8591 #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
Workqueue: hci0 hci_cmd_sync_work
Call Trace:
 <TASK>
 dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:408 [inline]
 print_report+0xd2/0x2b0 mm/kasan/report.c:521
 kasan_report+0x118/0x150 mm/kasan/report.c:634
 mgmt_remove_adv_monitor_complete+0xe5/0x540 net/bluetooth/mgmt.c:5406
 hci_cmd_sync_work+0x261/0x3a0 net/bluetooth/hci_sync.c:334
 process_one_work kernel/workqueue.c:3238 [inline]
 process_scheduled_works+0xade/0x17b0 kernel/workqueue.c:3321
 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402
 kthread+0x711/0x8a0 kernel/kthread.c:464
 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>

Allocated by task 5987:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
 poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
 __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:394
 kasan_kmalloc include/linux/kasan.h:260 [inline]
 __kmalloc_cache_noprof+0x230/0x3d0 mm/slub.c:4358
 kmalloc_noprof include/linux/slab.h:905 [inline]
 kzalloc_noprof include/linux/slab.h:1039 [inline]
 mgmt_pending_new+0x65/0x240 net/bluetooth/mgmt_util.c:252
 mgmt_pending_add+0x34/0x120 net/bluetooth/mgmt_util.c:279
 remove_adv_monitor+0x103/0x1b0 net/bluetooth/mgmt.c:5454
 hci_mgmt_cmd+0x9c9/0xef0 net/bluetooth/hci_sock.c:1719
 hci_sock_sendmsg+0x6ca/0xef0 net/bluetooth/hci_sock.c:1839
 sock_sendmsg_nosec net/socket.c:712 [inline]
 __sock_sendmsg+0x219/0x270 net/socket.c:727
 sock_write_iter+0x258/0x330 net/socket.c:1131
 new_sync_write fs/read_write.c:593 [inline]
 vfs_write+0x548/0xa90 fs/read_write.c:686
 ksys_write+0x145/0x250 fs/read_write.c:738
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 5989:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:576
 poison_slab_object mm/kasan/common.c:247 [inline]
 __kasan_slab_free+0x62/0x70 mm/kasan/common.c:264
 kasan_slab_free include/linux/kasan.h:233 [inline]
 slab_free_hook mm/slub.c:2380 [inline]
 slab_free mm/slub.c:4642 [inline]
 kfree+0x18e/0x440 mm/slub.c:4841
 mgmt_pending_foreach+0xc9/0x120 net/bluetooth/mgmt_util.c:242
 mgmt_index_removed+0x10d/0x2f0 net/bluetooth/mgmt.c:9366
 hci_sock_bind+0xbe9/0x1000 net/bluetooth/hci_sock.c:1314
 __sys_bind_socket net/socket.c:1810 [inline]
 __sys_bind+0x2c3/0x3e0 net/socket.c:1841
 __do_sys_bind net/socket.c:1846 [inline]
 __se_sys_bind net/socket.c:1844 [inline]
 __x64_sys_bind+0x7a/0x90 net/socket.c:1844
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fixes: 66bd095ab5 ("Bluetooth: advmon offload MSFT remove monitor")
Closes: https://syzkaller.appspot.com/bug?extid=feb0dc579bbe30a13190
Reported-by: syzbot+feb0dc579bbe30a13190@syzkaller.appspotmail.com
Tested-by: syzbot+feb0dc579bbe30a13190@syzkaller.appspotmail.com
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-06-05 14:54:35 -04:00
Pauli Virtanen
308a3a8ce8 Bluetooth: hci_core: fix list_for_each_entry_rcu usage
Releasing + re-acquiring RCU lock inside list_for_each_entry_rcu() loop
body is not correct.

Fix by taking the update-side hdev->lock instead.

Fixes: c7eaf80bfb ("Bluetooth: Fix hci_link_tx_to RCU lock usage")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-06-05 14:53:13 -04:00
Eric Dumazet
3cae906e1a calipso: unlock rcu before returning -EAFNOSUPPORT
syzbot reported that a recent patch forgot to unlock rcu
in the error path.

Adopt the convention that netlbl_conn_setattr() is already using.

Fixes: 6e9f2df1c5 ("calipso: Don't call calipso functions for AF_INET sk.")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://patch.msgid.link/20250604133826.1667664-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-05 08:03:38 -07:00
Ido Schimmel
7632fedb26 seg6: Fix validation of nexthop addresses
The kernel currently validates that the length of the provided nexthop
address does not exceed the specified length. This can lead to the
kernel reading uninitialized memory if user space provided a shorter
length than the specified one.

Fix by validating that the provided length exactly matches the specified
one.

Fixes: d1df6fd8a1 ("ipv6: sr: define core operations for seg6local lightweight tunnel")
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250604113252.371528-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-05 08:03:17 -07:00
Eric Dumazet
feafc73f3e net: prevent a NULL deref in rtnl_create_link()
At the time rtnl_create_link() is running, dev->netdev_ops is NULL,
we must not use netdev_lock_ops() or risk a NULL deref if
CONFIG_NET_SHAPER is defined.

Use netif_set_group() instead of dev_set_group().

 RIP: 0010:netdev_need_ops_lock include/net/netdev_lock.h:33 [inline]
 RIP: 0010:netdev_lock_ops include/net/netdev_lock.h:41 [inline]
 RIP: 0010:dev_set_group+0xc0/0x230 net/core/dev_api.c:82
Call Trace:
 <TASK>
  rtnl_create_link+0x748/0xd10 net/core/rtnetlink.c:3674
  rtnl_newlink_create+0x25c/0xb00 net/core/rtnetlink.c:3813
  __rtnl_newlink net/core/rtnetlink.c:3940 [inline]
  rtnl_newlink+0x16d6/0x1c70 net/core/rtnetlink.c:4055
  rtnetlink_rcv_msg+0x7cf/0xb70 net/core/rtnetlink.c:6944
  netlink_rcv_skb+0x208/0x470 net/netlink/af_netlink.c:2534
  netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline]
  netlink_unicast+0x75b/0x8d0 net/netlink/af_netlink.c:1339
  netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1883
  sock_sendmsg_nosec net/socket.c:712 [inline]

Reported-by: syzbot+9fc858ba0312b42b577e@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6840265f.a00a0220.d4325.0009.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 7e4d784f58 ("net: hold netdev instance lock during rtnetlink operations")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250604105815.1516973-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-05 08:03:00 -07:00
Eric Dumazet
535caaca92 net: annotate data-races around cleanup_net_task
from_cleanup_net() reads cleanup_net_task locklessly.

Add READ_ONCE()/WRITE_ONCE() annotations to avoid
a potential KCSAN warning, even if the race is harmless.

Fixes: 0734d7c3d9 ("net: expedite synchronize_net() for cleanup_net()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250604093928.1323333-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-05 08:02:26 -07:00
Paolo Abeni
4d401c5534 Couple of quick fixes:
- iwlwifi/iwlmld crash on certain error paths
  - iwlwifi/iwlmld regulatory data mixup
  - iwlwifi/iwlmld suspend/resume fix
  - iwlwifi MSI (without -X) fix
  - cfg80211/mac80211 S1G parsing fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmhBaO0ACgkQ10qiO8sP
 aABINw/9ExZdFxO6EDoiNMZbAOjMdNRxt64s7hCjy/CWz5f+qPCmS09mk+LBi+vx
 +1sAIoDIg0YfYjEt+yB7/+dAU66gX85y0wZfGVJPuMfzOvabzCnsNS8++yFIBdLr
 C24I+MRsJU3q+CMlSAkeyCN4jkKTRlVUJtQB1iteK/MYOUxvkRwuZAxfNMHVRq0L
 VWY+8nuzDeYMllnomznsOL4RhcIUXzcEkJA9Vd7ae3KEuw5gicOBiDh+sDl7hYVE
 bW9zDrcXIT2CUrF9ZVErIiUKuCJONTQzGrs1DAFSvnhY3C3IpV4L7UiQwaUChaZv
 THr0Ui4S059mDMZeoX+06RA1/+3nlf8J+yLTBS0JzeyCABxIgEyM55MtnYq6vQk9
 SsHnWLdwMDxVIAveZkP1ukCF7LhY6TybT8AibrmOXvAeNzjTq3D+Yc66BRyBTFq7
 P73wG4gcFd8nz9p5ZiRjqN0VTLi70N0ERUmWEv9LqURD9tZCWqkp6rNX3Dib9ZBV
 knj2PaMdHPbv1ovmqmVGv82mxE1Ke/zF7uaLi6bK/h03ST3cT7wUZCmNL46XorOo
 pQsmSr0E9Z1UlEFVJjDnSiI/m8YaDUKBWUBK72cOy6GmI7KKMQiG0WgZhRdgoi0P
 IdPqg6fl9EFPznY7exghDafV+Op84g+SDqpi4PslXbBVRze7KJk=
 =Ai0E
 -----END PGP SIGNATURE-----

Merge tag 'wireless-2025-06-05' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes Berg says:

====================
Couple of quick fixes:
 - iwlwifi/iwlmld crash on certain error paths
 - iwlwifi/iwlmld regulatory data mixup
 - iwlwifi/iwlmld suspend/resume fix
 - iwlwifi MSI (without -X) fix
 - cfg80211/mac80211 S1G parsing fixes

* tag 'wireless-2025-06-05' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: cfg80211/mac80211: correctly parse S1G beacon optional elements
  wifi: iwlwifi: mld: Move regulatory domain initialization
  wifi: iwlwifi: pcie: fix non-MSIX handshake register
  wifi: iwlwifi: mld: avoid panic on init failure
  wifi: iwlwifi: mvm: fix assert on suspend
====================

Link: https://patch.msgid.link/20250605095443.17874-6-johannes@sipsolutions.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-06-05 15:19:33 +02:00
Lachlan Hodges
1e1f706fc2 wifi: cfg80211/mac80211: correctly parse S1G beacon optional elements
S1G beacons are not traditional beacons but a type of extension frame.
Extension frames contain the frame control and duration fields, followed
by zero or more optional fields before the frame body. These optional
fields are distinct from the variable length elements.

The presence of optional fields is indicated in the frame control field.
To correctly locate the elements offset, the frame control must be parsed
to identify which optional fields are present. Currently, mac80211 parses
S1G beacons based on fixed assumptions about the frame layout, without
inspecting the frame control field. This can result in incorrect offsets
to the "variable" portion of the frame.

Properly parse S1G beacon frames by using the field lengths defined in
IEEE 802.11-2024, section 9.3.4.3, ensuring that the elements offset is
calculated accurately.

Fixes: 9eaffe5078 ("cfg80211: convert S1G beacon to scan results")
Fixes: cd418ba63f ("mac80211: convert S1G beacon to scan results")
Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com>
Link: https://patch.msgid.link/20250603053538.468562-1-lachlan.hodges@morsemicro.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-05 11:32:16 +02:00
Florian Westphal
50d9ce9679 netfilter: nf_nat: also check reverse tuple to obtain clashing entry
The logic added in the blamed commit was supposed to only omit nat source
port allocation if neither the existing nor the new entry are subject to
NAT.

However, its not enough to lookup the conntrack based on the proposed
tuple, we must also check the reverse direction.

Otherwise there are esoteric cases where the collision is in the reverse
direction because that colliding connection has a port rewrite, but the
new entry doesn't.  In this case, we only check the new entry and then
erronously conclude that no clash exists anymore.

 The existing (udp) tuple is:
  a:p -> b:P, with nat translation to s:P, i.e. pure daddr rewrite,
  reverse tuple in conntrack table is s:P -> a:p.

When another UDP packet is sent directly to s, i.e. a:p->s:P, this is
correctly detected as a colliding entry: tuple is taken by existing reply
tuple in reverse direction.

But the colliding conntrack is only searched for with unreversed
direction, and we can't find such entry matching a:p->s:P.

The incorrect conclusion is that the clashing entry has timed out and
that no port address translation is required.

Such conntrack will then be discarded at nf_confirm time because the
proposed reverse direction clashes with an existing mapping in the
conntrack table.

Search for the reverse tuple too, this will then check the NAT bits of
the colliding entry and triggers port reallocation.

Followp patch extends nft_nat.sh selftest to cover this scenario.

The IPS_SEQ_ADJUST change is also a bug fix:
Instead of checking for SEQ_ADJ this tested for SEEN_REPLY and ASSURED
by accident -- _BIT is only for use with the test_bit() API.

This bug has little consequence in practice, because the sequence number
adjustments are only useful for TCP which doesn't support clash resolution.

The existing test case (conntrack_reverse_clash.sh) exercise a race
condition path (parallel conntrack creation on different CPUs), so
the colliding entries have neither SEEN_REPLY nor ASSURED set.

Thanks to Yafang Shao and Shaun Brady for an initial investigation
of this bug.

Fixes: d8f84a9bc7 ("netfilter: nf_nat: don't try nat source port reallocation for reverse dir clash")
Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1795
Reported-by: Yafang Shao <laoar.shao@gmail.com>
Reported-by: Shaun Brady <brady.1345@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-06-05 10:50:05 +02:00
Florian Westphal
ea77c397bf netfilter: nf_set_pipapo_avx2: fix initial map fill
If the first field doesn't cover the entire start map, then we must zero
out the remainder, else we leak those bits into the next match round map.

The early fix was incomplete and did only fix up the generic C
implementation.

A followup patch adds a test case to nft_concat_range.sh.

Fixes: 791a615b7a ("netfilter: nf_set_pipapo: fix initial map fill")
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-06-05 10:49:58 +02:00
Linus Torvalds
5abc7438f1 NFS Clent Updates for Linux 6.16
New Features:
   * Implement the Sunrpc rfc2203 rpcsec_gss sequence number cache
   * Add support for FALLOC_FL_ZERO_RANGE on NFS v4.2
   * Add a localio sysfs attribute
 
 Stable Fixes:
   * Fix double-unlock bug in nfs_return_empty_folio()
   * Don't check for OPEN feature support in v4.1
   * Always probe for LOCALIO support asynchronously
   * Prevent hang on NFS mounts with xprtsec=[m]tls
 
 Other Bugfixes:
   * xattr handlers should check for absent nfs filehandles
   * Fix setattr caching of TIME_[MODIFY|ACCESS]_SET when timestamps are delegated
   * Fix listxattr to return selinux security labels
   * Connect to NFSv3 DS using TLS if MDS connection uses TLS
   * Clear SB_RDONLY before getting a superblock, and ignore when remounting
   * Fix incorrect handling of NFS error codes in nfs4_do_mkdir()
   * Various nfs_localio fixes from Neil Brown that include fixing an
       rcu compilation error found by older gcc versions.
   * Update stats on flexfiles pNFS DSes when receiving NFS4ERR_DELAY
 
 Cleanups:
   * Add a refcount tracker for struct net in the nfs_client
   * Allow FREE_STATEID to clean up delegations
   * Always set NLINK even if the server doesn't support it
   * Cleanups to the NFS folio writeback code
   * Remove dead code from xs_tcp_tls_setup_socket()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmg/YkAACgkQ18tUv7Cl
 QOuGpQ/+OuG/xkVX6j7FerUcdbVhcZ+5jDUKC0cNe6EeFeFRjgqsdFB0uqH+AgJh
 DlxEJuXTMq+9mcptl0rjrOn0tj7dlTpgZowp3kWdK3bX1zSI2jBEJjnz3xVzjBQx
 3lbmF/UAIaHv5bPVc9aF8mioaj5DSRKWTBLTg7iOM1ol1DqgHK/M0q2D7d2n1yB4
 WYGI7LlAWSBGV4PvEkhHW6PwVPDSqECPBvIxd1obq8TSNl+YZlmVxCoJ99+zVqWf
 dvaDOwfs5x+YEQH/+N/XWdc38QiCGfu7H79qGHShWB8t/KT4axxmjVs2fT7xtUsv
 yN3fb77rlFOCJaPLRF549/4EJqHYMWmFDKIMUZ7YC1vEBCG4B1kQUqarA5eCbsAi
 s/rxBs1VNKeev/RecDaViAeH3XZoVU1rNyIBJjOuWgNlC5wnbF+An3zE0m8MAXxO
 Vh7wQSH3GZEY+VCR6ljwLhIv6+tvSVQxEZKUUjfVQXp5UuNwN3wKa+sW6li+FBl6
 uV6lJcmdUffrurNhvSghIiSQGDkerHUVhSltgtj5FnmRp/AM95Z850t5a7qqc7Cv
 duks9siLLaeC4K5W+AOcKLWXho1dJMIPWUej3ErCiHWnA20QiNXsQN4QoimkDKqf
 9SYdcl6UECqV5MzIa/L7cW96S3K0acrq+8ofJCjN3A8M0pcTGgU=
 =5DFQ
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.16-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS clent updates from Anna Schumaker:
 "New Features:

   - Implement the Sunrpc rfc2203 rpcsec_gss sequence number cache

   - Add support for FALLOC_FL_ZERO_RANGE on NFS v4.2

   - Add a localio sysfs attribute

  Stable Fixes:

   - Fix double-unlock bug in nfs_return_empty_folio()

   - Don't check for OPEN feature support in v4.1

   - Always probe for LOCALIO support asynchronously

   - Prevent hang on NFS mounts with xprtsec=[m]tls

  Other Bugfixes:

   - xattr handlers should check for absent nfs filehandles

   - Fix setattr caching of TIME_[MODIFY|ACCESS]_SET when timestamps are
     delegated

   - Fix listxattr to return selinux security labels

   - Connect to NFSv3 DS using TLS if MDS connection uses TLS

   - Clear SB_RDONLY before getting a superblock, and ignore when
     remounting

   - Fix incorrect handling of NFS error codes in nfs4_do_mkdir()

   - Various nfs_localio fixes from Neil Brown that include fixing an
     rcu compilation error found by older gcc versions.

   - Update stats on flexfiles pNFS DSes when receiving NFS4ERR_DELAY

  Cleanups:

   - Add a refcount tracker for struct net in the nfs_client

   - Allow FREE_STATEID to clean up delegations

   - Always set NLINK even if the server doesn't support it

   - Cleanups to the NFS folio writeback code

   - Remove dead code from xs_tcp_tls_setup_socket()"

* tag 'nfs-for-6.16-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (30 commits)
  flexfiles/pNFS: update stats on NFS4ERR_DELAY for v4.1 DSes
  nfs_localio: change nfsd_file_put_local() to take a pointer to __rcu pointer
  nfs_localio: protect race between nfs_uuid_put() and nfs_close_local_fh()
  nfs_localio: duplicate nfs_close_local_fh()
  nfs_localio: simplify interface to nfsd for getting nfsd_file
  nfs_localio: always hold nfsd net ref with nfsd_file ref
  nfs_localio: use cmpxchg() to install new nfs_file_localio
  SUNRPC: Remove dead code from xs_tcp_tls_setup_socket()
  SUNRPC: Prevent hang on NFS mount with xprtsec=[m]tls
  nfs: fix incorrect handling of large-number NFS errors in nfs4_do_mkdir()
  nfs: ignore SB_RDONLY when remounting nfs
  nfs: clear SB_RDONLY before getting superblock
  NFS: always probe for LOCALIO support asynchronously
  pnfs/flexfiles: connect to NFSv3 DS using TLS if MDS connection uses TLS
  NFS: add localio to sysfs
  nfs: use writeback_iter directly
  nfs: refactor nfs_do_writepage
  nfs: don't return AOP_WRITEPAGE_ACTIVATE from nfs_do_writepage
  nfs: fold nfs_page_async_flush into nfs_do_writepage
  NFSv4: Always set NLINK even if the server doesn't support it
  ...
2025-06-03 16:13:32 -07:00
Sabrina Dubroca
7eb11c0ab7 xfrm: state: use a consistent pcpu_id in xfrm_state_find
If we get preempted during xfrm_state_find, we could run
xfrm_state_look_at using a different pcpu_id than the one
xfrm_state_find saw. This could lead to ignoring states that should
have matched, and triggering acquires on a CPU that already has a pcpu
state.

    xfrm_state_find starts on CPU1
    pcpu_id = 1
    lookup starts
    <preemption, we're now on CPU2>
    xfrm_state_look_at pcpu_id = 2
       finds a state
found:
    best->pcpu_num != pcpu_id (2 != 1)
    if (!x && !error && !acquire_in_progress) {
        ...
        xfrm_state_alloc
        xfrm_init_tempstate
        ...

This can be avoided by passing the original pcpu_id down to all
xfrm_state_look_at() calls.

Also switch to raw_smp_processor_id, disabling preempting just to
re-enable it immediately doesn't really make sense.

Fixes: 1ddf9916ac ("xfrm: Add support for per cpu xfrm state handling.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-06-03 15:39:57 +02:00
Sabrina Dubroca
94d077c331 xfrm: state: initialize state_ptrs earlier in xfrm_state_find
In case of preemption, xfrm_state_look_at will find a different
pcpu_id and look up states for that other CPU. If we matched a state
for CPU2 in the state_cache while the lookup started on CPU1, we will
jump to "found", but the "best" state that we got will be ignored and
we will enter the "acquire" block. This block uses state_ptrs, which
isn't initialized at this point.

Let's initialize state_ptrs just after taking rcu_read_lock. This will
also prevent a possible misuse in the future, if someone adjusts this
function.

Reported-by: syzbot+7ed9d47e15e88581dc5b@syzkaller.appspotmail.com
Fixes: e952837f3d ("xfrm: state: fix out-of-bounds read during lookup")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-06-03 15:39:48 +02:00
Linus Torvalds
0fb34422b5 vfs-6.16-rc1.netfs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPUAAKCRCRxhvAZXjc
 ouMEAQCrviYPG/WMtPTH7nBIbfVQTfNEXt/TvN7u7OjXb+RwRAEAwe9tLy4GrS/t
 GuvUPWAthbhs77LTvxj6m3Gf49BOVgQ=
 =6FqN
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.16-rc1.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull netfs updates from Christian Brauner:

 - The main API document has been extensively updated/rewritten

 - Fix an oops in write-retry due to mis-resetting the I/O iterator

 - Fix the recording of transferred bytes for short DIO reads

 - Fix a request's work item to not require a reference, thereby
   avoiding the need to get rid of it in BH/IRQ context

 - Fix waiting and waking to be consistent about the waitqueue used

 - Remove NETFS_SREQ_SEEK_DATA_READ, NETFS_INVALID_WRITE,
   NETFS_ICTX_WRITETHROUGH, NETFS_READ_HOLE_CLEAR,
   NETFS_RREQ_DONT_UNLOCK_FOLIOS, and NETFS_RREQ_BLOCKED

 - Reorder structs to eliminate holes

 - Remove netfs_io_request::ractl

 - Only provide proc_link field if CONFIG_PROC_FS=y

 - Remove folio_queue::marks3

 - Fix undifferentiation of DIO reads from unbuffered reads

* tag 'vfs-6.16-rc1.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  netfs: Fix undifferentiation of DIO reads from unbuffered reads
  netfs: Fix wait/wake to be consistent about the waitqueue used
  netfs: Fix the request's work item to not require a ref
  netfs: Fix setting of transferred bytes with short DIO reads
  netfs: Fix oops in write-retry from mis-resetting the subreq iterator
  fs/netfs: remove unused flag NETFS_RREQ_BLOCKED
  fs/netfs: remove unused flag NETFS_RREQ_DONT_UNLOCK_FOLIOS
  folio_queue: remove unused field `marks3`
  fs/netfs: declare field `proc_link` only if CONFIG_PROC_FS=y
  fs/netfs: remove `netfs_io_request.ractl`
  fs/netfs: reorder struct fields to eliminate holes
  fs/netfs: remove unused enum choice NETFS_READ_HOLE_CLEAR
  fs/netfs: remove unused flag NETFS_ICTX_WRITETHROUGH
  fs/netfs: remove unused source NETFS_INVALID_WRITE
  fs/netfs: remove unused flag NETFS_SREQ_SEEK_DATA_READ
2025-06-02 15:04:06 -07:00
Shiming Cheng
3382a1ed7f net: fix udp gso skb_segment after pull from frag_list
Commit a1e40ac5b5 ("net: gso: fix udp gso fraglist segmentation after
pull from frag_list") detected invalid geometry in frag_list skbs and
redirects them from skb_segment_list to more robust skb_segment. But some
packets with modified geometry can also hit bugs in that code. We don't
know how many such cases exist. Addressing each one by one also requires
touching the complex skb_segment code, which risks introducing bugs for
other types of skbs. Instead, linearize all these packets that fail the
basic invariants on gso fraglist skbs. That is more robust.

If only part of the fraglist payload is pulled into head_skb, it will
always cause exception when splitting skbs by skb_segment. For detailed
call stack information, see below.

Valid SKB_GSO_FRAGLIST skbs
- consist of two or more segments
- the head_skb holds the protocol headers plus first gso_size
- one or more frag_list skbs hold exactly one segment
- all but the last must be gso_size

Optional datapath hooks such as NAT and BPF (bpf_skb_pull_data) can
modify fraglist skbs, breaking these invariants.

In extreme cases they pull one part of data into skb linear. For UDP,
this  causes three payloads with lengths of (11,11,10) bytes were
pulled tail to become (12,10,10) bytes.

The skbs no longer meets the above SKB_GSO_FRAGLIST conditions because
payload was pulled into head_skb, it needs to be linearized before pass
to regular skb_segment.

    skb_segment+0xcd0/0xd14
    __udp_gso_segment+0x334/0x5f4
    udp4_ufo_fragment+0x118/0x15c
    inet_gso_segment+0x164/0x338
    skb_mac_gso_segment+0xc4/0x13c
    __skb_gso_segment+0xc4/0x124
    validate_xmit_skb+0x9c/0x2c0
    validate_xmit_skb_list+0x4c/0x80
    sch_direct_xmit+0x70/0x404
    __dev_queue_xmit+0x64c/0xe5c
    neigh_resolve_output+0x178/0x1c4
    ip_finish_output2+0x37c/0x47c
    __ip_finish_output+0x194/0x240
    ip_finish_output+0x20/0xf4
    ip_output+0x100/0x1a0
    NF_HOOK+0xc4/0x16c
    ip_forward+0x314/0x32c
    ip_rcv+0x90/0x118
    __netif_receive_skb+0x74/0x124
    process_backlog+0xe8/0x1a4
    __napi_poll+0x5c/0x1f8
    net_rx_action+0x154/0x314
    handle_softirqs+0x154/0x4b8

    [118.376811] [C201134] rxq0_pus: [name:bug&]kernel BUG at net/core/skbuff.c:4278!
    [118.376829] [C201134] rxq0_pus: [name:traps&]Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
    [118.470774] [C201134] rxq0_pus: [name:mrdump&]Kernel Offset: 0x178cc00000 from 0xffffffc008000000
    [118.470810] [C201134] rxq0_pus: [name:mrdump&]PHYS_OFFSET: 0x40000000
    [118.470827] [C201134] rxq0_pus: [name:mrdump&]pstate: 60400005 (nZCv daif +PAN -UAO)
    [118.470848] [C201134] rxq0_pus: [name:mrdump&]pc : [0xffffffd79598aefc] skb_segment+0xcd0/0xd14
    [118.470900] [C201134] rxq0_pus: [name:mrdump&]lr : [0xffffffd79598a5e8] skb_segment+0x3bc/0xd14
    [118.470928] [C201134] rxq0_pus: [name:mrdump&]sp : ffffffc008013770

Fixes: a1e40ac5b5 ("gso: fix udp gso fraglist segmentation after pull from frag_list")
Signed-off-by: Shiming Cheng <shiming.cheng@mediatek.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-06-02 12:41:33 +01:00
Matthias Schiffer
7dc284702b batman-adv: store hard_iface as iflink private data
By passing the hard_iface to netdev_master_upper_dev_link() as private
data, we can iterate over hardifs of a mesh interface more efficiently
using netdev_for_each_lower_private*() (instead of iterating over the
global hardif list). In addition, this will enable resolving a hardif
from its netdev using netdev_lower_dev_get_private() and getting rid of
the global list altogether in the following patches.

A similar approach can be seen in the bonding driver.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2025-05-31 10:41:11 +02:00
Simon Wunderlich
2b05db6b8a batman-adv: Start new development cycle
This version will contain all the (major or even only minor) changes for
Linux 6.17.

The version number isn't a semantic version number with major and minor
information. It is just encoding the year of the expected publishing as
Linux -rc1 and the number of published versions this year (starting at 0).

Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2025-05-31 10:37:44 +02:00
Paul Chaignon
ead7f9b8de bpf: Fix L4 csum update on IPv6 in CHECKSUM_COMPLETE
In Cilium, we use bpf_csum_diff + bpf_l4_csum_replace to, among other
things, update the L4 checksum after reverse SNATing IPv6 packets. That
use case is however not currently supported and leads to invalid
skb->csum values in some cases. This patch adds support for IPv6 address
changes in bpf_l4_csum_update via a new flag.

When calling bpf_l4_csum_replace in Cilium, it ends up calling
inet_proto_csum_replace_by_diff:

    1:  void inet_proto_csum_replace_by_diff(__sum16 *sum, struct sk_buff *skb,
    2:                                       __wsum diff, bool pseudohdr)
    3:  {
    4:      if (skb->ip_summed != CHECKSUM_PARTIAL) {
    5:          csum_replace_by_diff(sum, diff);
    6:          if (skb->ip_summed == CHECKSUM_COMPLETE && pseudohdr)
    7:              skb->csum = ~csum_sub(diff, skb->csum);
    8:      } else if (pseudohdr) {
    9:          *sum = ~csum_fold(csum_add(diff, csum_unfold(*sum)));
    10:     }
    11: }

The bug happens when we're in the CHECKSUM_COMPLETE state. We've just
updated one of the IPv6 addresses. The helper now updates the L4 header
checksum on line 5. Next, it updates skb->csum on line 7. It shouldn't.

For an IPv6 packet, the updates of the IPv6 address and of the L4
checksum will cancel each other. The checksums are set such that
computing a checksum over the packet including its checksum will result
in a sum of 0. So the same is true here when we update the L4 checksum
on line 5. We'll update it as to cancel the previous IPv6 address
update. Hence skb->csum should remain untouched in this case.

The same bug doesn't affect IPv4 packets because, in that case, three
fields are updated: the IPv4 address, the IP checksum, and the L4
checksum. The change to the IPv4 address and one of the checksums still
cancel each other in skb->csum, but we're left with one checksum update
and should therefore update skb->csum accordingly. That's exactly what
inet_proto_csum_replace_by_diff does.

This special case for IPv6 L4 checksums is also described atop
inet_proto_csum_replace16, the function we should be using in this case.

This patch introduces a new bpf_l4_csum_replace flag, BPF_F_IPV6,
to indicate that we're updating the L4 checksum of an IPv6 packet. When
the flag is set, inet_proto_csum_replace_by_diff will skip the
skb->csum update.

Fixes: 7d672345ed ("bpf: add generic bpf_csum_diff helper")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/96a6bc3a443e6f0b21ff7b7834000e17fb549e05.1748509484.git.paul.chaignon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-30 19:53:51 -07:00
Paul Chaignon
6043b794c7 net: Fix checksum update for ILA adj-transport
During ILA address translations, the L4 checksums can be handled in
different ways. One of them, adj-transport, consist in parsing the
transport layer and updating any found checksum. This logic relies on
inet_proto_csum_replace_by_diff and produces an incorrect skb->csum when
in state CHECKSUM_COMPLETE.

This bug can be reproduced with a simple ILA to SIR mapping, assuming
packets are received with CHECKSUM_COMPLETE:

  $ ip a show dev eth0
  14: eth0@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
      link/ether 62:ae:35:9e:0f:8d brd ff:ff:ff:ff:ff:ff link-netnsid 0
      inet6 3333:0:0:1::c078/64 scope global
         valid_lft forever preferred_lft forever
      inet6 fd00:10:244:1::c078/128 scope global nodad
         valid_lft forever preferred_lft forever
      inet6 fe80::60ae:35ff:fe9e:f8d/64 scope link proto kernel_ll
         valid_lft forever preferred_lft forever
  $ ip ila add loc_match fd00:10:244:1 loc 3333:0:0:1 \
      csum-mode adj-transport ident-type luid dev eth0

Then I hit [fd00:10:244:1::c078]:8000 with a server listening only on
[3333:0:0:1::c078]:8000. With the bug, the SYN packet is dropped with
SKB_DROP_REASON_TCP_CSUM after inet_proto_csum_replace_by_diff changed
skb->csum. The translation and drop are visible on pwru [1] traces:

  IFACE   TUPLE                                                        FUNC
  eth0:9  [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp)  ipv6_rcv
  eth0:9  [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp)  ip6_rcv_core
  eth0:9  [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp)  nf_hook_slow
  eth0:9  [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp)  inet_proto_csum_replace_by_diff
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     tcp_v6_early_demux
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     ip6_route_input
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     ip6_input
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     ip6_input_finish
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     ip6_protocol_deliver_rcu
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     raw6_local_deliver
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     ipv6_raw_deliver
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     tcp_v6_rcv
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     __skb_checksum_complete
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     kfree_skb_reason(SKB_DROP_REASON_TCP_CSUM)
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     skb_release_head_state
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     skb_release_data
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     skb_free_head
  eth0:9  [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp)     kfree_skbmem

This is happening because inet_proto_csum_replace_by_diff is updating
skb->csum when it shouldn't. The L4 checksum is updated such that it
"cancels" the IPv6 address change in terms of checksum computation, so
the impact on skb->csum is null.

Note this would be different for an IPv4 packet since three fields
would be updated: the IPv4 address, the IP checksum, and the L4
checksum. Two would cancel each other and skb->csum would still need
to be updated to take the L4 checksum change into account.

This patch fixes it by passing an ipv6 flag to
inet_proto_csum_replace_by_diff, to skip the skb->csum update if we're
in the IPv6 case. Note the behavior of the only other user of
inet_proto_csum_replace_by_diff, the BPF subsystem, is left as is in
this patch and fixed in the subsequent patch.

With the fix, using the reproduction from above, I can confirm
skb->csum is not touched by inet_proto_csum_replace_by_diff and the TCP
SYN proceeds to the application after the ILA translation.

Link: https://github.com/cilium/pwru [1]
Fixes: 65d7ab8de5 ("net: Identifier Locator Addressing module")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/b5539869e3550d46068504feb02d37653d939c0b.1748509484.git.paul.chaignon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-30 19:53:51 -07:00
Saurabh Sengar
3ec5233049 hv_netvsc: fix potential deadlock in netvsc_vf_setxdp()
The MANA driver's probe registers netdevice via the following call chain:

mana_probe()
  register_netdev()
    register_netdevice()

register_netdevice() calls notifier callback for netvsc driver,
holding the netdev mutex via netdev_lock_ops().

Further this netvsc notifier callback end up attempting to acquire the
same lock again in dev_xdp_propagate() leading to deadlock.

netvsc_netdev_event()
  netvsc_vf_setxdp()
    dev_xdp_propagate()

This deadlock was not observed so far because net_shaper_ops was never set,
and thus the lock was effectively a no-op in this case. Fix this by using
netif_xdp_propagate() instead of dev_xdp_propagate() to avoid recursive
locking in this path.

And, since no deadlock is observed on the other path which is via
netvsc_probe, add the lock exclusivly for that path.

Also, clean up the unregistration path by removing the unnecessary call to
netvsc_vf_setxdp(), since unregister_netdevice_many_notify() already
performs this cleanup via dev_xdp_uninstall().

Fixes: 97246d6d21 ("net: hold netdev instance lock during ndo_bpf")
Cc: stable@vger.kernel.org
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Tested-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://patch.msgid.link/1748513910-23963-1-git-send-email-ssengar@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-30 19:31:25 -07:00
Pranjal Shrivastava
c1f4cb8a8d net: Fix net_devmem_bind_dmabuf for non-devmem configs
Fix the signature of the net_devmem_bind_dmabuf API for
CONFIG_NET_DEVMEM=n.

Fixes: bd61848900 ("net: devmem: Implement TX path")
Signed-off-by: Pranjal Shrivastava <praan@google.com>
Link: https://patch.msgid.link/20250528211058.1826608-1-praan@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-30 19:23:36 -07:00
Álvaro Fernández Rojas
efdddc4484 net: dsa: tag_brcm: legacy: fix pskb_may_pull length
BRCM_LEG_PORT_ID was incorrectly used for pskb_may_pull length.
The correct check is BRCM_LEG_TAG_LEN + VLAN_HLEN, or 10 bytes.

Fixes: 964dbf186e ("net: dsa: tag_brcm: add support for legacy tags")
Signed-off-by: Álvaro Fernández Rojas <noltari@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250529124406.2513779-1-noltari@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-30 19:20:18 -07:00
Luiz Augusto von Dentz
03dba9cea7 Bluetooth: L2CAP: Fix not responding with L2CAP_CR_LE_ENCRYPTION
Depending on the security set the response to L2CAP_LE_CONN_REQ shall be
just L2CAP_CR_LE_ENCRYPTION if only encryption when BT_SECURITY_MEDIUM
is selected since that means security mode 2 which doesn't require
authentication which is something that is covered in the qualification
test L2CAP/LE/CFC/BV-25-C.

Link: https://github.com/bluez/bluez/issues/1270
Fixes: 27e2d4c8d2 ("Bluetooth: Add basic LE L2CAP connect request receiving support")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-05-30 13:29:42 -04:00
Dmitry Antipov
03f1700b9b Bluetooth: MGMT: reject malformed HCI_CMD_SYNC commands
In 'mgmt_hci_cmd_sync()', check whether the size of parameters passed
in 'struct mgmt_cp_hci_cmd_sync' matches the total size of the data
(i.e. 'sizeof(struct mgmt_cp_hci_cmd_sync)' plus trailing bytes).
Otherwise, large invalid 'params_len' will cause 'hci_cmd_sync_alloc()'
to do 'skb_put_data()' from an area beyond the one actually passed to
'mgmt_hci_cmd_sync()'.

Reported-by: syzbot+5fe2d5bfbfbec0b675a0@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=5fe2d5bfbfbec0b675a0
Fixes: 827af4787e ("Bluetooth: MGMT: Add initial implementation of MGMT_OP_HCI_CMD_SYNC")
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-05-30 13:29:30 -04:00
Charalampos Mitrodimas
f29ccaa07c net: tipc: fix refcount warning in tipc_aead_encrypt
syzbot reported a refcount warning [1] caused by calling get_net() on
a network namespace that is being destroyed (refcount=0). This happens
when a TIPC discovery timer fires during network namespace cleanup.

The recently added get_net() call in commit e279024617 ("net/tipc:
fix slab-use-after-free Read in tipc_aead_encrypt_done") attempts to
hold a reference to the network namespace. However, if the namespace
is already being destroyed, its refcount might be zero, leading to the
use-after-free warning.

Replace get_net() with maybe_get_net(), which safely checks if the
refcount is non-zero before incrementing it. If the namespace is being
destroyed, return -ENODEV early, after releasing the bearer reference.

[1]: https://lore.kernel.org/all/68342b55.a70a0220.253bc2.0091.GAE@google.com/T/#m12019cf9ae77e1954f666914640efa36d52704a2

Reported-by: syzbot+f0c4a4aba757549ae26c@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68342b55.a70a0220.253bc2.0091.GAE@google.com/T/#m12019cf9ae77e1954f666914640efa36d52704a2
Fixes: e279024617 ("net/tipc: fix slab-use-after-free Read in tipc_aead_encrypt_done")
Signed-off-by: Charalampos Mitrodimas <charmitro@posteo.net>
Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech>
Link: https://patch.msgid.link/20250527-net-tipc-warning-v2-1-df3dc398a047@posteo.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-29 12:07:26 +02:00
David Howells
fd579a2ebb rxrpc: Fix return from none_validate_challenge()
Fix the return value of none_validate_challenge() to be explicitly true
(which indicates the source packet should simply be discarded) rather than
implicitly true (because rxrpc_abort_conn() always returns -EPROTO which
gets converted to true).

Note that this change doesn't change the behaviour of the code (which is
correct by accident) and, in any case, we *shouldn't* get a CHALLENGE
packet to an rxnull connection (ie. no security).

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lists.infradead.org/pipermail/linux-afs/2025-April/009738.html
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/10720.1748358103@warthog.procyon.org.uk
Fixes: 5800b1cf3f ("rxrpc: Allow CHALLENGEs to the passed to the app for a RESPONSE")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-29 12:03:21 +02:00
Dong Chenchen
271683bb2c page_pool: Fix use-after-free in page_pool_recycle_in_ring
syzbot reported a uaf in page_pool_recycle_in_ring:

BUG: KASAN: slab-use-after-free in lock_release+0x151/0xa30 kernel/locking/lockdep.c:5862
Read of size 8 at addr ffff8880286045a0 by task syz.0.284/6943

CPU: 0 UID: 0 PID: 6943 Comm: syz.0.284 Not tainted 6.13.0-rc3-syzkaller-gdfa94ce54f41 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:378 [inline]
 print_report+0x169/0x550 mm/kasan/report.c:489
 kasan_report+0x143/0x180 mm/kasan/report.c:602
 lock_release+0x151/0xa30 kernel/locking/lockdep.c:5862
 __raw_spin_unlock_bh include/linux/spinlock_api_smp.h:165 [inline]
 _raw_spin_unlock_bh+0x1b/0x40 kernel/locking/spinlock.c:210
 spin_unlock_bh include/linux/spinlock.h:396 [inline]
 ptr_ring_produce_bh include/linux/ptr_ring.h:164 [inline]
 page_pool_recycle_in_ring net/core/page_pool.c:707 [inline]
 page_pool_put_unrefed_netmem+0x748/0xb00 net/core/page_pool.c:826
 page_pool_put_netmem include/net/page_pool/helpers.h:323 [inline]
 page_pool_put_full_netmem include/net/page_pool/helpers.h:353 [inline]
 napi_pp_put_page+0x149/0x2b0 net/core/skbuff.c:1036
 skb_pp_recycle net/core/skbuff.c:1047 [inline]
 skb_free_head net/core/skbuff.c:1094 [inline]
 skb_release_data+0x6c4/0x8a0 net/core/skbuff.c:1125
 skb_release_all net/core/skbuff.c:1190 [inline]
 __kfree_skb net/core/skbuff.c:1204 [inline]
 sk_skb_reason_drop+0x1c9/0x380 net/core/skbuff.c:1242
 kfree_skb_reason include/linux/skbuff.h:1263 [inline]
 __skb_queue_purge_reason include/linux/skbuff.h:3343 [inline]

root cause is:

page_pool_recycle_in_ring
  ptr_ring_produce
    spin_lock(&r->producer_lock);
    WRITE_ONCE(r->queue[r->producer++], ptr)
      //recycle last page to pool
				page_pool_release
				  page_pool_scrub
				    page_pool_empty_ring
				      ptr_ring_consume
				      page_pool_return_page  //release all page
				  __page_pool_destroy
				     free_percpu(pool->recycle_stats);
				     free(pool) //free

     spin_unlock(&r->producer_lock); //pool->ring uaf read
  recycle_stat_inc(pool, ring);

page_pool can be free while page pool recycle the last page in ring.
Add producer-lock barrier to page_pool_release to prevent the page
pool from being free before all pages have been recycled.

recycle_stat_inc() is empty when CONFIG_PAGE_POOL_STATS is not
enabled, which will trigger Wempty-body build warning. Add definition
for pool stat macro to fix warning.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/netdev/20250513083123.3514193-1-dongchenchen2@huawei.com
Fixes: ff7d6b27f8 ("page_pool: refurbish version of page_pool code")
Reported-by: syzbot+204a4382fcb3311f3858@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=204a4382fcb3311f3858
Signed-off-by: Dong Chenchen <dongchenchen2@huawei.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250527114152.3119109-1-dongchenchen2@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-28 19:19:36 -07:00
Tengteng Yang
8542d6fac2 Fix sock_exceed_buf_limit not being triggered in __sk_mem_raise_allocated
When a process under memory pressure is not part of any cgroup and
the charged flag is false, trace_sock_exceed_buf_limit was not called
as expected.

This regression was introduced by commit 2def8ff3fd ("sock:
Code cleanup on __sk_mem_raise_allocated()"). The fix changes the
default value of charged to true while preserving existing logic.

Fixes: 2def8ff3fd ("sock: Code cleanup on __sk_mem_raise_allocated()")
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Tengteng Yang <yangtengteng@bytedance.com>
Link: https://patch.msgid.link/20250527030419.67693-1-yangtengteng@bytedance.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-28 19:07:53 -07:00
Linus Torvalds
90b83efa67 bpf-next-6.16
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmg3NqgACgkQ6rmadz2v
 bTpNUQ/8DPeYtn3nskpsP2OwFy6O3hhfCe6gjOAmUVSk000xbG+AcI/h1DnGZWgk
 xlVcEs93ekzUzHd7k1+RJ2c5yDLXieLJAtb66rbFU1enkxs2cWlcWSKE6K/gaoh3
 G1BCARVlKwtrJhrVrsXtYP/eGZxKRSUZFK7xhtCk7lp7sRI3xkTLE+FJBcDkTJ6W
 HwF14i3zO+BkqNGdFwwlASCCqRItSNBBiM3KjW1DbETOTfAKlvCTrcgdUiODqxhF
 PNnULW+xmICABDFlKfDMlUAGNlSHKjiI3+g31LdblA5eyEhIqiCRgBGFYoCnsluk
 qUauRSie61KqC7fxN3qVpC3bXJfD1td7uIvoqSkDLtTv8a5+HAoiohzi1qBzCayl
 LAGkBYewAfDtdDDjNY38JLH2RCdyY6zG9DhqghPHdPlM7zj7L5zZgj34igEwesMM
 mfj9TuFFF99yfX5UUeSxKpDGR1eO4Ew0p7tg8CRs8Fqh6AIQSmboREZrsncVRCTS
 4SDHSI4KcO4LO2pEKzy+X4dewganN7aESnQG34iG0liyvDDwJOgUnDWLRwPLas7k
 3b/zIfBLxOJpA5R+0hhAMtjMA4NgyKJf4yFZwEieuasQjvzwTApi24YhZ/b3HSEB
 2Dp8kHEEbwezv0OFFz/fJ88dNQnrDmtJ+QByN/liA8kj4Yuh2+Q=
 =j3t8
 -----END PGP SIGNATURE-----

Merge tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf updates from Alexei Starovoitov:

 - Fix and improve BTF deduplication of identical BTF types (Alan
   Maguire and Andrii Nakryiko)

 - Support up to 12 arguments in BPF trampoline on arm64 (Xu Kuohai and
   Alexis Lothoré)

 - Support load-acquire and store-release instructions in BPF JIT on
   riscv64 (Andrea Parri)

 - Fix uninitialized values in BPF_{CORE,PROBE}_READ macros (Anton
   Protopopov)

 - Streamline allowed helpers across program types (Feng Yang)

 - Support atomic update for hashtab of BPF maps (Hou Tao)

 - Implement json output for BPF helpers (Ihor Solodrai)

 - Several s390 JIT fixes (Ilya Leoshkevich)

 - Various sockmap fixes (Jiayuan Chen)

 - Support mmap of vmlinux BTF data (Lorenz Bauer)

 - Support BPF rbtree traversal and list peeking (Martin KaFai Lau)

 - Tests for sockmap/sockhash redirection (Michal Luczaj)

 - Introduce kfuncs for memory reads into dynptrs (Mykyta Yatsenko)

 - Add support for dma-buf iterators in BPF (T.J. Mercier)

 - The verifier support for __bpf_trap() (Yonghong Song)

* tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (135 commits)
  bpf, arm64: Remove unused-but-set function and variable.
  selftests/bpf: Add tests with stack ptr register in conditional jmp
  bpf: Do not include stack ptr register in precision backtracking bookkeeping
  selftests/bpf: enable many-args tests for arm64
  bpf, arm64: Support up to 12 function arguments
  bpf: Check rcu_read_lock_trace_held() in bpf_map_lookup_percpu_elem()
  bpf: Avoid __bpf_prog_ret0_warn when jit fails
  bpftool: Add support for custom BTF path in prog load/loadall
  selftests/bpf: Add unit tests with __bpf_trap() kfunc
  bpf: Warn with __bpf_trap() kfunc maybe due to uninitialized variable
  bpf: Remove special_kfunc_set from verifier
  selftests/bpf: Add test for open coded dmabuf_iter
  selftests/bpf: Add test for dmabuf_iter
  bpf: Add open coded dmabuf iterator
  bpf: Add dmabuf iterator
  dma-buf: Rename debugfs symbols
  bpf: Fix error return value in bpf_copy_from_user_dynptr
  libbpf: Use mmap to parse vmlinux BTF from sysfs
  selftests: bpf: Add a test for mmapable vmlinux BTF
  btf: Allow mmap of vmlinux btf
  ...
2025-05-28 15:52:42 -07:00
Linus Torvalds
1b98f357da Networking changes for 6.16.
Core
 ----
 
  - Implement the Device Memory TCP transmit path, allowing zero-copy
    data transmission on top of TCP from e.g. GPU memory to the wire.
 
  - Move all the IPv6 routing tables management outside the RTNL scope,
    under its own lock and RCU. The route control path is now 3x times
    faster.
 
  - Convert queue related netlink ops to instance lock, reducing
    again the scope of the RTNL lock. This improves the control plane
    scalability.
 
  - Refactor the software crc32c implementation, removing unneeded
    abstraction layers and improving significantly the related
    micro-benchmarks.
 
  - Optimize the GRO engine for UDP-tunneled traffic, for a 10%
    performance improvement in related stream tests.
 
  - Cover more per-CPU storage with local nested BH locking; this is a
    prep work to remove the current per-CPU lock in local_bh_disable()
    on PREMPT_RT.
 
  - Introduce and use nlmsg_payload helper, combining buffer bounds
    verification with accessing payload carried by netlink messages.
 
 Netfilter
 ---------
 
  - Rewrite the procfs conntrack table implementation, improving
    considerably the dump performance. A lot of user-space tools
    still use this interface.
 
  - Implement support for wildcard netdevice in netdev basechain
    and flowtables.
 
  - Integrate conntrack information into nft trace infrastructure.
 
  - Export set count and backend name to userspace, for better
    introspection.
 
 BPF
 ---
 
  - BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops
    programs and can be controlled in similar way to traditional qdiscs
    using the "tc qdisc" command.
 
  - Refactor the UDP socket iterator, addressing long standing issues
    WRT duplicate hits or missed sockets.
 
 Protocols
 ---------
 
  - Improve TCP receive buffer auto-tuning and increase the default
    upper bound for the receive buffer; overall this improves the single
    flow maximum thoughput on 200Gbs link by over 60%.
 
  - Add AFS GSSAPI security class to AF_RXRPC; it provides transport
    security for connections to the AFS fileserver and VL server.
 
  - Improve TCP multipath routing, so that the sources address always
    matches the nexthop device.
 
  - Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS,
    and thus preventing DoS caused by passing around problematic FDs.
 
  - Retire DCCP socket. DCCP only receives updates for bugs, and major
    distros disable it by default. Its removal allows for better
    organisation of TCP fields to reduce the number of cache lines hit
    in the fast path.
 
  - Extend TCP drop-reason support to cover PAWS checks.
 
 Driver API
 ----------
 
  - Reorganize PTP ioctl flag support to require an explicit opt-in for
    the drivers, avoiding the problem of drivers not rejecting new
    unsupported flags.
 
  - Converted several device drivers to timestamping APIs.
 
  - Introduce per-PHY ethtool dump helpers, improving the support for
    dump operations targeting PHYs.
 
 Tests and tooling
 -----------------
 
  - Add support for classic netlink in user space C codegen, so that
    ynl-c can now read, create and modify links, routes addresses and
    qdisc layer configuration.
 
  - Add ynl sub-types for binary attributes, allowing ynl-c to output
    known struct instead of raw binary data, clarifying the classic
    netlink output.
 
  - Extend MPTCP selftests to improve the code-coverage.
 
  - Add tests for XDP tail adjustment in AF_XDP.
 
 New hardware / drivers
 ----------------------
 
  - OpenVPN virtual driver: offload OpenVPN data channels processing
    to the kernel-space, increasing the data transfer throughput WRT
    the user-space implementation.
 
  - Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC.
 
  - Broadcom asp-v3.0 ethernet driver.
 
  - AMD Renoir ethernet device.
 
  - ReakTek MT9888 2.5G ethernet PHY driver.
 
  - Aeonsemi 10G C45 PHYs driver.
 
 Drivers
 -------
 
  - Ethernet high-speed NICs:
    - nVidia/Mellanox (mlx5):
      - refactor the stearing table handling to reduce significantly
        the amount of memory used
      - add support for complex matches in H/W flow steering
      - improve flow streeing error handling
      - convert to netdev instance locking
    - Intel (100G, ice, igb, ixgbe, idpf):
      - ice: add switchdev support for LLDP traffic over VF
      - ixgbe: add firmware manipulation and regions devlink support
      - igb: introduce support for frame transmission premption
      - igb: adds persistent NAPI configuration
      - idpf: introduce RDMA support
      - idpf: add initial PTP support
    - Meta (fbnic):
      - extend hardware stats coverage
      - add devlink dev flash support
    - Broadcom (bnxt):
      - add support for RX-side device memory TCP
    - Wangxun (txgbe):
      - implement support for udp tunnel offload
      - complete PTP and SRIOV support for AML 25G/10G devices
 
  - Ethernet NICs embedded and virtual:
    - Google (gve):
      - add device memory TCP TX support
    - Amazon (ena):
      - support persistent per-NAPI config
    - Airoha:
      - add H/W support for L2 traffic offload
      - add per flow stats for flow offloading
    - RealTek (rtl8211): add support for WoL magic packet
    - Synopsys (stmmac):
      - dwmac-socfpga 1000BaseX support
      - add Loongson-2K3000 support
      - introduce support for hardware-accelerated VLAN stripping
    - Broadcom (bcmgenet):
      - expose more H/W stats
    - Freescale (enetc, dpaa2-eth):
      - enetc: add MAC filter, VLAN filter RSS and loopback support
      - dpaa2-eth: convert to H/W timestamping APIs
    - vxlan: convert FDB table to rhashtable, for better scalabilty
    - veth: apply qdisc backpressure on full ring to reduce TX drops
 
  - Ethernet switches:
    - Microchip (kzZ88x3): add ETS scheduler support
 
  - Ethernet PHYs:
    - RealTek (rtl8211):
      - add support for WoL magic packet
      - add support for PHY LEDs
 
  - CAN:
    - Adds RZ/G3E CANFD support to the rcar_canfd driver.
    - Preparatory work for CAN-XL support.
    - Add self-tests framework with support for CAN physical interfaces.
 
  - WiFi:
    - mac80211:
      - scan improvements with multi-link operation (MLO)
    - Qualcomm (ath12k):
      - enable AHB support for IPQ5332
      - add monitor interface support to QCN9274
      - add multi-link operation support to WCN7850
      - add 802.11d scan offload support to WCN7850
      - monitor mode for WCN7850, better 6 GHz regulatory
    - Qualcomm (ath11k):
      - restore hibernation support
    - MediaTek (mt76):
      - WiFi-7 improvements
      - implement support for mt7990
    - Intel (iwlwifi):
      - enhanced multi-link single-radio (EMLSR) support on 5 GHz links
      - rework device configuration
    - RealTek (rtw88):
      - improve throughput for RTL8814AU
    - RealTek (rtw89):
      - add multi-link operation support
      - STA/P2P concurrency improvements
      - support different SAR configs by antenna
 
  - Bluetooth:
    - introduce HCI Driver protocol
    - btintel_pcie: do not generate coredump for diagnostic events
    - btusb: add HCI Drv commands for configuring altsetting
    - btusb: add RTL8851BE device 0x0bda:0xb850
    - btusb: add new VID/PID 13d3/3584 for MT7922
    - btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925
    - btnxpuart: implement host-wakeup feature
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmg3D64SHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkcIsQAK2eEc+BxQer975wzvtMg6gF9eoex4a+
 rZ7jxfDzDtNvTauoQsrpehDZp0FnySaVGCU36lHGB2OvDnhCpPc5hXzKDWQpOuqQ
 SHrGG3/6FTbdTG/HfHUcbNyrUzIf53SADSObiQ3qg4gyEQ3sCpcOKtVtMcU8rvsY
 /HqMnsJWFaROUMjMtCcnUSgjmeY9kBvha3sTXUqgeRugEOCvZD7z4rpqFIcQqHw7
 e2Fi8dwIXEYNxqPp6MRq2qdyUTewCRruE8ZIMAFuhtfYeMElUZMPlqlMENX3AzTQ
 cr0EgwcFOUxRA7oZRxhoBNBsVXavtSpQr4ZDoWplxP4aQ37n5tc1E9Q72axpB/Og
 FbJRl6GvWYnCd8071BczgmfHlKaTAigPvt2Z4r6JjM5I/Bij/IZ3k+On1OTuOAj/
 EqfFkdZ0a5cfKrwUMP+oSGtSAywkMVUtnIKJlZeRbjSj2432sCfe2jVAlS8ELM43
 3LUgXYrAKtA87g171LlsRu5EEpI5QmqPb+i5LpPlEXe2TJEgPisyfecJ3NafF/2+
 j575lm+TFNm9NTNhGGjDPEvw0djI5wSGGMe9J4gC74eWi6s5t6C4cuUf84TKWdwR
 x+9H0IB7rfFncAwXHJuUUtzd+fPHaYzs5dDGbSgMQOXr1cr1wlubCK8mQ1r/Wt/a
 3GjFIOQKW2Q5
 =t/Tz
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "Core:

   - Implement the Device Memory TCP transmit path, allowing zero-copy
     data transmission on top of TCP from e.g. GPU memory to the wire.

   - Move all the IPv6 routing tables management outside the RTNL scope,
     under its own lock and RCU. The route control path is now 3x times
     faster.

   - Convert queue related netlink ops to instance lock, reducing again
     the scope of the RTNL lock. This improves the control plane
     scalability.

   - Refactor the software crc32c implementation, removing unneeded
     abstraction layers and improving significantly the related
     micro-benchmarks.

   - Optimize the GRO engine for UDP-tunneled traffic, for a 10%
     performance improvement in related stream tests.

   - Cover more per-CPU storage with local nested BH locking; this is a
     prep work to remove the current per-CPU lock in local_bh_disable()
     on PREMPT_RT.

   - Introduce and use nlmsg_payload helper, combining buffer bounds
     verification with accessing payload carried by netlink messages.

  Netfilter:

   - Rewrite the procfs conntrack table implementation, improving
     considerably the dump performance. A lot of user-space tools still
     use this interface.

   - Implement support for wildcard netdevice in netdev basechain and
     flowtables.

   - Integrate conntrack information into nft trace infrastructure.

   - Export set count and backend name to userspace, for better
     introspection.

  BPF:

   - BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops
     programs and can be controlled in similar way to traditional qdiscs
     using the "tc qdisc" command.

   - Refactor the UDP socket iterator, addressing long standing issues
     WRT duplicate hits or missed sockets.

  Protocols:

   - Improve TCP receive buffer auto-tuning and increase the default
     upper bound for the receive buffer; overall this improves the
     single flow maximum thoughput on 200Gbs link by over 60%.

   - Add AFS GSSAPI security class to AF_RXRPC; it provides transport
     security for connections to the AFS fileserver and VL server.

   - Improve TCP multipath routing, so that the sources address always
     matches the nexthop device.

   - Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS,
     and thus preventing DoS caused by passing around problematic FDs.

   - Retire DCCP socket. DCCP only receives updates for bugs, and major
     distros disable it by default. Its removal allows for better
     organisation of TCP fields to reduce the number of cache lines hit
     in the fast path.

   - Extend TCP drop-reason support to cover PAWS checks.

  Driver API:

   - Reorganize PTP ioctl flag support to require an explicit opt-in for
     the drivers, avoiding the problem of drivers not rejecting new
     unsupported flags.

   - Converted several device drivers to timestamping APIs.

   - Introduce per-PHY ethtool dump helpers, improving the support for
     dump operations targeting PHYs.

  Tests and tooling:

   - Add support for classic netlink in user space C codegen, so that
     ynl-c can now read, create and modify links, routes addresses and
     qdisc layer configuration.

   - Add ynl sub-types for binary attributes, allowing ynl-c to output
     known struct instead of raw binary data, clarifying the classic
     netlink output.

   - Extend MPTCP selftests to improve the code-coverage.

   - Add tests for XDP tail adjustment in AF_XDP.

  New hardware / drivers:

   - OpenVPN virtual driver: offload OpenVPN data channels processing to
     the kernel-space, increasing the data transfer throughput WRT the
     user-space implementation.

   - Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC.

   - Broadcom asp-v3.0 ethernet driver.

   - AMD Renoir ethernet device.

   - ReakTek MT9888 2.5G ethernet PHY driver.

   - Aeonsemi 10G C45 PHYs driver.

  Drivers:

   - Ethernet high-speed NICs:
       - nVidia/Mellanox (mlx5):
           - refactor the steering table handling to significantly
             reduce the amount of memory used
           - add support for complex matches in H/W flow steering
           - improve flow streeing error handling
           - convert to netdev instance locking
       - Intel (100G, ice, igb, ixgbe, idpf):
           - ice: add switchdev support for LLDP traffic over VF
           - ixgbe: add firmware manipulation and regions devlink support
           - igb: introduce support for frame transmission premption
           - igb: adds persistent NAPI configuration
           - idpf: introduce RDMA support
           - idpf: add initial PTP support
       - Meta (fbnic):
           - extend hardware stats coverage
           - add devlink dev flash support
       - Broadcom (bnxt):
           - add support for RX-side device memory TCP
       - Wangxun (txgbe):
           - implement support for udp tunnel offload
           - complete PTP and SRIOV support for AML 25G/10G devices

   - Ethernet NICs embedded and virtual:
       - Google (gve):
           - add device memory TCP TX support
       - Amazon (ena):
           - support persistent per-NAPI config
       - Airoha:
           - add H/W support for L2 traffic offload
           - add per flow stats for flow offloading
       - RealTek (rtl8211): add support for WoL magic packet
       - Synopsys (stmmac):
           - dwmac-socfpga 1000BaseX support
           - add Loongson-2K3000 support
           - introduce support for hardware-accelerated VLAN stripping
       - Broadcom (bcmgenet):
           - expose more H/W stats
       - Freescale (enetc, dpaa2-eth):
           - enetc: add MAC filter, VLAN filter RSS and loopback support
           - dpaa2-eth: convert to H/W timestamping APIs
       - vxlan: convert FDB table to rhashtable, for better scalabilty
       - veth: apply qdisc backpressure on full ring to reduce TX drops

   - Ethernet switches:
       - Microchip (kzZ88x3): add ETS scheduler support

   - Ethernet PHYs:
       - RealTek (rtl8211):
           - add support for WoL magic packet
           - add support for PHY LEDs

   - CAN:
       - Adds RZ/G3E CANFD support to the rcar_canfd driver.
       - Preparatory work for CAN-XL support.
       - Add self-tests framework with support for CAN physical interfaces.

   - WiFi:
       - mac80211:
           - scan improvements with multi-link operation (MLO)
       - Qualcomm (ath12k):
           - enable AHB support for IPQ5332
           - add monitor interface support to QCN9274
           - add multi-link operation support to WCN7850
           - add 802.11d scan offload support to WCN7850
           - monitor mode for WCN7850, better 6 GHz regulatory
       - Qualcomm (ath11k):
           - restore hibernation support
       - MediaTek (mt76):
           - WiFi-7 improvements
           - implement support for mt7990
       - Intel (iwlwifi):
           - enhanced multi-link single-radio (EMLSR) support on 5 GHz links
           - rework device configuration
       - RealTek (rtw88):
           - improve throughput for RTL8814AU
       - RealTek (rtw89):
           - add multi-link operation support
           - STA/P2P concurrency improvements
           - support different SAR configs by antenna

   - Bluetooth:
       - introduce HCI Driver protocol
       - btintel_pcie: do not generate coredump for diagnostic events
       - btusb: add HCI Drv commands for configuring altsetting
       - btusb: add RTL8851BE device 0x0bda:0xb850
       - btusb: add new VID/PID 13d3/3584 for MT7922
       - btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925
       - btnxpuart: implement host-wakeup feature"

* tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1611 commits)
  selftests/bpf: Fix bpf selftest build warning
  selftests: netfilter: Fix skip of wildcard interface test
  net: phy: mscc: Stop clearing the the UDPv4 checksum for L2 frames
  net: openvswitch: Fix the dead loop of MPLS parse
  calipso: Don't call calipso functions for AF_INET sk.
  selftests/tc-testing: Add a test for HFSC eltree double add with reentrant enqueue behaviour on netem
  net_sched: hfsc: Address reentrant enqueue adding class to eltree twice
  octeontx2-pf: QOS: Refactor TC_HTB_LEAF_DEL_LAST callback
  octeontx2-pf: QOS: Perform cache sync on send queue teardown
  net: mana: Add support for Multi Vports on Bare metal
  net: devmem: ncdevmem: remove unused variable
  net: devmem: ksft: upgrade rx test to send 1K data
  net: devmem: ksft: add 5 tuple FS support
  net: devmem: ksft: add exit_wait to make rx test pass
  net: devmem: ksft: add ipv4 support
  net: devmem: preserve sockc_err
  page_pool: fix ugly page_pool formatting
  net: devmem: move list_add to net_devmem_bind_dmabuf.
  selftests: netfilter: nft_queue.sh: include file transfer duration in log message
  net: phy: mscc: Fix memory leak when using one step timestamping
  ...
2025-05-28 15:24:36 -07:00
Chuck Lever
111f9e4b0d SUNRPC: Remove dead code from xs_tcp_tls_setup_socket()
xs_tcp_tls_finish_connecting() already marks the upper xprt
connected, so the same code in xs_tcp_tls_setup_socket() is
never executed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28 17:17:14 -04:00
Chuck Lever
0bd2f6b899 SUNRPC: Prevent hang on NFS mount with xprtsec=[m]tls
Engineers at Hammerspace noticed that sometimes mounting with
"xprtsec=tls" hangs for a minute or so, and then times out, even
when the NFS server is reachable and responsive.

kTLS shuts off data_ready callbacks if strp->msg_ready is set to
mitigate data_ready callbacks when a full TLS record is not yet
ready to be read from the socket.

Normally msg_ready is clear when the first TLS record arrives on
a socket. However, I observed that sometimes tls_setsockopt() sets
strp->msg_ready, and that prevents forward progress because
tls_data_ready() becomes a no-op.

Moreover, Jakub says: "If there's a full record queued at the time
when [tlshd] passes the socket back to the kernel, it's up to the
reader to read the already queued data out." So SunRPC cannot
expect a data_ready call when ingress data is already waiting.

Add an explicit poll after SunRPC's upper transport is set up to
pick up any data that arrived after the TLS handshake but before
transport set-up is complete.

Reported-by: Steve Sears <sjs@hammerspace.com>
Suggested-by: Jakub Kacinski <kuba@kernel.org>
Fixes: 75eb6af7ac ("SUNRPC: Add a TCP-with-TLS RPC transport class")
Tested-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28 17:17:14 -04:00
Linus Torvalds
2c26b68cd5 NFSD 6.16 Release Notes
The marquee feature for this release is that the limit on the
 maximum rsize and wsize has been raised to 4MB. The default remains
 at 1MB, but risk-seeking administrators now have the ability to try
 larger I/O sizes with NFS clients that support them. Eventually the
 default setting will be increased when we have confidence that this
 change will not have negative impact.
 
 With v6.16, NFSD now has its own debugfs file system where we can
 add experimental features and make them available outside of our
 development community without impacting production deployments. The
 first experimental setting added is one that makes all NFS READ
 operations use vfs_iter_read() instead of the NFSD splice actor. The
 plan is to eventually retire the splice actor, as that will enable a
 number of new capabilities such as the use of struct bio_vec from the
 top to the bottom of the NFSD stack.
 
 Jeff Layton contributed a number of observability improvements. The
 use of dprintk() in a number of high-traffic code paths has been
 replaced with static trace points.
 
 This release sees the continuation of efforts to harden the NFSv4.2
 COPY operation. Soon, the restriction on async COPY operations can
 be lifted.
 
 Many thanks to the contributors, reviewers, testers, and bug
 reporters who participated during the v6.16 development cycle.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmg1yGUACgkQM2qzM29m
 f5frMA//TJTbSWiM7qBX1GhVMNr1lxQcjU4BPKo0qZfEtwV06F2BB9mWgDU+BIQh
 AcGfMZUmNWAnhOTOYvwqyW6dnX+1yt8sBsCZ/1ctY30A4JH4AgG5sdZS7BUrlEEr
 bGDMUCaPnvQ3maeDjMlefe7Xv/rUhj9TVhXmDkt4vf/jCde2JODTB/z8n7WeAxYJ
 eOvmr/n5z6VI5Q67M7b5/xqofBEaEoq9P5UEgn61ThfeR0bMlrklm/avDCbbNIH8
 6n7Z3tjzllK1CAjEmwHalq4LRbMX5FHWzNkyJw+wtviXS18J5vCAvRe+JDoykusu
 L2bgXT8bBUqy46eO4WKEOJtEqVQhIsRFx/8ku1iTLrpDWlwrR4mHVyObEDkkdlMX
 EyBQ4svg2OxCXSyy5O8oggzU0TWVJStIjbIEHbJYusWLU7HxxFveBwqwzYHXLtip
 WKm6N2ANqQi1du+Pc6xmgXo9svA5Vk+DQjljm1Y5up9dhi2K9cvCIHjwFsZ+E0VL
 XqXJ2YgIQb3oXK7FttzLOiDrpX1OX82sTIbgdcPcfT7lP+ej7uiHMBPmdPwgaZIU
 EbIp0ThoTkh8/VRMDcWIt+B6SEhmb5vY3Zgz9Lcf2J0PM1fuYJ67L7xGTviFX7Ci
 DpohiCgceb6PHYeIuarayF86tPJGF8Vb7XvQZej2Ybv8QdxLFg8=
 =FbeG
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:
 "The marquee feature for this release is that the limit on the maximum
  rsize and wsize has been raised to 4MB. The default remains at 1MB,
  but risk-seeking administrators now have the ability to try larger I/O
  sizes with NFS clients that support them. Eventually the default
  setting will be increased when we have confidence that this change
  will not have negative impact.

  With v6.16, NFSD now has its own debugfs file system where we can add
  experimental features and make them available outside of our
  development community without impacting production deployments. The
  first experimental setting added is one that makes all NFS READ
  operations use vfs_iter_read() instead of the NFSD splice actor. The
  plan is to eventually retire the splice actor, as that will enable a
  number of new capabilities such as the use of struct bio_vec from the
  top to the bottom of the NFSD stack.

  Jeff Layton contributed a number of observability improvements. The
  use of dprintk() in a number of high-traffic code paths has been
  replaced with static trace points.

  This release sees the continuation of efforts to harden the NFSv4.2
  COPY operation. Soon, the restriction on async COPY operations can be
  lifted.

  Many thanks to the contributors, reviewers, testers, and bug reporters
  who participated during the v6.16 development cycle"

* tag 'nfsd-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (60 commits)
  xdrgen: Fix code generated for counted arrays
  SUNRPC: Bump the maximum payload size for the server
  NFSD: Add a "default" block size
  NFSD: Remove NFSSVC_MAXBLKSIZE_V2 macro
  NFSD: Remove NFSD_BUFSIZE
  sunrpc: Remove the RPCSVC_MAXPAGES macro
  svcrdma: Adjust the number of entries in svc_rdma_send_ctxt::sc_pages
  svcrdma: Adjust the number of entries in svc_rdma_recv_ctxt::rc_pages
  sunrpc: Adjust size of socket's receive page array dynamically
  SUNRPC: Remove svc_rqst :: rq_vec
  SUNRPC: Remove svc_fill_write_vector()
  NFSD: Use rqstp->rq_bvec in nfsd_iter_write()
  SUNRPC: Export xdr_buf_to_bvec()
  NFSD: De-duplicate the svc_fill_write_vector() call sites
  NFSD: Use rqstp->rq_bvec in nfsd_iter_read()
  sunrpc: Replace the rq_bvec array with dynamically-allocated memory
  sunrpc: Replace the rq_pages array with dynamically-allocated memory
  sunrpc: Remove backchannel check in svc_init_buffer()
  sunrpc: Add a helper to derive maxpages from sv_max_mesg
  svcrdma: Reduce the number of rdma_rw contexts per-QP
  ...
2025-05-28 12:21:12 -07:00
Paolo Abeni
f6bd8faeb1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge in late fixes to prepare for the 6.16 net-next PR.

No conflicts nor adjacent changes.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-28 10:11:15 +02:00
Faicker Mo
0bdc924bfb net: openvswitch: Fix the dead loop of MPLS parse
The unexpected MPLS packet may not end with the bottom label stack.
When there are many stacks, The label count value has wrapped around.
A dead loop occurs, soft lockup/CPU stuck finally.

stack backtrace:
UBSAN: array-index-out-of-bounds in /build/linux-0Pa0xK/linux-5.15.0/net/openvswitch/flow.c:662:26
index -1 is out of range for type '__be32 [3]'
CPU: 34 PID: 0 Comm: swapper/34 Kdump: loaded Tainted: G           OE   5.15.0-121-generic #131-Ubuntu
Hardware name: Dell Inc. PowerEdge C6420/0JP9TF, BIOS 2.12.2 07/14/2021
Call Trace:
 <IRQ>
 show_stack+0x52/0x5c
 dump_stack_lvl+0x4a/0x63
 dump_stack+0x10/0x16
 ubsan_epilogue+0x9/0x36
 __ubsan_handle_out_of_bounds.cold+0x44/0x49
 key_extract_l3l4+0x82a/0x840 [openvswitch]
 ? kfree_skbmem+0x52/0xa0
 key_extract+0x9c/0x2b0 [openvswitch]
 ovs_flow_key_extract+0x124/0x350 [openvswitch]
 ovs_vport_receive+0x61/0xd0 [openvswitch]
 ? kernel_init_free_pages.part.0+0x4a/0x70
 ? get_page_from_freelist+0x353/0x540
 netdev_port_receive+0xc4/0x180 [openvswitch]
 ? netdev_port_receive+0x180/0x180 [openvswitch]
 netdev_frame_hook+0x1f/0x40 [openvswitch]
 __netif_receive_skb_core.constprop.0+0x23a/0xf00
 __netif_receive_skb_list_core+0xfa/0x240
 netif_receive_skb_list_internal+0x18e/0x2a0
 napi_complete_done+0x7a/0x1c0
 bnxt_poll+0x155/0x1c0 [bnxt_en]
 __napi_poll+0x30/0x180
 net_rx_action+0x126/0x280
 ? bnxt_msix+0x67/0x80 [bnxt_en]
 handle_softirqs+0xda/0x2d0
 irq_exit_rcu+0x96/0xc0
 common_interrupt+0x8e/0xa0
 </IRQ>

Fixes: fbdcdd78da ("Change in Openvswitch to support MPLS label depth of 3 in ingress direction")
Signed-off-by: Faicker Mo <faicker.mo@zenlayer.com>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/259D3404-575D-4A6D-B263-1DF59A67CF89@zenlayer.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-28 09:03:02 +02:00
Kuniyuki Iwashima
6e9f2df1c5 calipso: Don't call calipso functions for AF_INET sk.
syzkaller reported a null-ptr-deref in txopt_get(). [0]

The offset 0x70 was of struct ipv6_txoptions in struct ipv6_pinfo,
so struct ipv6_pinfo was NULL there.

However, this never happens for IPv6 sockets as inet_sk(sk)->pinet6
is always set in inet6_create(), meaning the socket was not IPv6 one.

The root cause is missing validation in netlbl_conn_setattr().

netlbl_conn_setattr() switches branches based on struct
sockaddr.sa_family, which is passed from userspace.  However,
netlbl_conn_setattr() does not check if the address family matches
the socket.

The syzkaller must have called connect() for an IPv6 address on
an IPv4 socket.

We have a proper validation in tcp_v[46]_connect(), but
security_socket_connect() is called in the earlier stage.

Let's copy the validation to netlbl_conn_setattr().

[0]:
Oops: general protection fault, probably for non-canonical address 0xdffffc000000000e: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077]
CPU: 2 UID: 0 PID: 12928 Comm: syz.9.1677 Not tainted 6.12.0 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:txopt_get include/net/ipv6.h:390 [inline]
RIP: 0010:
Code: 02 00 00 49 8b ac 24 f8 02 00 00 e8 84 69 2a fd e8 ff 00 16 fd 48 8d 7d 70 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 53 02 00 00 48 8b 6d 70 48 85 ed 0f 84 ab 01 00
RSP: 0018:ffff88811b8afc48 EFLAGS: 00010212
RAX: dffffc0000000000 RBX: 1ffff11023715f8a RCX: ffffffff841ab00c
RDX: 000000000000000e RSI: ffffc90007d9e000 RDI: 0000000000000070
RBP: 0000000000000000 R08: ffffed1023715f9d R09: ffffed1023715f9e
R10: ffffed1023715f9d R11: 0000000000000003 R12: ffff888123075f00
R13: ffff88810245bd80 R14: ffff888113646780 R15: ffff888100578a80
FS:  00007f9019bd7640(0000) GS:ffff8882d2d00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f901b927bac CR3: 0000000104788003 CR4: 0000000000770ef0
PKRU: 80000000
Call Trace:
 <TASK>
 calipso_sock_setattr+0x56/0x80 net/netlabel/netlabel_calipso.c:557
 netlbl_conn_setattr+0x10c/0x280 net/netlabel/netlabel_kapi.c:1177
 selinux_netlbl_socket_connect_helper+0xd3/0x1b0 security/selinux/netlabel.c:569
 selinux_netlbl_socket_connect_locked security/selinux/netlabel.c:597 [inline]
 selinux_netlbl_socket_connect+0xb6/0x100 security/selinux/netlabel.c:615
 selinux_socket_connect+0x5f/0x80 security/selinux/hooks.c:4931
 security_socket_connect+0x50/0xa0 security/security.c:4598
 __sys_connect_file+0xa4/0x190 net/socket.c:2067
 __sys_connect+0x12c/0x170 net/socket.c:2088
 __do_sys_connect net/socket.c:2098 [inline]
 __se_sys_connect net/socket.c:2095 [inline]
 __x64_sys_connect+0x73/0xb0 net/socket.c:2095
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xaa/0x1b0 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f901b61a12d
Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f9019bd6fa8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
RAX: ffffffffffffffda RBX: 00007f901b925fa0 RCX: 00007f901b61a12d
RDX: 000000000000001c RSI: 0000200000000140 RDI: 0000000000000003
RBP: 00007f901b701505 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 00007f901b5b62a0 R15: 00007f9019bb7000
 </TASK>
Modules linked in:

Fixes: ceba1832b1 ("calipso: Set the calipso socket label to match the secattr.")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Reported-by: John Cheung <john.cs.hey@gmail.com>
Closes: https://lore.kernel.org/netdev/CAP=Rh=M1LzunrcQB1fSGauMrJrhL6GGps5cPAKzHJXj6GQV+-g@mail.gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://patch.msgid.link/20250522221858.91240-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-28 08:59:09 +02:00
Pedro Tammela
ac9fe7dd8e net_sched: hfsc: Address reentrant enqueue adding class to eltree twice
Savino says:
    "We are writing to report that this recent patch
    (141d34391a) [1]
    can be bypassed, and a UAF can still occur when HFSC is utilized with
    NETEM.

    The patch only checks the cl->cl_nactive field to determine whether
    it is the first insertion or not [2], but this field is only
    incremented by init_vf [3].

    By using HFSC_RSC (which uses init_ed) [4], it is possible to bypass the
    check and insert the class twice in the eltree.
    Under normal conditions, this would lead to an infinite loop in
    hfsc_dequeue for the reasons we already explained in this report [5].

    However, if TBF is added as root qdisc and it is configured with a
    very low rate,
    it can be utilized to prevent packets from being dequeued.
    This behavior can be exploited to perform subsequent insertions in the
    HFSC eltree and cause a UAF."

To fix both the UAF and the infinite loop, with netem as an hfsc child,
check explicitly in hfsc_enqueue whether the class is already in the eltree
whenever the HFSC_RSC flag is set.

[1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=141d34391abbb315d68556b7c67ad97885407547
[2] https://elixir.bootlin.com/linux/v6.15-rc5/source/net/sched/sch_hfsc.c#L1572
[3] https://elixir.bootlin.com/linux/v6.15-rc5/source/net/sched/sch_hfsc.c#L677
[4] https://elixir.bootlin.com/linux/v6.15-rc5/source/net/sched/sch_hfsc.c#L1574
[5] https://lore.kernel.org/netdev/8DuRWwfqjoRDLDmBMlIfbrsZg9Gx50DHJc1ilxsEBNe2D6NMoigR_eIRIG0LOjMc3r10nUUZtArXx4oZBIdUfZQrwjcQhdinnMis_0G7VEk=@willsroot.io/T/#u

Fixes: 37d9cf1a3c ("sched: Fix detection of empty queues in child qdiscs")
Reported-by: Savino Dicanosa <savy@syst3mfailure.io>
Reported-by: William Liu <will@willsroot.io>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Tested-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Link: https://patch.msgid.link/20250522181448.1439717-2-pctammela@mojatatu.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-28 08:55:35 +02:00
Mina Almasry
85cea17d15 net: devmem: preserve sockc_err
Preserve the error code returned by sock_cmsg_send and return that on
err.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250523230524.1107879-4-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-27 19:19:35 -07:00
Mina Almasry
170ebc60b7 page_pool: fix ugly page_pool formatting
Minor cleanup; this line is badly formatted.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250523230524.1107879-3-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-27 19:19:35 -07:00
Mina Almasry
88e47c93b3 net: devmem: move list_add to net_devmem_bind_dmabuf.
It's annoying for the list_add to be outside net_devmem_bind_dmabuf, but
the list_del is in net_devmem_unbind_dmabuf. Make it consistent by
having both the list_add/del be inside the net_devmem_[un]bind_dmabuf.

Cc: ap420073@gmail.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Tested-by: Taehee Yoo <ap420073@gmail.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250523230524.1107879-2-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-27 19:19:35 -07:00
Christoph Hellwig
33f1b3677a sctp: mark sctp_do_peeloff static
sctp_do_peeloff is only used inside of net/sctp/socket.c,
so mark it static.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20250526054745.2329201-1-hch@lst.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-27 18:18:55 -07:00
Saeed Mahameed
cb575e5e9f net: Kconfig NET_DEVMEM selects GENERIC_ALLOCATOR
GENERIC_ALLOCATOR is a non-prompt kconfig, meaning users can't enable it
selectively. All kconfig users of GENERIC_ALLOCATOR select it, except of
NET_DEVMEM which only depends on it, there is no easy way to turn
GENERIC_ALLOCATOR on unless we select other unnecessary configs that
will select it.

Instead of depending on it, select it when NET_DEVMEM is enabled.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/1747950086-1246773-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-27 17:31:42 -07:00
Zilin Guan
c8ef20fe72 tipc: use kfree_sensitive() for aead cleanup
The tipc_aead_free() function currently uses kfree() to release the aead
structure. However, this structure contains sensitive information, such
as key's SALT value, which should be securely erased from memory to
prevent potential leakage.

To enhance security, replace kfree() with kfree_sensitive() when freeing
the aead structure. This change ensures that sensitive data is explicitly
cleared before memory deallocation, aligning with the approach used in
tipc_aead_init() and adhering to best practices for handling confidential
information.

Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech>
Link: https://patch.msgid.link/20250523114717.4021518-1-zilin@seu.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-27 17:31:42 -07:00
Linus Torvalds
5e8bbb2caa Another set of timer API cleanups:
- Convert init_timer*(), try_to_del_timer_sync() and
    destroy_timer_on_stack() over to the canonical timer_*() namespace
    convention.
 
 There are is another large converstion pending, which has not been included
 because it would have caused a gazillion of merge conflicts in next. The
 conversion scripts will be run towards the end of the merge window and a
 pull request sent once all conflict dependencies have been merged.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmgzgTkTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYodwVD/97rF1Juqm1JZNIZPN/vMqwCxRoUkc6
 tsK0+UC7UXusuJadxJ+Bsv25iPF+qejnThMU+SQ5yTVj/PNfxOe0WPdCEGGiL8Ye
 2JCk6GqSOB/360SlLmtR1B1xHDwsuuUcQTz0w57CH66HRV5vpoWSMSwj/ypy+8nU
 PlgjItaxdCKa9NJ+SUJZPWIxRkt/PsA1kwlV1OcxkgB++IiIHQEbPxECq9mlzWXF
 b4Sq/Sdf2OmEePN+DYoey4fneRwJnkjkeX/o+CqosCPHRIiWUlSu5W/lU5IYojM3
 s3XpMNNg/z8PMXR4JA2VaPYWLUZyBOs+3dM7Y6Am+z55EoxMxfzg6pGx2tfM4ftl
 vF8wG3Z1c9MmpLk+P9LatNvfHeVLNve8KgOLa5phMDQ/El/a8KqLu6HmRDPONvKp
 d6iXdPq1CP8P6jOtlFfzLmKPShgEcp+Zz9W3CaQR/0ZJEsEqrpKOLzdT86hJhBV0
 mBCdzixmGtKAh0BdPdmg2FCLScqER3HKIJhZSdV8I+jSETIHCuMiIfbMXR7iwm/H
 R1/ayvxrbc1mPseo28scqvo7m6cn5BFBxIUf4Sokp52ZCapz1v2aWzo4vHI0cTgT
 ZOjlTrf+fgYLn1dqdD45TJiQPnmRrw4dU+WWSFRFJY2qjfyucj80vdqdkE5zkp5b
 UPomlVimG4ccPg==
 =FHGU
 -----END PGP SIGNATURE-----

Merge tag 'timers-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer cleanups from Thomas Gleixner:
 "Another set of timer API cleanups:

    - Convert init_timer*(), try_to_del_timer_sync() and
      destroy_timer_on_stack() over to the canonical timer_*()
      namespace convention.

  There is another large conversion pending, which has not been included
  because it would have caused a gazillion of merge conflicts in next.
  The conversion scripts will be run towards the end of the merge window
  and a pull request sent once all conflict dependencies have been
  merged"

* tag 'timers-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  treewide, timers: Rename destroy_timer_on_stack() as timer_destroy_on_stack()
  treewide, timers: Rename try_to_del_timer_sync() as timer_delete_sync_try()
  timers: Rename init_timers() as timers_init()
  timers: Rename NEXT_TIMER_MAX_DELTA as TIMER_NEXT_MAX_DELTA
  timers: Rename __init_timer_on_stack() as __timer_init_on_stack()
  timers: Rename __init_timer() as __timer_init()
  timers: Rename init_timer_on_stack_key() as timer_init_key_on_stack()
  timers: Rename init_timer_key() as timer_init_key()
2025-05-27 08:31:21 -07:00
Bui Quang Minh
28fcb4b56f xsk: add missing virtual address conversion for page
In commit 7ead4405e0 ("xsk: convert xdp_copy_frags_from_zc() to use
page_pool_dev_alloc()"), when converting from netmem to page, I missed a
call to page_address() around skb_frag_page(frag) to get the virtual
address of the page. This commit uses skb_frag_address() helper to fix
the issue.

Fixes: 7ead4405e0 ("xsk: convert xdp_copy_frags_from_zc() to use page_pool_dev_alloc()")
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Link: https://patch.msgid.link/20250522040115.5057-1-minhquangbui99@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27 11:46:47 +02:00
Stanislav Fomichev
d8d85ef0a6 af_packet: move notifier's packet_dev_mc out of rcu critical section
Syzkaller reports the following issue:

 BUG: sleeping function called from invalid context at kernel/locking/mutex.c:578
 __mutex_lock+0x106/0xe80 kernel/locking/mutex.c:746
 team_change_rx_flags+0x38/0x220 drivers/net/team/team_core.c:1781
 dev_change_rx_flags net/core/dev.c:9145 [inline]
 __dev_set_promiscuity+0x3f8/0x590 net/core/dev.c:9189
 netif_set_promiscuity+0x50/0xe0 net/core/dev.c:9201
 dev_set_promiscuity+0x126/0x260 net/core/dev_api.c:286 packet_dev_mc net/packet/af_packet.c:3698 [inline]
 packet_dev_mclist_delete net/packet/af_packet.c:3722 [inline]
 packet_notifier+0x292/0xa60 net/packet/af_packet.c:4247
 notifier_call_chain+0x1b3/0x3e0 kernel/notifier.c:85
 call_netdevice_notifiers_extack net/core/dev.c:2214 [inline]
 call_netdevice_notifiers net/core/dev.c:2228 [inline]
 unregister_netdevice_many_notify+0x15d8/0x2330 net/core/dev.c:11972
 rtnl_delete_link net/core/rtnetlink.c:3522 [inline]
 rtnl_dellink+0x488/0x710 net/core/rtnetlink.c:3564
 rtnetlink_rcv_msg+0x7cf/0xb70 net/core/rtnetlink.c:6955
 netlink_rcv_skb+0x219/0x490 net/netlink/af_netlink.c:2534

Calling `PACKET_ADD_MEMBERSHIP` on an ops-locked device can trigger
the `NETDEV_UNREGISTER` notifier, which may require disabling promiscuous
and/or allmulti mode. Both of these operations require acquiring
the netdev instance lock.

Move the call to `packet_dev_mc` outside of the RCU critical section.
The `mclist` modifications (add, del, flush, unregister) are protected by
the RTNL, not the RCU. The RCU only protects the `sklist` and its
associated `sks`. The delayed operation on the `mclist` entry remains
within the RTNL.

Reported-by: syzbot+b191b5ccad8d7a986286@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b191b5ccad8d7a986286
Fixes: ad7c7b2172 ("net: hold netdev instance lock during sysfs operations")
Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250522031129.3247266-1-stfomichev@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27 11:36:26 +02:00
Michal Luczaj
5ec40864aa vsock: Move lingering logic to af_vsock core
Lingering should be transport-independent in the long run. In preparation
for supporting other transports, as well as the linger on shutdown(), move
code to core.

Generalize by querying vsock_transport::unsent_bytes(), guard against the
callback being unimplemented. Do not pass sk_lingertime explicitly. Pull
SOCK_LINGER check into vsock_linger().

Flatten the function. Remove the nested block by inverting the condition:
return early on !timeout.

Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20250522-vsock-linger-v6-2-2ad00b0e447e@rbox.co
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27 11:05:21 +02:00
Michal Luczaj
1c39f5dbbf vsock/virtio: Linger on unsent data
Currently vsock's lingering effectively boils down to waiting (or timing
out) until packets are consumed or dropped by the peer; be it by receiving
the data, closing or shutting down the connection.

To align with the semantics described in the SO_LINGER section of man
socket(7) and to mimic AF_INET's behaviour more closely, change the logic
of a lingering close(): instead of waiting for all data to be handled,
block until data is considered sent from the vsock's transport point of
view. That is until worker picks the packets for processing and decrements
virtio_vsock_sock::bytes_unsent down to 0.

Note that (some interpretation of) lingering was always limited to
transports that called virtio_transport_wait_close() on transport release.
This does not change, i.e. under Hyper-V and VMCI no lingering would be
observed.

The implementation does not adhere strictly to man page's interpretation of
SO_LINGER: shutdown() will not trigger the lingering. This follows AF_INET.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20250522-vsock-linger-v6-1-2ad00b0e447e@rbox.co
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27 11:05:21 +02:00
Kees Cook
ae9fcd5a0f net: core: Convert dev_set_mac_address_user() to use struct sockaddr_storage
Convert callers of dev_set_mac_address_user() to use struct
sockaddr_storage. Add sanity checks on dev->addr_len usage.

Signed-off-by: Kees Cook <kees@kernel.org>
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://patch.msgid.link/20250521204619.2301870-8-kees@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27 08:25:43 +02:00
Kees Cook
6b12e0a3c3 rtnetlink: do_setlink: Use struct sockaddr_storage
Instead of a heap allocating a variably sized struct sockaddr and lying
about the type in the call to netif_set_mac_address(), use a stack
allocated struct sockaddr_storage. This lets us drop the cast and avoid
the allocation.

Putting "ss" on the stack means it will get a reused stack slot since
it is the same size (128B) as other existing single-scope stack variables,
like the vfinfo array (128B), so no additional stack space is used by
this function.

Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20250521204619.2301870-7-kees@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27 08:25:43 +02:00
Kees Cook
9ca6804ab7 net: core: Convert dev_set_mac_address() to struct sockaddr_storage
All users of dev_set_mac_address() are now using a struct sockaddr_storage.
Convert the internal data type to struct sockaddr_storage, drop the casts,
and update pointer types.

Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20250521204619.2301870-6-kees@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27 08:25:43 +02:00
Kees Cook
7da6117ea1 ieee802154: Use struct sockaddr_storage with dev_set_mac_address()
Switch to struct sockaddr_storage for calling dev_set_mac_address(). Add
a temporary cast to struct sockaddr, which will be removed in a
subsequent patch.

Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20250521204619.2301870-4-kees@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27 08:25:43 +02:00
Kees Cook
db586cad6f net/ncsi: Use struct sockaddr_storage for pending_mac
To avoid future casting with coming API type changes, switch struct
ncsi_dev_priv::pending_mac to a full struct sockaddr_storage.

Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20250521204619.2301870-3-kees@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27 08:25:42 +02:00
Kees Cook
161972650d net: core: Switch netif_set_mac_address() to struct sockaddr_storage
In order to avoid passing around struct sockaddr that has a size the
compiler cannot reason about (nor track at runtime), convert
netif_set_mac_address() to take struct sockaddr_storage. This is just a
cast conversion, so there is are no binary changes. Following patches
will make actual allocation changes.

Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20250521204619.2301870-2-kees@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27 08:25:42 +02:00
Kees Cook
ed449ddbd8 net: core: Convert inet_addr_is_any() to sockaddr_storage
All the callers of inet_addr_is_any() have a sockaddr_storage-backed
sockaddr. Avoid casts and switch prototype to the actual object being
used.

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20250521204619.2301870-1-kees@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27 08:25:42 +02:00
Baris Can Goral
5bccdc51f9 replace strncpy with strscpy_pad
The strncpy() function is actively dangerous to use since it may not
NULL-terminate the destination string, resulting in potential memory
content exposures, unbounded reads, or crashes.
Link: https://github.com/KSPP/linux/issues/90

In addition, strscpy_pad is more appropriate because it also zero-fills
any remaining space in the destination if the source is shorter than
the provided buffer size.

Signed-off-by: Baris Can Goral <goralbaris@gmail.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20250521161036.14489-1-goralbaris@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26 22:28:44 +02:00
Linus Torvalds
c5bfc48d54 vfs-6.16-rc1.coredump
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPTwAKCRCRxhvAZXjc
 oliqAQCVdrBn7D2+dB04hjefFq6W6LhyLGrtCCliflicN5SyxAD+PHHiB9nFKe6J
 xQkaNArCJjPd2QEx73aGjHzi3UQq6Qs=
 =Pk9c
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.16-rc1.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull coredump updates from Christian Brauner:
 "This adds support for sending coredumps over an AF_UNIX socket. It
  also makes (implicit) use of the new SO_PEERPIDFD ability to hand out
  pidfds for reaped peer tasks

  The new coredump socket will allow userspace to not have to rely on
  usermode helpers for processing coredumps and provides a saf way to
  handle them instead of relying on super privileged coredumping helpers

  This will also be significantly more lightweight since the kernel
  doens't have to do a fork()+exec() for each crashing process to spawn
  a usermodehelper. Instead the kernel just connects to the AF_UNIX
  socket and userspace can process it concurrently however it sees fit.
  Support for userspace is incoming starting with systemd-coredump

  There's more work coming in that direction next cycle. The rest below
  goes into some details and background

  Coredumping currently supports two modes:

   (1) Dumping directly into a file somewhere on the filesystem.

   (2) Dumping into a pipe connected to a usermode helper process
       spawned as a child of the system_unbound_wq or kthreadd

  For simplicity I'm mostly ignoring (1). There's probably still some
  users of (1) out there but processing coredumps in this way can be
  considered adventurous especially in the face of set*id binaries

  The most common option should be (2) by now. It works by allowing
  userspace to put a string into /proc/sys/kernel/core_pattern like:

          |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h

  The "|" at the beginning indicates to the kernel that a pipe must be
  used. The path following the pipe indicator is a path to a binary that
  will be spawned as a usermode helper process. Any additional
  parameters pass information about the task that is generating the
  coredump to the binary that processes the coredump

  In the example the core_pattern shown causes the kernel to spawn
  systemd-coredump as a usermode helper. There's various conceptual
  consequences of this (non-exhaustive list):

   - systemd-coredump is spawned with file descriptor number 0 (stdin)
     connected to the read-end of the pipe. All other file descriptors
     are closed. That specifically includes 1 (stdout) and 2 (stderr).

     This has already caused bugs because userspace assumed that this
     cannot happen (Whether or not this is a sane assumption is
     irrelevant)

   - systemd-coredump will be spawned as a child of system_unbound_wq.
     So it is not a child of any userspace process and specifically not
     a child of PID 1. It cannot be waited upon and is in a weird hybrid
     upcall which are difficult for userspace to control correctly

   - systemd-coredump is spawned with full kernel privileges. This
     necessitates all kinds of weird privilege dropping excercises in
     userspace to make this safe

   - A new usermode helper has to be spawned for each crashing process

  This adds a new mode:

   (3) Dumping into an AF_UNIX socket

  Userspace can set /proc/sys/kernel/core_pattern to:

          @/path/to/coredump.socket

  The "@" at the beginning indicates to the kernel that an AF_UNIX
  coredump socket will be used to process coredumps

  The coredump socket must be located in the initial mount namespace.
  When a task coredumps it opens a client socket in the initial network
  namespace and connects to the coredump socket:

   - The coredump server uses SO_PEERPIDFD to get a stable handle on the
     connected crashing task. The retrieved pidfd will provide a stable
     reference even if the crashing task gets SIGKILLed while generating
     the coredump. That is a huge attack vector right now

   - By setting core_pipe_limit non-zero userspace can guarantee that
     the crashing task cannot be reaped behind it's back and thus
     process all necessary information in /proc/<pid>. The SO_PEERPIDFD
     can be used to detect whether /proc/<pid> still refers to the same
     process

     The core_pipe_limit isn't used to rate-limit connections to the
     socket. This can simply be done via AF_UNIX socket directly

   - The pidfd for the crashing task will contain information how the
     task coredumps. The PIDFD_GET_INFO ioctl gained a new flag
     PIDFD_INFO_COREDUMP which can be used to retreive the coredump
     information

     If the coredump gets a new coredump client connection the kernel
     guarantees that PIDFD_INFO_COREDUMP information is available.

     Currently the following information is provided in the new
     @coredump_mask extension to struct pidfd_info:

      * PIDFD_COREDUMPED is raised if the task did actually coredump

      * PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping
        (e.g., undumpable)

      * PIDFD_COREDUMP_USER is raised if this is a regular coredump and
        doesn't need special care by the coredump server

      * PIDFD_COREDUMP_ROOT is raised if the generated coredump should
        be treated as sensitive and the coredump server should restrict
        access to the generated coredump to sufficiently privileged
        users"

* tag 'vfs-6.16-rc1.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  mips, net: ensure that SOCK_COREDUMP is defined
  selftests/coredump: add tests for AF_UNIX coredumps
  selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure
  coredump: validate socket name as it is written
  coredump: show supported coredump modes
  pidfs, coredump: add PIDFD_INFO_COREDUMP
  coredump: add coredump socket
  coredump: reflow dump helpers a little
  coredump: massage do_coredump()
  coredump: massage format_corename()
2025-05-26 11:17:01 -07:00
Linus Torvalds
7d7a103d29 vfs-6.16-rc1.pidfs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPTwAKCRCRxhvAZXjc
 ov4zAP4yfqKBAz6eMt9CzDgHCdVQJ9Nuur1EiRdot3maPzHTcQEA2hVkJrvVo1Y/
 jCVAf7gmGX1Uu6nCUF6Vjy35g8i20gA=
 =nzYS
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull pidfs updates from Christian Brauner:
 "Features:

   - Allow handing out pidfds for reaped tasks for AF_UNIX SO_PEERPIDFD
     socket option

     SO_PEERPIDFD is a socket option that allows to retrieve a pidfd for
     the process that called connect() or listen(). This is heavily used
     to safely authenticate clients in userspace avoiding security bugs
     due to pid recycling races (dbus, polkit, systemd, etc.)

     SO_PEERPIDFD currently doesn't support handing out pidfds if the
     sk->sk_peer_pid thread-group leader has already been reaped. In
     this case it currently returns EINVAL. Userspace still wants to get
     a pidfd for a reaped process to have a stable handle it can pass
     on. This is especially useful now that it is possible to retrieve
     exit information through a pidfd via the PIDFD_GET_INFO ioctl()'s
     PIDFD_INFO_EXIT flag

     Another summary has been provided by David Rheinsberg:

      > A pidfd can outlive the task it refers to, and thus user-space
      > must already be prepared that the task underlying a pidfd is
      > gone at the time they get their hands on the pidfd. For
      > instance, resolving the pidfd to a PID via the fdinfo must be
      > prepared to read `-1`.
      >
      > Despite user-space knowing that a pidfd might be stale, several
      > kernel APIs currently add another layer that checks for this. In
      > particular, SO_PEERPIDFD returns `EINVAL` if the peer-task was
      > already reaped, but returns a stale pidfd if the task is reaped
      > immediately after the respective alive-check.
      >
      > This has the unfortunate effect that user-space now has two ways
      > to check for the exact same scenario: A syscall might return
      > EINVAL/ESRCH/... *or* the pidfd might be stale, even though
      > there is no particular reason to distinguish both cases. This
      > also propagates through user-space APIs, which pass on pidfds.
      > They must be prepared to pass on `-1` *or* the pidfd, because
      > there is no guaranteed way to get a stale pidfd from the kernel.
      >
      > Userspace must already deal with a pidfd referring to a reaped
      > task as the task may exit and get reaped at any time will there
      > are still many pidfds referring to it

     In order to allow handing out reaped pidfd SO_PEERPIDFD needs to
     ensure that PIDFD_INFO_EXIT information is available whenever a
     pidfd for a reaped task is created by PIDFD_INFO_EXIT. The uapi
     promises that reaped pidfds are only handed out if it is guaranteed
     that the caller sees the exit information:

     TEST_F(pidfd_info, success_reaped)
     {
             struct pidfd_info info = {
                     .mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT,
             };

             /*
              * Process has already been reaped and PIDFD_INFO_EXIT been set.
              * Verify that we can retrieve the exit status of the process.
              */
             ASSERT_EQ(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0);
             ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
             ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
             ASSERT_TRUE(WIFEXITED(info.exit_code));
             ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
     }

     To hand out pidfds for reaped processes we thus allocate a pidfs
     entry for the relevant sk->sk_peer_pid at the time the
     sk->sk_peer_pid is stashed and drop it when the socket is
     destroyed. This guarantees that exit information will always be
     recorded for the sk->sk_peer_pid task and we can hand out pidfds
     for reaped processes

   - Hand a pidfd to the coredump usermode helper process

     Give userspace a way to instruct the kernel to install a pidfd for
     the crashing process into the process started as a usermode helper.
     There's still tricky race-windows that cannot be easily or
     sometimes not closed at all by userspace. There's various ways like
     looking at the start time of a process to make sure that the
     usermode helper process is started after the crashing process but
     it's all very very brittle and fraught with peril

     The crashed-but-not-reaped process can be killed by userspace
     before coredump processing programs like systemd-coredump have had
     time to manually open a PIDFD from the PID the kernel provides
     them, which means they can be tricked into reading from an
     arbitrary process, and they run with full privileges as they are
     usermode helper processes

     Even if that specific race-window wouldn't exist it's still the
     safest and cleanest way to let the kernel provide the pidfd
     directly instead of requiring userspace to do it manually. In
     parallel with this commit we already have systemd adding support
     for this in [1]

     When the usermode helper process is forked we install a pidfd file
     descriptor three into the usermode helper's file descriptor table
     so it's available to the exec'd program

     Since usermode helpers are either children of the system_unbound_wq
     workqueue or kthreadd we know that the file descriptor table is
     empty and can thus always use three as the file descriptor number

     Note, that we'll install a pidfd for the thread-group leader even
     if a subthread is calling do_coredump(). We know that task linkage
     hasn't been removed yet and even if this @current isn't the actual
     thread-group leader we know that the thread-group leader cannot be
     reaped until
     @current has exited

   - Allow telling when a task has not been found from finding the wrong
     task when creating a pidfd

     We currently report EINVAL whenever a struct pid has no tasked
     attached anymore thereby conflating two concepts:

      (1) The task has already been reaped

      (2) The caller requested a pidfd for a thread-group leader but the
          pid actually references a struct pid that isn't used as a
          thread-group leader

     This is causing issues for non-threaded workloads as in where they
     expect ESRCH to be reported, not EINVAL

     So allow userspace to reliably distinguish between (1) and (2)

   - Make it possible to detect when a pidfs entry would outlive the
     struct pid it pinned

   - Add a range of new selftests

  Cleanups:

   - Remove unneeded NULL check from pidfd_prepare() for passed struct
     pid

   - Avoid pointless reference count bump during release_task()

  Fixes:

   - Various fixes to the pidfd and coredump selftests

   - Fix error handling for replace_fd() when spawning coredump usermode
     helper"

* tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  pidfs: detect refcount bugs
  coredump: hand a pidfd to the usermode coredump helper
  coredump: fix error handling for replace_fd()
  pidfs: move O_RDWR into pidfs_alloc_file()
  selftests: coredump: Raise timeout to 2 minutes
  selftests: coredump: Fix test failure for slow machines
  selftests: coredump: Properly initialize pointer
  net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid
  pidfs: get rid of __pidfd_prepare()
  net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid
  pidfs: register pid in pidfs
  net, pidfd: report EINVAL for ESRCH
  release_task: kill the no longer needed get/put_pid(thread_pid)
  pidfs: ensure consistent ENOENT/ESRCH reporting
  exit: move wake_up_all() pidfd waiters into __unhash_process()
  selftest/pidfd: add test for thread-group leader pidfd open for thread
  pidfd: improve uapi when task isn't found
  pidfd: remove unneeded NULL check from pidfd_prepare()
  selftests/pidfd: adapt to recent changes
2025-05-26 10:30:02 -07:00
Stefano Garzarella
45ca7e9f07 vsock/virtio: fix rx_bytes accounting for stream sockets
In `struct virtio_vsock_sock`, we maintain two counters:
- `rx_bytes`: used internally to track how many bytes have been read.
  This supports mechanisms like .stream_has_data() and sock_rcvlowat().
- `fwd_cnt`: used for the credit mechanism to inform available receive
  buffer space to the remote peer.

These counters are updated via virtio_transport_inc_rx_pkt() and
virtio_transport_dec_rx_pkt().

Since the beginning with commit 06a8fc7836 ("VSOCK: Introduce
virtio_vsock_common.ko"), we call virtio_transport_dec_rx_pkt() in
virtio_transport_stream_do_dequeue() only when we consume the entire
packet, so partial reads, do not update `rx_bytes` and `fwd_cnt`.

This is fine for `fwd_cnt`, because we still have space used for the
entire packet, and we don't want to update the credit for the other
peer until we free the space of the entire packet. However, this
causes `rx_bytes` to be stale on partial reads.

Previously, this didn’t cause issues because `rx_bytes` was used only by
.stream_has_data(), and any unread portion of a packet implied data was
still available. However, since commit 93b8088766
("virtio/vsock: fix logic which reduces credit update messages"), we now
rely on `rx_bytes` to determine if a credit update should be sent when
the data in the RX queue drops below SO_RCVLOWAT value.

This patch fixes the accounting by updating `rx_bytes` with the number
of bytes actually read, even on partial reads, while leaving `fwd_cnt`
untouched until the packet is fully consumed. Also introduce a new
`buf_used` counter to check that the remote peer is honoring the given
credit; this was previously done via `rx_bytes`.

Fixes: 93b8088766 ("virtio/vsock: fix logic which reduces credit update messages")
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250521121705.196379-1-sgarzare@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26 19:04:21 +02:00
Paolo Abeni
f5b60d6a57 netfilter pull request 25-05-23
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmgwd00ACgkQ1w0aZmrP
 KyEfwA//RXQ3i8PCa7lKHxDRhVzG3rEvgXRmiXeNd+JjzsCnybBb7+wRf3dtBGWT
 +1s44Utx1JqosWxCVBulqYC5bqSC66789l5X2jhYJmUZxRrbcsqPngwnIrjb/XeK
 ZJM62wiRhkBQED7yZLGy+y4VHQiG8CEMt16AOQHk863aruWv1tT7up90CTtzA545
 4GF/grU3FC0PsoTLwzWyvqsWK+9uk3Y4Tifp5hU3w6uRD9EjX5tHCZlXXSqOF5gu
 KT26OYsePYXhJVZIwDf2oVLGi0EVTPB9IFxZSNgLqyXqu2ILAb9OwRNVTNfTP7Pg
 1RWJWmgqvRNs9OM2ecifYgQf/AfvCL0Cja1BJOjmvtICuGegrYH7G5YYQsMl9CoE
 7jBoTzpToSASat5+dwoz81Bvzh447dYxRE2VmbxmRTTWToQYS1KGBPc9e3u/n5Rr
 ruh8tRZ3/R0Fy+YLDkrJst3grh5RLITbuyu4ElJMArPU50mLTVYxKd6nA3BqwB5G
 1GmLfCzvQH3e6PKz6CNke1AytVDy/wLTXtcbLnze2Muaj4AqhtOe5Q8ypnOO0Vyk
 PsJ6U3rm2asd3GE9+AIx8gZBv8yCu1w9CiwLK8ybT2NETb2dEnqPgWeDyT7rpcaD
 sQOPsBE1q/TEp9gofbYCHBm5E2mX9UP7Q6EHCTekrI97xLq8Q2M=
 =fBhd
 -----END PGP SIGNATURE-----

Merge tag 'nf-next-25-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following batch contains Netfilter updates for net-next,
specifically 26 patches: 5 patches adding/updating selftests,
4 fixes, 3 PREEMPT_RT fixes, and 14 patches to enhance nf_tables):

1) Improve selftest coverage for pipapo 4 bit group format, from
   Florian Westphal.

2) Fix incorrect dependencies when compiling a kernel without
   legacy ip{6}tables support, also from Florian.

3) Two patches to fix nft_fib vrf issues, including selftest updates
   to improve coverage, also from Florian Westphal.

4) Fix incorrect nesting in nft_tunnel's GENEVE support, from
   Fernando F. Mancera.

5) Three patches to fix PREEMPT_RT issues with nf_dup infrastructure
   and nft_inner to match in inner headers, from Sebastian Andrzej Siewior.

6) Integrate conntrack information into nft trace infrastructure,
   from Florian Westphal.

7) A series of 13 patches to allow to specify wildcard netdevice in
   netdev basechain and flowtables, eg.

   table netdev filter {
       chain ingress {
           type filter hook ingress devices = { eth0, eth1, vlan* } priority 0; policy accept;
       }
   }

   This also allows for runtime hook registration on NETDEV_{UN}REGISTER
   event, from Phil Sutter.

netfilter pull request 25-05-23

* tag 'nf-next-25-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: (26 commits)
  selftests: netfilter: Torture nftables netdev hooks
  netfilter: nf_tables: Add notifications for hook changes
  netfilter: nf_tables: Support wildcard netdev hook specs
  netfilter: nf_tables: Sort labels in nft_netdev_hook_alloc()
  netfilter: nf_tables: Handle NETDEV_CHANGENAME events
  netfilter: nf_tables: Wrap netdev notifiers
  netfilter: nf_tables: Respect NETDEV_REGISTER events
  netfilter: nf_tables: Prepare for handling NETDEV_REGISTER events
  netfilter: nf_tables: Have a list of nf_hook_ops in nft_hook
  netfilter: nf_tables: Pass nf_hook_ops to nft_unregister_flowtable_hook()
  netfilter: nf_tables: Introduce nft_register_flowtable_ops()
  netfilter: nf_tables: Introduce nft_hook_find_ops{,_rcu}()
  netfilter: nf_tables: Introduce functions freeing nft_hook objects
  netfilter: nf_tables: add packets conntrack state to debug trace info
  netfilter: conntrack: make nf_conntrack_id callable without a module dependency
  netfilter: nf_dup_netdev: Move the recursion counter struct netdev_xmit
  netfilter: nft_inner: Use nested-BH locking for nft_pcpu_tun_ctx
  netfilter: nf_dup{4, 6}: Move duplication check to task_struct
  netfilter: nft_tunnel: fix geneve_opt dump
  selftests: netfilter: nft_fib.sh: add type and oif tests with and without VRFs
  ...
====================

Link: https://patch.msgid.link/20250523132712.458507-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26 18:53:41 +02:00
Paolo Abeni
fdb061195f ipsec-next-2025-05-23
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmgwJa4ACgkQrB3Eaf9P
 W7d34A//V3NukN6UNAUKd+MbH80eXCEbNSNIuVUstfr0S71qTCxovLX58u+oQztb
 43mx/NsnF38TzNFWVyVzF4vcr/n0DS/Da3P5pJEjoewIYSDrz/WfOum6VpVIUsZ/
 kLCDZlIoX/fBPFZDPHMmsDXDemAdrtr8CuK72NUH10vKDuGKSUG0NElqDieDBEsA
 y/fqgBsyxQXi9cMdRxf+DLDK/hzqyaJmVj8B8WEcFtYXJ4RE6+jfLgAaTE6J7V5W
 fYACTu/IcdtgEEm2U7wlow66oIjqqGReuWUzV9zHGJNCB9+da6L4dbGtzlRmOPdn
 kI1PIALFWT2HbKnJOJJbaThO6zES1rMOm3PsWt7iVewCT8HuhAa9kDV0xzdcLQE1
 +REfo8dXW9f5hRUrSuqpJFUArkCHWHLhQEcmTHaF0b2RveC/hd9rOyKIfae+fgIP
 5uLU2DpwafDgw5UCjsQTLyQ5M6icO8wFgM7vKAUJWyI1Pck1ktf7Ic6+KQRNjWiv
 Q7ImwpSdLH2bZpIbIKDnIcyZg3CMBIQ88cdsYi0+ckgDQ0hMf6ZrXRseXKRe0P/M
 gKgBOoXIJBF7niJQTDqHjsmnYGvvhZysIJNQLf4BZFYOeF5L9OduP6ywqMe5pFKt
 QAsJSZw/+SibheLEYQAzvyLD6VdMXaxeOAHlPylRRpl9vEX0l04=
 =GRVJ
 -----END PGP SIGNATURE-----

Merge tag 'ipsec-next-2025-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next

Steffen Klassert says:

====================
1) Remove some unnecessary strscpy_pad() size arguments.
   From Thorsten Blum.

2) Correct use of xso.real_dev on bonding offloads.
   Patchset from Cosmin Ratiu.

3) Add hardware offload configuration to XFRM_MSG_MIGRATE.
   From Chiachang Wang.

4) Refactor migration setup during cloning. This was
   done after the clone was created. Now it is done
   in the cloning function itself.
   From Chiachang Wang.

5) Validate assignment of maximal possible SEQ number.
   Prevent from setting to the maximum sequrnce number
   as this would cause for traffic drop.
   From Leon Romanovsky.

6) Prevent configuration of interface index when offload
   is used. Hardware can't handle this case.i
   From Leon Romanovsky.

7) Always use kfree_sensitive() for SA secret zeroization.
   From Zilin Guan.

ipsec-next-2025-05-23

* tag 'ipsec-next-2025-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
  xfrm: use kfree_sensitive() for SA secret zeroization
  xfrm: prevent configuration of interface index when offload is used
  xfrm: validate assignment of maximal possible SEQ number
  xfrm: Refactor migration setup during the cloning process
  xfrm: Migrate offload configuration
  bonding: Fix multiple long standing offload races
  bonding: Mark active offloaded xfrm_states
  xfrm: Add explicit dev to .xdo_dev_state_{add,delete,free}
  xfrm: Remove unneeded device check from validate_xmit_xfrm
  xfrm: Use xdo.dev instead of xdo.real_dev
  net/mlx5: Avoid using xso.real_dev unnecessarily
  xfrm: Remove unnecessary strscpy_pad() size arguments
====================

Link: https://patch.msgid.link/20250523075611.3723340-1-steffen.klassert@secunet.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26 18:32:48 +02:00
Jeremy Kerr
0a9b2c9fd1 net: mctp: use nlmsg_payload() for netlink message data extraction
Jakub suggests:

> I have a different request :) Matt, once this ends up in net-next
> (end of this week) could you refactor it to use nlmsg_payload() ?
> It doesn't exist in net but this is exactly why it was added.

This refactors the additions to both mctp_dump_addrinfo(), and
mctp_rtm_getneigh() - two cases where we're calling nlh_data() on an
an incoming netlink message, without a prior nlmsg_parse().

For the neigh.c case, we cannot hit the failure where the nlh does not
contain a full ndmsg at present, as the core handler
(net/core/neighbour.c, neigh_get()) has already validated the size
through neigh_valid_req_get(), and would have failed the get operation
before the MCTP hander is called.

However, relying on that is a bit fragile, so apply the nlmsg_payload
refector here too.

Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20250521-mctp-nlmsg-payload-v2-1-e85df160c405@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26 17:38:27 +02:00
Linus Torvalds
6d5b940e1e vfs-6.16-rc1.async.dir
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBN6wAKCRCRxhvAZXjc
 ok32AQD9DTiSCAoVg+7s+gSBuLTi8drPTN++mCaxdTqRh5WpRAD9GVyrGQT0s6LH
 eo9bm8d1TAYjilEWM0c0K0TxyQ7KcAA=
 =IW7H
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.16-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs directory lookup updates from Christian Brauner:
 "This contains cleanups for the lookup_one*() family of helpers.

  We expose a set of functions with names containing "lookup_one_len"
  and others without the "_len". This difference has nothing to do with
  "len". It's rater a historical accident that can be confusing.

  The functions without "_len" take a "mnt_idmap" pointer. This is found
  in the "vfsmount" and that is an important question when choosing
  which to use: do you have a vfsmount, or are you "inside" the
  filesystem. A related question is "is permission checking relevant
  here?".

  nfsd and cachefiles *do* have a vfsmount but *don't* use the non-_len
  functions. They pass nop_mnt_idmap and refuse to work on filesystems
  which have any other idmap.

  This work changes nfsd and cachefile to use the lookup_one family of
  functions and to explictily pass &nop_mnt_idmap which is consistent
  with all other vfs interfaces used where &nop_mnt_idmap is explicitly
  passed.

  The remaining uses of the "_one" functions do not require permission
  checks so these are renamed to be "_noperm" and the permission
  checking is removed.

  This series also changes these lookup function to take a qstr instead
  of separate name and len. In many cases this simplifies the call"

* tag 'vfs-6.16-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  VFS: change lookup_one_common and lookup_noperm_common to take a qstr
  Use try_lookup_noperm() instead of d_hash_and_lookup() outside of VFS
  VFS: rename lookup_one_len family to lookup_noperm and remove permission check
  cachefiles: Use lookup_one() rather than lookup_one_len()
  nfsd: Use lookup_one() rather than lookup_one_len()
  VFS: improve interface for lookup_one functions
2025-05-26 08:02:43 -07:00
Qiu Yutan
e45b7196df net: neigh: use kfree_skb_reason() in neigh_resolve_output() and neigh_connected_output()
Replace kfree_skb() used in neigh_resolve_output() and
neigh_connected_output() with kfree_skb_reason().

Following new skb drop reason is added:
/* failed to fill the device hard header */
SKB_DROP_REASON_NEIGH_HH_FILLFAIL

Signed-off-by: Qiu Yutan <qiu.yutan@zte.com.cn>
Signed-off-by: Jiang Kun <jiang.kun2@zte.com.cn>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Xu Xin <xu.xin16@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-26 10:03:13 +01:00
Stanislav Fomichev
384492c48e net: devmem: support single IOV with sendmsg
sendmsg() with a single iov becomes ITER_UBUF, sendmsg() with multiple
iovs becomes ITER_IOVEC. iter_iov_len does not return correct
value for UBUF, so teach to treat UBUF differently.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Mina Almasry <almasrymina@google.com>
Fixes: bd61848900 ("net: devmem: Implement TX path")
Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Acked-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-26 10:00:48 +01:00
Phil Sutter
465b9ee0ee netfilter: nf_tables: Add notifications for hook changes
Notify user space if netdev hooks are updated due to netdev add/remove
events. Send minimal notification messages by introducing
NFT_MSG_NEWDEV/DELDEV message types describing a single device only.

Upon NETDEV_CHANGENAME, the callback has no information about the
interface's old name. To provide a clear message to user space, include
the hook's stored interface name in the notification.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:14 +02:00
Phil Sutter
6d07a28950 netfilter: nf_tables: Support wildcard netdev hook specs
User space may pass non-nul-terminated NFTA_DEVICE_NAME attribute values
to indicate a suffix wildcard.
Expect for multiple devices to match the given prefix in
nft_netdev_hook_alloc() and populate 'ops_list' with them all.
When checking for duplicate hooks, compare the shortest prefix so a
device may never match more than a single hook spec.
Finally respect the stored prefix length when hooking into new devices
from event handlers.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:14 +02:00
Phil Sutter
6f670935b4 netfilter: nf_tables: Sort labels in nft_netdev_hook_alloc()
No point in having err_hook_alloc, just call return directly. Also
rename err_hook_dev - it's not about the hook's device but freeing the
hook itself.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:13 +02:00
Phil Sutter
7b4856493d netfilter: nf_tables: Handle NETDEV_CHANGENAME events
For the sake of simplicity, treat them like consecutive NETDEV_REGISTER
and NETDEV_UNREGISTER events. If the new name matches a hook spec and
registration fails, escalate the error and keep things as they are.

To avoid unregistering the newly registered hook again during the
following fake NETDEV_UNREGISTER event, leave hooks alone if their
interface spec matches the new name.

Note how this patch also skips for NETDEV_REGISTER if the device is
already registered. This is not yet possible as the new name would have
to match the old one. This will change with wildcard interface specs,
though.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:13 +02:00
Phil Sutter
9669c1105b netfilter: nf_tables: Wrap netdev notifiers
Handling NETDEV_CHANGENAME events has to traverse all chains/flowtables
twice, prepare for this. No functional change intended.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:13 +02:00
Phil Sutter
a331b78a55 netfilter: nf_tables: Respect NETDEV_REGISTER events
Hook into new devices if their name matches the hook spec.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:13 +02:00
Phil Sutter
104031ac89 netfilter: nf_tables: Prepare for handling NETDEV_REGISTER events
Put NETDEV_UNREGISTER handling code into a switch, no functional change
intended as the function is only called for that event yet.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:13 +02:00
Phil Sutter
73319a8ee1 netfilter: nf_tables: Have a list of nf_hook_ops in nft_hook
Supporting a 1:n relationship between nft_hook and nf_hook_ops is
convenient since a chain's or flowtable's nft_hooks may remain in place
despite matching interfaces disappearing. This stabilizes ruleset dumps
in that regard and opens the possibility to claim newly added interfaces
which match the spec. Also it prepares for wildcard interface specs
since these will potentially match multiple interfaces.

All spots dealing with hook registration are updated to handle a list of
multiple nf_hook_ops, but nft_netdev_hook_alloc() only adds a single
item for now to retain the old behaviour. The only expected functional
change here is how vanishing interfaces are handled: Instead of dropping
the respective nft_hook, only the matching nf_hook_ops are dropped.

To safely remove individual ops from the list in netdev handlers, an
rcu_head is added to struct nf_hook_ops so kfree_rcu() may be used.
There is at least nft_flowtable_find_dev() which may be iterating
through the list at the same time.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:13 +02:00
Phil Sutter
91a089d056 netfilter: nf_tables: Pass nf_hook_ops to nft_unregister_flowtable_hook()
The function accesses only the hook's ops field, pass it directly. This
prepares for nft_hooks holding a list of nf_hook_ops in future.

While at it, make use of the function in
__nft_unregister_flowtable_net_hooks() as well.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:13 +02:00
Phil Sutter
21aa0a03eb netfilter: nf_tables: Introduce nft_register_flowtable_ops()
Facilitate binding and registering of a flowtable hook via a single
function call.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:13 +02:00
Phil Sutter
e225376d78 netfilter: nf_tables: Introduce nft_hook_find_ops{,_rcu}()
Also a pretty dull wrapper around the hook->ops.dev comparison for now.
Will search the embedded nf_hook_ops list in future. The ugly cast to
eliminate the const qualifier will vanish then, too.

Since this future list will be RCU-protected, also introduce an _rcu()
variant here.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:12 +02:00
Phil Sutter
75e20bcdce netfilter: nf_tables: Introduce functions freeing nft_hook objects
Pointless wrappers around kfree() for now, prep work for an embedded
list of nf_hook_ops.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:12 +02:00
Florian Westphal
7e5c6aa67e netfilter: nf_tables: add packets conntrack state to debug trace info
Add the minimal relevant info needed for userspace ("nftables monitor
trace") to provide the conntrack view of the packet:

- state (new, related, established)
- direction (original, reply)
- status (e.g., if connection is subject to dnat)
- id (allows to query ctnetlink for remaining conntrack state info)

Example:
trace id a62 inet filter PRE_RAW packet: iif "enp0s3" ether [..]
  [..]
trace id a62 inet filter PRE_MANGLE conntrack: ct direction original ct state new ct id 32
trace id a62 inet filter PRE_MANGLE packet: [..]
 [..]
trace id a62 inet filter IN conntrack: ct direction original ct state new ct status dnat-done ct id 32
 [..]

In this case one can see that while NAT is active, the new connection
isn't subject to a translation.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:12 +02:00
Florian Westphal
90869f43d0 netfilter: conntrack: make nf_conntrack_id callable without a module dependency
While nf_conntrack_id() doesn't need any functionaliy from conntrack, it
does reside in nf_conntrack_core.c -- callers add a module
dependency on conntrack.

Followup patch will need to compute the conntrack id from nf_tables_trace.c
to include it in nf_trace messages emitted to userspace via netlink.

I don't want to introduce a module dependency between nf_tables and
conntrack for this.

Since trace is slowpath, the added indirection is ok.

One alternative is to move nf_conntrack_id to the netfilter/core.c,
but I don't see a compelling reason so far.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:12 +02:00
Sebastian Andrzej Siewior
f37ad91270 netfilter: nf_dup_netdev: Move the recursion counter struct netdev_xmit
nf_dup_skb_recursion is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Move nf_dup_skb_recursion to struct netdev_xmit, provide wrappers.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:12 +02:00
Sebastian Andrzej Siewior
ba36fada9a netfilter: nft_inner: Use nested-BH locking for nft_pcpu_tun_ctx
nft_pcpu_tun_ctx is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Make a struct with a nft_inner_tun_ctx member (original
nft_pcpu_tun_ctx) and a local_lock_t and use local_lock_nested_bh() for
locking. This change adds only lockdep coverage and does not alter the
functional behaviour for !PREEMPT_RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:12 +02:00
Sebastian Andrzej Siewior
a1f1acb9c5 netfilter: nf_dup{4, 6}: Move duplication check to task_struct
nf_skb_duplicated is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Due to the recursion involved, the simplest change is to make it a
per-task variable.

Move the per-CPU variable nf_skb_duplicated to task_struct and name it
in_nf_duplicate. Add it to the existing bitfield so it doesn't use
additional memory.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:12 +02:00
Fernando Fernandez Mancera
22a9613de4 netfilter: nft_tunnel: fix geneve_opt dump
When dumping a nft_tunnel with more than one geneve_opt configured the
netlink attribute hierarchy should be as follow:

 NFTA_TUNNEL_KEY_OPTS
 |
 |--NFTA_TUNNEL_KEY_OPTS_GENEVE
 |  |
 |  |--NFTA_TUNNEL_KEY_GENEVE_CLASS
 |  |--NFTA_TUNNEL_KEY_GENEVE_TYPE
 |  |--NFTA_TUNNEL_KEY_GENEVE_DATA
 |
 |--NFTA_TUNNEL_KEY_OPTS_GENEVE
 |  |
 |  |--NFTA_TUNNEL_KEY_GENEVE_CLASS
 |  |--NFTA_TUNNEL_KEY_GENEVE_TYPE
 |  |--NFTA_TUNNEL_KEY_GENEVE_DATA
 |
 |--NFTA_TUNNEL_KEY_OPTS_GENEVE
 ...

Otherwise, userspace tools won't be able to fetch the geneve options
configured correctly.

Fixes: 925d844696 ("netfilter: nft_tunnel: add support for geneve opts")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:12 +02:00
Florian Westphal
9a119669fb netfilter: nf_tables: nft_fib: consistent l3mdev handling
fib has two modes:
1. Obtain output device according to source or destination address
2. Obtain the type of the address, e.g. local, unicast, multicast.

'fib daddr type' should return 'local' if the address is configured
in this netns or unicast otherwise.

'fib daddr . iif type' should return 'local' if the address is configured
on the input interface or unicast otherwise, i.e. more restrictive.

However, if the interface is part of a VRF, then 'fib daddr type'
returns unicast even if the address is configured on the incoming
interface.

This is broken for both ipv4 and ipv6.

In the ipv4 case, inet_dev_addr_type must only be used if the
'iif' or 'oif' (strict mode) was requested.

Else inet_addr_type_dev_table() needs to be used and the correct
dev argument must be passed as well so the correct fib (vrf) table
is used.

In the ipv6 case, the bug is similar, without strict mode, dev is NULL
so .flowi6_l3mdev will be set to 0.

Add a new 'nft_fib_l3mdev_master_ifindex_rcu()' helper and use that
to init the .l3mdev structure member.

For ipv6, use it from nft_fib6_flowi_init() which gets called from
both the 'type' and the 'route' mode eval functions.

This provides consistent behaviour for all modes for both ipv4 and ipv6:
If strict matching is requested, the input respectively output device
of the netfilter hooks is used.

Otherwise, use skb->dev to obtain the l3mdev ifindex.

Without this, most type checks in updated nft_fib.sh selftest fail:

  FAIL: did not find veth0 . 10.9.9.1 . local in fibtype4
  FAIL: did not find veth0 . dead:1::1 . local in fibtype6
  FAIL: did not find veth0 . dead:9::1 . local in fibtype6
  FAIL: did not find tvrf . 10.0.1.1 . local in fibtype4
  FAIL: did not find tvrf . 10.9.9.1 . local in fibtype4
  FAIL: did not find tvrf . dead:1::1 . local in fibtype6
  FAIL: did not find tvrf . dead:9::1 . local in fibtype6
  FAIL: fib expression address types match (iif in vrf)

(fib errounously returns 'unicast' for all of them, even
 though all of these addresses are local to the vrf).

Fixes: f6d0cbcf09 ("netfilter: nf_tables: add fib expression")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23 13:57:09 +02:00
Kuniyuki Iwashima
77cbe1a6d8 af_unix: Introduce SO_PASSRIGHTS.
As long as recvmsg() or recvmmsg() is used with cmsg, it is not
possible to avoid receiving file descriptors via SCM_RIGHTS.

This behaviour has occasionally been flagged as problematic, as
it can be (ab)used to trigger DoS during close(), for example, by
passing a FUSE-controlled fd or a hung NFS fd.

For instance, as noted on the uAPI Group page [0], an untrusted peer
could send a file descriptor pointing to a hung NFS mount and then
close it.  Once the receiver calls recvmsg() with msg_control, the
descriptor is automatically installed, and then the responsibility
for the final close() now falls on the receiver, which may result
in blocking the process for a long time.

Regarding this, systemd calls cmsg_close_all() [1] after each
recvmsg() to close() unwanted file descriptors sent via SCM_RIGHTS.

However, this cannot work around the issue at all, because the final
fput() may still occur on the receiver's side once sendmsg() with
SCM_RIGHTS succeeds.  Also, even filtering by LSM at recvmsg() does
not work for the same reason.

Thus, we need a better way to refuse SCM_RIGHTS at sendmsg().

Let's introduce SO_PASSRIGHTS to disable SCM_RIGHTS.

Note that this option is enabled by default for backward
compatibility.

Link: https://uapi-group.org/kernel-features/#disabling-reception-of-scm_rights-for-af_unix-sockets #[0]
Link: https://github.com/systemd/systemd/blob/v257.5/src/basic/fd-util.c#L612-L628 #[1]
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-23 10:24:18 +01:00
Kuniyuki Iwashima
3f84d577b7 af_unix: Inherit sk_flags at connect().
For SOCK_STREAM embryo sockets, the SO_PASS{CRED,PIDFD,SEC} options
are inherited from the parent listen()ing socket.

Currently, this inheritance happens at accept(), because these
attributes were stored in sk->sk_socket->flags and the struct socket
is not allocated until accept().

This leads to unintentional behaviour.

When a peer sends data to an embryo socket in the accept() queue,
unix_maybe_add_creds() embeds credentials into the skb, even if
neither the peer nor the listener has enabled these options.

If the option is enabled, the embryo socket receives the ancillary
data after accept().  If not, the data is silently discarded.

This conservative approach works for SO_PASS{CRED,PIDFD,SEC}, but
would not for SO_PASSRIGHTS; once an SCM_RIGHTS with a hung file
descriptor was sent, it'd be game over.

To avoid this, we will need to preserve SOCK_PASSRIGHTS even on embryo
sockets.

Commit aed6ecef55 ("af_unix: Save listener for embryo socket.")
made it possible to access the parent's flags in sendmsg() via
unix_sk(other)->listener->sk->sk_socket->flags, but this introduces
an unnecessary condition that is irrelevant for most sockets,
accept()ed sockets and clients.

Therefore, we moved SOCK_PASSXXX into struct sock.

Let’s inherit sk->sk_scm_recv_flags at connect() to avoid receiving
SCM_RIGHTS on embryo sockets created from a parent with SO_PASSRIGHTS=0.

Note that the parent socket is locked in connect() so we don't need
READ_ONCE() for sk_scm_recv_flags.

Now, we can remove !other->sk_socket check in unix_maybe_add_creds()
to avoid slow SOCK_PASS{CRED,PIDFD} handling for embryo sockets
created from a parent with SO_PASS{CRED,PIDFD}=0.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-23 10:24:18 +01:00
Kuniyuki Iwashima
0e81cfd971 af_unix: Move SOCK_PASS{CRED,PIDFD,SEC} to struct sock.
As explained in the next patch, SO_PASSRIGHTS would have a problem
if we assigned a corresponding bit to socket->flags, so it must be
managed in struct sock.

Mixing socket->flags and sk->sk_flags for similar options will look
confusing, and sk->sk_flags does not have enough space on 32bit system.

Also, as mentioned in commit 16e5726269 ("af_unix: dont send
SCM_CREDENTIALS by default"), SOCK_PASSCRED and SOCK_PASSPID handling
is known to be slow, and managing the flags in struct socket cannot
avoid that for embryo sockets.

Let's move SOCK_PASS{CRED,PIDFD,SEC} to struct sock.

While at it, other SOCK_XXX flags in net.h are grouped as enum.

Note that assign_bit() was atomic, so the writer side is moved down
after lock_sock() in setsockopt(), but the bit is only read once
in sendmsg() and recvmsg(), so lock_sock() is not needed there.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-23 10:24:18 +01:00
Kuniyuki Iwashima
7d8d93fdde net: Restrict SO_PASS{CRED,PIDFD,SEC} to AF_{UNIX,NETLINK,BLUETOOTH}.
SCM_CREDENTIALS and SCM_SECURITY can be recv()ed by calling
scm_recv() or scm_recv_unix(), and SCM_PIDFD is only used by
scm_recv_unix().

scm_recv() is called from AF_NETLINK and AF_BLUETOOTH.

scm_recv_unix() is literally called from AF_UNIX.

Let's restrict SO_PASSCRED and SO_PASSSEC to such sockets and
SO_PASSPIDFD to AF_UNIX only.

Later, SOCK_PASS{CRED,PIDFD,SEC} will be moved to struct sock
and united with another field.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-23 10:24:18 +01:00
Kuniyuki Iwashima
ae4f2f59e1 tcp: Restrict SO_TXREHASH to TCP socket.
sk->sk_txrehash is only used for TCP.

Let's restrict SO_TXREHASH to TCP to reflect this.

Later, we will make sk_txrehash a part of the union for other
protocol families.

Note that we need to modify BPF selftest not to get/set
SO_TEREHASH for non-TCP sockets.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-23 10:24:18 +01:00
Kuniyuki Iwashima
38b95d588f scm: Move scm_recv() from scm.h to scm.c.
scm_recv() has been placed in scm.h since the pre-git era for no
particular reason (I think), which makes the file really fragile.

For example, when you move SOCK_PASSCRED from include/linux/net.h to
enum sock_flags in include/net/sock.h, you will see weird build failure
due to terrible dependency.

To avoid the build failure in the future, let's move scm_recv(_unix())?
and its callees to scm.c.

Note that only scm_recv() needs to be exported for Bluetooth.

scm_send() should be moved to scm.c too, but I'll revisit later.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-23 10:24:18 +01:00
Kuniyuki Iwashima
3041bbbeb4 af_unix: Don't pass struct socket to maybe_add_creds().
We will move SOCK_PASS{CRED,PIDFD,SEC} from struct socket.flags
to struct sock for better handling with SOCK_PASSRIGHTS.

Then, we don't need to access struct socket in maybe_add_creds().

Let's pass struct sock to maybe_add_creds() and its caller
queue_oob().

While at it, we append the unix_ prefix and fix double spaces
around the pid assignment.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-23 10:24:18 +01:00
Kuniyuki Iwashima
350d454629 af_unix: Factorise test_bit() for SOCK_PASSCRED and SOCK_PASSPIDFD.
Currently, the same checks for SOCK_PASSCRED and SOCK_PASSPIDFD
are scattered across many places.

Let's centralise the bit tests to make the following changes cleaner.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-23 10:24:18 +01:00
Jiayuan Chen
8259eb0e06 bpf, sockmap: Avoid using sk_socket after free when sending
The sk->sk_socket is not locked or referenced in backlog thread, and
during the call to skb_send_sock(), there is a race condition with
the release of sk_socket. All types of sockets(tcp/udp/unix/vsock)
will be affected.

Race conditions:
'''
CPU0                               CPU1

backlog::skb_send_sock
  sendmsg_unlocked
    sock_sendmsg
      sock_sendmsg_nosec
                                   close(fd):
                                     ...
                                     ops->release() -> sock_map_close()
                                     sk_socket->ops = NULL
                                     free(socket)
      sock->ops->sendmsg
            ^
            panic here
'''

The ref of psock become 0 after sock_map_close() executed.
'''
void sock_map_close()
{
    ...
    if (likely(psock)) {
    ...
    // !! here we remove psock and the ref of psock become 0
    sock_map_remove_links(sk, psock)
    psock = sk_psock_get(sk);
    if (unlikely(!psock))
        goto no_psock; <=== Control jumps here via goto
        ...
        cancel_delayed_work_sync(&psock->work); <=== not executed
        sk_psock_put(sk, psock);
        ...
}
'''

Based on the fact that we already wait for the workqueue to finish in
sock_map_close() if psock is held, we simply increase the psock
reference count to avoid race conditions.

With this patch, if the backlog thread is running, sock_map_close() will
wait for the backlog thread to complete and cancel all pending work.

If no backlog running, any pending work that hasn't started by then will
fail when invoked by sk_psock_get(), as the psock reference count have
been zeroed, and sk_psock_drop() will cancel all jobs via
cancel_delayed_work_sync().

In summary, we require synchronization to coordinate the backlog thread
and close() thread.

The panic I catched:
'''
Workqueue: events sk_psock_backlog
RIP: 0010:sock_sendmsg+0x21d/0x440
RAX: 0000000000000000 RBX: ffffc9000521fad8 RCX: 0000000000000001
...
Call Trace:
 <TASK>
 ? die_addr+0x40/0xa0
 ? exc_general_protection+0x14c/0x230
 ? asm_exc_general_protection+0x26/0x30
 ? sock_sendmsg+0x21d/0x440
 ? sock_sendmsg+0x3e0/0x440
 ? __pfx_sock_sendmsg+0x10/0x10
 __skb_send_sock+0x543/0xb70
 sk_psock_backlog+0x247/0xb80
...
'''

Fixes: 4b4647add7 ("sock_map: avoid race between sock_map_close and sk_psock_put")
Reported-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20250516141713.291150-1-jiayuan.chen@linux.dev
2025-05-22 16:16:37 -07:00
Jakub Kicinski
ea15e04626 Lots of new things, notably:
* ath12k: monitor mode for WCN7850, better 6 GHz regulatory
  * brcmfmac: SAE for some Cypress devices
  * iwlwifi: rework device configuration
  * mac80211: scan improvements with MLO
  * mt76: EHT improvements, new device IDs
  * rtw88: throughput improvements
  * rtw89: MLO, STA/P2P concurrency improvements, SAR
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmgvVpAACgkQ10qiO8sP
 aADeJg//dShJQPKKUw7s4qY9y0lr1kimFw7cKE1vhHAq0eyQE8VP/05sj7XkeLdO
 2MDFCmWmTRZW1Av925xhicEhdiggxdOaT3n3RQ82y+Vjx7+6BsqqRE0YVRmK28vM
 MhUQocSzbd+Gh75wd4ti8G8dDPRJ9sbLTlZhIqPXMth2Ljl9EklMNzOlhzfo8N8+
 TgZ8oJx0EZ2n+sObtI5US27rNiPzLCtAM10Nl03F5yxfSk7gh3UpLHFhmu7384Nx
 56qqMwsmHHQSaRudg1ls8p30ztwve8/zHkOM6UeVksbb7CS2GHoPoVFtJUWBYmn9
 Ckd/XNItniRmIbsABgOyybawJV7EKZAWHclffeICQc526VMZWxeD9xukxQZSykiu
 3YXbHbPUkaCi3MlC3arc8SNQpW2l/BQvrC0SHqds4r/h/j4yUbA0wLs0OwqNXXwh
 NFoXnPTlkhMjNcX0W0t/A+EzXt/EsGKjBasiWC/tVZG9gHpMWKO3G8kLKmDyN/i9
 NsUh7E7zJTBjYjS2Bhm4xGmSy3DdgKSkBV2d4qCG/0LoKAW2eIRw97DGyn0pfVlA
 BAmio94xDJ3AM565WeySmWi/ZqvuDgQy2rd3J1ji0F/QDhdwqUdIHXy9C3VbN9zB
 TegAIgDjnqpOLzCn2P2FzWZlXwcXsxG13XMvqr2DfhBZtNUmRsw=
 =NXiI
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2025-05-22' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Johannes Berg says:

====================
Lots of new things, notably:
 * ath12k: monitor mode for WCN7850, better 6 GHz regulatory
 * brcmfmac: SAE for some Cypress devices
 * iwlwifi: rework device configuration
 * mac80211: scan improvements with MLO
 * mt76: EHT improvements, new device IDs
 * rtw88: throughput improvements
 * rtw89: MLO, STA/P2P concurrency improvements, SAR

* tag 'wireless-next-2025-05-22' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (389 commits)
  wifi: mt76: mt7925: add rfkill_poll for hardware rfkill
  wifi: mt76: support power delta calculation for 5 TX paths
  wifi: mt76: fix available_antennas setting
  wifi: mt76: mt7996: fix RX buffer size of MCU event
  wifi: mt76: mt7996: change max beacon size
  wifi: mt76: mt7996: fix invalid NSS setting when TX path differs from NSS
  wifi: mt76: mt7996: drop fragments with multicast or broadcast RA
  wifi: mt76: mt7996: set EHT max ampdu length capability
  wifi: mt76: mt7996: fix beamformee SS field
  wifi: mt76: remove capability of partial bandwidth UL MU-MIMO
  wifi: mt76: mt7925: add test mode support
  wifi: mt76: mt7925: extend MCU support for testmode
  wifi: mt76: mt7925: ensure all MCU commands wait for response
  wifi: mt76: mt7925: refine the sniffer commnad
  wifi: mt76: mt7925: prevent multiple scan commands
  wifi: mt76: mt7915: Fix null-ptr-deref in mt7915_mmio_wed_init()
  wifi: mt76: mt7996: Fix null-ptr-deref in mt7996_mmio_wed_init()
  wifi: mt76: mt7925: add RNR scan support for 6GHz
  wifi: mt76: add mt76_connac_mcu_build_rnr_scan_param routine
  wifi: mt76: scan: Fix 'mlink' dereferenced before IS_ERR_OR_NULL check
  ...
====================

Link: https://patch.msgid.link/20250522165501.189958-50-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-22 14:05:18 -07:00
Jakub Kicinski
43a1ce8f42 bluetooth-next pull request for net-next:
core:
 
  - Add support for SIOCETHTOOL ETHTOOL_GET_TS_INFO
  - Separate CIS_LINK and BIS_LINK link types
  - Introduce HCI Driver protocol
 
 drivers:
 
  - btintel_pcie: Do not generate coredump for diagnostic events
  - btusb: Add HCI Drv commands for configuring altsetting
  - btusb: Add RTL8851BE device 0x0bda:0xb850
  - btusb: Add new VID/PID 13d3/3584 for MT7922
  - btusb: Add new VID/PID 13d3/3630 and 13d3/3613 for MT7925
  - btnxpuart: Implement host-wakeup feature
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmgvWiwZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKezlD/4uKp4yrCPAO/tO0FFvh752
 7oVmBzqe6GDunl+Isz6/GSWc5sD0OVdhMg7QL+zhi3hjluyGh9N3rUE9Qw/Q3h8Q
 JkMXWAVNHq+Dr88RqLVro335D2XP8mgiTLEKwSDh5Fdip3xOz+itoQZI5wYqriQg
 exNU1l04ZzrwMWicAJULvFFPz9q/556cUq0x9k7OJ6GaHOmUQ0Y7BPMFAQ0/uHAA
 8Y9qiXlJQKzeYDz9rUvAf6Gd+21k0cAU4QSYt+ZDLGBAuH0iK4zgu56uiHadVLRb
 bm5hlO/lrUD7Hw/swSJ2wZYMKpPINPP6Cr2kpC66kmXZYWx7YJaQQCN8GEtwbEVh
 t3q9Y7zQXjppQ/tIG/WJuWlZ84DiWsm5na3k/q61LfihQ5VPL96RtlJKXD492Dxz
 vFXRFN5F6lMcDP5Ujji6S8O0H5P1bDz9XbITcGHxEDjAbOnThBND7g+10mmZ1MRw
 GWQTnnsrYaU+gaUdj9Nr5o7kPp2KSXvGkG8F407RDvF2fjbbwTNEQgkt7vF9CbPN
 KkJAwnPM+JhSuxGaIVcKpoKJ2gZA/fNXjr9d6hD6v/U+SksNsMovxtdZMaL4Mx/n
 gV8W7RwhUNeJ8NvneHJ12bRhtb7x/IYJsQ6ARgDTenNlSxd56uc3Zt7nmDdIvUfF
 BuEJDucjnJLnCsrUj6MNuw==
 =ZrGM
 -----END PGP SIGNATURE-----

Merge tag 'for-net-next-2025-05-22' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next

Luiz Augusto von Dentz says:

====================
bluetooth-next pull request for net-next:

core:
 - Add support for SIOCETHTOOL ETHTOOL_GET_TS_INFO
 - Separate CIS_LINK and BIS_LINK link types
 - Introduce HCI Driver protocol

drivers:
 - btintel_pcie: Do not generate coredump for diagnostic events
 - btusb: Add HCI Drv commands for configuring altsetting
 - btusb: Add RTL8851BE device 0x0bda:0xb850
 - btusb: Add new VID/PID 13d3/3584 for MT7922
 - btusb: Add new VID/PID 13d3/3630 and 13d3/3613 for MT7925
 - btnxpuart: Implement host-wakeup feature

* tag 'for-net-next-2025-05-22' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (23 commits)
  Bluetooth: btintel: Check dsbr size from EFI variable
  Bluetooth: MGMT: iterate over mesh commands in mgmt_mesh_foreach()
  Bluetooth: btusb: Add new VID/PID 13d3/3584 for MT7922
  Bluetooth: btusb: use skb_pull to avoid unsafe access in QCA dump handling
  Bluetooth: L2CAP: Fix not checking l2cap_chan security level
  Bluetooth: separate CIS_LINK and BIS_LINK link types
  Bluetooth: btusb: Add new VID/PID 13d3/3630 for MT7925
  Bluetooth: add support for SIOCETHTOOL ETHTOOL_GET_TS_INFO
  Bluetooth: btintel_pcie: Dump debug registers on error
  Bluetooth: ISO: Fix getpeername not returning sockaddr_iso_bc fields
  Bluetooth: ISO: Fix not using SID from adv report
  Revert "Bluetooth: btusb: add sysfs attribute to control USB alt setting"
  Revert "Bluetooth: btusb: Configure altsetting for HCI_USER_CHANNEL"
  Bluetooth: btusb: Add HCI Drv commands for configuring altsetting
  Bluetooth: Introduce HCI Driver protocol
  Bluetooth: btnxpuart: Implement host-wakeup feature
  dt-bindings: net: bluetooth: nxp: Add support for host-wakeup
  Bluetooth: btusb: Add RTL8851BE device 0x0bda:0xb850
  Bluetooth: hci_uart: Remove unnecessary NULL check before release_firmware()
  Bluetooth: btmtksdio: Fix wakeup source leaks on device unbind
  ...
====================

Link: https://patch.msgid.link/20250522171048.3307873-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-22 13:46:13 -07:00
Dmitry Antipov
3bb88524b7 Bluetooth: MGMT: iterate over mesh commands in mgmt_mesh_foreach()
In 'mgmt_mesh_foreach()', iterate over mesh commands
rather than generic mgmt ones. Compile tested only.

Fixes: b338d91703 ("Bluetooth: Implement support for Mesh")
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-05-22 13:06:11 -04:00
Luiz Augusto von Dentz
631c8682c3 Bluetooth: L2CAP: Fix not checking l2cap_chan security level
l2cap_check_enc_key_size shall check the security level of the
l2cap_chan rather than the hci_conn since for incoming connection
request that may be different as hci_conn may already been
encrypted using a different security level.

Fixes: 522e9ed157 ("Bluetooth: l2cap: Check encryption key size on incoming connection")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-05-22 13:04:39 -04:00
Jakub Kicinski
33e1b1b399 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc8).

Conflicts:
  80f2ab46c2 ("irdma: free iwdev->rf after removing MSI-X")
  4bcc063939 ("ice, irdma: fix an off by one in error handling code")
  c24a65b6a2 ("iidc/ice/irdma: Update IDC to support multiple consumers")
https://lore.kernel.org/20250513130630.280ee6c5@canb.auug.org.au

No extra adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-22 09:42:41 -07:00
Florian Westphal
8b53f46eb4 netfilter: nf_tables: nft_fib_ipv6: fix VRF ipv4/ipv6 result discrepancy
With a VRF, ipv4 and ipv6 FIB expression behave differently.

   fib daddr . iif oif

Will return the input interface name for ipv4, but the real device
for ipv6.  Example:

If VRF device name is tvrf and real (incoming) device is veth0.
First round is ok, both ipv4 and ipv6 will yield 'veth0'.

But in the second round (incoming device will be set to "tvrf"), ipv4
will yield "tvrf" whereas ipv6 returns "veth0" for the second round too.

This makes ipv6 behave like ipv4.

A followup patch will add a test case for this, without this change
it will fail with:
  get element inet t fibif6iif { tvrf . dead:1::99 . tvrf }
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  FAIL: did not find tvrf . dead:1::99 . tvrf in fibif6iif

Alternatively we could either not do anything at all or change
ipv4 to also return the lower/real device, however, nft (userspace)
doc says "iif: if fib lookup provides a route then check its output
interface is identical to the packets input interface." which is what
the nft fib ipv4 behaviour is.

Fixes: f6d0cbcf09 ("netfilter: nf_tables: add fib expression")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-22 17:16:02 +02:00
Florian Westphal
c38eb2973c netfilter: xtables: support arpt_mark and ipv6 optstrip for iptables-nft only builds
Its now possible to build a kernel that has no support for the classic
xtables get/setsockopt interfaces and builtin tables.

In this case, we have CONFIG_IP6_NF_MANGLE=n and
CONFIG_IP_NF_ARPTABLES=n.

For optstript, the ipv6 code is so small that we can enable it if
netfilter ipv6 support exists. For mark, check if either classic
arptables or NFT_ARP_COMPAT is set.

Fixes: a9525c7f62 ("netfilter: xtables: allow xtables-nft only builds")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-22 17:16:02 +02:00
Kory Maincent
4ff4d86f6c net: Add support for providing the PTP hardware source in tsinfo
Multi-PTP source support within a network topology has been merged,
but the hardware timestamp source is not yet exposed to users.
Currently, users only see the PTP index, which does not indicate
whether the timestamp comes from a PHY or a MAC.

Add support for reporting the hwtstamp source using a
hwtstamp-source field, alongside hwtstamp-phyindex, to describe
the origin of the hardware timestamp.

Remove HWTSTAMP_SOURCE_UNSPEC enum value as it is not used at all.

Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20250519-feature_ptp_source-v4-1-5d10e19a0265@bootlin.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-22 15:32:00 +02:00
Paolo Abeni
bd2ec34d00 ipsec-2025-05-21
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmgtYgEACgkQrB3Eaf9P
 W7e1ag//da84UIRyJwMfDO4Y3MXDNPslNSDuq0HuvwRtdLIBLFtwitSzU1uhKsxY
 yn5v7RSsxvp6lXW2RT+Ycor2qZ/mGHJsHcVfG7m0YjxH6unw7yzjqn5LNNzRbYN4
 NcD8P0skuX6d80EFPUB3Hsnmdj1VKR62OsWyk3rAPb4CLBVKJt9OsseVfN4bn1R0
 TaZSIkdh5EDGYXTBKb49jc8LFfQo7+uVg/AjtZ/2ZsWt+Qgw3XevTIcwLokH00rt
 GzXcLjC1g+b6TeVncOuD1oiNJUtQVGYV23t2yQlk9k2HFzCdNnq0YM9pzawwiI+l
 icBV2X/QFjhdCRkvJRF4dkXq/4tnnEmYoY/1vSOoWR9VmY2u8Lr3VRiDD/h0gYJT
 KXd8YPMtZLDnLgmH+DwWbv4vdLtHvQTmB8XFzb/4VN6Ikucenry3loJsUsLnS+Je
 t1/7unLrg9yyJC6UPzweqjAx+6VgZvem/M5kejIVxHpk+Wg2dXGZ2jz4fsVuZYPB
 dMLj1h1MLn4gOt2b/bdI2do0C+p2R1axrTNw+RiqwCrb1h5Ey+7RAhWyXyaHUEs3
 1brMAgOcvdbaaeSIpoHJ8eJx/PgRxDrxRnUC3HjCGPNApYQXC3FM3POk7wwJ9C0i
 odlHrq+yOdzLCZyU+YKdR1q3kPq9AWpUSmc4Olg359OQ9IxDGQw=
 =bgyq
 -----END PGP SIGNATURE-----

Merge tag 'ipsec-2025-05-21' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec

Steffen Klassert says:

====================
pull request (net): ipsec 2025-05-21

1) Fix some missing kfree_skb in the error paths of espintcp.
   From Sabrina Dubroca.

2) Fix a reference leak in espintcp.
   From Sabrina Dubroca.

3) Fix UDP GRO handling for ESPINUDP.
   From Tobias Brunner.

4) Fix ipcomp truesize computation on the receive path.
   From Sabrina Dubroca.

5) Sanitize marks before policy/state insertation.
   From Paul Chaignon.

* tag 'ipsec-2025-05-21' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
  xfrm: Sanitize marks before insert
  xfrm: ipcomp: fix truesize computation on receive
  xfrm: Fix UDP GRO handling for some corner cases
  espintcp: remove encap socket caching to avoid reference leak
  espintcp: fix skb leaks
====================

Link: https://patch.msgid.link/20250521054348.4057269-1-steffen.klassert@secunet.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-22 11:49:53 +02:00
Wang Liang
e279024617 net/tipc: fix slab-use-after-free Read in tipc_aead_encrypt_done
Syzbot reported a slab-use-after-free with the following call trace:

  ==================================================================
  BUG: KASAN: slab-use-after-free in tipc_aead_encrypt_done+0x4bd/0x510 net/tipc/crypto.c:840
  Read of size 8 at addr ffff88807a733000 by task kworker/1:0/25

  Call Trace:
   kasan_report+0xd9/0x110 mm/kasan/report.c:601
   tipc_aead_encrypt_done+0x4bd/0x510 net/tipc/crypto.c:840
   crypto_request_complete include/crypto/algapi.h:266
   aead_request_complete include/crypto/internal/aead.h:85
   cryptd_aead_crypt+0x3b8/0x750 crypto/cryptd.c:772
   crypto_request_complete include/crypto/algapi.h:266
   cryptd_queue_worker+0x131/0x200 crypto/cryptd.c:181
   process_one_work+0x9fb/0x1b60 kernel/workqueue.c:3231

  Allocated by task 8355:
   kzalloc_noprof include/linux/slab.h:778
   tipc_crypto_start+0xcc/0x9e0 net/tipc/crypto.c:1466
   tipc_init_net+0x2dd/0x430 net/tipc/core.c:72
   ops_init+0xb9/0x650 net/core/net_namespace.c:139
   setup_net+0x435/0xb40 net/core/net_namespace.c:343
   copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
   create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
   unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228
   ksys_unshare+0x419/0x970 kernel/fork.c:3323
   __do_sys_unshare kernel/fork.c:3394

  Freed by task 63:
   kfree+0x12a/0x3b0 mm/slub.c:4557
   tipc_crypto_stop+0x23c/0x500 net/tipc/crypto.c:1539
   tipc_exit_net+0x8c/0x110 net/tipc/core.c:119
   ops_exit_list+0xb0/0x180 net/core/net_namespace.c:173
   cleanup_net+0x5b7/0xbf0 net/core/net_namespace.c:640
   process_one_work+0x9fb/0x1b60 kernel/workqueue.c:3231

After freed the tipc_crypto tx by delete namespace, tipc_aead_encrypt_done
may still visit it in cryptd_queue_worker workqueue.

I reproduce this issue by:
  ip netns add ns1
  ip link add veth1 type veth peer name veth2
  ip link set veth1 netns ns1
  ip netns exec ns1 tipc bearer enable media eth dev veth1
  ip netns exec ns1 tipc node set key this_is_a_master_key master
  ip netns exec ns1 tipc bearer disable media eth dev veth1
  ip netns del ns1

The key of reproduction is that, simd_aead_encrypt is interrupted, leading
to crypto_simd_usable() return false. Thus, the cryptd_queue_worker is
triggered, and the tipc_crypto tx will be visited.

  tipc_disc_timeout
    tipc_bearer_xmit_skb
      tipc_crypto_xmit
        tipc_aead_encrypt
          crypto_aead_encrypt
            // encrypt()
            simd_aead_encrypt
              // crypto_simd_usable() is false
              child = &ctx->cryptd_tfm->base;

  simd_aead_encrypt
    crypto_aead_encrypt
      // encrypt()
      cryptd_aead_encrypt_enqueue
        cryptd_aead_enqueue
          cryptd_enqueue_request
            // trigger cryptd_queue_worker
            queue_work_on(smp_processor_id(), cryptd_wq, &cpu_queue->work)

Fix this by holding net reference count before encrypt.

Reported-by: syzbot+55c12726619ff85ce1f6@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=55c12726619ff85ce1f6
Fixes: fc1b6d6de2 ("tipc: introduce TIPC encryption & authentication")
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Link: https://patch.msgid.link/20250520101404.1341730-1-wangliang74@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-22 11:33:12 +02:00
Cong Wang
3f98113810 sch_hfsc: Fix qlen accounting bug when using peek in hfsc_enqueue()
When enqueuing the first packet to an HFSC class, hfsc_enqueue() calls the
child qdisc's peek() operation before incrementing sch->q.qlen and
sch->qstats.backlog. If the child qdisc uses qdisc_peek_dequeued(), this may
trigger an immediate dequeue and potential packet drop. In such cases,
qdisc_tree_reduce_backlog() is called, but the HFSC qdisc's qlen and backlog
have not yet been updated, leading to inconsistent queue accounting. This
can leave an empty HFSC class in the active list, causing further
consequences like use-after-free.

This patch fixes the bug by moving the increment of sch->q.qlen and
sch->qstats.backlog before the call to the child qdisc's peek() operation.
This ensures that queue length and backlog are always accurate when packet
drops or dequeues are triggered during the peek.

Fixes: 12d0ad3be9 ("net/sched/sch_hfsc.c: handle corner cases where head may change invalidating calculated deadline")
Reported-by: Mingi Cho <mincho@theori.io>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250518222038.58538-2-xiyou.wangcong@gmail.com
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-22 11:16:44 +02:00
Eric Dumazet
945301db34 net: add debug checks in ____napi_schedule() and napi_poll()
While tracking an IDPF bug, I found that idpf_vport_splitq_napi_poll()
was not following NAPI rules.

It can indeed return @budget after napi_complete() has been called.

Add two debug conditions in networking core to hopefully catch
this kind of bugs sooner.

IDPF bug will be fixed in a separate patch.

[   72.441242] repoll requested for device eth1 idpf_vport_splitq_napi_poll [idpf] but napi is not scheduled.
[   72.446291] list_del corruption. next->prev should be ff31783d93b14040, but was ff31783d93b10080. (next=ff31783d93b10080)
[   72.446659] kernel BUG at lib/list_debug.c:67!
[   72.446816] Oops: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC NOPTI
[   72.447031] CPU: 156 UID: 0 PID: 16258 Comm: ip Tainted: G        W           6.15.0-dbg-DEV #1944 NONE
[   72.447340] Tainted: [W]=WARN
[   72.447702] RIP: 0010:__list_del_entry_valid_or_report (lib/list_debug.c:65)
[   72.450630] Call Trace:
[   72.450720]  <IRQ>
[   72.450797] net_rx_action (include/linux/list.h:215 include/linux/list.h:287 net/core/dev.c:7385 net/core/dev.c:7516)
[   72.450928] ? lock_release (kernel/locking/lockdep.c:?)
[   72.451059] ? clockevents_program_event (kernel/time/clockevents.c:?)
[   72.451222] handle_softirqs (kernel/softirq.c:579)
[   72.451356] ? do_softirq (kernel/softirq.c:480)
[   72.451480] ? idpf_vc_xn_exec (drivers/net/ethernet/intel/idpf/idpf_virtchnl.c:462) idpf
[   72.451635] do_softirq (kernel/softirq.c:480)
[   72.451750]  </IRQ>
[   72.451828]  <TASK>
[   72.451905] __local_bh_enable_ip (kernel/softirq.c:?)
[   72.452051] idpf_vc_xn_exec (drivers/net/ethernet/intel/idpf/idpf_virtchnl.c:462) idpf
[   72.452210] idpf_send_delete_queues_msg (drivers/net/ethernet/intel/idpf/idpf_virtchnl.c:2083) idpf
[   72.452390] idpf_vport_stop (drivers/net/ethernet/intel/idpf/idpf_lib.c:837 drivers/net/ethernet/intel/idpf/idpf_lib.c:868) idpf
[   72.452541] ? idpf_vport_stop (include/linux/bottom_half.h:? include/linux/netdevice.h:4762 drivers/net/ethernet/intel/idpf/idpf_lib.c:855) idpf
[   72.452695] idpf_initiate_soft_reset (drivers/net/ethernet/intel/idpf/idpf_lib.c:?) idpf
[   72.452867] idpf_change_mtu (drivers/net/ethernet/intel/idpf/idpf_lib.c:2189) idpf
[   72.453015] netif_set_mtu_ext (net/core/dev.c:9437)
[   72.453157] ? packet_notifier (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 net/packet/af_packet.c:4240)
[   72.453292] netif_set_mtu (net/core/dev.c:9515)
[   72.453416] dev_set_mtu (net/core/dev_api.c:?)
[   72.453534] bond_change_mtu (drivers/net/bonding/bond_main.c:4833)
[   72.453666] netif_set_mtu_ext (net/core/dev.c:9437)
[   72.453803] do_setlink (net/core/rtnetlink.c:3116)
[   72.453925] ? rtnl_newlink (net/core/rtnetlink.c:3901)
[   72.454055] ? rtnl_newlink (net/core/rtnetlink.c:3901)
[   72.454185] ? rtnl_newlink (net/core/rtnetlink.c:3901)
[   72.454314] ? trace_contention_end (include/trace/events/lock.h:122)
[   72.454467] ? __mutex_lock (arch/x86/include/asm/preempt.h:85 kernel/locking/mutex.c:611 kernel/locking/mutex.c:746)
[   72.454597] ? cap_capable (include/trace/events/capability.h:26)
[   72.454721] ? security_capable (security/security.c:?)
[   72.454857] rtnl_newlink (net/core/rtnetlink.c:?)
[   72.454982] ? lock_is_held_type (kernel/locking/lockdep.c:5599 kernel/locking/lockdep.c:5938)
[   72.455121] ? __lock_acquire (kernel/locking/lockdep.c:?)
[   72.455256] ? __change_page_attr_set_clr (arch/x86/mm/pat/set_memory.c:685)
[   72.455438] ? __lock_acquire (kernel/locking/lockdep.c:?)
[   72.455582] ? rtnetlink_rcv_msg (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 net/core/rtnetlink.c:6885)
[   72.455721] ? lock_acquire (kernel/locking/lockdep.c:5866)
[   72.455848] ? rtnetlink_rcv_msg (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 net/core/rtnetlink.c:6885)
[   72.455987] ? lock_release (kernel/locking/lockdep.c:?)
[   72.456117] ? rcu_read_unlock (include/linux/rcupdate.h:341 include/linux/rcupdate.h:871)
[   72.456249] ? __pfx_rtnl_newlink (net/core/rtnetlink.c:3956)
[   72.456388] rtnetlink_rcv_msg (net/core/rtnetlink.c:6955)
[   72.456526] ? rtnetlink_rcv_msg (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 net/core/rtnetlink.c:6885)
[   72.456671] ? lock_acquire (kernel/locking/lockdep.c:5866)
[   72.456802] ? net_generic (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 include/net/netns/generic.h:45)
[   72.456929] ? __pfx_rtnetlink_rcv_msg (net/core/rtnetlink.c:6858)
[   72.457082] netlink_rcv_skb (net/netlink/af_netlink.c:2534)
[   72.457212] netlink_unicast (net/netlink/af_netlink.c:1313)
[   72.457344] netlink_sendmsg (net/netlink/af_netlink.c:1883)
[   72.457476] __sock_sendmsg (net/socket.c:712)
[   72.457602] ____sys_sendmsg (net/socket.c:?)
[   72.457735] ? _copy_from_user (arch/x86/include/asm/uaccess_64.h:126 arch/x86/include/asm/uaccess_64.h:134 arch/x86/include/asm/uaccess_64.h:141 include/linux/uaccess.h:178 lib/usercopy.c:18)
[   72.457875] ___sys_sendmsg (net/socket.c:2620)
[   72.458042] ? __call_rcu_common (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:119 arch/x86/include/asm/irqflags.h:159 kernel/rcu/tree.c:3107)
[   72.458185] ? mntput_no_expire (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 fs/namespace.c:1457)
[   72.458324] ? lock_acquire (kernel/locking/lockdep.c:5866)
[   72.458451] ? mntput_no_expire (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 fs/namespace.c:1457)
[   72.458588] ? lock_release (kernel/locking/lockdep.c:?)
[   72.458718] ? mntput_no_expire (include/linux/rcupdate.h:331 include/linux/rcupdate.h:841 fs/namespace.c:1457)
[   72.458856] __x64_sys_sendmsg (net/socket.c:2652)
[   72.458997] ? do_syscall_64 (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:119 include/linux/entry-common.h:198 arch/x86/entry/syscall_64.c:90)
[   72.459136] do_syscall_64 (arch/x86/entry/syscall_64.c:?)
[   72.459259] ? exc_page_fault (arch/x86/mm/fault.c:1542)
[   72.459387] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
[   72.459555] RIP: 0033:0x7fd15f17cbd0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250520121908.1805732-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-21 20:44:11 -07:00
Eric Biggers
c93f75b2d7 net: remove skb_copy_and_hash_datagram_iter()
Now that skb_copy_and_hash_datagram_iter() is no longer used, remove it.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-11-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-21 15:40:17 -07:00
Eric Biggers
ea6342d989 net: add skb_copy_and_crc32c_datagram_iter()
Since skb_copy_and_hash_datagram_iter() is used only with CRC32C, the
crypto_ahash abstraction provides no value.  Add
skb_copy_and_crc32c_datagram_iter() which just calls crc32c() directly.

This is faster and simpler.  It also doesn't have the weird dependency
issue where skb_copy_and_hash_datagram_iter() depends on
CONFIG_CRYPTO_HASH=y without that being expressed explicitly in the
kconfig (presumably because it was too heavyweight for NET to select).
The new function is conditional on the hidden boolean symbol NET_CRC32C,
which selects CRC32.  So it gets compiled only when something that
actually needs CRC32C packet checksums is enabled, it has no implicit
dependency, and it doesn't depend on the heavyweight crypto layer.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-9-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-21 15:40:17 -07:00
Eric Biggers
70c96c7cb9 net: fold __skb_checksum() into skb_checksum()
Now that the only remaining caller of __skb_checksum() is
skb_checksum(), fold __skb_checksum() into skb_checksum().  This makes
struct skb_checksum_ops unnecessary, so remove that too and simply do
the "regular" net checksum.  It also makes the wrapper functions
csum_partial_ext() and csum_block_add_ext() unnecessary, so remove those
too and just use the underlying functions.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-7-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-21 15:40:16 -07:00
Eric Biggers
99de9d4022 sctp: use skb_crc32c() instead of __skb_checksum()
Make sctp_compute_cksum() just use the new function skb_crc32c(),
instead of calling __skb_checksum() with a skb_checksum_ops struct that
does CRC32C.  This is faster and simpler.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-6-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-21 15:40:16 -07:00
Eric Biggers
86edc94da1 net: use skb_crc32c() in skb_crc32c_csum_help()
Instead of calling __skb_checksum() with a skb_checksum_ops struct that
does CRC32C, just call the new function skb_crc32c().  This is faster
and simpler.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-4-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-21 15:39:59 -07:00
Eric Biggers
a5bd029c73 net: add skb_crc32c()
Add skb_crc32c(), which calculates the CRC32C of a sk_buff.  It will
replace __skb_checksum(), which unnecessarily supports arbitrary
checksums.  Compared to __skb_checksum(), skb_crc32c():

   - Uses the correct type for CRC32C values (u32, not __wsum).

   - Does not require the caller to provide a skb_checksum_ops struct.

   - Is faster because it does not use indirect calls and does not use
     the very slow crc32c_combine().

According to commit 2817a336d4 ("net: skb_checksum: allow custom
update/combine for walking skb") which added __skb_checksum(), the
original motivation for the abstraction layer was to avoid code
duplication for CRC32C and other checksums in the future.  However:

   - No additional checksums showed up after CRC32C.  __skb_checksum()
     is only used with the "regular" net checksum and CRC32C.

   - Indirect calls are expensive.  Commit 2544af0344 ("net: avoid
     indirect calls in L4 checksum calculation") worked around this
     using the INDIRECT_CALL_1 macro. But that only avoided the indirect
     call for the net checksum, and at the cost of an extra branch.

   - The checksums use different types (__wsum and u32), causing casts
     to be needed.

   - It made the checksums of fragments be combined (rather than
     chained) for both checksums, despite this being highly
     counterproductive for CRC32C due to how slow crc32c_combine() is.
     This can clearly be seen in commit 4c2f245496 ("sctp: linearize
     early if it's not GSO") which tried to work around this performance
     bug.  With a dedicated function for each checksum, we can instead
     just use the proper strategy for each checksum.

As shown by the following tables, the new function skb_crc32c() is
faster than __skb_checksum(), with the improvement varying greatly from
5% to 2500% depending on the case.  The largest improvements come from
fragmented packets, mainly due to eliminating the inefficient
crc32c_combine().  But linear packets are improved too, especially
shorter ones, mainly due to eliminating indirect calls.  These
benchmarks were done on AMD Zen 5.  On that CPU, Linux uses IBRS instead
of retpoline; an even greater improvement might be seen with retpoline:

    Linear sk_buffs

        Length in bytes    __skb_checksum cycles    skb_crc32c cycles
        ===============    =====================    =================
                     64                       43                   18
                    256                       94                   77
                   1420                      204                  161
                  16384                     1735                 1642

    Nonlinear sk_buffs (even split between head and one fragment)

        Length in bytes    __skb_checksum cycles    skb_crc32c cycles
        ===============    =====================    =================
                     64                      579                   22
                    256                      829                   77
                   1420                     1506                  194
                  16384                     4365                 1682

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-3-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-21 15:39:58 -07:00
Eric Biggers
55d22ee035 net: introduce CONFIG_NET_CRC32C
Add a hidden kconfig symbol NET_CRC32C that will group together the
functions that calculate CRC32C checksums of packets, so that these
don't have to be built into NET-enabled kernels that don't need them.

Make skb_crc32c_csum_help() (which is called only when IP_SCTP is
enabled) conditional on this symbol, and make IP_SCTP select it.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-2-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-21 15:39:58 -07:00
Pauli Virtanen
23205562ff Bluetooth: separate CIS_LINK and BIS_LINK link types
Use separate link type id for unicast and broadcast ISO connections.
These connection types are handled with separate HCI commands, socket
API is different, and hci_conn has union fields that are different in
the two cases, so they shall not be mixed up.

Currently in most places it is attempted to distinguish ucast by
bacmp(&c->dst, BDADDR_ANY) but it is wrong as dst is set for bcast sink
hci_conn in iso_conn_ready(). Additionally checking sync_handle might be
OK, but depends on details of bcast conn configuration flow.

To avoid complicating it, use separate link types.

Fixes: f764a6c2c1 ("Bluetooth: ISO: Add broadcast support")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-05-21 10:29:28 -04:00
Pauli Virtanen
dd0ccf8580 Bluetooth: add support for SIOCETHTOOL ETHTOOL_GET_TS_INFO
Bluetooth needs some way for user to get supported so_timestamping flags
for the different socket types.

Use SIOCETHTOOL API for this purpose. As hci_dev is not associated with
struct net_device, the existing implementation can't be reused, so we
add a small one here.

Add support (only) for ETHTOOL_GET_TS_INFO command. The API differs
slightly from netdev in that the result depends also on socket type.

Signed-off-by: Pauli Virtanen <pav@iki.fi>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-05-21 10:28:51 -04:00
Luiz Augusto von Dentz
0a766a0aff Bluetooth: ISO: Fix getpeername not returning sockaddr_iso_bc fields
If the socket is a broadcast receiver fields from sockaddr_iso_bc shall
be part of the values returned to getpeername since some of these fields
are updated while doing the PA and BIG sync procedures.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-05-21 10:28:08 -04:00
Luiz Augusto von Dentz
e2d471b780 Bluetooth: ISO: Fix not using SID from adv report
Up until now it has been assumed that the application would be able to
enter the advertising SID in sockaddr_iso_bc.bc_sid, but userspace has
no access to SID since the likes of MGMT_EV_DEVICE_FOUND cannot carry
it, so it was left unset (0x00) which means it would be unable to
synchronize if the broadcast source is using a different SID e.g. 0x04:

> HCI Event: LE Meta Event (0x3e) plen 57
      LE Extended Advertising Report (0x0d)
        Num reports: 1
        Entry 0
          Event type: 0x0000
            Props: 0x0000
            Data status: Complete
          Address type: Random (0x01)
          Address: 0B:82:E8:50:6D:C8 (Non-Resolvable)
          Primary PHY: LE 1M
          Secondary PHY: LE 2M
          SID: 0x04
          TX power: 127 dBm
          RSSI: -55 dBm (0xc9)
          Periodic advertising interval: 180.00 msec (0x0090)
          Direct address type: Public (0x00)
          Direct address: 00:00:00:00:00:00 (OUI 00-00-00)
          Data length: 0x1f
        06 16 52 18 5b 0b e1 05 16 56 18 04 00 11 30 4c  ..R.[....V....0L
        75 69 7a 27 73 20 53 32 33 20 55 6c 74 72 61     uiz's S23 Ultra
        Service Data: Broadcast Audio Announcement (0x1852)
        Broadcast ID: 14748507 (0xe10b5b)
        Service Data: Public Broadcast Announcement (0x1856)
          Data[2]: 0400
        Unknown EIR field 0x30[16]: 4c75697a27732053323320556c747261
< HCI Command: LE Periodic Advertising Create Sync (0x08|0x0044) plen 14
        Options: 0x0000
        Use advertising SID, Advertiser Address Type and address
        Reporting initially enabled
        SID: 0x00 (<- Invalid)
        Adv address type: Random (0x01)
        Adv address: 0B:82:E8:50:6D:C8 (Non-Resolvable)
        Skip: 0x0000
        Sync timeout: 20000 msec (0x07d0)
        Sync CTE type: 0x0000

So instead this changes now allow application to set HCI_SID_INVALID
which will make hci_le_pa_create_sync to wait for a report, update the
conn->sid using the report SID and only then issue PA create sync
command:

< HCI Command: LE Periodic Advertising Create Sync
        Options: 0x0000
        Use advertising SID, Advertiser Address Type and address
        Reporting initially enabled
        SID: 0x04
        Adv address type: Random (0x01)
        Adv address: 0B:82:E8:50:6D:C8 (Non-Resolvable)
        Skip: 0x0000
        Sync timeout: 20000 msec (0x07d0)
        Sync CTE type: 0x0000
> HCI Event: LE Meta Event (0x3e) plen 16
      LE Periodic Advertising Sync Established (0x0e)
        Status: Success (0x00)
        Sync handle: 64
        Advertising SID: 0x04
        Advertiser address type: Random (0x01)
        Advertiser address: 0B:82:E8:50:6D:C8 (Non-Resolvable)
        Advertiser PHY: LE 2M (0x02)
        Periodic advertising interval: 180.00 msec (0x0090)
        Advertiser clock accuracy: 0x05

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-05-21 10:28:08 -04:00
Hsin-chen Chuang
04425292a6 Bluetooth: Introduce HCI Driver protocol
Although commit 75ddcd5ad4 ("Bluetooth: btusb: Configure altsetting
for HCI_USER_CHANNEL") has enabled the HCI_USER_CHANNEL user to send out
SCO data through USB Bluetooth chips, it's observed that with the patch
HFP is flaky on most of the existing USB Bluetooth controllers: Intel
chips sometimes send out no packet for Transparent codec; MTK chips may
generate SCO data with a wrong handle for CVSD codec; RTK could split
the data with a wrong packet size for Transparent codec; ... etc.

To address the issue above one needs to reset the altsetting back to
zero when there is no active SCO connection, which is the same as the
BlueZ behavior, and another benefit is the bus doesn't need to reserve
bandwidth when no SCO connection.

This patch adds the infrastructure that allow the user space program to
talk to Bluetooth drivers directly:
- Define the new packet type HCI_DRV_PKT which is specifically used for
  communication between the user space program and the Bluetooth drviers
- hci_send_frame intercepts the packets and invokes drivers' HCI Drv
  callbacks (so far only defined for btusb)
- 2 kinds of events to user space: Command Status and Command Complete,
  the former simply returns the status while the later may contain
  additional response data.

Cc: chromeos-bluetooth-upstreaming@chromium.org
Fixes: b16b327edb ("Bluetooth: btusb: add sysfs attribute to control USB alt setting")
Signed-off-by: Hsin-chen Chuang <chharry@chromium.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-05-21 10:28:07 -04:00
David Howells
20d72b00ca
netfs: Fix the request's work item to not require a ref
When the netfs_io_request struct's work item is queued, it must be supplied
with a ref to the work item struct to prevent it being deallocated whilst
on the queue or whilst it is being processed.  This is tricky to manage as
we have to get a ref before we try and queue it and then we may find it's
already queued and is thus already holding a ref - in which case we have to
try and get rid of the ref again.

The problem comes if we're in BH or IRQ context and need to drop the ref:
if netfs_put_request() reduces the count to 0, we have to do the cleanup -
but the cleanup may need to wait.

Fix this by adding a new work item to the request, ->cleanup_work, and
dispatching that when the refcount hits zero.  That can then synchronously
cancel any outstanding work on the main work item before doing the cleanup.

Adding a new work item also deals with another problem upstream where it's
sometimes changing the work func in the put function and requeuing it -
which has occasionally in the past caused the cleanup to happen
incorrectly.

As a bonus, this allows us to get rid of the 'was_async' parameter from a
bunch of functions.  This indicated whether the put function might not be
permitted to sleep.

Fixes: 3d3c950467 ("netfs: Provide readahead and readpage netfs helpers")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250519090707.2848510-4-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Steve French <stfrench@microsoft.com>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-21 14:35:20 +02:00
Christian Brauner
a9194f8878
coredump: add coredump socket
Coredumping currently supports two modes:

(1) Dumping directly into a file somewhere on the filesystem.
(2) Dumping into a pipe connected to a usermode helper process
    spawned as a child of the system_unbound_wq or kthreadd.

For simplicity I'm mostly ignoring (1). There's probably still some
users of (1) out there but processing coredumps in this way can be
considered adventurous especially in the face of set*id binaries.

The most common option should be (2) by now. It works by allowing
userspace to put a string into /proc/sys/kernel/core_pattern like:

        |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h

The "|" at the beginning indicates to the kernel that a pipe must be
used. The path following the pipe indicator is a path to a binary that
will be spawned as a usermode helper process. Any additional parameters
pass information about the task that is generating the coredump to the
binary that processes the coredump.

In the example core_pattern shown above systemd-coredump is spawned as a
usermode helper. There's various conceptual consequences of this
(non-exhaustive list):

- systemd-coredump is spawned with file descriptor number 0 (stdin)
  connected to the read-end of the pipe. All other file descriptors are
  closed. That specifically includes 1 (stdout) and 2 (stderr). This has
  already caused bugs because userspace assumed that this cannot happen
  (Whether or not this is a sane assumption is irrelevant.).

- systemd-coredump will be spawned as a child of system_unbound_wq. So
  it is not a child of any userspace process and specifically not a
  child of PID 1. It cannot be waited upon and is in a weird hybrid
  upcall which are difficult for userspace to control correctly.

- systemd-coredump is spawned with full kernel privileges. This
  necessitates all kinds of weird privilege dropping excercises in
  userspace to make this safe.

- A new usermode helper has to be spawned for each crashing process.

This series adds a new mode:

(3) Dumping into an AF_UNIX socket.

Userspace can set /proc/sys/kernel/core_pattern to:

        @/path/to/coredump.socket

The "@" at the beginning indicates to the kernel that an AF_UNIX
coredump socket will be used to process coredumps.

The coredump socket must be located in the initial mount namespace.
When a task coredumps it opens a client socket in the initial network
namespace and connects to the coredump socket.

- The coredump server uses SO_PEERPIDFD to get a stable handle on the
  connected crashing task. The retrieved pidfd will provide a stable
  reference even if the crashing task gets SIGKILLed while generating
  the coredump.

- By setting core_pipe_limit non-zero userspace can guarantee that the
  crashing task cannot be reaped behind it's back and thus process all
  necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
  detect whether /proc/<pid> still refers to the same process.

  The core_pipe_limit isn't used to rate-limit connections to the
  socket. This can simply be done via AF_UNIX sockets directly.

- The pidfd for the crashing task will grow new information how the task
  coredumps.

- The coredump server should mark itself as non-dumpable.

- A container coredump server in a separate network namespace can simply
  bind to another well-know address and systemd-coredump fowards
  coredumps to the container.

- Coredumps could in the future also be handled via per-user/session
  coredump servers that run only with that users privileges.

  The coredump server listens on the coredump socket and accepts a
  new coredump connection. It then retrieves SO_PEERPIDFD for the
  client, inspects uid/gid and hands the accepted client to the users
  own coredump handler which runs with the users privileges only
  (It must of coure pay close attention to not forward crashing suid
  binaries.).

The new coredump socket will allow userspace to not have to rely on
usermode helpers for processing coredumps and provides a safer way to
handle them instead of relying on super privileged coredumping helpers
that have and continue to cause significant CVEs.

This will also be significantly more lightweight since no fork()+exec()
for the usermodehelper is required for each crashing process. The
coredump server in userspace can e.g., just keep a worker pool.

Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-4-664f3caf2516@kernel.org
Acked-by: Luca Boccassi <luca.boccassi@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-21 13:59:11 +02:00
Samiullah Khawaja
b95ed55173 xsk: Bring back busy polling support in XDP_COPY
Commit 5ef44b3cb4 ("xsk: Bring back busy polling support") fixed the
busy polling support in xsk for XDP_ZEROCOPY after it was broken in
commit 86e25f40aa ("net: napi: Add napi_config"). The busy polling
support with XDP_COPY remained broken since the napi_id setup in
xsk_rcv_check was removed.

Bring back the setup of napi_id for XDP_COPY so socket level SO_BUSYPOLL
can be used to poll the underlying napi.

Do the setup of napi_id for XDP_COPY in xsk_bind, as it is done
currently for XDP_ZEROCOPY. The setup of napi_id for XDP_COPY in
xsk_bind is safe because xsk_rcv_check checks that the rx queue at which
the packet arrives is equal to the queue_id that was supplied in bind.
This is done for both XDP_COPY and XDP_ZEROCOPY mode.

Tested using AF_XDP support in virtio-net by running the xsk_rr AF_XDP
benchmarking tool shared here:
https://lore.kernel.org/all/20250320163523.3501305-1-skhawaja@google.com/T/

Enabled socket busy polling using following commands in qemu,

```
sudo ethtool -L eth0 combined 1
echo 400 | sudo tee /proc/sys/net/core/busy_read
echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
echo 15000   | sudo tee /sys/class/net/eth0/gro_flush_timeout
```

Fixes: 5ef44b3cb4 ("xsk: Bring back busy polling support")
Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-21 10:28:23 +01:00
Aditya Kumar Singh
0b0ff976af wifi: mac80211: accept probe response on link address as well
If a random MAC address is not requested during scan request, unicast probe
response frames are only accepted if the destination address matches the
interface address. This works fine for non-ML interfaces. However, with
MLO, the same interface can have multiple links, and a scan on a link would
be requested with the link address. In such cases, the probe response frame
gets dropped which is incorrect.

Therefore, add logic to check if any of the link addresses match the
destination address if the interface address does not match.

Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com>
Link: https://patch.msgid.link/20250516-bug_fix_mlo_scan-v2-2-12e59d9110ac@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-05-21 09:26:28 +02:00
Aditya Kumar Singh
78a7a126dc wifi: mac80211: validate SCAN_FLAG_AP in scan request during MLO
When an AP interface is already beaconing, a subsequent scan is not allowed
unless the user space explicitly sets the flag NL80211_SCAN_FLAG_AP in the
scan request. If this flag is not set, the scan request will be returned
with the error code -EOPNOTSUPP. However, this restriction currently
applies only to non-ML interfaces. For ML interfaces, scans are allowed
without this flag being explicitly set by the user space which is wrong.
This is because the beaconing check currently uses only the deflink, which
does not get set during MLO.

Hence to fix this, during MLO, use the existing helper
ieee80211_num_beaconing_links() to know if any of the link is beaconing.

Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com>
Link: https://patch.msgid.link/20250516-bug_fix_mlo_scan-v2-1-12e59d9110ac@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-05-21 09:26:28 +02:00
Bert Karwatzki
d7500fbfb1 wifi: check if socket flags are valid
Checking the SOCK_WIFI_STATUS flag bit in sk_flags may give wrong results
since sk_flags are part of a union and the union is used otherwise. Add
sk_requests_wifi_status() which checks if sk is non-NULL, sk is a full
socket (so flags are valid) and checks the flag bit.

Fixes: 76a853f86c ("wifi: free SKBTX_WIFI_STATUS skb tx_flags flag")
Suggested-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Bert Karwatzki <spasswolf@web.de>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250520223430.6875-1-spasswolf@web.de
[edit commit message, fix indentation]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-05-21 09:26:22 +02:00
Kuniyuki Iwashima
002dba13c8 ipv6: Revert two per-cpu var allocation for RTM_NEWROUTE.
These two commits preallocated two per-cpu variables in
ip6_route_info_create() as fib_nh_common_init() and fib6_nh_init()
were expected to be called under RCU.

  * commit d27b9c40db ("ipv6: Preallocate nhc_pcpu_rth_output in
    ip6_route_info_create().")
  * commit 5720a328c3 ("ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in
    ip6_route_info_create().")

Now these functions can be called without RCU and can use GFP_KERNEL.

Let's revert the commits.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-8-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20 19:18:24 -07:00
Kuniyuki Iwashima
d465bd07d1 ipv6: Pass gfp_flags down to ip6_route_info_create_nh().
Since commit c4837b9853 ("ipv6: Split ip6_route_info_create()."),
ip6_route_info_create_nh() uses GFP_ATOMIC as it was expected to be
called under RCU.

Now, we can call it without RCU and use GFP_KERNEL.

Let's pass gfp_flags to ip6_route_info_create_nh().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-7-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20 19:18:24 -07:00
Kuniyuki Iwashima
5e4a8cc7be Revert "ipv6: Factorise ip6_route_multipath_add()."
Commit 71c0efb6d1 ("ipv6: Factorise ip6_route_multipath_add().") split
a loop in ip6_route_multipath_add() so that we can put rcu_read_lock()
between ip6_route_info_create() and ip6_route_info_create_nh().

We no longer need to do so as ip6_route_info_create_nh() does not require
RCU now.

Let's revert the commit to simplify the code.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-6-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20 19:18:24 -07:00
Kuniyuki Iwashima
cefe6e131c Revert "ipv6: sr: switch to GFP_ATOMIC flag to allocate memory during seg6local LWT setup"
The previous patch fixed the same issue mentioned in
commit 14a0087e72 ("ipv6: sr: switch to GFP_ATOMIC
flag to allocate memory during seg6local LWT setup").

Let's revert it.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Link: https://patch.msgid.link/20250516022759.44392-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20 19:18:24 -07:00
Kuniyuki Iwashima
8e5f1bb812 ipv6: Narrow down RCU critical section in inet6_rtm_newroute().
Commit 169fd62799 ("ipv6: Get rid of RTNL for SIOCADDRT and
RTM_NEWROUTE.") added rcu_read_lock() covering
ip6_route_info_create_nh() and __ip6_ins_rt() to guarantee that
nexthop and netdev will not go away.

However, as reported by syzkaller [0], ip_tun_build_state() calls
dst_cache_init() with GFP_KERNEL during the RCU critical section.

ip6_route_info_create_nh() fetches nexthop or netdev depending on
whether RTA_NH_ID is set, and struct fib6_info holds a refcount
of either of them by nexthop_get() or netdev_get_by_index().

netdev_get_by_index() looks up a dev and calls dev_hold() under RCU.

So, we need RCU only around nexthop_find_by_id() and nexthop_get()
( and a few more nexthop code).

Let's add rcu_read_lock() there and remove rcu_read_lock() in
ip6_route_add() and ip6_route_multipath_add().

Now these functions called from fib6_add() need RCU:

  - inet6_rt_notify()
  - fib6_drop_pcpu_from() (via fib6_purge_rt())
  - rt6_flush_exceptions() (via fib6_purge_rt())
  - ip6_ignore_linkdown() (via rt6_multipath_rebalance())

All callers of inet6_rt_notify() need RCU, so rcu_read_lock() is
added there.

[0]:
[ BUG: Invalid wait context ]
6.15.0-rc4-syzkaller-00746-g836b313a14a3 #0 Tainted: G W
 ----------------------------
syz-executor234/5832 is trying to lock:
ffffffff8e021688 (pcpu_alloc_mutex){+.+.}-{4:4}, at:
pcpu_alloc_noprof+0x284/0x16b0 mm/percpu.c:1782
other info that might help us debug this:
context-{5:5}
1 lock held by syz-executor234/5832:
 0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire
include/linux/rcupdate.h:331 [inline]
 0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock
include/linux/rcupdate.h:841 [inline]
 0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at:
ip6_route_add+0x4d/0x2f0 net/ipv6/route.c:3913
stack backtrace:
CPU: 0 UID: 0 PID: 5832 Comm: syz-executor234 Tainted: G W
6.15.0-rc4-syzkaller-00746-g836b313a14a3 #0 PREEMPT(full)
Tainted: [W]=WARN
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 04/29/2025
Call Trace:
 <TASK>
 dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
 print_lock_invalid_wait_context kernel/locking/lockdep.c:4831 [inline]
 check_wait_context kernel/locking/lockdep.c:4903 [inline]
 __lock_acquire+0xbcf/0xd20 kernel/locking/lockdep.c:5185
 lock_acquire+0x120/0x360 kernel/locking/lockdep.c:5866
 __mutex_lock_common kernel/locking/mutex.c:601 [inline]
 __mutex_lock+0x182/0xe80 kernel/locking/mutex.c:746
 pcpu_alloc_noprof+0x284/0x16b0 mm/percpu.c:1782
 dst_cache_init+0x37/0xc0 net/core/dst_cache.c:145
 ip_tun_build_state+0x193/0x6b0 net/ipv4/ip_tunnel_core.c:687
 lwtunnel_build_state+0x381/0x4c0 net/core/lwtunnel.c:137
 fib_nh_common_init+0x129/0x460 net/ipv4/fib_semantics.c:635
 fib6_nh_init+0x15e4/0x2030 net/ipv6/route.c:3669
 ip6_route_info_create_nh+0x139/0x870 net/ipv6/route.c:3866
 ip6_route_add+0xf6/0x2f0 net/ipv6/route.c:3915
 inet6_rtm_newroute+0x284/0x1c50 net/ipv6/route.c:5732
 rtnetlink_rcv_msg+0x7cc/0xb70 net/core/rtnetlink.c:6955
 netlink_rcv_skb+0x219/0x490 net/netlink/af_netlink.c:2534
 netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline]
 netlink_unicast+0x758/0x8d0 net/netlink/af_netlink.c:1339
 netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1883
 sock_sendmsg_nosec net/socket.c:712 [inline]
 __sock_sendmsg+0x219/0x270 net/socket.c:727
 ____sys_sendmsg+0x505/0x830 net/socket.c:2566
 ___sys_sendmsg+0x21f/0x2a0 net/socket.c:2620
 __sys_sendmsg net/socket.c:2652 [inline]
 __do_sys_sendmsg net/socket.c:2657 [inline]
 __se_sys_sendmsg net/socket.c:2655 [inline]
 __x64_sys_sendmsg+0x19b/0x260 net/socket.c:2655
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xf6/0x210 arch/x86/entry/syscall_64.c:94

Fixes: 169fd62799 ("ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.")
Reported-by: syzbot+bcc12d6799364500fbec@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=bcc12d6799364500fbec
Reported-by: Eric Dumazet <edumazet@google.com>
Closes: https://lore.kernel.org/netdev/CANn89i+r1cGacVC_6n3-A-WSkAa_Nr+pmxJ7Gt+oP-P9by2aGw@mail.gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20 19:18:24 -07:00
Kuniyuki Iwashima
f0a56c17e6 inet: Remove rtnl_is_held arg of lwtunnel_valid_encap_type(_attr)?().
Commit f130a0cc1b ("inet: fix lwtunnel_valid_encap_type() lock
imbalance") added the rtnl_is_held argument as a temporary fix while
I'm converting nexthop and IPv6 routing table to per-netns RTNL or RCU.

Now all callers of lwtunnel_valid_encap_type() do not hold RTNL.

Let's remove the argument.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20 19:18:24 -07:00
Kuniyuki Iwashima
f1a8d107d9 ipv6: Remove rcu_read_lock() in fib6_get_table().
Once allocated, the IPv6 routing table is not freed until
netns is dismantled.

fib6_get_table() uses rcu_read_lock() while iterating
net->ipv6.fib_table_hash[], but it's not needed and
rather confusing.

Because some callers have this pattern,

  table = fib6_get_table();

  rcu_read_lock();
  /* ... use table here ... */
  rcu_read_unlock();

  [ See: addrconf_get_prefix_route(), ip6_route_del(),
         rt6_get_route_info(), rt6_get_dflt_router() ]

and this looks illegal but is actually safe.

Let's remove rcu_read_lock() in fib6_get_table() and pass true
to the last argument of hlist_for_each_entry_rcu() to bypass
the RCU check.

Note that protection is not needed but RCU helper is used to
avoid data-race.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250516022759.44392-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20 19:18:23 -07:00
Stanislav Fomichev
f792709e0b selftests: net: validate team flags propagation
Cover three recent cases:
1. missing ops locking for the lowers during netdev_sync_lower_features
2. missing locking for dev_set_promiscuity (plus netdev_ops_assert_locked
   with a comment on why/when it's needed)
3. rcu lock during team_change_rx_flags

Verified that each one triggers when the respective fix is reverted.
Not sure about the placement, but since it all relies on teaming,
added to the teaming directory.

One ugly bit is that I add NETIF_F_LRO to netdevsim; there is no way
to trigger netdev_sync_lower_features without it.

Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Link: https://patch.msgid.link/20250516232205.539266-1-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20 18:12:58 -07:00
Petr Malat
af295892a7 sctp: Do not wake readers in __sctp_write_space()
Function __sctp_write_space() doesn't set poll key, which leads to
ep_poll_callback() waking up all waiters, not only these waiting
for the socket being writable. Set the key properly using
wake_up_interruptible_poll(), which is preferred over the sync
variant, as writers are not woken up before at least half of the
queue is available. Also, TCP does the same.

Signed-off-by: Petr Malat <oss@malat.biz>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20250516081727.1361451-1-oss@malat.biz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20 18:10:40 -07:00
Zilin Guan
e7a37c9e42 xfrm: use kfree_sensitive() for SA secret zeroization
High-level copy_to_user_* APIs already redact SA secret fields when
redaction is enabled, but the state teardown path still freed aead,
aalg and ealg structs with plain kfree(), which does not clear memory
before deallocation. This can leave SA keys and other confidential
data in memory, risking exposure via post-free vulnerabilities.

Since this path is outside the packet fast path, the cost of zeroization
is acceptable and prevents any residual key material. This patch
replaces those kfree() calls unconditionally with kfree_sensitive(),
which zeroizes the entire buffer before freeing.

Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-05-20 07:55:00 +02:00
Oliver Hartkopp
dac5e62491 can: bcm: add missing rcu read protection for procfs content
When the procfs content is generated for a bcm_op which is in the process
to be removed the procfs output might show unreliable data (UAF).

As the removal of bcm_op's is already implemented with rcu handling this
patch adds the missing rcu_read_lock() and makes sure the list entries
are properly removed under rcu protection.

Fixes: f1b4e32aca ("can: bcm: use call_rcu() instead of costly synchronize_rcu()")
Reported-by: Anderson Nascimento <anderson@allelesecurity.com>
Suggested-by: Anderson Nascimento <anderson@allelesecurity.com>
Tested-by: Anderson Nascimento <anderson@allelesecurity.com>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://patch.msgid.link/20250519125027.11900-2-socketcan@hartkopp.net
Cc: stable@vger.kernel.org # >= 5.4
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-05-19 16:58:19 +02:00
Oliver Hartkopp
c2aba69d0c can: bcm: add locking for bcm_op runtime updates
The CAN broadcast manager (CAN BCM) can send a sequence of CAN frames via
hrtimer. The content and also the length of the sequence can be changed
resp reduced at runtime where the 'currframe' counter is then set to zero.

Although this appeared to be a safe operation the updates of 'currframe'
can be triggered from user space and hrtimer context in bcm_can_tx().
Anderson Nascimento created a proof of concept that triggered a KASAN
slab-out-of-bounds read access which can be prevented with a spin_lock_bh.

At the rework of bcm_can_tx() the 'count' variable has been moved into
the protected section as this variable can be modified from both contexts
too.

Fixes: ffd980f976 ("[CAN]: Add broadcast manager (bcm) protocol")
Reported-by: Anderson Nascimento <anderson@allelesecurity.com>
Tested-by: Anderson Nascimento <anderson@allelesecurity.com>
Reviewed-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://patch.msgid.link/20250519125027.11900-1-socketcan@hartkopp.net
Cc: stable@vger.kernel.org
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-05-19 16:58:12 +02:00
Nikhil Jha
fadc0f3bb2 sunrpc: don't immediately retransmit on seqno miss
RFC2203 requires that retransmitted messages use a new gss sequence
number, but the same XID. This means that if the server is just slow
(e.x. overloaded), the client might receive a response using an older
seqno than the one it has recorded.

Currently, Linux's client immediately retransmits in this case. However,
this leads to a lot of wasted retransmits until the server eventually
responds faster than the client can resend.

Client -> SEQ 1 -> Server
Client -> SEQ 2 -> Server
Client <- SEQ 1 <- Server (misses, expecting seqno = 2)
Client -> SEQ 3 -> Server (immediate retransmission on miss)
Client <- SEQ 2 <- Server (misses, expecting seqno = 3)
Client -> SEQ 4 -> Server (immediate retransmission on miss)
... and so on ...

This commit makes it so that we ignore messages with bad checksums
due to seqnum mismatch, and rely on the usual timeout behavior for
retransmission instead of doing so immediately.

Signed-off-by: Nikhil Jha <njha@janestreet.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-19 10:14:29 -04:00
Nikhil Jha
08d6ee6d8a sunrpc: implement rfc2203 rpcsec_gss seqnum cache
This implements a sequence number cache of the last three (right now
hardcoded) sent sequence numbers for a given XID, as suggested by the
RFC.

From RFC2203 5.3.3.1:

"Note that the sequence number algorithm requires that the client
increment the sequence number even if it is retrying a request with
the same RPC transaction identifier.  It is not infrequent for
clients to get into a situation where they send two or more attempts
and a slow server sends the reply for the first attempt. With
RPCSEC_GSS, each request and reply will have a unique sequence
number. If the client wishes to improve turn around time on the RPC
call, it can cache the RPCSEC_GSS sequence number of each request it
sends. Then when it receives a response with a matching RPC
transaction identifier, it can compute the checksum of each sequence
number in the cache to try to match the checksum in the reply's
verifier."

Signed-off-by: Nikhil Jha <njha@janestreet.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-19 10:14:29 -04:00
Ilia Gavrilov
239af1970b llc: fix data loss when reading from a socket in llc_ui_recvmsg()
For SOCK_STREAM sockets, if user buffer size (len) is less
than skb size (skb->len), the remaining data from skb
will be lost after calling kfree_skb().

To fix this, move the statement for partial reading
above skb deletion.

Found by InfoTeCS on behalf of Linux Verification Center (linuxtesting.org)

Fixes: 30a584d944 ("[LLX]: SOCK_DGRAM interface fixes")
Cc: stable@vger.kernel.org
Signed-off-by: Ilia Gavrilov <Ilia.Gavrilov@infotecs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-19 12:12:54 +01:00
Paolo Abeni
c46286fdd6 mr: consolidate the ipmr_can_free_table() checks.
Guoyu Yin reported a splat in the ipmr netns cleanup path:

WARNING: CPU: 2 PID: 14564 at net/ipv4/ipmr.c:440 ipmr_free_table net/ipv4/ipmr.c:440 [inline]
WARNING: CPU: 2 PID: 14564 at net/ipv4/ipmr.c:440 ipmr_rules_exit+0x135/0x1c0 net/ipv4/ipmr.c:361
Modules linked in:
CPU: 2 UID: 0 PID: 14564 Comm: syz.4.838 Not tainted 6.14.0 #1
Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:ipmr_free_table net/ipv4/ipmr.c:440 [inline]
RIP: 0010:ipmr_rules_exit+0x135/0x1c0 net/ipv4/ipmr.c:361
Code: ff df 48 c1 ea 03 80 3c 02 00 75 7d 48 c7 83 60 05 00 00 00 00 00 00 5b 5d 41 5c 41 5d 41 5e e9 71 67 7f 00 e8 4c 2d 8a fd 90 <0f> 0b 90 eb 93 e8 41 2d 8a fd 0f b6 2d 80 54 ea 01 31 ff 89 ee e8
RSP: 0018:ffff888109547c58 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff888108c12dc0 RCX: ffffffff83e09868
RDX: ffff8881022b3300 RSI: ffffffff83e098d4 RDI: 0000000000000005
RBP: ffff888104288000 R08: 0000000000000000 R09: ffffed10211825c9
R10: 0000000000000001 R11: ffff88801816c4a0 R12: 0000000000000001
R13: ffff888108c13320 R14: ffff888108c12dc0 R15: fffffbfff0b74058
FS:  00007f84f39316c0(0000) GS:ffff88811b100000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f84f3930f98 CR3: 0000000113b56000 CR4: 0000000000350ef0
Call Trace:
 <TASK>
 ipmr_net_exit_batch+0x50/0x90 net/ipv4/ipmr.c:3160
 ops_exit_list+0x10c/0x160 net/core/net_namespace.c:177
 setup_net+0x47d/0x8e0 net/core/net_namespace.c:394
 copy_net_ns+0x25d/0x410 net/core/net_namespace.c:516
 create_new_namespaces+0x3f6/0xaf0 kernel/nsproxy.c:110
 unshare_nsproxy_namespaces+0xc3/0x180 kernel/nsproxy.c:228
 ksys_unshare+0x78d/0x9a0 kernel/fork.c:3342
 __do_sys_unshare kernel/fork.c:3413 [inline]
 __se_sys_unshare kernel/fork.c:3411 [inline]
 __x64_sys_unshare+0x31/0x40 kernel/fork.c:3411
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xa6/0x1a0 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f84f532cc29
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f84f3931038 EFLAGS: 00000246 ORIG_RAX: 0000000000000110
RAX: ffffffffffffffda RBX: 00007f84f5615fa0 RCX: 00007f84f532cc29
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000040000400
RBP: 00007f84f53fba18 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 00007f84f5615fa0 R15: 00007fff51c5f328
 </TASK>

The running kernel has CONFIG_IP_MROUTE_MULTIPLE_TABLES disabled, and
the sanity check for such build is still too loose.

Address the issue consolidating the relevant sanity check in a single
helper regardless of the kernel configuration. Also share it between
the ipv4 and ipv6 code.

Reported-by: Guoyu Yin <y04609127@gmail.com>
Fixes: 50b9420444 ("ipmr: tune the ipmr_can_free_table() checks.")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/372dc261e1bf12742276e1b984fc5a071b7fc5a8.1747321903.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 17:53:48 -07:00
Eric Dumazet
9cd5ef0b8c net: rfs: add sock_rps_delete_flow() helper
RFS can exhibit lower performance for workloads using short-lived
flows and a small set of 4-tuple.

This is often the case for load-testers, using a pair of hosts,
if the server has a single listener port.

Typical use case :

Server : tcp_crr -T128 -F1000 -6 -U -l30 -R 14250
Client : tcp_crr -T128 -F1000 -6 -U -l30 -c -H server | grep local_throughput

This is because RFS global hash table contains stale information,
when the same RSS key is recycled for another socket and another cpu.

Make sure to undo the changes and go back to initial state when
a flow is disconnected.

Performance of the above test is increased by 22 %,
going from 372604 transactions per second to 457773.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Octavian Purdila <tavip@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20250515100354.3339920-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 16:03:48 -07:00
Ido Schimmel
91b6dbced0 bridge: netfilter: Fix forwarding of fragmented packets
When netfilter defrag hooks are loaded (due to the presence of conntrack
rules, for example), fragmented packets entering the bridge will be
defragged by the bridge's pre-routing hook (br_nf_pre_routing() ->
ipv4_conntrack_defrag()).

Later on, in the bridge's post-routing hook, the defragged packet will
be fragmented again. If the size of the largest fragment is larger than
what the kernel has determined as the destination MTU (using
ip_skb_dst_mtu()), the defragged packet will be dropped.

Before commit ac6627a28d ("net: ipv4: Consolidate ipv4_mtu and
ip_dst_mtu_maybe_forward"), ip_skb_dst_mtu() would return dst_mtu() as
the destination MTU. Assuming the dst entry attached to the packet is
the bridge's fake rtable one, this would simply be the bridge's MTU (see
fake_mtu()).

However, after above mentioned commit, ip_skb_dst_mtu() ends up
returning the route's MTU stored in the dst entry's metrics. Ideally, in
case the dst entry is the bridge's fake rtable one, this should be the
bridge's MTU as the bridge takes care of updating this metric when its
MTU changes (see br_change_mtu()).

Unfortunately, the last operation is a no-op given the metrics attached
to the fake rtable entry are marked as read-only. Therefore,
ip_skb_dst_mtu() ends up returning 1500 (the initial MTU value) and
defragged packets are dropped during fragmentation when dealing with
large fragments and high MTU (e.g., 9k).

Fix by moving the fake rtable entry's metrics to be per-bridge (in a
similar fashion to the fake rtable entry itself) and marking them as
writable, thereby allowing MTU changes to be reflected.

Fixes: 62fa8a846d ("net: Implement read-only protection and COW'ing of metrics.")
Fixes: 33eb9873a2 ("bridge: initialize fake_rtable metrics")
Reported-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Closes: https://lore.kernel.org/netdev/PH0PR10MB4504888284FF4CBA648197D0ACB82@PH0PR10MB4504.namprd10.prod.outlook.com/
Tested-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250515084848.727706-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 16:02:06 -07:00
Jakob Unterwurzacher
ba54bce747 net: dsa: microchip: linearize skb for tail-tagging switches
The pointer arithmentic for accessing the tail tag only works
for linear skbs.

For nonlinear skbs, it reads uninitialized memory inside the
skb headroom, essentially randomizing the tag. I have observed
it gets set to 6 most of the time.

Example where ksz9477_rcv thinks that the packet from port 1 comes from port 6
(which does not exist for the ksz9896 that's in use), dropping the packet.
Debug prints added by me (not included in this patch):

	[  256.645337] ksz9477_rcv:323 tag0=6
	[  256.645349] skb len=47 headroom=78 headlen=0 tailroom=0
	               mac=(64,14) mac_len=14 net=(78,0) trans=78
	               shinfo(txflags=0 nr_frags=1 gso(size=0 type=0 segs=0))
	               csum(0x0 start=0 offset=0 ip_summed=0 complete_sw=0 valid=0 level=0)
	               hash(0x0 sw=0 l4=0) proto=0x00f8 pkttype=1 iif=3
	               priority=0x0 mark=0x0 alloc_cpu=0 vlan_all=0x0
	               encapsulation=0 inner(proto=0x0000, mac=0, net=0, trans=0)
	[  256.645377] dev name=end1 feat=0x0002e10200114bb3
	[  256.645386] skb headroom: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	[  256.645395] skb headroom: 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	[  256.645403] skb headroom: 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	[  256.645411] skb headroom: 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	[  256.645420] skb headroom: 00000040: ff ff ff ff ff ff 00 1c 19 f2 e2 db 08 06
	[  256.645428] skb frag:     00000000: 00 01 08 00 06 04 00 01 00 1c 19 f2 e2 db 0a 02
	[  256.645436] skb frag:     00000010: 00 83 00 00 00 00 00 00 0a 02 a0 2f 00 00 00 00
	[  256.645444] skb frag:     00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01
	[  256.645452] ksz_common_rcv:92 dsa_conduit_find_user returned NULL

Call skb_linearize before trying to access the tag.

This patch fixes ksz9477_rcv which is used by the ksz9896 I have at
hand, and also applies the same fix to ksz8795_rcv which seems to have
the same problem.

Signed-off-by: Jakob Unterwurzacher <jakob.unterwurzacher@cherry.de>
CC: stable@vger.kernel.org
Fixes: 016e43a26b ("net: dsa: ksz: Add KSZ8795 tag code")
Fixes: 8b8010fb78 ("dsa: add support for Microchip KSZ tail tagging")
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://patch.msgid.link/20250515072920.2313014-1-jakob.unterwurzacher@cherry.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16 16:00:17 -07:00
Aditya Kumar Singh
68b44b05f4 wifi: mac80211: handle non-MLO mode as well in ieee80211_num_beaconing_links()
Currently, ieee80211_num_beaconing_links() returns 0 when the interface
operates in non-ML mode. However, non-MLO mode is equivalent to having a
single link. Therefore, the function can handle the non-MLO case as well.
This adjustment will also eliminate the need for deflink usage in certain
scenarios.

Hence, implement changes to handle the non-MLO case as well. There is
no change in functionality, and no existing user-visible bug is getting
fixed. This update simply makes the function generic to handle all cases.

Suggested-by: Johannes Berg <johannes@sipsolutions.net>
Link: https://lore.kernel.org/linux-wireless/16499ad8e4b060ee04c8a8b3615fe8952aa7b07b.camel@sipsolutions.net/
Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com>
Link: https://patch.msgid.link/20250515-fix_num_beaconing_links-v1-1-4a39e2704314@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-05-16 10:51:44 +02:00
Chuck Lever
56ab43f50d svcrdma: Adjust the number of entries in svc_rdma_send_ctxt::sc_pages
Allow allocation of more entries in the sc_pages[] array when the
maximum size of an RPC message is increased.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-15 16:16:26 -04:00
Chuck Lever
81381d1a90 svcrdma: Adjust the number of entries in svc_rdma_recv_ctxt::rc_pages
Allow allocation of more entries in the rc_pages[] array when the
maximum size of an RPC message is increased.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-15 16:16:26 -04:00
Chuck Lever
f4126823c1 sunrpc: Adjust size of socket's receive page array dynamically
As a step towards making NFSD's maximum rsize and wsize variable at
run-time, make sk_pages a flexible array.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-15 16:16:25 -04:00
Chuck Lever
b406c6b781 SUNRPC: Remove svc_fill_write_vector()
Clean up: This API is no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-15 16:16:25 -04:00
Chuck Lever
62bf165c04 SUNRPC: Export xdr_buf_to_bvec()
Prepare xdr_buf_to_bvec() to be invoked from upper layer protocol
code.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-15 16:16:24 -04:00
Chuck Lever
f2e597353d NFSD: De-duplicate the svc_fill_write_vector() call sites
All three call sites do the same thing.

I'm struggling with this a bit, however. struct xdr_buf is an XDR
layer object and unmarshaling a WRITE payload is clearly a task
intended to be done by the proc and xdr functions, not by VFS. This
feels vaguely like a layering violation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-15 16:16:23 -04:00
Chuck Lever
59cf734654 sunrpc: Replace the rq_bvec array with dynamically-allocated memory
As a step towards making NFSD's maximum rsize and wsize variable at
run-time, replace the fixed-size rq_bvec[] array in struct svc_rqst
with a chunk of dynamically-allocated memory.

The rq_bvec[] array contains enough bio_vecs to handle each page in
a maximum size RPC message.

On a system with 8-byte pointers and 4KB pages, pahole reports that
the rq_bvec[] array is 4144 bytes. This patch replaces that array
with a single 8-byte pointer field.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-15 16:16:23 -04:00
Chuck Lever
ed603bcf4f sunrpc: Replace the rq_pages array with dynamically-allocated memory
As a step towards making NFSD's maximum rsize and wsize variable at
run-time, replace the fixed-size rq_vec[] array in struct svc_rqst
with a chunk of dynamically-allocated memory.

On a system with 8-byte pointers and 4KB pages, pahole reports that
the rq_pages[] array is 2080 bytes. This patch replaces that with
a single 8-byte pointer field.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-15 16:16:22 -04:00
Chuck Lever
1a58791292 sunrpc: Remove backchannel check in svc_init_buffer()
The server's backchannel uses struct svc_rqst, but does not use the
pages in svc_rqst::rq_pages. It's rq_arg::pages and rq_res::pages
comes from the RPC client's page allocator. Currently,
svc_init_buffer() skips allocating pages in rq_pages for that
reason.

Except that, svc_rqst::rq_pages is filled anyway when a backchannel
svc_rqst is passed to svc_recv() -> and then to svc_alloc_arg().

This isn't really a problem at the moment, except that these pages
are allocated but then never used, as far as I can tell.

The problem is that later in this series, in addition to populating
the entries of rq_pages[], svc_init_buffer() will also allocate the
memory underlying the rq_pages[] array itself. If that allocation is
skipped, then svc_alloc_args() chases a NULL pointer for ingress
backchannel requests.

This approach avoids introducing extra conditional logic in
svc_alloc_args(), which is a hot path.

Acked-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-15 16:16:22 -04:00
Chuck Lever
5924331589 svcrdma: Reduce the number of rdma_rw contexts per-QP
There is an upper bound on the number of rdma_rw contexts that can
be created per QP.

This invisible upper bound is because rdma_create_qp() adds one or
more additional SQEs for each ctxt that the ULP requests via
qp_attr.cap.max_rdma_ctxs. The QP's actual Send Queue length is on
the order of the sum of qp_attr.cap.max_send_wr and a factor times
qp_attr.cap.max_rdma_ctxs. The factor can be up to three, depending
on whether MR operations are required before RDMA Reads.

This limit is not visible to RDMA consumers via dev->attrs. When the
limit is surpassed, QP creation fails with -ENOMEM. For example:

svcrdma's estimate of the number of rdma_rw contexts it needs is
three times the number of pages in RPCSVC_MAXPAGES. When MAXPAGES
is about 260, the internally-computed SQ length should be:

64 credits + 10 backlog + 3 * (3 * 260) = 2414

Which is well below the advertised qp_max_wr of 32768.

If RPCSVC_MAXPAGES is increased to 4MB, that's 1040 pages:

64 credits + 10 backlog + 3 * (3 * 1040) = 9434

However, QP creation fails. Dynamic printk for mlx5 shows:

calc_sq_size:618:(pid 1514): send queue size (9326 * 256 / 64 -> 65536) exceeds limits(32768)

Although 9326 is still far below qp_max_wr, QP creation still
fails.

Because the total SQ length calculation is opaque to RDMA consumers,
there doesn't seem to be much that can be done about this except for
consumers to try to keep the requested rdma_rw ctxt count low.

Fixes: 2da0f610e7 ("svcrdma: Increase the per-transport rw_ctx count")
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-15 16:16:21 -04:00
Eric Dumazet
572be9bf9d tcp: increase tcp_rmem[2] to 32 MB
Last change to tcp_rmem[2] happened in 2012, in commit b49960a05e
("tcp: change tcp_adv_win_scale and tcp_rmem[2]")

TCP performance on WAN is mostly limited by tcp_rmem[2] for receivers.

After this series improvements, it is time to increase the default.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-12-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15 11:30:09 -07:00
Eric Dumazet
c4221a8cc3 tcp: always use tcp_limit_output_bytes limitation
This partially reverts commit c73e5807e4 ("tcp: tsq: no longer use
limit_output_bytes for paced flows")

Overriding the tcp_limit_output_bytes sysctl value
for FQ enabled flows has the following problem:

It allows TCP to queue around 2 ms worth of data per flow,
defeating tcp_rcv_rtt_update() accuracy on the receiver,
forcing it to increase sk->sk_rcvbuf even if the real
RTT is around 100 us.

After this change, we keep enough packets in flight to fill
the pipe, and let receive queues small enough to get
good cache behavior (cpu caches and/or NIC driver page pools).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-11-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15 11:30:09 -07:00
Eric Dumazet
9ea3bfa61b tcp: increase tcp_limit_output_bytes default value to 4MB
Last change happened in 2018 with commit c73e5807e4
("tcp: tsq: no longer use limit_output_bytes for paced flows")

Modern NIC speeds got a 4x increase since then.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-10-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15 11:30:09 -07:00
Eric Dumazet
a00f135cd9 tcp: skip big rtt sample if receive queue is not empty
tcp_rcv_rtt_update() role is to keep an estimation
of RTT (tp->rcv_rtt_est.rtt_us) for receivers.

If an application is too slow to drain the TCP receive
queue, it is better to leave the RTT estimation small,
so that tcp_rcv_space_adjust() does not inflate
tp->rcvq_space.space and sk->sk_rcvbuf.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15 11:30:09 -07:00
Eric Dumazet
b879dcb1ae tcp: always seek for minimal rtt in tcp_rcv_rtt_update()
tcp_rcv_rtt_update() goal is to maintain an estimation of the RTT
in tp->rcv_rtt_est.rtt_us, used by tcp_rcv_space_adjust()

When TCP TS are enabled, tcp_rcv_rtt_update() is using
EWMA to smooth the samples.

Change this to immediately latch the incoming value if it
is lower than tp->rcv_rtt_est.rtt_us, so that tcp_rcv_space_adjust()
does not overshoot tp->rcvq_space.space and sk->sk_rcvbuf.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15 11:30:08 -07:00
Eric Dumazet
cd171461b9 tcp: fix initial tp->rcvq_space.space value for passive TS enabled flows
tcp_rcv_state_process() must tweak tp->advmss for TS enabled flows
before the call to tcp_init_transfer() / tcp_init_buffer_space().

Otherwise tp->rcvq_space.space is off by 120 bytes
(TCP_INIT_CWND * TCPOLEN_TSTAMP_ALIGNED).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15 11:30:08 -07:00
Eric Dumazet
d59fc95be9 tcp: remove zero TCP TS samples for autotuning
For TCP flows using ms RFC 7323 timestamp granularity
tcp_rcv_rtt_update() can be fed with 1 ms samples, breaking
TCP autotuning for data center flows with sub ms RTT.

Instead, rely on the window based samples, fed by tcp_rcv_rtt_measure()

tcp_rcvbuf_grow() for a 10 second TCP_STREAM sesssion now looks saner.
We can see rcvbuf is kept at a reasonable value.

  222.234976: tcp:tcp_rcvbuf_grow: time=348 rtt_us=330 copied=110592 inq=0 space=40960 ooo=0 scaling_ratio=230 rcvbuf=131072 ...
  222.235276: tcp:tcp_rcvbuf_grow: time=300 rtt_us=288 copied=126976 inq=0 space=110592 ooo=0 scaling_ratio=230 rcvbuf=246187 ...
  222.235569: tcp:tcp_rcvbuf_grow: time=294 rtt_us=288 copied=184320 inq=0 space=126976 ooo=0 scaling_ratio=230 rcvbuf=282659 ...
  222.235833: tcp:tcp_rcvbuf_grow: time=264 rtt_us=244 copied=373760 inq=0 space=184320 ooo=0 scaling_ratio=230 rcvbuf=410312 ...
  222.236142: tcp:tcp_rcvbuf_grow: time=308 rtt_us=219 copied=424960 inq=20480 space=373760 ooo=0 scaling_ratio=230 rcvbuf=832022 ...
  222.236378: tcp:tcp_rcvbuf_grow: time=236 rtt_us=219 copied=692224 inq=49152 space=404480 ooo=0 scaling_ratio=230 rcvbuf=900407 ...
  222.236602: tcp:tcp_rcvbuf_grow: time=225 rtt_us=219 copied=730112 inq=49152 space=643072 ooo=0 scaling_ratio=230 rcvbuf=1431534 ...
  222.237050: tcp:tcp_rcvbuf_grow: time=229 rtt_us=219 copied=1160192 inq=49152 space=680960 ooo=0 scaling_ratio=230 rcvbuf=1515876 ...
  222.237618: tcp:tcp_rcvbuf_grow: time=305 rtt_us=218 copied=2228224 inq=49152 space=1111040 ooo=0 scaling_ratio=230 rcvbuf=2473271 ...
  222.238591: tcp:tcp_rcvbuf_grow: time=224 rtt_us=218 copied=3063808 inq=360448 space=2179072 ooo=0 scaling_ratio=230 rcvbuf=4850803 ...
  222.240647: tcp:tcp_rcvbuf_grow: time=260 rtt_us=218 copied=2752512 inq=0 space=2703360 ooo=0 scaling_ratio=230 rcvbuf=6017914 ...
  222.243535: tcp:tcp_rcvbuf_grow: time=224 rtt_us=218 copied=2834432 inq=49152 space=2752512 ooo=0 scaling_ratio=230 rcvbuf=6127331 ...
  222.245108: tcp:tcp_rcvbuf_grow: time=240 rtt_us=218 copied=2883584 inq=49152 space=2785280 ooo=0 scaling_ratio=230 rcvbuf=6200275 ...
  222.245333: tcp:tcp_rcvbuf_grow: time=224 rtt_us=218 copied=2859008 inq=0 space=2834432 ooo=0 scaling_ratio=230 rcvbuf=6309692 ...
  222.301021: tcp:tcp_rcvbuf_grow: time=222 rtt_us=218 copied=2883584 inq=0 space=2859008 ooo=0 scaling_ratio=230 rcvbuf=6364400 ...
  222.989242: tcp:tcp_rcvbuf_grow: time=225 rtt_us=218 copied=2899968 inq=0 space=2883584 ooo=0 scaling_ratio=230 rcvbuf=6419108 ...
  224.139553: tcp:tcp_rcvbuf_grow: time=224 rtt_us=218 copied=3014656 inq=65536 space=2899968 ooo=0 scaling_ratio=230 rcvbuf=6455580 ...
  224.584608: tcp:tcp_rcvbuf_grow: time=232 rtt_us=218 copied=3014656 inq=49152 space=2949120 ooo=0 scaling_ratio=230 rcvbuf=6564997 ...
  230.145560: tcp:tcp_rcvbuf_grow: time=223 rtt_us=218 copied=2981888 inq=0 space=2965504 ooo=0 scaling_ratio=230 rcvbuf=6601469 ...

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15 11:30:08 -07:00
Eric Dumazet
ea33537d82 tcp: add receive queue awareness in tcp_rcv_space_adjust()
If the application can not drain fast enough a TCP socket queue,
tcp_rcv_space_adjust() can overestimate tp->rcvq_space.space.

Then sk->sk_rcvbuf can grow and hit tcp_rmem[2] for no good reason.

Fix this by taking into acount the number of available bytes.

Keeping sk->sk_rcvbuf at the right size allows better cache efficiency.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15 11:30:08 -07:00
Eric Dumazet
63ad7dfedf tcp: adjust rcvbuf in presence of reorders
This patch takes care of the needed provisioning
when incoming packets are stored in the out of order queue.

This part was not implemented in the correct way, we need
to decouple it from tcp_rcv_space_adjust() logic.

Without it, stalls in the pipe could happen.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15 11:30:08 -07:00
Eric Dumazet
65c5287892 tcp: fix sk_rcvbuf overshoot
Current autosizing in tcp_rcv_space_adjust() is too aggressive.

Instead of betting on possible losses and over estimate BDP,
it is better to only account for slow start.

The following patch is then adding a more precise tuning
in the events of packet losses.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15 11:30:08 -07:00
Eric Dumazet
c1269d3d12 tcp: add tcp_rcvbuf_grow() tracepoint
Provide a new tracepoint to better understand
tcp_rcv_space_adjust() (currently broken) behavior.

Call it only when tcp_rcv_space_adjust() has a chance
to make a change.

I chose to leave trace_tcp_rcv_space_adjust() as is,
because commit 6163849d28 ("net: introduce a new tracepoint
for tcp_rcv_space_adjust") intent was to get it called after
each data delivery to user space.

Tested:

Pair of hosts in the same rack. Ideally, sk->sk_rcvbuf should be kept small.

echo "4096 131072 33554432" >/proc/sys/net/ipv4/tcp_rmem
./netserver
perf record -C10 -e tcp:tcp_rcvbuf_grow sleep 30

<launch from client : netperf -H server -T,10>

Trace for a TS enabled TCP flow (with standard ms granularity)

perf script // We can see that sk_rcvbuf is growing very fast to tcp_mem[2]
  260.500397: tcp:tcp_rcvbuf_grow: time=291 rtt_us=274 copied=110592 inq=0 space=41080 ooo=0 scaling_ratio=230 rcvbuf=131072 ...
  260.501333: tcp:tcp_rcvbuf_grow: time=555 rtt_us=364 copied=333824 inq=0 space=110592 ooo=0 scaling_ratio=230 rcvbuf=1399144 ...
  260.501664: tcp:tcp_rcvbuf_grow: time=331 rtt_us=330 copied=798720 inq=0 space=333824 ooo=0 scaling_ratio=230 rcvbuf=4110551 ...
  260.502003: tcp:tcp_rcvbuf_grow: time=340 rtt_us=330 copied=1040384 inq=49152 space=798720 ooo=0 scaling_ratio=230 rcvbuf=7006410 ...
  260.502483: tcp:tcp_rcvbuf_grow: time=479 rtt_us=330 copied=2658304 inq=49152 space=1040384 ooo=0 scaling_ratio=230 rcvbuf=7006410 ...
  260.502899: tcp:tcp_rcvbuf_grow: time=416 rtt_us=413 copied=4026368 inq=147456 space=2658304 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.504233: tcp:tcp_rcvbuf_grow: time=493 rtt_us=487 copied=4800512 inq=196608 space=4026368 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.504792: tcp:tcp_rcvbuf_grow: time=559 rtt_us=551 copied=5672960 inq=49152 space=4800512 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.506614: tcp:tcp_rcvbuf_grow: time=610 rtt_us=607 copied=6688768 inq=180224 space=5672960 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.507280: tcp:tcp_rcvbuf_grow: time=666 rtt_us=656 copied=6868992 inq=49152 space=6688768 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.507979: tcp:tcp_rcvbuf_grow: time=699 rtt_us=699 copied=7000064 inq=0 space=6868992 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.508681: tcp:tcp_rcvbuf_grow: time=703 rtt_us=699 copied=7208960 inq=0 space=7000064 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.509426: tcp:tcp_rcvbuf_grow: time=744 rtt_us=737 copied=7569408 inq=0 space=7208960 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.510213: tcp:tcp_rcvbuf_grow: time=787 rtt_us=770 copied=7880704 inq=49152 space=7569408 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.511013: tcp:tcp_rcvbuf_grow: time=801 rtt_us=798 copied=8339456 inq=0 space=7880704 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.511860: tcp:tcp_rcvbuf_grow: time=847 rtt_us=824 copied=8601600 inq=49152 space=8339456 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.512710: tcp:tcp_rcvbuf_grow: time=850 rtt_us=846 copied=8814592 inq=65536 space=8601600 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.514428: tcp:tcp_rcvbuf_grow: time=871 rtt_us=865 copied=8855552 inq=49152 space=8814592 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.515333: tcp:tcp_rcvbuf_grow: time=905 rtt_us=882 copied=9228288 inq=49152 space=8855552 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.516237: tcp:tcp_rcvbuf_grow: time=905 rtt_us=896 copied=9371648 inq=49152 space=9228288 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.517149: tcp:tcp_rcvbuf_grow: time=911 rtt_us=909 copied=9543680 inq=49152 space=9371648 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.518070: tcp:tcp_rcvbuf_grow: time=921 rtt_us=921 copied=9793536 inq=0 space=9543680 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.520895: tcp:tcp_rcvbuf_grow: time=948 rtt_us=947 copied=10203136 inq=114688 space=9793536 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.521853: tcp:tcp_rcvbuf_grow: time=959 rtt_us=954 copied=10293248 inq=57344 space=10203136 ooo=0 scaling_ratio=230 rcvbuf=24691992 ...
  260.522818: tcp:tcp_rcvbuf_grow: time=964 rtt_us=959 copied=10330112 inq=0 space=10293248 ooo=0 scaling_ratio=230 rcvbuf=24691992 ...
  260.524760: tcp:tcp_rcvbuf_grow: time=979 rtt_us=969 copied=10633216 inq=49152 space=10330112 ooo=0 scaling_ratio=230 rcvbuf=24691992 ...
  260.526709: tcp:tcp_rcvbuf_grow: time=975 rtt_us=973 copied=12013568 inq=163840 space=10633216 ooo=0 scaling_ratio=230 rcvbuf=25136755 ...
  260.527694: tcp:tcp_rcvbuf_grow: time=985 rtt_us=976 copied=12025856 inq=32768 space=12013568 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  260.530655: tcp:tcp_rcvbuf_grow: time=991 rtt_us=986 copied=12050432 inq=98304 space=12025856 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  260.533626: tcp:tcp_rcvbuf_grow: time=993 rtt_us=989 copied=12124160 inq=0 space=12050432 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  260.538606: tcp:tcp_rcvbuf_grow: time=1000 rtt_us=994 copied=12222464 inq=49152 space=12124160 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  260.545605: tcp:tcp_rcvbuf_grow: time=1005 rtt_us=998 copied=12263424 inq=81920 space=12222464 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  260.553626: tcp:tcp_rcvbuf_grow: time=1005 rtt_us=999 copied=12320768 inq=12288 space=12263424 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  260.589749: tcp:tcp_rcvbuf_grow: time=1001 rtt_us=1000 copied=12398592 inq=16384 space=12320768 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  260.806577: tcp:tcp_rcvbuf_grow: time=1010 rtt_us=1000 copied=12402688 inq=32768 space=12398592 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  261.002386: tcp:tcp_rcvbuf_grow: time=1002 rtt_us=1000 copied=12419072 inq=98304 space=12402688 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  261.803432: tcp:tcp_rcvbuf_grow: time=1013 rtt_us=1000 copied=12468224 inq=49152 space=12419072 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  261.829533: tcp:tcp_rcvbuf_grow: time=1004 rtt_us=1000 copied=12615680 inq=0 space=12468224 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  265.505435: tcp:tcp_rcvbuf_grow: time=1007 rtt_us=1000 copied=12632064 inq=32768 space=12615680 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...

We also see rtt_us going gradually to 1000 usec, causing massive overshoot.

Trace for a usec TS enabled TCP flow (us granularity)

perf script // We can see that sk_rcvbuf is growing to a smaller value,
               thanks to tight rtt_us values.
 1509.273955: tcp:tcp_rcvbuf_grow: time=396 rtt_us=377 copied=110592 inq=0 space=41080 ooo=0 scaling_ratio=230 rcvbuf=131072 ...
 1509.274366: tcp:tcp_rcvbuf_grow: time=412 rtt_us=365 copied=129024 inq=0 space=110592 ooo=0 scaling_ratio=230 rcvbuf=1399144 ...
 1509.274738: tcp:tcp_rcvbuf_grow: time=372 rtt_us=355 copied=194560 inq=0 space=129024 ooo=0 scaling_ratio=230 rcvbuf=1399144 ...
 1509.275020: tcp:tcp_rcvbuf_grow: time=282 rtt_us=257 copied=401408 inq=0 space=194560 ooo=0 scaling_ratio=230 rcvbuf=1399144 ...
 1509.275190: tcp:tcp_rcvbuf_grow: time=170 rtt_us=144 copied=741376 inq=229376 space=401408 ooo=0 scaling_ratio=230 rcvbuf=3021625 ...
 1509.275300: tcp:tcp_rcvbuf_grow: time=110 rtt_us=110 copied=1146880 inq=65536 space=741376 ooo=0 scaling_ratio=230 rcvbuf=4642390 ...
 1509.275449: tcp:tcp_rcvbuf_grow: time=149 rtt_us=106 copied=1310720 inq=737280 space=1146880 ooo=0 scaling_ratio=230 rcvbuf=5498637 ...
 1509.275560: tcp:tcp_rcvbuf_grow: time=111 rtt_us=107 copied=1388544 inq=430080 space=1310720 ooo=0 scaling_ratio=230 rcvbuf=5498637 ...
 1509.275674: tcp:tcp_rcvbuf_grow: time=114 rtt_us=113 copied=1495040 inq=421888 space=1388544 ooo=0 scaling_ratio=230 rcvbuf=5498637 ...
 1509.275800: tcp:tcp_rcvbuf_grow: time=126 rtt_us=126 copied=1572864 inq=77824 space=1495040 ooo=0 scaling_ratio=230 rcvbuf=5498637 ...
 1509.275968: tcp:tcp_rcvbuf_grow: time=168 rtt_us=161 copied=1863680 inq=172032 space=1572864 ooo=0 scaling_ratio=230 rcvbuf=5498637 ...
 1509.276129: tcp:tcp_rcvbuf_grow: time=161 rtt_us=161 copied=1941504 inq=204800 space=1863680 ooo=0 scaling_ratio=230 rcvbuf=5782790 ...
 1509.276288: tcp:tcp_rcvbuf_grow: time=159 rtt_us=158 copied=1990656 inq=131072 space=1941504 ooo=0 scaling_ratio=230 rcvbuf=5782790 ...
 1509.276900: tcp:tcp_rcvbuf_grow: time=228 rtt_us=226 copied=2883584 inq=266240 space=1990656 ooo=0 scaling_ratio=230 rcvbuf=5782790 ...
 1509.277819: tcp:tcp_rcvbuf_grow: time=242 rtt_us=236 copied=3022848 inq=0 space=2883584 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.278072: tcp:tcp_rcvbuf_grow: time=253 rtt_us=247 copied=3055616 inq=49152 space=3022848 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.279560: tcp:tcp_rcvbuf_grow: time=268 rtt_us=264 copied=3133440 inq=180224 space=3055616 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.279833: tcp:tcp_rcvbuf_grow: time=274 rtt_us=270 copied=3424256 inq=0 space=3133440 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.282187: tcp:tcp_rcvbuf_grow: time=277 rtt_us=273 copied=3465216 inq=180224 space=3424256 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.284685: tcp:tcp_rcvbuf_grow: time=292 rtt_us=292 copied=3481600 inq=147456 space=3465216 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.284983: tcp:tcp_rcvbuf_grow: time=297 rtt_us=295 copied=3702784 inq=45056 space=3481600 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.285596: tcp:tcp_rcvbuf_grow: time=311 rtt_us=310 copied=3723264 inq=40960 space=3702784 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.285909: tcp:tcp_rcvbuf_grow: time=313 rtt_us=304 copied=3846144 inq=196608 space=3723264 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.291654: tcp:tcp_rcvbuf_grow: time=322 rtt_us=311 copied=3960832 inq=49152 space=3846144 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.291986: tcp:tcp_rcvbuf_grow: time=333 rtt_us=330 copied=4075520 inq=360448 space=3960832 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.292319: tcp:tcp_rcvbuf_grow: time=332 rtt_us=332 copied=4079616 inq=65536 space=4075520 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.292666: tcp:tcp_rcvbuf_grow: time=348 rtt_us=347 copied=4177920 inq=212992 space=4079616 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.293015: tcp:tcp_rcvbuf_grow: time=349 rtt_us=345 copied=4276224 inq=262144 space=4177920 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.293371: tcp:tcp_rcvbuf_grow: time=356 rtt_us=346 copied=4415488 inq=49152 space=4276224 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.515798: tcp:tcp_rcvbuf_grow: time=424 rtt_us=411 copied=4833280 inq=81920 space=4415488 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20250513193919.1089692-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15 11:30:08 -07:00
Jakub Kicinski
bebd7b2626 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc7).

Conflicts:

tools/testing/selftests/drivers/net/hw/ncdevmem.c
  97c4e094a4 ("tests/ncdevmem: Fix double-free of queue array")
  2f1a805f32 ("selftests: ncdevmem: Implement devmem TCP TX")
https://lore.kernel.org/20250514122900.1e77d62d@canb.auug.org.au

Adjacent changes:

net/core/devmem.c
net/core/devmem.h
  0afc44d8cd ("net: devmem: fix kernel panic when netlink socket close after module unload")
  bd61848900 ("net: devmem: Implement TX path")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15 11:28:30 -07:00
Luiz Augusto von Dentz
7af8479d9e Bluetooth: L2CAP: Fix not checking l2cap_chan security level
l2cap_check_enc_key_size shall check the security level of the
l2cap_chan rather than the hci_conn since for incoming connection
request that may be different as hci_conn may already been
encrypted using a different security level.

Fixes: 522e9ed157 ("Bluetooth: l2cap: Check encryption key size on incoming connection")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-05-15 13:09:46 -04:00
Taehee Yoo
0afc44d8cd net: devmem: fix kernel panic when netlink socket close after module unload
Kernel panic occurs when a devmem TCP socket is closed after NIC module
is unloaded.

This is Devmem TCP unregistration scenarios. number is an order.
(a)netlink socket close    (b)pp destroy    (c)uninstall    result
1                          2                3               OK
1                          3                2               (d)Impossible
2                          1                3               OK
3                          1                2               (e)Kernel panic
2                          3                1               (d)Impossible
3                          2                1               (d)Impossible

(a) netdev_nl_sock_priv_destroy() is called when devmem TCP socket is
    closed.
(b) page_pool_destroy() is called when the interface is down.
(c) mp_ops->uninstall() is called when an interface is unregistered.
(d) There is no scenario in mp_ops->uninstall() is called before
    page_pool_destroy().
    Because unregister_netdevice_many_notify() closes interfaces first
    and then calls mp_ops->uninstall().
(e) netdev_nl_sock_priv_destroy() accesses struct net_device to acquire
    netdev_lock().
    But if the interface module has already been removed, net_device
    pointer is invalid, so it causes kernel panic.

In summary, there are only 3 possible scenarios.
 A. sk close -> pp destroy -> uninstall.
 B. pp destroy -> sk close -> uninstall.
 C. pp destroy -> uninstall -> sk close.

Case C is a kernel panic scenario.

In order to fix this problem, It makes mp_dmabuf_devmem_uninstall() set
binding->dev to NULL.
It indicates an bound net_device was unregistered.

It makes netdev_nl_sock_priv_destroy() do not acquire netdev_lock()
if binding->dev is NULL.

A new binding->lock is added to protect a dev of a binding.
So, lock ordering is like below.
 priv->lock
 netdev_lock(dev)
 binding->lock

Tests:
Scenario A:
    ./ncdevmem -s 192.168.1.4 -c 192.168.1.2 -f $interface -l -p 8000 \
        -v 7 -t 1 -q 1 &
    pid=$!
    sleep 10
    kill $pid
    ip link set $interface down
    modprobe -rv $module

Scenario B:
    ./ncdevmem -s 192.168.1.4 -c 192.168.1.2 -f $interface -l -p 8000 \
        -v 7 -t 1 -q 1 &
    pid=$!
    sleep 10
    ip link set $interface down
    kill $pid
    modprobe -rv $module

Scenario C:
    ./ncdevmem -s 192.168.1.4 -c 192.168.1.2 -f $interface -l -p 8000 \
        -v 7 -t 1 -q 1 &
    pid=$!
    sleep 10
    modprobe -rv $module
    sleep 5
    kill $pid

Splat looks like:
Oops: general protection fault, probably for non-canonical address 0xdffffc001fffa9f7: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
KASAN: probably user-memory-access in range [0x00000000fffd4fb8-0x00000000fffd4fbf]
CPU: 0 UID: 0 PID: 2041 Comm: ncdevmem Tainted: G    B   W           6.15.0-rc1+ #2 PREEMPT(undef)  0947ec89efa0fd68838b78e36aa1617e97ff5d7f
Tainted: [B]=BAD_PAGE, [W]=WARN
RIP: 0010:__mutex_lock (./include/linux/sched.h:2244 kernel/locking/mutex.c:400 kernel/locking/mutex.c:443 kernel/locking/mutex.c:605 kernel/locking/mutex.c:746)
Code: ea 03 80 3c 02 00 0f 85 4f 13 00 00 49 8b 1e 48 83 e3 f8 74 6a 48 b8 00 00 00 00 00 fc ff df 48 8d 7b 34 48 89 fa 48 c1 ea 03 <0f> b6 f
RSP: 0018:ffff88826f7ef730 EFLAGS: 00010203
RAX: dffffc0000000000 RBX: 00000000fffd4f88 RCX: ffffffffaa9bc811
RDX: 000000001fffa9f7 RSI: 0000000000000008 RDI: 00000000fffd4fbc
RBP: ffff88826f7ef8b0 R08: 0000000000000000 R09: ffffed103e6aa1a4
R10: 0000000000000007 R11: ffff88826f7ef442 R12: fffffbfff669f65e
R13: ffff88812a830040 R14: ffff8881f3550d20 R15: 00000000fffd4f88
FS:  0000000000000000(0000) GS:ffff888866c05000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000563bed0cb288 CR3: 00000001a7c98000 CR4: 00000000007506f0
PKRU: 55555554
Call Trace:
<TASK>
 ...
 netdev_nl_sock_priv_destroy (net/core/netdev-genl.c:953 (discriminator 3))
 genl_release (net/netlink/genetlink.c:653 net/netlink/genetlink.c:694 net/netlink/genetlink.c:705)
 ...
 netlink_release (net/netlink/af_netlink.c:737)
 ...
 __sock_release (net/socket.c:647)
 sock_close (net/socket.c:1393)

Fixes: 1d22d3060b ("net: drop rtnl_lock for queue_mgmt operations")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250514154028.1062909-1-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15 08:05:32 -07:00
Pengtao He
491deb9b8c net/tls: fix kernel panic when alloc_page failed
We cannot set frag_list to NULL pointer when alloc_page failed.
It will be used in tls_strp_check_queue_ok when the next time
tls_strp_read_sock is called.

This is because we don't reset full_len in tls_strp_flush_anchor_copy()
so the recv path will try to continue handling the partial record
on the next call but we dettached the rcvq from the frag list.
Alternative fix would be to reset full_len.

Unable to handle kernel NULL pointer dereference
at virtual address 0000000000000028
 Call trace:
 tls_strp_check_rcv+0x128/0x27c
 tls_strp_data_ready+0x34/0x44
 tls_data_ready+0x3c/0x1f0
 tcp_data_ready+0x9c/0xe4
 tcp_data_queue+0xf6c/0x12d0
 tcp_rcv_established+0x52c/0x798

Fixes: 84c61fe1a7 ("tls: rx: do not use the standard strparser")
Signed-off-by: Pengtao He <hept.hept.hept@gmail.com>
Link: https://patch.msgid.link/20250514132013.17274-1-hept.hept.hept@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15 07:40:51 -07:00
Jakub Kicinski
3933536c87 Couple of stragglers:
- mac80211: fix syzbot/ubsan in scan counted-by
  - mt76: fix NAPI handling on driver remove
  - mt67: fix multicast/ipv6 receive
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmgl2xIACgkQ10qiO8sP
 aACj5g/+Lz4v2yTuBSX20yp99mjWO2uGuUn5zBf8dUlsoq6s8/OvqBPBqv/ryA5n
 3RCeTY+JzZHgBbxQ9h9eOwXPA7LUlJ1Lt9Uv+xokFTqO7Yxbu0d1Aqum9Xb+3HP4
 Sxf0Ib8tc4BcKsXAZIDF9PwaaY8yFqjM2oH3Z7cEUZjJgniao+Zl2Vr5naYcxX64
 OLv3ebtdp8sR0KcRW9upp6pXJjTJj1j0Th+s51aVlkhs5e0vb+z3qB8y3w+kEw2q
 d0VVI5z9may7WGwZ3IHRXRN8/XfT59L/UAW/bf82IKRIxe6gCohaIcRCif8LVJ9P
 jegEhz5e4GFDy2Hwrxs4BAJEOx7UluluL91Wd6Z1wYp8vUyoRVZQGLBQMH25FAh9
 Pj662XTaXdkirFJiP8FhB2D6S9Zy36d9EEDyoN4fbWrU3IJSWhAw5akLK1UdyIAa
 7CtnE7TyG4bBmzwwBdVwHKdoXskWKnb4XpEkCMK9GGl+d2e0hhwpBJmpuBgiGxcu
 k5s7MSvNIdej2MH2iEwGPQGdzkXrt4jLjSk9BGSuagS6794qcJGnneWR9/ZvhZU1
 qIiyJA9RQJj0tDWUHPo7PSKpPFqVyKDgS9RKzWblCgD0c6JvGrbdSF+CQLhojs9X
 jSjrh584D/oEg+mbEbbDCaTily4zBSxkBbBHJFJn07tUcEs5ZZU=
 =DZLy
 -----END PGP SIGNATURE-----

Merge tag 'wireless-2025-05-15' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes Berg says:

====================
Couple of stragglers:
 - mac80211: fix syzbot/ubsan in scan counted-by
 - mt76: fix NAPI handling on driver remove
 - mt67: fix multicast/ipv6 receive

* tag 'wireless-2025-05-15' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: mac80211: Set n_channels after allocating struct cfg80211_scan_request
  wifi: mt76: mt7925: fix missing hdr_trans_tlv command for broadcast wtbl
  wifi: mt76: disable napi on driver removal
====================

Link: https://patch.msgid.link/20250515121749.61912-4-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15 07:19:50 -07:00
Sebastian Andrzej Siewior
c50d295c37 rds: Use nested-BH locking for rds_page_remainder
rds_page_remainder is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Add a local_lock_t to the data structure and use
local_lock_nested_bh() for locking. This change adds only lockdep
coverage and does not alter the functional behaviour for !PREEMPT_RT.

Cc: Allison Henderson <allison.henderson@oracle.com>
Cc: linux-rdma@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250512092736.229935-16-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-15 15:23:31 +02:00
Sebastian Andrzej Siewior
0af5928f35 rds: Acquire per-CPU pointer within BH disabled section
rds_page_remainder_alloc() obtains the current CPU with get_cpu() while
disabling preemption. Then the CPU number is used to access the per-CPU
data structure via per_cpu().

This can be optimized by relying on local_bh_disable() to provide a
stable CPU number/ avoid migration and then using this_cpu_ptr() to
retrieve the data structure.

Cc: Allison Henderson <allison.henderson@oracle.com>
Cc: linux-rdma@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250512092736.229935-15-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-15 15:23:31 +02:00
Sebastian Andrzej Siewior
aaaaa6639c rds: Disable only bottom halves in rds_page_remainder_alloc()
rds_page_remainder_alloc() is invoked from a preemptible context or a
tasklet. There is no need to disable interrupts for locking.

Use local_bh_disable() instead of local_irq_save() for locking.

Cc: Allison Henderson <allison.henderson@oracle.com>
Cc: linux-rdma@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250512092736.229935-14-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-15 15:23:31 +02:00
Sebastian Andrzej Siewior
82d9e6b9a0 mptcp: Use nested-BH locking for hmac_storage
mptcp_delegated_actions is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data
structure requires explicit locking.

Add a local_lock_t to the data structure and use local_lock_nested_bh() for
locking. This change adds only lockdep coverage and does not alter the
functional behaviour for !PREEMPT_RT.

Cc: Matthieu Baerts <matttbe@kernel.org>
Cc: Mat Martineau <martineau@kernel.org>
Cc: Geliang Tang <geliang@kernel.org>
Cc: mptcp@lists.linux.dev
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250512092736.229935-13-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-15 15:23:31 +02:00
Sebastian Andrzej Siewior
20d677d389 net/sched: Use nested-BH locking for sch_frag_data_storage
sch_frag_data_storage is a per-CPU variable and relies on disabled BH
for its locking. Without per-CPU locking in local_bh_disable() on
PREEMPT_RT this data structure requires explicit locking.

Add local_lock_t to the struct and use local_lock_nested_bh() for locking.
This change adds only lockdep coverage and does not alter the functional
behaviour for !PREEMPT_RT.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250512092736.229935-12-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-15 15:23:31 +02:00
Sebastian Andrzej Siewior
7fe70c06a1 net/sched: act_mirred: Move the recursion counter struct netdev_xmit
mirred_nest_level is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Move mirred_nest_level to struct netdev_xmit as u8, provide wrappers.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://patch.msgid.link/20250512092736.229935-11-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-15 15:23:31 +02:00
Sebastian Andrzej Siewior
3af4cdd67f openvswitch: Move ovs_frag_data_storage into the struct ovs_pcpu_storage
ovs_frag_data_storage is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Move ovs_frag_data_storage into the struct ovs_pcpu_storage which already
provides locking for the structure.

Cc: Aaron Conole <aconole@redhat.com>
Cc: Eelco Chaudron <echaudro@redhat.com>
Cc: Ilya Maximets <i.maximets@ovn.org>
Cc: dev@openvswitch.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20250512092736.229935-10-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-15 15:23:31 +02:00
Sebastian Andrzej Siewior
672318331b openvswitch: Use nested-BH locking for ovs_pcpu_storage
ovs_pcpu_storage is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.
The data structure can be referenced recursive and there is a recursion
counter to avoid too many recursions.

Add a local_lock_t to the data structure and use
local_lock_nested_bh() for locking. Add an owner of the struct which is
the current task and acquire the lock only if the structure is not owned
by the current task.

Cc: Aaron Conole <aconole@redhat.com>
Cc: Eelco Chaudron <echaudro@redhat.com>
Cc: Ilya Maximets <i.maximets@ovn.org>
Cc: dev@openvswitch.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20250512092736.229935-9-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-15 15:23:31 +02:00
Sebastian Andrzej Siewior
035fcdc4d2 openvswitch: Merge three per-CPU structures into one
exec_actions_level is a per-CPU integer allocated at compile time.
action_fifos and flow_keys are per-CPU pointer and have their data
allocated at module init time.
There is no gain in splitting it, once the module is allocated, the
structures are allocated.

Merge the three per-CPU variables into ovs_pcpu_storage, adapt callers.

Cc: Aaron Conole <aconole@redhat.com>
Cc: Eelco Chaudron <echaudro@redhat.com>
Cc: Ilya Maximets <i.maximets@ovn.org>
Cc: dev@openvswitch.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20250512092736.229935-8-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-15 15:23:31 +02:00
Sebastian Andrzej Siewior
9c607d4b65 xfrm: Use nested-BH locking for nat_keepalive_sk_ipv[46]
nat_keepalive_sk_ipv[46] is a per-CPU variable and relies on disabled BH
for its locking. Without per-CPU locking in local_bh_disable() on
PREEMPT_RT this data structure requires explicit locking.

Use sock_bh_locked which has a sock pointer and a local_lock_t. Use
local_lock_nested_bh() for locking. This change adds only lockdep
coverage and does not alter the functional behaviour for !PREEMPT_RT.

Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250512092736.229935-7-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-15 15:23:31 +02:00
Sebastian Andrzej Siewior
b9eef3391d xdp: Use nested-BH locking for system_page_pool
system_page_pool is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Make a struct with a page_pool member (original system_page_pool) and a
local_lock_t and use local_lock_nested_bh() for locking. This change
adds only lockdep coverage and does not alter the functional behaviour
for !PREEMPT_RT.

Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250512092736.229935-6-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-15 15:23:31 +02:00
Sebastian Andrzej Siewior
bc57eda646 ipv6: sr: Use nested-BH locking for hmac_storage
hmac_storage is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Add a local_lock_t to the data structure and use
local_lock_nested_bh() for locking. This change adds only lockdep
coverage and does not alter the functional behaviour for !PREEMPT_RT.

Cc: David Ahern <dsahern@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250512092736.229935-5-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-15 15:23:31 +02:00
Sebastian Andrzej Siewior
1c0829788a ipv4/route: Use this_cpu_inc() for stats on PREEMPT_RT
The statistics are incremented with raw_cpu_inc() assuming it always
happens with bottom half disabled. Without per-CPU locking in
local_bh_disable() on PREEMPT_RT this is no longer true.

Use this_cpu_inc() on PREEMPT_RT for the increment to not worry about
preemption.

Cc: David Ahern <dsahern@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250512092736.229935-4-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-15 15:23:30 +02:00
Sebastian Andrzej Siewior
c99dac52ff net: dst_cache: Use nested-BH locking for dst_cache::cache
dst_cache::cache is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Add a local_lock_t to the data structure and use
local_lock_nested_bh() for locking. This change adds only lockdep
coverage and does not alter the functional behaviour for !PREEMPT_RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250512092736.229935-3-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-15 15:23:30 +02:00
Sebastian Andrzej Siewior
32471b2f48 net: page_pool: Don't recycle into cache on PREEMPT_RT
With preemptible softirq and no per-CPU locking in local_bh_disable() on
PREEMPT_RT the consumer can be preempted while a skb is returned.

Avoid the race by disabling the recycle into the cache on PREEMPT_RT.

Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250512092736.229935-2-bigeasy@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-15 15:23:30 +02:00
Kees Cook
82bbe02b25 wifi: mac80211: Set n_channels after allocating struct cfg80211_scan_request
Make sure that n_channels is set after allocating the
struct cfg80211_registered_device::int_scan_req member. Seen with
syzkaller:

UBSAN: array-index-out-of-bounds in net/mac80211/scan.c:1208:5
index 0 is out of range for type 'struct ieee80211_channel *[] __counted_by(n_channels)' (aka 'struct ieee80211_channel *[]')

This was missed in the initial conversions because I failed to locate
the allocation likely due to the "sizeof(void *)" not matching the
"channels" array type.

Reported-by: syzbot+4bcdddd48bb6f0be0da1@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/680fd171.050a0220.2b69d1.045e.GAE@google.com/
Fixes: e3eac9f32e ("wifi: cfg80211: Annotate struct cfg80211_scan_request with __counted_by")
Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://patch.msgid.link/20250509184641.work.542-kees@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-05-15 13:20:33 +02:00
Leon Romanovsky
c82b48b63a xfrm: prevent configuration of interface index when offload is used
Both packet and crypto offloads perform decryption while packet is
arriving to the HW from the wire. It means that there is no possible
way to perform lookup on XFRM if_id as it can't be set to be "before' HW.

So instead of silently ignore this configuration, let's warn users about
misconfiguration.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-05-15 09:56:20 +02:00
Leon Romanovsky
e86212b6b1 xfrm: validate assignment of maximal possible SEQ number
Users can set any seq/seq_hi/oseq/oseq_hi values. The XFRM core code
doesn't prevent from them to set even 0xFFFFFFFF, however this value
will cause for traffic drop.

Is is happening because SEQ numbers here mean that packet with such
number was processed and next number should be sent on the wire. In this
case, the next number will be 0, and it means overflow which causes to
(expected) packet drops.

While it can be considered as misconfiguration and handled by XFRM
datapath in the same manner as any other SEQ number, let's add
validation to easy for packet offloads implementations which need to
configure HW with next SEQ to send and not with current SEQ like it is
done in core code.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-05-15 09:56:19 +02:00
Eelco Chaudron
88906f5595 openvswitch: Stricter validation for the userspace action
This change enhances the robustness of validate_userspace() by ensuring
that all Netlink attributes are fully contained within the parent
attribute. The previous use of nla_parse_nested_deprecated() could
silently skip trailing or malformed attributes, as it stops parsing at
the first invalid entry.

By switching to nla_parse_deprecated_strict(), we make sure only fully
validated attributes are copied for later use.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Link: https://patch.msgid.link/67eb414e2d250e8408bb8afeb982deca2ff2b10b.1747037304.git.echaudro@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14 19:13:34 -07:00
Paul Chaignon
0b91fda3a1 xfrm: Sanitize marks before insert
Prior to this patch, the mark is sanitized (applying the state's mask to
the state's value) only on inserts when checking if a conflicting XFRM
state or policy exists.

We discovered in Cilium that this same sanitization does not occur
in the hot-path __xfrm_state_lookup. In the hot-path, the sk_buff's mark
is simply compared to the state's value:

    if ((mark & x->mark.m) != x->mark.v)
        continue;

Therefore, users can define unsanitized marks (ex. 0xf42/0xf00) which will
never match any packet.

This commit updates __xfrm_state_insert and xfrm_policy_insert to store
the sanitized marks, thus removing this footgun.

This has the side effect of changing the ip output, as the
returned mark will have the mask applied to it when printed.

Fixes: 3d6acfa764 ("xfrm: SA lookups with mark")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Louis DeLosSantos <louis.delos.devel@gmail.com>
Co-developed-by: Louis DeLosSantos <louis.delos.devel@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-05-14 07:18:58 +02:00
Mina Almasry
ae28cb1147 net: check for driver support in netmem TX
We should not enable netmem TX for drivers that don't declare support.

Check for driver netmem TX support during devmem TX binding and fail if
the driver does not have the functionality.

Check for driver support in validate_xmit_skb as well.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250508004830.4100853-9-almasrymina@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13 11:12:49 +02:00
Mina Almasry
bd61848900 net: devmem: Implement TX path
Augment dmabuf binding to be able to handle TX. Additional to all the RX
binding, we also create tx_vec needed for the TX path.

Provide API for sendmsg to be able to send dmabufs bound to this device:

- Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from.
- MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf.

Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY
implementation, while disabling instances where MSG_ZEROCOPY falls back
to copying.

We additionally pipe the binding down to the new
zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems
instead of the traditional page netmems.

We also special case skb_frag_dma_map to return the dma-address of these
dmabuf net_iovs instead of attempting to map pages.

The TX path may release the dmabuf in a context where we cannot wait.
This happens when the user unbinds a TX dmabuf while there are still
references to its netmems in the TX path. In that case, the netmems will
be put_netmem'd from a context where we can't unmap the dmabuf, Resolve
this by making __net_devmem_dmabuf_binding_free schedule_work'd.

Based on work by Stanislav Fomichev <sdf@fomichev.me>. A lot of the meat
of the implementation came from devmem TCP RFC v1[1], which included the
TX path, but Stan did all the rebasing on top of netmem/net_iov.

Cc: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250508004830.4100853-5-almasrymina@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13 11:12:48 +02:00
Stanislav Fomichev
8802087d20 net: devmem: TCP tx netlink api
Add bind-tx netlink call to attach dmabuf for TX; queue is not
required, only ifindex and dmabuf fd for attachment.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250508004830.4100853-4-almasrymina@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13 11:12:48 +02:00
Mina Almasry
e9f3d61db5 net: add get_netmem/put_netmem support
Currently net_iovs support only pp ref counts, and do not support a
page ref equivalent.

This is fine for the RX path as net_iovs are used exclusively with the
pp and only pp refcounting is needed there. The TX path however does not
use pp ref counts, thus, support for get_page/put_page equivalent is
needed for netmem.

Support get_netmem/put_netmem. Check the type of the netmem before
passing it to page or net_iov specific code to obtain a page ref
equivalent.

For dmabuf net_iovs, we obtain a ref on the underlying binding. This
ensures the entire binding doesn't disappear until all the net_iovs have
been put_netmem'ed. We do not need to track the refcount of individual
dmabuf net_iovs as we don't allocate/free them from a pool similar to
what the buddy allocator does for pages.

This code is written to be extensible by other net_iov implementers.
get_netmem/put_netmem will check the type of the netmem and route it to
the correct helper:

pages -> [get|put]_page()
dmabuf net_iovs -> net_devmem_[get|put]_net_iov()
new net_iovs ->	new helpers

Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250508004830.4100853-3-almasrymina@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13 11:12:48 +02:00
Mina Almasry
03e96b8c11 netmem: add niov->type attribute to distinguish different net_iov types
Later patches in the series adds TX net_iovs where there is no pp
associated, so we can't rely on niov->pp->mp_ops to tell what is the
type of the net_iov.

Add a type enum to the net_iov which tells us the net_iov type.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250508004830.4100853-2-almasrymina@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13 11:12:48 +02:00
Jakub Kicinski
a96876057b netlink: fix policy dump for int with validation callback
Recent devlink change added validation of an integer value
via NLA_POLICY_VALIDATE_FN, for sparse enums. Handle this
in policy dump. We can't extract any info out of the callback,
so report only the type.

Fixes: 429ac62114 ("devlink: define enum for attr types of dynamic attributes")
Reported-by: syzbot+01eb26848144516e7f0a@syzkaller.appspotmail.com
Link: https://patch.msgid.link/20250509212751.1905149-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-12 18:50:09 -07:00
Cosmin Ratiu
af5f54b0ef net: Lock lower level devices when updating features
__netdev_update_features() expects the netdevice to be ops-locked, but
it gets called recursively on the lower level netdevices to sync their
features, and nothing locks those.

This commit fixes that, with the assumption that it shouldn't be possible
for both higher-level and lover-level netdevices to require the instance
lock, because that would lead to lock dependency warnings.

Without this, playing with higher level (e.g. vxlan) netdevices on top
of netdevices with instance locking enabled can run into issues:

WARNING: CPU: 59 PID: 206496 at ./include/net/netdev_lock.h:17 netif_napi_add_weight_locked+0x753/0xa60
[...]
Call Trace:
 <TASK>
 mlx5e_open_channel+0xc09/0x3740 [mlx5_core]
 mlx5e_open_channels+0x1f0/0x770 [mlx5_core]
 mlx5e_safe_switch_params+0x1b5/0x2e0 [mlx5_core]
 set_feature_lro+0x1c2/0x330 [mlx5_core]
 mlx5e_handle_feature+0xc8/0x140 [mlx5_core]
 mlx5e_set_features+0x233/0x2e0 [mlx5_core]
 __netdev_update_features+0x5be/0x1670
 __netdev_update_features+0x71f/0x1670
 dev_ethtool+0x21c5/0x4aa0
 dev_ioctl+0x438/0xae0
 sock_ioctl+0x2ba/0x690
 __x64_sys_ioctl+0xa78/0x1700
 do_syscall_64+0x6d/0x140
 entry_SYSCALL_64_after_hwframe+0x4b/0x53
 </TASK>

Fixes: 7e4d784f58 ("net: hold netdev instance lock during rtnetlink operations")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250509072850.2002821-1-cratiu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-12 18:44:33 -07:00
Chuck Lever
8ac6fcae5d svcrdma: Unregister the device if svc_rdma_accept() fails
To handle device removal, svc_rdma_accept() requests removal
notification for the underlying device when accepting a connection.
However svc_rdma_free() is not invoked if svc_rdma_accept() fails.
There needs to be a matching "unregister" in that case; otherwise
the device cannot be removed.

Fixes: c4de97f7c4 ("svcrdma: Handle device removal outside of the CM event handler")
Cc: stable@vger.kernel.org
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11 19:48:28 -04:00
Jeff Layton
de08ffb79c sunrpc: allow SOMAXCONN backlogged TCP connections
The connection backlog passed to listen() denotes the number of
connections that are fully established, but that have not yet been
accept()ed. If the amount goes above that level, new connection requests
will be dropped on the floor until the value goes down. If all the knfsd
threads are bogged down in (e.g.) disk I/O, new connection attempts can
stall because of this.

For the same rationale that Trond points out in the userland patch [1],
ensure that svc_xprt sockets created by the kernel allow SOMAXCONN
(4096) backlogged connections instead of the 64 that they do today.

[1]: https://lore.kernel.org/linux-nfs/20240308180223.2965601-1-trond.myklebust@hammerspace.com/

Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11 19:48:28 -04:00
Jeff Layton
18c64378ad sunrpc: add info about xprt queue times to svc_xprt_dequeue tracepoint
I've been looking at a problem where we see increased RPC timeouts in
clients when the nfs_layout_flexfiles dataserver_timeo value is tuned
very low (6s). This is necessary to ensure quick failover to a different
mirror if a server goes down, but it causes a lot more major RPC timeouts.

Ultimately, the problem is server-side however. It's sometimes doesn't
respond to connection attempts. My theory is that the interrupt handler
runs when a connection comes in, the xprt ends up being enqueued, but it
takes a significant amount of time for the nfsd thread to pick it up.

Currently, the svc_xprt_dequeue tracepoint displays "wakeup-us". This is
the time between the wake_up() call, and the thread dequeueing the xprt.
If no thread was woken, or the thread ended up picking up a different
xprt than intended, then this value won't tell us how long the xprt was
waiting.

Add a new xpt_qtime field to struct svc_xprt and set it in
svc_xprt_enqueue(). When the dequeue tracepoint fires, also store the
time that the xprt sat on the queue in total. Display it as "qtime-us".

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11 19:48:26 -04:00
Long Li
2298abcbe1 sunrpc: fix race in cache cleanup causing stale nextcheck time
When cache cleanup runs concurrently with cache entry removal, a race
condition can occur that leads to incorrect nextcheck times. This can
delay cache cleanup for the cache_detail by up to 1800 seconds:

1. cache_clean() sets nextcheck to current time plus 1800 seconds
2. While scanning a non-empty bucket, concurrent cache entry removal can
   empty that bucket
3. cache_clean() finds no cache entries in the now-empty bucket to update
   the nextcheck time
4. This maybe delays the next scan of the cache_detail by up to 1800
   seconds even when it should be scanned earlier based on remaining
   entries

Fix this by moving the hash_lock acquisition earlier in cache_clean().
This ensures bucket emptiness checks and nextcheck updates happen
atomically, preventing the race between cleanup and entry removal.

Signed-off-by: Long Li <leo.lilong@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11 19:48:22 -04:00
Long Li
5ca00634c8 sunrpc: update nextcheck time when adding new cache entries
The cache_detail structure uses a "nextcheck" field to control hash table
scanning intervals. When a table scan begins, nextcheck is set to current
time plus 1800 seconds. During scanning, if cache_detail is not empty and
a cache entry's expiry time is earlier than the current nextcheck, the
nextcheck is updated to that expiry time.

This mechanism ensures that:
1) Empty cache_details are scanned every 1800 seconds to avoid unnecessary
   scans
2) Non-empty cache_details are scanned based on the earliest expiry time
   found

However, when adding a new cache entry to an empty cache_detail, the
nextcheck time was not being updated, remaining at 1800 seconds. This
could delay cache cleanup for up to 1800 seconds, potentially blocking
threads(such as nfsd) that are waiting for cache cleanup.

Fix this by updating the nextcheck time whenever a new cache entry is
added.

Signed-off-by: Long Li <leo.lilong@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11 19:48:22 -04:00
Jiayuan Chen
79f0c39ae7 ktls, sockmap: Fix missing uncharge operation
When we specify apply_bytes, we divide the msg into multiple segments,
each with a length of 'send', and every time we send this part of the data
using tcp_bpf_sendmsg_redir(), we use sk_msg_return_zero() to uncharge the
memory of the specified 'send' size.

However, if the first segment of data fails to send, for example, the
peer's buffer is full, we need to release all of the msg. When releasing
the msg, we haven't uncharged the memory of the subsequent segments.

This modification does not make significant logical changes, but only
fills in the missing uncharge places.

This issue has existed all along, until it was exposed after we added the
apply test in test_sockmap:
commit 3448ad23b3 ("selftests/bpf: Add apply_bytes test to test_txmsg_redir_wait_sndmem in test_sockmap")

Fixes: d3b18ad31f ("tls: add bpf support to sk_msg handling")
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Closes: https://lore.kernel.org/bpf/aAmIi0vlycHtbXeb@pop-os.localdomain/T/#t
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Link: https://lore.kernel.org/r/20250425060015.6968-2-jiayuan.chen@linux.dev
2025-05-09 18:09:59 -07:00
Jakub Kicinski
4d64321c4f Here is a batman-adv bugfix:
- fix duplicate MAC address check, by Matthias Schiffer
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAmgdxEYWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoSG7D/9UpiHwuCb3FCT9hZVNbRX6XChV
 DSv8lYedVflwYgIZCqcKlg4/svygmnNNYy4XOJeLw7kjeegGmdQsol0TouZGkqTB
 pNxKnporVTxwA4FDSaRwVxwF1UOlelAzcMRxAk5tpIkGPl1wBFLx7dFzWWuU7wvT
 c0mtkJlV0kag0OZJEZqY5fYmQNHmYde9rsTpdP44ie7SPqGHpby6MYjqfGl0ztxy
 70XCMgp9ByRsCnL45UQYua17gYavKWalCsAYCikQc/+nZGkgymAaIFv5elFrou9N
 Dbw+k9X2ipw3v2atAUP40Bw6dvdIkTRaEpAyO5oDGM1//kqwAfUQazS+IWFYE50v
 eN2KRGSNJH/J2RHY0ODRomcFDotHykR8aESbkePJHc5ZFgM9E6vAzy+BPrFJaROX
 nRO/XmWXeVwyl9CrxoEtA0UusIEr3OQBMi2p5cV7vbcPHSPzQRuA2iKah0ZFsmB0
 pGyaeEh60lO2r84lYsO04bA9q8eyZkNHUxd3pZnCpy4iAzEmgKaD7RAIygwHtRim
 UezzP8BO7m6dlox16/eJ2Bev2EmA3m3w/rkyFU2jrE9xv8tTkjIYs04zTifs8+F7
 hAU+2WM/P0z+lCVEZzTaguEjFK4UrmnmSZxF1B4mh5AVQL5q4R5np4ncTiBdGJCl
 uS34escHTXWW4gHttQ==
 =kqur
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-pullrequest-20250509' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here is a batman-adv bugfix:

 - fix duplicate MAC address check, by Matthias Schiffer

* tag 'batadv-net-pullrequest-20250509' of git://git.open-mesh.org/linux-merge:
  batman-adv: fix duplicate MAC address check
====================

Link: https://patch.msgid.link/20250509090240.107796-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-09 17:09:39 -07:00
Jakub Kicinski
6a63b01567 This cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
 
  - constify and move broadcast addr definition, Matthias Schiffer
 
  - remove start/stop queue function for mesh-iface, by Antonio Quartulli
 
  - switch to crc32 header for crc32c, by Sven Eckelmann
 
  - drop unused net_namespace.h include, by Sven Eckelmann
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAmgdxeMWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoSR2EADDS+gfYJHfGK+9R2N+GFVf+/nu
 avyo9HPi2ZdQEgzeuSDZFQb96yzQNOcW3iTkM+nSAJMANHV7I9n28ZKIufqtXNaR
 E3dKpi592ZjwedaEYp1JFHp8L4bGxEWQ68ie8zqdlyT4QXryUyVVZG061jKF2Pjl
 vns3KAL6kzFv8SlweR+JuoYudQTjn/P0m1iuPylpzf2cCD8yUaY6GsI11D4XDgxQ
 IJDkNpmVulBEJ6oYqUfA9fqQffnnxiAw4/IaA4NOrzIGvNGHKfDfnWEMTqNB/YU7
 xZ7/zD6iKefxd8OQkiFnA1NuNw+dVuii/HOeH3GcWMQ9blApeAHCUZ+o9xp6uGP0
 8f6AQ3TU/STe9/whDCgHvjMHED5V2hpYSVvtWn0w3L9nqhxrZq5eY0Cek8k7vt+G
 tRj1tyzrEcpoPQMb/eL8mS179PgCV72KKe8gwSrFdgKzLaxWVcqnByPVUjSzhHoN
 4yOWY+KUhSH9CUTes0/dMwag2EorK1B/cUXm1forHk2xNosw/FvXD1hYXJB1Mcx5
 tZvN9BalJcag4aAhpx2CCDYI0XMZrAXDB2duS/lrgSbLdbvWN3WDGAr2aI3GXpWS
 97ezGLIMWn8l9oE2BJr1s6a4/2sXJ/HSYteUGkRxIQbaitVUsN9rueuQvRzSvUvW
 i8Y5mZZaX5EUW/2gfw==
 =T+6l
 -----END PGP SIGNATURE-----

Merge tag 'batadv-next-pullrequest-20250509' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This cleanup patchset includes the following patches:

 - bump version strings, by Simon Wunderlich

 - constify and move broadcast addr definition, Matthias Schiffer

 - remove start/stop queue function for mesh-iface, by Antonio Quartulli

 - switch to crc32 header for crc32c, by Sven Eckelmann

 - drop unused net_namespace.h include, by Sven Eckelmann

* tag 'batadv-next-pullrequest-20250509' of git://git.open-mesh.org/linux-merge:
  batman-adv: Drop unused net_namespace.h include
  batman-adv: Switch to crc32 header for crc32c
  batman-adv: no need to start/stop queue on mesh-iface
  batman-adv: constify and move broadcast addr definition
  batman-adv: Start new development cycle
====================

Link: https://patch.msgid.link/20250509091041.108416-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-09 17:09:38 -07:00
Vladimir Oltean
6c14058edf net: dsa: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()
New timestamping API was introduced in commit 66f7223039 ("net: add
NDOs for configuring hardware timestamping") from kernel v6.6. It is
time to convert DSA to the new API, so that the ndo_eth_ioctl() path can
be removed completely.

Move the ds->ops->port_hwtstamp_get() and ds->ops->port_hwtstamp_set()
calls from dsa_user_ioctl() to dsa_user_hwtstamp_get() and
dsa_user_hwtstamp_set().

Due to the fact that the underlying ifreq type changes to
kernel_hwtstamp_config, the drivers and the Ocelot switchdev front-end,
all hooked up directly or indirectly, must also be converted all at once.

The conversion also updates the comment from dsa_port_supports_hwtstamp(),
which is no longer true because kernel_hwtstamp_config is kernel memory
and does not need copy_to_user(). I've deliberated whether it is
necessary to also update "err != -EOPNOTSUPP" to a more general "!err",
but all drivers now either return 0 or -EOPNOTSUPP.

The existing logic from the ocelot_ioctl() function, to avoid
configuring timestamping if the PHY supports the operation, is obsoleted
by more advanced core logic in dev_set_hwtstamp_phylib().

This is only a partial preparation for proper PHY timestamping support.
None of these switch driver currently sets up PTP traps for PHY
timestamping, so setting dev->see_all_hwtstamp_requests is not yet
necessary and the conversion is relatively trivial.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # felix, sja1105, mv88e6xxx
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250508095236.887789-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-09 16:34:09 -07:00
Gal Pressman
1b2900db01 ethtool: Block setting of symmetric RSS when non-symmetric rx-flow-hash is requested
Symmetric RSS hash requires that:
* No other fields besides IP src/dst and/or L4 src/dst are set
* If src is set, dst must also be set

This restriction was only enforced when RXNFC was configured after
symmetric hash was enabled. In the opposite order of operations (RXNFC
then symmetric enablement) the check was not performed.

Perform the sanity check on set_rxfh as well, by iterating over all flow
types hash fields and making sure they are all symmetric.

Introduce a function that returns whether a flow type is hashable (not
spec only) and needs to be iterated over. To make sure that no one
forgets to update the list of hashable flow types when adding new flow
types, a static assert is added to draw the developer's attention.

The conversion of uapi #defines to enum is not ideal, but as Jakub
mentioned [1], we have precedent for that.

[1] https://lore.kernel.org/netdev/20250324073509.6571ade3@kernel.org/

Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250508103034.885536-1-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-09 16:24:28 -07:00
Andrew Jeffery
e4f349bd6e net: mctp: Ensure keys maintain only one ref to corresponding dev
mctp_flow_prepare_output() is called in mctp_route_output(), which
places outbound packets onto a given interface. The packet may represent
a message fragment, in which case we provoke an unbalanced reference
count to the underlying device. This causes trouble if we ever attempt
to remove the interface:

    [   48.702195] usb 1-1: USB disconnect, device number 2
    [   58.883056] unregister_netdevice: waiting for mctpusb0 to become free. Usage count = 2
    [   69.022548] unregister_netdevice: waiting for mctpusb0 to become free. Usage count = 2
    [   79.172568] unregister_netdevice: waiting for mctpusb0 to become free. Usage count = 2
    ...

Predicate the invocation of mctp_dev_set_key() in
mctp_flow_prepare_output() on not already having associated the device
with the key. It's not yet realistic to uphold the property that the key
maintains only one device reference earlier in the transmission sequence
as the route (and therefore the device) may not be known at the time the
key is associated with the socket.

Fixes: 67737c4572 ("mctp: Pass flow data & flow release events to drivers")
Acked-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
Link: https://patch.msgid.link/20250508-mctp-dev-refcount-v1-1-d4f965c67bb5@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-09 16:22:53 -07:00
Matt Johnston
f11cf946c0 net: mctp: Don't access ifa_index when missing
In mctp_dump_addrinfo, ifa_index can be used to filter interfaces, but
only when the struct ifaddrmsg is provided. Otherwise it will be
comparing to uninitialised memory - reproducible in the syzkaller case from
dhcpd, or busybox "ip addr show".

The kernel MCTP implementation has always filtered by ifa_index, so
existing userspace programs expecting to dump MCTP addresses must
already be passing a valid ifa_index value (either 0 or a real index).

BUG: KMSAN: uninit-value in mctp_dump_addrinfo+0x208/0xac0 net/mctp/device.c:128
 mctp_dump_addrinfo+0x208/0xac0 net/mctp/device.c:128
 rtnl_dump_all+0x3ec/0x5b0 net/core/rtnetlink.c:4380
 rtnl_dumpit+0xd5/0x2f0 net/core/rtnetlink.c:6824
 netlink_dump+0x97b/0x1690 net/netlink/af_netlink.c:2309

Fixes: 583be982d9 ("mctp: Add device handling and netlink interface")
Reported-by: syzbot+e76d52dadc089b9d197f@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68135815.050a0220.3a872c.000e.GAE@google.com/
Reported-by: syzbot+1065a199625a388fce60@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/681357d6.050a0220.14dd7d.000d.GAE@google.com/
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Link: https://patch.msgid.link/20250508-mctp-addr-dump-v2-1-c8a53fd2dd66@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-09 15:03:53 -07:00
Feng Yang
ee971630f2 bpf: Allow some trace helpers for all prog types
if it works under NMI and doesn't use any context-dependent things,
should be fine for any program type. The detailed discussion is in [1].

[1] https://lore.kernel.org/all/CAEf4Bza6gK3dsrTosk6k3oZgtHesNDSrDd8sdeQ-GiS6oJixQg@mail.gmail.com/

Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Feng Yang <yangfeng@kylinos.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/bpf/20250506061434.94277-2-yangfeng59949@163.com
2025-05-09 10:37:10 -07:00
Cong Wang
2d3cbfd6d5 net_sched: Flush gso_skb list too during ->change()
Previously, when reducing a qdisc's limit via the ->change() operation, only
the main skb queue was trimmed, potentially leaving packets in the gso_skb
list. This could result in NULL pointer dereference when we only check
sch->limit against sch->q.qlen.

This patch introduces a new helper, qdisc_dequeue_internal(), which ensures
both the gso_skb list and the main queue are properly flushed when trimming
excess packets. All relevant qdiscs (codel, fq, fq_codel, fq_pie, hhf, pie)
are updated to use this helper in their ->change() routines.

Fixes: 76e3cc126b ("codel: Controlled Delay AQM")
Fixes: 4b549a2ef4 ("fq_codel: Fair Queue Codel AQM")
Fixes: afe4fd0624 ("pkt_sched: fq: Fair Queue packet scheduler")
Fixes: ec97ecf1eb ("net: sched: add Flow Queue PIE packet scheduler")
Fixes: 10239edf86 ("net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdisc")
Fixes: d4b36210c2 ("net: pkt_sched: PIE AQM scheme")
Reported-by: Will <willsroot@protonmail.com>
Reported-by: Savy <savy@syst3mfailure.io>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-09 12:34:38 +01:00
Jakub Kicinski
ea9a83d7f3 bluetooth pull request for net:
- MGMT: Fix MGMT_OP_ADD_DEVICE invalid device flags
  - hci_event: Fix not using key encryption size when its known
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCgA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmgcyKYZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKWSqEACGdIL6oDrfUmE2mFUxlT2C
 eUSe1RDvXi63pFCxv4a1/JrnAICfIPbuKhvkOhf9g3JoQXjqdtX5REjEPcCndyBg
 sSyGGfxcaoYLsSWQY1nSb2p/bttk1R3LfPT78QhKPDsh63FkvRw77P++8anzG3T1
 wqgjKU9XZb7zYmMElYCqYbnd0K2PymJtD+Oml1srBRdOpz1ex3s6Qj4Cy6cg07Vb
 SlxqWMnWZG0wfnusAFZ48zwYu/7/LQQnSJ6rbHSfrKLQnizNVvFTtsW+xrIfQ2m6
 G7/WOX2LVwtP3XMe8VIDV0kdDtkDFtHq0KuNToAtt4DS582RHw0AVe+Xc2x5FFKA
 rmcukZLvg6tv/DM/PM5zJdvZW4M+r1IOnBSZvFMAdYb4Af04DHjI1k1ow9On6O3f
 oJCeRZ4LoOREljR+xdO/Ewn207za0wGR7IlDXFVpeGEOnLcSAxvqUl6biL3cEQpA
 bb97I7fjqwSPqpAGsMV0uhULTJxT7QPhF05rL3bpSzAzZoDpRFhc070TLBBuQvW1
 zQmf+SCeg72vDtv1vsekyoXWfM+SpwlsNYuStPBztrKZgBGeRKjtZc2/L6dLPKbo
 bATO6Dhidm+outsF59YaqORFteON4yLAXSryLw6uVNIT3ApyUspNICyn4zi0tbYY
 8N4rjTclIk5zV1z8zCJ7Lg==
 =uyVh
 -----END PGP SIGNATURE-----

Merge tag 'for-net-2025-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

 - MGMT: Fix MGMT_OP_ADD_DEVICE invalid device flags
 - hci_event: Fix not using key encryption size when its known

* tag 'for-net-2025-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: hci_event: Fix not using key encryption size when its known
  Bluetooth: MGMT: Fix MGMT_OP_ADD_DEVICE invalid device flags
====================

Link: https://patch.msgid.link/20250508150927.385675-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-08 18:38:00 -07:00
Mohan Kumar G
22c64f37e1 wifi: mac80211: Update MCS15 support in link_conf
As per IEEE 802.11be-2024 - 9.4.2.321, EHT operation element
contains MCS15 Disable subfield as the sixth bit, which is set when
MCS15 support is not enabled.

Get MCS15 support from EHT operation params and add it in link_conf
so that driver can use this value to know if EHT-MCS 15 reception
is enabled.

Co-developed-by: Dhanavandhana Kannan <quic_dhanavan1@quicinc.com>
Signed-off-by: Dhanavandhana Kannan <quic_dhanavan1@quicinc.com>
Signed-off-by: Mohan Kumar G <quic_mkumarg@quicinc.com>
Link: https://patch.msgid.link/20250505152836.3266829-1-quic_mkumarg@quicinc.com
[remove pointless !! for bool assignment]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-05-09 00:05:11 +02:00
Benjamin Berg
cf1b684a06 wifi: mac80211: do not offer a mesh path if forwarding is disabled
When processing a PREQ the code would always check whether we have a
mesh path locally and reply accordingly. However, when forwarding is
disabled then we should not reply with this information as we will not
forward data packets down that path.

Move the check for dot11MeshForwarding up in the function and skip the
mesh path lookup in that case. In the else block, set forward to false
so that the rest of the function becomes a no-op and the
dot11MeshForwarding check does not need to be duplicated.

This explains an effect observed in the Freifunk community where mesh
forwarding is disabled. In that case a mesh with three STAs and only bad
links in between them, individual STAs would occionally have indirect
mpath entries. This should not have happened.

Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net>
Reviewed-by: Rouven Czerwinski <rouven@czerwinskis.de>
Link: https://patch.msgid.link/20250430191042.3287004-1-benjamin@sipsolutions.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-05-08 23:59:12 +02:00
Ingo Molnar
367ed4e357 treewide, timers: Rename try_to_del_timer_sync() as timer_delete_sync_try()
Move this API to the canonical timer_*() namespace.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250507175338.672442-9-mingo@kernel.org
2025-05-08 19:49:33 +02:00
Jakub Kicinski
6b02fd7799 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc6).

No conflicts.

Adjacent changes:

net/core/dev.c:
  08e9f2d584 ("net: Lock netdevices during dev_shutdown")
  a82dc19db1 ("net: avoid potential race between netdev_get_by_index_lock() and netns switch")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-08 08:59:02 -07:00
Luiz Augusto von Dentz
c82b6357a5 Bluetooth: hci_event: Fix not using key encryption size when its known
This fixes the regression introduced by 50c1241e6a8a ("Bluetooth: l2cap:
Check encryption key size on incoming connection") introduced a check for
l2cap_check_enc_key_size which checks for hcon->enc_key_size which may
not be initialized if HCI_OP_READ_ENC_KEY_SIZE is still pending.

If the key encryption size is known, due previously reading it using
HCI_OP_READ_ENC_KEY_SIZE, then store it as part of link_key/smp_ltk
structures so the next time the encryption is changed their values are
used as conn->enc_key_size thus avoiding the racing against
HCI_OP_READ_ENC_KEY_SIZE.

Now that the enc_size is stored as part of key the information the code
then attempts to check that there is no downgrade of security if
HCI_OP_READ_ENC_KEY_SIZE returns a value smaller than what has been
previously stored.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=220061
Link: https://bugzilla.kernel.org/show_bug.cgi?id=220063
Fixes: 522e9ed157 ("Bluetooth: l2cap: Check encryption key size on incoming connection")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-05-08 10:24:15 -04:00
Jakub Kicinski
23fa6a23d9 net: export a helper for adding up queue stats
Older drivers and drivers with lower queue counts often have a static
array of queues, rather than allocating structs for each queue on demand.
Add a helper for adding up qstats from a queue range. Expectation is
that driver will pass a queue range [netdev->real_num_*x_queues, MAX).
It was tempting to always use num_*x_queues as the end, but virtio
seems to clamp its queue count after allocating the netdev. And this
way we can trivaly reuse the helper for [0, real_..).

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250507003221.823267-2-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-08 11:56:12 +02:00
Paul Chaignon
c432722994 bpf: Scrub packet on bpf_redirect_peer
When bpf_redirect_peer is used to redirect packets to a device in
another network namespace, the skb isn't scrubbed. That can lead skb
information from one namespace to be "misused" in another namespace.

As one example, this is causing Cilium to drop traffic when using
bpf_redirect_peer to redirect packets that just went through IPsec
decryption to a container namespace. The following pwru trace shows (1)
the packet path from the host's XFRM layer to the container's XFRM
layer where it's dropped and (2) the number of active skb extensions at
each function.

    NETNS       MARK  IFACE  TUPLE                                FUNC
    4026533547  d00   eth0   10.244.3.124:35473->10.244.2.158:53  xfrm_rcv_cb
                             .active_extensions = (__u8)2,
    4026533547  d00   eth0   10.244.3.124:35473->10.244.2.158:53  xfrm4_rcv_cb
                             .active_extensions = (__u8)2,
    4026533547  d00   eth0   10.244.3.124:35473->10.244.2.158:53  gro_cells_receive
                             .active_extensions = (__u8)2,
    [...]
    4026533547  0     eth0   10.244.3.124:35473->10.244.2.158:53  skb_do_redirect
                             .active_extensions = (__u8)2,
    4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  ip_rcv
                             .active_extensions = (__u8)2,
    4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  ip_rcv_core
                             .active_extensions = (__u8)2,
    [...]
    4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  udp_queue_rcv_one_skb
                             .active_extensions = (__u8)2,
    4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  __xfrm_policy_check
                             .active_extensions = (__u8)2,
    4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  __xfrm_decode_session
                             .active_extensions = (__u8)2,
    4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  security_xfrm_decode_session
                             .active_extensions = (__u8)2,
    4026534999  0     eth0   10.244.3.124:35473->10.244.2.158:53  kfree_skb_reason(SKB_DROP_REASON_XFRM_POLICY)
                             .active_extensions = (__u8)2,

In this case, there are no XFRM policies in the container's network
namespace so the drop is unexpected. When we decrypt the IPsec packet,
the XFRM state used for decryption is set in the skb extensions. This
information is preserved across the netns switch. When we reach the
XFRM policy check in the container's netns, __xfrm_policy_check drops
the packet with LINUX_MIB_XFRMINNOPOLS because a (container-side) XFRM
policy can't be found that matches the (host-side) XFRM state used for
decryption.

This patch fixes this by scrubbing the packet when using
bpf_redirect_peer, as is done on typical netns switches via veth
devices except skb->mark and skb->tstamp are not zeroed.

Fixes: 9aa1206e8f ("bpf: Add redirect_peer helper")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/1728ead5e0fe45e7a6542c36bd4e3ca07a73b7d6.1746460653.git.paul.chaignon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-07 18:16:33 -07:00
Jakub Kicinski
dc75a43c07 netfilter pull request 25-05-08
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmgb2D4ACgkQ1w0aZmrP
 KyHB4g//SzExpaI5eTduMIg9OwLmWyUNpsxCX8Qk3E5FX8jyGZDU1CAyTnbYZyuJ
 VAn+5PwhhswEDs85ZgXiHXMjdOXvmPs1l0IKVVlM4DoJqlRDXpAcuKzPH0m0J46Q
 A+nfIIsFfH8roQq+PhLPQ3TEOnIGK1vgNWVICl4oX5sYjkYTIzPnadEaTpNsG26I
 gZlvpqM9EH7KLWqhcX1h2/O+JB/6oNaFQTUK77qc9sGijMVMUlGUBtDN40aZ1IEW
 RP6l6J1Oea3j3wavmrrQxHSDtvGS4rnC0qTQ4Xs3OIeTrQiou4cmsCqp+ZLmMu7Q
 YNBU04o78uEOX9n3ux6Xm4tr2WLizHFNkGBU+m0WQSsEGYx7CWsALXIqnEsZWDLQ
 IK4y/d7M0rWLNB4nYTePIBPjKt+qzBe0gTgVr9jMLFI/X60vz3XYHJDBovaBL3dd
 skiVohmYxjiQj5Y6WPtZaLvubzVEULfjVGHatYLGheLXHsEO/BjXFJeIfYK6bmfO
 eXAx9pPsnqCbtKzh6n+tQzYCeRhlw+Oh9Uw48MDpeIX+cTUNr+eaN1LalQz3C3zX
 KqfnK/ZlWUWpQmyLzhMA6DwSUKPBbJ5nzl/NuWSowUPfnldT++rGAuW4lFOTtMPo
 DsJDnONuE6FPaNU5Ef3rIivVfrAnCAmRK797psL0buHDY9YRGcM=
 =Jwld
 -----END PGP SIGNATURE-----

Merge tag 'nf-25-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contain Netfilter/IPVS fixes for net:

1) Fix KMSAN uninit-value in do_output_route4, reported by syzbot.
   Patch from Julian Anastasov.

2) ipset hashtable set type breaks up the hashtable into regions of
   2^10 buckets. Fix the macro that determines the hashtable lock
   region to protect concurrent updates. From Jozsef Kadlecsik.

* tag 'nf-25-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: ipset: fix region locking in hash types
  ipvs: fix uninit-value for saddr in do_output_route4
====================

Link: https://patch.msgid.link/20250507221952.86505-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-07 17:57:04 -07:00
Eelco Chaudron
6beb6835c1 openvswitch: Fix unsafe attribute parsing in output_userspace()
This patch replaces the manual Netlink attribute iteration in
output_userspace() with nla_for_each_nested(), which ensures that only
well-formed attributes are processed.

Fixes: ccb1352e76 ("net: Add Open vSwitch kernel components.")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/0bd65949df61591d9171c0dc13e42cea8941da10.1746541734.git.echaudro@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-07 16:51:02 -07:00
Jozsef Kadlecsik
8478a729c0 netfilter: ipset: fix region locking in hash types
Region locking introduced in v5.6-rc4 contained three macros to handle
the region locks: ahash_bucket_start(), ahash_bucket_end() which gave
back the start and end hash bucket values belonging to a given region
lock and ahash_region() which should give back the region lock belonging
to a given hash bucket. The latter was incorrect which can lead to a
race condition between the garbage collector and adding new elements
when a hash type of set is defined with timeouts.

Fixes: f66ee0410b ("netfilter: ipset: Fix "INFO: rcu detected stall in hash_xxx" reports")
Reported-by: Kota Toda <kota.toda@gmo-cybersecurity.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-07 23:57:31 +02:00
Julian Anastasov
e34090d721 ipvs: fix uninit-value for saddr in do_output_route4
syzbot reports for uninit-value for the saddr argument [1].
commit 4754957f04 ("ipvs: do not use random local source address for
tunnels") already implies that the input value of saddr
should be ignored but the code is still reading it which can prevent
to connect the route. Fix it by changing the argument to ret_saddr.

[1]
BUG: KMSAN: uninit-value in do_output_route4+0x42c/0x4d0 net/netfilter/ipvs/ip_vs_xmit.c:147
 do_output_route4+0x42c/0x4d0 net/netfilter/ipvs/ip_vs_xmit.c:147
 __ip_vs_get_out_rt+0x403/0x21d0 net/netfilter/ipvs/ip_vs_xmit.c:330
 ip_vs_tunnel_xmit+0x205/0x2380 net/netfilter/ipvs/ip_vs_xmit.c:1136
 ip_vs_in_hook+0x1aa5/0x35b0 net/netfilter/ipvs/ip_vs_core.c:2063
 nf_hook_entry_hookfn include/linux/netfilter.h:154 [inline]
 nf_hook_slow+0xf7/0x400 net/netfilter/core.c:626
 nf_hook include/linux/netfilter.h:269 [inline]
 __ip_local_out+0x758/0x7e0 net/ipv4/ip_output.c:118
 ip_local_out net/ipv4/ip_output.c:127 [inline]
 ip_send_skb+0x6a/0x3c0 net/ipv4/ip_output.c:1501
 udp_send_skb+0xfda/0x1b70 net/ipv4/udp.c:1195
 udp_sendmsg+0x2fe3/0x33c0 net/ipv4/udp.c:1483
 inet_sendmsg+0x1fc/0x280 net/ipv4/af_inet.c:851
 sock_sendmsg_nosec net/socket.c:712 [inline]
 __sock_sendmsg+0x267/0x380 net/socket.c:727
 ____sys_sendmsg+0x91b/0xda0 net/socket.c:2566
 ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2620
 __sys_sendmmsg+0x41d/0x880 net/socket.c:2702
 __compat_sys_sendmmsg net/compat.c:360 [inline]
 __do_compat_sys_sendmmsg net/compat.c:367 [inline]
 __se_compat_sys_sendmmsg net/compat.c:364 [inline]
 __ia32_compat_sys_sendmmsg+0xc8/0x140 net/compat.c:364
 ia32_sys_call+0x3ffa/0x41f0 arch/x86/include/generated/asm/syscalls_32.h:346
 do_syscall_32_irqs_on arch/x86/entry/syscall_32.c:83 [inline]
 __do_fast_syscall_32+0xb0/0x110 arch/x86/entry/syscall_32.c:306
 do_fast_syscall_32+0x38/0x80 arch/x86/entry/syscall_32.c:331
 do_SYSENTER_32+0x1f/0x30 arch/x86/entry/syscall_32.c:369
 entry_SYSENTER_compat_after_hwframe+0x84/0x8e

Uninit was created at:
 slab_post_alloc_hook mm/slub.c:4167 [inline]
 slab_alloc_node mm/slub.c:4210 [inline]
 __kmalloc_cache_noprof+0x8fa/0xe00 mm/slub.c:4367
 kmalloc_noprof include/linux/slab.h:905 [inline]
 ip_vs_dest_dst_alloc net/netfilter/ipvs/ip_vs_xmit.c:61 [inline]
 __ip_vs_get_out_rt+0x35d/0x21d0 net/netfilter/ipvs/ip_vs_xmit.c:323
 ip_vs_tunnel_xmit+0x205/0x2380 net/netfilter/ipvs/ip_vs_xmit.c:1136
 ip_vs_in_hook+0x1aa5/0x35b0 net/netfilter/ipvs/ip_vs_core.c:2063
 nf_hook_entry_hookfn include/linux/netfilter.h:154 [inline]
 nf_hook_slow+0xf7/0x400 net/netfilter/core.c:626
 nf_hook include/linux/netfilter.h:269 [inline]
 __ip_local_out+0x758/0x7e0 net/ipv4/ip_output.c:118
 ip_local_out net/ipv4/ip_output.c:127 [inline]
 ip_send_skb+0x6a/0x3c0 net/ipv4/ip_output.c:1501
 udp_send_skb+0xfda/0x1b70 net/ipv4/udp.c:1195
 udp_sendmsg+0x2fe3/0x33c0 net/ipv4/udp.c:1483
 inet_sendmsg+0x1fc/0x280 net/ipv4/af_inet.c:851
 sock_sendmsg_nosec net/socket.c:712 [inline]
 __sock_sendmsg+0x267/0x380 net/socket.c:727
 ____sys_sendmsg+0x91b/0xda0 net/socket.c:2566
 ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2620
 __sys_sendmmsg+0x41d/0x880 net/socket.c:2702
 __compat_sys_sendmmsg net/compat.c:360 [inline]
 __do_compat_sys_sendmmsg net/compat.c:367 [inline]
 __se_compat_sys_sendmmsg net/compat.c:364 [inline]
 __ia32_compat_sys_sendmmsg+0xc8/0x140 net/compat.c:364
 ia32_sys_call+0x3ffa/0x41f0 arch/x86/include/generated/asm/syscalls_32.h:346
 do_syscall_32_irqs_on arch/x86/entry/syscall_32.c:83 [inline]
 __do_fast_syscall_32+0xb0/0x110 arch/x86/entry/syscall_32.c:306
 do_fast_syscall_32+0x38/0x80 arch/x86/entry/syscall_32.c:331
 do_SYSENTER_32+0x1f/0x30 arch/x86/entry/syscall_32.c:369
 entry_SYSENTER_compat_after_hwframe+0x84/0x8e

CPU: 0 UID: 0 PID: 22408 Comm: syz.4.5165 Not tainted 6.15.0-rc3-syzkaller-00019-gbc3372351d0c #0 PREEMPT(undef)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025

Reported-by: syzbot+04b9a82855c8aed20860@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/68138dfa.050a0220.14dd7d.0017.GAE@google.com/
Fixes: 4754957f04 ("ipvs: do not use random local source address for tunnels")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-07 23:57:24 +02:00
Luiz Augusto von Dentz
1e2e3044c1 Bluetooth: MGMT: Fix MGMT_OP_ADD_DEVICE invalid device flags
Device flags could be updated in the meantime while MGMT_OP_ADD_DEVICE
is pending on hci_update_passive_scan_sync so instead of setting the
current_flags as cmd->user_data just do a lookup using
hci_conn_params_lookup and use the latest stored flags.

Fixes: a182d9c84f ("Bluetooth: MGMT: Fix Add Device to responding before completing")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-05-07 12:47:53 -04:00
Jakub Kicinski
9540984da6 Couple of fixes:
* iwlwifi: add two missing device entries
  * cfg80211: fix a potential out-of-bounds access
  * mac80211: fix format of TID to link mapping action frames
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmgacfwACgkQ10qiO8sP
 aABEuw//X1RW+5zHLmH3EHW+buxDeqev9C/YStL/1+qtiQG4HSQGVrdr9Vv3PqFt
 3MYqPzZEWQrRl4l8GW8ezsNvkGCoYwUBOq92Qa1v7tBrauOQH/zkh3KtgjgT9xql
 0ZhIL0JXVmDmhv8WF7JqogwZDJ/wUy/QH006jbRvwhZ9VtQJpSVECr3hxy+pCP88
 JNNVD4yQMr8al5zY8hdbgPneAvk/2llmz542BiQ65zoqI7TqERQlCwKFWat7kfFl
 7+rTNRDUD0RAPnr5FE1VjIaQfHV7J0OnDpnH70C18mDsRSkU7E0GjHg0SZk3bPgV
 WH4OoxmnwQ1vR/vwLvFQ0Ly8oruet6SRJ4R4l0hcDD1Kkv4jpzQZzxojlf5EqUt8
 N/oYuff0ju8Kb7ELR/i9PmVIl9bC24mDvzWeOn3xwiCRywGc2MF2MITz8iTzlMQk
 lPYe4cBhaSYwn6heyQiT1IpOEYzyfZE1zX83k8vIR3mO+umZOBlh6sv4NvMx5lbW
 f1u8W2sgONjehWFjFVp2h4fmj4UYy6N8NRHPt8gjkZkkfuto55/2+QPMFN9Jpg9H
 3qqFoXg3uZUhVyNQ3V//R+A1eBbYwCzwE43D1Zf3RlEs5V3T1TUp10sxn1/9Jp46
 wiMEY4LUujwKwAckLU2HyQwf3SSSGVKx8/PW76cWZhYjSTwcBLM=
 =Bzvu
 -----END PGP SIGNATURE-----

Merge tag 'wireless-2025-05-06' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes Berg says:

====================
Couple of fixes:
 * iwlwifi: add two missing device entries
 * cfg80211: fix a potential out-of-bounds access
 * mac80211: fix format of TID to link mapping action frames

* tag 'wireless-2025-05-06' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: iwlwifi: add support for Killer on MTL
  wifi: mac80211: fix the type of status_code for negotiated TID to Link Mapping
  wifi: cfg80211: fix out-of-bounds access during multi-link element defragmentation
====================

Link: https://patch.msgid.link/20250506203506.158818-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-06 19:06:50 -07:00
Jakub Kicinski
9daaf19786 wireless features, notably
* stack
    - free SKBTX_WIFI_STATUS flag
    - fixes for VLAN multicast in multi-link
    - improve codel parameters (revert some old twiddling)
  * ath12k
    - Enable AHB support for IPQ5332.
    - Add monitor interface support to QCN9274.
    - Add MLO support to WCN7850.
    - Add 802.11d scan offload support to WCN7850.
  * ath11k
    - Restore hibernation support
  * iwlwifi
    - EMLSR on two 5 GHz links
  * mwifiex
    - cleanups/refactoring
 
 along with many other small features/cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmgaSmcACgkQ10qiO8sP
 aAA4hA//f/nfeLTAnhON53mDlqxa55/2bw9XSH7pOIOasVBWxmuYxhWfn5uiZluI
 zOGlBO7vtJYvPrVHEHSuPWMNCQ+ieL2ShRSP5BBfy3KBYdD4gKKAd95WoiXmRVp4
 d13OYtF9msFbVXOZYMyxMHmzIrWlBQIokjOSjqjSBeYRnD8U0GRemiSecugWo/qI
 bE7xcZuTgKBy+gr7242017DcUjWBdWcsp1C6C+COZm/KrSihQ0SQ4PIcOZgPsZjl
 COGextZltbWW56qnlp6QC394V+Vhah+Owcz3Qqz9zZ7hzJnuPo+DnpPMShhRGruL
 /IgqKhzcuye5UUJJl8nD768x6ebClchkcBC+A/hfjk5UYVl/oZxA1Bw5fC2O+5VU
 ycDMHr1Qu/yEE2rbwIJPGNOZ7NisqJFF07CPFuygKjBGNnp7I5S7H6UJsKRi5MuZ
 0CXHiFMhuHgWLmDFauIa66XI1JpqIzQgbZSjqVYFKRqwEz3yRKwihwDzZy1m6pDQ
 NhMFedQznFkMSJ+m020IOxxXYPy98PHcus+TZWL4os0SJTfEycEUWJZnlJ47Bb25
 Z9IF1OER2I4giMKxUVKRoDq0SGStbZODwqxrVCRD261Um9ybbfDdRbDeDOmvY+Gu
 jEmE6tscWbRfDvCS2M+xIg+MAcVHJq5gO4S8RopMSJlFHY/T/uE=
 =s9mS
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2025-05-06' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Johannes Berg says:

====================
wireless features, notably

 * stack
   - free SKBTX_WIFI_STATUS flag
   - fixes for VLAN multicast in multi-link
   - improve codel parameters (revert some old twiddling)
 * ath12k
   - Enable AHB support for IPQ5332.
   - Add monitor interface support to QCN9274.
   - Add MLO support to WCN7850.
   - Add 802.11d scan offload support to WCN7850.
 * ath11k
   - Restore hibernation support
 * iwlwifi
   - EMLSR on two 5 GHz links
 * mwifiex
   - cleanups/refactoring

along with many other small features/cleanups

* tag 'wireless-next-2025-05-06' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (177 commits)
  Revert "wifi: iwlwifi: clean up config macro"
  wifi: iwlwifi: move phy_filters to fw_runtime
  wifi: iwlwifi: pcie: make sure to lock rxq->read
  wifi: iwlwifi: add definitions for iwl_mac_power_cmd version 2
  wifi: iwlwifi: clean up config macro
  wifi: iwlwifi: mld: simplify iwl_mld_rx_fill_status()
  wifi: iwlwifi: mld: rx: simplify channel handling
  wifi: iwlwifi: clean up band in RX metadata
  wifi: iwlwifi: mld: skip unknown FW channel load values
  wifi: iwlwifi: define API for external FSEQ images
  wifi: iwlwifi: mld: allow EMLSR on separated 5 GHz subbands
  wifi: iwlwifi: mld: use cfg80211_chandef_get_width()
  wifi: iwlwifi: mld: fix iwl_mld_emlsr_disallowed_with_link() return
  wifi: iwlwifi: mld: clarify variable type
  wifi: iwlwifi: pcie: add support for the reset handshake in MSI
  wifi: mac80211_hwsim: Prevent tsf from setting if beacon is disabled
  wifi: mac80211: restructure tx profile retrieval for MLO MBSSID
  wifi: nl80211: add link id of transmitted profile for MLO MBSSID
  wifi: ieee80211: Add helpers to fetch EMLSR delay and timeout values
  wifi: mac80211: update ML STA with EML capabilities
  ...

====================

Link: https://patch.msgid.link/20250506174656.119970-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-06 19:04:42 -07:00
Jakub Kicinski
2e6259d821 linux-can-fixes-for-6.15-20250506
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEn/sM2K9nqF/8FWzzDHRl3/mQkZwFAmgaFN8THG1rbEBwZW5n
 dXRyb25peC5kZQAKCRAMdGXf+ZCRnOZVB/0SLNCKblGAISuevXuFKlXAFXcW2oxN
 4g38M6PzZ4zS+XWTHeGYn5XoKuwCTT/9vkD55ADLy4Xo0JDI/4sHxh6y0h9ROGL8
 wbyzSTmhiIAA5JftudvrA51Ac1pD2RWU8O6tb8xg0/+wFX0H/U7k9mZnttZBArgM
 NnsLGZEjAJdeqKZkcXR1GTveK9aE3Hntpz5dacMO92yMWYts5DMpyc99itfSEa0G
 M8HGWI7G9blBgw1BeCIpjnbfdQluitg2B9sbnJlzVZ9+iIHCnoykXo4HpnglksfW
 DTYQmbNVhVh0hUQVf9L4eOEgq/gH0cU5RwoMtsaiQjpUTuNHtHDtbE0U
 =5sTX
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-fixes-for-6.15-20250506' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2025-05-06

The first patch is by Antonios Salios and adds a missing
spin_lock_init() to the m_can driver.

The next 3 patches are by me and fix the unregistration order in the
mcp251xfd, rockchip_canfd and m_can driver.

The last patch is by Oliver Hartkopp and fixes RCU and BH
locking/handling in the CAN gw protocol.

* tag 'linux-can-fixes-for-6.15-20250506' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
  can: gw: fix RCU/BH usage in cgw_create_job()
  can: mcan: m_can_class_unregister(): fix order of unregistration calls
  can: rockchip_canfd: rkcanfd_remove(): fix order of unregistration calls
  can: mcp251xfd: mcp251xfd_remove(): fix order of unregistration calls
  can: mcp251xfd: fix TDC setting for low data bit rates
  can: m_can: m_can_class_allocate_dev(): initialize spin lock on device probe
====================

Link: https://patch.msgid.link/20250506135939.652543-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-06 18:56:36 -07:00
Stanislav Fomichev
78cd408356 net: add missing instance lock to dev_set_promiscuity
Accidentally spotted while trying to understand what else needs
to be renamed to netif_ prefix. Most of the calls to dev_set_promiscuity
are adjacent to dev_set_allmulti or dev_disable_lro so it should
be safe to add the lock. Note that new netif_set_promiscuity is
currently unused, the locked paths call __dev_set_promiscuity directly.

Fixes: ad7c7b2172 ("net: hold netdev instance lock during sysfs operations")
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250506011919.2882313-1-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-06 18:52:39 -07:00
Cosmin Ratiu
08e9f2d584 net: Lock netdevices during dev_shutdown
__qdisc_destroy() calls into various qdiscs .destroy() op, which in turn
can call .ndo_setup_tc(), which requires the netdev instance lock.

This commit extends the critical section in
unregister_netdevice_many_notify() to cover dev_shutdown() (and
dev_tcx_uninstall() as a side-effect) and acquires the netdev instance
lock in __dev_change_net_namespace() for the other dev_shutdown() call.

This should now guarantee that for all qdisc ops, the netdev instance
lock is held during .ndo_setup_tc().

Fixes: a0527ee2df ("net: hold netdev instance lock during qdisc ndo_setup_tc")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250505194713.1723399-1-cratiu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-06 18:31:32 -07:00
Jiri Pirko
88debb521f devlink: use DEVLINK_VAR_ATTR_TYPE_* instead of NLA_* in fmsg
Use newly introduced DEVLINK_VAR_ATTR_TYPE_* enum values instead of
internal NLA_* in fmsg health reporter code.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250505114513.53370-5-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-06 18:21:11 -07:00
Jiri Pirko
f9e78932ea devlink: avoid param type value translations
Assign DEVLINK_PARAM_TYPE_* enum values to DEVLINK_VAR_ATTR_TYPE_* to
ensure the same values are used internally and in UAPI. Benefit from
that by removing the value translations.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250505114513.53370-4-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-06 18:21:11 -07:00
Jiri Pirko
429ac62114 devlink: define enum for attr types of dynamic attributes
Devlink param and health reporter fmsg use attributes with dynamic type
which is determined according to a different type. Currently used values
are NLA_*. The problem is, they are not part of UAPI. They may change
which would cause a break.

To make this future safe, introduce a enum that shadows NLA_* values in
it and is part of UAPI.

Also, this allows to possibly carry types that are unrelated to NLA_*
values.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250505114513.53370-3-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-06 18:21:11 -07:00
Michael-CY Lee
e12a42f64f wifi: mac80211: fix the type of status_code for negotiated TID to Link Mapping
The status code should be type of __le16.

Fixes: 83e897a961 ("wifi: ieee80211: add definitions for negotiated TID to Link map")
Fixes: 8f500fbc6c ("wifi: mac80211: process and save negotiated TID to Link mapping request")
Signed-off-by: Michael-CY Lee <michael-cy.lee@mediatek.com>
Link: https://patch.msgid.link/20250505081946.3927214-1-michael-cy.lee@mediatek.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-05-06 21:27:10 +02:00
Veerendranath Jakkam
023c1f2f06 wifi: cfg80211: fix out-of-bounds access during multi-link element defragmentation
Currently during the multi-link element defragmentation process, the
multi-link element length added to the total IEs length when calculating
the length of remaining IEs after the multi-link element in
cfg80211_defrag_mle(). This could lead to out-of-bounds access if the
multi-link element or its corresponding fragment elements are the last
elements in the IEs buffer.

To address this issue, correctly calculate the remaining IEs length by
deducting the multi-link element end offset from total IEs end offset.

Cc: stable@vger.kernel.org
Fixes: 2481b5da9c ("wifi: cfg80211: handle BSS data contained in ML probe responses")
Signed-off-by: Veerendranath Jakkam <quic_vjakkam@quicinc.com>
Link: https://patch.msgid.link/20250424-fix_mle_defragmentation_oob_access-v1-1-84412a1743fa@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-05-06 21:04:40 +02:00
Oliver Hartkopp
511e64e13d can: gw: fix RCU/BH usage in cgw_create_job()
As reported by Sebastian Andrzej Siewior the use of local_bh_disable()
is only feasible in uni processor systems to update the modification rules.
The usual use-case to update the modification rules is to update the data
of the modifications but not the modification types (AND/OR/XOR/SET) or
the checksum functions itself.

To omit additional memory allocations to maintain fast modification
switching times, the modification description space is doubled at gw-job
creation time so that only the reference to the active modification
description is changed under rcu protection.

Rename cgw_job::mod to cf_mod and make it a RCU pointer. Allocate in
cgw_create_job() and free it together with cgw_job in
cgw_job_free_rcu(). Update all users to dereference cgw_job::cf_mod with
a RCU accessor and if possible once.

[bigeasy: Replace mod1/mod2 from the Oliver's original patch with dynamic
allocation, use RCU annotation and accessor]

Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Closes: https://lore.kernel.org/linux-can/20231031112349.y0aLoBrz@linutronix.de/
Fixes: dd895d7f21 ("can: cangw: introduce optional uid to reference created routing jobs")
Tested-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20250429070555.cs-7b_eZ@linutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-05-06 15:55:36 +02:00
Paolo Abeni
5b5f1efb72 netfilter pull request 25-05-06
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmgZSoMACgkQ1w0aZmrP
 KyGs/RAAvRfQAf24xfQbnYVFdI6rVWEuJcEchyZZp1IUzegSNj+6fyHa1b5bblRi
 bbBS5WVwSTDR7y+BCmChz2z7R2viuiX9/Zno7xcbOxWcjs+YqXNtrdQDTb47edbB
 nvBK+3xX1H7vykTqvZqeIjnAKhORao44k1t+Yc0crvfhY0gAvWpwFHee3oXStNF6
 24CgvJHglfmgGO0kvja0X3d9WkRuIxManEUQxB1cjQMOF4s7nqjDmljFbjUORhbF
 eJ2XWrKCfbJnBA2NO0PuTbqmA+qZX5cgRw91tqSgcuX3HyPeJbT+DE2ZbK5q9ZfP
 p17gWk3vLPki2QvMlyrY3ZJwArasi1mSbixtsguD9hRtHeNQuessaTmDeLY/xgah
 nsWHeLedZj/KebrpMK6nbUcBZfs8DN4CF6+5lfWC4V9EFiZlv2FO+VRNmSbn+rgp
 LafXe4OF5eD7IS//MK1utLaOK47SezmRyAP8TxUCFYuqvTgLD87YRBePDqdypvai
 y1gdW5YRCCc6FzjIUKUflTmgWzT2+0MdZURRFfjjF1LP9qGV803SsGrdRO+g2BU9
 HeWq299YNcpvAqpIe+3oFmq7FSHMfL0QlDejcrlaghEYYPyTDmWMCqPIR8QhI8xq
 J8DFLia31ANDDHqDI5giDIOx04HZgF+4UfNrjevmtV7nYx52BQo=
 =TxSF
 -----END PGP SIGNATURE-----

Merge tag 'nf-next-25-05-06' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Apparently, nf_conntrack_bridge changes the way in which fragments
   are handled, dealing to packet drop. From Huajian Yang.

2) Add a selftest to stress the conntrack subsystem, from Florian Westphal.

3) nft_quota depletion is off-by-one byte, Zhongqiu Duan.

4) Rewrites the procfs to read the conntrack table to speed it up,
   from Florian Westphal.

5) Two patches to prevent overflow in nft_pipapo lookup table and to
   clamp the maximum bucket size.

6) Update nft_fib selftest to check for loopback packet bypass.
   From Florian Westphal.

netfilter pull request 25-05-06

* tag 'nf-next-25-05-06' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  selftests: netfilter: nft_fib.sh: check lo packets bypass fib lookup
  netfilter: nft_set_pipapo: clamp maximum map bucket size to INT_MAX
  netfilter: nft_set_pipapo: prevent overflow in lookup table allocation
  netfilter: nf_conntrack: speed up reads from nf_conntrack proc file
  netfilter: nft_quota: match correctly when the quota just depleted
  selftests: netfilter: add conntrack stress test
  netfilter: bridge: Move specific fragmented packet to slow_path instead of dropping it
====================

Link: https://patch.msgid.link/20250505234151.228057-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-06 13:19:01 +02:00
Ruben Wauters
c2dbda0766 ipv4: ip_tunnel: Replace strcpy use with strscpy
Use of strcpy is decpreated, replaces the use of strcpy with strscpy as
recommended.

strscpy was chosen as it requires a NUL terminated non-padded string,
which is the case here.

I am aware there is an explicit bounds check above the second instance,
however using strscpy protects against buffer overflows in any future
code, and there is no good reason I can see to not use it.

I have also replaced the scrscpy above that had 3 params with the
version using 2 params. These are functionally equivalent, but it is
cleaner to have both using 2 params.

Signed-off-by: Ruben Wauters <rubenru09@aol.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250501202935.46318-1-rubenru09@aol.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-05 18:16:11 -07:00
Guillaume Nault
3e6a0243ff gre: Fix again IPv6 link-local address generation.
Use addrconf_addr_gen() to generate IPv6 link-local addresses on GRE
devices in most cases and fall back to using add_v4_addrs() only in
case the GRE configuration is incompatible with addrconf_addr_gen().

GRE used to use addrconf_addr_gen() until commit e5dd729460 ("ip/ip6_gre:
use the same logic as SIT interfaces when computing v6LL address")
restricted this use to gretap and ip6gretap devices, and created
add_v4_addrs() (borrowed from SIT) for non-Ethernet GRE ones.

The original problem came when commit 9af28511be ("addrconf: refuse
isatap eui64 for INADDR_ANY") made __ipv6_isatap_ifid() fail when its
addr parameter was 0. The commit says that this would create an invalid
address, however, I couldn't find any RFC saying that the generated
interface identifier would be wrong. Anyway, since gre over IPv4
devices pass their local tunnel address to __ipv6_isatap_ifid(), that
commit broke their IPv6 link-local address generation when the local
address was unspecified.

Then commit e5dd729460 ("ip/ip6_gre: use the same logic as SIT
interfaces when computing v6LL address") tried to fix that case by
defining add_v4_addrs() and calling it to generate the IPv6 link-local
address instead of using addrconf_addr_gen() (apart for gretap and
ip6gretap devices, which would still use the regular
addrconf_addr_gen(), since they have a MAC address).

That broke several use cases because add_v4_addrs() isn't properly
integrated into the rest of IPv6 Neighbor Discovery code. Several of
these shortcomings have been fixed over time, but add_v4_addrs()
remains broken on several aspects. In particular, it doesn't send any
Router Sollicitations, so the SLAAC process doesn't start until the
interface receives a Router Advertisement. Also, add_v4_addrs() mostly
ignores the address generation mode of the interface
(/proc/sys/net/ipv6/conf/*/addr_gen_mode), thus breaking the
IN6_ADDR_GEN_MODE_RANDOM and IN6_ADDR_GEN_MODE_STABLE_PRIVACY cases.

Fix the situation by using add_v4_addrs() only in the specific scenario
where the normal method would fail. That is, for interfaces that have
all of the following characteristics:

  * run over IPv4,
  * transport IP packets directly, not Ethernet (that is, not gretap
    interfaces),
  * tunnel endpoint is INADDR_ANY (that is, 0),
  * device address generation mode is EUI64.

In all other cases, revert back to the regular addrconf_addr_gen().

Also, remove the special case for ip6gre interfaces in add_v4_addrs(),
since ip6gre devices now always use addrconf_addr_gen() instead.

Note:
  This patch was originally applied as commit 183185a18f ("gre: Fix
  IPv6 link-local address generation."). However, it was then reverted
  by commit fc486c2d06 ("Revert "gre: Fix IPv6 link-local address
  generation."") because it uncovered another bug that ended up
  breaking net/forwarding/ip6gre_custom_multipath_hash.sh. That other
  bug has now been fixed by commit 4d0ab3a688 ("ipv6: Start path
  selection from the first nexthop"). Therefore we can now revive this
  GRE patch (no changes since original commit 183185a18f ("gre: Fix
  IPv6 link-local address generation.").

Fixes: e5dd729460 ("ip/ip6_gre: use the same logic as SIT interfaces when computing v6LL address")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/a88cc5c4811af36007645d610c95102dccb360a6.1746225214.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-05 18:08:14 -07:00
Maxime Chevallier
63fb100bf5 net: ethtool: netlink: Use netdev_hold for dumpit() operations
Move away from dev_hold and use netdev_hold with a local reftracker when
performing a DUMP on each netdev.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20250502085242.248645-4-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-05 17:17:41 -07:00
Maxime Chevallier
9dd2ad5e92 net: ethtool: phy: Convert the PHY_GET command to generic phy dump
Now that we have an infrastructure in ethnl for perphy DUMPs, we can get
rid of the custom ->doit and ->dumpit to deal with PHY listing commands.

As most of the code was custom, this basically means re-writing how we
deal with PHY listing.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20250502085242.248645-3-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-05 17:17:40 -07:00
Maxime Chevallier
172265b44c net: ethtool: Introduce per-PHY DUMP operations
ethnl commands that target a phy_device need a DUMP implementation that
will fill the reply for every PHY behind a netdev. We therefore need to
iterate over the dev->topo to list them.

When multiple PHYs are behind the same netdev, it's also useful to
perform DUMP with a filter on a given netdev, to get the capability of
every PHY.

Implement dedicated genl ->start(), ->dumpit() and ->done() operations
for PHY-targetting command, allowing filtered dumps and using a dump
context that keep track of the PHY iteration for multi-message dump.

PSE-PD and PLCA are converted to this new set of ops along the way.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20250502085242.248645-2-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-05 17:17:40 -07:00
Dr. David Alan Gilbert
ac8f09b921 sctp: Remove unused sctp_assoc_del_peer and sctp_chunk_iif
sctp_assoc_del_peer() last use was removed in 2015 by
commit 73e6742027 ("sctp: Do not try to search for the transport twice")
which now uses rm_peer instead of del_peer.

sctp_chunk_iif() last use was removed in 2016 by
commit 1f45f78f8e ("sctp: allow GSO frags to access the chunk too")

Remove them.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20250501233815.99832-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-05 16:51:12 -07:00
Dr. David Alan Gilbert
320a66f840 strparser: Remove unused __strp_unpause
The last use of __strp_unpause() was removed in 2022 by
commit 84c61fe1a7 ("tls: rx: do not use the standard strparser")

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250501002402.308843-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-05 16:48:12 -07:00
Cong Wang
3769478610 sch_htb: make htb_deactivate() idempotent
Alan reported a NULL pointer dereference in htb_next_rb_node()
after we made htb_qlen_notify() idempotent.

It turns out in the following case it introduced some regression:

htb_dequeue_tree():
  |-> fq_codel_dequeue()
    |-> qdisc_tree_reduce_backlog()
      |-> htb_qlen_notify()
        |-> htb_deactivate()
  |-> htb_next_rb_node()
  |-> htb_deactivate()

For htb_next_rb_node(), after calling the 1st htb_deactivate(), the
clprio[prio]->ptr could be already set to  NULL, which means
htb_next_rb_node() is vulnerable here.

For htb_deactivate(), although we checked qlen before calling it, in
case of qlen==0 after qdisc_tree_reduce_backlog(), we may call it again
which triggers the warning inside.

To fix the issues here, we need to:

1) Make htb_deactivate() idempotent, that is, simply return if we
   already call it before.
2) Make htb_next_rb_node() safe against ptr==NULL.

Many thanks to Alan for testing and for the reproducer.

Fixes: 5ba8b837b5 ("sch_htb: make htb_qlen_notify() idempotent")
Reported-by: Alan J. Wylie <alan@wylie.me.uk>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Link: https://patch.msgid.link/20250428232955.1740419-2-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-05 13:51:32 -07:00
Jakub Kicinski
b4cd2ee54c bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ6NaUOruQGUkvPdG4raS+Z+3y5EwUCaBVftAAKCRAraS+Z+3y5
 EzJ7AP9dtYuBHmU0tB5upuLzQ9sVxaCXs2linfRKK40A5YJDcgD/fBQqPzhxCqZR
 moHEqelMLgrAUcyro5egdRJZPcaTfQE=
 =6Las
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2025-05-02

We've added 14 non-merge commits during the last 10 day(s) which contain
a total of 13 files changed, 740 insertions(+), 121 deletions(-).

The main changes are:

1) Avoid skipping or repeating a sk when using a UDP bpf_iter,
   from Jordan Rife.

2) Fixed a crash when a bpf qdisc is set in
   the net.core.default_qdisc, from Amery Hung.

3) A few other fixes in the bpf qdisc, from Amery Hung.
   - Always call qdisc_watchdog_init() in the .init prologue such that
     the .reset/.destroy epilogue can always call qdisc_watchdog_cancel()
     without issue.
   - bpf_qdisc_init_prologue() was incorrectly returning an error
     when the bpf qdisc is set as the default_qdisc and the mq is creating
     the default_qdisc. It is now fixed.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  selftests/bpf: Cleanup bpf qdisc selftests
  selftests/bpf: Test attaching a bpf qdisc with incomplete operators
  bpf: net_sched: Make some Qdisc_ops ops mandatory
  selftests/bpf: Test setting and creating bpf qdisc as default qdisc
  bpf: net_sched: Fix bpf qdisc init prologue when set as default qdisc
  selftests/bpf: Add tests for bucket resume logic in UDP socket iterators
  selftests/bpf: Return socket cookies from sock_iter_batch progs
  bpf: udp: Avoid socket skips and repeats during iteration
  bpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch items
  bpf: udp: Get rid of st_bucket_done
  bpf: udp: Make sure iter->batch always contains a full bucket snapshot
  bpf: udp: Make mem flags configurable through bpf_iter_udp_realloc_batch
  bpf: net_sched: Fix using bpf qdisc as default qdisc
  selftests/bpf: Fix compilation errors
====================

Link: https://patch.msgid.link/20250503010755.4030524-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-05 13:22:58 -07:00
Pablo Neira Ayuso
b85e3367a5 netfilter: nft_set_pipapo: clamp maximum map bucket size to INT_MAX
Otherwise, it is possible to hit WARN_ON_ONCE in __kvmalloc_node_noprof()
when resizing hashtable because __GFP_NOWARN is unset.

Similar to:

  b541ba7d1f ("netfilter: conntrack: clamp maximum hashtable size to INT_MAX")

Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-05 13:17:32 +02:00
Pablo Neira Ayuso
4c5c6aa996 netfilter: nft_set_pipapo: prevent overflow in lookup table allocation
When calculating the lookup table size, ensure the following
multiplication does not overflow:

- desc->field_len[] maximum value is U8_MAX multiplied by
  NFT_PIPAPO_GROUPS_PER_BYTE(f) that can be 2, worst case.
- NFT_PIPAPO_BUCKETS(f->bb) is 2^8, worst case.
- sizeof(unsigned long), from sizeof(*f->lt), lt in
  struct nft_pipapo_field.

Then, use check_mul_overflow() to multiply by bucket size and then use
check_add_overflow() to the alignment for avx2 (if needed). Finally, add
lt_size_check_overflow() helper and use it to consolidate this.

While at it, replace leftover allocation using the GFP_KERNEL to
GFP_KERNEL_ACCOUNT for consistency, in pipapo_resize().

Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-05 13:17:32 +02:00
Florian Westphal
5e4d107abd netfilter: nf_conntrack: speed up reads from nf_conntrack proc file
Dumping all conntrack entries via proc interface can take hours due to
linear search to skip entries dumped so far in each cycle.

Apply same strategy used to speed up ipvs proc reading done in
commit 178883fd03 ("ipvs: speed up reads from ip_vs_conn proc file")
to nf_conntrack.

Note that the ctnetlink interface doesn't suffer from this problem, but
many scripts depend on the nf_conntrack proc interface.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-05 13:16:26 +02:00
Zhongqiu Duan
bfe7cfb65c netfilter: nft_quota: match correctly when the quota just depleted
The xt_quota compares skb length with remaining quota, but the nft_quota
compares it with consumed bytes.

The xt_quota can match consumed bytes up to quota at maximum. But the
nft_quota break match when consumed bytes equal to quota.

i.e., nft_quota match consumed bytes in [0, quota - 1], not [0, quota].

Fixes: 795595f68d ("netfilter: nft_quota: dump consumed quota")
Signed-off-by: Zhongqiu Duan <dzq.aishenghu0@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-05 13:15:09 +02:00
Huajian Yang
aa04c6f45b netfilter: bridge: Move specific fragmented packet to slow_path instead of dropping it
The config NF_CONNTRACK_BRIDGE will change the bridge forwarding for
fragmented packets.

The original bridge does not know that it is a fragmented packet and
forwards it directly, after NF_CONNTRACK_BRIDGE is enabled, function
nf_br_ip_fragment and br_ip6_fragment will check the headroom.

In original br_forward, insufficient headroom of skb may indeed exist,
but there's still a way to save the skb in the device driver after
dev_queue_xmit.So droping the skb will change the original bridge
forwarding in some cases.

Fixes: 3c171f496e ("netfilter: bridge: add connection tracking system")
Signed-off-by: Huajian Yang <huajianyang@asrmicro.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-05 13:13:08 +02:00
Ido Schimmel
836b313a14 ipv4: Honor "ignore_routes_with_linkdown" sysctl in nexthop selection
Commit 32607a332c ("ipv4: prefer multipath nexthop that matches source
address") changed IPv4 nexthop selection to prefer a nexthop whose
nexthop device is assigned the specified source address for locally
generated traffic.

While the selection honors the "fib_multipath_use_neigh" sysctl and will
not choose a nexthop with an invalid neighbour, it does not honor the
"ignore_routes_with_linkdown" sysctl and can choose a nexthop without a
carrier:

 $ sysctl net.ipv4.conf.all.ignore_routes_with_linkdown
 net.ipv4.conf.all.ignore_routes_with_linkdown = 1
 $ ip route show 198.51.100.0/24
 198.51.100.0/24
         nexthop via 192.0.2.2 dev dummy1 weight 1
         nexthop via 192.0.2.18 dev dummy2 weight 1 dead linkdown
 $ ip route get 198.51.100.1 from 192.0.2.17
 198.51.100.1 from 192.0.2.17 via 192.0.2.18 dev dummy2 uid 0

Solve this by skipping over nexthops whose assigned hash upper bound is
minus one, which is the value assigned to nexthops that do not have a
carrier when the "ignore_routes_with_linkdown" sysctl is set.

In practice, this probably does not matter a lot as the initial route
lookup for the source address would not choose a nexthop that does not
have a carrier in the first place, but the change does make the code
clearer.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-03 21:52:38 +01:00
Kuniyuki Iwashima
586ceac9ac ipv6: Restore fib6_config validation for SIOCADDRT.
syzkaller reported out-of-bounds read in ipv6_addr_prefix(),
where the prefix length was over 128.

The cited commit accidentally removed some fib6_config
validation from the ioctl path.

Let's restore the validation.

[0]:
BUG: KASAN: slab-out-of-bounds in ip6_route_info_create (./include/net/ipv6.h:616 net/ipv6/route.c:3814)
Read of size 1 at addr ff11000138020ad4 by task repro/261

CPU: 3 UID: 0 PID: 261 Comm: repro Not tainted 6.15.0-rc3-00614-g0d15a26b247d #87 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl (lib/dump_stack.c:123)
 print_report (mm/kasan/report.c:409 mm/kasan/report.c:521)
 kasan_report (mm/kasan/report.c:636)
 ip6_route_info_create (./include/net/ipv6.h:616 net/ipv6/route.c:3814)
 ip6_route_add (net/ipv6/route.c:3902)
 ipv6_route_ioctl (net/ipv6/route.c:4523)
 inet6_ioctl (net/ipv6/af_inet6.c:577)
 sock_do_ioctl (net/socket.c:1190)
 sock_ioctl (net/socket.c:1314)
 __x64_sys_ioctl (fs/ioctl.c:51 fs/ioctl.c:906 fs/ioctl.c:892 fs/ioctl.c:892)
 do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
RIP: 0033:0x7f518fb2de5d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48
RSP: 002b:00007fff14f38d18 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f518fb2de5d
RDX: 00000000200015c0 RSI: 000000000000890b RDI: 0000000000000003
RBP: 00007fff14f38d30 R08: 0000000000000800 R09: 0000000000000800
R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff14f38e48
R13: 0000000000401136 R14: 0000000000403df0 R15: 00007f518fd3c000
 </TASK>

Fixes: fa76c1674f ("ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config().")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Reported-by: Yi Lai <yi1.lai@linux.intel.com>
Closes: https://lore.kernel.org/netdev/aBAcKDEFoN%2FLntBF@ly-workstation/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250501005335.53683-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-02 18:48:48 -07:00
Pedro Falcato
a2f6476ed1 mptcp: Align mptcp_inet6_sk with other protocols
Ever since commit f5f80e32de ("ipv6: remove hard coded limitation on
ipv6_pinfo") that protocols stopped using the old "obj_size -
sizeof(struct ipv6_pinfo)" way of grabbing ipv6_pinfo, that severely
restricted struct layout and caused fun, hard to see issues.

However, mptcp_inet6_sk wasn't fixed (unlike tcp_inet6_sk). Do so.
The non-cloned sockets already do the right thing using
ipv6_pinfo_offset + the generic IPv6 code.

Signed-off-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250430154541.1038561-1-pfalcato@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-02 18:28:21 -07:00
Amery Hung
64d6e3b9df bpf: net_sched: Make some Qdisc_ops ops mandatory
The patch makes all currently supported Qdisc_ops (i.e., .enqueue,
.dequeue, .init, .reset, and .destroy) mandatory.

Make .init, .reset and .destroy mandatory as bpf qdisc relies on prologue
and epilogue to check attach points and correctly initialize/cleanup
resources. The prologue/epilogue will only be generated for an struct_ops
operator only if users implement the operator.

Make .enqueue and .dequeue mandatory as bpf qdisc infra does not provide
a default data path.

Fixes: c824034495 ("bpf: net_sched: Support implementation of Qdisc_ops in bpf")
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2025-05-02 15:35:37 -07:00
Amery Hung
659b3b2c48 bpf: net_sched: Fix bpf qdisc init prologue when set as default qdisc
Allow .init to proceed if qdisc_lookup() returns NULL as it only happens
when called by qdisc_create_dflt() in mq/mqprio_init and the parent qdisc
has not been added to qdisc_hash yet. In qdisc_create(), the caller,
__tc_modify_qdisc(), would have made sure the parent qdisc already exist.

In addition, call qdisc_watchdog_init() whether .init succeeds or not to
prevent null-pointer dereference. In qdisc_create() and
qdisc_create_dflt(), if .init fails, .destroy will be called. As a
result, the destroy epilogue could call qdisc_watchdog_cancel() with an
uninitialized timer, causing null-pointer deference in hrtimer_cancel().

Fixes: c824034495 ("bpf: net_sched: Support implementation of Qdisc_ops in bpf")
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2025-05-02 14:50:08 -07:00
Jordan Rife
5668f73f09 bpf: udp: Avoid socket skips and repeats during iteration
Replace the offset-based approach for tracking progress through a bucket
in the UDP table with one based on socket cookies. Remember the cookies
of unprocessed sockets from the last batch and use this list to
pick up where we left off or, in the case that the next socket
disappears between reads, find the first socket after that point that
still exists in the bucket and resume from there.

This approach guarantees that all sockets that existed when iteration
began and continue to exist throughout will be visited exactly once.
Sockets that are added to the table during iteration may or may not be
seen, but if they are they will be seen exactly once.

Signed-off-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2025-05-02 12:07:46 -07:00
Jordan Rife
251c6636e0 bpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch items
Prepare for the next patch that tracks cookies between iterations by
converting struct sock **batch to union bpf_udp_iter_batch_item *batch
inside struct bpf_udp_iter_state.

Signed-off-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
2025-05-02 11:46:42 -07:00
Jordan Rife
3fae8959cd bpf: udp: Get rid of st_bucket_done
Get rid of the st_bucket_done field to simplify UDP iterator state and
logic. Before, st_bucket_done could be false if bpf_iter_udp_batch
returned a partial batch; however, with the last patch ("bpf: udp: Make
sure iter->batch always contains a full bucket snapshot"),
st_bucket_done == true is equivalent to iter->cur_sk == iter->end_sk.

Signed-off-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2025-05-02 10:54:38 -07:00
Jordan Rife
66d454e99d bpf: udp: Make sure iter->batch always contains a full bucket snapshot
Require that iter->batch always contains a full bucket snapshot. This
invariant is important to avoid skipping or repeating sockets during
iteration when combined with the next few patches. Before, there were
two cases where a call to bpf_iter_udp_batch may only capture part of a
bucket:

1. When bpf_iter_udp_realloc_batch() returns -ENOMEM [1].
2. When more sockets are added to the bucket while calling
   bpf_iter_udp_realloc_batch(), making the updated batch size
   insufficient [2].

In cases where the batch size only covers part of a bucket, it is
possible to forget which sockets were already visited, especially if we
have to process a bucket in more than two batches. This forces us to
choose between repeating or skipping sockets, so don't allow this:

1. Stop iteration and propagate -ENOMEM up to userspace if reallocation
   fails instead of continuing with a partial batch.
2. Try bpf_iter_udp_realloc_batch() with GFP_USER just as before, but if
   we still aren't able to capture the full bucket, call
   bpf_iter_udp_realloc_batch() again while holding the bucket lock to
   guarantee the bucket does not change. On the second attempt use
   GFP_NOWAIT since we hold onto the spin lock.

Introduce the udp_portaddr_for_each_entry_from macro and use it instead
of udp_portaddr_for_each_entry to make it possible to continue iteration
from an arbitrary socket. This is required for this patch in the
GFP_NOWAIT case to allow us to fill the rest of a batch starting from
the middle of a bucket and the later patch which skips sockets that were
already seen.

Testing all scenarios directly is a bit difficult, but I did some manual
testing to exercise the code paths where GFP_NOWAIT is used and where
ERR_PTR(err) is returned. I used the realloc test case included later
in this series to trigger a scenario where a realloc happens inside
bpf_iter_udp_batch and made a small code tweak to force the first
realloc attempt to allocate a too-small batch, thus requiring
another attempt with GFP_NOWAIT. Some printks showed both reallocs with
the tests passing:

Apr 25 23:16:24 crow kernel: go again GFP_USER
Apr 25 23:16:24 crow kernel: go again GFP_NOWAIT

With this setup, I also forced each of the bpf_iter_udp_realloc_batch
calls to return -ENOMEM to ensure that iteration ends and that the
read() in userspace fails.

[1]: https://lore.kernel.org/bpf/CABi4-ogUtMrH8-NVB6W8Xg_F_KDLq=yy-yu-tKr2udXE2Mu1Lg@mail.gmail.com/
[2]: https://lore.kernel.org/bpf/7ed28273-a716-4638-912d-f86f965e54bb@linux.dev/

Signed-off-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2025-05-02 10:54:37 -07:00
Jordan Rife
3e485e15a1 bpf: udp: Make mem flags configurable through bpf_iter_udp_realloc_batch
Prepare for the next patch which needs to be able to choose either
GFP_USER or GFP_NOWAIT for calls to bpf_iter_udp_realloc_batch.

Signed-off-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
2025-05-02 10:54:35 -07:00
Andrea Mayer
14a0087e72 ipv6: sr: switch to GFP_ATOMIC flag to allocate memory during seg6local LWT setup
Recent updates to the locking mechanism that protects IPv6 routing tables
[1] have affected the SRv6 networking subsystem. Such changes cause
problems with some SRv6 Endpoints behaviors, like End.B6.Encaps and also
impact SRv6 counters.

Starting from commit 169fd62799 ("ipv6: Get rid of RTNL for SIOCADDRT and
RTM_NEWROUTE."), the inet6_rtm_newroute() function no longer needs to
acquire the RTNL lock for creating and configuring IPv6 routes and set up
lwtunnels.
The RTNL lock can be avoided because the ip6_route_add() function
finishes setting up a new route in a section protected by RCU.
This makes sure that no dev/nexthops can disappear during the operation.
Because of this, the steps for setting up lwtunnels - i.e., calling
lwtunnel_build_state() - are now done in a RCU lock section and not
under the RTNL lock anymore.

However, creating and configuring a lwtunnel instance in an
RCU-protected section can be problematic when that tunnel needs to
allocate memory using the GFP_KERNEL flag.
For example, the following trace shows what happens when an SRv6
End.B6.Encaps behavior is instantiated after commit 169fd62799 ("ipv6:
Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE."):

[ 3061.219696] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321
[ 3061.226136] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 445, name: ip
[ 3061.232101] preempt_count: 0, expected: 0
[ 3061.235414] RCU nest depth: 1, expected: 0
[ 3061.238622] 1 lock held by ip/445:
[ 3061.241458]  #0: ffffffff83ec64a0 (rcu_read_lock){....}-{1:3}, at: ip6_route_add+0x41/0x1e0
[ 3061.248520] CPU: 1 UID: 0 PID: 445 Comm: ip Not tainted 6.15.0-rc3-micro-vm-dev-00590-ge527e891492d #2058 PREEMPT(full)
[ 3061.248532] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ 3061.248549] Call Trace:
[ 3061.248620]  <TASK>
[ 3061.248633]  dump_stack_lvl+0xa9/0xc0
[ 3061.248846]  __might_resched+0x218/0x360
[ 3061.248871]  __kmalloc_node_track_caller_noprof+0x332/0x4e0
[ 3061.248889]  ? rcu_is_watching+0x3a/0x70
[ 3061.248902]  ? parse_nla_srh+0x56/0xa0
[ 3061.248938]  kmemdup_noprof+0x1c/0x40
[ 3061.248952]  parse_nla_srh+0x56/0xa0
[ 3061.248969]  seg6_local_build_state+0x2e0/0x580
[ 3061.248992]  ? __lock_acquire+0xaff/0x1cd0
[ 3061.249013]  ? do_raw_spin_lock+0x111/0x1d0
[ 3061.249027]  ? __pfx_seg6_local_build_state+0x10/0x10
[ 3061.249068]  ? lwtunnel_build_state+0xe1/0x3a0
[ 3061.249274]  lwtunnel_build_state+0x10d/0x3a0
[ 3061.249303]  fib_nh_common_init+0xce/0x1e0
[ 3061.249337]  ? __pfx_fib_nh_common_init+0x10/0x10
[ 3061.249352]  ? in6_dev_get+0xaf/0x1f0
[ 3061.249369]  ? __rcu_read_unlock+0x64/0x2e0
[ 3061.249392]  fib6_nh_init+0x290/0xc30
[ 3061.249422]  ? __pfx_fib6_nh_init+0x10/0x10
[ 3061.249447]  ? __lock_acquire+0xaff/0x1cd0
[ 3061.249459]  ? _raw_spin_unlock_irqrestore+0x22/0x70
[ 3061.249624]  ? ip6_route_info_create+0x423/0x520
[ 3061.249641]  ? rcu_is_watching+0x3a/0x70
[ 3061.249683]  ip6_route_info_create_nh+0x190/0x390
[ 3061.249715]  ip6_route_add+0x71/0x1e0
[ 3061.249730]  ? __pfx_inet6_rtm_newroute+0x10/0x10
[ 3061.249743]  inet6_rtm_newroute+0x426/0xc50
[ 3061.249764]  ? avc_has_perm_noaudit+0x13d/0x360
[ 3061.249853]  ? __pfx_inet6_rtm_newroute+0x10/0x10
[ 3061.249905]  ? __lock_acquire+0xaff/0x1cd0
[ 3061.249962]  ? rtnetlink_rcv_msg+0x52f/0x890
[ 3061.249996]  ? __pfx_inet6_rtm_newroute+0x10/0x10
[ 3061.250012]  rtnetlink_rcv_msg+0x551/0x890
[ 3061.250040]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 3061.250065]  ? __lock_acquire+0xaff/0x1cd0
[ 3061.250092]  netlink_rcv_skb+0xbd/0x1f0
[ 3061.250108]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 3061.250124]  ? __pfx_netlink_rcv_skb+0x10/0x10
[ 3061.250179]  ? netlink_deliver_tap+0x10b/0x700
[ 3061.250210]  netlink_unicast+0x2e7/0x410
[ 3061.250232]  ? __pfx_netlink_unicast+0x10/0x10
[ 3061.250241]  ? __lock_acquire+0xaff/0x1cd0
[ 3061.250280]  netlink_sendmsg+0x366/0x670
[ 3061.250306]  ? __pfx_netlink_sendmsg+0x10/0x10
[ 3061.250313]  ? find_held_lock+0x2d/0xa0
[ 3061.250344]  ? import_ubuf+0xbc/0xf0
[ 3061.250370]  ? __pfx_netlink_sendmsg+0x10/0x10
[ 3061.250381]  __sock_sendmsg+0x13e/0x150
[ 3061.250420]  ____sys_sendmsg+0x33d/0x450
[ 3061.250442]  ? __pfx_____sys_sendmsg+0x10/0x10
[ 3061.250453]  ? __pfx_copy_msghdr_from_user+0x10/0x10
[ 3061.250489]  ? __pfx_slab_free_after_rcu_debug+0x10/0x10
[ 3061.250514]  ___sys_sendmsg+0xe5/0x160
[ 3061.250530]  ? __pfx____sys_sendmsg+0x10/0x10
[ 3061.250568]  ? __lock_acquire+0xaff/0x1cd0
[ 3061.250617]  ? find_held_lock+0x2d/0xa0
[ 3061.250678]  ? __virt_addr_valid+0x199/0x340
[ 3061.250704]  ? preempt_count_sub+0xf/0xc0
[ 3061.250736]  __sys_sendmsg+0xca/0x140
[ 3061.250750]  ? __pfx___sys_sendmsg+0x10/0x10
[ 3061.250786]  ? syscall_exit_to_user_mode+0xa2/0x1e0
[ 3061.250825]  do_syscall_64+0x62/0x140
[ 3061.250844]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 3061.250855] RIP: 0033:0x7f0b042ef914
[ 3061.250868] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 80 00 00 00 00 48 8d 05 e9 5d 0c 00 8b 00 85 c0 75 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 41 54 41 89 d4 55 48 89 f5 53
[ 3061.250876] RSP: 002b:00007ffc2d113ef8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[ 3061.250885] RAX: ffffffffffffffda RBX: 00000000680f93fa RCX: 00007f0b042ef914
[ 3061.250891] RDX: 0000000000000000 RSI: 00007ffc2d113f60 RDI: 0000000000000003
[ 3061.250897] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000008
[ 3061.250902] R10: fffffffffffff26d R11: 0000000000000246 R12: 0000000000000001
[ 3061.250907] R13: 000055a961f8a520 R14: 000055a961f63eae R15: 00007ffc2d115270
[ 3061.250952]  </TASK>

To solve this issue, we replace the GFP_KERNEL flag with the GFP_ATOMIC
one in those SRv6 Endpoints that need to allocate memory during the
setup phase. This change makes sure that memory allocations are handled
in a way that works with RCU critical sections.

[1] - https://lore.kernel.org/all/20250418000443.43734-1-kuniyu@amazon.com/

Fixes: 169fd62799 ("ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.")
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250429132453.31605-1-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-01 18:04:08 -07:00
Jakub Kicinski
337079d31f Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc5).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-01 15:11:38 -07:00
Linus Torvalds
ebd297a2af Happy May Day.
Things have calmed down on our end (knock on wood), no outstanding
 investigations. Including fixes from Bluetooth and WiFi.
 
 Current release - fix to a fix:
 
  - igc: fix lock order in igc_ptp_reset
 
 Current release - new code bugs:
 
  - Revert "wifi: iwlwifi: make no_160 more generic", fixes regression
    to Killer line of devices reported by a number of people
 
  - Revert "wifi: iwlwifi: add support for BE213", initial FW is too buggy
 
  - number of fixes for mld, the new Intel WiFi subdriver
 
 Previous releases - regressions:
 
  - wifi: mac80211: restore monitor for outgoing frames
 
  - drv: vmxnet3: fix malformed packet sizing in vmxnet3_process_xdp
 
  - eth: bnxt_en: fix timestamping FIFO getting out of sync on reset,
    delivering stale timestamps
 
  - use sock_gen_put() in the TCP fraglist GRO heuristic, don't assume
    every socket is a full socket
 
 Previous releases - always broken:
 
  - sched: adapt qdiscs for reentrant enqueue cases, fix list corruptions
 
  - xsk: fix race condition in AF_XDP generic RX path, shared UMEM
    can't be protected by a per-socket lock
 
  - eth: mtk-star-emac: fix spinlock recursion issues on rx/tx poll
 
  - btusb: avoid NULL pointer dereference in skb_dequeue()
 
  - dsa: felix: fix broken taprio gate states after clock jump
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmgTmIoACgkQMUZtbf5S
 IruoZhAApd5csY/U9Pl7h5CK6mQek7qk+EbDcLs+/j/0UFXcOayuFKILHT27BIe5
 gEVbQBkcEpzj+QKJblWvuczreBXChTmhfQb1zItRMWJktYmlqYtPsPE4kM2inzGo
 94iDeSTca/LKPfc4dPQ/rBfXMQb7NX7UhWRR9AFBWhOuM/SDjItyZUFPqhmGDctO
 8+7YdftK2qFcwIN18kU4GqbmJEZK9Y4+ePabABR7VBFtw70bpdCptWp5tmRMbq6V
 KphxkWbWg8SP1HUnAfRAS4C3TO4Dn6e7CVaAirroqrHWmawiWgHV/o1awc0ErO8k
 aAgVhPOB5zzVqbV0ZeoEj5PL5/2rus/KmWkMludRAqGGivcQOu6GOuvo+QPwpYCU
 3edxCU95kqRB6XrjsLnZ6HnpwtBr6CUVf9FXMGNt4199cWN/PE+w4qxTz2vxiiH3
 At0su5W41s4mvldktait+ilwk+R3v1rIzTR6IDmNPW8pACI8RlQt8/Mgy7smDZp9
 FEKtWhXHdww3DosB3vHOGZ5u6gRlGvfjPsJlt2C9fGZIdrOJzzHDZ37BXDYDTYnx
 9mejzy/YAYjK826YGA3sqJ3fUDFUoAynZuajDKxErvtIZlRzEDLu+7pG4zAB8Nrx
 tHdz5iR2rDdZ87S4jB8CAP7FGVs+YZkekVvY5fIw4dGs1QMc2wE=
 =6xNz
 -----END PGP SIGNATURE-----

Merge tag 'net-6.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Happy May Day.

  Things have calmed down on our end (knock on wood), no outstanding
  investigations. Including fixes from Bluetooth and WiFi.

  Current release - fix to a fix:

   - igc: fix lock order in igc_ptp_reset

  Current release - new code bugs:

   - Revert "wifi: iwlwifi: make no_160 more generic", fixes regression
     to Killer line of devices reported by a number of people

   - Revert "wifi: iwlwifi: add support for BE213", initial FW is too
     buggy

   - number of fixes for mld, the new Intel WiFi subdriver

  Previous releases - regressions:

   - wifi: mac80211: restore monitor for outgoing frames

   - drv: vmxnet3: fix malformed packet sizing in vmxnet3_process_xdp

   - eth: bnxt_en: fix timestamping FIFO getting out of sync on reset,
     delivering stale timestamps

   - use sock_gen_put() in the TCP fraglist GRO heuristic, don't assume
     every socket is a full socket

  Previous releases - always broken:

   - sched: adapt qdiscs for reentrant enqueue cases, fix list
     corruptions

   - xsk: fix race condition in AF_XDP generic RX path, shared UMEM
     can't be protected by a per-socket lock

   - eth: mtk-star-emac: fix spinlock recursion issues on rx/tx poll

   - btusb: avoid NULL pointer dereference in skb_dequeue()

   - dsa: felix: fix broken taprio gate states after clock jump"

* tag 'net-6.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits)
  net: vertexcom: mse102x: Fix RX error handling
  net: vertexcom: mse102x: Add range check for CMD_RTS
  net: vertexcom: mse102x: Fix LEN_MASK
  net: vertexcom: mse102x: Fix possible stuck of SPI interrupt
  net: hns3: defer calling ptp_clock_register()
  net: hns3: fixed debugfs tm_qset size
  net: hns3: fix an interrupt residual problem
  net: hns3: store rx VLAN tag offload state for VF
  octeon_ep: Fix host hang issue during device reboot
  net: fec: ERR007885 Workaround for conventional TX
  net: lan743x: Fix memleak issue when GSO enabled
  ptp: ocp: Fix NULL dereference in Adva board SMA sysfs operations
  net: use sock_gen_put() when sk_state is TCP_TIME_WAIT
  bnxt_en: fix module unload sequence
  bnxt_en: Fix ethtool -d byte order for 32-bit values
  bnxt_en: Fix out-of-bound memcpy() during ethtool -w
  bnxt_en: Fix coredump logic to free allocated buffer
  bnxt_en: delay pci_alloc_irq_vectors() in the AER path
  bnxt_en: call pci_alloc_irq_vectors() after bnxt_reserve_rings()
  bnxt_en: Add missing skb_mark_for_recycle() in bnxt_rx_vlan()
  ...
2025-05-01 10:37:49 -07:00
Jibin Zhang
f920436a44 net: use sock_gen_put() when sk_state is TCP_TIME_WAIT
It is possible for a pointer of type struct inet_timewait_sock to be
returned from the functions __inet_lookup_established() and
__inet6_lookup_established(). This can cause a crash when the
returned pointer is of type struct inet_timewait_sock and
sock_put() is called on it. The following is a crash call stack that
shows sk->sk_wmem_alloc being accessed in sk_free() during the call to
sock_put() on a struct inet_timewait_sock pointer. To avoid this issue,
use sock_gen_put() instead of sock_put() when sk->sk_state
is TCP_TIME_WAIT.

mrdump.ko        ipanic() + 120
vmlinux          notifier_call_chain(nr_to_call=-1, nr_calls=0) + 132
vmlinux          atomic_notifier_call_chain(val=0) + 56
vmlinux          panic() + 344
vmlinux          add_taint() + 164
vmlinux          end_report() + 136
vmlinux          kasan_report(size=0) + 236
vmlinux          report_tag_fault() + 16
vmlinux          do_tag_recovery() + 16
vmlinux          __do_kernel_fault() + 88
vmlinux          do_bad_area() + 28
vmlinux          do_tag_check_fault() + 60
vmlinux          do_mem_abort() + 80
vmlinux          el1_abort() + 56
vmlinux          el1h_64_sync_handler() + 124
vmlinux        > 0xFFFFFFC080011294()
vmlinux          __lse_atomic_fetch_add_release(v=0xF2FFFF82A896087C)
vmlinux          __lse_atomic_fetch_sub_release(v=0xF2FFFF82A896087C)
vmlinux          arch_atomic_fetch_sub_release(i=1, v=0xF2FFFF82A896087C)
+ 8
vmlinux          raw_atomic_fetch_sub_release(i=1, v=0xF2FFFF82A896087C)
+ 8
vmlinux          atomic_fetch_sub_release(i=1, v=0xF2FFFF82A896087C) + 8
vmlinux          __refcount_sub_and_test(i=1, r=0xF2FFFF82A896087C,
oldp=0) + 8
vmlinux          __refcount_dec_and_test(r=0xF2FFFF82A896087C, oldp=0) + 8
vmlinux          refcount_dec_and_test(r=0xF2FFFF82A896087C) + 8
vmlinux          sk_free(sk=0xF2FFFF82A8960700) + 28
vmlinux          sock_put() + 48
vmlinux          tcp6_check_fraglist_gro() + 236
vmlinux          tcp6_gro_receive() + 624
vmlinux          ipv6_gro_receive() + 912
vmlinux          dev_gro_receive() + 1116
vmlinux          napi_gro_receive() + 196
ccmni.ko         ccmni_rx_callback() + 208
ccmni.ko         ccmni_queue_recv_skb() + 388
ccci_dpmaif.ko   dpmaif_rxq_push_thread() + 1088
vmlinux          kthread() + 268
vmlinux          0xFFFFFFC08001F30C()

Fixes: c9d1d23e52 ("net: add heuristic for enabling TCP fraglist GRO")
Signed-off-by: Jibin Zhang <jibin.zhang@mediatek.com>
Signed-off-by: Shiming Cheng <shiming.cheng@mediatek.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250429020412.14163-1-shiming.cheng@mediatek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-01 07:00:19 -07:00
Sabrina Dubroca
417fae2c40 xfrm: ipcomp: fix truesize computation on receive
ipcomp_post_acomp currently drops all frags (via pskb_trim_unique(skb,
0)), and then subtracts the old skb->data_len from truesize. This
adjustment has already be done during trimming (in skb_condense), so
we don't need to do it again.

This shows up for example when running fragmented traffic over ipcomp,
we end up hitting the WARN_ON_ONCE in skb_try_coalesce.

Fixes: eb2953d269 ("xfrm: ipcomp: Use crypto_acomp interface")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-30 08:08:16 +02:00
Jakub Kicinski
1f773970a7 netfilter pull request 25-04-29
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmgP++oACgkQ1w0aZmrP
 KyHxVw/5ASwm5WPt7qRgSLKlbIzUbTQ5a2CEC7WS6ova5M0nvxFL6f/JACeK3tTt
 nC+2+Vx89PFYB1fcGE9MfrZBFf/Ccm9/leluKtYJ0l7VnhLAxAByn5EbExhTFPeP
 YA3neewU62x8Cm3JQ6pZrFQYjz7pEGfSgjJ6E2dncIDMcjQzFAnDxEAB8VDqzFde
 yAosgA3yJtvan2otuqAqg33i0Q32FM0rhFplIB0pvJpbq24YVs1vTupSp9xEAeDu
 Ms5dcm2fN1L2HEgt2xhDrhVLEIp3CDZpC0k9hf7eJkonm03nCBlfqUd6gPFwdz0j
 YoHziIDO4zdtefw4Z1Bd31L5BgRyEZkbf6E8qWqCb5hpauKMKNwrtaa2qAMYqWmF
 UQuGXCXkIJVTfwxrCk1zyYKRhc0ECXpah65lB7RwxlAayDWTvEGKdcWKs9Hk5jx3
 hHgxfqO+nRfUjzGap5bWpOt18mfwZn287XSWFK6Ko6m/6UgzORfEEby/YoDUELKL
 xZBWy50+fauSkhFpQXt+IRT5rirmfMYv9hBBweKBwK05txMR4JdTNcI+zZxLMwnK
 sDbRjUsE1Req2ZnJPF33PcbkbvHj/B8m3awPxt2C3/kDK5WpKL3986Bj0hIJ25iM
 12BD5iSfOTGIZJCdRMW7SAoI1h3Z7+cePF4UjxiWS7jCbKE6u2E=
 =tp12
 -----END PGP SIGNATURE-----

Merge tag 'nf-next-25-04-29' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following batch contains Netfilter updates for net-next:

1) Replace msecs_to_jiffies() by secs_to_jiffies(), from Easwar Hariharan.

2) Allow to compile xt_cgroup with cgroupsv2 support only,
   from Michal Koutny.

3) Prepare for sock_cgroup_classid() removal by wrapping it around
   ifdef, also from Michal Koutny.

4) Remove redundant pointer fetch on conntrack template, from Xuanqiang Luo.

5) Re-format one block in the tproxy documentation for consistency,
   from Chen Linxuan.

6) Expose set element count and type via netlink attributes,
   from Florian Westphal.

* tag 'nf-next-25-04-29' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nf_tables: export set count and backend name to userspace
  docs: tproxy: fix formatting for nft code block
  netfilter: conntrack: Remove redundant NFCT_ALIGN call
  net: cgroup: Guard users of sock_cgroup_classid()
  netfilter: xt_cgroup: Make it independent from net_cls
  netfilter: xt_IDLETIMER: convert timeouts to secs_to_jiffies()
====================

Link: https://patch.msgid.link/20250428221254.3853-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-29 16:31:10 -07:00
Felix Fietkau
b936a9b8d4 net: ipv6: fix UDPv6 GSO segmentation with NAT
If any address or port is changed, update it in all packets and recalculate
checksum.

Fixes: 9fd1ff5d2a ("udp: Support UDP fraglist GRO/GSO.")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250426153210.14044-1-nbd@nbd.name
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-29 14:46:28 -07:00
Bui Quang Minh
7ead4405e0 xsk: convert xdp_copy_frags_from_zc() to use page_pool_dev_alloc()
This commit makes xdp_copy_frags_from_zc() use page allocation API
page_pool_dev_alloc() instead of page_pool_dev_alloc_netmem() to avoid
possible confusion of the returned value.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Link: https://patch.msgid.link/20250426081220.40689-3-minhquangbui99@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-29 14:26:26 -07:00
Bui Quang Minh
ebaebc5eaf xsk: respect the offsets when copying frags
In commit 560d958c6c ("xsk: add generic XSk &xdp_buff -> skb
conversion"), we introduce a helper to convert zerocopy xdp_buff to skb.
However, in the frag copy, we mistakenly ignore the frag's offset. This
commit adds the missing offset when copying frags in
xdp_copy_frags_from_zc(). This function is not used anywhere so no
backport is needed.

Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Link: https://patch.msgid.link/20250426081220.40689-2-minhquangbui99@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-29 14:26:26 -07:00
Amery Hung
7625645e69 bpf: net_sched: Fix using bpf qdisc as default qdisc
Use bpf_try_module_get()/bpf_module_put() instead of try_module_get()/
module_put() when handling default qdisc since users can assign a bpf
qdisc to it.

To trigger the bug:
$ bpftool struct_ops register bpf_qdisc_fq.bpf.o /sys/fs/bpf
$ echo bpf_fq > /proc/sys/net/core/default_qdisc

Fixes: c824034495 ("bpf: net_sched: Support implementation of Qdisc_ops in bpf")
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250429192128.3860571-1-ameryhung@gmail.com
2025-04-29 14:25:41 -07:00
Kees Cook
fca6170f5a ipv4: fib: Fix fib_info_hash_alloc() allocation type
In preparation for making the kmalloc family of allocators type aware,
we need to make sure that the returned type from the allocation matches
the type of the variable being assigned. (Before, the allocator would
always return "void *", which can be implicitly cast to any pointer type.)

This was allocating many sizeof(struct hlist_head *) when it actually
wanted sizeof(struct hlist_head). Luckily these are the same size.
Adjust the allocation type to match the assignment.

Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250426060529.work.873-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-29 10:44:17 -07:00
Willem de Bruijn
65e9024643 ip: load balance tcp connections to single dst addr and port
Load balance new TCP connections across nexthops also when they
connect to the same service at a single remote address and port.

This affects only port-based multipath hashing:
fib_multipath_hash_policy 1 or 3.

Local connections must choose both a source address and port when
connecting to a remote service, in ip_route_connect. This
"chicken-and-egg problem" (commit 2d7192d6cb ("ipv4: Sanitize and
simplify ip_route_{connect,newports}()")) is resolved by first
selecting a source address, by looking up a route using the zero
wildcard source port and address.

As a result multiple connections to the same destination address and
port have no entropy in fib_multipath_hash.

This is not a problem when forwarding, as skb-based hashing has a
4-tuple. Nor when establishing UDP connections, as autobind there
selects a port before reaching ip_route_connect.

Load balance also TCP, by using a random port in fib_multipath_hash.
Port assignment in inet_hash_connect is not atomic with
ip_route_connect. Thus ports are unpredictable, effectively random.

Implementation details:

Do not actually pass a random fl4_sport, as that affects not only
hashing, but routing more broadly, and can match a source port based
policy route, which existing wildcard port 0 will not. Instead,
define a new wildcard flowi flag that is used only for hashing.

Selecting a random source is equivalent to just selecting a random
hash entirely. But for code clarity, follow the normal 4-tuple hash
process and only update this field.

fib_multipath_hash can be reached with zero sport from other code
paths, so explicitly pass this flowi flag, rather than trying to infer
this case in the function itself.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250424143549.669426-3-willemdebruijn.kernel@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-29 16:22:25 +02:00
Willem de Bruijn
32607a332c ipv4: prefer multipath nexthop that matches source address
With multipath routes, try to ensure that packets leave on the device
that is associated with the source address.

Avoid the following tcpdump example:

    veth0 Out IP 10.1.0.2.38640 > 10.2.0.3.8000: Flags [S]
    veth1 Out IP 10.1.0.2.38648 > 10.2.0.3.8000: Flags [S]

Which can happen easily with the most straightforward setup:

    ip addr add 10.0.0.1/24 dev veth0
    ip addr add 10.1.0.1/24 dev veth1

    ip route add 10.2.0.3 nexthop via 10.0.0.2 dev veth0 \
    			  nexthop via 10.1.0.2 dev veth1

This is apparently considered WAI, based on the comment in
ip_route_output_key_hash_rcu:

    * 2. Moreover, we are allowed to send packets with saddr
    *    of another iface. --ANK

It may be ok for some uses of multipath, but not all. For instance,
when using two ISPs, a router may drop packets with unknown source.

The behavior occurs because tcp_v4_connect makes three route
lookups when establishing a connection:

1. ip_route_connect calls to select a source address, with saddr zero.
2. ip_route_connect calls again now that saddr and daddr are known.
3. ip_route_newports calls again after a source port is also chosen.

With a route with multiple nexthops, each lookup may make a different
choice depending on available entropy to fib_select_multipath. So it
is possible for 1 to select the saddr from the first entry, but 3 to
select the second entry. Leading to the above situation.

Address this by preferring a match that matches the flowi4 saddr. This
will make 2 and 3 make the same choice as 1. Continue to update the
backup choice until a choice that matches saddr is found.

Do this in fib_select_multipath itself, rather than passing an fl4_oif
constraint, to avoid changing non-multipath route selection. Commit
e6b45241c5 ("ipv4: reset flowi parameters on route connect") shows
how that may cause regressions.

Also read ipv4.sysctl_fib_multipath_use_neigh only once. No need to
refresh in the loop.

This does not happen in IPv6, which performs only one lookup.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250424143549.669426-2-willemdebruijn.kernel@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-29 16:22:25 +02:00
Victor Nogueira
f139f37dcd net_sched: qfq: Fix double list add in class with netem as child qdisc
As described in Gerrard's report [1], there are use cases where a netem
child qdisc will make the parent qdisc's enqueue callback reentrant.
In the case of qfq, there won't be a UAF, but the code will add the same
classifier to the list twice, which will cause memory corruption.

This patch checks whether the class was already added to the agg->active
list (cl_is_active) before doing the addition to cater for the reentrant
case.

[1] https://lore.kernel.org/netdev/CAHcdcOm+03OD2j6R0=YHKqmy=VgJ8xEOKuP6c7mSgnp-TEJJbw@mail.gmail.com/

Fixes: 37d9cf1a3c ("sched: Fix detection of empty queues in child qdiscs")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20250425220710.3964791-5-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-28 15:55:07 -07:00
Victor Nogueira
1a6d0c00fa net_sched: ets: Fix double list add in class with netem as child qdisc
As described in Gerrard's report [1], there are use cases where a netem
child qdisc will make the parent qdisc's enqueue callback reentrant.
In the case of ets, there won't be a UAF, but the code will add the same
classifier to the list twice, which will cause memory corruption.

In addition to checking for qlen being zero, this patch checks whether
the class was already added to the active_list (cl_is_active) before
doing the addition to cater for the reentrant case.

[1] https://lore.kernel.org/netdev/CAHcdcOm+03OD2j6R0=YHKqmy=VgJ8xEOKuP6c7mSgnp-TEJJbw@mail.gmail.com/

Fixes: 37d9cf1a3c ("sched: Fix detection of empty queues in child qdiscs")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20250425220710.3964791-4-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-28 15:55:06 -07:00
Victor Nogueira
141d34391a net_sched: hfsc: Fix a UAF vulnerability in class with netem as child qdisc
As described in Gerrard's report [1], we have a UAF case when an hfsc class
has a netem child qdisc. The crux of the issue is that hfsc is assuming
that checking for cl->qdisc->q.qlen == 0 guarantees that it hasn't inserted
the class in the vttree or eltree (which is not true for the netem
duplicate case).

This patch checks the n_active class variable to make sure that the code
won't insert the class in the vttree or eltree twice, catering for the
reentrant case.

[1] https://lore.kernel.org/netdev/CAHcdcOm+03OD2j6R0=YHKqmy=VgJ8xEOKuP6c7mSgnp-TEJJbw@mail.gmail.com/

Fixes: 37d9cf1a3c ("sched: Fix detection of empty queues in child qdiscs")
Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20250425220710.3964791-3-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-28 15:55:06 -07:00
Victor Nogueira
f99a3fbf02 net_sched: drr: Fix double list add in class with netem as child qdisc
As described in Gerrard's report [1], there are use cases where a netem
child qdisc will make the parent qdisc's enqueue callback reentrant.
In the case of drr, there won't be a UAF, but the code will add the same
classifier to the list twice, which will cause memory corruption.

In addition to checking for qlen being zero, this patch checks whether the
class was already added to the active_list (cl_is_active) before adding
to the list to cover for the reentrant case.

[1] https://lore.kernel.org/netdev/CAHcdcOm+03OD2j6R0=YHKqmy=VgJ8xEOKuP6c7mSgnp-TEJJbw@mail.gmail.com/

Fixes: 37d9cf1a3c ("sched: Fix detection of empty queues in child qdiscs")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20250425220710.3964791-2-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-28 15:55:06 -07:00
Florian Westphal
0014af8021 netfilter: nf_tables: export set count and backend name to userspace
nf_tables picks a suitable set backend implementation (bitmap, hash,
rbtree..) based on the userspace requirements.

Figuring out the chosen backend requires information about the set flags
and the kernel version.  Export this to userspace so nft can include this
information in '--debug=netlink' output.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-04-29 00:00:27 +02:00
Xuanqiang Luo
eaa2b34db0 netfilter: conntrack: Remove redundant NFCT_ALIGN call
The "nf_ct_tmpl_alloc" function had a redundant call to "NFCT_ALIGN" when
aligning the pointer "p". Since "NFCT_ALIGN" always gives the same result
for the same input.

Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-04-29 00:00:26 +02:00
Alexei Starovoitov
224ee86639 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc4
Cross-merge bpf and other fixes after downstream PRs.

No conflicts.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-28 08:40:45 -07:00
Christian Brauner
20b70e5896
net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid
Now that all preconditions are met, allow handing out pidfs for reaped
sk->sk_peer_pids.

Thanks to Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
for pointing out that we need to limit this to AF_UNIX sockets for now.

Link: https://lore.kernel.org/20250425-work-pidfs-net-v2-4-450a19461e75@kernel.org
Reviewed-by: David Rheinsberg <david@readahead.eu>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-28 11:04:43 +02:00
Linus Torvalds
d22aad29de nfsd-6.15 fixes:
- Revert a v6.15 patch due to a report of SELinux test failures
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmgNA7gACgkQM2qzM29m
 f5fL+RAAlzI/+eISq6euOz6f56tP5BcqLgLtjOuaM+7d0Ix+cAihj2LUJpGjhbDH
 CalF2J3RxdAeK/PlHsbzOt5Aywec0pnYMuUG5+4BDuZyYl8D9ZuzO9zNjU8uf/Mp
 sm0O7s/RV4jbNn/8lCrD7keDqqb12Qo5F+4vOD5bE99rs9SMQSXkDSd6lR+Htj/x
 M6NGOP3FvgzjdZKuSBM1mRcJcermM76dE1DTdQ1zGswUoglomYC5v/y72Gdt/QRq
 UpSMxlX6mmoSSQSVIPVitMlgG+PzpQnGMp207WfupRY+N2euWn+9BGgKtmzQu3NJ
 BvctRSqNuc6rqRaH9LcSc5ArNB5UbBxjl385IMHhAs7w4JiLzObP6Shy3xHmDipY
 5/X1IHLg8b+CRzyHy2/a2LsBde3kvlhIpkYMkIkzC7WJqJvEWiQQaMk4vKpOvmrW
 WOR2wOLgX/Fq/2BZe3Ixe8PgdLck+qAjIYjesc/No/9Tb8Iz4CjnqOiv1WhxkG1P
 cFZvO6MP63/AgVwVf6EHUq8L96CwdXo+XJl9P48fcYpUYLRpMn1+K1IrgfF9BC7T
 TuQHn1/l427lGIg2yehHeHQ2hwBqQ4uYohJuJquSbKcprKcl2P1UNDCNoSymSoGH
 PqwdXd45XGjTkNBWYS9JZMmIFJJGLl1Txypc+P9zbAruWJPQoLg=
 =MOAx
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fix from Chuck Lever:

 - Revert a v6.15 patch due to a report of SELinux test failures

* tag 'nfsd-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  Revert "sunrpc: clean cache_detail immediately when flush is written frequently"
2025-04-26 10:43:03 -07:00
Chuck Lever
831e3f545b Revert "sunrpc: clean cache_detail immediately when flush is written frequently"
Ondrej reports that certain SELinux tests are failing after commit
fc2a169c56 ("sunrpc: clean cache_detail immediately when flush is
written frequently"), merged during the v6.15 merge window.

Reported-by: Ondrej Mosnacek <omosnace@redhat.com>
Fixes: fc2a169c56 ("sunrpc: clean cache_detail immediately when flush is written frequently")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-04-26 12:00:43 -04:00
Christian Brauner
fd0a109a0f
net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid
SO_PEERPIDFD currently doesn't support handing out pidfds if the
sk->sk_peer_pid thread-group leader has already been reaped. In this
case it currently returns EINVAL. Userspace still wants to get a pidfd
for a reaped process to have a stable handle it can pass on.
This is especially useful now that it is possible to retrieve exit
information through a pidfd via the PIDFD_GET_INFO ioctl()'s
PIDFD_INFO_EXIT flag.

Another summary has been provided by David in [1]:

> A pidfd can outlive the task it refers to, and thus user-space must
> already be prepared that the task underlying a pidfd is gone at the time
> they get their hands on the pidfd. For instance, resolving the pidfd to
> a PID via the fdinfo must be prepared to read `-1`.
>
> Despite user-space knowing that a pidfd might be stale, several kernel
> APIs currently add another layer that checks for this. In particular,
> SO_PEERPIDFD returns `EINVAL` if the peer-task was already reaped,
> but returns a stale pidfd if the task is reaped immediately after the
> respective alive-check.
>
> This has the unfortunate effect that user-space now has two ways to
> check for the exact same scenario: A syscall might return
> EINVAL/ESRCH/... *or* the pidfd might be stale, even though there is no
> particular reason to distinguish both cases. This also propagates
> through user-space APIs, which pass on pidfds. They must be prepared to
> pass on `-1` *or* the pidfd, because there is no guaranteed way to get a
> stale pidfd from the kernel.
> Userspace must already deal with a pidfd referring to a reaped task as
> the task may exit and get reaped at any time will there are still many
> pidfds referring to it.

In order to allow handing out reaped pidfd SO_PEERPIDFD needs to ensure
that PIDFD_INFO_EXIT information is available whenever a pidfd for a
reaped task is created by PIDFD_INFO_EXIT. The uapi promises that reaped
pidfds are only handed out if it is guaranteed that the caller sees the
exit information:

TEST_F(pidfd_info, success_reaped)
{
        struct pidfd_info info = {
                .mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT,
        };

        /*
         * Process has already been reaped and PIDFD_INFO_EXIT been set.
         * Verify that we can retrieve the exit status of the process.
         */
        ASSERT_EQ(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0);
        ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
        ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
        ASSERT_TRUE(WIFEXITED(info.exit_code));
        ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
}

To hand out pidfds for reaped processes we thus allocate a pidfs entry
for the relevant sk->sk_peer_pid at the time the sk->sk_peer_pid is
stashed and drop it when the socket is destroyed. This guarantees that
exit information will always be recorded for the sk->sk_peer_pid task
and we can hand out pidfds for reaped processes.

Link: https://lore.kernel.org/lkml/20230807085203.819772-1-david@readahead.eu [1]
Link: https://lore.kernel.org/20250425-work-pidfs-net-v2-2-450a19461e75@kernel.org
Reviewed-by: David Rheinsberg <david@readahead.eu>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-26 08:27:54 +02:00
Linus Torvalds
349b7d77f5 A small CephFS encryption-related fix and a dead code cleanup.
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmgL1ZUTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi9KbCACTw0bAFjWzuG9EIF6UjFfDF1XVaU7f
 teIHfARjveZIhEYC4veM//EZeiHoFxgwdwyN3TP6Q7nwjcxHEG5/z1KC6sPtYzjA
 dZj/F4tXJ5h5TICxIIJAD4HSN5YgwHx/JTAq3ZkayVsAlNpUOslkK6BtilmUpPIE
 T3dXOdnm1e6QNjXyeaqnJoYgWqjHrENOSjA8gF0yuEUwOfmln9OPIImFRNyMrQ3q
 iePJd2tHzC8SKoub5qgMEXvQT6fI8zkELEvjGHk0MYSCfHK5E2LjjcEWlMqTVP4m
 3n9uj2LWqKOyx5Viv+g1Ab/CuW/AnYLMfg4NPlyz7MFLts/Y/BcTElyA
 =pFi+
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.15-rc4' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A small CephFS encryption-related fix and a dead code cleanup"

* tag 'ceph-for-6.15-rc4' of https://github.com/ceph/ceph-client:
  ceph: Fix incorrect flush end position calculation
  ceph: Remove osd_client deadcode
2025-04-25 15:51:28 -07:00
Pauli Virtanen
3908feb1bd Bluetooth: L2CAP: copy RX timestamp to new fragments
Copy timestamp too when allocating new skb for received fragment.
Fixes missing RX timestamps with fragmentation.

Fixes: 4d7ea8ee90 ("Bluetooth: L2CAP: Fix handling fragmented length")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-04-25 15:03:19 -04:00
Luiz Augusto von Dentz
024421cf39 Bluetooth: hci_conn: Fix not setting timeout for BIG Create Sync
BIG Create Sync requires the command to just generates a status so this
makes use of __hci_cmd_sync_status_sk to wait for
HCI_EVT_LE_BIG_SYNC_ESTABLISHED, also because of this chance it is not
longer necessary to use a custom method to serialize the process of
creating the BIG sync since the cmd_work_sync itself ensures only one
command would be pending which now awaits for
HCI_EVT_LE_BIG_SYNC_ESTABLISHED before proceeding to next connection.

Fixes: 42ecf19471 ("Bluetooth: ISO: Do not emit LE BIG Create Sync if previous is pending")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-04-25 15:03:19 -04:00
Luiz Augusto von Dentz
6d0417e4e1 Bluetooth: hci_conn: Fix not setting conn_timeout for Broadcast Receiver
Broadcast Receiver requires creating PA sync but the command just
generates a status so this makes use of __hci_cmd_sync_status_sk to wait
for HCI_EV_LE_PA_SYNC_ESTABLISHED, also because of this chance it is not
longer necessary to use a custom method to serialize the process of
creating the PA sync since the cmd_work_sync itself ensures only one
command would be pending which now awaits for
HCI_EV_LE_PA_SYNC_ESTABLISHED before proceeding to next connection.

Fixes: 4a5e0ba686 ("Bluetooth: ISO: Do not emit LE PA Create Sync if previous is pending")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-04-25 15:03:19 -04:00
Jeremy Harris
2b13042d36 tcp: fastopen: pass TFO child indication through getsockopt
tcp: fastopen: pass TFO child indication through getsockopt

Note that this uses up the last bit of a field in struct tcp_info

Signed-off-by: Jeremy Harris <jgh@exim.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20250423124334.4916-3-jgh@exim.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-24 18:21:04 -07:00
Jeremy Harris
bc2550b4e1 tcp: fastopen: note that a child socket was created
tcp: fastopen: note that a child socket was created

This uses up the last bit in a field of tcp_sock.

Signed-off-by: Jeremy Harris <jgh@exim.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20250423124334.4916-2-jgh@exim.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-24 18:21:04 -07:00
Colin Ian King
4134bb726e net: ip_gre: Fix spelling mistake "demultiplexor" -> "demultiplexer"
There is a spelling mistake in a pr_info message. Fix it.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://patch.msgid.link/20250423113719.173539-1-colin.i.king@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-24 18:20:40 -07:00
Dan Carpenter
3a4236c379 rxrpc: rxgk: Fix some reference count leaks
These paths should call rxgk_put(gk) but they don't.  In the
rxgk_construct_response() function the "goto error;" will free the
"response" skb as well calling rxgk_put() so that's a bonus.

Fixes: 9d1d2b5934 ("rxrpc: rxgk: Implement the yfs-rxgk security class (GSSAPI)")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Acked-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/aAikCbsnnzYtVmIA@stanley.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-24 18:06:05 -07:00
e.kubanski
a1356ac774 xsk: Fix race condition in AF_XDP generic RX path
Move rx_lock from xsk_socket to xsk_buff_pool.
Fix synchronization for shared umem mode in
generic RX path where multiple sockets share
single xsk_buff_pool.

RX queue is exclusive to xsk_socket, while FILL
queue can be shared between multiple sockets.
This could result in race condition where two
CPU cores access RX path of two different sockets
sharing the same umem.

Protect both queues by acquiring spinlock in shared
xsk_buff_pool.

Lock contention may be minimized in the future by some
per-thread FQ buffering.

It's safe and necessary to move spin_lock_bh(rx_lock)
after xsk_rcv_check():
* xs->pool and spinlock_init is synchronized by
  xsk_bind() -> xsk_is_bound() memory barriers.
* xsk_rcv_check() may return true at the moment
  of xsk_release() or xsk_unbind_dev(),
  however this will not cause any data races or
  race conditions. xsk_unbind_dev() removes xdp
  socket from all maps and waits for completion
  of all outstanding rx operations. Packets in
  RX path will either complete safely or drop.

Signed-off-by: Eryk Kubanski <e.kubanski@partner.samsung.com>
Fixes: bf0bdd1343 ("xdp: fix race on generic receive path")
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://patch.msgid.link/20250416101908.10919-1-e.kubanski@partner.samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-24 17:11:33 -07:00
Dr. David Alan Gilbert
39144062ea rxrpc: Remove deadcode
Remove three functions that are no longer used.

rxrpc_get_txbuf() last use was removed by 2020's
commit 5e6ef4f101 ("rxrpc: Make the I/O thread take over the call and
local processor work")

rxrpc_kernel_get_epoch() last use was removed by 2020's
commit 44746355cc ("afs: Don't get epoch from a server because it may be
ambiguous")

rxrpc_kernel_set_max_life() last use was removed by 2023's
commit db099c625b ("rxrpc: Fix timeout of a call that hasn't yet been
granted a channel")

Both of the rxrpc_kernel_* functions were documented.  Remove that
documentation as well as the code.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Acked-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/20250422235147.146460-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-24 17:03:45 -07:00
Jakub Kicinski
5565acd1e6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc4).

This pull includes wireless and a fix to vxlan which isn't
in Linus's tree just yet. The latter creates with a silent conflict
/ build breakage, so merging it now to avoid causing problems.

drivers/net/vxlan/vxlan_vnifilter.c
  094adad913 ("vxlan: Use a single lock to protect the FDB table")
  087a9eb9e5 ("vxlan: vnifilter: Fix unlocked deletion of default FDB entry")
https://lore.kernel.org/20250423145131.513029-1-idosch@nvidia.com

No "normal" conflicts, or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-24 11:20:52 -07:00
Jakub Kicinski
30763f1adf Some more fixes, notably:
* iwlwifi: various regression and iwlmld fixes
  * mac80211: fix TX frames in monitor mode
  * brcmfmac: error handling for firmware load
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmgKKK0ACgkQ10qiO8sP
 aADVRxAAjf7m5Owl/3euI+wQbuYQql0F+82eoukbqWgMUKioYzWf/5N5fjtJrAn8
 0LW2O23/HHlK2Y4dqIJR3oRm+zvAQftCTr9Eqd0bC/hpT/1FBOujqagtdTNkpMSH
 qH+ExoJjpbMfuNeq3Jok2BgI82pM39vHDVL4c/EebaoNoVrwTU4y6B/nmbpOajMF
 V75spwaX1vnmdgmZdWXb+4MbH9r78Bj4CjGJ7PGXBvsepiZnNUtkFDUeqjJJcSCH
 wAQn2cv2iC1HGGDIDYe2fmO8ChnNfwiBJmWcO70CiWRrb2SXfv93nZucaWWRbD27
 5PiJnzII9CkyTEIrR47nV3xS9pw34Zov5MPRvqZi96WfYWchPVsm0fUIDWthGO+P
 RqCWJsl5z5aI2O6OlTRr7oPXCVPiD5Womv76ptjQhxls73AXv60ox/EMoCu2Mw9u
 XfZYJckjhBobOSgLJ+3bNJW22Mm5ZycLZcZo1/zjt4DpB2boAF2WYiJ1TFaQpir6
 9azBHJ8YqqniipZUsT0wDyQh+24SU97peXaF/Z0g+D55EboxcWMsGLOLZaahEhMv
 BZdRo71j1XPr4jluvOyHq6su/s1pspVnWw7+xChtV2PqZT5Ce/oxk+wuUWykNEQ3
 ZSJ2TgVkbkjeY+TWNdAKvtiS21SsVNB4rGlt3ZDNHuWnann7OOs=
 =c/97
 -----END PGP SIGNATURE-----

Merge tag 'wireless-2025-04-24' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes Berg says:

====================
Some more fixes, notably:
 * iwlwifi: various regression and iwlmld fixes
 * mac80211: fix TX frames in monitor mode
 * brcmfmac: error handling for firmware load

* tag 'wireless-2025-04-24' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: iwlwifi: restore missing initialization of async_handlers_list
  wifi: brcm80211: fmac: Add error handling for brcmf_usb_dl_writeimage()
  wifi: plfxlc: Remove erroneous assert in plfxlc_mac_release
  wifi: iwlwifi: fix the check for the SCRATCH register upon resume
  wifi: iwlwifi: don't warn if the NIC is gone in resume
  wifi: iwlwifi: mld: fix BAID validity check
  wifi: iwlwifi: back off on continuous errors
  wifi: iwlwifi: mld: only create debugfs symlink if it does not exist
  wifi: iwlwifi: mld: inform trans on init failure
  wifi: iwlwifi: mld: properly handle async notification in op mode start
  Revert "wifi: iwlwifi: make no_160 more generic"
  Revert "wifi: iwlwifi: add support for BE213"
  wifi: mac80211: restore monitor for outgoing frames
====================

Link: https://patch.msgid.link/20250424120535.56499-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-24 11:10:57 -07:00
Michal Koutný
0876453147 net: cgroup: Guard users of sock_cgroup_classid()
Exclude code that relies on sock_cgroup_classid() as preparation of
removal of the function.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-04-24 16:04:02 +02:00
Michal Koutný
3ba0032afe netfilter: xt_cgroup: Make it independent from net_cls
The xt_group matching supports the default hierarchy since commit
c38c4597e4 ("netfilter: implement xt_cgroup cgroup2 path match").
The cgroup v1 matching (based on clsid) and cgroup v2 matching (based on
path) are rather independent. Downgrade the Kconfig dependency to
mere CONFIG_SOCK_GROUP_DATA so that xt_group can be built even without
CONFIG_NET_CLS_CGROUP for path matching.
Also add a message for users when they attempt to specify any clsid.

Link: https://lists.opensuse.org/archives/list/kernel@lists.opensuse.org/thread/S23NOILB7MUIRHSKPBOQKJHVSK26GP6X/
Cc: Jan Engelhardt <ej@inai.de>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-04-24 16:04:02 +02:00
Easwar Hariharan
f4293c2baf netfilter: xt_IDLETIMER: convert timeouts to secs_to_jiffies()
Commit b35108a51c ("jiffies: Define secs_to_jiffies()") introduced
secs_to_jiffies().  As the value here is a multiple of 1000, use
secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication.

This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with
the following Coccinelle rules:

@depends on patch@
expression E;
@@

-msecs_to_jiffies(E * 1000)
+secs_to_jiffies(E)

-msecs_to_jiffies(E * MSEC_PER_SEC)
+secs_to_jiffies(E)

Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-04-24 16:04:01 +02:00
Kuniyuki Iwashima
169fd62799 ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.
Now we are ready to remove RTNL from SIOCADDRT and RTM_NEWROUTE.

The remaining things to do are

  1. pass false to lwtunnel_valid_encap_type_attr()
  2. use rcu_dereference_rtnl() in fib6_check_nexthop()
  3. place rcu_read_lock() before ip6_route_info_create_nh().

Let's complete the RTNL-free conversion.

When each CPU-X adds 100000 routes on table-X in a batch
concurrently on c7a.metal-48xl EC2 instance with 192 CPUs,

without this series:

  $ sudo ./route_test.sh
  ...
  added 19200000 routes (100000 routes * 192 tables).
  time elapsed: 191577 milliseconds.

with this series:

  $ sudo ./route_test.sh
  ...
  added 19200000 routes (100000 routes * 192 tables).
  time elapsed: 62854 milliseconds.

I changed the number of routes in each table (1000 ~ 100000)
and consistently saw it finish 3x faster with this series.

Note that now every caller of lwtunnel_valid_encap_type() passes
false as the last argument, and this can be removed later.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-16-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24 09:29:56 +02:00
Kuniyuki Iwashima
081efd1832 ipv6: Protect nh->f6i_list with spinlock and flag.
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT.

Then, we may be going to add a route tied to a dying nexthop.

The nexthop itself is not freed during the RCU grace period, but
if we link a route after __remove_nexthop_fib() is called for the
nexthop, the route will be leaked.

To avoid the race between IPv6 route addition under RCU vs nexthop
deletion under RTNL, let's add a dead flag and protect it and
nh->f6i_list with a spinlock.

__remove_nexthop_fib() acquires the nexthop's spinlock and sets false
to nh->dead, then calls ip6_del_rt() for the linked route one by one
without the spinlock because fib6_purge_rt() acquires it later.

While adding an IPv6 route, fib6_add() acquires the nexthop lock and
checks the dead flag just before inserting the route.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-15-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24 09:29:56 +02:00
Kuniyuki Iwashima
accb46b56b ipv6: Defer fib6_purge_rt() in fib6_add_rt2node() to fib6_add().
The next patch adds per-nexthop spinlock which protects nh->f6i_list.

When rt->nh is not NULL, fib6_add_rt2node() will be called under the lock.
fib6_add_rt2node() could call fib6_purge_rt() for another route, which
could holds another nexthop lock.

Then, deadlock could happen between two nexthops.

Let's defer fib6_purge_rt() after fib6_add_rt2node().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20250418000443.43734-14-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24 09:29:56 +02:00
Kuniyuki Iwashima
834d97843e ipv6: Protect fib6_link_table() with spinlock.
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT.

If the request specifies a new table ID, fib6_new_table() is
called to create a new routing table.

Two concurrent requests could specify the same table ID, so we
need a lock to protect net->ipv6.fib_table_hash[h].

Let's add a spinlock to protect the hash bucket linkage.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20250418000443.43734-13-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24 09:29:56 +02:00
Kuniyuki Iwashima
71c0efb6d1 ipv6: Factorise ip6_route_multipath_add().
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT and rely
on RCU to guarantee dev and nexthop lifetime.

Then, the RCU section will start before ip6_route_info_create_nh()
in ip6_route_multipath_add(), but ip6_route_info_create() is called
in the same loop and will sleep.

Let's split the loop into ip6_route_mpath_info_create() and
ip6_route_mpath_info_create_nh().

Note that ip6_route_info_append() is now integrated into
ip6_route_mpath_info_create_nh() because we need to call different
free functions for nexthops that passed ip6_route_info_create_nh().

In case of failure, the remaining nexthops that ip6_route_info_create_nh()
has not been called for will be freed by ip6_route_mpath_info_cleanup().

OTOH, if a nexthop passes ip6_route_info_create_nh(), it will be linked
to a local temporary list, which will be spliced back to rt6_nh_list.
In case of failure, these nexthops will be released by fib6_info_release()
in ip6_route_multipath_add().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-12-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24 09:29:56 +02:00
Kuniyuki Iwashima
5a1ccff5c6 ipv6: Rename rt6_nh.next to rt6_nh.list.
ip6_route_multipath_add() allocates struct rt6_nh for each config
of multipath routes to link them to a local list rt6_nh_list.

struct rt6_nh.next is the list node of each config, so the name
is quite misleading.

Let's rename it to list.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/netdev/c9bee472-c94e-4878-8cc2-1512b2c54db5@redhat.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-11-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24 09:29:56 +02:00
Kuniyuki Iwashima
87d5d921ea ipv6: Don't pass net to ip6_route_info_append().
net is not used in ip6_route_info_append() after commit 36f19d5b4f
("net/ipv6: Remove extra call to ip6_convert_metrics for multipath case").

Let's remove the argument.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20250418000443.43734-10-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24 09:29:56 +02:00
Kuniyuki Iwashima
d27b9c40db ipv6: Preallocate nhc_pcpu_rth_output in ip6_route_info_create().
ip6_route_info_create_nh() will be called under RCU.

It calls fib_nh_common_init() and allocates nhc->nhc_pcpu_rth_output.

As with the reason for rt->fib6_nh->rt6i_pcpu, we want to avoid
GFP_ATOMIC allocation for nhc->nhc_pcpu_rth_output under RCU.

Let's preallocate it in ip6_route_info_create().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-9-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24 09:29:56 +02:00
Kuniyuki Iwashima
5720a328c3 ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create().
ip6_route_info_create_nh() will be called under RCU.

Then, fib6_nh_init() is also under RCU, but per-cpu memory allocation
is very likely to fail with GFP_ATOMIC while bulk-adding IPv6 routes
and we would see a bunch of this message in dmesg.

  percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left
  percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left

Let's preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create().

If something fails before the original memory allocation in
fib6_nh_init(), ip6_route_info_create_nh() calls fib6_info_release(),
which releases the preallocated per-cpu memory.

Note that rt->fib6_nh->rt6i_pcpu is not preallocated when called via
ipv6_stub, so we still need alloc_percpu_gfp() in fib6_nh_init().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-8-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24 09:29:56 +02:00
Kuniyuki Iwashima
c4837b9853 ipv6: Split ip6_route_info_create().
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT and rely
on RCU to guarantee dev and nexthop lifetime.

Then, we want to allocate as much as possible before entering
the RCU section.

The RCU section will start in the middle of ip6_route_info_create(),
and this is problematic for ip6_route_multipath_add() that calls
ip6_route_info_create() multiple times.

Let's split ip6_route_info_create() into two parts; one for memory
allocation and another for nexthop setup.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20250418000443.43734-7-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24 09:29:56 +02:00
Kuniyuki Iwashima
c9cabe05e4 ipv6: Move nexthop_find_by_id() after fib6_info_alloc().
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT.

Then, we must perform two lookups for nexthop and dev under RCU
to guarantee their lifetime.

ip6_route_info_create() calls nexthop_find_by_id() first if
RTA_NH_ID is specified, and then allocates struct fib6_info.

nexthop_find_by_id() must be called under RCU, but we do not want
to use GFP_ATOMIC for memory allocation here, which will be likely
to fail in ip6_route_multipath_add().

Let's move nexthop_find_by_id() after the memory allocation so
that we can later split ip6_route_info_create() into two parts:
the sleepable part and the RCU part.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20250418000443.43734-6-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24 09:29:56 +02:00
Kuniyuki Iwashima
e6f497955f ipv6: Check GATEWAY in rtm_to_fib6_multipath_config().
In ip6_route_multipath_add(), we call rt6_qualify_for_ecmp() for each
entry.  If it returns false, the request fails.

rt6_qualify_for_ecmp() returns false if either of the conditions below
is true:

  1. f6i->fib6_flags has RTF_ADDRCONF
  2. f6i->nh is not NULL
  3. f6i->fib6_nh->fib_nh_gw_family is AF_UNSPEC

1 is unnecessary because rtm_to_fib6_config() never sets RTF_ADDRCONF
to cfg->fc_flags.

2. is equivalent with cfg->fc_nh_id.

3. can be replaced by checking RTF_GATEWAY in the base and each multipath
entry because AF_INET6 is set to f6i->fib6_nh->fib_nh_gw_family only when
cfg.fc_is_fdb is true or RTF_GATEWAY is set, but the former is always
false.

These checks do not require RCU and can be done earlier.

Let's perform the equivalent checks in rtm_to_fib6_multipath_config().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-5-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24 09:29:56 +02:00
Kuniyuki Iwashima
fa76c1674f ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config().
ip6_route_info_create() is called from 3 functions:

  * ip6_route_add()
  * ip6_route_multipath_add()
  * addrconf_f6i_alloc()

addrconf_f6i_alloc() does not need validation for struct fib6_config in
ip6_route_info_create().

ip6_route_multipath_add() calls ip6_route_info_create() for multiple
routes with slightly different fib6_config instances, which is copied
from the base config passed from userspace.  So, we need not validate
the same config repeatedly.

Let's move such validation into rtm_to_fib6_config().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20250418000443.43734-4-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24 09:29:55 +02:00
Kuniyuki Iwashima
bd11ff421d ipv6: Get rid of RTNL for SIOCDELRT and RTM_DELROUTE.
Basically, removing an IPv6 route does not require RTNL because
the IPv6 routing tables are protected by per table lock.

inet6_rtm_delroute() calls nexthop_find_by_id() to check if the
nexthop specified by RTA_NH_ID exists.  nexthop uses rbtree and
the top-down walk can be safely performed under RCU.

ip6_route_del() already relies on RCU and the table lock, but we
need to extend the RCU critical section a bit more to cover
__ip6_del_rt().  For example, nexthop_for_each_fib6_nh() and
inet6_rt_notify() needs RCU.

Let's call nexthop_find_by_id() and __ip6_del_rt() under RCU and
get rid of RTNL from inet6_rtm_delroute() and SIOCDELRT.

Even if the nexthop is removed after rcu_read_unlock() in
inet6_rtm_delroute(), __remove_nexthop_fib() cleans up the routes
tied to the nexthop, and ip6_route_del() returns -ESRCH.  So the
request was at least valid as of nexthop_find_by_id(), and it's just
a matter of timing.

Note that we need to pass false to lwtunnel_valid_encap_type_attr().
The following patches also use the newroute bool.

Note also that fib6_get_table() does not require RCU because once
allocated fib6_table is not freed until netns dismantle.  I will
post a follow-up series to convert such callers to RCU-lockless
version.  [0]

Link: https://lore.kernel.org/netdev/20250417174557.65721-1-kuniyu@amazon.com/ #[0]
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-3-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24 09:29:55 +02:00
Kuniyuki Iwashima
4cb4861d8c ipv6: Validate RTA_GATEWAY of RTA_MULTIPATH in rtm_to_fib6_config().
We will perform RTM_NEWROUTE and RTM_DELROUTE under RCU, and then
we want to perform some validation out of the RCU scope.

When creating / removing an IPv6 route with RTA_MULTIPATH,
inet6_rtm_newroute() / inet6_rtm_delroute() validates RTA_GATEWAY
in each multipath entry.

Let's do that in rtm_to_fib6_config().

Note that now RTM_DELROUTE returns an error for RTA_MULTIPATH with
0 entries, which was accepted but should result in -EINVAL as
RTM_NEWROUTE.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-2-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24 09:29:55 +02:00
Jakub Kicinski
3fec58f5a4 Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:

====================
igc: Add support for Frame Preemption

Faizal Rahim says:

Introduce support for the FPE feature in the IGC driver.

The patches aligns with the upstream FPE API:
https://patchwork.kernel.org/project/netdevbpf/cover/20230220122343.1156614-1-vladimir.oltean@nxp.com/
https://patchwork.kernel.org/project/netdevbpf/cover/20230119122705.73054-1-vladimir.oltean@nxp.com/

It builds upon earlier work:
https://patchwork.kernel.org/project/netdevbpf/cover/20220520011538.1098888-1-vinicius.gomes@intel.com/

The patch series adds the following functionalities to the IGC driver:
a) Configure FPE using `ethtool --set-mm`.
b) Display FPE settings via `ethtool --show-mm`.
c) View FPE statistics using `ethtool --include-statistics --show-mm'.
e) Block setting preemptible tc in taprio since it is not supported yet.
   Existing code already blocks it in mqprio.

Tested:
Enabled CONFIG_PROVE_LOCKING, CONFIG_DEBUG_ATOMIC_SLEEP, CONFIG_DMA_API_DEBUG, and CONFIG_KASAN
1) selftests
2) netdev down/up cycles
3) suspend/resume cycles
4) fpe verification

No bugs or unusual dmesg logs were observed.
Ran 1), 2) and 3) with and without the patch series, compared dmesg and selftest logs - no differences found.

* '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  igc: add support to get frame preemption statistics via ethtool
  igc: add support to get MAC Merge data via ethtool
  igc: block setting preemptible traffic class in taprio
  igc: add support to set tx-min-frag-size
  igc: add support for frame preemption verification
  igc: set the RX packet buffer size for TSN mode
  igc: use FIELD_PREP and GENMASK for existing RX packet buffer size
  igc: optimize TX packet buffer utilization for TSN mode
  igc: use FIELD_PREP and GENMASK for existing TX packet buffer size
  igc: rename I225_RXPBSIZE_DEFAULT and I225_TXPBSIZE_DEFAULT
  igc: rename xdp_get_tx_ring() for non-xdp usage
  net: ethtool: mm: reset verification status when link is down
  net: ethtool: mm: extract stmmac verification logic into common library
  net: stmmac: move frag_size handling out of spin_lock
====================

Link: https://patch.msgid.link/20250418163822.3519810-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-23 18:31:55 -07:00
Cong Wang
6ccbda44e2 net_sched: hfsc: Fix a potential UAF in hfsc_dequeue() too
Similarly to the previous patch, we need to safe guard hfsc_dequeue()
too. But for this one, we don't have a reliable reproducer.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20250417184732.943057-3-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-23 17:16:50 -07:00
Cong Wang
3df275ef0a net_sched: hfsc: Fix a UAF vulnerability in class handling
This patch fixes a Use-After-Free vulnerability in the HFSC qdisc class
handling. The issue occurs due to a time-of-check/time-of-use condition
in hfsc_change_class() when working with certain child qdiscs like netem
or codel.

The vulnerability works as follows:
1. hfsc_change_class() checks if a class has packets (q.qlen != 0)
2. It then calls qdisc_peek_len(), which for certain qdiscs (e.g.,
   codel, netem) might drop packets and empty the queue
3. The code continues assuming the queue is still non-empty, adding
   the class to vttree
4. This breaks HFSC scheduler assumptions that only non-empty classes
   are in vttree
5. Later, when the class is destroyed, this can lead to a Use-After-Free

The fix adds a second queue length check after qdisc_peek_len() to verify
the queue wasn't emptied.

Fixes: 21f4d5cc25 ("net_sched/hfsc: fix curve activation in hfsc_change_class()")
Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg>
Reviewed-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20250417184732.943057-2-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-23 17:16:50 -07:00
Mat Martineau
13b4ece33c mptcp: pm: Defer freeing of MPTCP userspace path manager entries
When path manager entries are deleted from the local address list, they
are first unlinked from the address list using list_del_rcu(). The
entries must not be freed until after the RCU grace period, but the
existing code immediately frees the entry.

Use kfree_rcu_mightsleep() and adjust sk_omem_alloc in open code instead
of using the sock_kfree_s() helper. This code path is only called in a
netlink handler, so the "might sleep" function is preferable to adding
a rarely-used rcu_head member to struct mptcp_pm_addr_entry.

Fixes: 88d0973163 ("mptcp: drop free_list for deleting entries")
Cc: stable@vger.kernel.org
Signed-off-by: Mat Martineau <martineau@kernel.org>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250421-net-mptcp-pm-defer-freeing-v1-1-e731dc6e86b9@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-23 16:27:58 -07:00
Kuniyuki Iwashima
f0cc3777b2 net: Fix wild-memory-access in __register_pernet_operations() when CONFIG_NET_NS=n.
kernel test robot reported the splat below. [0]

Before commit fed176bf31 ("net: Add ops_undo_single for module
load/unload."), if CONFIG_NET_NS=n, ops was linked to pernet_list
only when init_net had not been initialised, and ops was unlinked
from pernet_list only under the same condition.

Let's say an ops is loaded before the init_net setup but unloaded
after that.  Then, the ops remains in pernet_list, which seems odd.

The cited commit added ops_undo_single(), which calls list_add() for
ops to link it to a temporary list, so a minor change was added to
__register_pernet_operations() and __unregister_pernet_operations()
under CONFIG_NET_NS=n to avoid the pernet_list corruption.

However, the corruption must have been left as is.

When CONFIG_NET_NS=n, pernet_list was used to keep ops registered
before the init_net setup, and after that, pernet_list was not used
at all.

This was because some ops annotated with __net_initdata are cleared
out of memory at some point during boot.

Then, such ops is initialised by POISON_FREE_INITMEM (0xcc), resulting
in that ops->list.{next,prev} suddenly switches from a valid pointer
to a weird value, 0xcccccccccccccccc.

To avoid such wild memory access, let's allow the pernet_list
corruption for CONFIG_NET_NS=n.

[0]:
Oops: general protection fault, probably for non-canonical address 0xf999959999999999: 0000 [#1] SMP KASAN NOPTI
KASAN: maybe wild-memory-access in range [0xccccccccccccccc8-0xcccccccccccccccf]
CPU: 2 UID: 0 PID: 346 Comm: modprobe Not tainted 6.15.0-rc1-00294-ga4cba7e98e35 #85 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:__list_add_valid_or_report (lib/list_debug.c:32)
Code: 48 c1 ea 03 80 3c 02 00 0f 85 5a 01 00 00 49 39 74 24 08 0f 85 83 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 1f 01 00 00 4c 39 26 0f 85 ab 00 00 00 4c 39 ee
RSP: 0018:ff11000135b87830 EFLAGS: 00010a07
RAX: dffffc0000000000 RBX: ffffffffc02223c0 RCX: ffffffff8406fcc2
RDX: 1999999999999999 RSI: cccccccccccccccc RDI: ffffffffc02223c0
RBP: ffffffff86064e40 R08: 0000000000000001 R09: fffffbfff0a9f5b5
R10: ffffffff854fadaf R11: 676552203a54454e R12: ffffffff86064e40
R13: ffffffffc02223c0 R14: ffffffff86064e48 R15: 0000000000000021
FS:  00007f6fb0d9e1c0(0000) GS:ff11000858ea0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6fb0eda580 CR3: 0000000122fec005 CR4: 0000000000771ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <TASK>
 register_pernet_operations (./include/linux/list.h:150 (discriminator 5) ./include/linux/list.h:183 (discriminator 5) net/core/net_namespace.c:1315 (discriminator 5) net/core/net_namespace.c:1359 (discriminator 5))
 register_pernet_subsys (net/core/net_namespace.c:1401)
 inet6_init (net/ipv6/af_inet6.c:535) ipv6
 do_one_initcall (init/main.c:1257)
 do_init_module (kernel/module/main.c:2942)
 load_module (kernel/module/main.c:3409)
 init_module_from_file (kernel/module/main.c:3599)
 idempotent_init_module (kernel/module/main.c:3611)
 __x64_sys_finit_module (./include/linux/file.h:62 ./include/linux/file.h:83 kernel/module/main.c:3634 kernel/module/main.c:3621 kernel/module/main.c:3621)
 do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
 RIP: 0033:0x7f6fb0df7e5d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48
RSP: 002b:00007fffdc6a8968 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
RAX: ffffffffffffffda RBX: 000055b535721b70 RCX: 00007f6fb0df7e5d
RDX: 0000000000000000 RSI: 000055b51e44aa2a RDI: 0000000000000004
RBP: 0000000000040000 R08: 0000000000000000 R09: 000055b535721b30
R10: 0000000000000004 R11: 0000000000000246 R12: 000055b51e44aa2a
R13: 000055b535721bf0 R14: 000055b5357220b0 R15: 0000000000000000
 </TASK>
Modules linked in: ipv6(+) crc_ccitt

Fixes: fed176bf31 ("net: Add ops_undo_single for module load/unload.")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202504181511.1c3f23e4-lkp@intel.com
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418215025.87871-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-23 16:24:56 -07:00
KaFai Wan
4c0a42c500 selftests/bpf: Add test to access const void pointer argument in tracing program
Adding verifier test for accessing const void pointer argument in
tracing programs.

The test program loads 1st argument of bpf_fentry_test10 function
which is const void pointer and checks that verifier allows that.

Signed-off-by: KaFai Wan <mannkafai@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20250423121329.3163461-3-mannkafai@gmail.com
2025-04-23 11:26:22 -07:00
Rameshkumar Sundaram
f600832794 wifi: mac80211: restructure tx profile retrieval for MLO MBSSID
For MBSSID, each vif (struct ieee80211_vif) stores another vif
pointer for the transmitting profile of MBSSID set. This won't
suffice for MLO as there may be multiple links, each of which can
be part of different MBSSID sets. Hence the information needs to
be stored per-link. Additionally, the transmitted profile itself
may be part of an MLD hence storing vif will not suffice either.
Fix MLO by storing an instance of struct ieee80211_bss_conf
for each link.

Modify following operations to reflect the above structure updates:
- channel switch completion
- BSS color change completion
- Removing nontransmitted links in ieee80211_stop_mbssid()
- drivers retrieving the transmitted link for beacon templates.

Signed-off-by: Rameshkumar Sundaram <rameshkumar.sundaram@oss.qualcomm.com>
Co-developed-by: Muna Sinada <muna.sinada@oss.qualcomm.com>
Signed-off-by: Muna Sinada <muna.sinada@oss.qualcomm.com>
Co-developed-by: Aloka Dixit <aloka.dixit@oss.qualcomm.com>
Signed-off-by: Aloka Dixit <aloka.dixit@oss.qualcomm.com>
Link: https://patch.msgid.link/20250408184501.3715887-3-aloka.dixit@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-04-23 18:03:53 +02:00
Rameshkumar Sundaram
37523c3c47 wifi: nl80211: add link id of transmitted profile for MLO MBSSID
During non-transmitted (nontx) profile configuration, interface
index of the transmitted (tx) profile is used to retrieve the
wireless device (wdev) associated with it. With MLO, this 'wdev'
may be part of an MLD with more than one link, hence only
interface index is not sufficient anymore to retrieve the correct
tx profile. Add a new attribute to configure link id of tx profile.

Signed-off-by: Rameshkumar Sundaram <rameshkumar.sundaram@oss.qualcomm.com>
Co-developed-by: Muna Sinada <muna.sinada@oss.qualcomm.com>
Signed-off-by: Muna Sinada <muna.sinada@oss.qualcomm.com>
Co-developed-by: Aloka Dixit <aloka.dixit@oss.qualcomm.com>
Signed-off-by: Aloka Dixit <aloka.dixit@oss.qualcomm.com>
Link: https://patch.msgid.link/20250408184501.3715887-2-aloka.dixit@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-04-23 18:03:30 +02:00
Ramasamy Kaliappan
14e0f59a88 wifi: mac80211: update ML STA with EML capabilities
When an AP and Non-AP MLD operates in EMLSR mode, EML capabilities
advertised during Association contains information such as EMLSR
transition delay, padding delay and transition timeout values.

Save the EML capabilities information that is received during station
addition and capabilities update in ieee80211_sta so that drivers can use
it for triggering EMLSR operation.

Signed-off-by: Ramasamy Kaliappan <quic_rkaliapp@quicinc.com>
Signed-off-by: Rameshkumar Sundaram <quic_ramess@quicinc.com>
Link: https://patch.msgid.link/20250327051320.3253783-3-quic_ramess@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-04-23 17:03:05 +02:00
Ramasamy Kaliappan
53160d0edf wifi: cfg80211: Add support to get EMLSR capabilities of non-AP MLD
The Enhanced multi-link single-radio (EMLSR) operation allows a non-AP MLD
with multiple receive chains to listen on one or more EMLSR links when the
corresponding non-AP STA(s) affiliated with the non-AP MLD is (are) in
the awake state. [IEEE 802.11be-2024, (35.3.17 Enhanced multi-link
single-radio (EMLSR) operation)]

An MLD which intends to enable EMLSR operations will set the EML
Capabilities Present subfield to 1 and shall set the EMLSR Support
subfield in the Common Info field of the Basic Multi-Link element to 1 in
all Management frames that include the Basic Multi-Link element except
Authentication frames. EML capabilities contains information such as
EML Transition timeout, Padding delay and Transition delay. These fields
needs to updated to drivers to trigger EMLSR operation and to transmit and
receive initial control frame and data frames.

Add support to receive EML Capabilities subfield that non-AP MLD
advertises during (re)association request and send it to underlying
drivers during ADD/SET station.

Signed-off-by: Ramasamy Kaliappan <quic_rkaliapp@quicinc.com>
Signed-off-by: Rameshkumar Sundaram <quic_ramess@quicinc.com>
Link: https://patch.msgid.link/20250327051320.3253783-2-quic_ramess@quicinc.com
[accept EMLSR capabilities only for unassoc AP STA]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-04-23 17:02:44 +02:00
Muna Sinada
1a4a6a2255 wifi: mac80211: VLAN traffic in multicast path
Currently for MLO, sending out multicast frames on each link is handled by
mac80211 only when IEEE80211_HW_MLO_MCAST_MULTI_LINK_TX flag is not set.

Dynamic VLAN multicast traffic utilizes software encryption.
Due to this, mac80211 should handle transmitting multicast frames on
all links for multicast VLAN traffic.

Signed-off-by: Muna Sinada <muna.sinada@oss.qualcomm.com>
Link: https://patch.msgid.link/20250325213125.1509362-4-muna.sinada@oss.qualcomm.com
[remove unnecessary parentheses]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-04-23 16:59:02 +02:00
Muna Sinada
90233b0ad2 wifi: mac80211: Create separate links for VLAN interfaces
Currently, MLD links for an AP_VLAN interface type is not fully
supported.

Add allocation of separate links for each VLAN interface and copy
chanctx and chandef of AP bss to VLAN where necessary. Separate
links are created because for Dynamic VLAN each link will have its own
default_multicast_key.

Signed-off-by: Muna Sinada <muna.sinada@oss.qualcomm.com>
Link: https://patch.msgid.link/20250325213125.1509362-3-muna.sinada@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-04-23 16:56:15 +02:00
Muna Sinada
f61c7b3d44 wifi: mac80211: Add link iteration macro for link data
Currently before iterating through valid links we are utilizing
open-coding when checking if vif valid_links is a non-zero value.

Add new macro, for_each_link_data(), which iterates through link_id
and checks if it is set on vif valid_links. If it is a valid link then
access link data for that link id.

Signed-off-by: Muna Sinada <muna.sinada@oss.qualcomm.com>
Link: https://patch.msgid.link/20250325213125.1509362-2-muna.sinada@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-04-23 16:56:15 +02:00
Julian Vetter
39df75eb38 wifi: mac80211: Replace __get_unaligned_cpu32 in mesh_pathtbl.c
The __get_unaligned_cpu32 function is deprecated. So, replace it with
the more generic get_unaligned and just cast the input parameter.

Signed-off-by: Julian Vetter <julian@outer-limits.org>
Link: https://patch.msgid.link/20250408092220.2267754-1-julian@outer-limits.org
[fix subject]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-04-23 15:28:50 +02:00
Toke Høiland-Jørgensen
4876376988 Revert "mac80211: Dynamically set CoDel parameters per station"
This reverts commit 484a54c2e5. The CoDel
parameter change essentially disables CoDel on slow stations, with some
questionable assumptions, as Dave pointed out in [0]. Quoting from
there:

  But here are my pithy comments as to why this part of mac80211 is so
  wrong...

   static void sta_update_codel_params(struct sta_info *sta, u32 thr)
   {
  -       if (thr && thr < STA_SLOW_THRESHOLD * sta->local->num_sta) {

  1) sta->local->num_sta is the number of associated, rather than
  active, stations. "Active" stations in the last 50ms or so, might have
  been a better thing to use, but as most people have far more than that
  associated, we end up with really lousy codel parameters, all the
  time. Mistake numero uno!

  2) The STA_SLOW_THRESHOLD was completely arbitrary in 2016.

  -               sta->cparams.target = MS2TIME(50);

  This, by itself, was probably not too bad. 30ms might have been
  better, at the time, when we were battling powersave etc, but 20ms was
  enough, really, to cover most scenarios, even where we had low rate
  2Ghz multicast to cope with. Even then, codel has a hard time finding
  any sane drop rate at all, with a target this high.

  -               sta->cparams.interval = MS2TIME(300);

  But this was horrible, a total mistake, that is leading to codel being
  completely ineffective in almost any scenario on clients or APS.
  100ms, even 80ms, here, would be vastly better than this insanity. I'm
  seeing 5+seconds of delay accumulated in a bunch of otherwise happily
  fq-ing APs....

  100ms of observed jitter during a flow is enough. Certainly (in 2016)
  there were interactions with powersave that I did not understand, and
  still don't, but if you are transmitting in the first place, powersave
  shouldn't be a problemmmm.....

  -               sta->cparams.ecn = false;

  At the time we were pretty nervous about ecn, I'm kind of sanguine
  about it now, and reliably indicating ecn seems better than turning it
  off for any reason.

  [...]

  In production, on p2p wireless, I've had 8ms and 80ms for target and
  interval for years now, and it works great.

I think Dave's arguments above are basically sound on the face of it,
and various experimentation with tighter CoDel parameters in the OpenWrt
community have show promising results[1]. So I don't think there's any
reason to keep this parameter fiddling; hence this revert.

[0] https://lore.kernel.org/linux-wireless/CAA93jw6NJ2cmLmMauz0xAgC2MGbBq6n0ZiZzAdkK0u4b+O2yXg@mail.gmail.com/
[1] https://forum.openwrt.org/t/reducing-multiplexing-latencies-still-further-in-wifi/133605/130

Suggested-By: Dave Taht <dave.taht@gmail.com>
In-memory-of: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://patch.msgid.link/20250403183930.197716-1-toke@toke.dk
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-04-23 15:22:11 +02:00
Johannes Berg
996c15bd30 wifi: cfg80211/mac80211: remove more 5/10 MHz code
We still have ieee80211_chandef_rate_flags() and all that,
but all the users seem pretty much broken (deflink, etc.)
Remove all the code. It's been two years since last anyone
even vaguely entertained the notion of looking at this and
fixing it.

Link: https://patch.msgid.link/20250329221419.c31da7ae8c84.I1a3a4b6008134d66ca75a5bdfc004f4594da8145@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-04-23 15:21:32 +02:00
Gustavo A. R. Silva
17328a5b6a wifi: mac80211: Avoid -Wflex-array-member-not-at-end warnings
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.

Use the `DEFINE_RAW_FLEX()` helper for on-stack definitions of
a flexible structure where the size of the flexible-array member
is known at compile-time, and refactor the rest of the code,
accordingly.

So, with these changes, fix the following warnings:

net/mac80211/spectmgmt.c:151:47: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
net/mac80211/spectmgmt.c:155:48: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/Z-SQdHZljwAgIlp9@kspp
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-04-23 15:20:20 +02:00
Johannes Berg
76a853f86c wifi: free SKBTX_WIFI_STATUS skb tx_flags flag
Jason mentioned at netdevconf that we've run out of tx_flags in
the skb_shinfo(). Gain one bit back by removing the wifi bit.

We can do that because the only userspace application for it
(hostapd) doesn't change the setting on the socket, it just
uses different sockets, and normally doesn't even use this any
more, sending the frames over nl80211 instead.

Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250313134942.52ff54a140ec.If390bbdc46904cf451256ba989d7a056c457af6e@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-04-23 15:09:25 +02:00
Johannes Berg
abf078c0a3 wifi: mac80211: restore monitor for outgoing frames
This code was accidentally dropped during the cooked
monitor removal, but really should've been simplified
instead. Add the simple version back.

Fixes: 286e696770 ("wifi: mac80211: Drop cooked monitor support")
Link: https://patch.msgid.link/20250422213251.b3d65fd0f323.Id2a6901583f7af86bbe94deb355968b238f350c6@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-04-23 14:44:22 +02:00
Yong Wang
6c131043ea net: bridge: mcast: update multicast contex when vlan state is changed
When the vlan STP state is changed, which could be manipulated by
"bridge vlan" commands, similar to port STP state, this also impacts
multicast behaviors such as igmp query. In the scenario of per-VLAN
snooping, there's a need to update the corresponding multicast context
to re-arm the port query timer when vlan state becomes "forwarding" etc.

Update br_vlan_set_state() function to enable vlan multicast context
in such scenario.

Before the patch, the IGMP query does not happen in the last step of the
following test sequence, i.e. no growth for tx counter:
 # ip link add name br1 up type bridge vlan_filtering 1 mcast_snooping 1 mcast_vlan_snooping 1 mcast_querier 1 mcast_stats_enabled 1
 # bridge vlan global set vid 1 dev br1 mcast_snooping 1 mcast_querier 1 mcast_query_interval 100 mcast_startup_query_count 0
 # ip link add name swp1 up master br1 type dummy
 # sleep 1
 # bridge vlan set vid 1 dev swp1 state 4
 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]'
1
 # sleep 1
 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]'
1
 # bridge vlan set vid 1 dev swp1 state 3
 # sleep 2
 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]'
1

After the patch, the IGMP query happens in the last step of the test:
 # ip link add name br1 up type bridge vlan_filtering 1 mcast_snooping 1 mcast_vlan_snooping 1 mcast_querier 1 mcast_stats_enabled 1
 # bridge vlan global set vid 1 dev br1 mcast_snooping 1 mcast_querier 1 mcast_query_interval 100 mcast_startup_query_count 0
 # ip link add name swp1 up master br1 type dummy
 # sleep 1
 # bridge vlan set vid 1 dev swp1 state 4
 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]'
1
 # sleep 1
 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]'
1
 # bridge vlan set vid 1 dev swp1 state 3
 # sleep 2
 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]'
3

Signed-off-by: Yong Wang <yongwang@nvidia.com>
Reviewed-by: Andy Roulin <aroulin@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-04-23 13:02:20 +01:00
Yong Wang
4b30ae9adb net: bridge: mcast: re-implement br_multicast_{enable, disable}_port functions
When a bridge port STP state is changed from BLOCKING/DISABLED to
FORWARDING, the port's igmp query timer will NOT re-arm itself if the
bridge has been configured as per-VLAN multicast snooping.

Solve this by choosing the correct multicast context(s) to enable/disable
port multicast based on whether per-VLAN multicast snooping is enabled or
not, i.e. using per-{port, VLAN} context in case of per-VLAN multicast
snooping by re-implementing br_multicast_enable_port() and
br_multicast_disable_port() functions.

Before the patch, the IGMP query does not happen in the last step of the
following test sequence, i.e. no growth for tx counter:
 # ip link add name br1 up type bridge vlan_filtering 1 mcast_snooping 1 mcast_vlan_snooping 1 mcast_querier 1 mcast_stats_enabled 1
 # bridge vlan global set vid 1 dev br1 mcast_snooping 1 mcast_querier 1 mcast_query_interval 100 mcast_startup_query_count 0
 # ip link add name swp1 up master br1 type dummy
 # bridge link set dev swp1 state 0
 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]'
1
 # sleep 1
 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]'
1
 # bridge link set dev swp1 state 3
 # sleep 2
 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]'
1

After the patch, the IGMP query happens in the last step of the test:
 # ip link add name br1 up type bridge vlan_filtering 1 mcast_snooping 1 mcast_vlan_snooping 1 mcast_querier 1 mcast_stats_enabled 1
 # bridge vlan global set vid 1 dev br1 mcast_snooping 1 mcast_querier 1 mcast_query_interval 100 mcast_startup_query_count 0
 # ip link add name swp1 up master br1 type dummy
 # bridge link set dev swp1 state 0
 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]'
1
 # sleep 1
 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]'
1
 # bridge link set dev swp1 state 3
 # sleep 2
 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]'
3

Signed-off-by: Yong Wang <yongwang@nvidia.com>
Reviewed-by: Andy Roulin <aroulin@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-04-23 13:02:20 +01:00
Joshua Washington
0e0a7e3719 xdp: create locked/unlocked instances of xdp redirect target setters
Commit 03df156dd3 ("xdp: double protect netdev->xdp_flags with
netdev->lock") introduces the netdev lock to xdp_set_features_flag().
The change includes a _locked version of the method, as it is possible
for a driver to have already acquired the netdev lock before calling
this helper. However, the same applies to
xdp_features_(set|clear)_redirect_flags(), which ends up calling the
unlocked version of xdp_set_features_flags() leading to deadlocks in
GVE, which grabs the netdev lock as part of its suspend, reset, and
shutdown processes:

[  833.265543] WARNING: possible recursive locking detected
[  833.270949] 6.15.0-rc1 #6 Tainted: G            E
[  833.276271] --------------------------------------------
[  833.281681] systemd-shutdow/1 is trying to acquire lock:
[  833.287090] ffff949d2b148c68 (&dev->lock){+.+.}-{4:4}, at: xdp_set_features_flag+0x29/0x90
[  833.295470]
[  833.295470] but task is already holding lock:
[  833.301400] ffff949d2b148c68 (&dev->lock){+.+.}-{4:4}, at: gve_shutdown+0x44/0x90 [gve]
[  833.309508]
[  833.309508] other info that might help us debug this:
[  833.316130]  Possible unsafe locking scenario:
[  833.316130]
[  833.322142]        CPU0
[  833.324681]        ----
[  833.327220]   lock(&dev->lock);
[  833.330455]   lock(&dev->lock);
[  833.333689]
[  833.333689]  *** DEADLOCK ***
[  833.333689]
[  833.339701]  May be due to missing lock nesting notation
[  833.339701]
[  833.346582] 5 locks held by systemd-shutdow/1:
[  833.351205]  #0: ffffffffa9c89130 (system_transition_mutex){+.+.}-{4:4}, at: __se_sys_reboot+0xe6/0x210
[  833.360695]  #1: ffff93b399e5c1b8 (&dev->mutex){....}-{4:4}, at: device_shutdown+0xb4/0x1f0
[  833.369144]  #2: ffff949d19a471b8 (&dev->mutex){....}-{4:4}, at: device_shutdown+0xc2/0x1f0
[  833.377603]  #3: ffffffffa9eca050 (rtnl_mutex){+.+.}-{4:4}, at: gve_shutdown+0x33/0x90 [gve]
[  833.386138]  #4: ffff949d2b148c68 (&dev->lock){+.+.}-{4:4}, at: gve_shutdown+0x44/0x90 [gve]

Introduce xdp_features_(set|clear)_redirect_target_locked() versions
which assume that the netdev lock has already been acquired before
setting the XDP feature flag and update GVE to use the locked version.

Fixes: 03df156dd3 ("xdp: double protect netdev->xdp_flags with netdev->lock")
Tested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Signed-off-by: Joshua Washington <joshwash@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250422011643.3509287-1-joshwash@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22 19:57:56 -07:00
Kuniyuki Iwashima
434efd3d0c net: Drop hold_rtnl arg from ops_undo_list().
ops_undo_list() first iterates over ops_list for ->pre_exit().

Let's check if any of the ops has ->exit_rtnl() there and drop
the hold_rtnl argument.

Note that nexthop uses ->exit_rtnl() and is built-in, so hold_rtnl
is always true for setup_net() and cleanup_net() for now.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/netdev/20250414170148.21f3523c@kernel.org/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418003259.48017-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22 19:07:41 -07:00
Tung Nguyen
d63527e109 tipc: fix NULL pointer dereference in tipc_mon_reinit_self()
syzbot reported:

tipc: Node number set to 1055423674
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 3 UID: 0 PID: 6017 Comm: kworker/3:5 Not tainted 6.15.0-rc1-syzkaller-00246-g900241a5cc15 #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
Workqueue: events tipc_net_finalize_work
RIP: 0010:tipc_mon_reinit_self+0x11c/0x210 net/tipc/monitor.c:719
...
RSP: 0018:ffffc9000356fb68 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000003ee87cba
RDX: 0000000000000000 RSI: ffffffff8dbc56a7 RDI: ffff88804c2cc010
RBP: dffffc0000000000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000007
R13: fffffbfff2111097 R14: ffff88804ead8000 R15: ffff88804ead9010
FS:  0000000000000000(0000) GS:ffff888097ab9000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000f720eb00 CR3: 000000000e182000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 tipc_net_finalize+0x10b/0x180 net/tipc/net.c:140
 process_one_work+0x9cc/0x1b70 kernel/workqueue.c:3238
 process_scheduled_works kernel/workqueue.c:3319 [inline]
 worker_thread+0x6c8/0xf10 kernel/workqueue.c:3400
 kthread+0x3c2/0x780 kernel/kthread.c:464
 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:153
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>
...
RIP: 0010:tipc_mon_reinit_self+0x11c/0x210 net/tipc/monitor.c:719
...
RSP: 0018:ffffc9000356fb68 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000003ee87cba
RDX: 0000000000000000 RSI: ffffffff8dbc56a7 RDI: ffff88804c2cc010
RBP: dffffc0000000000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000007
R13: fffffbfff2111097 R14: ffff88804ead8000 R15: ffff88804ead9010
FS:  0000000000000000(0000) GS:ffff888097ab9000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000f720eb00 CR3: 000000000e182000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

There is a racing condition between workqueue created when enabling
bearer and another thread created when disabling bearer right after
that as follow:

enabling_bearer                          | disabling_bearer
---------------                          | ----------------
tipc_disc_timeout()                      |
{                                        | bearer_disable()
 ...                                     | {
 schedule_work(&tn->work);               |  tipc_mon_delete()
 ...                                     |  {
}                                        |   ...
                                         |   write_lock_bh(&mon->lock);
                                         |   mon->self = NULL;
                                         |   write_unlock_bh(&mon->lock);
                                         |   ...
                                         |  }
tipc_net_finalize_work()                 | }
{                                        |
 ...                                     |
 tipc_net_finalize()                     |
 {                                       |
  ...                                    |
  tipc_mon_reinit_self()                 |
  {                                      |
   ...                                   |
   write_lock_bh(&mon->lock);            |
   mon->self->addr = tipc_own_addr(net); |
   write_unlock_bh(&mon->lock);          |
   ...                                   |
  }                                      |
  ...                                    |
 }                                       |
 ...                                     |
}                                        |

'mon->self' is set to NULL in disabling_bearer thread and dereferenced
later in enabling_bearer thread.

This commit fixes this issue by validating 'mon->self' before assigning
node address to it.

Reported-by: syzbot+ed60da8d686dc709164c@syzkaller.appspotmail.com
Fixes: 46cb01eeeb ("tipc: update mon's self addr when node addr generated")
Signed-off-by: Tung Nguyen <tung.quang.nguyen@est.tech>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250417074826.578115-1-tung.quang.nguyen@est.tech
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22 18:43:57 -07:00
Dr. David Alan Gilbert
45bd443bfd net: 802: Remove unused p8022 code
p8022.c defines two external functions, register_8022_client()
and unregister_8022_client(), the last use of which was removed in
2018 by
commit 7a2e838d28 ("staging: ipx: delete it from the tree")

Remove the p8022.c file, it's corresponding header, and glue
surrounding it.  There was one place the header was included in vlan.c
but it didn't use the functions it declared.

There was a comment in net/802/Makefile about checking
against net/core/Makefile, but that's at least 20 years old and
there's no sign of net/core/Makefile mentioning it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://patch.msgid.link/20250418011519.145320-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22 07:04:02 -07:00
Justin Iurman
c03a49f309 net: lwtunnel: disable BHs when required
In lwtunnel_{output|xmit}(), dev_xmit_recursion() may be called in
preemptible scope for PREEMPT kernels. This patch disables BHs before
calling dev_xmit_recursion(). BHs are re-enabled only at the end, since
we must ensure the same CPU is used for both dev_xmit_recursion_inc()
and dev_xmit_recursion_dec() (and any other recursion levels in some
cases) in order to maintain valid per-cpu counters.

Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Closes: https://lore.kernel.org/netdev/CAADnVQJFWn3dBFJtY+ci6oN1pDFL=TzCmNbRgey7MdYxt_AP2g@mail.gmail.com/
Reported-by: Eduard Zingerman <eddyz87@gmail.com>
Closes: https://lore.kernel.org/netdev/m2h62qwf34.fsf@gmail.com/
Fixes: 986ffb3a57 ("net: lwtunnel: fix recursion loops")
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250416160716.8823-1-justin.iurman@uliege.be
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22 15:37:01 +02:00
Oleksij Rempel
9e8d1013b0 net: selftests: initialize TCP header and skb payload with zero
Zero-initialize TCP header via memset() to avoid garbage values that
may affect checksum or behavior during test transmission.

Also zero-fill allocated payload and padding regions using memset()
after skb_put(), ensuring deterministic content for all outgoing
test packets.

Fixes: 3e1e58d64c ("net: add generic selftest support")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Cc: stable@vger.kernel.org
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250416160125.2914724-1-o.rempel@pengutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22 15:30:35 +02:00
Dan Carpenter
b77ad30c42 rxrpc: rxgk: Set error code in rxgk_yfs_decode_ticket()
Propagate the error code if key_alloc() fails.  Don't return
success.

Fixes: 9d1d2b5934 ("rxrpc: rxgk: Implement the yfs-rxgk security class (GSSAPI)")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/Z_-P_1iLDWksH1ik@stanley.mountain
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22 13:42:27 +02:00
Jakub Kicinski
07e32237ed bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ6NaUOruQGUkvPdG4raS+Z+3y5EwUCaAFFvgAKCRAraS+Z+3y5
 E/1OAP9SGmTMgHuHLlF8en+MaYdtwgcHy6uurXgbSQAAV/RwwQEAh2oXZE1D9I7a
 EtxsaJYqbbhD09RPwWa2Rd8iJrJYXQk=
 =qcoU
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2025-04-17

We've added 12 non-merge commits during the last 9 day(s) which contain
a total of 18 files changed, 1748 insertions(+), 19 deletions(-).

The main changes are:

1) bpf qdisc support, from Amery Hung.
   A qdisc can be implemented in bpf struct_ops programs and
   can be used the same as other existing qdiscs in the
   "tc qdisc" command.

2) Add xsk tail adjustment tests, from Tushar Vyavahare.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  selftests/bpf: Test attaching bpf qdisc to mq and non root
  selftests/bpf: Add a bpf fq qdisc to selftest
  selftests/bpf: Add a basic fifo qdisc test
  libbpf: Support creating and destroying qdisc
  bpf: net_sched: Disable attaching bpf qdisc to non root
  bpf: net_sched: Support updating bstats
  bpf: net_sched: Add a qdisc watchdog timer
  bpf: net_sched: Add basic bpf qdisc kfuncs
  bpf: net_sched: Support implementation of Qdisc_ops in bpf
  bpf: Prepare to reuse get_ctx_arg_idx
  selftests/xsk: Add tail adjustment tests and support check
  selftests/xsk: Add packet stream replacement function
====================

Link: https://patch.msgid.link/20250417184338.3152168-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-21 18:51:08 -07:00
Breno Leitao
a451930180 net: Use nlmsg_payload in rtnetlink file
Leverage the new nlmsg_payload() helper to avoid checking for message
size and then reading the nlmsg data.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250417-nlmsg_v3-v1-2-9b09d9d7e61d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-21 18:38:00 -07:00
Breno Leitao
9929ba1942 net: Use nlmsg_payload in neighbour file
Leverage the new nlmsg_payload() helper to avoid checking for message
size and then reading the nlmsg data.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250417-nlmsg_v3-v1-1-9b09d9d7e61d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-21 18:38:00 -07:00
Jakub Kicinski
d3153c3b42 net: fix the missing unlock for detached devices
The combined condition was left as is when we converted
from __dev_get_by_index() to netdev_get_by_index_lock().
There was no need to undo anything with the former, for
the latter we need an unlock.

Fixes: 1d22d3060b ("net: drop rtnl_lock for queue_mgmt operations")
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250418015317.1954107-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-21 17:10:49 -07:00
Alexei Starovoitov
5709be4c35 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc3
Cross-merge bpf and other fixes after downstream PRs.

No conflicts.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-21 08:04:38 -07:00
Faizal Rahim
dda666343c net: ethtool: mm: reset verification status when link is down
When the link partner goes down, "ethtool --show-mm" still displays
"Verification status: SUCCEEDED," reflecting a previous state that is
no longer valid.

Reset the verification status to ensure it reflects the current state.

Reviewed-by: Furong Xu <0x1207@gmail.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-18 08:43:10 -07:00
Vladimir Oltean
9ff2aa4206 net: ethtool: mm: extract stmmac verification logic into common library
It appears that stmmac is not the only hardware which requires a
software-driven verification state machine for the MAC Merge layer.

While on the one hand it's good to encourage hardware implementations,
on the other hand it's quite difficult to tolerate multiple drivers
implementing independently fairly non-trivial logic.

Extract the hardware-independent logic from stmmac into library code and
put it in ethtool. Name the state structure "mmsv" for MAC Merge
Software Verification. Let this expose an operations structure for
executing the hardware stuff: sync hardware with the tx_active boolean
(result of verification process), enable/disable the pMAC, send mPackets,
notify library of external events (reception of mPackets), as well as
link state changes.

Note that it is assumed that the external events are received in hardirq
context. If they are not, it is probably a good idea to disable hardirqs
when calling ethtool_mmsv_event_handle(), because the library does not
do so.

Also, the MM software verification process has no business with the
tx_min_frag_size, that is all the driver's to handle.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Co-developed-by: Choong Yong Liang <yong.liang.choong@linux.intel.com>
Signed-off-by: Choong Yong Liang <yong.liang.choong@linux.intel.com>
Tested-by: Choong Yong Liang <yong.liang.choong@linux.intel.com>
Tested-by: Furong Xu <0x1207@gmail.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-18 08:43:09 -07:00
Jakub Kicinski
22cbc1ee26 netdev: fix the locking for netdev notifications
Kuniyuki reports that the assert for netdev lock fires when
there are netdev event listeners (otherwise we skip the netlink
event generation).

Correct the locking when coming from the notifier.

The NETDEV_XDP_FEAT_CHANGE notifier is already fully locked,
it's the documentation that's incorrect.

Fixes: 99e44f39a8 ("netdev: depend on netdev->lock for xdp features")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Reported-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/20250410171019.62128-1-kuniyu@amazon.com
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250416030447.1077551-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-17 18:55:14 -07:00
Jakub Kicinski
240ce924d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc3).

No conflicts. Adjacent changes:

tools/net/ynl/pyynl/ynl_gen_c.py
  4d07bbf2d4 ("tools: ynl-gen: don't declare loop iterator in place")
  7e8ba0c7de ("tools: ynl: don't use genlmsghdr in classic netlink")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-17 12:26:50 -07:00
Linus Torvalds
b5c6891b2c Including fixes from Bluetooth, CAN and Netfilter.
Current release - regressions:
 
  - 2 fixes for the netdev per-instance locking
 
  - batman-adv: fix double-hold of meshif when getting enabled
 
 Current release - new code bugs:
 
  - Bluetooth: increment TX timestamping tskey always for stream sockets
 
  - wifi: static analysis and build fixes for the new Intel sub-driver
 
 Previous releases - regressions:
 
  - net: fib_rules: fix iif / oif matching on L3 master (VRF) device
 
  - ipv6: add exception routes to GC list in rt6_insert_exception()
 
  - netfilter: conntrack: fix erroneous removal of offload bit
 
  - Bluetooth:
   - fix sending MGMT_EV_DEVICE_FOUND for invalid address
   - l2cap: process valid commands in too long frame
   - btnxpuart: Revert baudrate change in nxp_shutdown
 
 Previous releases - always broken:
 
  - ethtool: fix memory corruption during SFP FW flashing
 
  - eth: hibmcge: fixes for link and MTU handling, pause frames etc.
 
  - eth: igc: fixes for PTM (PCIe timestamping)
 
  - dsa: b53: enable BPDU reception for management port
 
 Misc:
 
  - fixes for Netlink protocol schemas
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmgBMWAACgkQMUZtbf5S
 Irv3kw//enBPRBMhTdXop2g9Rlw8zQEwuMinq9ytgiNbWxshd7wXHNDYKICaX8qK
 KkN4InxF4ieESHGNUy9/UvZLAEHYV8BqugdUGurwCnEg9QJlThsDmRCYTxI37Sdx
 8uUYoSYWvPcHUsw3s/qBzunAZ8R/5hbP99L1M8W+3DqnJUm6qjUYjWCmm+8sn6FW
 abZ8JiazJhGKfTtIGxxN3oOJBzKcdeTK+z29ujQBWegnau8uj92QKkuoTZTtHaYL
 MU5pRyDQS5vo/wLBizTGQQ7PrcIpKgM1mi5Oayl5WbvTUF1pJ8pvT1iT7jnGOt4c
 lWPImtdQwsNwAwFNU7/S+YkBvrYBsCUPDErQ82HtxRsNXcCzKvuM4SqkvzYS5SZd
 Qz0ZuKqXbxxSgRgasvikWP6qAI8vz4zT8pjXu/a3eE5xnsSqhwXVbEwez8KqRF79
 Cxm/iYz/m/NhYW59YSobssNHjGM4QgOu/dFr37EPCgf8BK//RLFckwQVA1qBogO/
 HEwkWhQFYNJY0BDXuzO8rCPFQbs/NBiuNU0xRbKqPGkGcqKRwTd/KLOeTe/NJvtE
 mQx5R+VNVEyNdlQtXh5muEJHhN2nLYzQUTM5gkt6jwIzcLt7HVE+ZF3nkAKf1EQB
 1nYN8gxCO1convTNT84Eldpp2At8zazz+ltnC9gWIEZhdTct46Q=
 =EObk
 -----END PGP SIGNATURE-----

Merge tag 'net-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from Bluetooth, CAN and Netfilter.

  Current release - regressions:

   - two fixes for the netdev per-instance locking

   - batman-adv: fix double-hold of meshif when getting enabled

  Current release - new code bugs:

   - Bluetooth: increment TX timestamping tskey always for stream
     sockets

   - wifi: static analysis and build fixes for the new Intel sub-driver

  Previous releases - regressions:

   - net: fib_rules: fix iif / oif matching on L3 master (VRF) device

   - ipv6: add exception routes to GC list in rt6_insert_exception()

   - netfilter: conntrack: fix erroneous removal of offload bit

   - Bluetooth:
       - fix sending MGMT_EV_DEVICE_FOUND for invalid address
       - l2cap: process valid commands in too long frame
       - btnxpuart: Revert baudrate change in nxp_shutdown

  Previous releases - always broken:

   - ethtool: fix memory corruption during SFP FW flashing

   - eth:
       - hibmcge: fixes for link and MTU handling, pause frames etc
       - igc: fixes for PTM (PCIe timestamping)

   - dsa: b53: enable BPDU reception for management port

  Misc:

   - fixes for Netlink protocol schemas"

* tag 'net-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
  net: ethernet: mtk_eth_soc: revise QDMA packet scheduler settings
  net: ethernet: mtk_eth_soc: correct the max weight of the queue limit for 100Mbps
  net: ethernet: mtk_eth_soc: reapply mdc divider on reset
  net: ti: icss-iep: Fix possible NULL pointer dereference for perout request
  net: ti: icssg-prueth: Fix possible NULL pointer dereference inside emac_xmit_xdp_frame()
  net: ti: icssg-prueth: Fix kernel warning while bringing down network interface
  netfilter: conntrack: fix erronous removal of offload bit
  net: don't try to ops lock uninitialized devs
  ptp: ocp: fix start time alignment in ptp_ocp_signal_set
  net: dsa: avoid refcount warnings when ds->ops->tag_8021q_vlan_del() fails
  net: dsa: free routing table on probe failure
  net: dsa: clean up FDB, MDB, VLAN entries on unbind
  net: dsa: mv88e6xxx: fix -ENOENT when deleting VLANs and MST is unsupported
  net: dsa: mv88e6xxx: avoid unregistering devlink regions which were never registered
  net: txgbe: fix memory leak in txgbe_probe() error path
  net: bridge: switchdev: do not notify new brentries as changed
  net: b53: enable BPDU reception for management port
  netlink: specs: rt-neigh: prefix struct nfmsg members with ndm
  netlink: specs: rt-link: adjust mctp attribute naming
  netlink: specs: rtnetlink: attribute naming corrections
  ...
2025-04-17 11:45:30 -07:00
Amery Hung
e582778f02 bpf: net_sched: Disable attaching bpf qdisc to non root
Do not allow users to attach bpf qdiscs to classful qdiscs. This is to
prevent accidentally breaking existings classful qdiscs if they rely on
some data in the child qdisc. This restriction can potentially be lifted
in the future. Note that, we still allow bpf qdisc to be attached to mq.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409214606.2000194-7-ameryhung@gmail.com
2025-04-17 10:54:40 -07:00
Amery Hung
544e0a1f1e bpf: net_sched: Support updating bstats
Add a kfunc to update Qdisc bstats when an skb is dequeued. The kfunc is
only available in .dequeue programs.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409214606.2000194-6-ameryhung@gmail.com
2025-04-17 10:54:40 -07:00
Amery Hung
7a2dafda95 bpf: net_sched: Add a qdisc watchdog timer
Add a watchdog timer to bpf qdisc. The watchdog can be used to schedule
the execution of qdisc through kfunc, bpf_qdisc_schedule(). It can be
useful for building traffic shaping scheduling algorithm, where the time
the next packet will be dequeued is known.

The implementation relies on struct_ops gen_prologue/epilogue to patch bpf
programs provided by users. Operator specific prologue/epilogue kfuncs
are introduced instead of watchdog kfuncs so that it is easier to extend
prologue/epilogue in the future (writing C vs BPF bytecode).

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409214606.2000194-5-ameryhung@gmail.com
2025-04-17 10:54:40 -07:00
Amery Hung
870c28588a bpf: net_sched: Add basic bpf qdisc kfuncs
Add basic kfuncs for working on skb in qdisc.

Both bpf_qdisc_skb_drop() and bpf_kfree_skb() can be used to release
a reference to an skb. However, bpf_qdisc_skb_drop() can only be called
in .enqueue where a to_free skb list is available from kernel to defer
the release. bpf_kfree_skb() should be used elsewhere. It is also used
in bpf_obj_free_fields() when cleaning up skb in maps and collections.

bpf_skb_get_hash() returns the flow hash of an skb, which can be used
to build flow-based queueing algorithms.

Finally, allow users to create read-only dynptr via bpf_dynptr_from_skb().

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409214606.2000194-4-ameryhung@gmail.com
2025-04-17 10:54:40 -07:00
Amery Hung
c824034495 bpf: net_sched: Support implementation of Qdisc_ops in bpf
The recent advancement in bpf such as allocated objects, bpf list and bpf
rbtree has provided powerful and flexible building blocks to realize
sophisticated packet scheduling algorithms. As struct_ops now supports
core operators in Qdisc_ops, start allowing qdisc to be implemented using
bpf struct_ops with this patch. Users can implement Qdisc_ops.{enqueue,
dequeue, init, reset, destroy} in bpf and register the qdisc dynamically
into the kernel.

Co-developed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409214606.2000194-3-ameryhung@gmail.com
2025-04-17 10:54:33 -07:00
Paolo Abeni
8e57ce3c32 netfilter pull request 25-04-17
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmgA1LAACgkQ1w0aZmrP
 KyHLThAAw/aRaQmwWB7yAYuCR+VzdB+SzafLVR4PlZpH0aGG1dNucIf0y3eI016M
 ZixJsOithlpqNEFMKi6WfuKxX2h835ck56VMhYuuCW2thaOeXiyut+P6jAaTWVKV
 u9tjfYICtI2KTGWGBekhKOSxmATWzDqBRCa1uO3Qsjz+J/ti4sW8YsSSs4BE91Uu
 zecK00lSkoNTfU5HuUOpTTeStpiQcXxfO1Arx3HLtdf6IqZLtbKk4hkgJw5DtkNB
 SS4nbvnhWx8CbngdMT6EJas47TK8sJPvgWTBHPeSlYjcPV9p2QanQ3RNMASf2AAE
 OL2pPKWICUVLKTpZ4UlsNDm+0a1D3Bz7ET4CtJAmpa6sennl0cDZYsDkRCXV/Dz7
 putS+BuX3SSk/G1XqRh3VZOUf/G0Cyko2Viyj6gWP8A+TeGzAm0EtHWURP2TNodS
 1DTr0BB3/tf+uJ2WCKBng30XtyPfQz0soDPJ0LrvFQwkWotVPyOWL37cVrEs/yyF
 FjiwJkeEUBZNc8HRM3KADm69Q5LK7YLN/lNrgOCNiJeQI52rhNd7rdEheuvSCiwc
 IEer5EXgj2dubh0MeEKbUvrwXqZGrdquec7/KftTrfqo+vec7+9ycNlnjhq/k1Qb
 Ni5k7Svor5+avW/H39rUO6b6DHJoG0zaAppgrRrmniDQ330Sjok=
 =VG/D
 -----END PGP SIGNATURE-----

Merge tag 'nf-25-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fix for net

The following batch contains one Netfilter fix for net:

1) conntrack offload bit is erroneously unset in a race scenario,
   from Florian Westphal.

netfilter pull request 25-04-17

* tag 'nf-25-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: conntrack: fix erronous removal of offload bit
====================

Link: https://patch.msgid.link/20250417102847.16640-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17 15:20:41 +02:00
Paolo Abeni
a43ae7cf55 bluetooth pull request for net:
- l2cap: Process valid commands in too long frame
  - vhci: Avoid needless snprintf() calls
 -----BEGIN PGP SIGNATURE-----
 
 iQJMBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmgAGkwZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKZOmD/UbedvbIrN+fV9mUcTfWsOE
 UDo+L3GOaowjHDmRu/YR2XD65HWxJGM93u3rksZXfW+wuLWrm6gUfDGphEHT8Xee
 Oms+qWgBg/qqNX4gm0vmxGY9HeO9o+Ove8BN1cGudJAkZ5P4kz3vQ2Ytcb732tb5
 zKgh+JaiUMa3eZnXmYKVpFiZ0V1c1iJwjsp2O+pXrBvM6uoheIZSnTKtFPFF199c
 I09l30nYeKsZsRNgyCgYU3mOvNtEcHPrhH1Y8sXh5Oy88ffM/G/29FHWb3fd/wCd
 Mo4G8Ciy70OY4xR1oeS/Mnt8mNIvja3AXlquRtKHvNgcY3e6DgKQZAWXLcMMRmOO
 hdjy9pIUJLG09OGv92+HEMlnXMRvK9OYJ5GjbBVFxSHNk2b8FaIW7NdMm4xjVClZ
 qcG5pl2a3Gji5sdsrtoP3jBO6ErHDwI0S5rrBYGiSqvAqBBRvveo9cl2GgPHviEF
 IT0E6FzzyaR6xu9qpObw+LN13d4egEA2mi9TSHdUn/Vs1TCdk1zjKGOljKpcQWrc
 It0B4cwDKJPiw5Be8AXhb+OgdYHmzjUsFz9FtXJ8X92nzJLC1ufWybvR2fPnQm47
 NScQMRC3TjuWxcwuqZHYcaixO2efq4Kl58f2SF/SdBKlhlfpgOASML2PKp1AOweO
 lSZNdBpVOgQIpFw2gkS7
 =8ull
 -----END PGP SIGNATURE-----

Merge tag 'for-net-2025-04-16' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

 - l2cap: Process valid commands in too long frame
 - vhci: Avoid needless snprintf() calls

* tag 'for-net-2025-04-16' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: vhci: Avoid needless snprintf() calls
  Bluetooth: l2cap: Process valid commands in too long frame
====================

Link: https://patch.msgid.link/20250416210126.2034212-1-luiz.dentz@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17 13:08:42 +02:00
Peter Seiderer
422cf22aa3 net: pktgen: fix code style (WARNING: Prefer strscpy over strcpy)
Fix checkpatch code style warnings:

  WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88
  #1423: FILE: net/core/pktgen.c:1423:
  +                       strcpy(pkt_dev->dst_min, buf);

  WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88
  #1444: FILE: net/core/pktgen.c:1444:
  +                       strcpy(pkt_dev->dst_max, buf);

  WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88
  #1554: FILE: net/core/pktgen.c:1554:
  +                       strcpy(pkt_dev->src_min, buf);

  WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88
  #1575: FILE: net/core/pktgen.c:1575:
  +                       strcpy(pkt_dev->src_max, buf);

  WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88
  #3231: FILE: net/core/pktgen.c:3231:
  +                       strcpy(pkt_dev->result, "Starting");

  WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88
  #3235: FILE: net/core/pktgen.c:3235:
  +                       strcpy(pkt_dev->result, "Error starting");

  WARNING: Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88
  #3849: FILE: net/core/pktgen.c:3849:
  +       strcpy(pkt_dev->odevname, ifname);

While at it squash memset/strcpy pattern into single strscpy_pad call.

Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250415112916.113455-4-ps.report@gmx.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17 13:02:41 +02:00
Peter Seiderer
65f5b9cb54 net: pktgen: fix code style (WARNING: please, no space before tabs)
Fix checkpatch code style warnings:

  WARNING: please, no space before tabs
  #230: FILE: net/core/pktgen.c:230:
  +#define M_NETIF_RECEIVE ^I1^I/* Inject packets into stack */$

Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250415112916.113455-3-ps.report@gmx.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17 13:02:41 +02:00
Peter Seiderer
3bc1ca7e17 net: pktgen: fix code style (ERROR: else should follow close brace '}')
Fix checkpatch code style errors:

  ERROR: else should follow close brace '}'
  #1317: FILE: net/core/pktgen.c:1317:
  +               }
  +               else

And checkpatch follow up code style check:

  CHECK: Unbalanced braces around else statement
  #1316: FILE: net/core/pktgen.c:1316:
  +               } else

Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250415112916.113455-2-ps.report@gmx.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17 13:02:41 +02:00
Justin Iurman
47ce7c8545 net: ipv6: ioam6: fix double reallocation
If the dst_entry is the same post transformation (which is a valid use
case for IOAM), we don't add it to the cache to avoid a reference loop.
Instead, we use a "fake" dst_entry and add it to the cache as a signal.
When we read the cache, we compare it with our "fake" dst_entry and
therefore detect if we're in the special case.

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Link: https://patch.msgid.link/20250415112554.23823-3-justin.iurman@uliege.be
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17 12:52:34 +02:00
Justin Iurman
d55acb9732 net: ipv6: ioam6: use consistent dst names
Be consistent and use the same terminology as other lwt users: orig_dst
is the dst_entry before the transformation, while dst is either the
dst_entry in the cache or the dst_entry after the transformation

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Link: https://patch.msgid.link/20250415112554.23823-2-justin.iurman@uliege.be
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17 12:52:34 +02:00
Antonio Quartulli
17240749f2 skb: implement skb_send_sock_locked_with_flags()
When sending an skb over a socket using skb_send_sock_locked(),
it is currently not possible to specify any flag to be set in
msghdr->msg_flags.

However, we may want to pass flags the user may have specified,
like MSG_NOSIGNAL.

Extend __skb_send_sock() with a new argument 'flags' and add a
new interface named skb_send_sock_locked_with_flags().

Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20250415-b4-ovpn-v26-12-577f6097b964@openvpn.net
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17 12:30:03 +02:00
Antonio Quartulli
11851cbd60 ovpn: implement TCP transport
With this change ovpn is allowed to communicate to peers also via TCP.
Parsing of incoming messages is implemented through the strparser API.

Note that ovpn redefines sk_prot and sk_socket->ops for the TCP socket
used to communicate with the peer.
For this reason it needs to access inet6_stream_ops, which is declared
as extern in the IPv6 module, but it is not fully exported.

Therefore this patch is also adding EXPORT_SYMBOL_GPL(inet6_stream_ops)
to net/ipv6/af_inet6.c.

Cc: David Ahern <dsahern@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20250415-b4-ovpn-v26-11-577f6097b964@openvpn.net
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17 12:30:03 +02:00
Florian Westphal
d2d31ea8cd netfilter: conntrack: fix erronous removal of offload bit
The blamed commit exposes a possible issue with flow_offload_teardown():
We might remove the offload bit of a conntrack entry that has been
offloaded again.

1. conntrack entry c1 is offloaded via flow f1 (f1->ct == c1).
2. f1 times out and is pushed back to slowpath, c1 offload bit is
   removed.  Due to bug, f1 is not unlinked from rhashtable right away.
3. a new packet arrives for the flow and re-offload is triggered, i.e.
   f2->ct == c1.  This is because lookup in flowtable skip entries with
   teardown bit set.
4. Next flowtable gc cycle finds f1 again
5. flow_offload_teardown() is called again for f1 and c1 offload bit is
   removed again, even though we have f2 referencing the same entry.

This is harmless, but clearly not correct.
Fix the bug that exposes this: set 'teardown = true' to have the gc
callback unlink the flowtable entry from the table right away instead of
the unintentional defer to the next round.

Also prevent flow_offload_teardown() from fixing up the ct state more than
once: We could also be called from the data path or a notifier, not only
from the flowtable gc callback.

NF_FLOW_TEARDOWN can never be unset, so we can use it as synchronization
point: if we observe did not see a 0 -> 1 transition, then another CPU
is already doing the ct state fixups for us.

Fixes: 03428ca5ce ("netfilter: conntrack: rework offload nf_conn timeout extension logic")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-04-17 11:14:22 +02:00
Tobias Brunner
e3fd057776 xfrm: Fix UDP GRO handling for some corner cases
This fixes an issue that's caused if there is a mismatch between the data
offset in the GRO header and the length fields in the regular sk_buff due
to the pskb_pull()/skb_push() calls.  That's because the UDP GRO layer
stripped off the UDP header via skb_gro_pull() already while the UDP
header was explicitly not pulled/pushed in this function.

For example, an IKE packet that triggered this had len=data_len=1268 and
the data_offset in the GRO header was 28 (IPv4 + UDP).  So pskb_pull()
was called with an offset of 28-8=20, which reduced len to 1248 and via
pskb_may_pull() and __pskb_pull_tail() it also set data_len to 1248.
As the ESP offload module was not loaded, the function bailed out and
called skb_push(), which restored len to 1268, however, data_len remained
at 1248.

So while skb_headlen() was 0 before, it was now 20.  The latter caused a
difference of 8 instead of 28 (or 0 if pskb_pull()/skb_push() was called
with the complete GRO data_offset) in gro_try_pull_from_frag0() that
triggered a call to gro_pull_from_frag0() that corrupted the packet.

This change uses a more GRO-like approach seen in other GRO receivers
via skb_gro_header() to just read the actual data we are interested in
and does not try to "restore" the UDP header at this point to call the
existing function.  If the offload module is not loaded, it immediately
bails out, otherwise, it only does a quick check to see if the packet
is an IKE or keepalive packet instead of calling the existing function.

Fixes: 172bf009c1 ("xfrm: Support GRO for IPv4 ESP in UDP encapsulation")
Fixes: 221ddb723d ("xfrm: Support GRO for IPv6 ESP in UDP encapsulation")
Signed-off-by: Tobias Brunner <tobias@strongswan.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-17 11:08:16 +02:00
Chiachang Wang
e8961c50ee xfrm: Refactor migration setup during the cloning process
Previously, migration related setup, such as updating family,
destination address, and source address, was performed after
the clone was created in `xfrm_state_migrate`. This change
moves this setup into the cloning function itself, improving
code locality and reducing redundancy.

The `xfrm_state_clone_and_setup` function now conditionally
applies the migration parameters from struct xfrm_migrate
if it is provided. This allows the function to be used both
for simple cloning and for cloning with migration setup.

Test: Tested with kernel test in the Android tree located
      in https://android.googlesource.com/kernel/tests/
      The xfrm_tunnel_test.py under the tests folder in
      particular.
Signed-off-by: Chiachang Wang <chiachangwang@google.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-17 11:00:37 +02:00
Chiachang Wang
ab244a394c xfrm: Migrate offload configuration
Add hardware offload configuration to XFRM_MSG_MIGRATE
using an option netlink attribute XFRMA_OFFLOAD_DEV.

In the existing xfrm_state_migrate(), the xfrm_init_state()
is called assuming no hardware offload by default. Even the
original xfrm_state is configured with offload, the setting will
be reset. If the device is configured with hardware offload,
it's reasonable to allow the device to maintain its hardware
offload mode. But the device will end up with offload disabled
after receiving a migration event when the device migrates the
connection from one netdev to another one.

The devices that support migration may work with different
underlying networks, such as mobile devices. The hardware setting
should be forwarded to the different netdev based on the
migration configuration. This change provides the capability
for user space to migrate from one netdev to another.

Test: Tested with kernel test in the Android tree located
      in https://android.googlesource.com/kernel/tests/
      The xfrm_tunnel_test.py under the tests folder in
      particular.
Signed-off-by: Chiachang Wang <chiachangwang@google.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-17 11:00:03 +02:00
Christian Brauner
b590c928cc
net, pidfd: report EINVAL for ESRCH
dbus-broker relies -EINVAL being returned to indicate ESRCH in [1].
This causes issues for some workloads as reported in [2].
Paper over it until this is fixed in userspace.

Link: https://lore.kernel.org/20250416-gegriffen-tiefbau-70cfecb80ac8@brauner
Link: 5d34d91b13/src/util/sockopt.c (L241) [1]
Link: https://lore.kernel.org/20250415223454.GA1852104@ax162 [2]
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-17 09:43:26 +02:00
Jakub Kicinski
4e34a84061 Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:

====================
ixgbe: Add basic devlink support

Jedrzej Jagielski says:

Create devlink specific directory for more convenient future feature
development.

Flashing and reloading are supported only by E610 devices.

Introduce basic FW/NVM validation since devlink reload introduces
possibility of runtime NVM update. Check FW API version, FW recovery
mode and FW rollback mode. Introduce minimal recovery probe to let
user to reload the faulty FW when recovery mode is detected.

* '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  ixgbe: add support for FW rollback mode
  ixgbe: add E610 implementation of FW recovery mode
  ixgbe: add FW API version check
  ixgbe: add support for devlink reload
  ixgbe: add device flash update via devlink
  ixgbe: extend .info_get() with stored versions
  ixgbe: add E610 functions getting PBA and FW ver info
  ixgbe: add .info_get extension specific for E610 devices
  ixgbe: read the netlist version information
  ixgbe: read the OROM version information
  ixgbe: add E610 functions for acquiring flash data
  ixgbe: add handler for devlink .info_get()
  ixgbe: add initial devlink support
  ixgbe: wrap netdev_priv() usage
  devlink: add value check to devlink_info_version_put()
====================

Link: https://patch.msgid.link/20250415221301.1633933-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16 19:05:21 -07:00
Breno Leitao
04e00a849e ipv4: Use nlmsg_payload in ipmr file
Leverage the new nlmsg_payload() helper to avoid checking for message
size and then reading the nlmsg data.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250415-nlmsg_v2-v1-7-a1c75d493fd7@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16 18:33:03 -07:00
Breno Leitao
d5ce0ed528 ipv4: Use nlmsg_payload in route file
Leverage the new nlmsg_payload() helper to avoid checking for message
size and then reading the nlmsg data.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250415-nlmsg_v2-v1-6-a1c75d493fd7@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16 18:33:02 -07:00
Breno Leitao
b411638fb9 ipv4: Use nlmsg_payload in fib_frontend file
Leverage the new nlmsg_payload() helper to avoid checking for message
size and then reading the nlmsg data.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250415-nlmsg_v2-v1-5-a1c75d493fd7@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16 18:33:02 -07:00
Breno Leitao
7d82cc229c ipv4: Use nlmsg_payload in devinet file
Leverage the new nlmsg_payload() helper to avoid checking for message
size and then reading the nlmsg data.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250415-nlmsg_v2-v1-4-a1c75d493fd7@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16 18:33:02 -07:00
Breno Leitao
bc05add844 ipv6: Use nlmsg_payload in route file
Leverage the new nlmsg_payload() helper to avoid checking for message
size and then reading the nlmsg data.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250415-nlmsg_v2-v1-3-a1c75d493fd7@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16 18:33:02 -07:00
Breno Leitao
6c454270a8 ipv6: Use nlmsg_payload in addrconf file
Leverage the new nlmsg_payload() helper to avoid checking for message
size and then reading the nlmsg data.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250415-nlmsg_v2-v1-2-a1c75d493fd7@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16 18:33:02 -07:00
Breno Leitao
5ef4097ed1 ipv6: Use nlmsg_payload in addrlabel file
Leverage the new nlmsg_payload() helper to avoid checking for message
size and then reading the nlmsg data.

This changes function ip6addrlbl_valid_get_req() and
ip6addrlbl_valid_dump_req().

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250415-nlmsg_v2-v1-1-a1c75d493fd7@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16 18:33:02 -07:00
Jakub Kicinski
4798cfa209 net: don't try to ops lock uninitialized devs
We need to be careful when operating on dev while in rtnl_create_link().
Some devices (vxlan) initialize netdev_ops in ->newlink, so later on.
Avoid using netdev_lock_ops(), the device isn't registered so we
cannot legally call its ops or generate any notifications for it.

netdev_ops_assert_locked_or_invisible() is safe to use, it checks
registration status first.

Reported-by: syzbot+de1c7d68a10e3f123bdd@syzkaller.appspotmail.com
Fixes: 04efcee6ef ("net: hold instance lock during NETDEV_CHANGE")
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250415151552.768373-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16 18:28:11 -07:00
Vladimir Oltean
514eff7b0a net: dsa: avoid refcount warnings when ds->ops->tag_8021q_vlan_del() fails
This is very similar to the problem and solution from commit
232deb3f95 ("net: dsa: avoid refcount warnings when
->port_{fdb,mdb}_del returns error"), except for the
dsa_port_do_tag_8021q_vlan_del() operation.

Fixes: c64b9c0504 ("net: dsa: tag_8021q: add proper cross-chip notifier support")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20250414213020.2959021-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16 18:14:44 -07:00
Vladimir Oltean
8bf108d716 net: dsa: free routing table on probe failure
If complete = true in dsa_tree_setup(), it means that we are the last
switch of the tree which is successfully probing, and we should be
setting up all switches from our probe path.

After "complete" becomes true, dsa_tree_setup_cpu_ports() or any
subsequent function may fail. If that happens, the entire tree setup is
in limbo: the first N-1 switches have successfully finished probing
(doing nothing but having allocated persistent memory in the tree's
dst->ports, and maybe dst->rtable), and switch N failed to probe, ending
the tree setup process before anything is tangible from the user's PoV.

If switch N fails to probe, its memory (ports) will be freed and removed
from dst->ports. However, the dst->rtable elements pointing to its ports,
as created by dsa_link_touch(), will remain there, and will lead to
use-after-free if dereferenced.

If dsa_tree_setup_switches() returns -EPROBE_DEFER, which is entirely
possible because that is where ds->ops->setup() is, we get a kasan
report like this:

==================================================================
BUG: KASAN: slab-use-after-free in mv88e6xxx_setup_upstream_port+0x240/0x568
Read of size 8 at addr ffff000004f56020 by task kworker/u8:3/42

Call trace:
 __asan_report_load8_noabort+0x20/0x30
 mv88e6xxx_setup_upstream_port+0x240/0x568
 mv88e6xxx_setup+0xebc/0x1eb0
 dsa_register_switch+0x1af4/0x2ae0
 mv88e6xxx_register_switch+0x1b8/0x2a8
 mv88e6xxx_probe+0xc4c/0xf60
 mdio_probe+0x78/0xb8
 really_probe+0x2b8/0x5a8
 __driver_probe_device+0x164/0x298
 driver_probe_device+0x78/0x258
 __device_attach_driver+0x274/0x350

Allocated by task 42:
 __kasan_kmalloc+0x84/0xa0
 __kmalloc_cache_noprof+0x298/0x490
 dsa_switch_touch_ports+0x174/0x3d8
 dsa_register_switch+0x800/0x2ae0
 mv88e6xxx_register_switch+0x1b8/0x2a8
 mv88e6xxx_probe+0xc4c/0xf60
 mdio_probe+0x78/0xb8
 really_probe+0x2b8/0x5a8
 __driver_probe_device+0x164/0x298
 driver_probe_device+0x78/0x258
 __device_attach_driver+0x274/0x350

Freed by task 42:
 __kasan_slab_free+0x48/0x68
 kfree+0x138/0x418
 dsa_register_switch+0x2694/0x2ae0
 mv88e6xxx_register_switch+0x1b8/0x2a8
 mv88e6xxx_probe+0xc4c/0xf60
 mdio_probe+0x78/0xb8
 really_probe+0x2b8/0x5a8
 __driver_probe_device+0x164/0x298
 driver_probe_device+0x78/0x258
 __device_attach_driver+0x274/0x350

The simplest way to fix the bug is to delete the routing table in its
entirety. dsa_tree_setup_routing_table() has no problem in regenerating
it even if we deleted links between ports other than those of switch N,
because dsa_link_touch() first checks whether the port pair already
exists in dst->rtable, allocating if not.

The deletion of the routing table in its entirety already exists in
dsa_tree_teardown(), so refactor that into a function that can also be
called from the tree setup error path.

In my analysis of the commit to blame, it is the one which added
dsa_link elements to dst->rtable. Prior to that, each switch had its own
ds->rtable which is freed when the switch fails to probe. But the tree
is potentially persistent memory.

Fixes: c5f51765a1 ("net: dsa: list DSA links in the fabric")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20250414213001.2957964-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16 18:14:44 -07:00
Vladimir Oltean
7afb5fb42d net: dsa: clean up FDB, MDB, VLAN entries on unbind
As explained in many places such as commit b117e1e8a8 ("net: dsa:
delete dsa_legacy_fdb_add and dsa_legacy_fdb_del"), DSA is written given
the assumption that higher layers have balanced additions/deletions.
As such, it only makes sense to be extremely vocal when those
assumptions are violated and the driver unbinds with entries still
present.

But Ido Schimmel points out a very simple situation where that is wrong:
https://lore.kernel.org/netdev/ZDazSM5UsPPjQuKr@shredder/
(also briefly discussed by me in the aforementioned commit).

Basically, while the bridge bypass operations are not something that DSA
explicitly documents, and for the majority of DSA drivers this API
simply causes them to go to promiscuous mode, that isn't the case for
all drivers. Some have the necessary requirements for bridge bypass
operations to do something useful - see dsa_switch_supports_uc_filtering().

Although in tools/testing/selftests/net/forwarding/local_termination.sh,
we made an effort to popularize better mechanisms to manage address
filters on DSA interfaces from user space - namely macvlan for unicast,
and setsockopt(IP_ADD_MEMBERSHIP) - through mtools - for multicast, the
fact is that 'bridge fdb add ... self static local' also exists as
kernel UAPI, and might be useful to someone, even if only for a quick
hack.

It seems counter-productive to block that path by implementing shim
.ndo_fdb_add and .ndo_fdb_del operations which just return -EOPNOTSUPP
in order to prevent the ndo_dflt_fdb_add() and ndo_dflt_fdb_del() from
running, although we could do that.

Accepting that cleanup is necessary seems to be the only option.
Especially since we appear to be coming back at this from a different
angle as well. Russell King is noticing that the WARN_ON() triggers even
for VLANs:
https://lore.kernel.org/netdev/Z_li8Bj8bD4-BYKQ@shell.armlinux.org.uk/

What happens in the bug report above is that dsa_port_do_vlan_del() fails,
then the VLAN entry lingers on, and then we warn on unbind and leak it.

This is not a straight revert of the blamed commit, but we now add an
informational print to the kernel log (to still have a way to see
that bugs exist), and some extra comments gathered from past years'
experience, to justify the logic.

Fixes: 0832cd9f1f ("net: dsa: warn if port lists aren't empty in dsa_port_teardown")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20250414212930.2956310-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16 18:14:43 -07:00
Jonas Gorski
eb25de13bd net: bridge: switchdev: do not notify new brentries as changed
When adding a bridge vlan that is pvid or untagged after the vlan has
already been added to any other switchdev backed port, the vlan change
will be propagated as changed, since the flags change.

This causes the vlan to not be added to the hardware for DSA switches,
since the DSA handler ignores any vlans for the CPU or DSA ports that
are changed.

E.g. the following order of operations would work:

$ ip link add swbridge type bridge vlan_filtering 1 vlan_default_pvid 0
$ ip link set lan1 master swbridge
$ bridge vlan add dev swbridge vid 1 pvid untagged self
$ bridge vlan add dev lan1 vid 1 pvid untagged

but this order would break:

$ ip link add swbridge type bridge vlan_filtering 1 vlan_default_pvid 0
$ ip link set lan1 master swbridge
$ bridge vlan add dev lan1 vid 1 pvid untagged
$ bridge vlan add dev swbridge vid 1 pvid untagged self

Additionally, the vlan on the bridge itself would become undeletable:

$ bridge vlan
port              vlan-id
lan1              1 PVID Egress Untagged
swbridge          1 PVID Egress Untagged
$ bridge vlan del dev swbridge vid 1 self
$ bridge vlan
port              vlan-id
lan1              1 PVID Egress Untagged
swbridge          1 Egress Untagged

since the vlan was never added to DSA's vlan list, so deleting it will
cause an error, causing the bridge code to not remove it.

Fix this by checking if flags changed only for vlans that are already
brentry and pass changed as false for those that become brentries, as
these are a new vlan (member) from the switchdev point of view.

Since *changed is set to true for becomes_brentry = true regardless of
would_change's value, this will not change any rtnetlink notification
delivery, just the value passed on to switchdev in vlan->changed.

Fixes: 8d23a54f5b ("net: bridge: switchdev: differentiate new VLANs from changed ones")
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250414200020.192715-1-jonas.gorski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16 18:11:39 -07:00
Frédéric Danis
e2e49e2141 Bluetooth: l2cap: Process valid commands in too long frame
This is required for passing PTS test cases:
- L2CAP/COS/CED/BI-14-C
  Multiple Signaling Command in one PDU, Data Truncated, BR/EDR,
  Connection Request
- L2CAP/COS/CED/BI-15-C
  Multiple Signaling Command in one PDU, Data Truncated, BR/EDR,
  Disconnection Request

The test procedure defined in L2CAP.TS.p39 for both tests is:
1. The Lower Tester sends a C-frame to the IUT with PDU Length set
   to 8 and Channel ID set to the correct signaling channel for the
   logical link. The Information payload contains one L2CAP_ECHO_REQ
   packet with Data Length set to 0 with 0 octets of echo data and
   one command packet and Data Length set as specified in Table 4.6
   and the correct command data.
2. The IUT sends an L2CAP_ECHO_RSP PDU to the Lower Tester.
3. Perform alternative 3A, 3B, 3C, or 3D depending on the IUT’s
   response.
   Alternative 3A (IUT terminates the link):
     3A.1 The IUT terminates the link.
     3A.2 The test ends with a Pass verdict.
   Alternative 3B (IUT discards the frame):
     3B.1 The IUT does not send a reply to the Lower Tester.
   Alternative 3C (IUT rejects PDU):
     3C.1 The IUT sends an L2CAP_COMMAND_REJECT_RSP PDU to the
          Lower Tester.
   Alternative 3D (Any other IUT response):
     3D.1 The Upper Tester issues a warning and the test ends.
4. The Lower Tester sends a C-frame to the IUT with PDU Length set
   to 4 and Channel ID set to the correct signaling channel for the
   logical link. The Information payload contains Data Length set to
   0 with an L2CAP_ECHO_REQ packet with 0 octets of echo data.
5. The IUT sends an L2CAP_ECHO_RSP PDU to the Lower Tester.

With expected outcome:
  In Steps 2 and 5, the IUT responds with an L2CAP_ECHO_RSP.
  In Step 3A.1, the IUT terminates the link.
  In Step 3B.1, the IUT does not send a reply to the Lower Tester.
  In Step 3C.1, the IUT rejects the PDU.
  In Step 3D.1, the IUT sends any valid response.

Currently PTS fails with the following logs:
  Failed to receive ECHO RESPONSE.

And HCI logs:
> ACL Data RX: Handle 11 flags 0x02 dlen 20
      L2CAP: Information Response (0x0b) ident 2 len 12
        Type: Fixed channels supported (0x0003)
        Result: Success (0x0000)
        Channels: 0x000000000000002e
          L2CAP Signaling (BR/EDR)
          Connectionless reception
          AMP Manager Protocol
          L2CAP Signaling (LE)
> ACL Data RX: Handle 11 flags 0x02 dlen 13
        frame too long
        08 01 00 00 08 02 01 00 aa                       .........

Cc: stable@vger.kernel.org
Signed-off-by: Frédéric Danis <frederic.danis@collabora.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-04-16 16:50:25 -04:00
Matthias Schiffer
8772cc49e0 batman-adv: fix duplicate MAC address check
batadv_check_known_mac_addr() is both too lenient and too strict:

- It is called from batadv_hardif_add_interface(), which means that it
  checked interfaces that are not used for batman-adv at all. Move it
  to batadv_hardif_enable_interface(). Also, restrict it to hardifs of
  the same mesh interface; different mesh interfaces should not interact
  at all. The batadv_check_known_mac_addr() argument is changed from
  `struct net_device` to `struct batadv_hard_iface` to achieve this.
- The check only cares about hardifs in BATADV_IF_ACTIVE and
  BATADV_IF_TO_BE_ACTIVATED states, but interfaces in BATADV_IF_INACTIVE
  state should be checked as well, or the following steps will not
  result in a warning then they should:

  - Add two interfaces in down state with different MAC addresses to
    a mesh as hardifs
  - Change the MAC addresses so they conflict
  - Set interfaces to up state

  Now there will be two active hardifs with the same MAC address, but no
  warning. Fix by only ignoring hardifs in BATADV_IF_NOT_IN_USE state.

The RCU lock can be dropped, as we're holding RTNL anyways when the
function is called.

Fixes: c6c8fea297 ("net: Add batman-adv meshing protocol")
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2025-04-16 20:52:33 +02:00
Cosmin Ratiu
43eca05b6a xfrm: Add explicit dev to .xdo_dev_state_{add,delete,free}
Previously, device driver IPSec offload implementations would fall into
two categories:
1. Those that used xso.dev to determine the offload device.
2. Those that used xso.real_dev to determine the offload device.

The first category didn't work with bonding while the second did.
In a non-bonding setup the two pointers are the same.

This commit adds explicit pointers for the offload netdevice to
.xdo_dev_state_add() / .xdo_dev_state_delete() / .xdo_dev_state_free()
which eliminates the confusion and allows drivers from the first
category to work with bonding.

xso.real_dev now becomes a private pointer managed by the bonding
driver.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-16 11:01:41 +02:00
Cosmin Ratiu
d53dda291b xfrm: Remove unneeded device check from validate_xmit_xfrm
validate_xmit_xfrm checks whether a packet already passed through it on
the master device (xso.dev) and skips processing the skb again on the
slave device (xso.real_dev).

This check was added in commit [1] to avoid tx packets on a bond device
pass through xfrm twice and get two sets of headers, but the check was
soon obsoleted by commit [2], which was added around the same time to
fix a similar but unrelated problem. Commit [3] set XFRM_XMIT only when
packets are hw offloaded.

xso.dev is usually equal to xso.real_dev, unless bonding is used, in
which case the bonding driver uses xso.real_dev to manage offloaded xfrm
states.

Since commit [3], the check added in commit [1] is unused on all cases,
since packets going through validate_xmit_xfrm twice bail out on the
check added in commit [2]. Here's a breakdown of relevant scenarios:

1. ESP offload off: validate_xmit_xfrm returns early on !xo.
2. ESP offload on, no bond: skb->dev == xso.real_dev == xso.dev.
3. ESP offload on, bond, xs on bond dev: 1st pass adds XFRM_XMIT, 2nd
   pass returns early on XFRM_XMIT.
3. ESP offload on, bond, xs on slave dev: 1st pass returns early on
   !xo, 2nd pass adds XFRM_XMIT.
4. ESP offload on, bond, xs on both bond AND slave dev: only 1 offload
   possible in secpath. Either 1st pass adds XFRM_XMIT and 2nd pass returns
   early on XFRM_XMIT, or 1st pass is sw and returns early on !xo.
6. ESP offload on, crypto fallback triggered in esp_xmit/esp6_xmit: 1st
   pass does sw crypto & secpath_reset, 2nd pass returns on !xo.

This commit removes the unnecessary check, so xso.real_dev becomes what
it is in practice: a private field managed by bonding driver.
The check immediately below that can be simplified as well.

[1] commit 272c2330ad ("xfrm: bail early on slave pass over skb")
[2] commit 94579ac3f6 ("xfrm: Fix double ESP trailer insertion in
IPsec crypto offload.")
[3] commit c7dbf4c088 ("xfrm: Provide private skb extensions for
segmented and hw offloaded ESP packets")

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-16 11:01:32 +02:00
Cosmin Ratiu
25ac138f58 xfrm: Use xdo.dev instead of xdo.real_dev
The policy offload struct was reused from the state offload and
real_dev was copied from dev, but it was never set to anything else.
Simplify the code by always using xdo.dev for policies.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-16 11:01:19 +02:00
Jakub Kicinski
adf6b730fc linux-can-fixes-for-6.15-20250415
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEn/sM2K9nqF/8FWzzDHRl3/mQkZwFAmf+NN4THG1rbEBwZW5n
 dXRyb25peC5kZQAKCRAMdGXf+ZCRnAZkCACNYH8G1O60542NfplJq91lN4y4hHcA
 YRtAwUqUDj9PZxS2EG4q+AsNjl7EYxPgZjCx0II0Qe30EvbAt4FQ/A0M5ItvtBcH
 yKcKaUWJ5YSsnaT3Fim/RbJ1+tzR8b6n7g9S8IJ+1LIxtOL0MmFhrcqCYPY/BT+q
 /oByZ0U9dLHueOJcGdtcQFSjgEfAdXNP7OmmC1FWM23+l1Nr/VfFf+FjWRfN+ugQ
 MVsy252YpcakIRhehmrraQ1i5EtXad0IIe5mjLOLfwDkhDxLH3tGtwVQJMXYiUDz
 REqdfIKp5xClz7JTQGjXrQZvwjsqcGyJKnVQLomNQQngg7kfRxf4GJ6k
 =Ghw+
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-fixes-for-6.15-20250415' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2025-04-15

The first patch is by Davide Caratti and fixes the missing derement in
the protocol inuse counter for the J1939 CAN protocol.

The last patch is by Weizhao Ouyang and fixes a broken quirks check in
the rockchip CAN-FD driver.

* tag 'linux-can-fixes-for-6.15-20250415' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
  can: rockchip_canfd: fix broken quirks checks
  can: fix missing decrement of j1939_proto.inuse_idx
====================

Link: https://patch.msgid.link/20250415103401.445981-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 20:05:55 -07:00
Sven Eckelmann
10a7796576 batman-adv: Fix double-hold of meshif when getting enabled
It was originally meant to replace the dev_hold with netdev_hold. But this
was missed in batadv_hardif_enable_interface(). As result, there was an
imbalance and a hang when trying to remove the mesh-interface with
(previously) active hard-interfaces:

  unregister_netdevice: waiting for batadv0 to become free. Usage count = 3

Fixes: 00b3553081 ("batman-adv: adopt netdev_hold() / netdev_put()")
Suggested-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot+ff3aa851d46ab82953a3@syzkaller.appspotmail.com
Reported-by: syzbot+4036165fc595a74b09b2@syzkaller.appspotmail.com
Reported-by: syzbot+c35d73ce910d86c0026e@syzkaller.appspotmail.com
Reported-by: syzbot+48c14f61594bdfadb086@syzkaller.appspotmail.com
Reported-by: syzbot+f37372d86207b3bb2941@syzkaller.appspotmail.com
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250414-double_hold_fix-v5-1-10e056324cde@narfation.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 17:56:47 -07:00
Ido Schimmel
2d300ce0b7 net: fib_rules: Fix iif / oif matching on L3 master device
Before commit 40867d74c3 ("net: Add l3mdev index to flow struct and
avoid oif reset for port devices") it was possible to use FIB rules to
match on a L3 domain. This was done by having a FIB rule match on iif /
oif being a L3 master device. It worked because prior to the FIB rule
lookup the iif / oif fields in the flow structure were reset to the
index of the L3 master device to which the input / output device was
enslaved to.

The above scheme made it impossible to match on the original input /
output device. Therefore, cited commit stopped overwriting the iif / oif
fields in the flow structure and instead stored the index of the
enslaving L3 master device in a new field ('flowi_l3mdev') in the flow
structure.

While the change enabled new use cases, it broke the original use case
of matching on a L3 domain. Fix this by interpreting the iif / oif
matching on a L3 master device as a match against the L3 domain. In
other words, if the iif / oif in the FIB rule points to a L3 master
device, compare the provided index against 'flowi_l3mdev' rather than
'flowi_{i,o}if'.

Before cited commit, a FIB rule that matched on 'iif vrf1' would only
match incoming traffic from devices enslaved to 'vrf1'. With the
proposed change (i.e., comparing against 'flowi_l3mdev'), the rule would
also match traffic originating from a socket bound to 'vrf1'. Avoid that
by adding a new flow flag ('FLOWI_FLAG_L3MDEV_OIF') that indicates if
the L3 domain was derived from the output interface or the input
interface (when not set) and take this flag into account when evaluating
the FIB rule against the flow structure.

Avoid unnecessary checks in the data path by detecting that a rule
matches on a L3 master device when the rule is installed and marking it
as such.

Tested using the following script [1].

Output before 40867d74c3 (v5.4.291):

default dev dummy1 table 100 scope link
default dev dummy1 table 200 scope link

Output after 40867d74c3:

default dev dummy1 table 300 scope link
default dev dummy1 table 300 scope link

Output with this patch:

default dev dummy1 table 100 scope link
default dev dummy1 table 200 scope link

[1]
 #!/bin/bash

 ip link add name vrf1 up type vrf table 10
 ip link add name dummy1 up master vrf1 type dummy

 sysctl -wq net.ipv4.conf.all.forwarding=1
 sysctl -wq net.ipv4.conf.all.rp_filter=0

 ip route add table 100 default dev dummy1
 ip route add table 200 default dev dummy1
 ip route add table 300 default dev dummy1

 ip rule add prio 0 oif vrf1 table 100
 ip rule add prio 1 iif vrf1 table 200
 ip rule add prio 2 table 300

 ip route get 192.0.2.1 oif dummy1 fibmatch
 ip route get 192.0.2.1 iif dummy1 from 198.51.100.1 fibmatch

Fixes: 40867d74c3 ("net: Add l3mdev index to flow struct and avoid oif reset for port devices")
Reported-by: hanhuihui <hanhuihui5@huawei.com>
Closes: https://lore.kernel.org/netdev/ec671c4f821a4d63904d0da15d604b75@huawei.com/
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250414172022.242991-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 17:54:56 -07:00
Shengyu Qu
a496d2f0fd net: bridge: locally receive all multicast packets if IFF_ALLMULTI is set
If multicast snooping is enabled, multicast packets may not always end up
on the local bridge interface, if the host is not a member of the multicast
group. Similar to how IFF_PROMISC allows all packets to be received
locally, let IFF_ALLMULTI allow all multicast packets to be received.

OpenWrt uses a user space daemon for DHCPv6/RA/NDP handling, and in relay
mode it sets the ALLMULTI flag in order to receive all relevant queries on
the network.

This works for normal network interfaces and non-snooping bridges, but not
snooping bridges (unless multicast routing is enabled).

Reported-by: Felix Fietkau <nbd@nbd.name>
Closes: https://github.com/openwrt/openwrt/issues/15857#issuecomment-2662851243
Signed-off-by: Shengyu Qu <wiagn233@outlook.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/OSZPR01MB8434308370ACAFA90A22980798B32@OSZPR01MB8434.jpnprd01.prod.outlook.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 17:25:52 -07:00
Toke Høiland-Jørgensen
5f5f92912b tc: Return an error if filters try to attach too many actions
While developing the fix for the buffer sizing issue in [0], I noticed
that the kernel will happily accept a long list of actions for a filter,
and then just silently truncate that list down to a maximum of 32
actions.

That seems less than ideal, so this patch changes the action parsing to
return an error message and refuse to create the filter in this case.
This results in an error like:

 # ip link add type veth
 # tc qdisc replace dev veth0 root handle 1: fq_codel
 # tc -echo filter add dev veth0 parent 1: u32 match u32 0 0 $(for i in $(seq 33); do echo action pedit munge ip dport set 22; done)
Error: Only 32 actions supported per filter.
We have an error talking to the kernel

Instead of just creating a filter with 32 actions and dropping the last
one.

This is obviously a change in UAPI. But seeing as creating more than 32
filters has never actually *worked*, it seems that returning an explicit
error is better, and any use cases that get broken by this were already
broken just in more subtle ways.

[0] https://lore.kernel.org/r/20250407105542.16601-1-toke@redhat.com

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409145523.164506-1-toke@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 17:13:11 -07:00
Breno Leitao
8ff9530361 net: fib_rules: Use nlmsg_payload in fib_{new,del}rule()
Leverage the new nlmsg_payload() helper to avoid checking for message
size and then reading the nlmsg data.

Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250414-nlmsg-v2-10-3d90cb42c6af@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 08:28:55 -07:00
Breno Leitao
4c113c803f net: fib_rules: Use nlmsg_payload in fib_valid_dumprule_req
Leverage the new nlmsg_payload() helper to avoid checking for message
size and then reading the nlmsg data.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250414-nlmsg-v2-9-3d90cb42c6af@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 08:28:55 -07:00
Breno Leitao
69a1ecfe47 mpls: Use nlmsg_payload in mpls_valid_getroute_req
Leverage the new nlmsg_payload() helper to avoid checking for message
size and then reading the nlmsg data.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250414-nlmsg-v2-8-3d90cb42c6af@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 08:28:54 -07:00
Breno Leitao
8cf1e30907 ipv6: Use nlmsg_payload in inet6_rtm_valid_getaddr_req
Leverage the new nlmsg_payload() helper to avoid checking for message
size and then reading the nlmsg data.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250414-nlmsg-v2-7-3d90cb42c6af@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 08:28:54 -07:00
Breno Leitao
e87187dfbb ipv6: Use nlmsg_payload in inet6_valid_dump_ifaddr_req
Leverage the new nlmsg_payload() helper to avoid checking for message
size and then reading the nlmsg data.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250414-nlmsg-v2-6-3d90cb42c6af@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 08:28:54 -07:00
Breno Leitao
72be72bea9 mpls: Use nlmsg_payload in mpls_valid_fib_dump_req
Leverage the new nlmsg_payload() helper to avoid checking for message
size and then reading the nlmsg data.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250414-nlmsg-v2-5-3d90cb42c6af@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 08:28:54 -07:00
Breno Leitao
77d0229036 rtnetlink: Use nlmsg_payload in valid_fdb_dump_strict
Leverage the new nlmsg_payload() helper to avoid checking for message
size and then reading the nlmsg data.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250414-nlmsg-v2-4-3d90cb42c6af@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 08:28:54 -07:00
Breno Leitao
2d1f827f06 neighbour: Use nlmsg_payload in neigh_valid_get_req
Update neigh_valid_get_req function to utilize the new nlmsg_payload()
helper function.

This change improves code clarity and safety by ensuring that the
Netlink message payload is properly validated before accessing its data.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250414-nlmsg-v2-3-3d90cb42c6af@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 08:28:54 -07:00
Breno Leitao
7527efe8a4 neighbour: Use nlmsg_payload in neightbl_valid_dump_info
Update neightbl_valid_dump_info function to utilize the new
nlmsg_payload() helper function.

This change improves code clarity and safety by ensuring that the
Netlink message payload is properly validated before accessing its data.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250414-nlmsg-v2-2-3d90cb42c6af@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 08:28:54 -07:00
Matthieu Baerts (NGI0)
4ce7fb8de5 mptcp: add MPJoinRejected MIB counter
This counter is useful to understand why some paths are rejected, and
not created as expected.

It is incremented when receiving a connection request, if the PM didn't
allow the creation of new subflows.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250413-net-next-mptcp-sched-mib-sft-misc-v2-5-0f83a4350150@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 08:21:47 -07:00
Matthieu Baerts (NGI0)
60cbf31585 mptcp: pass right struct to subflow_hmac_valid
subflow_hmac_valid() needs to access the MPTCP socket and the subflow
request, but not the request sock that is passed in argument.

Instead, the subflow request can be directly passed to avoid getting it
via an additional cast.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250413-net-next-mptcp-sched-mib-sft-misc-v2-4-0f83a4350150@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 08:21:47 -07:00
Thorsten Blum
def9d0958b mptcp: pm: Return local variable instead of freed pointer
Commit e4c28e3d5c ("mptcp: pm: move generic PM helpers to pm.c")
removed an unnecessary if-check, which resulted in returning a freed
pointer.

This still works due to the implicit boolean conversion when returning
the freed pointer from mptcp_remove_anno_list_by_saddr(), but it can be
confusing and potentially error-prone. To improve clarity, add a local
variable to explicitly return a boolean value instead.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250413-net-next-mptcp-sched-mib-sft-misc-v2-3-0f83a4350150@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 08:21:46 -07:00
Geliang Tang
760ff07669 mptcp: sched: split validation part
A new interface .validate has been added in struct bpf_struct_ops
recently. This patch prepares a future struct_ops support by
implementing it as a new helper mptcp_validate_scheduler() for struct
mptcp_sched_ops.

In this helper, check whether the required ops "get_subflow" of struct
mptcp_sched_ops has been implemented.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250413-net-next-mptcp-sched-mib-sft-misc-v2-2-0f83a4350150@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 08:21:46 -07:00
Matthieu Baerts (NGI0)
6e83166dd8 mptcp: sched: remove mptcp_sched_data
This is a follow-up of commit b68b106b0f ("mptcp: sched: reduce size
for unused data"), now removing the mptcp_sched_data structure.

Now is a good time to do that, because the previously mentioned WIP work
has been updated, no longer depending on this structure.

Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250413-net-next-mptcp-sched-mib-sft-misc-v2-1-0f83a4350150@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 08:21:46 -07:00
Jedrzej Jagielski
8982fc03fd devlink: add value check to devlink_info_version_put()
Prevent from proceeding if there's nothing to print.

Suggested-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-15 07:36:31 -07:00
Hari Kalavakunta
e8a1bd8344 net: ncsi: Fix GCPS 64-bit member variables
Correct Get Controller Packet Statistics (GCPS) 64-bit wide member
variables, as per DSP0222 v1.0.0 and forward specs. The Driver currently
collects these stats, but they are yet to be exposed to the user.
Therefore, no user impact.

Statistics fixes:
Total Bytes Received (byte range 28..35)
Total Bytes Transmitted (byte range 36..43)
Total Unicast Packets Received (byte range 44..51)
Total Multicast Packets Received (byte range 52..59)
Total Broadcast Packets Received (byte range 60..67)
Total Unicast Packets Transmitted (byte range 68..75)
Total Multicast Packets Transmitted (byte range 76..83)
Total Broadcast Packets Transmitted (byte range 84..91)
Valid Bytes Received (byte range 204..11)

Signed-off-by: Hari Kalavakunta <kalavakunta.hari.prasad@gmail.com>
Reviewed-by: Paul Fertser <fercerpav@gmail.com>
Link: https://patch.msgid.link/20250410012309.1343-1-kalavakunta.hari.prasad@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-15 15:19:55 +02:00
Kevin Paul Reddy Janagari
8c9b406ff4 tipc: Removing deprecated strncpy()
This patch suggests the replacement of strncpy with strscpy
as per Documentation/process/deprecated.
The strncpy() fails to guarantee NULL termination,
The function adds zero pads which isn't really convenient for short strings
as it may cause performance issues.

strscpy() is a preferred replacement because
it overcomes the limitations of strncpy mentioned above.

Compile Tested

Signed-off-by: Kevin Paul Reddy Janagari <kevinpaul468@gmail.com>
Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech>
Tested-by: Tung Nguyen <tung.quang.nguyen@est.tech>
Link: https://patch.msgid.link/20250411085010.6249-1-kevinpaul468@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-15 13:37:14 +02:00
Davide Caratti
8b18794914 can: fix missing decrement of j1939_proto.inuse_idx
Like other protocols on top of AF_CAN family, also j1939_proto.inuse_idx
needs to be decremented on socket dismantle.

Fixes: 6bffe88452 ("can: add protocol counter for AF_CAN sockets")
Reported-by: Oliver Hartkopp <socketcan@hartkopp.net>
Closes: https://lore.kernel.org/linux-can/7e35b13f-bbc4-491e-9081-fb939e1b8df0@hartkopp.net/
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/09ce71f281b9e27d1e3d1104430bf3fceb8c7321.1742292636.git.dcaratti@redhat.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-04-15 12:18:07 +02:00
David Howells
aa2199088a rxrpc: rxperf: Add test RxGK server keys
Add RxGK server keys of bytes containing { 0, 1, 2, 3, 4, ... } to the
server keyring for the rxperf test server.  This allows the rxperf test
client to connect to it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20250411095303.2316168-15-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:36:42 -07:00
David Howells
fba6995798 rxrpc: Add more CHALLENGE/RESPONSE packet tracing
Add more tracing for CHALLENGE and RESPONSE packets.  Currently, rxrpc only
has client-relevant tracepoints (rx_challenge and tx_response), but add the
server-side ones too.

Further, record the service ID in the rx_challenge tracepoint as well.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20250411095303.2316168-14-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:36:42 -07:00
David Howells
d03539d5c2 rxrpc: Display security params in the afs_cb_call tracepoint
Make the afs_cb_call tracepoint display some security parameters to make
debugging easier.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20250411095303.2316168-12-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:36:42 -07:00
David Howells
b794dc17cd rxrpc: Allow the app to store private data on peer structs
Provide a way for the application (e.g. the afs filesystem) to store
private data on the rxrpc_peer structs for later retrieval via the call
object.

This will allow afs to store a pointer to the afs_server object on the
rxrpc_peer struct, thereby obviating the need for afs to keep lookup tables
by which it can associate an incoming call with server that transmitted it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20250411095303.2316168-11-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:36:42 -07:00
David Howells
7a7513a308 rxrpc: rxgk: Implement connection rekeying
Implement rekeying of connections with the RxGK security class.  This
involves regenerating the keys with a different key number as part of the
input data after a certain amount of time or a certain amount of bytes
encrypted.  Rekeying may be triggered by either end.

The LSW of the key number is inserted into the security-specific field in
the RX header, and we try and expand it to 32-bits to make it last longer.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: Chuck Lever <chuck.lever@oracle.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20250411095303.2316168-10-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:36:42 -07:00
David Howells
9d1d2b5934 rxrpc: rxgk: Implement the yfs-rxgk security class (GSSAPI)
Implement the basic parts of the yfs-rxgk security class (security index 6)
to support GSSAPI-negotiated security.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: Chuck Lever <chuck.lever@oracle.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20250411095303.2316168-9-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:36:42 -07:00
David Howells
c86f9b963d rxrpc: rxgk: Provide infrastructure and key derivation
Provide some infrastructure for implementing the RxGK transport security
class:

 (1) A definition of an encoding type, including:

	- Relevant crypto-layer names
	- Lengths of the crypto keys and checksums involved
	- Crypto functions specific to the encoding type
	- Crypto scheme used for that type

 (2) A definition of a crypto scheme, including:

	- Underlying crypto handlers
	- The pseudo-random function, PRF, used in base key derivation
	- Functions for deriving usage keys Kc, Ke and Ki
	- Functions for en/decrypting parts of an sk_buff

 (3) A key context, with the usage keys required for a derivative of a
     transport key for a specific key number.  This includes keys for
     securing packets for transmission, extracting received packets and
     dealing with response packets.

 (3) A function to look up an encoding type by number.

 (4) A function to set up a key context and derive the keys.

 (5) A function to set up the keys required to extract the ticket obtained
     from the GSS negotiation in the server.

 (6) Miscellaneous functions for context handling.

The keys and key derivation functions are described in:

	tools.ietf.org/html/draft-wilkinson-afs3-rxgk-11

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: Chuck Lever <chuck.lever@oracle.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20250411095303.2316168-8-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:36:41 -07:00
David Howells
0ca100ff4d rxrpc: Add YFS RxGK (GSSAPI) security class
Add support for the YFS-variant RxGK security class to support
GSSAPI-derived authentication.  This also allows the use of better crypto
over the rxkad security class.

The key payload is XDR encoded of the form:

    typedef int64_t opr_time;

    const AFSTOKEN_RK_TIX_MAX = 12000; 	/* Matches entry in rxkad.h */

    struct token_rxkad {
	afs_int32 viceid;
	afs_int32 kvno;
	afs_int64 key;
	afs_int32 begintime;
	afs_int32 endtime;
	afs_int32 primary_flag;
	opaque ticket<AFSTOKEN_RK_TIX_MAX>;
    };

    struct token_rxgk {
	opr_time begintime;
	opr_time endtime;
	afs_int64 level;
	afs_int64 lifetime;
	afs_int64 bytelife;
	afs_int64 enctype;
	opaque key<>;
	opaque ticket<>;
    };

    const AFSTOKEN_UNION_NOAUTH = 0;
    const AFSTOKEN_UNION_KAD = 2;
    const AFSTOKEN_UNION_YFSGK = 6;

    union ktc_tokenUnion switch (afs_int32 type) {
	case AFSTOKEN_UNION_KAD:
	    token_rxkad kad;
	case AFSTOKEN_UNION_YFSGK:
	    token_rxgk  gk;
    };

    const AFSTOKEN_LENGTH_MAX = 16384;
    typedef opaque token_opaque<AFSTOKEN_LENGTH_MAX>;

    const AFSTOKEN_MAX = 8;
    const AFSTOKEN_CELL_MAX = 64;

    struct ktc_setTokenData {
	afs_int32 flags;
	string cell<AFSTOKEN_CELL_MAX>;
	token_opaque tokens<AFSTOKEN_MAX>;
    };

The parser for the basic token struct is already present, as is the rxkad
token type.  This adds a parser for the rxgk token type.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: Chuck Lever <chuck.lever@oracle.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20250411095303.2316168-7-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:36:41 -07:00
David Howells
5800b1cf3f rxrpc: Allow CHALLENGEs to the passed to the app for a RESPONSE
Allow the app to request that CHALLENGEs be passed to it through an
out-of-band queue that allows recvmsg() to pick it up so that the app can
add data to it with sendmsg().

This will allow the application (AFS or userspace) to interact with the
process if it wants to and put values into user-defined fields.  This will
be used by AFS when talking to a fileserver to supply that fileserver with
a crypto key by which callback RPCs can be encrypted (ie. notifications
from the fileserver to the client).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20250411095303.2316168-5-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:36:41 -07:00
David Howells
019c8433eb rxrpc: Remove some socket lock acquire/release annotations
Remove some socket lock acquire/release annotations as lock_sock() and
release_sock() don't have them and so the checker gets confused.  Removing
all of them, however, causes warnings about "context imbalance" and "wrong
count at exit" to occur instead.

Probably lock_sock() and release_sock() should have annotations on
indicating their taking of sk_lock - there is a dep_map in socket_lock_t,
but I don't know if that matters to the static checker.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20250411095303.2316168-4-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:36:41 -07:00
David Howells
23738cc804 rxrpc: Pull out certain app callback funcs into an ops table
A number of functions separately furnish an AF_RXRPC socket with callback
function pointers into a kernel app (such as the AFS filesystem) that is
using it.  Replace most of these with an ops table for the entire socket.
This makes it easier to add more callback functions.

Note that the call incoming data processing callback is retaind as that
gets set to different things, depending on the type of op.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20250411095303.2316168-3-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:36:41 -07:00
David Howells
28a79fc9b0 rxrpc: kdoc: Update function descriptions and add link from rxrpc.rst
Update the kerneldoc function descriptions to add "Return:" sections for
AF_RXRPC exported functions that have return values to stop the kdoc
builder from throwing warnings.

Also add links from the rxrpc.rst API doc to add a function API reference
at the end.  (Note that the API doc really needs updating, but that's
beyond this patchset).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20250411095303.2316168-2-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:36:40 -07:00
Kuniyuki Iwashima
c57a9c5035 net: Remove ->exit_batch_rtnl().
There are no ->exit_batch_rtnl() users remaining.

Let's remove the hook.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-15-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:08:45 -07:00
Kuniyuki Iwashima
b7924f50be bridge: Convert br_net_exit_batch_rtnl() to ->exit_rtnl().
br_net_exit_batch_rtnl() iterates the dying netns list and
performs the same operation for each.

Let's use ->exit_rtnl().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-10-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:08:43 -07:00
Kuniyuki Iwashima
9571ab5a98 xfrm: Convert xfrmi_exit_batch_rtnl() to ->exit_rtnl().
xfrmi_exit_batch_rtnl() iterates the dying netns list and
performs the same operations for each.

Let's use ->exit_rtnl().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-9-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:08:43 -07:00
Kuniyuki Iwashima
f76758f18f ipv6: Convert tunnel devices' ->exit_batch_rtnl() to ->exit_rtnl().
The following functions iterates the dying netns list and performs
the same operations for each netns.

  * ip6gre_exit_batch_rtnl()
  * ip6_tnl_exit_batch_rtnl()
  * vti6_exit_batch_rtnl()
  * sit_exit_batch_rtnl()

Let's use ->exit_rtnl().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-8-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:08:42 -07:00
Kuniyuki Iwashima
a967e01e2a ipv4: ip_tunnel: Convert ip_tunnel_delete_nets() callers to ->exit_rtnl().
ip_tunnel_delete_nets() iterates the dying netns list and performs the
same operations for each.

Let's export ip_tunnel_destroy() as ip_tunnel_delete_net() and call it
from ->exit_rtnl().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-7-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:08:42 -07:00
Kuniyuki Iwashima
cf701038d1 nexthop: Convert nexthop_net_exit_batch_rtnl() to ->exit_rtnl().
nexthop_net_exit_batch_rtnl() iterates the dying netns list and
performs the same operation for each.

Let's use ->exit_rtnl().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:08:41 -07:00
Kuniyuki Iwashima
7a60d91c69 net: Add ->exit_rtnl() hook to struct pernet_operations.
struct pernet_operations provides two batching hooks; ->exit_batch()
and ->exit_batch_rtnl().

The batching variant is beneficial if ->exit() meets any of the
following conditions:

  1) ->exit() repeatedly acquires a global lock for each netns

  2) ->exit() has a time-consuming operation that can be factored
     out (e.g. synchronize_rcu(), smp_mb(), etc)

  3) ->exit() does not need to repeat the same iterations for each
     netns (e.g. inet_twsk_purge())

Currently, none of the ->exit_batch_rtnl() functions satisfy any of
the above conditions because RTNL is factored out and held by the
caller and all of these functions iterate over the dying netns list.

Also, we want to hold per-netns RTNL there but avoid spreading
__rtnl_net_lock() across multiple locations.

Let's add ->exit_rtnl() hook and run it under __rtnl_net_lock().

The following patches will convert all ->exit_batch_rtnl() users
to ->exit_rtnl().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:08:41 -07:00
Kuniyuki Iwashima
fed176bf31 net: Add ops_undo_single for module load/unload.
If ops_init() fails while loading a module or we unload the
module, free_exit_list() rolls back the changes.

The rollback sequence is the same as ops_undo_list().

The ops is already removed from pernet_list before calling
free_exit_list().  If we link the ops to a temporary list,
we can reuse ops_undo_list().

Let's add a wrapper of ops_undo_list() and use it instead
of free_exit_list().

Now, we have the central place to roll back ops_init().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:08:40 -07:00
Kuniyuki Iwashima
e333b1c3cf net: Factorise setup_net() and cleanup_net().
When we roll back the changes made by struct pernet_operations.init(),
we execute mostly identical sequences in three places.

  * setup_net()
  * cleanup_net()
  * free_exit_list()

The only difference between the first two is which list and RCU helpers
to use.

In setup_net(), an ops could fail on the way, so we need to perform a
reverse walk from its previous ops in pernet_list.  OTOH, in cleanup_net(),
we iterate the full list from tail to head.

The former passes the failed ops to list_for_each_entry_continue_reverse().
It's tricky, but we can reuse it for the latter if we pass list_entry() of
the head node.

Also, synchronize_rcu() and synchronize_rcu_expedited() can be easily
switched by an argument.

Let's factorise the rollback part in setup_net() and cleanup_net().

In the next patch, ops_undo_list() will be reused for free_exit_list(),
and then two arguments (ops_list and hold_rtnl) will differ.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 17:08:40 -07:00
Toke Høiland-Jørgensen
ee62ce7a1d page_pool: Track DMA-mapped pages and unmap them when destroying the pool
When enabling DMA mapping in page_pool, pages are kept DMA mapped until
they are released from the pool, to avoid the overhead of re-mapping the
pages every time they are used. This causes resource leaks and/or
crashes when there are pages still outstanding while the device is torn
down, because page_pool will attempt an unmap through a non-existent DMA
device on the subsequent page return.

To fix this, implement a simple tracking of outstanding DMA-mapped pages
in page pool using an xarray. This was first suggested by Mina[0], and
turns out to be fairly straight forward: We simply store pointers to
pages directly in the xarray with xa_alloc() when they are first DMA
mapped, and remove them from the array on unmap. Then, when a page pool
is torn down, it can simply walk the xarray and unmap all pages still
present there before returning, which also allows us to get rid of the
get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional
synchronisation is needed, as a page will only ever be unmapped once.

To avoid having to walk the entire xarray on unmap to find the page
reference, we stash the ID assigned by xa_alloc() into the page
structure itself, using the upper bits of the pp_magic field. This
requires a couple of defines to avoid conflicting with the
POINTER_POISON_DELTA define, but this is all evaluated at compile-time,
so does not affect run-time performance. The bitmap calculations in this
patch gives the following number of bits for different architectures:

- 23 bits on 32-bit architectures
- 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
- 32 bits on other 64-bit architectures

Stashing a value into the unused bits of pp_magic does have the effect
that it can make the value stored there lie outside the unmappable
range (as governed by the mmap_min_addr sysctl), for architectures that
don't define ILLEGAL_POINTER_VALUE. This means that if one of the
pointers that is aliased to the pp_magic field (such as page->lru.next)
is dereferenced while the page is owned by page_pool, that could lead to
a dereference into userspace, which is a security concern. The risk of
this is mitigated by the fact that (a) we always clear pp_magic before
releasing a page from page_pool, and (b) this would need a
use-after-free bug for struct page, which can have many other risks
since page->lru.next is used as a generic list pointer in multiple
places in the kernel. As such, with this patch we take the position that
this risk is negligible in practice. For more discussion, see[1].

Since all the tracking added in this patch is performed on DMA
map/unmap, no additional code is needed in the fast path, meaning the
performance overhead of this tracking is negligible there. A
micro-benchmark shows that the total overhead of the tracking itself is
about 400 ns (39 cycles(tsc) 395.218 ns; sum for both map and unmap[2]).
Since this cost is only paid on DMA map and unmap, it seems like an
acceptable cost to fix the late unmap issue. Further optimisation can
narrow the cases where this cost is paid (for instance by eliding the
tracking when DMA map/unmap is a no-op).

The extra memory needed to track the pages is neatly encapsulated inside
xarray, which uses the 'struct xa_node' structure to track items. This
structure is 576 bytes long, with slots for 64 items, meaning that a
full node occurs only 9 bytes of overhead per slot it tracks (in
practice, it probably won't be this efficient, but in any case it should
be an acceptable overhead).

[0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/
[1] https://lore.kernel.org/r/20250320023202.GA25514@openwall.com
[2] https://lore.kernel.org/r/ae07144c-9295-4c9d-a400-153bb689fe9e@huawei.com

Reported-by: Yonglong Liu <liuyonglong@huawei.com>
Closes: https://lore.kernel.org/r/8743264a-9700-4227-a556-5f931c720211@huawei.com
Fixes: ff7d6b27f8 ("page_pool: refurbish version of page_pool code")
Suggested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Qiuling Ren <qren@redhat.com>
Tested-by: Yuying Ma <yuma@redhat.com>
Tested-by: Yonglong Liu <liuyonglong@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409-page-pool-track-dma-v9-2-6a9ef2e0cba8@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 16:30:29 -07:00
Toke Høiland-Jørgensen
cd3c93167d page_pool: Move pp_magic check into helper functions
Since we are about to stash some more information into the pp_magic
field, let's move the magic signature checks into a pair of helper
functions so it can be changed in one place.

Reviewed-by: Mina Almasry <almasrymina@google.com>
Tested-by: Yonglong Liu <liuyonglong@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409-page-pool-track-dma-v9-1-6a9ef2e0cba8@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 16:30:29 -07:00
Ilya Maximets
65d91192aa net: openvswitch: fix nested key length validation in the set() action
It's not safe to access nla_len(ovs_key) if the data is smaller than
the netlink header.  Check that the attribute is OK first.

Fixes: ccb1352e76 ("net: Add Open vSwitch kernel components.")
Reported-by: syzbot+b07a9da40df1576b8048@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b07a9da40df1576b8048
Tested-by: syzbot+b07a9da40df1576b8048@syzkaller.appspotmail.com
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Reviewed-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20250412104052.2073688-1-i.maximets@ovn.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 16:15:38 -07:00
Joseph Huang
c428d43d4f net: bridge: mcast: Notify on mdb offload failure
Notify user space on mdb offload failure if
mdb_offload_fail_notification is enabled.

Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250411150323.1117797-4-Joseph.Huang@garmin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 15:56:43 -07:00
Joseph Huang
9fbe1e3e61 net: bridge: Add offload_fail_notification bopt
Add BR_BOOLOPT_MDB_OFFLOAD_FAIL_NOTIFICATION bool option.

Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250411150323.1117797-3-Joseph.Huang@garmin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 15:56:42 -07:00
Joseph Huang
e846fb5e7c net: bridge: mcast: Add offload failed mdb flag
Add MDB_FLAGS_OFFLOAD_FAILED and MDB_PG_FLAGS_OFFLOAD_FAILED to indicate
that an attempt to offload the MDB entry to switchdev has failed.

Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250411150323.1117797-2-Joseph.Huang@garmin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 15:56:42 -07:00
Peter Seiderer
08fcb1f242 net: pktgen: fix code style (WARNING: quoted string split across lines)
Fix checkpatch code style warnings:

  WARNING: quoted string split across lines
  #480: FILE: net/core/pktgen.c:480:
  +       "Packet Generator for packet performance testing. "
  +       "Version: " VERSION "\n";

  WARNING: quoted string split across lines
  #632: FILE: net/core/pktgen.c:632:
  +                  "     udp_src_min: %d  udp_src_max: %d"
  +                  "  udp_dst_min: %d  udp_dst_max: %d\n",

Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 14:51:32 -07:00
Peter Seiderer
dceae3e82f net: pktgen: fix code style (WARNING: macros should not use a trailing semicolon)
Fix checkpatch code style warnings:

  WARNING: macros should not use a trailing semicolon
  #180: FILE: net/core/pktgen.c:180:
  +#define func_enter() pr_debug("entering %s\n", __func__);

  WARNING: macros should not use a trailing semicolon
  #234: FILE: net/core/pktgen.c:234:
  +#define   if_lock(t)           mutex_lock(&(t->if_lock));

  CHECK: Unnecessary parentheses around t->if_lock
  #235: FILE: net/core/pktgen.c:235:
  +#define   if_unlock(t)           mutex_unlock(&(t->if_lock));

Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 14:51:32 -07:00
Peter Seiderer
ca8ee66521 net: pktgen: fix code style (WARNING: Missing a blank line after declarations)
Fix checkpatch code style warnings:

  WARNING: Missing a blank line after declarations
  #761: FILE: net/core/pktgen.c:761:
  +               char c;
  +               if (get_user(c, &user_buffer[i]))

  WARNING: Missing a blank line after declarations
  #780: FILE: net/core/pktgen.c:780:
  +               char c;
  +               if (get_user(c, &user_buffer[i]))

  WARNING: Missing a blank line after declarations
  #806: FILE: net/core/pktgen.c:806:
  +               char c;
  +               if (get_user(c, &user_buffer[i]))

  WARNING: Missing a blank line after declarations
  #823: FILE: net/core/pktgen.c:823:
  +               char c;
  +               if (get_user(c, &user_buffer[i]))

  WARNING: Missing a blank line after declarations
  #1968: FILE: net/core/pktgen.c:1968:
  +               char f[32];
  +               memset(f, 0, 32);

  WARNING: Missing a blank line after declarations
  #2410: FILE: net/core/pktgen.c:2410:
  +       struct pktgen_net *pn = net_generic(dev_net(pkt_dev->odev), pg_net_id);
  +       if (!x) {

  WARNING: Missing a blank line after declarations
  #2442: FILE: net/core/pktgen.c:2442:
  +               __u16 t;
  +               if (pkt_dev->flags & F_QUEUE_MAP_RND) {

  WARNING: Missing a blank line after declarations
  #2523: FILE: net/core/pktgen.c:2523:
  +               unsigned int i;
  +               for (i = 0; i < pkt_dev->nr_labels; i++)

  WARNING: Missing a blank line after declarations
  #2567: FILE: net/core/pktgen.c:2567:
  +                       __u32 t;
  +                       if (pkt_dev->flags & F_IPSRC_RND)

  WARNING: Missing a blank line after declarations
  #2587: FILE: net/core/pktgen.c:2587:
  +                               __be32 s;
  +                               if (pkt_dev->flags & F_IPDST_RND) {

  WARNING: Missing a blank line after declarations
  #2634: FILE: net/core/pktgen.c:2634:
  +               __u32 t;
  +               if (pkt_dev->flags & F_TXSIZE_RND) {

  WARNING: Missing a blank line after declarations
  #2736: FILE: net/core/pktgen.c:2736:
  +               int i;
  +               for (i = 0; i < pkt_dev->cflows; i++) {

  WARNING: Missing a blank line after declarations
  #2738: FILE: net/core/pktgen.c:2738:
  +                       struct xfrm_state *x = pkt_dev->flows[i].x;
  +                       if (x) {

  WARNING: Missing a blank line after declarations
  #2752: FILE: net/core/pktgen.c:2752:
  +               int nhead = 0;
  +               if (x) {

  WARNING: Missing a blank line after declarations
  #2795: FILE: net/core/pktgen.c:2795:
  +       unsigned int i;
  +       for (i = 0; i < pkt_dev->nr_labels; i++)

  WARNING: Missing a blank line after declarations
  #3480: FILE: net/core/pktgen.c:3480:
  +       ktime_t idle_start = ktime_get();
  +       schedule();

Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 14:51:32 -07:00
Peter Seiderer
870b856cb4 net: pktgen: fix code style (WARNING: Block comments)
Fix checkpatch code style warnings:

  WARNING: Block comments use a trailing */ on a separate line
  +                                * removal by worker thread */

  WARNING: Block comments use * on subsequent lines
  +       __u8 tos;            /* six MSB of (former) IPv4 TOS
  +                               are for dscp codepoint */

  WARNING: Block comments use a trailing */ on a separate line
  +                               are for dscp codepoint */

  WARNING: Block comments use * on subsequent lines
  +       __u8 traffic_class;  /* ditto for the (former) Traffic Class in IPv6
  +                               (see RFC 3260, sec. 4) */

  WARNING: Block comments use a trailing */ on a separate line
  +                               (see RFC 3260, sec. 4) */

  WARNING: Block comments use * on subsequent lines
  +       /* = {
  +          0x00, 0x80, 0xC8, 0x79, 0xB3, 0xCB,

  WARNING: Block comments use * on subsequent lines
  +       /* Field for thread to receive "posted" events terminate,
  +          stop ifs etc. */

  WARNING: Block comments use a trailing */ on a separate line
  +          stop ifs etc. */

  WARNING: Block comments should align the * on each line
  + * we go look for it ...
  +*/

  WARNING: Block comments use a trailing */ on a separate line
  +        * we resolve the dst issue */

  WARNING: Block comments use a trailing */ on a separate line
  +        * with proc_create_data() */

Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 14:51:31 -07:00
Peter Seiderer
1d8f07bf4a net: pktgen: fix code style (WARNING: suspect code indent for conditional statements)
Fix checkpatch code style warnings:

  WARNING: suspect code indent for conditional statements (8, 17)
  #2901: FILE: net/core/pktgen.c:2901:
  +       } else {
  +                skb = __netdev_alloc_skb(dev, size, GFP_NOWAIT);

Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14 14:51:31 -07:00