kernel test robot reported warnings when build bonding module with
make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash drivers/net/bonding/:
from ../drivers/net/bonding/bond_main.c:35:
In function ‘fortify_memcpy_chk’,
inlined from ‘iph_to_flow_copy_v4addrs’ at ../include/net/ip.h:566:2,
inlined from ‘bond_flow_ip’ at ../drivers/net/bonding/bond_main.c:3984:3:
../include/linux/fortify-string.h:413:25: warning: call to ‘__read_overflow2_field’ declared with attribute warning: detected read beyond size of f
ield (2nd parameter); maybe use struct_group()? [-Wattribute-warning]
413 | __read_overflow2_field(q_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In function ‘fortify_memcpy_chk’,
inlined from ‘iph_to_flow_copy_v6addrs’ at ../include/net/ipv6.h:900:2,
inlined from ‘bond_flow_ip’ at ../drivers/net/bonding/bond_main.c:3994:3:
../include/linux/fortify-string.h:413:25: warning: call to ‘__read_overflow2_field’ declared with attribute warning: detected read beyond size of f
ield (2nd parameter); maybe use struct_group()? [-Wattribute-warning]
413 | __read_overflow2_field(q_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is because we try to copy the whole ip/ip6 address to the flow_key,
while we only point the to ip/ip6 saddr. Note that since these are UAPI
headers, __struct_group() is used to avoid the compiler warnings.
Reported-by: kernel test robot <lkp@intel.com>
Fixes: c3f8324188 ("net: Add full IPv6 addresses to flow_keys")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20221115142400.1204786-1-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
sk->sk_user_data has multiple users, which are not compatible with each
other. Writers must synchronize by grabbing the sk->sk_callback_lock.
l2tp currently fails to grab the lock when modifying the underlying tunnel
socket fields. Fix it by adding appropriate locking.
We err on the side of safety and grab the sk_callback_lock also inside the
sk_destruct callback overridden by l2tp, even though there should be no
refs allowing access to the sock at the time when sk_destruct gets called.
v4:
- serialize write to sk_user_data in l2tp sk_destruct
v3:
- switch from sock lock to sk_callback_lock
- document write-protection for sk_user_data
v2:
- update Fixes to point to origin of the bug
- use real names in Reported/Tested-by tags
Cc: Tom Parkin <tparkin@katalix.com>
Fixes: 3557baabf2 ("[L2TP]: PPP over L2TP driver core")
Reported-by: Haowei Yan <g1042620637@gmail.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Long standing KCSAN issues are caused by data-race around
some dev->stats changes.
Most performance critical paths already use per-cpu
variables, or per-queue ones.
It is reasonable (and more correct) to use atomic operations
for the slow paths.
This patch adds an union for each field of net_device_stats,
so that we can convert paths that are not yet protected
by a spinlock or a mutex.
netdev_stats_to_stats64() no longer has an #if BITS_PER_LONG==64
Note that the memcpy() we were using on 64bit arches
had no provision to avoid load-tearing,
while atomic_long_read() is providing the needed protection
at no cost.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The maximum hash table size is 64K due to the nature of the protocol. [0]
It's smaller than TCP, and fewer sockets can cause a performance drop.
On an EC2 c5.24xlarge instance (192 GiB memory), after running iperf3 in
different netns, creating 32Mi sockets without data transfer in the root
netns causes regression for the iperf3's connection.
uhash_entries sockets length Gbps
64K 1 1 5.69
1Mi 16 5.27
2Mi 32 4.90
4Mi 64 4.09
8Mi 128 2.96
16Mi 256 2.06
32Mi 512 1.12
The per-netns hash table breaks the lengthy lists into shorter ones. It is
useful on a multi-tenant system with thousands of netns. With smaller hash
tables, we can look up sockets faster, isolate noisy neighbours, and reduce
lock contention.
The max size of the per-netns table is 64K as well. This is because the
possible hash range by udp_hashfn() always fits in 64K within the same
netns and we cannot make full use of the whole buckets larger than 64K.
/* 0 < num < 64K -> X < hash < X + 64K */
(num + net_hash_mix(net)) & mask;
Also, the min size is 128. We use a bitmap to search for an available
port in udp_lib_get_port(). To keep the bitmap on the stack and not
fire the CONFIG_FRAME_WARN error at build time, we round up the table
size to 128.
The sysctl usage is the same with TCP:
$ dmesg | cut -d ' ' -f 6- | grep "UDP hash"
UDP hash table entries: 65536 (order: 9, 2097152 bytes, vmalloc)
# sysctl net.ipv4.udp_hash_entries
net.ipv4.udp_hash_entries = 65536 # can be changed by uhash_entries
# sysctl net.ipv4.udp_child_hash_entries
net.ipv4.udp_child_hash_entries = 0 # disabled by default
# ip netns add test1
# ip netns exec test1 sysctl net.ipv4.udp_hash_entries
net.ipv4.udp_hash_entries = -65536 # share the global table
# sysctl -w net.ipv4.udp_child_hash_entries=100
net.ipv4.udp_child_hash_entries = 100
# ip netns add test2
# ip netns exec test2 sysctl net.ipv4.udp_hash_entries
net.ipv4.udp_hash_entries = 128 # own a per-netns table with 2^n buckets
We could optimise the hash table lookup/iteration further by removing
the netns comparison for the per-netns one in the future. Also, we
could optimise the sparse udp_hslot layout by putting it in udp_table.
[0]: https://lore.kernel.org/netdev/4ACC2815.7010101@gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We will soon introduce an optional per-netns hash table
for UDP.
This means we cannot use the global sk->sk_prot->h.udp_table
to fetch a UDP hash table.
Instead, set NULL to sk->sk_prot->h.udp_table for UDP and get
a proper table from net->ipv4.udp_table.
Note that we still need sk->sk_prot->h.udp_table for UDP LITE.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When built with Control Flow Integrity, function prototypes between
caller and function declaration must match. These mismatches are visible
at compile time with the new -Wcast-function-type-strict in Clang[1].
Fix a total of 73 warnings like these:
drivers/net/wireless/intersil/orinoco/wext.c:1379:27: warning: cast from 'int (*)(struct net_device *, struct iw_request_info *, struct iw_param *, char *)' to 'iw_handler' (aka 'int (*)(struct net_device *, struct iw_request_info *, union iwreq_data *, char *)') converts to incompatible function type [-Wcast-function-type-strict]
IW_HANDLER(SIOCGIWPOWER, (iw_handler)orinoco_ioctl_getpower),
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../net/wireless/wext-compat.c:1607:33: warning: cast from 'int (*)(struct net_device *, struct iw_request_info *, struct iw_point *, char *)' to 'iw_handler' (aka 'int (*)(struct net_device *, struct iw_request_info *, union iwreq_data *, char *)') converts to incompatible function type [-Wcast-function-type-strict]
[IW_IOCTL_IDX(SIOCSIWGENIE)] = (iw_handler) cfg80211_wext_siwgenie,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../drivers/net/wireless/intersil/orinoco/wext.c:1390:27: error: incompatible function pointer types initializing 'const iw_handler' (aka 'int (*const)(struct net_device *, struct iw_request_info *, union iwreq_data *, char *)') with an expression of type 'int (struct net_device *, struct iw_request_info *, struct iw_param *, char *)' [-Wincompatible-function-pointer-types]
IW_HANDLER(SIOCGIWRETRY, cfg80211_wext_giwretry),
^~~~~~~~~~~~~~~~~~~~~~
The cfg80211 Wireless Extension handler callbacks (iw_handler) use a
union for the data argument. Actually use the union and perform explicit
member selection in the function body instead of having a function
prototype mismatch. There are no resulting binary differences
before/after changes.
These changes were made partly manually and partly with the help of
Coccinelle.
Link: https://github.com/KSPP/linux/issues/234
Link: https://reviews.llvm.org/D134831 [1]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/a68822bf8dd587988131bb6a295280cb4293f05d.1667934775.git.gustavoars@kernel.org
As of now, no DSA driver uses a custom link mode validation procedure
anymore. So remove this DSA operation and let phylink determine what is
supported based on config->mac_capabilities (if provided by the driver).
Leave a comment why we left the code that we did, and that there is more
work to do.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
1) Fix sparse warning in the new nft_inner expression, reported
by Jakub Kicinski.
2) Incorrect vlan header check in nft_inner, from Peng Wu.
3) Two patches to pass reset boolean to expression dump operation,
in preparation for allowing to reset stateful expressions in rules.
This adds a new NFT_MSG_GETRULE_RESET command. From Phil Sutter.
4) Inconsistent indentation in nft_fib, from Jiapeng Chong.
5) Speed up siphash calculation in conntrack, from Florian Westphal.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: conntrack: use siphash_4u64
netfilter: rpfilter/fib: clean up some inconsistent indenting
netfilter: nf_tables: Introduce NFT_MSG_GETRULE_RESET
netfilter: nf_tables: Extend nft_expr_ops::dump callback parameters
netfilter: nft_inner: fix return value check in nft_inner_parse_l2l3()
netfilter: nft_payload: use __be16 to store gre version
====================
Link: https://lore.kernel.org/r/20221115095922.139954-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Analogous to NFT_MSG_GETOBJ_RESET, but for rules: Reset stateful
expressions like counters or quotas. The latter two are the only
consumers, adjust their 'dump' callbacks to respect the parameter
introduced earlier.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add a 'reset' flag just like with nft_object_ops::dump. This will be
useful to reset "anonymous stateful objects", e.g. simple rule counters.
No functional change intended.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This adds a new flow_rule_match_arp function that allows drivers
to be able to dissect ARP frames.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a RDMA VF driver for Microsoft Azure Network Adapter (MANA).
Co-developed-by: Ajay Sharma <sharmaajay@microsoft.com>
Signed-off-by: Ajay Sharma <sharmaajay@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-13-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Long Li says:
====================
Introduce Microsoft Azure Network Adapter (MANA) RDMA driver [netdev prep]
The first 11 patches which modify the MANA Ethernet driver to support
RDMA driver.
* 'mana-shared-6.2' of https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
net: mana: Define data structures for protection domain and memory registration
net: mana: Define data structures for allocating doorbell page from GDMA
net: mana: Define and process GDMA response code GDMA_STATUS_MORE_ENTRIES
net: mana: Define max values for SGL entries
net: mana: Move header files to a common location
net: mana: Record port number in netdev
net: mana: Export Work Queue functions for use by RDMA driver
net: mana: Set the DMA device max segment size
net: mana: Handle vport sharing between devices
net: mana: Record the physical address for doorbell page region
net: mana: Add support for auxiliary device
====================
Link: https://lore.kernel.org/all/1667502990-2559-1-git-send-email-longli@linuxonhyperv.com/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The MANA hardware support protection domain and memory registration for use
in RDMA environment. Add those definitions and expose them for use by the
RDMA driver.
Signed-off-by: Ajay Sharma <sharmaajay@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-12-git-send-email-longli@linuxonhyperv.com
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
The RDMA device needs to allocate doorbell pages for each user context.
Define the GDMA data structures for use by the RDMA driver.
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-11-git-send-email-longli@linuxonhyperv.com
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
When doing memory registration, the PF may respond with
GDMA_STATUS_MORE_ENTRIES to indicate a follow request is needed. This is
not an error and should be processed as expected.
Signed-off-by: Ajay Sharma <sharmaajay@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-10-git-send-email-longli@linuxonhyperv.com
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
The number of maximum SGl entries should be computed from the maximum
WQE size for the intended queue type and the corresponding OOB data
size. This guarantees the hardware queue can successfully queue requests
up to the queue depth exposed to the upper layer.
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-9-git-send-email-longli@linuxonhyperv.com
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
In preparation to add MANA RDMA driver, move all the required header files
to a common location for use by both Ethernet and RDMA drivers.
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-8-git-send-email-longli@linuxonhyperv.com
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Add packet traps for 802.1X operation. The "eapol" control trap is used
to trap EAPOL packets and is required for the correct operation of the
control plane. The "locked_port" drop trap can be enabled to gain
visibility into packets that were dropped by the device due to the
locked bridge port check.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When the bridge is offloaded to hardware, FDB entries are learned and
aged-out by the hardware. Some device drivers synchronize the hardware
and software FDBs by generating switchdev events towards the bridge.
When a port is locked, the hardware must not learn autonomously, as
otherwise any host will blindly gain authorization. Instead, the
hardware should generate events regarding hosts that are trying to gain
authorization and their MAC addresses should be notified by the device
driver as locked FDB entries towards the bridge driver.
Allow device drivers to notify the bridge driver about such entries by
extending the 'switchdev_notifier_fdb_info' structure with the 'locked'
bit. The bit can only be set by device drivers and not by the bridge
driver.
Prevent a locked entry from being installed if MAB is not enabled on the
bridge port.
If an entry already exists in the bridge driver, reject the locked entry
if the current entry does not have the "locked" flag set or if it points
to a different port. The same semantics are implemented in the software
data path.
Signed-off-by: Hans J. Schultz <netdev@kapio-technology.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmNq0Q8ACgkQ+7dXa6fL
C2toRxAAmmvce10i3hcS6ke0PvB4gPu6ZSuaQWO3KxpP9jz6lV7M+cfFOh9N3neG
uEe6ms4Kzt/BJIBm+aMdXW84648sV5vOqdrNGBOb2cJikaiTkj9x730klSdwOVr2
epEELoj/IEWZZz/d9U05uq26VUtnxsc/Enzkq/GIaENSVauYWaZXrHdKzrzUZYjk
gEbspFSpQEJqu5slRl2XGos4tMHHvTIkehoLH9KM4YmC5WGf1kKYz/6v38PIhc/9
mEBsUqQlTVsUPNcOXWBY24HJKY91CBgowhbTQIxyJNydHPJYPVJ8U5nNp1g1CYmu
URdvvX8IyIR0zX2RcVlc9vnWQ+p5NoTjxjwc1iKjnBsofCmqDucie6Iz2vis7Zl6
6s6N1FZSYQTX0fbBbf00efWaG/3I/ynRhcW+zM9NcozHzpRxyuptDlKSOVORXRG7
gy7+sID2y5dLqCg9ukTIx1y9Njt+uryosBOajCMaaAy0VgXEsETFO8UxbodUAu6N
ubmPwGO42bY//c+fJWRAjT9tjhzp2fWK4rgrgd3VG4cYrjq2W21EMwyjzilVp2dM
ZlvWoWJptIqEhPtWU8nf3i759XE+FOWKt9ns1FupKB+0msht1p2HBj88bue8TrKk
CcV1dY9cohNzgRFXvXcgSLvSCioT31Q//mGmXWLif7teOXIUN4A=
=q04p
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-next-20221108' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
rxrpc changes
David Howells says:
====================
rxrpc: Increasing SACK size and moving away from softirq, part 1
AF_RXRPC has some issues that need addressing:
(1) The SACK table has a maximum capacity of 255, but for modern networks
that isn't sufficient. This is hard to increase in the upstream code
because of the way the application thread is coupled to the softirq
and retransmission side through a ring buffer. Adjustments to the rx
protocol allows a capacity of up to 8192, and having a ring
sufficiently large to accommodate that would use an excessive amount
of memory as this is per-call.
(2) Processing ACKs in softirq mode causes the ACKs get conflated, with
only the most recent being considered. Whilst this has the upside
that the retransmission algorithm only needs to deal with the most
recent ACK, it causes DATA transmission for a call to be very bursty
because DATA packets cannot be transmitted in softirq mode. Rather
transmission must be delegated to either the application thread or a
workqueue, so there tend to be sudden bursts of traffic for any
particular call due to scheduling delays.
(3) All crypto in a single call is done in series; however, each DATA
packet is individually encrypted so encryption and decryption of large
calls could be parallelised if spare CPU resources are available.
This is the first of a number of sets of patches that try and address them.
The overall aims of these changes include:
(1) To get rid of the TxRx ring and instead pass the packets round in
queues (eg. sk_buff_head). On the Tx side, each ACK packet comes with
a SACK table that can be parsed as-is, so there's no particular need
to maintain our own; we just have to refer to the ACK.
On the Rx side, we do need to maintain a SACK table with one bit per
entry - but only if packets go missing - and we don't want to have to
perform a complex transformation to get the information into an ACK
packet.
(2) To try and move almost all processing of received packets out of the
softirq handler and into a high-priority kernel I/O thread. Only the
transferral of packets would be left there. I would still use the
encap_rcv hook to receive packets as there's a noticeable performance
drop from letting the UDP socket put the packets into its own queue
and then getting them out of there.
(3) To make the I/O thread also do all the transmission. The app thread
would be responsible for packaging the data into packets and then
buffering them for the I/O thread to transmit. This would make it
easier for the app thread to run ahead of the I/O thread, and would
mean the I/O thread is less likely to have to wait around for a new
packet to come available for transmission.
(4) To logically partition the socket/UAPI/KAPI side of things from the
I/O side of things. The local endpoint, connection, peer and call
objects would belong to the I/O side. The socket side would not then
touch the private internals of calls and suchlike and would not change
their states. It would only look at the send queue, receive queue and
a way to pass a message to cause an abort.
(5) To remove as much locking, synchronisation, barriering and atomic ops
as possible from the I/O side. Exclusion would be achieved by
limiting modification of state to the I/O thread only. Locks would
still need to be used in communication with the UDP socket and the
AF_RXRPC socket API.
(6) To provide crypto offload kernel threads that, when there's slack in
the system, can see packets that need crypting and provide
parallelisation in dealing with them.
(7) To remove the use of system timers. Since each timer would then send
a poke to the I/O thread, which would then deal with it when it had
the opportunity, there seems no point in using system timers if,
instead, a list of timeouts can be sensibly consulted. An I/O thread
only then needs to schedule with a timeout when it is idle.
(8) To use zero-copy sendmsg to send packets. This would make use of the
I/O thread being the sole transmitter on the socket to manage the
dead-reckoning sequencing of the completion notifications. There is a
problem with zero-copy, though: the UDP socket doesn't handle running
out of option memory very gracefully.
With regard to this first patchset, the changes made include:
(1) Some fixes, including a fallback for proc_create_net_single_write(),
setting ack.bufferSize to 0 in ACK packets and a fix for rxrpc
congestion management, which shouldn't be saving the cwnd value
between calls.
(2) Improvements in rxrpc tracepoints, including splitting the timer
tracepoint into a set-timer and a timer-expired trace.
(3) Addition of a new proc file to display some stats.
(4) Some code cleanups, including removing some unused bits and
unnecessary header inclusions.
(5) A change to the recently added UDP encap_err_rcv hook so that it has
the same signature as {ip,ipv6}_icmp_error(), and then just have rxrpc
point its UDP socket's hook directly at those.
(6) Definition of a new struct, rxrpc_txbuf, that is used to hold
transmissible packets of DATA and ACK type in a single 2KiB block
rather than using an sk_buff. This allows the buffer to be on a
number of queues simultaneously more easily, and also guarantees that
the entire block is in a single unit for zerocopy purposes and that
the data payload is aligned for in-place crypto purposes.
(7) ACK txbufs are allocated at proposal and queued for later transmission
rather than being stored in a single place in the rxrpc_call struct,
which means only a single ACK can be pending transmission at a time.
The queue is then drained at various points. This allows the ACK
generation code to be simplified.
(8) The Rx ring buffer is removed. When a jumbo packet is received (which
comprises a number of ordinary DATA packets glued together), it used
to be pointed to by the ring multiple times, with an annotation in a
side ring indicating which subpacket was in that slot - but this is no
longer possible. Instead, the packet is cloned once for each
subpacket, barring the last, and the range of data is set in the skb
private area. This makes it easier for the subpackets in a jumbo
packet to be decrypted in parallel.
(9) The Tx ring buffer is removed. The side annotation ring that held the
SACK information is also removed. Instead, in the event of packet
loss, the SACK data attached an ACK packet is parsed.
(10) Allocate an skcipher request when needed in the rxkad security class
rather than caching one in the rxrpc_call struct. This deals with a
race between externally-driven call disconnection getting rid of the
skcipher request and sendmsg/recvmsg trying to use it because they
haven't seen the completion yet. This is also needed to support
parallelisation as the skcipher request cannot be used by two or more
threads simultaneously.
(11) Call udp_sendmsg() and udpv6_sendmsg() directly rather than going
through kernel_sendmsg() so that we can provide our own iterator
(zerocopy explicitly doesn't work with a KVEC iterator). This also
lets us avoid the overhead of the security hook.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the udp encap_err_rcv signature to match ip_icmp_error() and
ipv6_icmp_error() so that those can be used from the called function and
export them.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
This patch is to add helper support in act_ct for OVS actions=ct(alg=xxx)
offloading, which is corresponding to Commit cae3a26275 ("openvswitch:
Allow attaching helpers to ct action") in OVS kernel part.
The difference is when adding TC actions family and proto cannot be got
from the filter/match, other than helper name in tb[TCA_CT_HELPER_NAME],
we also need to send the family in tb[TCA_CT_HELPER_FAMILY] and the
proto in tb[TCA_CT_HELPER_PROTO] to kernel.
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Move ovs_ct_add_helper from openvswitch to nf_conntrack_helper and
rename as nf_ct_add_helper, so that it can be used in TC act_ct in
the next patch.
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Move ovs_ct_helper from openvswitch to nf_conntrack_helper and rename
as nf_ct_helper so that it can be used in TC act_ct in the next patch.
Note that it also adds the checks for the family and proto, as in TC
act_ct, the packets with correct family and proto are not guaranteed.
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Let families to hook in the new split ops.
They are more flexible and should not be much larger than
full ops. Each split op is 40B while full op is 48B.
Devlink for example has 54 dos and 19 dumps, 2 of the dumps
do not have a do -> 56 full commands = 2688B.
Split ops would have taken 2920B, so 9% more space while
allowing individual per/post doit and per-type policies.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently have two forms of operations - small ops and "full" ops
(or just ops). The former does not have pointers for some of the less
commonly used features (namely dump start/done and policy).
The "full" ops, however, still don't contain all the necessary
information. In particular the policy is per command ID, while
do and dump often accept different attributes. It's also not
possible to define different pre_doit and post_doit callbacks
for different commands within the family.
At the same time a lot of commands do not support dumping and
therefore all the dump-related information is wasted space.
Create a new command representation which can hold info about
a do implementation or a dump implementation, but not both at
the same time.
Use this new representation on the command execution path
(genl_family_rcv_msg) as we either run a do or a dump and
don't have to create a "full" op there.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the private fields down to form a "private section".
Use the kdoc "private:" label comment thing to hide them
from the main kdoc comment.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Expose devlink port handle related to netdev over rtnetlink. Introduce a
new nested IFLA attribute to carry the info. Call into devlink code to
fill-up the nest with existing devlink attributes that are used over
devlink netlink.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To avoid a need to take RTNL mutex in port_fill() function, benefit from
the introduce infrastructure that tracks netdevice notifier events.
Store the ifindex and ifname upon register and change name events.
Remove the rtnl_held bool propagated down to port_fill() function as it
is no longer needed.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since devlink_port_type_eth_set() should no longer be called by any
driver with netdev pointer as it should rather use
SET_NETDEV_DEVLINK_PORT, remove the netdev arg. Add a warn to
type_clear()
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Instead of storing type_dev as a void pointer, convert it to union and
use it to store either struct net_device or struct ib_device pointer.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add new apptrust extension attributes to the 8021Qaz APP managed object.
Two new attributes, DCB_ATTR_DCB_APP_TRUST_TABLE and
DCB_ATTR_DCB_APP_TRUST, has been added. Trusted selectors are passed in
the nested attribute DCB_ATTR_DCB_APP_TRUST, in order of precedence.
The new attributes are meant to allow drivers, whose hw supports the
notion of trust, to be able to set whether a particular app selector is
trusted - and in which order.
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Both bond_alb_xmit() and bond_tlb_xmit() produce a valid warning with
gcc-13:
drivers/net/bonding/bond_alb.c:1409:13: error: conflicting types for 'bond_tlb_xmit' due to enum/integer mismatch; have 'netdev_tx_t(struct sk_buff *, struct net_device *)' ...
include/net/bond_alb.h:160:5: note: previous declaration of 'bond_tlb_xmit' with type 'int(struct sk_buff *, struct net_device *)'
drivers/net/bonding/bond_alb.c:1523:13: error: conflicting types for 'bond_alb_xmit' due to enum/integer mismatch; have 'netdev_tx_t(struct sk_buff *, struct net_device *)' ...
include/net/bond_alb.h:159:5: note: previous declaration of 'bond_alb_xmit' with type 'int(struct sk_buff *, struct net_device *)'
I.e. the return type of the declaration is int, while the definitions
spell netdev_tx_t. Synchronize both of them to the latter.
Cc: Martin Liska <mliska@suse.cz>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Link: https://lore.kernel.org/r/20221031114409.10417-1-jirislaby@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub reported that the addition of the "network_byte_order"
member in struct nla_policy increases size of 32bit platforms.
Instead of scraping the bit from elsewhere Johannes suggested
to add explicit NLA_BE types instead, so do this here.
NLA_POLICY_MAX_BE() macro is removed again, there is no need
for it: NLA_POLICY_MAX(NLA_BE.., ..) will do the right thing.
NLA_BE64 can be added later.
Fixes: 08724ef699 ("netlink: introduce NLA_POLICY_MAX_BE")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Suggested-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20221031123407.9158-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
IPv4 reassembly unit can decide to drop frags based on
/proc/sys/net/ipv4/ipfrag_max_dist sysctl.
Add a specific drop reason to track this specific
and weird case.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Used to track skbs freed after a timeout happened
in a reassmbly unit.
Passing a @reason argument to inet_frag_rbtree_purge()
allows to use correct consumed status for frags
that have been successfully re-assembled.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is used to track when a duplicate segment received by various
reassembly units is dropped.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This will allow to simply use in the future:
kfree_skb_reason(skb, reason);
Instead of repeating sequences like:
if (dropped)
kfree_skb_reason(skb, reason);
else
consume_skb(skb);
For instance, following patch in the series is adding
@reason to skb_release_data() and skb_release_all(),
so that we can propagate a meaningful @reason whenever
consume_skb()/kfree_skb() have to take care of a potential frag_list.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch use the new helper unregister_netdevice_many_notify() for
rtnl_delete_link(), so that the kernel could reply unicast when userspace
set NLM_F_ECHO flag to request the new created interface info.
At the same time, the parameters of rtnl_delete_link() need to be updated
since we need nlmsghdr and portid info.
Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch pass netlink message header and portid to rtnl_configure_link()
All the functions in this call chain need to add the parameters so we can
use them in the last call rtnl_notify(), and notify the userspace about
the new link info if NLM_F_ECHO flag is set.
- rtnl_configure_link()
- __dev_notify_flags()
- rtmsg_ifinfo()
- rtmsg_ifinfo_event()
- rtmsg_ifinfo_build_skb()
- rtmsg_ifinfo_send()
- rtnl_notify()
Also move __dev_notify_flags() declaration to net/core/dev.h, as Jakub
suggested.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
New compilers don't like flexible array of flexible structs:
include/net/geneve.h:62:34: warning: array of flexible structures
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Clean up the use of unsafe_memcpy() by adding a flexible array
at the end of netlink message header and splitting up the header
and data copies.
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
sockmap replaces ->sk_prot with its own callbacks, we should remove
SOCK_SUPPORT_ZC as the new proto doesn't support msghdr::ubuf_info.
Cc: <stable@vger.kernel.org> # 6.0
Reported-by: Jakub Kicinski <kuba@kernel.org>
Fixes: e993ffe3da ("net: flag sockets supporting msghdr originated zerocopy")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Mark the validation fields as private, users shouldn't set
them directly and they are too complicated to explain in
a more succinct way (there's already a long explanation
in the comment above).
The strict_start_type field is set directly and has a dedicated
comment so move that above the "private" section.
Link: https://lore.kernel.org/r/20221027212107.2639255-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
First set of patches v6.2. mac80211 refactoring continues for Wi-Fi 7.
All mac80211 driver are now converted to use internal TX queues, this
might cause some regressions so we wanted to do this early in the
cycle.
Note: wireless tree was merged[1] to wireless-next to avoid some
conflicts with mac80211 patches between the trees. Unfortunately there
are still two smaller conflicts in net/mac80211/util.c which Stephen
also reported[2]. In the first conflict initialise scratch_len to
"params->scratch_len ?: 3 * params->len" (note number 3, not 2!) and
in the second conflict take the version which uses elems->scratch_pos.
Git diff output should like this:
--- a/net/mac80211/util.c
+++ b/net/mac80211/util.c
@@@ -1506,7 -1648,7 +1650,7 @@@ ieee802_11_parse_elems_full(struct ieee
const struct element *non_inherit = NULL;
u8 *nontransmitted_profile;
int nontransmitted_profile_len = 0;
- size_t scratch_len = params->len;
- size_t scratch_len = params->scratch_len ?: 2 * params->len;
++ size_t scratch_len = params->scratch_len ?: 3 * params->len;
elems = kzalloc(sizeof(*elems) + scratch_len, GFP_ATOMIC);
if (!elems)
[1] https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next.git/commit/?id=dfd2d876b3fda1790bc0239ba4c6967e25d16e91
[2] https://lore.kernel.org/all/20221020032340.5cf101c0@canb.auug.org.au/
Major changes:
mac80211
* preparation for Wi-Fi 7 Multi-Link Operation (MLO) continues
* add API to show the link STAs in debugfs
* all mac80211 drivers are now using mac80211 internal TX queues (iTXQs)
rtw89
* support 8852BE
rtl8xxxu
* support RTL8188FU
brmfmac
* support two station interfaces concurrently
bcma
* support SPROM rev 11
-----BEGIN PGP SIGNATURE-----
iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmNb2KwRHGt2YWxvQGtl
cm5lbC5vcmcACgkQbhckVSbrbZs6hggAqrmgHaiRbPYlLVE0hFVaGeVXslfpK9nj
ZGr3bwG5FpioxjAsI7NfwEwpq+FoxTUigG/SShw1Rr8lw43Nt7E57+5McK3qWdKC
5/WndfChPppjtJUpc6PN+UcmCuhcm4TAyMAojbTe5lJ9cJ4yyOWZChDwah3BVZQk
/tHK4qKq8gpppwgkiJ88VdTVoVv5anIvXHH3lAIteljxk0zkaEM7lIQhIFst/zpX
cu/cWvKz2Rh6vnYjkzl8BcrbW8e0/VwnzWggtWs4/kIuRVnHRrrGwlnsULShOscM
fmK+RI2CbJIMucttUOHOXjCg4wwppSit1rH4xOkcwlvrFa7AvDbdHg==
=Zils
-----END PGP SIGNATURE-----
Kalle Valo says:
====================
pull-request: wireless-next-2022-10-28
First set of patches v6.2. mac80211 refactoring continues for Wi-Fi 7.
All mac80211 driver are now converted to use internal TX queues, this
might cause some regressions so we wanted to do this early in the
cycle.
Note: wireless tree was merged[1] to wireless-next to avoid some
conflicts with mac80211 patches between the trees. Unfortunately there
are still two smaller conflicts in net/mac80211/util.c which Stephen
also reported[2]. In the first conflict initialise scratch_len to
"params->scratch_len ?: 3 * params->len" (note number 3, not 2!) and
in the second conflict take the version which uses elems->scratch_pos.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next.git/commit/?id=dfd2d876b3fda1790bc0239ba4c6967e25d16e91
[2] https://lore.kernel.org/all/20221020032340.5cf101c0@canb.auug.org.au/
mac80211
- preparation for Wi-Fi 7 Multi-Link Operation (MLO) continues
- add API to show the link STAs in debugfs
- all mac80211 drivers are now using mac80211 internal TX queues (iTXQs)
rtw89
- support 8852BE
rtl8xxxu
- support RTL8188FU
brmfmac
- support two station interfaces concurrently
bcma
- support SPROM rev 11
====================
Link: https://lore.kernel.org/r/20221028132943.304ECC433B5@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Congestion control algorithms track PLB state and cause the connection
to trigger a path change when either of the 2 conditions is satisfied:
- No packets are in flight and (# consecutive congested rounds >=
sysctl_tcp_plb_idle_rehash_rounds)
- (# consecutive congested rounds >= sysctl_tcp_plb_rehash_rounds)
A round (RTT) is marked as congested when congestion signal
(ECN ce_ratio) over an RTT is greater than sysctl_tcp_plb_cong_thresh.
In the event of RTO, PLB (via tcp_write_timeout()) triggers a path
change and disables congestion-triggered path changes for random time
between (sysctl_tcp_plb_suspend_rto_sec, 2*sysctl_tcp_plb_suspend_rto_sec)
to avoid hopping onto the "connectivity blackhole". RTO-triggered
path changes can still happen during this cool-off period.
Signed-off-by: Mubashir Adnan Qureshi <mubashirq@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
PLB (Protective Load Balancing) is a host based mechanism for load
balancing across switch links. It leverages congestion signals(e.g. ECN)
from transport layer to randomly change the path of the connection
experiencing congestion. PLB changes the path of the connection by
changing the outgoing IPv6 flow label for IPv6 connections (implemented
in Linux by calling sk_rethink_txhash()). Because of this implementation
mechanism, PLB can currently only work for IPv6 traffic. For more
information, see the SIGCOMM 2022 paper:
https://doi.org/10.1145/3544216.3544226
This commit adds new sysctl knobs and sets their default values for
TCP PLB.
Signed-off-by: Mubashir Adnan Qureshi <mubashirq@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
1) Move struct nft_payload_set definition to .c file where it is
only used.
2) Shrink transport and inner header offset fields in the nft_pktinfo
structure to 16-bits, from Florian Westphal.
3) Get rid of nft_objref Kbuild toggle, make it built-in into
nf_tables. This expression is used to instantiate conntrack helpers
in nftables. After removing the conntrack helper auto-assignment
toggle it this feature became more important so move it to the nf_tables
core module. Also from Florian.
4) Extend the existing function to calculate payload inner header offset
to deal with the GRE and IPIP transport protocols.
6) Add inner expression support for nf_tables. This new expression
provides a packet parser for tunneled packets which uses a userspace
description of the expected inner headers. The inner expression
invokes the payload expression (via direct call) to match on the
inner header protocol fields using the inner link, network and
transport header offsets.
An example of the bytecode generated from userspace to match on
IP source encapsulated in a VxLAN packet:
# nft --debug=netlink add rule netdev x y udp dport 4789 vxlan ip saddr 1.2.3.4
netdev x y
[ meta load l4proto => reg 1 ]
[ cmp eq reg 1 0x00000011 ]
[ payload load 2b @ transport header + 2 => reg 1 ]
[ cmp eq reg 1 0x0000b512 ]
[ inner type vxlan hdrsize 8 flags f [ meta load protocol => reg 1 ] ]
[ cmp eq reg 1 0x00000008 ]
[ inner type vxlan hdrsize 8 flags f [ payload load 4b @ network header + 12 => reg 1 ] ]
[ cmp eq reg 1 0x04030201 ]
7) Store inner link, network and transport header offsets in percpu
area to parse inner packet header once only. Matching on a different
tunnel type invalidates existing offsets in the percpu area and it
invokes the inner tunnel parser again.
8) Add support for inner meta matching. This support for
NFTA_META_PROTOCOL, which specifies the inner ethertype, and
NFT_META_L4PROTO, which specifies the inner transport protocol.
9) Extend nft_inner to parse GENEVE optional fields to calculate the
link layer offset.
10) Update inner expression so tunnel offset points to GRE header
to normalize tunnel header handling. This also allows to perform
different interpretations of the GRE header from userspace.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nft_inner: set tunnel offset to GRE header offset
netfilter: nft_inner: add geneve support
netfilter: nft_meta: add inner match support
netfilter: nft_inner: add percpu inner context
netfilter: nft_inner: support for inner tunnel header matching
netfilter: nft_payload: access ipip payload for inner offset
netfilter: nft_payload: access GRE payload via inner offset
netfilter: nft_objref: make it builtin
netfilter: nf_tables: reduce nft_pktinfo by 8 bytes
netfilter: nft_payload: move struct nft_payload_set definition where it belongs
====================
Link: https://lore.kernel.org/r/20221026132227.3287-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
drivers/net/can/usb/kvaser_usb/kvaser_usb_leaf.c
2871edb32f ("can: kvaser_usb: Fix possible completions during init_completion")
abb8670938 ("can: kvaser_usb_leaf: Ignore stale bus-off after start")
8d21f5927a ("can: kvaser_usb_leaf: Fix improved state not being reported")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Bond agnostically interacts with TLS device-offload requests via the
.ndo_sk_get_lower_dev operation. Return value is true iff bond
guarantees fixed mapping between the TLS connection and a lower netdev.
Due to this nature, the bond TLS device offload features are not
explicitly controllable in the bond layer. As of today, these are
read-only values based on the evaluation of bond_sk_check(). However,
this indication might be incorrect and misleading, when the feature bits
are "fixed" by some dependency features. For example,
NETIF_F_HW_TLS_TX/RX are forcefully cleared in case the corresponding
checksum offload is disabled. But in fact the bond ability to still
offload TLS connections to the lower device is not hurt.
This means that these bits can not be trusted, and hence better become
unused.
This patch revives some old discussion [1] and proposes a much simpler
solution: Clear the bond's TLS features bits. Everyone should stop
reading them.
[1] https://lore.kernel.org/netdev/20210526095747.22446-1-tariqt@nvidia.com/
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20221025105300.4718-1-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Stefan Schmidt says:
====================
==
One of the biggest cycles for ieee802154 in a long time. We are landing the
first pieces of a big enhancements in managing PAN's. We might have another pull
request ready for this cycle later on, but I want to get this one out first.
Miquel Raynal added support for sending frames synchronously as a dependency
to handle MLME commands. Also introducing more filtering levels to match with
the needs of a device when scanning or operating as a pan coordinator.
To support development and testing the hwsim driver for ieee802154 was also
enhanced for the new filtering levels and to update the PIB attributes.
Alexander Aring fixed quite a few bugs spotted during reviewing changes. He
also added support for TRAC in the atusb driver to have better failure
handling if the firmware provides the needed information.
Jilin Yuan fixed a comment with a repeated word in it.
==================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for inner meta matching on:
- NFT_META_PROTOCOL: to match on the ethertype, this can be used
regardless tunnel protocol provides no link layer header, in that case
nft_inner sets on the ethertype based on the IP header version field.
- NFT_META_L4PROTO: to match on the layer 4 protocol.
These meta expression are usually autogenerated as dependencies by
userspace nftables.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add NFT_PKTINFO_INNER_FULL flag to annotate that inner offsets are
available. Store nft_inner_tun_ctx object in percpu area to cache
existing inner offsets for this skbuff.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This new expression allows you to match on the inner headers that are
encapsulated by any of the existing tunneling protocols.
This expression parses the inner packet to set the link, network and
transport offsets, so the existing expressions (with a few updates) can
be reused to match on the inner headers.
The inner expression supports for different tunnel combinations such as:
- ethernet frame over IPv4/IPv6 packet, eg. VxLAN.
- IPv4/IPv6 packet over IPv4/IPv6 packet, eg. IPIP.
- IPv4/IPv6 packet over IPv4/IPv6 + transport header, eg. GRE.
- transport header (ESP or SCTP) over transport header (usually UDP)
The following fields are used to describe the tunnel protocol:
- flags, which describe how to parse the inner headers:
NFT_PAYLOAD_CTX_INNER_TUN, the tunnel provides its own header.
NFT_PAYLOAD_CTX_INNER_ETHER, the ethernet frame is available as inner header.
NFT_PAYLOAD_CTX_INNER_NH, the network header is available as inner header.
NFT_PAYLOAD_CTX_INNER_TH, the transport header is available as inner header.
For example, VxLAN sets on all of these flags. While GRE only sets on
NFT_PAYLOAD_CTX_INNER_NH and NFT_PAYLOAD_CTX_INNER_TH. Then, ESP over
UDP only sets on NFT_PAYLOAD_CTX_INNER_TH.
The tunnel description is composed of the following attributes:
- header size: in case the tunnel comes with its own header, eg. VxLAN.
- type: this provides a hint to userspace on how to delinearize the rule.
This is useful for VxLAN and Geneve since they run over UDP, since
transport does not provide a hint. This is also useful in case hardware
offload is ever supported. The type is not currently interpreted by the
kernel.
- expression: currently only payload supported. Follow up patch adds
also inner meta support which is required by autogenerated
dependencies. The exthdr expression should be supported too
at some point. There is a new inner_ops operation that needs to be
set on to allow to use an existing expression from the inner expression.
This patch adds a new NFT_PAYLOAD_TUN_HEADER base which allows to match
on the tunnel header fields, eg. vxlan vni.
The payload expression is embedded into nft_inner private area and this
private data area is passed to the payload inner eval function via
direct call.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nft_objref is needed to reference named objects, it makes
no sense to disable it.
Before:
text data bss dec filename
4014 424 0 4438 nft_objref.o
4174 1128 0 5302 nft_objref.ko
359351 15276 864 375491 nf_tables.ko
After:
text data bss dec filename
3815 408 0 4223 nft_objref.o
363161 15692 864 379717 nf_tables.ko
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
structure is reduced from 32 to 24 bytes. While at it, also check
that iphdrlen is sane, this is guaranteed for NFPROTO_IPV4 but not
for ingress or bridge, so add checks for this.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Not required to expose this header in nf_tables_core.h, move it to where
it is used, ie. nft_payload.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Kazuho Oku reported that setsockopt(SO_INCOMING_CPU) does not work
with setsockopt(SO_REUSEPORT) since v4.6.
With the combination of SO_REUSEPORT and SO_INCOMING_CPU, we could
build a highly efficient server application.
setsockopt(SO_INCOMING_CPU) associates a CPU with a TCP listener
or UDP socket, and then incoming packets processed on the CPU will
likely be distributed to the socket. Technically, a socket could
even receive packets handled on another CPU if no sockets in the
reuseport group have the same CPU receiving the flow.
The logic exists in compute_score() so that a socket will get a higher
score if it has the same CPU with the flow. However, the score gets
ignored after the blamed two commits, which introduced a faster socket
selection algorithm for SO_REUSEPORT.
This patch introduces a counter of sockets with SO_INCOMING_CPU in
a reuseport group to check if we should iterate all sockets to find
a proper one. We increment the counter when
* calling listen() if the socket has SO_INCOMING_CPU and SO_REUSEPORT
* enabling SO_INCOMING_CPU if the socket is in a reuseport group
Also, we decrement it when
* detaching a socket out of the group to apply SO_INCOMING_CPU to
migrated TCP requests
* disabling SO_INCOMING_CPU if the socket is in a reuseport group
When the counter reaches 0, we can get back to the O(1) selection
algorithm.
The overall changes are negligible for the non-SO_INCOMING_CPU case,
and the only notable thing is that we have to update sk_incomnig_cpu
under reuseport_lock. Otherwise, the race prevents transitioning to
the O(n) algorithm and results in the wrong socket selection.
cpu1 (setsockopt) cpu2 (listen)
+-----------------+ +-------------+
lock_sock(sk1) lock_sock(sk2)
reuseport_update_incoming_cpu(sk1, val)
.
| /* set CPU as 0 */
|- WRITE_ONCE(sk1->incoming_cpu, val)
|
| spin_lock_bh(&reuseport_lock)
| reuseport_grow(sk2, reuse)
| .
| |- more_socks_size = reuse->max_socks * 2U;
| |- if (more_socks_size > U16_MAX &&
| | reuse->num_closed_socks)
| | .
| | |- RCU_INIT_POINTER(sk1->sk_reuseport_cb, NULL);
| | `- __reuseport_detach_closed_sock(sk1, reuse)
| | .
| | `- reuseport_put_incoming_cpu(sk1, reuse)
| | .
| | | /* Read shutdown()ed sk1's sk_incoming_cpu
| | | * without lock_sock().
| | | */
| | `- if (sk1->sk_incoming_cpu >= 0)
| | .
| | | /* decrement not-yet-incremented
| | | * count, which is never incremented.
| | | */
| | `- __reuseport_put_incoming_cpu(reuse);
| |
| `- spin_lock_bh(&reuseport_lock)
|
|- spin_lock_bh(&reuseport_lock)
|
|- reuse = rcu_dereference_protected(sk1->sk_reuseport_cb, ...)
|- if (!reuse)
| .
| | /* Cannot increment reuse->incoming_cpu. */
| `- goto out;
|
`- spin_unlock_bh(&reuseport_lock)
Fixes: e32ea7e747 ("soreuseport: fast reuseport UDP socket selection")
Fixes: c125e80b88 ("soreuseport: fast reuseport TCP socket selection")
Reported-by: Kazuho Oku <kazuhooku@gmail.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add support for skbedit queue mapping action on receive
side. This is supported only in hardware, so the skip_sw
flag is enforced. This enables offloading filters for
receive queue selection in the hardware using the
skbedit action. Traffic arrives on the Rx queue requested
in the skbedit action parameter. A new tc action flag
TCA_ACT_FLAGS_AT_INGRESS is introduced to identify the
traffic direction the action queue_mapping is requested
on during filter addition. This is used to disallow
offloading the skbedit queue mapping action on transmit
side.
Example:
$tc filter add dev $IFACE ingress protocol ip flower dst_ip $DST_IP\
action skbedit queue_mapping $rxq_id skip_sw
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
To keep backward compatibility we used to leave attribute parsing
to the family if no policy is specified. This becomes tedious as
we move to more strict validation. Families must define reject all
policies if they don't want any attributes accepted.
Piggy back on the resv_start_op field as the switchover point.
AFAICT only ethtool has added new commands since the resv_start_op
was defined, and it has per-op policies so this should be a no-op.
Nonetheless the patch should still go into v6.1 for consistency.
Link: https://lore.kernel.org/all/20221019125745.3f2e7659@kernel.org/
Link: https://lore.kernel.org/r/20221021193532.1511293-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As Shakeel explains the commit under Fixes had the unintended
side-effect of no longer pre-loading the cached memory allowance.
Even tho we previously dropped the first packet received when
over memory limit - the consecutive ones would get thru by using
the cache. The charging was happening in batches of 128kB, so
we'd let in 128kB (truesize) worth of packets per one drop.
After the change we no longer force charge, there will be no
cache filling side effects. This causes significant drops and
connection stalls for workloads which use a lot of page cache,
since we can't reclaim page cache under GFP_NOWAIT.
Some of the latency can be recovered by improving SACK reneg
handling but nowhere near enough to get back to the pre-5.15
performance (the application I'm experimenting with still
sees 5-10x worst latency).
Apply the suggested workaround of using GFP_ATOMIC. We will now
be more permissive than previously as we'll drop _no_ packets
in softirq when under pressure. But I can't think of any good
and simple way to address that within networking.
Link: https://lore.kernel.org/all/20221012163300.795e7b86@kernel.org/
Suggested-by: Shakeel Butt <shakeelb@google.com>
Fixes: 4b1327be9f ("net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()")
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20221021160304.1362511-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The parameter 'msg' has never been used by __sock_cmsg_send, so we can remove it
safely.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Reviewed-by: Zhang Yunkai <zhang.yunkai@zte.com.cn>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit ffa84b5ffb ("net: add netns refcount tracker to struct sock")
added a tracker to sockets, but did not track kernel sockets.
We still have syzbot reports hinting about netns being destroyed
while some kernel TCP sockets had not been dismantled.
This patch tracks kernel sockets, and adds a ref_tracker_dir_print()
call to net_free() right before the netns is freed.
Normally, each layer is responsible for properly releasing its
kernel sockets before last call to net_free().
This debugging facility is enabled with CONFIG_NET_NS_REFCNT_TRACKER=y
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Tested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the receiver process and the BH runs on different cores,
udp_rmem_release() experience a cache miss while accessing sk_rcvbuf,
as the latter shares the same cacheline with sk_forward_alloc, written
by the BH.
With this patch, UDP tracks the rcvbuf value and its update via custom
SOL_SOCKET socket options, and copies the forward memory threshold value
used by udp_rmem_release() in a different cacheline, already accessed by
the above function and uncontended.
Since the UDP socket init operation grown a bit, factor out the common
code between v4 and v6 in a shared helper.
Overall the above give a 10% peek throughput increase under UDP flood.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The last user of inet6_destroy_sock() is its wrapper inet6_cleanup_sock().
Let's rename inet6_destroy_sock() to inet6_cleanup_sock().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
'&asoc->ulpq' passed to sctp_ulpq_init() as the first argument,
then sctp_qlpq_init() initializes it and eventually returns the
address of the struct member back. Therefore, in this case, the
return pointer cannot be NULL.
Moreover, it seems sctp_ulpq_init() has always been used only in
sctp_association_init(), so there's really no need to return ulpq
anymore.
Detected using the static analysis tool - Svace.
Signed-off-by: Alexey Kodanev <aleksei.kodanev@bell-sw.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/20221019180735.161388-1-aleksei.kodanev@bell-sw.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Address a bunch of kdoc warnings:
include/net/genetlink.h:81: warning: Function parameter or member 'module' not described in 'genl_family'
include/net/genetlink.h:243: warning: expecting prototype for struct genl_info. Prototype was for struct genl_dumpit_info instead
include/net/genetlink.h:419: warning: Function parameter or member 'net' not described in 'genlmsg_unicast'
include/net/genetlink.h:438: warning: expecting prototype for gennlmsg_data(). Prototype was for genlmsg_data() instead
include/net/genetlink.h:244: warning: Function parameter or member 'op' not described in 'genl_dumpit_info'
Link: https://lore.kernel.org/r/20221018231310.1040482-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When we call connect() for a UDP socket in a reuseport group, we have
to update sk->sk_reuseport_cb->has_conns to 1. Otherwise, the kernel
could select a unconnected socket wrongly for packets sent to the
connected socket.
However, the current way to set has_conns is illegal and possible to
trigger that problem. reuseport_has_conns() changes has_conns under
rcu_read_lock(), which upgrades the RCU reader to the updater. Then,
it must do the update under the updater's lock, reuseport_lock, but
it doesn't for now.
For this reason, there is a race below where we fail to set has_conns
resulting in the wrong socket selection. To avoid the race, let's split
the reader and updater with proper locking.
cpu1 cpu2
+----+ +----+
__ip[46]_datagram_connect() reuseport_grow()
. .
|- reuseport_has_conns(sk, true) |- more_reuse = __reuseport_alloc(more_socks_size)
| . |
| |- rcu_read_lock()
| |- reuse = rcu_dereference(sk->sk_reuseport_cb)
| |
| | | /* reuse->has_conns == 0 here */
| | |- more_reuse->has_conns = reuse->has_conns
| |- reuse->has_conns = 1 | /* more_reuse->has_conns SHOULD BE 1 HERE */
| | |
| | |- rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb,
| | | more_reuse)
| `- rcu_read_unlock() `- kfree_rcu(reuse, rcu)
|
|- sk->sk_state = TCP_ESTABLISHED
Note the likely(reuse) in reuseport_has_conns_set() is always true,
but we put the test there for ease of review. [0]
For the record, usually, sk_reuseport_cb is changed under lock_sock().
The only exception is reuseport_grow() & TCP reqsk migration case.
1) shutdown() TCP listener, which is moved into the latter part of
reuse->socks[] to migrate reqsk.
2) New listen() overflows reuse->socks[] and call reuseport_grow().
3) reuse->max_socks overflows u16 with the new listener.
4) reuseport_grow() pops the old shutdown()ed listener from the array
and update its sk->sk_reuseport_cb as NULL without lock_sock().
shutdown()ed TCP sk->sk_reuseport_cb can be changed without lock_sock(),
but, reuseport_has_conns_set() is called only for UDP under lock_sock(),
so likely(reuse) never be false in reuseport_has_conns_set().
[0]: https://lore.kernel.org/netdev/CANn89iLja=eQHbsM_Ta2sQF0tOGU8vAGrh_izRuuHjuO1ouUag@mail.gmail.com/
Fixes: acdcecc612 ("udp: correct reuseport selection with connected sockets")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20221014182625.89913-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmNHYD0ACgkQSfxwEqXe
A655AA//dJK0PdRghqrKQsl18GOCffV5TUw5i1VbJQbI9d8anfxNjVUQiNGZi4et
qUwZ8OqVXxYx1Z1UDgUE39PjEDSG9/cCvOpMUWqN20/+6955WlNZjwA7Fk6zjvlM
R30fz5CIJns9RFvGT4SwKqbVLXIMvfg/wDENUN+8sxt36+VD2gGol7J2JJdngEhM
lW+zqzi0ABqYy5so4TU2kixpKmpC08rqFvQbD1GPid+50+JsOiIqftDErt9Eg1Mg
MqYivoFCvbAlxxxRh3+UHBd7ZpJLtp1UFEOl2Rf00OXO+ZclLCAQAsTczucIWK9M
8LCZjb7d4lPJv9RpXFAl3R1xvfc+Uy2ga5KeXvufZtc5G3aMUKPuIU7k28ZyblVS
XXsXEYhjTSd0tgi3d0JlValrIreSuj0z2QGT5pVcC9utuAqAqRIlosiPmgPlzXjr
Us4jXaUhOIPKI+Musv/fqrxsTQziT0jgVA3Njlt4cuAGm/EeUbLUkMWwKXjZLTsv
vDsBhEQFmyZqxWu4pYo534VX2mQWTaKRV1SUVVhQEHm57b00EAiZohoOvweB09SR
4KiJapikoopmW4oAUFotUXUL1PM6yi+MXguTuc1SEYuLz/tCFtK8DJVwNpfnWZpE
lZKvXyJnHq2Sgod/hEZq58PMvT6aNzTzSg7YzZy+VabxQGOO5mc=
=M+mV
-----END PGP SIGNATURE-----
Merge tag 'random-6.1-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull more random number generator updates from Jason Donenfeld:
"This time with some large scale treewide cleanups.
The intent of this pull is to clean up the way callers fetch random
integers. The current rules for doing this right are:
- If you want a secure or an insecure random u64, use get_random_u64()
- If you want a secure or an insecure random u32, use get_random_u32()
The old function prandom_u32() has been deprecated for a while
now and is just a wrapper around get_random_u32(). Same for
get_random_int().
- If you want a secure or an insecure random u16, use get_random_u16()
- If you want a secure or an insecure random u8, use get_random_u8()
- If you want secure or insecure random bytes, use get_random_bytes().
The old function prandom_bytes() has been deprecated for a while
now and has long been a wrapper around get_random_bytes()
- If you want a non-uniform random u32, u16, or u8 bounded by a
certain open interval maximum, use prandom_u32_max()
I say "non-uniform", because it doesn't do any rejection sampling
or divisions. Hence, it stays within the prandom_*() namespace, not
the get_random_*() namespace.
I'm currently investigating a "uniform" function for 6.2. We'll see
what comes of that.
By applying these rules uniformly, we get several benefits:
- By using prandom_u32_max() with an upper-bound that the compiler
can prove at compile-time is ≤65536 or ≤256, internally
get_random_u16() or get_random_u8() is used, which wastes fewer
batched random bytes, and hence has higher throughput.
- By using prandom_u32_max() instead of %, when the upper-bound is
not a constant, division is still avoided, because
prandom_u32_max() uses a faster multiplication-based trick instead.
- By using get_random_u16() or get_random_u8() in cases where the
return value is intended to indeed be a u16 or a u8, we waste fewer
batched random bytes, and hence have higher throughput.
This series was originally done by hand while I was on an airplane
without Internet. Later, Kees and I worked on retroactively figuring
out what could be done with Coccinelle and what had to be done
manually, and then we split things up based on that.
So while this touches a lot of files, the actual amount of code that's
hand fiddled is comfortably small"
* tag 'random-6.1-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
prandom: remove unused functions
treewide: use get_random_bytes() when possible
treewide: use get_random_u32() when possible
treewide: use get_random_{u8,u16}() when possible, part 2
treewide: use get_random_{u8,u16}() when possible, part 1
treewide: use prandom_u32_max() when possible, part 2
treewide: use prandom_u32_max() when possible, part 1
Current release - regressions:
- Revert "net/sched: taprio: make qdisc_leaf() see
the per-netdev-queue pfifo child qdiscs", it may cause crashes
when the qdisc is reconfigured
- inet: ping: fix splat due to packet allocation refactoring in inet
- tcp: clean up kernel listener's reqsk in inet_twsk_purge(),
fix UAF due to races when per-netns hash table is used
Current release - new code bugs:
- eth: adin1110: check in netdev_event that netdev belongs to driver
- fixes for PTR_ERR() vs NULL bugs in driver code, from Dan and co.
Previous releases - regressions:
- ipv4: handle attempt to delete multipath route when fib_info
contains an nh reference, avoid oob access
- wifi: fix handful of bugs in the new Multi-BSSID code
- wifi: mt76: fix rate reporting / throughput regression on mt7915
and newer, fix checksum offload
- wifi: iwlwifi: mvm: fix double list_add at
iwl_mvm_mac_wake_tx_queue (other cases)
- wifi: mac80211: do not drop packets smaller than the LLC-SNAP
header on fast-rx
Previous releases - always broken:
- ieee802154: don't warn zero-sized raw_sendmsg()
- ipv6: ping: fix wrong checksum for large frames
- mctp: prevent double key removal and unref
- tcp/udp: fix memory leaks and races around IPV6_ADDRFORM
- hv_netvsc: fix race between VF offering and VF association message
Misc:
- remove -Warray-bounds silencing in the drivers, compilers fixed
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmNISaMACgkQMUZtbf5S
IruEARAArjYZbOEGkUVqtcEbnV0vmxQ5GsVyvurDkmzUULJ1rVAITtG7BbxcPyZ7
tJf5BPmmpXxEXh/lZBIlgHLOGgf/cx4gkCH9Jz6LYlSoTpTZiTqxlOfAZNeei0FI
PD95Slvd3TnIOEysv5RH/pQzIoKdd6+YqOhVITbwCW36cCLaUm+r7JUhzDrnHMNE
KCcsOX9DDtW7MDJrJj/E0wlWeWcudpHY4DLG2A723X6Esu+8k6krK32XtkrFIKqa
PFxeU1NPgMkn4S2xRPKqy+W3dTMfMKB4WWBMMUzEU220MIxV4l/RZSrnI5nrnLh2
uXyUefpx+lD92D5BOiqUw8rK7B4Jq0uUrawuCf+70tbO1f13ThkkAlV6cEzrlnZY
tGQxs0ayFIDVypU1tpY9cemUiYXrnPpCkpz+V1G0us8L323eCHxjz/f5TUlb51Na
BVFvRqvxkjztprBv2LrH2SmnVtcH2kvQG8qMYmXRchBM+11rivz6BrPdE0V+muMg
Hjr6HefYMBpSgcD+ADVFr8a/OB/W7AuWpTBd3z/WyNQ5MxkFX9Kf2Lt2+j8SRfpE
ELO0AANFQZ1Gyp6LTbEkA3mFs1LhNNQyfjHcMHC16ZExHmV3i37BE9LJdnFM27N8
R8lIm4YDs6Jj6YIUDy2wExgUAgkUk7mfZNCMPNi2nSsdJksyAsc=
=AyqG
-----END PGP SIGNATURE-----
Merge tag 'net-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter, and wifi.
Current release - regressions:
- Revert "net/sched: taprio: make qdisc_leaf() see the
per-netdev-queue pfifo child qdiscs", it may cause crashes when the
qdisc is reconfigured
- inet: ping: fix splat due to packet allocation refactoring in inet
- tcp: clean up kernel listener's reqsk in inet_twsk_purge(), fix UAF
due to races when per-netns hash table is used
Current release - new code bugs:
- eth: adin1110: check in netdev_event that netdev belongs to driver
- fixes for PTR_ERR() vs NULL bugs in driver code, from Dan and co.
Previous releases - regressions:
- ipv4: handle attempt to delete multipath route when fib_info
contains an nh reference, avoid oob access
- wifi: fix handful of bugs in the new Multi-BSSID code
- wifi: mt76: fix rate reporting / throughput regression on mt7915
and newer, fix checksum offload
- wifi: iwlwifi: mvm: fix double list_add at
iwl_mvm_mac_wake_tx_queue (other cases)
- wifi: mac80211: do not drop packets smaller than the LLC-SNAP
header on fast-rx
Previous releases - always broken:
- ieee802154: don't warn zero-sized raw_sendmsg()
- ipv6: ping: fix wrong checksum for large frames
- mctp: prevent double key removal and unref
- tcp/udp: fix memory leaks and races around IPV6_ADDRFORM
- hv_netvsc: fix race between VF offering and VF association message
Misc:
- remove -Warray-bounds silencing in the drivers, compilers fixed"
* tag 'net-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (73 commits)
sunhme: fix an IS_ERR() vs NULL check in probe
net: marvell: prestera: fix a couple NULL vs IS_ERR() checks
kcm: avoid potential race in kcm_tx_work
tcp: Clean up kernel listener's reqsk in inet_twsk_purge()
net: phy: micrel: Fixes FIELD_GET assertion
openvswitch: add nf_ct_is_confirmed check before assigning the helper
tcp: Fix data races around icsk->icsk_af_ops.
ipv6: Fix data races around sk->sk_prot.
tcp/udp: Call inet6_destroy_sock() in IPv6 sk->sk_destruct().
udp: Call inet6_destroy_sock() in setsockopt(IPV6_ADDRFORM).
tcp/udp: Fix memory leak in ipv6_renew_options().
mctp: prevent double key removal and unref
selftests: netfilter: Fix nft_fib.sh for all.rp_filter=1
netfilter: rpfilter/fib: Populate flowic_l3mdev field
selftests: netfilter: Test reverse path filtering
net/mlx5: Make ASO poll CQ usable in atomic context
tcp: cdg: allow tcp_cdg_release() to be called multiple times
inet: ping: fix recent breakage
ipv6: ping: fix wrong checksum for large frames
net: ethernet: ti: am65-cpsw: set correct devlink flavour for unused ports
...
Originally, inet6_sk(sk)->XXX were changed under lock_sock(), so we were
able to clean them up by calling inet6_destroy_sock() during the IPv6 ->
IPv4 conversion by IPV6_ADDRFORM. However, commit 03485f2adc ("udpv6:
Add lockless sendmsg() support") added a lockless memory allocation path,
which could cause a memory leak:
setsockopt(IPV6_ADDRFORM) sendmsg()
+-----------------------+ +-------+
- do_ipv6_setsockopt(sk, ...) - udpv6_sendmsg(sk, ...)
- sockopt_lock_sock(sk) ^._ called via udpv6_prot
- lock_sock(sk) before WRITE_ONCE()
- WRITE_ONCE(sk->sk_prot, &tcp_prot)
- inet6_destroy_sock() - if (!corkreq)
- sockopt_release_sock(sk) - ip6_make_skb(sk, ...)
- release_sock(sk) ^._ lockless fast path for
the non-corking case
- __ip6_append_data(sk, ...)
- ipv6_local_rxpmtu(sk, ...)
- xchg(&np->rxpmtu, skb)
^._ rxpmtu is never freed.
- goto out_no_dst;
- lock_sock(sk)
For now, rxpmtu is only the case, but not to miss the future change
and a similar bug fixed in commit e27326009a ("net: ping6: Fix
memleak in ipv6_renew_options()."), let's set a new function to IPv6
sk->sk_destruct() and call inet6_cleanup_sock() there. Since the
conversion does not change sk->sk_destruct(), we can guarantee that
we can clean up IPv6 resources finally.
We can now remove all inet6_destroy_sock() calls from IPv6 protocol
specific ->destroy() functions, but such changes are invasive to
backport. So they can be posted as a follow-up later for net-next.
Fixes: 03485f2adc ("udpv6: Add lockless sendmsg() support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 4b340ae20d ("IPv6: Complete IPV6_DONTFRAG support") forgot
to add a change to free inet6_sk(sk)->rxpmtu while converting an IPv6
socket into IPv4 with IPV6_ADDRFORM. After conversion, sk_prot is
changed to udp_prot and ->destroy() never cleans it up, resulting in
a memory leak.
This is due to the discrepancy between inet6_destroy_sock() and
IPV6_ADDRFORM, so let's call inet6_destroy_sock() from IPV6_ADDRFORM
to remove the difference.
However, this is not enough for now because rxpmtu can be changed
without lock_sock() after commit 03485f2adc ("udpv6: Add lockless
sendmsg() support"). We will fix this case in the following patch.
Note we will rename inet6_destroy_sock() to inet6_cleanup_sock() and
remove unnecessary inet6_destroy_sock() calls in sk_prot->destroy()
in the future.
Fixes: 4b340ae20d ("IPv6: Complete IPV6_DONTFRAG support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This IEEE802154_HW_RX_DROP_BAD_CKSUM flag was only used by hwsim to
reflect the fact that it would not validate the checksum (FCS). So this
was only useful while the only filtering level hwsim was capable of was
"NONE". Now that the driver has been improved we no longer need this
flag.
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Acked-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/r/20221007085310.503366-7-miquel.raynal@bootlin.com
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
We have access to the address filters being theoretically applied, we
also have access to the actual filtering level applied, so let's add a
proper frame validation sequence in hwsim.
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Acked-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/r/20221007085310.503366-6-miquel.raynal@bootlin.com
[stefan@datenfreihafen.org: fixup some checkpatch warnings]
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
The current filtering level is set on the first interface up on a wpan
phy. If we support scan functionality we need to change the filtering
level on the fly on an operational phy and switching back again.
This patch will move the receive mode parameter e.g. address filter and
promiscuous mode to the drv_start() functionality to allow changing the
receive mode on an operational phy not on first ifup only. In future this
should be handled on driver layer because each hardware has it's own way
to enter a specific filtering level. However this should offer to switch
to mode IEEE802154_FILTERING_NONE and back to
IEEE802154_FILTERING_4_FRAME_FIELDS.
Only IEEE802154_FILTERING_4_FRAME_FIELDS and IEEE802154_FILTERING_NONE
are somewhat supported by current hardware. All other filtering levels
can be supported in future but will end in IEEE802154_FILTERING_NONE as
the receive part can kind of "emulate" those receive paths by doing
additional filtering routines.
There are in total three filtering levels in the code:
- the per-interface default level (should not be changed)
- the required per-interface level (mac commands may play with it)
- the actual per-PHY (hw) level that is currently in use
Signed-off-by: Alexander Aring <aahringo@redhat.com>
[<miquel.raynal@bootlin.com: Add the third filtering variable]
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Link: https://lore.kernel.org/r/20221007085310.503366-4-miquel.raynal@bootlin.com
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
The prandom_u32() function has been a deprecated inline wrapper around
get_random_u32() for several releases now, and compiles down to the
exact same code. Replace the deprecated wrapper with a direct call to
the real function. The same also applies to get_random_int(), which is
just a wrapper around get_random_u32(). This was done as a basic find
and replace.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Acked-by: Helge Deller <deller@gmx.de> # for parisc
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
The highlight of this PR is Christian's patch to allocate smaller buffers
for most metadata requests: 9p with a big msize would try to allocate large
buffers when just 4 or 8k would be more than enough; this brings in nice
performance improvements.
There's also a few fixes for problems reported by syzkaller (thanks to
Schspa Shi, Tetsuo Handa for tests and feedback/patches) as well as some
minor cleanup
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmNDq18ACgkQq06b7GqY
5nDQhBAAsl1C2gOvnCl2OA8HRf4EFWeal6wvG61y7yWaYos+oowu3TNw2wE6XsSI
f2aqbngmvQsVt/iosvVjCrwOyeKdNUaMoKcrqAZ6IFC5YzR+Xu1dvpWsTff0lA7T
nq6dj6W1F/KoKpV5RwWJM1lWWt/hHi2+pzH8HsAZUaNfD3tgH6ku4Qm1xM9dP5Ma
k2ckEIx7U6NeqAte816K5HA1aat2wFyC0PlMXTx1bOEYNRk9c5kGWY4DcAzA4vF9
x9PoqFfZow7UZfa0mo3cAk39Z8n6E4y93Eu+Mw3JCSfEQy3qB2wnzlLt2xMqiCtU
Pi2RrEd+doqRw/kvyEUCCl9ycVlb8iuAAXsUGuZFHYCMMlwXN/D1EY8x+b+XCwbE
TH8yUMMhgLfVPhVtFDBDTXgPDZbXAchb9BOoM6np4tOUdEe41kURd2wgdYhCP6+e
9BerKTXRqukyj/lf+4fp7XV1Ctj8Fi6FO4pH++rOXoiX765UFZhUiJd/vSE5Zou+
ANCSE8KWDXZwLS1sq1pjjHQOGyzHO76PUYDZL/JV8yu/NwR7Jwu1gbbABEb7sGfQ
Jxg7KuaJvQMrYcbguKBDhTpSariPyIq7bvUImh2jrJB23WbA+/L/1vo/5XOBEtZs
SGitcbyhvjKAB5XYLMaZhTw8eKtoTY6xTORyEnYpz2xXcM84Mmw=
=70yS
-----END PGP SIGNATURE-----
Merge tag '9p-for-6.1' of https://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
"Smaller buffers for small messages and fixes.
The highlight of this is Christian's patch to allocate smaller buffers
for most metadata requests: 9p with a big msize would try to allocate
large buffers when just 4 or 8k would be more than enough; this brings
in nice performance improvements.
There's also a few fixes for problems reported by syzkaller (thanks to
Schspa Shi, Tetsuo Handa for tests and feedback/patches) as well as
some minor cleanup"
* tag '9p-for-6.1' of https://github.com/martinetd/linux:
net/9p: clarify trans_fd parse_opt failure handling
net/9p: add __init/__exit annotations to module init/exit funcs
net/9p: use a dedicated spinlock for trans_fd
9p/trans_fd: always use O_NONBLOCK read/write
net/9p: allocate appropriate reduced message buffers
net/9p: add 'pooled_rbuffers' flag to struct p9_trans_module
net/9p: add p9_msg_buf_size()
9p: add P9_ERRMAX for 9p2000 and 9p2000.u
net/9p: split message size argument into 't_size' and 'r_size' pair
9p: trans_fd/p9_conn_cancel: drop client lock earlier
Start to align the TX handling to only use internal TX queues (iTXQs):
Provide a handler for drivers not having a custom wake_tx_queue
callback and update the documentation.
Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In case of 4way handshake offload, transition disable policy
updated by the AP during EAPOL 3/4 is not updated to the upper layer.
This results in mismatch between transition disable policy
between the upper layer and the driver. This patch addresses this
issue by updating transition disable policy as part of port
authorization indication.
Signed-off-by: Vinayak Yadawad <vinayak.yadawad@broadcom.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We might sometimes need to use RCU and locking in the same code
path, so add the two variants link_conf_dereference_check() and
link_sta_dereference_check().
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
For assoc and connect result APIs, support reporting
failed links; they should still come with the BSS
pointer in the case of assoc, so they're released
correctly. In the case of connect result, this is
optional.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Create debugfs data per-link. For drivers, there is a new operation
link_sta_add_debugfs which will always be called.
For non-MLO, the station directory will be used directly rather than
creating a corresponding subdirectory. As such, non-MLO drivers can
simply continue to create the data from sta_debugfs_add.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
[add missing inlines if !CONFIG_MAC80211_DEBUGFS]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
While often not needed, this considerably simplifies going from a link
to the STA. This helps in cases such as debugfs where a single pointer
should allow accessing a specific link and the STA.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This patch adds handling to return -EINVAL for an unknown addr type. The
current behaviour is to return 0 as successful but the size of an
unknown addr type is not defined and should return an error like -EINVAL.
Fixes: 94160108a7 ("net/ieee802154: fix uninit value bug in dgram_sendmsg")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add P9_ERRMAX macro to 9P protocol header which reflects the maximum
error string length of Rerror replies for 9p2000 and 9p2000.u protocol
versions. Unfortunately a maximum error string length is not defined by
the 9p2000 spec, picking 128 as value for now, as this seems to be a
common max. size for POSIX error strings in practice.
9p2000.L protocol version uses Rlerror replies instead which does not
contain an error string.
Link: https://lkml.kernel.org/r/3f23191d21032e7c14852b1e1a4ae26417a36739.1657920926.git.linux_oss@crudebyte.com
Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Daniel Borkmann says:
====================
pull-request: bpf 2022-10-03
We've added 10 non-merge commits during the last 23 day(s) which contain
a total of 14 files changed, 130 insertions(+), 69 deletions(-).
The main changes are:
1) Fix dynptr helper API to gate behind CAP_BPF given it was not intended
for unprivileged BPF programs, from Kumar Kartikeya Dwivedi.
2) Fix need_wakeup flag inheritance from umem buffer pool for shared xsk
sockets, from Jalal Mostafa.
3) Fix truncated last_member_type_id in btf_struct_resolve() which had a
wrong storage type, from Lorenz Bauer.
4) Fix xsk back-pressure mechanism on tx when amount of produced
descriptors to CQ is lower than what was grabbed from xsk tx ring,
from Maciej Fijalkowski.
5) Fix wrong cgroup attach flags being displayed to effective progs,
from Pu Lehui.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
xsk: Inherit need_wakeup flag for shared sockets
bpf: Gate dynptr API behind CAP_BPF
selftests/bpf: Adapt cgroup effective query uapi change
bpftool: Fix wrong cgroup attach flags being assigned to effective progs
bpf, cgroup: Reject prog_attach_flags array when effective query
bpf: Ensure correct locking around vulnerable function find_vpid()
bpf: btf: fix truncated last_member_type_id in btf_struct_resolve
selftests/xsk: Add missing close() on netns fd
xsk: Fix backpressure mechanism on Tx
MAINTAINERS: Add include/linux/tnum.h to BPF CORE
====================
Link: https://lore.kernel.org/r/20221003201957.13149-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2022-10-03
We've added 143 non-merge commits during the last 27 day(s) which contain
a total of 151 files changed, 8321 insertions(+), 1402 deletions(-).
The main changes are:
1) Add kfuncs for PKCS#7 signature verification from BPF programs, from Roberto Sassu.
2) Add support for struct-based arguments for trampoline based BPF programs,
from Yonghong Song.
3) Fix entry IP for kprobe-multi and trampoline probes under IBT enabled, from Jiri Olsa.
4) Batch of improvements to veristat selftest tool in particular to add CSV output,
a comparison mode for CSV outputs and filtering, from Andrii Nakryiko.
5) Add preparatory changes needed for the BPF core for upcoming BPF HID support,
from Benjamin Tissoires.
6) Support for direct writes to nf_conn's mark field from tc and XDP BPF program
types, from Daniel Xu.
7) Initial batch of documentation improvements for BPF insn set spec, from Dave Thaler.
8) Add a new BPF_MAP_TYPE_USER_RINGBUF map which provides single-user-space-producer /
single-kernel-consumer semantics for BPF ring buffer, from David Vernet.
9) Follow-up fixes to BPF allocator under RT to always use raw spinlock for the BPF
hashtab's bucket lock, from Hou Tao.
10) Allow creating an iterator that loops through only the resources of one
task/thread instead of all, from Kui-Feng Lee.
11) Add support for kptrs in the per-CPU arraymap, from Kumar Kartikeya Dwivedi.
12) Add a new kfunc helper for nf to set src/dst NAT IP/port in a newly allocated CT
entry which is not yet inserted, from Lorenzo Bianconi.
13) Remove invalid recursion check for struct_ops for TCP congestion control BPF
programs, from Martin KaFai Lau.
14) Fix W^X issue with BPF trampoline and BPF dispatcher, from Song Liu.
15) Fix percpu_counter leakage in BPF hashtab allocation error path, from Tetsuo Handa.
16) Various cleanups in BPF selftests to use preferred ASSERT_* macros, from Wang Yufen.
17) Add invocation for cgroup/connect{4,6} BPF programs for ICMP pings, from YiFei Zhu.
18) Lift blinding decision under bpf_jit_harden = 1 to bpf_capable(), from Yauheni Kaliuta.
19) Various libbpf fixes and cleanups including a libbpf NULL pointer deref, from Xin Liu.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (143 commits)
net: netfilter: move bpf_ct_set_nat_info kfunc in nf_nat_bpf.c
Documentation: bpf: Add implementation notes documentations to table of contents
bpf, docs: Delete misformatted table.
selftests/xsk: Fix double free
bpftool: Fix error message of strerror
libbpf: Fix overrun in netlink attribute iteration
selftests/bpf: Fix spelling mistake "unpriviledged" -> "unprivileged"
samples/bpf: Fix typo in xdp_router_ipv4 sample
bpftool: Remove unused struct event_ring_info
bpftool: Remove unused struct btf_attach_point
bpf, docs: Add TOC and fix formatting.
bpf, docs: Add Clang note about BPF_ALU
bpf, docs: Move Clang notes to a separate file
bpf, docs: Linux byteswap note
bpf, docs: Move legacy packet instructions to a separate file
selftests/bpf: Check -EBUSY for the recurred bpf_setsockopt(TCP_CONGESTION)
bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
bpf: Refactor bpf_setsockopt(TCP_CONGESTION) handling into another function
bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt()
bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline
...
====================
Link: https://lore.kernel.org/r/20221003194915.11847-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
DECnet was removed by commit 1202cdd665 ("Remove DECnet support from
kernel"). Let's also revome its flow structure.
Compile-tested only (allmodconfig).
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add ip_tunnel_netlink_parms to parse netlink msg of ip_tunnel_parm.
Reduces duplicate code, no actual functional changes.
Signed-off-by: Liu Jian <liujian56@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add ip_tunnel_netlink_encap_parms to parse netlink msg of ip_tunnel_encap.
Reduces duplicate code, no actual functional changes.
Signed-off-by: Liu Jian <liujian56@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
1) Refactor selftests to use an array of structs in xfrm_fill_key().
From Gautam Menghani.
2) Drop an unused argument from xfrm_policy_match.
From Hongbin Wang.
3) Support collect metadata mode for xfrm interfaces.
From Eyal Birger.
4) Add netlink extack support to xfrm.
From Sabrina Dubroca.
Please note, there is a merge conflict in:
include/net/dst_metadata.h
between commit:
0a28bfd497 ("net/macsec: Add MACsec skb_metadata_dst Tx Data path support")
from the net-next tree and commit:
5182a5d48c ("net: allow storing xfrm interface metadata in metadata_dst")
from the ipsec-next tree.
Can be solved as done in linux-next.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
All the bind_class callback duplicate the same logic, this patch
introduces tc_cls_bind_class() helper for common usage.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since dsa_port_devlink_setup() and dsa_port_devlink_teardown() are
already called from code paths which only execute once per port (due to
the existing bool dp->setup), keeping another dp->devlink_port_setup is
redundant, because we can already manage to balance the calls properly
(and not call teardown when setup was never called, or call setup twice,
or things like that).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Lifetime of some of the devlink objects, like regions, is currently
forced to be different for devlink instance and devlink port instance
(per-port regions). The reason is that for devlink ports, the internal
structures initialization happens only after devlink_port_register() is
called.
To resolve this inconsistency, introduce new set of helpers to allow
driver to initialize devlink pointer and region list before
devlink_register() is called. That allows port regions to be created
before devlink port registration and destroyed after devlink
port unregistration.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Instead of relying on devlink pointer not being initialized, introduce
an extra flag to indicate if devlink port is registered. This is needed
as later on devlink pointer is going to be initialized even in case
devlink port is not registered yet.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- Add RTL8761BUV device (Edimax BT-8500)
- Add a new PID/VID 13d3/3583 for MT7921
- Add Realtek RTL8852C support ID 0x13D3:0x3592
- Add VID/PID 0489/e0e0 for MediaTek MT7921
- Add a new VID/PID 0e8d/0608 for MT7921
- Add a new PID/VID 13d3/3578 for MT7921
- Add BT device 0cb8:c549 from RTW8852AE
- Add support for Intel Magnetor
-----BEGIN PGP SIGNATURE-----
iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmM3jT0ZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKYMhEACfjFS9KCeXyyAVSNk2RwF4
c+rhJW78VmAGd3yWCa+L/6g+xmxgF+sQYNJP35FMW6SAgL9K9yu3+hH0X4O2GNXT
YA9YltHOL60KVKAiHzMqTPwNUoafqbzVz3MLE/yragd9G8dZeMt2W4YObO6/UOoJ
2aWfUNZUNCTyUTHdM2ImXpgkrYe3/ytm/xNKBsJDnYG1MptQmmKAiOr1TUxPN0Sj
pq1Qg2mH9Qj1bo0250v8JVD3ya+NPmtMtCZ7a2vJcf/7BhzMqbxM6bGul3N34cpK
kDL7ma0rj0MqNKVdhY0hPUpLwE4ImNjpIjbZfUJ3KlLr0OqFcl9p7pXQXjeSlR9/
Jsfoyq+kcB693FX8U/6GjTivRmkVfOI3UQeej9tspSS20sLGGHzYczdQosm9vq5h
VZVJ3HowRXCs0tpEvssuwdITXMZTFjtEl8phAVkcap1GwUM+Qe1wz1jAzXGcuwoh
N8VHJIBR3DCfWh3c3S5y/J/PMuhxhayMjdaCQD7M10Fg00FvlY7iFvlocLCkHBga
sWmaBPDt8Ytp8Ar7lXdotw7ICFBZYX3AZe241PFngZBXFR3R50lqBtbJKV0YglSk
HViWhjwrvmOt3UclV/G8VFqxyRcBI5XgqDxyI6+1lg0PO5nJlfZJhkJJ8wyISEp1
238k6Y9EosV9ERlj9Zss/w==
=BmY6
-----END PGP SIGNATURE-----
Merge tag 'for-net-next-2022-09-30' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Luiz Augusto von Dentz says:
====================
bluetooth-next pull request for net-next
- Add RTL8761BUV device (Edimax BT-8500)
- Add a new PID/VID 13d3/3583 for MT7921
- Add Realtek RTL8852C support ID 0x13D3:0x3592
- Add VID/PID 0489/e0e0 for MediaTek MT7921
- Add a new VID/PID 0e8d/0608 for MT7921
- Add a new PID/VID 13d3/3578 for MT7921
- Add BT device 0cb8:c549 from RTW8852AE
- Add support for Intel Magnetor
* tag 'for-net-next-2022-09-30' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (49 commits)
Bluetooth: hci_sync: Fix not indicating power state
Bluetooth: L2CAP: Fix user-after-free
Bluetooth: Call shutdown for HCI_USER_CHANNEL
Bluetooth: Prevent double register of suspend
Bluetooth: hci_core: Fix not handling link timeouts propertly
Bluetooth: hci_event: Make sure ISO events don't affect non-ISO connections
Bluetooth: hci_debugfs: Fix not checking conn->debugfs
Bluetooth: hci_sysfs: Fix attempting to call device_add multiple times
Bluetooth: MGMT: fix zalloc-simple.cocci warnings
Bluetooth: hci_{ldisc,serdev}: check percpu_init_rwsem() failure
Bluetooth: use hdev->workqueue when queuing hdev->{cmd,ncmd}_timer works
Bluetooth: L2CAP: initialize delayed works at l2cap_chan_create()
Bluetooth: RFCOMM: Fix possible deadlock on socket shutdown/release
Bluetooth: hci_sync: allow advertise when scan without RPA
Bluetooth: btusb: Add a new VID/PID 0e8d/0608 for MT7921
Bluetooth: btusb: Add a new PID/VID 13d3/3583 for MT7921
Bluetooth: avoid hci_dev_test_and_set_flag() in mgmt_init_hdev()
Bluetooth: btintel: Mark Intel controller to support LE_STATES quirk
Bluetooth: btintel: Add support for Magnetor
Bluetooth: btusb: Add a new PID/VID 13d3/3578 for MT7921
...
====================
Link: https://lore.kernel.org/r/20221001004602.297366-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Few stack changes and lots of driver changes in this round. brcmfmac
has more activity as usual and it gets new hardware support. ath11k
improves WCN6750 support and also other smaller features. And of
course changes all over.
Note: in early September wireless tree was merged to wireless-next to
avoid some conflicts with mac80211 patches, this shouldn't cause any
problems but wanted to mention anyway.
Major changes:
mac80211
* refactoring and preparation for Wi-Fi 7 Multi-Link Operation (MLO)
feature continues
brcmfmac
* support CYW43439 SDIO chipset
* support BCM4378 on Apple platforms
* support CYW89459 PCIe chipset
rtw89
* more work to get rtw8852c supported
* P2P support
* support for enabling and disabling MSDU aggregation via nl80211
mt76
* tx status reporting improvements
ath11k
* cold boot calibration support on WCN6750
* Target Wake Time (TWT) debugfs support for STA interface
* support to connect to a non-transmit MBSSID AP profile
* enable remain-on-channel support on WCN6750
* implement SRAM dump debugfs interface
* enable threaded NAPI on all hardware
* WoW support for WCN6750
* support to provide transmit power from firmware via nl80211
* support to get power save duration for each client
* spectral scan support for 160 MHz
wcn36xx
* add SNR from a received frame as a source of system entropy
-----BEGIN PGP SIGNATURE-----
iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmM3BGYRHGt2YWxvQGtl
cm5lbC5vcmcACgkQbhckVSbrbZuR3Af/XiuMlnDB6flq+M/kQHLWWvHybLw5aCJ7
l3yXhNFWxpBl2hQXtj17JSjVCYQmxbfrgRqhbNhyACO25bpymCb5QctB9X+Y7TwL
250JmuKvQfFx5oJNRfJ67dKTf3raloQYbdEMJNqySgebL+eSfrDskc9vaCLVDmCK
I994fl0Q1wUbJ6fbuIFd07ti8ay6UlSS/iakv4+nEeimabtZWJWlXBWYRpKpikdP
h9z2kPtss6yz6seaQuw6ny+qysYLi11Tp+Cued9XR3dWOOhB2X1tLHH0H02xPw76
9OJZEJHycP2juxjMfAaktHY+VX36GPLsMLUTVusH0h/Fdy3VG8YSAw==
=emmG
-----END PGP SIGNATURE-----
Merge tag 'wireless-next-2022-09-30' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Kalle Valo says:
====================
wireless-next patches for v6.1
Few stack changes and lots of driver changes in this round. brcmfmac
has more activity as usual and it gets new hardware support. ath11k
improves WCN6750 support and also other smaller features. And of
course changes all over.
Note: in early September wireless tree was merged to wireless-next to
avoid some conflicts with mac80211 patches, this shouldn't cause any
problems but wanted to mention anyway.
Major changes:
mac80211
- refactoring and preparation for Wi-Fi 7 Multi-Link Operation (MLO)
feature continues
brcmfmac
- support CYW43439 SDIO chipset
- support BCM4378 on Apple platforms
- support CYW89459 PCIe chipset
rtw89
- more work to get rtw8852c supported
- P2P support
- support for enabling and disabling MSDU aggregation via nl80211
mt76
- tx status reporting improvements
ath11k
- cold boot calibration support on WCN6750
- Target Wake Time (TWT) debugfs support for STA interface
- support to connect to a non-transmit MBSSID AP profile
- enable remain-on-channel support on WCN6750
- implement SRAM dump debugfs interface
- enable threaded NAPI on all hardware
- WoW support for WCN6750
- support to provide transmit power from firmware via nl80211
- support to get power save duration for each client
- spectral scan support for 160 MHz
wcn36xx
- add SNR from a received frame as a source of system entropy
* tag 'wireless-next-2022-09-30' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (231 commits)
wifi: rtl8xxxu: Improve rtl8xxxu_queue_select
wifi: rtl8xxxu: Fix AIFS written to REG_EDCA_*_PARAM
wifi: rtl8xxxu: gen2: Enable 40 MHz channel width
wifi: rtw89: 8852b: configure DLE mem
wifi: rtw89: check DLE FIFO size with reserved size
wifi: rtw89: mac: correct register of report IMR
wifi: rtw89: pci: set power cut closed for 8852be
wifi: rtw89: pci: add to do PCI auto calibration
wifi: rtw89: 8852b: implement chip_ops::{enable,disable}_bb_rf
wifi: rtw89: add DMA busy checking bits to chip info
wifi: rtw89: mac: define DMA channel mask to avoid unsupported channels
wifi: rtw89: pci: mask out unsupported TX channels
iwlegacy: Replace zero-length arrays with DECLARE_FLEX_ARRAY() helper
ipw2x00: Replace zero-length array with DECLARE_FLEX_ARRAY() helper
wifi: iwlwifi: Track scan_cmd allocation size explicitly
brcmfmac: Remove the call to "dtim_assoc" IOVAR
brcmfmac: increase dcmd maximum buffer size
brcmfmac: Support 89459 pcie
brcmfmac: increase default max WOWL patterns to 16
cw1200: fix incorrect check to determine if no element is found in list
...
====================
Link: https://lore.kernel.org/r/20220930150413.A7984C433D6@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The previous commit removed the last usage of xsk_buff_discard in mlx5e,
so the function that is no longer used can be removed.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
CC: "Björn Töpel" <bjorn@kernel.org>
CC: Magnus Karlsson <magnus.karlsson@intel.com>
CC: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Drivers should be aware of the range of valid UMEM chunk sizes to be
able to allocate their internal structures of an appropriate size. It
will be used by mlx5e in the following patches.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
CC: "Björn Töpel" <bjorn@kernel.org>
CC: Magnus Karlsson <magnus.karlsson@intel.com>
CC: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit fixes a bug in the tracking of max_packets_out and
is_cwnd_limited. This bug can cause the connection to fail to remember
that is_cwnd_limited is true, causing the connection to fail to grow
cwnd when it should, causing throughput to be lower than it should be.
The following event sequence is an example that triggers the bug:
(a) The connection is cwnd_limited, but packets_out is not at its
peak due to TSO deferral deciding not to send another skb yet.
In such cases the connection can advance max_packets_seq and set
tp->is_cwnd_limited to true and max_packets_out to a small
number.
(b) Then later in the round trip the connection is pacing-limited (not
cwnd-limited), and packets_out is larger. In such cases the
connection would raise max_packets_out to a bigger number but
(unexpectedly) flip tp->is_cwnd_limited from true to false.
This commit fixes that bug.
One straightforward fix would be to separately track (a) the next
window after max_packets_out reaches a maximum, and (b) the next
window after tp->is_cwnd_limited is set to true. But this would
require consuming an extra u32 sequence number.
Instead, to save space we track only the most important
information. Specifically, we track the strongest available signal of
the degree to which the cwnd is fully utilized:
(1) If the connection is cwnd-limited then we remember that fact for
the current window.
(2) If the connection not cwnd-limited then we track the maximum
number of outstanding packets in the current window.
In particular, note that the new logic cannot trigger the buggy
(a)/(b) sequence above because with the new logic a condition where
tp->packets_out > tp->max_packets_out can only trigger an update of
tp->is_cwnd_limited if tp->is_cwnd_limited is false.
This first showed up in a testing of a BBRv2 dev branch, but this
buggy behavior highlighted a general issue with the
tcp_cwnd_validate() logic that can cause cwnd to fail to increase at
the proper rate for any TCP congestion control, including Reno or
CUBIC.
Fixes: ca8a226343 ("tcp: make cwnd-limited checks measurement-based, and gentler")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Kevin(Yudong) Yang <yyd@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IEEE 802.1Q clause 12.29.1.1 "The queueMaxSDUTable structure and data
types" and 8.6.8.4 "Enhancements for scheduled traffic" talk about the
existence of a per traffic class limitation of maximum frame sizes, with
a fallback on the port-based MTU.
As far as I am able to understand, the 802.1Q Service Data Unit (SDU)
represents the MAC Service Data Unit (MSDU, i.e. L2 payload), excluding
any number of prepended VLAN headers which may be otherwise present in
the MSDU. Therefore, the queueMaxSDU is directly comparable to the
device MTU (1500 means L2 payload sizes are accepted, or frame sizes of
1518 octets, or 1522 plus one VLAN header). Drivers which offload this
are directly responsible of translating into other units of measurement.
To keep the fast path checks optimized, we keep 2 arrays in the qdisc,
one for max_sdu translated into frame length (so that it's comparable to
skb->len), and another for offloading and for dumping back to the user.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When adding optional new features to Qdisc offloads, existing drivers
must reject the new configuration until they are coded up to act on it.
Since modifying all drivers in lockstep with the changes in the Qdisc
can create problems of its own, it would be nice if there existed an
automatic opt-in mechanism for offloading optional features.
Jakub proposes that we multiplex one more kind of call through
ndo_setup_tc(): one where the driver populates a Qdisc-specific
capability structure.
First user will be taprio in further changes. Here we are introducing
the definitions for the base functionality.
Link: https://patchwork.kernel.org/project/netdevbpf/patch/20220923163310.3192733-3-vladimir.oltean@nxp.com/
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Co-developed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The v6_rcv_saddr and rcv_saddr are inside a union in the
'struct inet_bind2_bucket'. When searching a bucket by following the
bhash2 hashtable chain, eg. inet_bind2_bucket_match, it is only using
the sk->sk_family and there is no way to check if the inet_bind2_bucket
has a v6 or v4 address in the union. This leads to an uninit-value
KMSAN report in [0] and also potentially incorrect matches.
This patch fixes it by adding a family member to the inet_bind2_bucket
and then tests 'sk->sk_family != tb->family' before matching
the sk's address to the tb's address.
Cc: Joanne Koong <joannelkoong@gmail.com>
Fixes: 28044fc1d4 ("net: Add a bhash2 table hashed by port and address")
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Tested-by: Alexander Potapenko <glider@google.com>
Link: https://lore.kernel.org/r/20220927002544.3381205-1-kafai@fb.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It will be used to support TCP FastOpen with MPTCP in the following
commit.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Dmytro Shytyi <dmytro@shytyi.net>
Signed-off-by: Dmytro Shytyi <dmytro@shytyi.net>
Signed-off-by: Benjamin Hesmans <benjamin.hesmans@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Practical experience (and advice from Alexei) tell us that bitfields in
structs lead to un-optimized assembly code. I've verified this change
does lead to better x86_64 assembly, both via objdump and playing with
code snippets in godbolt.org.
Using scripts/bloat-o-meter shows the code size is reduced with 24
bytes for xdp_convert_buff_to_frame() that gets inlined e.g. in
i40e_xmit_xdp_tx_ring() which were used for microbenchmarking.
Microbenchmarking results do show improvements, but very small and
varying between 0.5 to 2 nanosec improvement per packet.
The member @metasize is changed from u8 to u32. Future users of this
area could split this into two u16 fields. I've also benchmarked with
two u16 fields showing equal performance gains and code size reduction.
The moved member @frame_sz doesn't change sizeof struct due to existing
padding. Like xdp_buff member @frame_sz is placed next to @flags, which
allows compiler to optimize assignment of these.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/r/166393728005.2213882.4162674859542409548.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All uses of dst_hold_and_use() have
been removed since commit 1202cdd665 ("Remove DECnet support
from kernel"), so remove it.
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All uses of sk_nulls_node_init() have
been removed since commit dbca1596bb ("ping: convert to RCU
lookups, get rid of rwlock"), so remove it.
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All uses of neigh_key_eq16() have
been removed since commit 1202cdd665 ("Remove DECnet support
from kernel"), so remove it.
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that the MACsec offloading preparation phase was removed from the
MACsec core implementation as well as from drivers implementing it, we
can safely remove the flag representing it.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The walk implementation of most qdisc class modules is basically the
same. That is, the values of count and skip are checked first. If
count is greater than or equal to skip, the registered fn function is
executed. Otherwise, increase the value of count. So we can reconstruct
them.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduce cipher sizes descriptor. It helps reducing the amount of code
duplications and repeated switch/cases that assigns the proper sizes
according to the cipher type.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The flag for need_wakeup is not set for xsks with `XDP_SHARED_UMEM`
flag and of different queue ids and/or devices. They should inherit
the flag from the first socket buffer pool since no flags can be
specified once `XDP_SHARED_UMEM` is specified.
Fixes: b5aea28dca ("xsk: Add shared umem support between queue ids")
Signed-off-by: Jalal Mostafa <jalal.a.mostapha@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20220921135701.10199-1-jalal.a.mostapha@gmail.com
Currently, SMC uses smc->sk.sk_{rcv|snd}buf to create buffers for
send buffer and RMB. And the values of buffer size are from tcp_{w|r}mem
in clcsock.
The buffer size from TCP socket doesn't fit SMC well. Generally, buffers
are usually larger than TCP for SMC-R/-D to get higher performance, for
they are different underlay devices and paths.
So this patch unbinds buffer size from TCP, and introduces two sysctl
knobs to tune them independently. Also, these knobs are per net
namespace and work for containers.
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
SMC-R tests the viability of link by sending out TEST_LINK LLC
messages over RoCE fabric when connections on link have been
idle for a time longer than keepalive interval (testlink time).
But using tcp_keepalive_time as testlink time maybe not quite
suitable because it is default no less than two hours[1], which
is too long for single link to find peer dead. The active host
will still use peer-dead link (QP) sending messages, and can't
find out until get IB_WC_RETRY_EXC_ERR error CQEs, which takes
more time than TEST_LINK timeout (SMC_LLC_WAIT_TIME) normally.
So this patch introduces a independent sysctl for SMC-R to set
link keepalive time, in order to detect link down in time. The
default value is 30 seconds.
[1] https://www.rfc-editor.org/rfc/rfc1122#page-101
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The walk implementation of most tc cls modules is basically the same.
That is, the values of count and skip are checked first. If count is
greater than or equal to skip, the registered fn function is executed.
Otherwise, increase the value of count. So we can reconstruct them.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Tested-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We're seeing the following new warnings on netdev/build_32bit and
netdev/build_allmodconfig_warn CI jobs:
../net/core/filter.c:8608:1: warning: symbol
'nf_conn_btf_access_lock' was not declared. Should it be static?
../net/core/filter.c:8611:5: warning: symbol 'nfct_bsa' was not
declared. Should it be static?
Fix by ensuring extern declaration is present while compiling filter.o.
Fixes: 864b656f82 ("bpf: Add support for writing to nf_conn:mark")
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/2bd2e0283df36d8a4119605878edb1838d144174.1663683114.git.dxu@dxuuu.xyz
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
The more sockets we have in the hash table, the longer we spend looking
up the socket. While running a number of small workloads on the same
host, they penalise each other and cause performance degradation.
The root cause might be a single workload that consumes much more
resources than the others. It often happens on a cloud service where
different workloads share the same computing resource.
On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
entries), after running iperf3 in different netns, creating 24Mi sockets
without data transfer in the root netns causes about 10% performance
regression for the iperf3's connection.
thash_entries sockets length Gbps
524288 1 1 50.7
24Mi 48 45.1
It is basically related to the length of the list of each hash bucket.
For testing purposes to see how performance drops along the length,
I set 131072 (1Mi / 8) to thash_entries, and here's the result.
thash_entries sockets length Gbps
131072 1 1 50.7
1Mi 8 49.9
2Mi 16 48.9
4Mi 32 47.3
8Mi 64 44.6
16Mi 128 40.6
24Mi 192 36.3
32Mi 256 32.5
40Mi 320 27.0
48Mi 384 25.0
To resolve the socket lookup degradation, we introduce an optional
per-netns hash table for TCP, but it's just ehash, and we still share
the global bhash, bhash2 and lhash2.
With a smaller ehash, we can look up non-listener sockets faster and
isolate such noisy neighbours. In addition, we can reduce lock contention.
We can control the ehash size by a new sysctl knob. However, depending
on workloads, it will require very sensitive tuning, so we disable the
feature by default (net.ipv4.tcp_child_ehash_entries == 0). Moreover,
we can fall back to using the global ehash in case we fail to allocate
enough memory for a new ehash. The maximum size is 16Mi, which is large
enough that even if we have 48Mi sockets, the average list length is 3,
and regression would be less than 1%.
We can check the current ehash size by another read-only sysctl knob,
net.ipv4.tcp_ehash_entries. A negative value means the netns shares
the global ehash (per-netns ehash is disabled or failed to allocate
memory).
# dmesg | cut -d ' ' -f 5- | grep "established hash"
TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)
# sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries
# sysctl net.ipv4.tcp_child_ehash_entries
net.ipv4.tcp_child_ehash_entries = 0 # disabled by default
# ip netns add test1
# ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = -524288 # share the global ehash
# sysctl -w net.ipv4.tcp_child_ehash_entries=100
net.ipv4.tcp_child_ehash_entries = 100
# ip netns add test2
# ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = 128 # own a per-netns ehash with 2^n buckets
When more than two processes in the same netns create per-netns ehash
concurrently with different sizes, we need to guarantee the size in
one of the following ways:
1) Share the global ehash and create per-netns ehash
First, unshare() with tcp_child_ehash_entries==0. It creates dedicated
netns sysctl knobs where we can safely change tcp_child_ehash_entries
and clone()/unshare() to create a per-netns ehash.
2) Control write on sysctl by BPF
We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on
sysctl knobs.
Note that the global ehash allocated at the boot time is spread over
available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate
pages for each per-netns ehash depending on the current process's NUMA
policy. By default, the allocation is done in the local node only, so
the per-netns hash table could fully reside on a random node. Thus,
depending on the NUMA policy the netns is created with and the CPU the
current thread is running on, we could see some performance differences
for highly optimised networking applications.
Note also that the default values of two sysctl knobs depend on the ehash
size and should be tuned carefully:
tcp_max_tw_buckets : tcp_child_ehash_entries / 2
tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)
As a bonus, we can dismantle netns faster. Currently, while destroying
netns, we call inet_twsk_purge(), which walks through the global ehash.
It can be potentially big because it can have many sockets other than
TIME_WAIT in all netns. Splitting ehash changes that situation, where
it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets
in each netns.
With regard to this, we do not free the per-netns ehash in inet_twsk_kill()
to avoid UAF while iterating the per-netns ehash in inet_twsk_purge().
Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to
keep it protocol-family-independent.
In the future, we could optimise ehash lookup/iteration further by removing
netns comparison for the per-netns ehash.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While destroying netns, we call inet_twsk_purge() in tcp_sk_exit_batch()
and tcpv6_net_exit_batch() for AF_INET and AF_INET6. These commands
trigger the kernel to walk through the potentially big ehash twice even
though the netns has no TIME_WAIT sockets.
# ip netns add test
# ip netns del test
or
# unshare -n /bin/true >/dev/null
When tw_refcount is 1, we need not call inet_twsk_purge() at least
for the net. We can save such unneeded iterations if all netns in
net_exit_list have no TIME_WAIT sockets. This change eliminates
the tax by the additional unshare() described in the next patch to
guarantee the per-netns ehash size.
Tested:
# mount -t debugfs none /sys/kernel/debug/
# echo cleanup_net > /sys/kernel/debug/tracing/set_ftrace_filter
# echo inet_twsk_purge >> /sys/kernel/debug/tracing/set_ftrace_filter
# echo function > /sys/kernel/debug/tracing/current_tracer
# cat ./add_del_unshare.sh
for i in `seq 1 40`
do
(for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) &
done
wait;
# ./add_del_unshare.sh
Before the patch:
# cat /sys/kernel/debug/tracing/trace_pipe
kworker/u128:0-8 [031] ...1. 174.162765: cleanup_net <-process_one_work
kworker/u128:0-8 [031] ...1. 174.240796: inet_twsk_purge <-cleanup_net
kworker/u128:0-8 [032] ...1. 174.244759: inet_twsk_purge <-tcp_sk_exit_batch
kworker/u128:0-8 [034] ...1. 174.290861: cleanup_net <-process_one_work
kworker/u128:0-8 [039] ...1. 175.245027: inet_twsk_purge <-cleanup_net
kworker/u128:0-8 [046] ...1. 175.290541: inet_twsk_purge <-tcp_sk_exit_batch
kworker/u128:0-8 [037] ...1. 175.321046: cleanup_net <-process_one_work
kworker/u128:0-8 [024] ...1. 175.941633: inet_twsk_purge <-cleanup_net
kworker/u128:0-8 [025] ...1. 176.242539: inet_twsk_purge <-tcp_sk_exit_batch
After:
# cat /sys/kernel/debug/tracing/trace_pipe
kworker/u128:0-8 [038] ...1. 428.116174: cleanup_net <-process_one_work
kworker/u128:0-8 [038] ...1. 428.262532: cleanup_net <-process_one_work
kworker/u128:0-8 [030] ...1. 429.292645: cleanup_net <-process_one_work
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We will soon introduce an optional per-netns ehash.
This means we cannot use the global sk->sk_prot->h.hashinfo
to fetch a TCP hashinfo.
Instead, set NULL to sk->sk_prot->h.hashinfo for TCP and get
a proper hashinfo from net->ipv4.tcp_death_row.hashinfo.
Note that we need not use sk->sk_prot->h.hashinfo if DCCP is
disabled.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We will soon introduce an optional per-netns ehash and access hash
tables via net->ipv4.tcp_death_row->hashinfo instead of &tcp_hashinfo
in most places.
It could harm the fast path because dereferences of two fields in net
and tcp_death_row might incur two extra cache line misses. To save one
dereference, let's place tcp_death_row back in netns_ipv4 and fetch
hashinfo via net->ipv4.tcp_death_row"."hashinfo.
Note tcp_death_row was initially placed in netns_ipv4, and commit
fbb8295248 ("tcp: allocate tcp_death_row outside of struct netns_ipv4")
changed it to a pointer so that we can fire TIME_WAIT timers after freeing
net. However, we don't do so after commit 04c494e68a ("Revert "tcp/dccp:
get rid of inet_twsk_purge()""), so we need not define tcp_death_row as a
pointer.
Also, we move refcount_dec_and_test(&tw_refcount) from tcp_sk_exit() to
tcp_sk_exit_batch() as a debug check.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are 2 ways in which a DSA user port may become handled by 2 CPU
ports in a LAG:
(1) its current DSA master joins a LAG
ip link del bond0 && ip link add bond0 type bond mode 802.3ad
ip link set eno2 master bond0
When this happens, all user ports with "eno2" as DSA master get
automatically migrated to "bond0" as DSA master.
(2) it is explicitly configured as such by the user
# Before, the DSA master was eno3
ip link set swp0 type dsa master bond0
The design of this configuration is that the LAG device dynamically
becomes a DSA master through dsa_master_setup() when the first physical
DSA master becomes a LAG slave, and stops being so through
dsa_master_teardown() when the last physical DSA master leaves.
A LAG interface is considered as a valid DSA master only if it contains
existing DSA masters, and no other lower interfaces. Therefore, we
mainly rely on method (1) to enter this configuration.
Each physical DSA master (LAG slave) retains its dev->dsa_ptr for when
it becomes a standalone DSA master again. But the LAG master also has a
dev->dsa_ptr, and this is actually duplicated from one of the physical
LAG slaves, and therefore needs to be balanced when LAG slaves come and
go.
To the switch driver, putting DSA masters in a LAG is seen as putting
their associated CPU ports in a LAG.
We need to prepare cross-chip host FDB notifiers for CPU ports in a LAG,
by calling the driver's ->lag_fdb_add method rather than ->port_fdb_add.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Drivers could refuse to offload a LAG configuration for a variety of
reasons, mainly having to do with its TX type. Additionally, since DSA
masters may now also be LAG interfaces, and this will translate into a
call to port_lag_join on the CPU ports, there may be extra restrictions
there. Propagate the netlink extack to this DSA method in order for
drivers to give a meaningful error message back to the user.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Some DSA switches have multiple CPU ports, which can be used to improve
CPU termination throughput, but DSA, through dsa_tree_setup_cpu_ports(),
sets up only the first one, leading to suboptimal use of hardware.
The desire is to not change the default configuration but to permit the
user to create a dynamic mapping between individual user ports and the
CPU port that they are served by, configurable through rtnetlink. It is
also intended to permit load balancing between CPU ports, and in that
case, the foreseen model is for the DSA master to be a bonding interface
whose lowers are the physical DSA masters.
To that end, we create a struct rtnl_link_ops for DSA user ports with
the "dsa" kind. We expose the IFLA_DSA_MASTER link attribute that
contains the ifindex of the newly desired DSA master.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
There is a desire to support for DSA masters in a LAG.
That configuration is intended to work by simply enslaving the master to
a bonding/team device. But the physical DSA master (the LAG slave) still
has a dev->dsa_ptr, and that cpu_dp still corresponds to the physical
CPU port.
However, we would like to be able to retrieve the LAG that's the upper
of the physical DSA master. In preparation for that, introduce a helper
called dsa_port_get_master() that replaces all occurrences of the
dp->cpu_dp->master pattern. The distinction between LAG and non-LAG will
be made later within the helper itself.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Allow to offload L2TPv3 filters by adding flow_rule_match_l2tpv3.
Drivers can extract L2TPv3 specific fields from now on.
Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Allow to dissect L2TPv3 specific field which is:
- session ID (32 bits)
L2TPv3 might be transported over IP or over UDP,
this implementation is only about L2TPv3 over IP.
IP protocol carries L2TPv3 when ip_proto is
IPPROTO_L2TP (115).
Acked-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
There are already a few definitions of arrays containing
MULTICAST_LACPDU_ADDR and the next patch will add one more use. These all
contain the same constant data so define one common instance for all
bonding code.
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is uninit value bug in dgram_sendmsg function in
net/ieee802154/socket.c when the length of valid data pointed by the
msg->msg_name isn't verified.
We introducing a helper function ieee802154_sockaddr_check_size to
check namelen. First we check there is addr_type in ieee802154_addr_sa.
Then, we check namelen according to addr_type.
Also fixed in raw_bind, dgram_bind, dgram_connect.
Signed-off-by: Haimin Zhang <tcs_kernel@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support direct writes to nf_conn:mark from TC and XDP prog types. This
is useful when applications want to store per-connection metadata. This
is also particularly useful for applications that run both bpf and
iptables/nftables because the latter can trivially access this metadata.
One example use case would be if a bpf prog is responsible for advanced
packet classification and iptables/nftables is later used for routing
due to pre-existing/legacy code.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/ebca06dea366e3e7e861c12f375a548cc4c61108.1662568410.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Recent changes breaks HCIGETDEVINFO since it changes the size of
hci_dev_info.
Fixes: 26afbd826e ("Bluetooth: Add initial implementation of CIS connections")
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Each tc action module has a corresponding net_id, so put net_id directly
into the structure tc_action_ops.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal says:
====================
The following set contains changes for your *net-next* tree:
- make conntrack ignore packets that are delayed (containing
data already acked). The current behaviour to flag them as INVALID
causes more harm than good, let them pass so peer can send an
immediate ACK for the most recent sequence number.
- make conntrack recognize when both peers have sent 'invalid' FINs:
This helps cleaning out stale connections faster for those cases where
conntrack is no longer in sync with the actual connection state.
- Now that DECNET is gone, we don't need to reserve space for DECNET
related information.
- compact common 'find a free port number for the new inbound
connection' code and move it to a helper, then cap number of tries
the new helper will make until it gives up.
- replace various instances of strlcpy with strscpy, from Wolfram Sang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Recent changes breaks HCIGETDEVINFO since it changes the size of
hci_dev_info.
Fixes: 26afbd826e ("Bluetooth: Add initial implementation of CIS connections")
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
drivers/net/ethernet/freescale/fec.h
7d650df99d ("net: fec: add pm_qos support on imx6q platform")
40c79ce13b ("net: fec: add stop mode support for imx8 platform")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Almost all nat helpers reserve an expecation port the same way:
Try the port inidcated by the peer, then move to next port if that
port is already in use.
We can squash this into a helper.
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
As Eric reported, the 'reason' field is not presented when trace the
kfree_skb event by perf:
$ perf record -e skb:kfree_skb -a sleep 10
$ perf script
ip_defrag 14605 [021] 221.614303: skb:kfree_skb:
skbaddr=0xffff9d2851242700 protocol=34525 location=0xffffffffa39346b1
reason:
The cause seems to be passing kernel address directly to TP_printk(),
which is not right. As the enum 'skb_drop_reason' is not exported to
user space through TRACE_DEFINE_ENUM(), perf can't get the drop reason
string from the 'reason' field, which is a number.
Therefore, we introduce the macro DEFINE_DROP_REASON(), which is used
to define the trace enum by TRACE_DEFINE_ENUM(). With the help of
DEFINE_DROP_REASON(), now we can remove the auto-generate that we
introduced in the commit ec43908dd5
("net: skb: use auto-generation to convert skb drop reason to string"),
and define the string array 'drop_reasons'.
Hmmmm...now we come back to the situation that have to maintain drop
reasons in both enum skb_drop_reason and DEFINE_DROP_REASON. But they
are both in dropreason.h, which makes it easier.
After this commit, now the format of kfree_skb is like this:
$ cat /tracing/events/skb/kfree_skb/format
name: kfree_skb
ID: 1524
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:void * skbaddr; offset:8; size:8; signed:0;
field:void * location; offset:16; size:8; signed:0;
field:unsigned short protocol; offset:24; size:2; signed:0;
field:enum skb_drop_reason reason; offset:28; size:4; signed:0;
print fmt: "skbaddr=%p protocol=%u location=%p reason: %s", REC->skbaddr, REC->protocol, REC->location, __print_symbolic(REC->reason, { 1, "NOT_SPECIFIED" }, { 2, "NO_SOCKET" } ......
Fixes: ec43908dd5 ("net: skb: use auto-generation to convert skb drop reason to string")
Link: https://lore.kernel.org/netdev/CANn89i+bx0ybvE55iMYf5GJM48WwV1HNpdm9Q6t-HaEstqpCSA@mail.gmail.com/
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move some MACsec infrastructure like defines and functions,
in order to avoid code duplication for future drivers which
implements MACsec offload.
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ben Ben-Ishay <benishay@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the current MACsec offload implementation, MACsec interfaces shares
the same MAC address by default.
Therefore, HW can't distinguish from which MACsec interface the traffic
originated from.
MACsec stack will use skb_metadata_dst to store the SCI value, which is
unique per Macsec interface, skb_metadat_dst will be used by the
offloading device driver to associate the SKB with the corresponding
offloaded interface (SCI).
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netlink allows to specify allowed ranges for integer types.
Unfortunately, nfnetlink passes integers in big endian, so the existing
NLA_POLICY_MAX() cannot be used.
At the moment, nfnetlink users, such as nf_tables, need to resort to
programmatic checking via helpers such as nft_parse_u32_check().
This is both cumbersome and error prone. This adds NLA_POLICY_MAX_BE
which adds range check support for BE16, BE32 and BE64 integers.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2022-09-05
The following pull-request contains BPF updates for your *net-next* tree.
We've added 106 non-merge commits during the last 18 day(s) which contain
a total of 159 files changed, 5225 insertions(+), 1358 deletions(-).
There are two small merge conflicts, resolve them as follows:
1) tools/testing/selftests/bpf/DENYLIST.s390x
Commit 27e23836ce ("selftests/bpf: Add lru_bug to s390x deny list") in
bpf tree was needed to get BPF CI green on s390x, but it conflicted with
newly added tests on bpf-next. Resolve by adding both hunks, result:
[...]
lru_bug # prog 'printk': failed to auto-attach: -524
setget_sockopt # attach unexpected error: -524 (trampoline)
cb_refs # expected error message unexpected error: -524 (trampoline)
cgroup_hierarchical_stats # JIT does not support calling kernel function (kfunc)
htab_update # failed to attach: ERROR: strerror_r(-524)=22 (trampoline)
[...]
2) net/core/filter.c
Commit 1227c1771d ("net: Fix data-races around sysctl_[rw]mem_(max|default).")
from net tree conflicts with commit 29003875bd ("bpf: Change bpf_setsockopt(SOL_SOCKET)
to reuse sk_setsockopt()") from bpf-next tree. Take the code as it is from
bpf-next tree, result:
[...]
if (getopt) {
if (optname == SO_BINDTODEVICE)
return -EINVAL;
return sk_getsockopt(sk, SOL_SOCKET, optname,
KERNEL_SOCKPTR(optval),
KERNEL_SOCKPTR(optlen));
}
return sk_setsockopt(sk, SOL_SOCKET, optname,
KERNEL_SOCKPTR(optval), *optlen);
[...]
The main changes are:
1) Add any-context BPF specific memory allocator which is useful in particular for BPF
tracing with bonus of performance equal to full prealloc, from Alexei Starovoitov.
2) Big batch to remove duplicated code from bpf_{get,set}sockopt() helpers as an effort
to reuse the existing core socket code as much as possible, from Martin KaFai Lau.
3) Extend BPF flow dissector for BPF programs to just augment the in-kernel dissector
with custom logic. In other words, allow for partial replacement, from Shmulik Ladkani.
4) Add a new cgroup iterator to BPF with different traversal options, from Hao Luo.
5) Support for BPF to collect hierarchical cgroup statistics efficiently through BPF
integration with the rstat framework, from Yosry Ahmed.
6) Support bpf_{g,s}et_retval() under more BPF cgroup hooks, from Stanislav Fomichev.
7) BPF hash table and local storages fixes under fully preemptible kernel, from Hou Tao.
8) Add various improvements to BPF selftests and libbpf for compilation with gcc BPF
backend, from James Hilliard.
9) Fix verifier helper permissions and reference state management for synchronous
callbacks, from Kumar Kartikeya Dwivedi.
10) Add support for BPF selftest's xskxceiver to also be used against real devices that
support MAC loopback, from Maciej Fijalkowski.
11) Various fixes to the bpf-helpers(7) man page generation script, from Quentin Monnet.
12) Document BPF verifier's tnum_in(tnum_range(), ...) gotchas, from Shung-Hsi Yu.
13) Various minor misc improvements all over the place.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (106 commits)
bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc.
bpf: Remove usage of kmem_cache from bpf_mem_cache.
bpf: Remove prealloc-only restriction for sleepable bpf programs.
bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
bpf: Remove tracing program restriction on map types
bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.
bpf: Add percpu allocation support to bpf_mem_alloc.
bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
bpf: Adjust low/high watermarks in bpf_mem_cache
bpf: Optimize call_rcu in non-preallocated hash map.
bpf: Optimize element count in non-preallocated hash map.
bpf: Relax the requirement to use preallocated hash maps in tracing progs.
samples/bpf: Reduce syscall overhead in map_perf_test.
selftests/bpf: Improve test coverage of test_maps
bpf: Convert hash map to bpf_mem_alloc.
bpf: Introduce any context BPF specific memory allocator.
selftest/bpf: Add test for bpf_getsockopt()
bpf: Change bpf_getsockopt(SOL_IPV6) to reuse do_ipv6_getsockopt()
bpf: Change bpf_getsockopt(SOL_IP) to reuse do_ip_getsockopt()
bpf: Change bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt()
...
====================
Link: https://lore.kernel.org/r/20220905161136.9150-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This introduces a "Mesh UUID" and an Experimental Feature bit to the
hdev mask, and depending all underlying Mesh functionality on it.
Signed-off-by: Brian Gix <brian.gix@intel.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
The patch adds state bits, storage and HCI command chains for sending
and receiving Bluetooth Mesh advertising packets, and delivery to
requesting user space processes. It specifically creates 4 new MGMT
commands and 2 new MGMT events:
MGMT_OP_SET_MESH_RECEIVER - Sets passive scan parameters and a list of
AD Types which will trigger Mesh Packet Received events
MGMT_OP_MESH_READ_FEATURES - Returns information on how many outbound
Mesh packets can be simultaneously queued, and what the currently queued
handles are.
MGMT_OP_MESH_SEND - Command to queue a specific outbound Mesh packet,
with the number of times it should be sent, and the BD Addr to use.
Discrete advertisments are added to the ADV Instance list.
MGMT_OP_MESH_SEND_CANCEL - Command to cancel a prior outbound message
request.
MGMT_EV_MESH_DEVICE_FOUND - Event to deliver entire received Mesh
Advertisement packet, along with timing information.
MGMT_EV_MESH_PACKET_CMPLT - Event to indicate that an outbound packet is
no longer queued for delivery.
Signed-off-by: Brian Gix <brian.gix@intel.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Implement an API function and debugfs file to switch
active links.
Also provide an async version of the API so drivers
can call it in arbitrary contexts, e.g. while in the
authorized callback.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The A-MSDU data needs to be stored per-link and aggregated into a single
value for the station. Add a new struct ieee_80211_sta_aggregates in
order to store this data and a new function
ieee80211_sta_recalc_aggregates to update the current data for the STA.
Note that in the non MLO case the pointer in ieee80211_sta will directly
reference the data in deflink.agg, which means that recalculation may be
skipped in that case.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add macros (and an exported function) to allow checking some
link RCU protected accesses that are happening in callbacks
from mac80211 and are thus under the correct lock.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add a link_id parameter to ieee80211_nullfunc_get() to be
able to obtain a correctly addressed frame.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add a new API function ieee80211_find_sta_by_link_addrs()
that looks up the STA and link ID based on interface and
station link addresses.
We're going to use it for mac80211-hwsim to track on the
AP side which links are active.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In order to let the driver select active links and properly
make multi-link connections, as a first step isolate the
driver from inactive links, and set the active links to be
only the association link for client-side interfaces. For
AP side nothing changes since APs always have to have all
their links active.
To simplify things, update the for_each_sta_active_link()
API to include the appropriate vif pointer.
This also implies not allocating a chanctx for an inactive
link, which requires a few more changes.
Since we now no longer try to program multiple links to the
driver, remove the check in the MLME code.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The SMPS power save mode needs to be per-link rather than being shared
for all links. As such, move it into struct ieee80211_link_sta.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This patch changes bpf_getsockopt(SOL_IPV6) to reuse
do_ipv6_getsockopt(). It removes the duplicated code from
bpf_getsockopt(SOL_IPV6).
This also makes bpf_getsockopt(SOL_IPV6) supporting the same
set of optnames as in bpf_setsockopt(SOL_IPV6). In particular,
this adds IPV6_AUTOFLOWLABEL support to bpf_getsockopt(SOL_IPV6).
ipv6 could be compiled as a module. Like how other code solved it
with stubs in ipv6_stubs.h, this patch adds the do_ipv6_getsockopt
to the ipv6_bpf_stub.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002931.2896218-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch changes bpf_getsockopt(SOL_IP) to reuse
do_ip_getsockopt() and remove the duplicated code.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002925.2895416-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch changes bpf_getsockopt(SOL_TCP) to reuse
do_tcp_getsockopt(). It removes the duplicated code from
bpf_getsockopt(SOL_TCP).
Before this patch, there were some optnames available to
bpf_setsockopt(SOL_TCP) but missing in bpf_getsockopt(SOL_TCP).
For example, TCP_NODELAY, TCP_MAXSEG, TCP_KEEPIDLE, TCP_KEEPINTVL,
and a few more. It surprises users from time to time. This patch
automatically closes this gap without duplicating more code.
bpf_getsockopt(TCP_SAVED_SYN) does not free the saved_syn,
so it stays in sol_tcp_sockopt().
For string name value like TCP_CONGESTION, bpf expects it
is always null terminated, so sol_tcp_sockopt() decrements
optlen by one before calling do_tcp_getsockopt() and
the 'if (optlen < saved_optlen) memset(..,0,..);'
in __bpf_getsockopt() will always do a null termination.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002918.2894511-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch changes bpf_getsockopt(SOL_SOCKET) to reuse
sk_getsockopt(). It removes all duplicated code from
bpf_getsockopt(SOL_SOCKET).
Before this patch, there were some optnames available to
bpf_setsockopt(SOL_SOCKET) but missing in bpf_getsockopt(SOL_SOCKET).
It surprises users from time to time. For example, SO_REUSEADDR,
SO_KEEPALIVE, SO_RCVLOWAT, and SO_MAX_PACING_RATE. This patch
automatically closes this gap without duplicating more code.
The only exception is SO_BINDTODEVICE because it needs to acquire a
blocking lock. Thus, SO_BINDTODEVICE is not supported.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002912.2894040-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Similar to the earlier patch that changes sk_getsockopt() to
take the sockptr_t argument . This patch also changes
do_ipv6_getsockopt() to take the sockptr_t argument such that
a latter patch can make bpf_getsockopt(SOL_IPV6) to reuse
do_ipv6_getsockopt().
Note on the change in ip6_mc_msfget(). This function is to
return an array of sockaddr_storage in optval. This function
is shared between ipv6_get_msfilter() and compat_ipv6_get_msfilter().
However, the sockaddr_storage is stored at different offset of the
optval because of the difference between group_filter and
compat_group_filter. Thus, a new 'ss_offset' argument is
added to ip6_mc_msfget().
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002853.2892532-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When CONFIG_IEEE802154_NL802154_EXPERIMENTAL is disabled,
NL802154_CMD_DEL_SEC_LEVEL is undefined and results in a compilation
error:
net/ieee802154/nl802154.c:2503:19: error: 'NL802154_CMD_DEL_SEC_LEVEL' undeclared here (not in a function); did you mean 'NL802154_CMD_SET_CCA_ED_LEVEL'?
2503 | .resv_start_op = NL802154_CMD_DEL_SEC_LEVEL + 1,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
| NL802154_CMD_SET_CCA_ED_LEVEL
Unhide the experimental commands, having them defined in an enum
makes no difference.
Fixes: 9c5d03d362 ("genetlink: start to validate reserved header bytes")
Signed-off-by: Gal Pressman <gal@nvidia.com>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Tested-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Link: https://lore.kernel.org/r/20220902030620.2737091-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Florian Westphal says:
====================
netfilter: bug fixes for net
1. Fix IP address check in irc DCC conntrack helper, this should check
the opposite direction rather than the destination address of the
packets' direction, from David Leadbeater.
2. bridge netfilter needs to drop dst references, from Harsh Modi.
This was fine back in the day the code was originally written,
but nowadays various tunnels can pre-set metadata dsts on packets.
3. Remove nf_conntrack_helper sysctl and the modparam toggle, users
need to explicitily assign the helpers to use via nftables or
iptables. Conntrack helpers, by design, may be used to add dynamic
port redirections to internal machines, so its necessary to restrict
which hosts/peers are allowed to use them.
It was discovered that improper checking in the irc DCC helper makes
it possible to trigger the 'please do dynamic port forward'
from outside by embedding a 'DCC' in a PING request; if the client
echos that back a expectation/port forward gets added.
The auto-assign-for-everything mechanism has been in "please don't do this"
territory since 2012. From Pablo.
4. Fix a memory leak in the netdev hook error unwind path, also from Pablo.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_conntrack_irc: Fix forged IP logic
netfilter: nf_tables: clean up hook list when offload flags check fails
netfilter: br_netfilter: Drop dst references before setting.
netfilter: remove nf_conntrack_helper sysctl and modparam toggles
====================
Link: https://lore.kernel.org/r/20220901071238.3044-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmMQjRQACgkQ+7dXa6fL
C2uGMBAAmb6+9iaxeAj6kE+/P4LCzUgfiIi0mH5QdK5JmCWDkhOvH1XCqHJwPIVb
w/z5OGQm8CpaVb0nRC3pXMxdrG2OAmE6q6vYnJly30XRYdmv4tQ40SVwsq+OUmZ/
dTdSgoFUYEb7WPwfGIdwpCQS0hmrXXDz5yRwq7DPmrcMEj374AKdADeJEQhvjd86
NkhAp9bANNCwxIEouUntHtfZM+x3zFhzPh+giV+54WOKPLbqUYiG/AKo729oIgNf
dGApdKjvfxbK/dx0fqfYXk19RpbJgyacrGrl5GQ+tZ1qkdL0PB1B1Q/nh/4COoTi
Q7bTrZ6HyXGpQCMIwEY43dkdnnEdfaWLeXw4EqU0oHABckpCrzkv+HsV2kAPrEUS
wsLPXEuNY4Lszbidc4+NKGIg82RQWrPjtxmslys6YdyO12cmOiyH51RePkeUvkKY
C4erE+Kj7tNM58BHGN1TYsGhgHdAta5DWn+008Uc5KQh1WFJW+Wfk+VHtmNwSj4f
PiPcO1TM9Sp42jRizlPQE8DF2KGSs13pwxtomvYpeufZKjZL29OnfMySH+bXfRU0
JDzyUbeY1jMayADV3ovXQsP4KIIbKzL/m1LxBCtZ+R/zcxdij87pRIzFV8ZnnsZ1
QBfWobozK1NNFofMES6LjocoHF0ZT3H8khpaYTOlZA0MjSwXM5o=
=GW48
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-fixes-20220901' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc fixes
Here are some fixes for AF_RXRPC:
(1) Fix the handling of ICMP/ICMP6 packets. This is a problem due to
rxrpc being switched to acting as a UDP tunnel, thereby allowing it to
steal the packets before they go through the UDP Rx queue. UDP
tunnels can't get ICMP/ICMP6 packets, however. This patch adds an
additional encap hook so that they can.
(2) Fix the encryption routines in rxkad to handle packets that have more
than three parts correctly. The problem is that ->nr_frags doesn't
count the initial fragment, so the sglist ends up too short.
(3) Fix a problem with destruction of the local endpoint potentially
getting repeated.
(4) Fix the calculation of the time at which to resend.
jiffies_to_usecs() gives microseconds, not nanoseconds.
(5) Fix AFS to work out when callback promises and locks expire based on
the time an op was issued rather than the time the first reply packet
arrives. We don't know how long the server took between calculating
the expiry interval and transmitting the reply.
(6) Given (5), rxrpc_get_reply_time() is no longer used, so remove it.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove rxrpc_get_reply_time() as that is no longer used now that the call
issue time is used instead of the reply time.
Signed-off-by: David Howells <dhowells@redhat.com>
Because rxrpc pretends to be a tunnel on top of a UDP/UDP6 socket, allowing
it to siphon off UDP packets early in the handling of received UDP packets
thereby avoiding the packet going through the UDP receive queue, it doesn't
get ICMP packets through the UDP ->sk_error_report() callback. In fact, it
doesn't appear that there's any usable option for getting hold of ICMP
packets.
Fix this by adding a new UDP encap hook to distribute error messages for
UDP tunnels. If the hook is set, then the tunnel driver will be able to
see ICMP packets. The hook provides the offset into the packet of the UDP
header of the original packet that caused the notification.
An alternative would be to call the ->error_handler() hook - but that
requires that the skbuff be cloned (as ip_icmp_error() or ipv6_cmp_error()
do, though isn't really necessary or desirable in rxrpc's case is we want
to parse them there and then, not queue them).
Changes
=======
ver #3)
- Fixed an uninitialised variable.
ver #2)
- Fixed some missing CONFIG_AF_RXRPC_IPV6 conditionals.
Fixes: 5271953cad ("rxrpc: Use the UDP encap_rcv hook")
Signed-off-by: David Howells <dhowells@redhat.com>
Because per host rate limiting has been proven problematic (side channel
attacks can be based on it), per host rate limiting of challenge acks ideally
should be per netns and turned off by default.
This is a long due followup of following commits:
083ae30828 ("tcp: enable per-socket rate limiting of all 'challenge acks'")
f2b2c582e8 ("tcp: mitigate ACK loops for connections as tcp_sock")
75ff39ccc1 ("tcp: make challenge acks less predictable")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jason Baron <jbaron@akamai.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The variable "other" in the struct red_stats is not used. Remove it.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
hci_abort_conn() is a wrapper around a number of DISCONNECT and
CREATE_CONN_CANCEL commands that was being invoked from hci_request
request queues, which are now deprecated. There are two versions:
hci_abort_conn() which can be invoked from the hci_event thread, and
hci_abort_conn_sync() which can be invoked within a hci_sync cmd chain.
Signed-off-by: Brian Gix <brian.gix@intel.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
__nf_ct_try_assign_helper() remains in place but it now requires a
template to configure the helper.
A toggle to disable automatic helper assignment was added by:
a900689264 ("netfilter: nf_ct_helper: allow to disable automatic helper assignment")
in 2012 to address the issues described in "Secure use of iptables and
connection tracking helpers". Automatic conntrack helper assignment was
disabled by:
3bb398d925 ("netfilter: nf_ct_helper: disable automatic helper assignment")
back in 2016.
This patch removes the sysctl and modparam toggles, users now have to
rely on explicit conntrack helper configuration via ruleset.
Update tools/testing/selftests/netfilter/nft_conntrack_helper.sh to
check that auto-assignment does not happen anymore.
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Follow-up the removal of unused internal api of port params made by
commit 42ded61aa7 ("devlink: Delete not used port parameters APIs")
and stub the commands and add extack message to tell the user what is
going on.
If later on port params are needed, could be easily re-introduced,
but until then it is a dead code.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20220826082730.1399735-1-jiri@resnulli.us
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Being able to check attribute presence and set extack
if not on one line is handy, add helpers.
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We had historically not checked that genlmsghdr.reserved
is 0 on input which prevents us from using those precious
bytes in the future.
One use case would be to extend the cmd field, which is
currently just 8 bits wide and 256 is not a lot of commands
for some core families.
To make sure that new families do the right thing by default
put the onus of opting out of validation on existing families.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com> (NetLabel)
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow specifying the xfrm interface if_id and link as part of a route
metadata using the lwtunnel infrastructure.
This allows for example using a single xfrm interface in collect_md
mode as the target of multiple routes each specifying a different if_id.
With the appropriate changes to iproute2, considering an xfrm device
ipsec1 in collect_md mode one can for example add a route specifying
an if_id like so:
ip route add <SUBNET> dev ipsec1 encap xfrm if_id 1
In which case traffic routed to the device via this route would use
if_id in the xfrm interface policy lookup.
Or in the context of vrf, one can also specify the "link" property:
ip route add <SUBNET> dev ipsec1 encap xfrm if_id 1 link_dev eth15
Note: LWT_XFRM_LINK uses NLA_U32 similar to IFLA_XFRM_LINK even though
internally "link" is signed. This is consistent with other _LINK
attributes in other devices as well as in bpf and should not have an
effect as device indexes can't be negative.
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This commit adds support for 'collect_md' mode on xfrm interfaces.
Each net can have one collect_md device, created by providing the
IFLA_XFRM_COLLECT_METADATA flag at creation. This device cannot be
altered and has no if_id or link device attributes.
On transmit to this device, the if_id is fetched from the attached dst
metadata on the skb. If exists, the link property is also fetched from
the metadata. The dst metadata type used is METADATA_XFRM which holds
these properties.
On the receive side, xfrmi_rcv_cb() populates a dst metadata for each
packet received and attaches it to the skb. The if_id used in this case is
fetched from the xfrm state, and the link is fetched from the incoming
device. This information can later be used by upper layers such as tc,
ebpf, and ip rules.
Because the skb is scrubed in xfrmi_rcv_cb(), the attachment of the dst
metadata is postponed until after scrubing. Similarly, xfrm_input() is
adapted to avoid dropping metadata dsts by only dropping 'valid'
(skb_valid_dst(skb) == true) dsts.
Policy matching on packets arriving from collect_md xfrmi devices is
done by using the xfrm state existing in the skb's sec_path.
The xfrm_if_cb.decode_cb() interface implemented by xfrmi_decode_session()
is changed to keep the details of the if_id extraction tucked away
in xfrm_interface.c.
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
XFRM interfaces provide the association of various XFRM transformations
to a netdevice using an 'if_id' identifier common to both the XFRM data
structures (polcies, states) and the interface. The if_id is configured by
the controlling entity (usually the IKE daemon) and can be used by the
administrator to define logical relations between different connections.
For example, different connections can share the if_id identifier so
that they pass through the same interface, . However, currently it is
not possible for connections using a different if_id to use the same
interface while retaining the logical separation between them, without
using additional criteria such as skb marks or different traffic
selectors.
When having a large number of connections, it is useful to have a the
logical separation offered by the if_id identifier but use a single
network interface. Similar to the way collect_md mode is used in IP
tunnels.
This patch attempts to enable different configuration mechanisms - such
as ebpf programs, LWT encapsulations, and TC - to attach metadata
to skbs which would carry the if_id. This way a single xfrm interface in
collect_md mode can demux traffic based on this configuration on tx and
provide this metadata on rx.
The XFRM metadata is somewhat similar to ip tunnel metadata in that it
has an "id", and shares similar configuration entities (bpf, tc, ...),
however, it does not necessarily represent an IP tunnel or use other
ip tunnel information, and also has an optional "link" property which
can be used for affecting underlying routing decisions.
Additional xfrm related criteria may also be added in the future.
Therefore, a new metadata type is introduced, to be used in subsequent
patches in the xfrm interface and configuration entities.
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Daniel borkmann says:
====================
The following pull-request contains BPF updates for your *net* tree.
We've added 11 non-merge commits during the last 14 day(s) which contain
a total of 13 files changed, 61 insertions(+), 24 deletions(-).
The main changes are:
1) Fix BPF verifier's precision tracking around BPF ring buffer, from Kumar Kartikeya Dwivedi.
2) Fix regression in tunnel key infra when passing FLOWI_FLAG_ANYSRC, from Eyal Birger.
3) Fix insufficient permissions for bpf_sys_bpf() helper, from YiFei Zhu.
4) Fix splat from hitting BUG when purging effective cgroup programs, from Pu Lehui.
5) Fix range tracking for array poke descriptors, from Daniel Borkmann.
6) Fix corrupted packets for XDP_SHARED_UMEM in aligned mode, from Magnus Karlsson.
7) Fix NULL pointer splat in BPF sockmap sk_msg_recvmsg(), from Liu Jian.
8) Add READ_ONCE() to bpf_jit_limit when reading from sysctl, from Kuniyuki Iwashima.
9) Add BPF selftest lru_bug check to s390x deny list, from Daniel Müller.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The memory allocated by using kzallloc_node and kcalloc has been cleared.
Therefore, the structure members of the new qdisc are 0. So there's no
need to explicitly assign a value of 0.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* rtw88: operation, locking, warning, and code style fixes
* rtw89: small updates
* cfg80211/mac80211: more EHT/MLO (802.11be, WiFi 7) work
* brcmfmac: a couple of fixes
* misc cleanups etc.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAmMInmcACgkQB8qZga/f
l8RKWw//bigvsgOiM+EnJ22+KzBIdI2FiGv0O7edO/RYjRNlv7C1hkNI6HwLVZTA
U458HhGY7Y7odujPQrm9cHuTyeQ5DOLX4y/JItW3U4jTnZjKZNbrLvg5BU/1zJC0
yAWZuGs0+Hy4JdzSii9KSwIWFf6yFWPLpRD20nYuauAcEkbTftphuGH3glshUpqP
N5ypDDRevJbvF6rGWHS8M0a5wcwPyyw1nDlyaytqn4IkNwhWxJO095tqls7QZkFh
oOZQNk0oMqmhZTQzyq3/sl9SvEe3Er/pD+iIGkfw2mq1tiUI4CYu92ADrxqeUFmb
s9KbLYppSFQxhISFqo7GdVIAg2WaZdrUsf2qXKoAWDl+n5iiug2GMDroW7CQw/cG
eFkNDcw5aRz1LYkxA7HkVBkXOBpH17bfAt8BI969mTWwEzuNCH+z9egaOKtyy7MV
6b8+BWNC56WK+dvTaFH1x4+xnY0KIOEKjvkDMVBuVNi/mp0Of3y/Vj+zy2LfntwQ
T+oJVC4TrkCvI2Lc2tLW+pQdoy61DjPHmVQwoM4jdTdOsL+a7aWgEql3kLJsdEP4
BEK1IcriPch3Q860PDG2Z5wRYw+bSf37Y6hOQgo2ARrIhAAPzMlvKwgdeipatnSk
5mWgVO6Y6Ejd/snAkgIdQyifkWmtwbPSUL6Mj5dtOJR+Q0QLzRw=
=J5Fc
-----END PGP SIGNATURE-----
Merge tag 'wireless-next-2022-08-26-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes berg says:
====================
Various updates:
* rtw88: operation, locking, warning, and code style fixes
* rtw89: small updates
* cfg80211/mac80211: more EHT/MLO (802.11be, WiFi 7) work
* brcmfmac: a couple of fixes
* misc cleanups etc.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
hci_update_adv_data() is called from hci_event and hci_core due to
events from the controller. The prior function used the deprecated
hci_request method, and the new one uses hci_sync.c
Signed-off-by: Brian Gix <brian.gix@intel.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This function has no dependencies on the deprecated hci_request
mechanism, so has been moved unchanged to hci_sync.c
Signed-off-by: Brian Gix <brian.gix@intel.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
The Advertising Instance expiration timer adv_instance_expire was
handled with the deprecated hci_request mechanism, rather than it's
replacement: hci_sync.
Signed-off-by: Brian Gix <brian.gix@intel.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Limit the acceptance of component name passed to cmd_flash_update() to
match one of the versions returned by info_get(), marked by version type.
This makes things clearer and enforces 1:1 mapping between exposed
version and accepted flash component.
Check VERSION_TYPE_COMPONENT version type during cmd_flash_update()
execution by calling info_get() with different "req" context.
That causes info_get() to lookup the component name instead of
filling-up the netlink message.
Remove "UPDATE_COMPONENT" flag which becomes used.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Whenever the driver is called by his info_get() op, it may put multiple
version names and values to the netlink message. Extend by additional
helper devlink_info_version_running/stored_put_ext() that allows to
specify a version type that indicates when particular version name
represents a flash component.
This is going to be used in follow-up patch calling info_get() during
flash update command checking if version with this the version type
exists.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
qdisc_reset() is clearing qdisc->q.qlen and qdisc->qstats.backlog
_after_ calling qdisc->ops->reset. There is no need to clear them
again in the specific reset function.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Link: https://lore.kernel.org/r/20220824005231.345727-1-shaozhengchao@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
To helper drivers if they e.g. have a lookup of the link_sta
pointer, add the link ID to the link_sta structure.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In MLO, when the address translation from link to MLD is done
in fw/hw, it is necessary to be able to have some information
on the link on which the frame has been received. Extend the
rx API to include link_id and a valid flag in ieee80211_rx_status.
Also make chanes to mac80211 rx APIs to make use of the reported
link_id after sanity checks.
Signed-off-by: Vasanthakumar Thiagarajan <quic_vthiagar@quicinc.com>
Link: https://lore.kernel.org/r/20220817104213.2531-2-quic_vthiagar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Implement key installation and lookup (on TX and RX)
for MLO, so we can use multiple GTKs/IGTKs/BIGTKs.
Co-authored-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add support for various key operations on MLD by adding new parameter
link_id. Pass the link_id received from userspace to driver for add_key,
get_key, del_key, set_default_key, set_default_mgmt_key and
set_default_beacon_key to support configuring keys specific to each MLO
link. Userspace must not specify link ID for MLO pairwise key since it
is common for all the MLO links.
Signed-off-by: Veerendranath Jakkam <quic_vjakkam@quicinc.com>
Link: https://lore.kernel.org/r/20220730052643.1959111-4-quic_vjakkam@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The Tx queue parameters are per link, so add the link ID
from nl80211 parameters to the API.
While at it, lock the wdev when calling into the driver
so it (and we) can check the link ID appropriately.
Signed-off-by: Shaul Triebitz <shaul.triebitz@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Introduce a simple helper function to replace a common pattern.
When accessing the GRO header, we fetch the pointer from frag0,
then test its validity and fetch it from the skb when necessary.
This leads to the pattern
skb_gro_header_fast -> skb_gro_header_hard -> skb_gro_header_slow
recurring many times throughout GRO code.
This patch replaces these patterns with a single inlined function
call, improving code readability.
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220823071034.GA56142@debian
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The current bind hashtable (bhash) is hashed by port only.
In the socket bind path, we have to check for bind conflicts by
traversing the specified port's inet_bind_bucket while holding the
hashbucket's spinlock (see inet_csk_get_port() and
inet_csk_bind_conflict()). In instances where there are tons of
sockets hashed to the same port at different addresses, the bind
conflict check is time-intensive and can cause softirq cpu lockups,
as well as stops new tcp connections since __inet_inherit_port()
also contests for the spinlock.
This patch adds a second bind table, bhash2, that hashes by
port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6).
Searching the bhash2 table leads to significantly faster conflict
resolution and less time holding the hashbucket spinlock.
Please note a few things:
* There can be the case where the a socket's address changes after it
has been bound. There are two cases where this happens:
1) The case where there is a bind() call on INADDR_ANY (ipv4) or
IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will
assign the socket an address when it handles the connect()
2) In inet_sk_reselect_saddr(), which is called when rebuilding the
sk header and a few pre-conditions are met (eg rerouting fails).
In these two cases, we need to update the bhash2 table by removing the
entry for the old address, and add a new entry reflecting the updated
address.
* The bhash2 table must have its own lock, even though concurrent
accesses on the same port are protected by the bhash lock. Bhash2 must
have its own lock to protect against cases where sockets on different
ports hash to different bhash hashbuckets but to the same bhash2
hashbucket.
This brings up a few stipulations:
1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock
will always be acquired after the bhash lock and released before the
bhash lock is released.
2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always
acquired+released before another bhash2 lock is acquired+released.
* The bhash table cannot be superseded by the bhash2 table because for
bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket
bound to that port must be checked for a potential conflict. The bhash
table is the only source of port->socket associations.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Fix crash with malformed ebtables blob which do not provide all
entry points, from Florian Westphal.
2) Fix possible TCP connection clogging up with default 5-days
timeout in conntrack, from Florian.
3) Fix crash in nf_tables tproxy with unsupported chains, also from Florian.
4) Do not allow to update implicit chains.
5) Make table handle allocation per-netns to fix data race.
6) Do not truncated payload length and offset, and checksum offset.
Instead report EINVAl.
7) Enable chain stats update via static key iff no error occurs.
8) Restrict osf expression to ip, ip6 and inet families.
9) Restrict tunnel expression to netdev family.
10) Fix crash when trying to bind again an already bound chain.
11) Flowtable garbage collector might leave behind pending work to
delete entries. This patch comes with a previous preparation patch
as dependency.
12) Allow net.netfilter.nf_conntrack_frag6_high_thresh to be lowered,
from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_defrag_ipv6: allow nf_conntrack_frag6_high_thresh increases
netfilter: flowtable: fix stuck flows on cleanup due to pending work
netfilter: flowtable: add function to invoke garbage collection immediately
netfilter: nf_tables: disallow binding to already bound chain
netfilter: nft_tunnel: restrict it to netdev family
netfilter: nft_osf: restrict osf to ipv4, ipv6 and inet families
netfilter: nf_tables: do not leave chain stats enabled on error
netfilter: nft_payload: do not truncate csum_offset and csum_type
netfilter: nft_payload: report ERANGE for too long offset and length
netfilter: nf_tables: make table handle allocation per-netns friendly
netfilter: nf_tables: disallow updates of implicit chain
netfilter: nft_tproxy: restrict to prerouting hook
netfilter: conntrack: work around exceeded receive window
netfilter: ebtables: reject blobs that don't provide all entry points
====================
Link: https://lore.kernel.org/r/20220824220330.64283-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While reading gro_normal_batch, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 323ebb61e3 ("net: use listified RX for handling GRO_NORMAL skbs")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_net_busy_poll, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 0602129286 ("net: add low latency socket poll")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To clear the flow table on flow table free, the following sequence
normally happens in order:
1) gc_step work is stopped to disable any further stats/del requests.
2) All flow table entries are set to teardown state.
3) Run gc_step which will queue HW del work for each flow table entry.
4) Waiting for the above del work to finish (flush).
5) Run gc_step again, deleting all entries from the flow table.
6) Flow table is freed.
But if a flow table entry already has pending HW stats or HW add work
step 3 will not queue HW del work (it will be skipped), step 4 will wait
for the pending add/stats to finish, and step 5 will queue HW del work
which might execute after freeing of the flow table.
To fix the above, this patch flushes the pending work, then it sets the
teardown flag to all flows in the flowtable and it forces a garbage
collector run to queue work to remove the flows from hardware, then it
flushes this new pending work and (finally) it forces another garbage
collector run to remove the entry from the software flowtable.
Stack trace:
[47773.882335] BUG: KASAN: use-after-free in down_read+0x99/0x460
[47773.883634] Write of size 8 at addr ffff888103b45aa8 by task kworker/u20:6/543704
[47773.885634] CPU: 3 PID: 543704 Comm: kworker/u20:6 Not tainted 5.12.0-rc7+ #2
[47773.886745] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009)
[47773.888438] Workqueue: nf_ft_offload_del flow_offload_work_handler [nf_flow_table]
[47773.889727] Call Trace:
[47773.890214] dump_stack+0xbb/0x107
[47773.890818] print_address_description.constprop.0+0x18/0x140
[47773.892990] kasan_report.cold+0x7c/0xd8
[47773.894459] kasan_check_range+0x145/0x1a0
[47773.895174] down_read+0x99/0x460
[47773.899706] nf_flow_offload_tuple+0x24f/0x3c0 [nf_flow_table]
[47773.907137] flow_offload_work_handler+0x72d/0xbe0 [nf_flow_table]
[47773.913372] process_one_work+0x8ac/0x14e0
[47773.921325]
[47773.921325] Allocated by task 592159:
[47773.922031] kasan_save_stack+0x1b/0x40
[47773.922730] __kasan_kmalloc+0x7a/0x90
[47773.923411] tcf_ct_flow_table_get+0x3cb/0x1230 [act_ct]
[47773.924363] tcf_ct_init+0x71c/0x1156 [act_ct]
[47773.925207] tcf_action_init_1+0x45b/0x700
[47773.925987] tcf_action_init+0x453/0x6b0
[47773.926692] tcf_exts_validate+0x3d0/0x600
[47773.927419] fl_change+0x757/0x4a51 [cls_flower]
[47773.928227] tc_new_tfilter+0x89a/0x2070
[47773.936652]
[47773.936652] Freed by task 543704:
[47773.937303] kasan_save_stack+0x1b/0x40
[47773.938039] kasan_set_track+0x1c/0x30
[47773.938731] kasan_set_free_info+0x20/0x30
[47773.939467] __kasan_slab_free+0xe7/0x120
[47773.940194] slab_free_freelist_hook+0x86/0x190
[47773.941038] kfree+0xce/0x3a0
[47773.941644] tcf_ct_flow_table_cleanup_work
Original patch description and stack trace by Paul Blakey.
Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Reported-by: Paul Blakey <paulb@nvidia.com>
Tested-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
More logic will be added to dsa_tree_setup_master() and
dsa_tree_teardown_master() in upcoming changes.
Reduce the indentation by one level in these functions by introducing
and using a dedicated iterator for CPU ports of a tree.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This adds 'vsock_data_ready()' which must be called by transport to kick
sleeping data readers. It checks for SO_RCVLOWAT value before waking
user, thus preventing spurious wake ups. Based on 'tcp_data_ready()' logic.
Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This adds transport specific callback for SO_RCVLOWAT, because in some
transports it may be difficult to know current available number of bytes
ready to read. Thus, when SO_RCVLOWAT is set, transport may reject it.
Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The value is only ever set once in bond_3ad_initialize and only ever
read otherwise. There seems to be no reason to set the variable via
bond_3ad_initialize when setting the global variable will do. Change
ad_ticks_per_sec to a const to enforce its read-only usage.
Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
DECnet is an obsolete network protocol that receives more attention
from kernel janitors than users. It belongs in computer protocol
history museum not in Linux kernel.
It has been "Orphaned" in kernel since 2010. The iproute2 support
for DECnet was dropped in 5.0 release. The documentation link on
Sourceforge says it is abandoned there as well.
Leave the UAPI alone to keep userspace programs compiling.
This means that there is still an empty neighbour table
for AF_DECNET.
The table of /proc/sys/net entries was updated to match
current directories and reformatted to be alphabetical.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: David Ahern <dsahern@kernel.org>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Expose MACSEC_SALT_LEN definition to user space to be
used in various user space applications such as iproute.
Iproute will use this as part of adding macsec extended
packet number support.
Reviewed-by: Raed Salem <raeds@nvidia.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Emeel Hakim <ehakim@nvidia.com>
Link: https://lore.kernel.org/r/20220818153229.4721-1-ehakim@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After the prep work in the previous patches,
this patch removes the dup code from bpf_setsockopt(SOL_IPV6)
and reuses the implementation in do_ipv6_setsockopt().
ipv6 could be compiled as a module. Like how other code solved it
with stubs in ipv6_stubs.h, this patch adds the do_ipv6_setsockopt
to the ipv6_bpf_stub.
The current bpf_setsockopt(IPV6_TCLASS) does not take the
INET_ECN_MASK into the account for tcp. The
do_ipv6_setsockopt(IPV6_TCLASS) will handle it correctly.
The existing optname white-list is refactored into a new
function sol_ipv6_setsockopt().
After this last SOL_IPV6 dup code removal, the __bpf_setsockopt()
is simplified enough that the extra "{ }" around the if statement
can be removed.
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061834.4181198-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
After the prep work in the previous patches,
this patch removes the dup code from bpf_setsockopt(SOL_IP)
and reuses the implementation in do_ip_setsockopt().
The existing optname white-list is refactored into a new
function sol_ip_setsockopt().
NOTE,
the current bpf_setsockopt(IP_TOS) is quite different from the
the do_ip_setsockopt(IP_TOS). For example, it does not take
the INET_ECN_MASK into the account for tcp and also does not adjust
sk->sk_priority. It looks like the current bpf_setsockopt(IP_TOS)
was referencing the IPV6_TCLASS implementation instead of IP_TOS.
This patch tries to rectify that by using the do_ip_setsockopt(IP_TOS).
While this is a behavior change, the do_ip_setsockopt(IP_TOS) behavior
is arguably what the user is expecting. At least, the INET_ECN_MASK bits
should be masked out for tcp.
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061826.4180990-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
After the prep work in the previous patches,
this patch removes all the dup code from bpf_setsockopt(SOL_TCP)
and reuses the do_tcp_setsockopt().
The existing optname white-list is refactored into a new
function sol_tcp_setsockopt(). The sol_tcp_setsockopt()
also calls the bpf_sol_tcp_setsockopt() to handle
the TCP_BPF_XXX specific optnames.
bpf_setsockopt(TCP_SAVE_SYN) now also allows a value 2 to
save the eth header also and it comes for free from
do_tcp_setsockopt().
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061819.4180146-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
After the prep work in the previous patches,
this patch removes most of the dup code from bpf_setsockopt(SOL_SOCKET)
and reuses them from sk_setsockopt().
The sock ptr test is added to the SO_RCVLOWAT because
the sk->sk_socket could be NULL in some of the bpf hooks.
The existing optname white-list is refactored into a new
function sol_socket_setsockopt().
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061804.4178920-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When bpf program calling bpf_setsockopt(SOL_SOCKET),
it could be run in softirq and doesn't make sense to do the capable
check. There was a similar situation in bpf_setsockopt(TCP_CONGESTION).
In commit 8d650cdeda ("tcp: fix tcp_set_congestion_control() use from bpf hook"),
tcp_set_congestion_control(..., cap_net_admin) was added to skip
the cap check for bpf prog.
This patch adds sockopt_ns_capable() and sockopt_capable() for
the sk_setsockopt() to use. They will consider the
has_current_bpf_ctx() before doing the ns_capable() and capable() test.
They are in EXPORT_SYMBOL for the ipv6 module to use in a latter patch.
Suggested-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061723.4175820-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Most of the code in bpf_setsockopt(SOL_SOCKET) are duplicated from
the sk_setsockopt(). The number of supported optnames are
increasing ever and so as the duplicated code.
One issue in reusing sk_setsockopt() is that the bpf prog
has already acquired the sk lock. This patch adds a
has_current_bpf_ctx() to tell if the sk_setsockopt() is called from
a bpf prog. The bpf prog calling bpf_setsockopt() is either running
in_task() or in_serving_softirq(). Both cases have the current->bpf_ctx
initialized. Thus, the has_current_bpf_ctx() only needs to
test !!current->bpf_ctx.
This patch also adds sockopt_{lock,release}_sock() helpers
for sk_setsockopt() to use. These helpers will test
has_current_bpf_ctx() before acquiring/releasing the lock. They are
in EXPORT_SYMBOL for the ipv6 module to use in a latter patch.
Note on the change in sock_setbindtodevice(). sockopt_lock_sock()
is done in sock_setbindtodevice() instead of doing the lock_sock
in sock_bindtoindex(..., lock_sk = true).
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061717.4175589-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit 451ef36bd2 ("ip_tunnels: Add new flow flags field to ip_tunnel_key")
added a "flow_flags" member to struct ip_tunnel_key which was later used by
the commit in the fixes tag to avoid dropping packets with sources that
aren't locally configured when set in bpf_set_tunnel_key().
VXLAN and GENEVE were made to respect this flag, ip tunnels like IPIP and GRE
were not.
This commit fixes this omission by making ip_tunnel_init_flow() receive
the flow flags from the tunnel key in the relevant collect_md paths.
Fixes: b8fff74852 ("bpf: Set flow flag to allow any source IP in bpf_tunnel_key")
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Paul Chaignon <paul@isovalent.com>
Link: https://lore.kernel.org/bpf/20220818074118.726639-1-eyal.birger@gmail.com
Andrii Nakryiko says:
====================
bpf-next 2022-08-17
We've added 45 non-merge commits during the last 14 day(s) which contain
a total of 61 files changed, 986 insertions(+), 372 deletions(-).
The main changes are:
1) New bpf_ktime_get_tai_ns() BPF helper to access CLOCK_TAI, from Kurt
Kanzenbach and Jesper Dangaard Brouer.
2) Few clean ups and improvements for libbpf 1.0, from Andrii Nakryiko.
3) Expose crash_kexec() as kfunc for BPF programs, from Artem Savkov.
4) Add ability to define sleepable-only kfuncs, from Benjamin Tissoires.
5) Teach libbpf's bpf_prog_load() and bpf_map_create() to gracefully handle
unsupported names on old kernels, from Hangbin Liu.
6) Allow opting out from auto-attaching BPF programs by libbpf's BPF skeleton,
from Hao Luo.
7) Relax libbpf's requirement for shared libs to be marked executable, from
Henqgi Chen.
8) Improve bpf_iter internals handling of error returns, from Hao Luo.
9) Few accommodations in libbpf to support GCC-BPF quirks, from James Hilliard.
10) Fix BPF verifier logic around tracking dynptr ref_obj_id, from Joanne Koong.
11) bpftool improvements to handle full BPF program names better, from Manu
Bretelle.
12) bpftool fixes around libcap use, from Quentin Monnet.
13) BPF map internals clean ups and improvements around memory allocations,
from Yafang Shao.
14) Allow to use cgroup_get_from_file() on cgroupv1, allowing BPF cgroup
iterator to work on cgroupv1, from Yosry Ahmed.
15) BPF verifier internal clean ups, from Dave Marchevsky and Joanne Koong.
16) Various fixes and clean ups for selftests/bpf and vmtest.sh, from Daniel
Xu, Artem Savkov, Joanne Koong, Andrii Nakryiko, Shibin Koikkara Reeny.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (45 commits)
selftests/bpf: Few fixes for selftests/bpf built in release mode
libbpf: Clean up deprecated and legacy aliases
libbpf: Streamline bpf_attr and perf_event_attr initialization
libbpf: Fix potential NULL dereference when parsing ELF
selftests/bpf: Tests libbpf autoattach APIs
libbpf: Allows disabling auto attach
selftests/bpf: Fix attach point for non-x86 arches in test_progs/lsm
libbpf: Making bpf_prog_load() ignore name if kernel doesn't support
selftests/bpf: Update CI kconfig
selftests/bpf: Add connmark read test
selftests/bpf: Add existing connection bpf_*_ct_lookup() test
bpftool: Clear errno after libcap's checks
bpf: Clear up confusion in bpf_skb_adjust_room()'s documentation
bpftool: Fix a typo in a comment
libbpf: Add names for auxiliary maps
bpf: Use bpf_map_area_alloc consistently on bpf map creation
bpf: Make __GFP_NOWARN consistent in bpf map creation
bpf: Use bpf_map_area_free instread of kvfree
bpf: Remove unneeded memset in queue_stack_map creation
libbpf: preserve errno across pr_warn/pr_info/pr_debug
...
====================
Link: https://lore.kernel.org/r/20220817215656.1180215-1-andrii@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Florian Westphal says:
====================
netfilter: conntrack and nf_tables bug fixes
The following patchset contains netfilter fixes for net.
Broken since 5.19:
A few ancient connection tracking helpers assume TCP packets cannot
exceed 64kb in size, but this isn't the case anymore with 5.19 when
BIG TCP got merged, from myself.
Regressions since 5.19:
1. 'conntrack -E expect' won't display anything because nfnetlink failed
to enable events for expectations, only for normal conntrack events.
2. partially revert change that added resched calls to a function that can
be in atomic context. Both broken and fixed up by myself.
Broken for several releases (up to original merge of nf_tables):
Several fixes for nf_tables control plane, from Pablo.
This fixes up resource leaks in error paths and adds more sanity
checks for mutually exclusive attributes/flags.
Kconfig:
NF_CONNTRACK_PROCFS is very old and doesn't provide all info provided
via ctnetlink, so it should not default to y. From Geert Uytterhoeven.
Selftests:
rework nft_flowtable.sh: it frequently indicated failure; the way it
tried to detect an offload failure did not work reliably.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
testing: selftests: nft_flowtable.sh: rework test to detect offload failure
testing: selftests: nft_flowtable.sh: use random netns names
netfilter: conntrack: NF_CONNTRACK_PROCFS should no longer default to y
netfilter: nf_tables: check NFT_SET_CONCAT flag if field_count is specified
netfilter: nf_tables: disallow NFT_SET_ELEM_CATCHALL and NFT_SET_ELEM_INTERVAL_END
netfilter: nf_tables: NFTA_SET_ELEM_KEY_END requires concat and interval flags
netfilter: nf_tables: validate NFTA_SET_ELEM_OBJREF based on NFT_SET_OBJECT flag
netfilter: nf_tables: really skip inactive sets when allocating name
netfilter: nfnetlink: re-enable conntrack expectation events
netfilter: nf_tables: fix scheduling-while-atomic splat
netfilter: nf_ct_irc: cap packet search space to 4k
netfilter: nf_ct_ftp: prefer skb_linearize
netfilter: nf_ct_h323: cap packet size at 64k
netfilter: nf_ct_sane: remove pseudo skb linearization
netfilter: nf_tables: possible module reference underflow in error path
netfilter: nf_tables: disallow NFTA_SET_ELEM_KEY_END with NFT_SET_ELEM_INTERVAL_END flag
netfilter: nf_tables: use READ_ONCE and WRITE_ONCE for shared generation id access
====================
Link: https://lore.kernel.org/r/20220817140015.25843-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
bpf_sk_reuseport_detach() calls __rcu_dereference_sk_user_data_with_flags()
to obtain the value of sk->sk_user_data, but that function is only usable
if the RCU read lock is held, and neither that function nor any of its
callers hold it.
Fix this by adding a new helper, __locked_read_sk_user_data_with_flags()
that checks to see if sk->sk_callback_lock() is held and use that here
instead.
Alternatively, making __rcu_dereference_sk_user_data_with_flags() use
rcu_dereference_checked() might suffice.
Without this, the following warning can be occasionally observed:
=============================
WARNING: suspicious RCU usage
6.0.0-rc1-build2+ #563 Not tainted
-----------------------------
include/net/sock.h:592 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
5 locks held by locktest/29873:
#0: ffff88812734b550 (&sb->s_type->i_mutex_key#9){+.+.}-{3:3}, at: __sock_release+0x77/0x121
#1: ffff88812f5621b0 (sk_lock-AF_INET){+.+.}-{0:0}, at: tcp_close+0x1c/0x70
#2: ffff88810312f5c8 (&h->lhash2[i].lock){+.+.}-{2:2}, at: inet_unhash+0x76/0x1c0
#3: ffffffff83768bb8 (reuseport_lock){+...}-{2:2}, at: reuseport_detach_sock+0x18/0xdd
#4: ffff88812f562438 (clock-AF_INET){++..}-{2:2}, at: bpf_sk_reuseport_detach+0x24/0xa4
stack backtrace:
CPU: 1 PID: 29873 Comm: locktest Not tainted 6.0.0-rc1-build2+ #563
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
Call Trace:
<TASK>
dump_stack_lvl+0x4c/0x5f
bpf_sk_reuseport_detach+0x6d/0xa4
reuseport_detach_sock+0x75/0xdd
inet_unhash+0xa5/0x1c0
tcp_set_state+0x169/0x20f
? lockdep_sock_is_held+0x3a/0x3a
? __lock_release.isra.0+0x13e/0x220
? reacquire_held_locks+0x1bb/0x1bb
? hlock_class+0x31/0x96
? mark_lock+0x9e/0x1af
__tcp_close+0x50/0x4b6
tcp_close+0x28/0x70
inet_release+0x8e/0xa7
__sock_release+0x95/0x121
sock_close+0x14/0x17
__fput+0x20f/0x36a
task_work_run+0xa3/0xcc
exit_to_user_mode_prepare+0x9c/0x14d
syscall_exit_to_user_mode+0x18/0x44
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fixes: cf8c1e9672 ("net: refactor bpf_sk_reuseport_detach()")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Hawkins Jiawei <yin31149@gmail.com>
Link: https://lore.kernel.org/r/166064248071.3502205.10036394558814861778.stgit@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Right now we have a neigh_param PROXY_QLEN which specifies maximum length
of neigh_table->proxy_queue. But in fact, this limitation doesn't work well
because check condition looks like:
tbl->proxy_queue.qlen > NEIGH_VAR(p, PROXY_QLEN)
The problem is that p (struct neigh_parms) is a per-device thing,
but tbl (struct neigh_table) is a system-wide global thing.
It seems reasonable to make proxy_queue limit per-device based.
v2:
- nothing changed in this patch
v3:
- rebase to net tree
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@kernel.org>
Cc: Yajun Deng <yajun.deng@linux.dev>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
Cc: kernel@openvz.org
Cc: devel@openvz.org
Suggested-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A little longer PR than usual but it's all fixes, no late features.
It's long partially because of timing, and partially because of
follow ups to stuff that got merged a week or so before the merge
window and wasn't as widely tested. Maybe the Bluetooth fixes are
a little alarming so we'll address that, but the rest seems okay
and not scary.
Notably we're including a fix for the netfilter Kconfig [1], your
WiFi warning [2] and a bluetooth fix which should unblock syzbot [3].
Current release - regressions:
- Bluetooth:
- don't try to cancel uninitialized works [3]
- L2CAP: fix use-after-free caused by l2cap_chan_put
- tls: rx: fix device offload after recent rework
- devlink: fix UAF on failed reload and leftover locks in mlxsw
Current release - new code bugs:
- netfilter:
- flowtable: fix incorrect Kconfig dependencies [1]
- nf_tables: fix crash when nf_trace is enabled
- bpf:
- use proper target btf when exporting attach_btf_obj_id
- arm64: fixes for bpf trampoline support
- Bluetooth:
- ISO: unlock on error path in iso_sock_setsockopt()
- ISO: fix info leak in iso_sock_getsockopt()
- ISO: fix iso_sock_getsockopt for BT_DEFER_SETUP
- ISO: fix memory corruption on iso_pinfo.base
- ISO: fix not using the correct QoS
- hci_conn: fix updating ISO QoS PHY
- phy: dp83867: fix get nvmem cell fail
Previous releases - regressions:
- wifi: cfg80211: fix validating BSS pointers in
__cfg80211_connect_result [2]
- atm: bring back zatm uAPI after ATM had been removed
- properly fix old bug making bonding ARP monitor mode not being
able to work with software devices with lockless Tx
- tap: fix null-deref on skb->dev in dev_parse_header_protocol
- revert "net: usb: ax88179_178a needs FLAG_SEND_ZLP" it helps
some devices and breaks others
- netfilter:
- nf_tables: many fixes rejecting cross-object linking
which may lead to UAFs
- nf_tables: fix null deref due to zeroed list head
- nf_tables: validate variable length element extension
- bgmac: fix a BUG triggered by wrong bytes_compl
- bcmgenet: indicate MAC is in charge of PHY PM
Previous releases - always broken:
- bpf:
- fix bad pointer deref in bpf_sys_bpf() injected via test infra
- disallow non-builtin bpf programs calling the prog_run command
- don't reinit map value in prealloc_lru_pop
- fix UAFs during the read of map iterator fd
- fix invalidity check for values in sk local storage map
- reject sleepable program for non-resched map iterator
- mptcp:
- move subflow cleanup in mptcp_destroy_common()
- do not queue data on closed subflows
- virtio_net: fix memory leak inside XDP_TX with mergeable
- vsock: fix memory leak when multiple threads try to connect()
- rework sk_user_data sharing to prevent psock leaks
- geneve: fix TOS inheriting for ipv4
- tunnels & drivers: do not use RT_TOS for IPv6 flowlabel
- phy: c45 baset1: do not skip aneg configuration if clock role
is not specified
- rose: avoid overflow when /proc displays timer information
- x25: fix call timeouts in blocking connects
- can: mcp251x: fix race condition on receive interrupt
- can: j1939:
- replace user-reachable WARN_ON_ONCE() with netdev_warn_once()
- fix memory leak of skbs in j1939_session_destroy()
Misc:
- docs: bpf: clarify that many things are not uAPI
- seg6: initialize induction variable to first valid array index
(to silence clang vs objtool warning)
- can: ems_usb: fix clang 14's -Wunaligned-access warning
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmL1TtkACgkQMUZtbf5S
Iruz8Q/+O5xFFsjxuyZD0Mw9d3Jeo3ZI9PeeDvcYl5dZXVegpxqorujTFntxv1Ad
JC8o5qqms3kO51d+W/yai6iDacEHX2YcJrupZve+vGvpOEVmBRY5O0E1AckJ18+u
ItmjSVESkybUP5P08/An7Y0dMmj9Xb2z84dGkLe+n8lg6/fimo6Ki6yZjcOBOALu
AYquMXUcnwztRMbTFjscbJjBd4xFMKZEtthljYtPdIReIN976wmMNYYx+jcPK7ha
g39Kv6maklp4euerkGIJ/AMnOWHaOGCFjIaz7rr4444NDfrKdt/jeirUXJaz77Jo
TJM2UOwgOeg6WZkSa3cmdq6UdjdkJ6LTe2CJFf1wJ1qfhAi+s8yWoszsM2Enp+66
c/mo9jTCMAjmgEJF11idZuz2S697/5j0hvbfM3ZPgNyNBgn8qxz/Z56fNOisx95u
TkoKKFnGH+mcm/et+omBcyLBtBVK2+/6B6mpl6btf4DOkPn5KFYWHV67uV3ksHzQ
ye+pnzidoIG0yKbRM2EQKXk7ELKROpl52xUHko93ZinMJt0Q7jBm7tZhJozNFEzi
hWgUvpmNXgawzLYQcJ9jJmKw3PmYZnRhvYZB/1r91YamM28Hd58k9WfpWtUtjYJN
N0X58L6JSnKPqzR70pcFppz6iBlh0tHdcEQGWhhKU5ScS3FDxGc=
=C5Ck
-----END PGP SIGNATURE-----
Merge tag 'net-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bluetooth, bpf, can and netfilter.
A little larger than usual but it's all fixes, no late features. It's
large partially because of timing, and partially because of follow ups
to stuff that got merged a week or so before the merge window and
wasn't as widely tested. Maybe the Bluetooth fixes are a little
alarming so we'll address that, but the rest seems okay and not scary.
Notably we're including a fix for the netfilter Kconfig [1], your WiFi
warning [2] and a bluetooth fix which should unblock syzbot [3].
Current release - regressions:
- Bluetooth:
- don't try to cancel uninitialized works [3]
- L2CAP: fix use-after-free caused by l2cap_chan_put
- tls: rx: fix device offload after recent rework
- devlink: fix UAF on failed reload and leftover locks in mlxsw
Current release - new code bugs:
- netfilter:
- flowtable: fix incorrect Kconfig dependencies [1]
- nf_tables: fix crash when nf_trace is enabled
- bpf:
- use proper target btf when exporting attach_btf_obj_id
- arm64: fixes for bpf trampoline support
- Bluetooth:
- ISO: unlock on error path in iso_sock_setsockopt()
- ISO: fix info leak in iso_sock_getsockopt()
- ISO: fix iso_sock_getsockopt for BT_DEFER_SETUP
- ISO: fix memory corruption on iso_pinfo.base
- ISO: fix not using the correct QoS
- hci_conn: fix updating ISO QoS PHY
- phy: dp83867: fix get nvmem cell fail
Previous releases - regressions:
- wifi: cfg80211: fix validating BSS pointers in
__cfg80211_connect_result [2]
- atm: bring back zatm uAPI after ATM had been removed
- properly fix old bug making bonding ARP monitor mode not being able
to work with software devices with lockless Tx
- tap: fix null-deref on skb->dev in dev_parse_header_protocol
- revert "net: usb: ax88179_178a needs FLAG_SEND_ZLP" it helps some
devices and breaks others
- netfilter:
- nf_tables: many fixes rejecting cross-object linking which may
lead to UAFs
- nf_tables: fix null deref due to zeroed list head
- nf_tables: validate variable length element extension
- bgmac: fix a BUG triggered by wrong bytes_compl
- bcmgenet: indicate MAC is in charge of PHY PM
Previous releases - always broken:
- bpf:
- fix bad pointer deref in bpf_sys_bpf() injected via test infra
- disallow non-builtin bpf programs calling the prog_run command
- don't reinit map value in prealloc_lru_pop
- fix UAFs during the read of map iterator fd
- fix invalidity check for values in sk local storage map
- reject sleepable program for non-resched map iterator
- mptcp:
- move subflow cleanup in mptcp_destroy_common()
- do not queue data on closed subflows
- virtio_net: fix memory leak inside XDP_TX with mergeable
- vsock: fix memory leak when multiple threads try to connect()
- rework sk_user_data sharing to prevent psock leaks
- geneve: fix TOS inheriting for ipv4
- tunnels & drivers: do not use RT_TOS for IPv6 flowlabel
- phy: c45 baset1: do not skip aneg configuration if clock role is
not specified
- rose: avoid overflow when /proc displays timer information
- x25: fix call timeouts in blocking connects
- can: mcp251x: fix race condition on receive interrupt
- can: j1939:
- replace user-reachable WARN_ON_ONCE() with netdev_warn_once()
- fix memory leak of skbs in j1939_session_destroy()
Misc:
- docs: bpf: clarify that many things are not uAPI
- seg6: initialize induction variable to first valid array index (to
silence clang vs objtool warning)
- can: ems_usb: fix clang 14's -Wunaligned-access warning"
* tag 'net-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (117 commits)
net: atm: bring back zatm uAPI
dpaa2-eth: trace the allocated address instead of page struct
net: add missing kdoc for struct genl_multicast_group::flags
nfp: fix use-after-free in area_cache_get()
MAINTAINERS: use my korg address for mt7601u
mlxsw: minimal: Fix deadlock in ports creation
bonding: fix reference count leak in balance-alb mode
net: usb: qmi_wwan: Add support for Cinterion MV32
bpf: Shut up kern_sys_bpf warning.
net/tls: Use RCU API to access tls_ctx->netdev
tls: rx: device: don't try to copy too much on detach
tls: rx: device: bound the frag walk
net_sched: cls_route: remove from list when handle is 0
selftests: forwarding: Fix failing tests with old libnet
net: refactor bpf_sk_reuseport_detach()
net: fix refcount bug in sk_psock_get (2)
selftests/bpf: Ensure sleepable program is rejected by hash map iter
selftests/bpf: Add write tests for sk local storage map iterator
selftests/bpf: Add tests for reading a dangling map iter fd
bpf: Only allow sleepable program for resched-able iterator
...
Multicast group flags were added in commit 4d54cc3211 ("mptcp: avoid
lock_fast usage in accept path"), but it missed adding the kdoc.
Mention which flags go into that field, and do the same for
op structs.
Link: https://lore.kernel.org/r/20220809232012.403730-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To avoid allocation of the conntrack extension area when possible,
the default behaviour was changed to only allocate the event extension
if a userspace program is subscribed to a notification group.
Problem is that while 'conntrack -E' does enable the event allocation
behind the scenes, 'conntrack -E expect' does not: no expectation events
are delivered unless user sets
"net.netfilter.nf_conntrack_events" back to 1 (always on).
Fix the autodetection to also consider EXP type group.
We need to track the 6 event groups (3+3, new/update/destroy for events and
for expectations each) independently, else we'd disable events again
if an expectation group becomes empty while there is still an active
event group.
Fixes: 2794cdb0b9 ("netfilter: nfnetlink: allow to detect if ctnetlink listeners exist")
Reported-by: Yi Chen <yiche@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Currently, tls_device_down synchronizes with tls_device_resync_rx using
RCU, however, the pointer to netdev is stored using WRITE_ONCE and
loaded using READ_ONCE.
Although such approach is technically correct (rcu_dereference is
essentially a READ_ONCE, and rcu_assign_pointer uses WRITE_ONCE to store
NULL), using special RCU helpers for pointers is more valid, as it
includes additional checks and might change the implementation
transparently to the callers.
Mark the netdev pointer as __rcu and use the correct RCU helpers to
access it. For non-concurrent access pass the right conditions that
guarantee safe access (locks taken, refcount value). Also use the
correct helper in mlx5e, where even READ_ONCE was missing.
The transition to RCU exposes existing issues, fixed by this commit:
1. bond_tls_device_xmit could read netdev twice, and it could become
NULL the second time, after the NULL check passed.
2. Drivers shouldn't stop processing the last packet if tls_device_down
just set netdev to NULL, before tls_dev_del was called. This prevents a
possible packet drop when transitioning to the fallback software mode.
Fixes: 89df6a8104 ("net/bonding: Implement TLS TX device offload")
Fixes: c55dcdd435 ("net/tls: Fix use-after-free after the TLS device goes down and up")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Link: https://lore.kernel.org/r/20220810081602.1435800-1-maximmi@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
bpf 2022-08-10
We've added 23 non-merge commits during the last 7 day(s) which contain
a total of 19 files changed, 424 insertions(+), 35 deletions(-).
The main changes are:
1) Several fixes for BPF map iterator such as UAFs along with selftests, from Hou Tao.
2) Fix BPF syscall program's {copy,strncpy}_from_bpfptr() to not fault, from Jinghao Jia.
3) Reject BPF syscall programs calling BPF_PROG_RUN, from Alexei Starovoitov and YiFei Zhu.
4) Fix attach_btf_obj_id info to pick proper target BTF, from Stanislav Fomichev.
5) BPF design Q/A doc update to clarify what is not stable ABI, from Paul E. McKenney.
6) Fix BPF map's prealloc_lru_pop to not reinitialize, from Kumar Kartikeya Dwivedi.
7) Fix bpf_trampoline_put to avoid leaking ftrace hash, from Jiri Olsa.
8) Fix arm64 JIT to address sparse errors around BPF trampoline, from Xu Kuohai.
9) Fix arm64 JIT to use kvcalloc instead of kcalloc for internal program address
offset buffer, from Aijun Sun.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (23 commits)
selftests/bpf: Ensure sleepable program is rejected by hash map iter
selftests/bpf: Add write tests for sk local storage map iterator
selftests/bpf: Add tests for reading a dangling map iter fd
bpf: Only allow sleepable program for resched-able iterator
bpf: Check the validity of max_rdwr_access for sock local storage map iterator
bpf: Acquire map uref in .init_seq_private for sock{map,hash} iterator
bpf: Acquire map uref in .init_seq_private for sock local storage map iterator
bpf: Acquire map uref in .init_seq_private for hash map iterator
bpf: Acquire map uref in .init_seq_private for array map iterator
bpf: Disallow bpf programs call prog_run command.
bpf, arm64: Fix bpf trampoline instruction endianness
selftests/bpf: Add test for prealloc_lru_pop bug
bpf: Don't reinit map value in prealloc_lru_pop
bpf: Allow calling bpf_prog_test kfuncs in tracing programs
bpf, arm64: Allocate program buffer using kvcalloc instead of kcalloc
selftests/bpf: Excercise bpf_obj_get_info_by_fd for bpf2bpf
bpf: Use proper target btf when exporting attach_btf_obj_id
mptcp, btf: Add struct mptcp_sock definition when CONFIG_MPTCP is disabled
bpf: Cleanup ftrace hash in bpf_trampoline_put
BPF: Fix potential bad pointer dereference in bpf_sys_bpf()
...
====================
Link: https://lore.kernel.org/r/20220810190624.10748-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Syzkaller reports refcount bug as follows:
------------[ cut here ]------------
refcount_t: saturated; leaking memory.
WARNING: CPU: 1 PID: 3605 at lib/refcount.c:19 refcount_warn_saturate+0xf4/0x1e0 lib/refcount.c:19
Modules linked in:
CPU: 1 PID: 3605 Comm: syz-executor208 Not tainted 5.18.0-syzkaller-03023-g7e062cda7d90 #0
<TASK>
__refcount_add_not_zero include/linux/refcount.h:163 [inline]
__refcount_inc_not_zero include/linux/refcount.h:227 [inline]
refcount_inc_not_zero include/linux/refcount.h:245 [inline]
sk_psock_get+0x3bc/0x410 include/linux/skmsg.h:439
tls_data_ready+0x6d/0x1b0 net/tls/tls_sw.c:2091
tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4983
tcp_data_queue+0x25f2/0x4c90 net/ipv4/tcp_input.c:5057
tcp_rcv_state_process+0x1774/0x4e80 net/ipv4/tcp_input.c:6659
tcp_v4_do_rcv+0x339/0x980 net/ipv4/tcp_ipv4.c:1682
sk_backlog_rcv include/net/sock.h:1061 [inline]
__release_sock+0x134/0x3b0 net/core/sock.c:2849
release_sock+0x54/0x1b0 net/core/sock.c:3404
inet_shutdown+0x1e0/0x430 net/ipv4/af_inet.c:909
__sys_shutdown_sock net/socket.c:2331 [inline]
__sys_shutdown_sock net/socket.c:2325 [inline]
__sys_shutdown+0xf1/0x1b0 net/socket.c:2343
__do_sys_shutdown net/socket.c:2351 [inline]
__se_sys_shutdown net/socket.c:2349 [inline]
__x64_sys_shutdown+0x50/0x70 net/socket.c:2349
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>
During SMC fallback process in connect syscall, kernel will
replaces TCP with SMC. In order to forward wakeup
smc socket waitqueue after fallback, kernel will sets
clcsk->sk_user_data to origin smc socket in
smc_fback_replace_callbacks().
Later, in shutdown syscall, kernel will calls
sk_psock_get(), which treats the clcsk->sk_user_data
as psock type, triggering the refcnt warning.
So, the root cause is that smc and psock, both will use
sk_user_data field. So they will mismatch this field
easily.
This patch solves it by using another bit(defined as
SK_USER_DATA_PSOCK) in PTRMASK, to mark whether
sk_user_data points to a psock object or not.
This patch depends on a PTRMASK introduced in commit f1ff5ce2cd
("net, sk_msg: Clear sk_user_data pointer on clone if tagged").
For there will possibly be more flags in the sk_user_data field,
this patch also refactor sk_user_data flags code to be more generic
to improve its maintainability.
Reported-and-tested-by: syzbot+5f26f85569bd179c18ce@syzkaller.appspotmail.com
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Extend struct nft_data_desc to add a flag field that specifies
nft_data_init() is being called for set element data.
Use it to disallow jump to implicit chain from set element, only jump
to chain via immediate expression is allowed.
Fixes: d0e2c7de92 ("netfilter: nf_tables: add NFT_CHAIN_BINDING")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Instead of parsing the data and then validate that type and length are
correct, pass a description of the expected data so it can be validated
upfront before parsing it to bail out earlier.
This patch adds a new .size field to specify the maximum size of the
data area. The .len field is optional and it is used as an input/output
field, it provides the specific length of the expected data in the input
path. If then .len field is not specified, then obtained length from the
netlink attribute is stored. This is required by cmp, bitwise, range and
immediate, which provide no netlink attribute that describes the data
length. The immediate expression uses the destination register type to
infer the expected data type.
Relying on opencoded validation of the expected data might lead to
subtle bugs as described in 7e6bc1f6ca ("netfilter: nf_tables:
stricter validation of element data").
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Update template to validate variable length extensions. This patch adds
a new .ext_len[id] field to the template to store the expected extension
length. This is used to sanity check the initialization of the variable
length extension.
Use PTR_ERR() in nft_set_elem_init() to report errors since, after this
update, there are two reason why this might fail, either because of
ENOMEM or insufficient room in the extension field (EINVAL).
Kernels up until 7e6bc1f6ca ("netfilter: nf_tables: stricter
validation of element data") allowed to copy more data to the extension
than was allocated. This ext_len field allows to validate if the
destination has the correct size as additional check.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The current ifdefry for code shared by the BPF and ctnetlink side looks
ugly. As per Pablo's request, simplify this by unconditionally compiling
in the code. This can be revisited when the shared code between the two
grows further.
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220725085130.11553-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The btf_sock_ids array needs struct mptcp_sock BTF ID for the
bpf_skc_to_mptcp_sock helper.
When CONFIG_MPTCP is disabled, the 'struct mptcp_sock' is not
defined and resolve_btfids will complain with:
[...]
BTFIDS vmlinux
WARN: resolve_btfids: unresolved symbol mptcp_sock
[...]
Add an empty definition for struct mptcp_sock when CONFIG_MPTCP
is disabled.
Fixes: 3bc253c2e6 ("bpf: Add bpf_skc_to_mptcp_sock_proto")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20220802163324.1873044-1-jolsa@kernel.org
- a couple of fixes
- add a tracepoint for fid refcounting
- some cleanup/followup on fid lookup
- some cleanup around req refcounting
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmLuKqkACgkQq06b7GqY
5nA+RBAAvuA6AjKQSvsNxHsqSFMahwoE3cCPlF/QlmAnnZa1DzmRI3kKTAuuDKkE
Q4c7DjxzDJCWRTlkpNUCGZycHqpu1QQbfoTo43sIqob7W8xlajeemc5Fxqtw5sPM
m+0SzN7vvNJpCr+6pxMXwwGHSZmZOvpFjwj7cUjhpF/V1WO8bNxfCGzQdF0hX1Vn
2HoBbFUOmacL9Z/pF3O/ZG9LwCFuRQsH0EmbFBlJy1WdDtlVHTXlzDaGQ7EGaY4D
17UR6iITsj9ozacLhvk094PLIc3/RHDGLrm3C4Ka3zmUI7BsiYWPDmai3Pu/DNqn
JJ5sZkdrVowxyBbGxw8GpZ4YDJtGsU5XglFPdkw+ZazxhZLNEIstPXg0HTZybrOe
GE+WskWB0qS+RpX0tnYEcX6qOHWm3/63Yq5NG6A9tLQUSFku02jS/bCQSLBrmGWW
Js24IWvFSTvl6XytHeldYhJP618pNUxXRSqgYv96vT/LI3mUrIMN+IVBNPujO6p2
jIYXNoaqLoY3efXKW/WQmp7C/52ZP4ly4fOiz7qtHTQsCcIk8Xo6zwHtm/FkNEqc
sMZdqgLrxKPNBAlT8iEtt//wU2fB7mFt988p0pc+5lAK5t0h67KZJV6vDwTAMObX
wV6Ht+QhOHtJwO779fk8FhZPNRPYZYMutIyFNlRx+4gCpBuqJPc=
=941k
-----END PGP SIGNATURE-----
Merge tag '9p-for-5.20' of https://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
- a couple of fixes
- add a tracepoint for fid refcounting
- some cleanup/followup on fid lookup
- some cleanup around req refcounting
* tag '9p-for-5.20' of https://github.com/martinetd/linux:
net/9p: Initialize the iounit field during fid creation
net: 9p: fix refcount leak in p9_read_work() error handling
9p: roll p9_tag_remove into p9_req_put
9p: Add client parameter to p9_req_put()
9p: Drop kref usage
9p: Fix some kernel-doc comments
9p fid refcount: cleanup p9_fid_put calls
9p fid refcount: add a 9p_fid_ref tracepoint
9p fid refcount: add p9_fid_get/put wrappers
9p: Fix minor typo in code comment
9p: Remove unnecessary variable for old fids while walking from d_parent
9p: Make the path walk logic more clear about when cloning is required
9p: Track the root fid with its own variable during lookups
The bonding driver piggybacks on time stamps kept by the network stack
for the purpose of the netdev TX watchdog, and this is problematic
because it does not work with NETIF_F_LLTX devices.
It is hard to say why the driver looks at dev_trans_start() of the
slave->dev, considering that this is updated even by non-ARP/NS probes
sent by us, and even by traffic not sent by us at all (for example PTP
on physical slave devices). ARP monitoring in active-backup mode appears
to still work even if we track only the last TX time of actual ARP
probes.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Core
----
- Refactor the forward memory allocation to better cope with memory
pressure with many open sockets, moving from a per socket cache to
a per-CPU one
- Replace rwlocks with RCU for better fairness in ping, raw sockets
and IP multicast router.
- Network-side support for IO uring zero-copy send.
- A few skb drop reason improvements, including codegen the source file
with string mapping instead of using macro magic.
- Rename reference tracking helpers to a more consistent
netdev_* schema.
- Adapt u64_stats_t type to address load/store tearing issues.
- Refine debug helper usage to reduce the log noise caused by bots.
BPF
---
- Improve socket map performance, avoiding skb cloning on read
operation.
- Add support for 64 bits enum, to match types exposed by kernel.
- Introduce support for sleepable uprobes program.
- Introduce support for enum textual representation in libbpf.
- New helpers to implement synproxy with eBPF/XDP.
- Improve loop performances, inlining indirect calls when
possible.
- Removed all the deprecated libbpf APIs.
- Implement new eBPF-based LSM flavor.
- Add type match support, which allow accurate queries to the
eBPF used types.
- A few TCP congetsion control framework usability improvements.
- Add new infrastructure to manipulate CT entries via eBPF programs.
- Allow for livepatch (KLP) and BPF trampolines to attach to the same
kernel function.
Protocols
---------
- Introduce per network namespace lookup tables for unix sockets,
increasing scalability and reducing contention.
- Preparation work for Wi-Fi 7 Multi-Link Operation (MLO) support.
- Add support to forciby close TIME_WAIT TCP sockets via user-space
tools.
- Significant performance improvement for the TLS 1.3 receive path,
both for zero-copy and not-zero-copy.
- Support for changing the initial MTPCP subflow priority/backup
status
- Introduce virtually contingus buffers for sockets over RDMA,
to cope better with memory pressure.
- Extend CAN ethtool support with timestamping capabilities
- Refactor CAN build infrastructure to allow building only the needed
features.
Driver API
----------
- Remove devlink mutex to allow parallel commands on multiple links.
- Add support for pause stats in distributed switch.
- Implement devlink helpers to query and flash line cards.
- New helper for phy mode to register conversion.
New hardware / drivers
----------------------
- Ethernet DSA driver for the rockchip mt7531 on BPI-R2 Pro.
- Ethernet DSA driver for the Renesas RZ/N1 A5PSW switch.
- Ethernet DSA driver for the Microchip LAN937x switch.
- Ethernet PHY driver for the Aquantia AQR113C EPHY.
- CAN driver for the OBD-II ELM327 interface.
- CAN driver for RZ/N1 SJA1000 CAN controller.
- Bluetooth: Infineon CYW55572 Wi-Fi plus Bluetooth combo device.
Drivers
-------
- Intel Ethernet NICs:
- i40e: add support for vlan pruning
- i40e: add support for XDP framented packets
- ice: improved vlan offload support
- ice: add support for PPPoE offload
- Mellanox Ethernet (mlx5)
- refactor packet steering offload for performance and scalability
- extend support for TC offload
- refactor devlink code to clean-up the locking schema
- support stacked vlans for bridge offloads
- use TLS objects pool to improve connection rate
- Netronome Ethernet NICs (nfp):
- extend support for IPv6 fields mangling offload
- add support for vepa mode in HW bridge
- better support for virtio data path acceleration (VDPA)
- enable TSO by default
- Microsoft vNIC driver (mana)
- add support for XDP redirect
- Others Ethernet drivers:
- bonding: add per-port priority support
- microchip lan743x: extend phy support
- Fungible funeth: support UDP segmentation offload and XDP xmit
- Solarflare EF100: add support for virtual function representors
- MediaTek SoC: add XDP support
- Mellanox Ethernet/IB switch (mlxsw):
- dropped support for unreleased H/W (XM router).
- improved stats accuracy
- unified bridge model coversion improving scalability
(parts 1-6)
- support for PTP in Spectrum-2 asics
- Broadcom PHYs
- add PTP support for BCM54210E
- add support for the BCM53128 internal PHY
- Marvell Ethernet switches (prestera):
- implement support for multicast forwarding offload
- Embedded Ethernet switches:
- refactor OcteonTx MAC filter for better scalability
- improve TC H/W offload for the Felix driver
- refactor the Microchip ksz8 and ksz9477 drivers to share
the probe code (parts 1, 2), add support for phylink
mac configuration
- Other WiFi:
- Microchip wilc1000: diable WEP support and enable WPA3
- Atheros ath10k: encapsulation offload support
Old code removal:
- Neterion vxge ethernet driver: this is untouched since more than
10 years.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmLqN+oSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkB9kQAI9VqW0c3SfiTJnkVBEIovZ6Tnh5stD2
UYFkh1BdchLsYxi7W4XMpVPSzRztiTP87mIx5c/KvIzj+QNeWL1XWRJSPdI9HhTD
pTAA/tM2OG7bqrbyQiKDNfpQdNl7+kk1RwnYd+f9RFl1QVuIJaYhmjVwrsN5xF/+
jUsotpROarM2dGFWiFwJbKhP2zMDT+6qEEahM8pEPggKhv8wRLYjany2cZVEe4e0
WGUpbINAS8gEKm0Ob922WaDfDrcK/N1Z0jNz/kMaENkK18Vvc7F6bCO0DzAawKX9
QZMMwm6mHp3EThflJAMAzCGIYiIcwLhykgdyj8rrjPhFrWbMD2Sdsbo21HOXU/8j
u4aAhVl+d+h7emmbgBoJ8sycVJ7BQlXz7lX20sTgADv9xI4/dPhQ17CMRuwX6fXX
JSrn6P6e1LTV5CEg6vrlSPnKPY6uhFn/cPw47FxCjRwJ9phVnp+8uZWQmf9Pz3yf
Ok/tcj+juFbsmuOshHy2cbRkuNZNS0oRWlSTBo5795ZwOLSakMonR3L+ev2aOvzz
DVrFp2Y/iIVwMSFdCbouYdYnhArPRhOAtCmZc2afY8aBN7aaMgrdTy3+mzUoHy3I
FG3K+VuKpfi0vY4zn6ZoLZDIpyXIoJJ93RcSGltD32t3Dp1RaQMVEI4s45k05PVm
1nYpXKHA8qML
=hxEG
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking changes from Paolo Abeni:
"Core:
- Refactor the forward memory allocation to better cope with memory
pressure with many open sockets, moving from a per socket cache to
a per-CPU one
- Replace rwlocks with RCU for better fairness in ping, raw sockets
and IP multicast router.
- Network-side support for IO uring zero-copy send.
- A few skb drop reason improvements, including codegen the source
file with string mapping instead of using macro magic.
- Rename reference tracking helpers to a more consistent netdev_*
schema.
- Adapt u64_stats_t type to address load/store tearing issues.
- Refine debug helper usage to reduce the log noise caused by bots.
BPF:
- Improve socket map performance, avoiding skb cloning on read
operation.
- Add support for 64 bits enum, to match types exposed by kernel.
- Introduce support for sleepable uprobes program.
- Introduce support for enum textual representation in libbpf.
- New helpers to implement synproxy with eBPF/XDP.
- Improve loop performances, inlining indirect calls when possible.
- Removed all the deprecated libbpf APIs.
- Implement new eBPF-based LSM flavor.
- Add type match support, which allow accurate queries to the eBPF
used types.
- A few TCP congetsion control framework usability improvements.
- Add new infrastructure to manipulate CT entries via eBPF programs.
- Allow for livepatch (KLP) and BPF trampolines to attach to the same
kernel function.
Protocols:
- Introduce per network namespace lookup tables for unix sockets,
increasing scalability and reducing contention.
- Preparation work for Wi-Fi 7 Multi-Link Operation (MLO) support.
- Add support to forciby close TIME_WAIT TCP sockets via user-space
tools.
- Significant performance improvement for the TLS 1.3 receive path,
both for zero-copy and not-zero-copy.
- Support for changing the initial MTPCP subflow priority/backup
status
- Introduce virtually contingus buffers for sockets over RDMA, to
cope better with memory pressure.
- Extend CAN ethtool support with timestamping capabilities
- Refactor CAN build infrastructure to allow building only the needed
features.
Driver API:
- Remove devlink mutex to allow parallel commands on multiple links.
- Add support for pause stats in distributed switch.
- Implement devlink helpers to query and flash line cards.
- New helper for phy mode to register conversion.
New hardware / drivers:
- Ethernet DSA driver for the rockchip mt7531 on BPI-R2 Pro.
- Ethernet DSA driver for the Renesas RZ/N1 A5PSW switch.
- Ethernet DSA driver for the Microchip LAN937x switch.
- Ethernet PHY driver for the Aquantia AQR113C EPHY.
- CAN driver for the OBD-II ELM327 interface.
- CAN driver for RZ/N1 SJA1000 CAN controller.
- Bluetooth: Infineon CYW55572 Wi-Fi plus Bluetooth combo device.
Drivers:
- Intel Ethernet NICs:
- i40e: add support for vlan pruning
- i40e: add support for XDP framented packets
- ice: improved vlan offload support
- ice: add support for PPPoE offload
- Mellanox Ethernet (mlx5)
- refactor packet steering offload for performance and scalability
- extend support for TC offload
- refactor devlink code to clean-up the locking schema
- support stacked vlans for bridge offloads
- use TLS objects pool to improve connection rate
- Netronome Ethernet NICs (nfp):
- extend support for IPv6 fields mangling offload
- add support for vepa mode in HW bridge
- better support for virtio data path acceleration (VDPA)
- enable TSO by default
- Microsoft vNIC driver (mana)
- add support for XDP redirect
- Others Ethernet drivers:
- bonding: add per-port priority support
- microchip lan743x: extend phy support
- Fungible funeth: support UDP segmentation offload and XDP xmit
- Solarflare EF100: add support for virtual function representors
- MediaTek SoC: add XDP support
- Mellanox Ethernet/IB switch (mlxsw):
- dropped support for unreleased H/W (XM router).
- improved stats accuracy
- unified bridge model coversion improving scalability (parts 1-6)
- support for PTP in Spectrum-2 asics
- Broadcom PHYs
- add PTP support for BCM54210E
- add support for the BCM53128 internal PHY
- Marvell Ethernet switches (prestera):
- implement support for multicast forwarding offload
- Embedded Ethernet switches:
- refactor OcteonTx MAC filter for better scalability
- improve TC H/W offload for the Felix driver
- refactor the Microchip ksz8 and ksz9477 drivers to share the
probe code (parts 1, 2), add support for phylink mac
configuration
- Other WiFi:
- Microchip wilc1000: diable WEP support and enable WPA3
- Atheros ath10k: encapsulation offload support
Old code removal:
- Neterion vxge ethernet driver: this is untouched since more than 10 years"
* tag 'net-next-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1890 commits)
doc: sfp-phylink: Fix a broken reference
wireguard: selftests: support UML
wireguard: allowedips: don't corrupt stack when detecting overflow
wireguard: selftests: update config fragments
wireguard: ratelimiter: use hrtimer in selftest
net/mlx5e: xsk: Discard unaligned XSK frames on striding RQ
net: usb: ax88179_178a: Bind only to vendor-specific interface
selftests: net: fix IOAM test skip return code
net: usb: make USB_RTL8153_ECM non user configurable
net: marvell: prestera: remove reduntant code
octeontx2-pf: Reduce minimum mtu size to 60
net: devlink: Fix missing mutex_unlock() call
net/tls: Remove redundant workqueue flush before destroy
net: txgbe: Fix an error handling path in txgbe_probe()
net: dsa: Fix spelling mistakes and cleanup code
Documentation: devlink: add add devlink-selftests to the table of contents
dccp: put dccp_qpolicy_full() and dccp_qpolicy_push() in the same lock
net: ionic: fix error check for vlan flags in ionic_set_nic_features()
net: ice: fix error NETIF_F_HW_VLAN_CTAG_FILTER check in ice_vsi_sync_fltr()
nfp: flower: add support for tunnel offload without key ID
...