Store rtm->rtm_tos in a dscp_t variable, which can then be used for
setting fl4.flowi4_tos and also be passed as parameter of
ip_route_input_rcu().
The .flowi4_tos field is going to be converted to dscp_t to ensure ECN
bits aren't erroneously taken into account during route lookups. Having
a dscp_t variable available will simplify that conversion, as we'll
just have to drop the inet_dscp_to_dsfield() call.
Note that we can't just convert rtm->rtm_tos to dscp_t because this
structure is exported to user space.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/7bc1c7dc47ad1393569095d334521fae59af5bc7.1736944951.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use ip4h_dscp() to get the tunnel DSCP option as dscp_t, instead of
manually masking the raw tos field with INET_DSCP_MASK. This will ease
the conversion of fl4->flowi4_tos to dscp_t, which just becomes a
matter of dropping the inet_dscp_to_dsfield() call.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/6c05a11afdc61530f1a4505147e0909ad51feb15.1736941806.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Following fields of 'struct mr_mfc' can be updated
concurrently (no lock protection) from ip_mr_forward()
and ip6_mr_forward()
- bytes
- pkt
- wrong_if
- lastuse
They also can be read from other functions.
Convert bytes, pkt and wrong_if to atomic_long_t,
and use READ_ONCE()/WRITE_ONCE() for lastuse.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250114221049.1190631-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Prior patch in the series added TCP_RFC7323_PAWS_ACK drop reason.
This patch adds the corresponding SNMP counter, for folks
using nstat instead of tracing for TCP diagnostics.
nstat -az | grep PAWSOldAck
Suggested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250113135558.3180360-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
XPS can cause reorders because of the relaxed OOO
conditions for pure ACK packets.
For hosts not using RFS, what can happpen is that ACK
packets are sent on behalf of the cpu processing NIC
interrupts, selecting TX queue A for ACK packet P1.
Then a subsequent sendmsg() can run on another cpu.
TX queue selection uses the socket hash and can choose
another queue B for packets P2 (with payload).
If queue A is more congested than queue B,
the ACK packet P1 could be sent on the wire after
P2.
A linux receiver when processing P1 (after P2) currently increments
LINUX_MIB_PAWSESTABREJECTED (TcpExtPAWSEstab)
and use TCP_RFC7323_PAWS drop reason.
It might also send a DUPACK if not rate limited.
In order to better understand this pattern, this
patch adds a new drop_reason : TCP_RFC7323_PAWS_ACK.
For old ACKS like these, we no longer increment
LINUX_MIB_PAWSESTABREJECTED and no longer sends a DUPACK,
keeping credit for other more interesting DUPACK.
perf record -e skb:kfree_skb -a
perf script
...
swapper 0 [148] 27475.438637: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK
swapper 0 [208] 27475.438706: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK
swapper 0 [208] 27475.438908: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK
swapper 0 [148] 27475.439010: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK
swapper 0 [148] 27475.439214: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK
swapper 0 [208] 27475.439286: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK
...
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250113135558.3180360-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Following patch is adding a new drop_reason to tcp_validate_incoming().
Change tcp_disordered_ack() to not return a boolean anymore,
but a drop reason.
Change its name to tcp_disordered_ack_check()
Refactor tcp_validate_incoming() to ease the code
review of the following patch, and reduce indentation
level.
This patch is a refactor, with no functional change.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250113135558.3180360-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As discussed in [0], rehash4 could be missed in udp_lib_rehash() when
udp hash4 changes while hash2 doesn't change. This patch fixes this by
moving rehash4 codes out of rehash2 checking, and then rehash2 and
rehash4 are done separately.
By doing this, we no longer need to call rehash4 explicitly in
udp_lib_hash4(), as the rehash callback in __ip4_datagram_connect takes
it. Thus, now udp_lib_hash4() returns directly if the sk is already
hashed.
Note that uhash4 may fail to work under consecutive connect(<dst
address>) calls because rehash() is not called with every connect(). To
overcome this, connect(<AF_UNSPEC>) needs to be called after the next
connect to a new destination.
[0]
https://lore.kernel.org/all/4761e466ab9f7542c68cdc95f248987d127044d2.1733499715.git.pabeni@redhat.com/
Fixes: 78c91ae2c6 ("ipv4/udp: Add 4-tuple hash for connected socket")
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Link: https://patch.msgid.link/20250110010810.107145-1-lulie@linux.alibaba.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmd/mNkACgkQrB3Eaf9P
W7cQww/9Hnv7+wosuBxFW2o2ptgZThiwj/Ovz0oiWGWvctpN7VlpwmrDOsXQ+XMn
xF6JBFEjnJanYoDBb78D0dMffJcMcUKZopTUU/ZMitSNr8aIHYiuB4SWPG1tqxl4
Ete7Mr2m3tS96YePQNnAaRZzEuGsx3BQb28VLTWl9So81MByD2OK4fsAbYz22Gg8
7A6tDHn1mUd9b2VG+LeeBZaDDFG8C0O2x4E/8Z3DX3z1N8y3LABPwZ38jcgTviKO
1ZldGrJT+PownBydu23bWDassKE2TuVvGH9e/SOPeQj8DJ4Lmd0bafMZTY6xwfNT
RJCwhlzZUpYRXFzvcf3+U3egsqEWEemV7/LzAapdT0V9OqfLWMUh3b1jMA4KcblZ
qmlm/MhZyXutDoeuASwtM4jgM3wGwovOofrKKsb13hD9VLBs5jFZmFSw5TlbmwE3
sjv7V4pFwNyfJnwQtyMmfuuHiy8w+fzqAA2GCg8mF3OosHABH/FOvEBP8xg1Vqu1
iKlLkByfyaCFD+GxTPzqSDvSB8nDzeZBgBM/ILGuwH0OWr+0gxBzl1sxrhAkw9hC
Gf+4J3wg7EknTxfrJk4LyqfyS50GvUIzpLSkHYxDtQRd4zv75bxRzMvoZvuapTtN
GGpWYftKO8n2kgLORQ0dTtFee3c4w/No+KXWsFdtRVNpyk3N0Yw=
=U79q
-----END PGP SIGNATURE-----
Merge tag 'ipsec-next-2025-01-09' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
ipsec-next-2025-01-09
1) Implement the AGGFRAG protocol and basic IP-TFS (RFC9347) functionality.
From Christian Hopps.
2) Support ESN context update to hardware for TX.
From Jianbo Liu.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When jumping to 'martian_destination' a drop reason is always set but
that label falls-through the 'e_nobufs' one, overriding the value.
The behavior was introduced by the mentioned commit. The logic went
from,
goto martian_destination;
...
martian_destination:
...
e_inval:
err = -EINVAL;
goto out;
e_nobufs:
err = -ENOBUFS;
goto out;
to,
reason = ...;
goto martian_destination;
...
martian_destination:
...
e_nobufs:
reason = SKB_DROP_REASON_NOMEM;
goto out;
A 'goto out' is clearly missing now after 'martian_destination' to avoid
overriding the drop reason.
Fixes: 5b92112acd ("net: ip: make ip_route_input_slow() return drop reasons")
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Cc: Menglong Dong <menglong8.dong@gmail.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250108165725.404564-1-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is a follow-up to 3c5b4d69c3 ("net: annotate data-races around
sk->sk_mark"). sk->sk_mark can be read and written without holding
the socket lock. IPv6 equivalent is already covered with READ_ONCE()
annotation in tcp_v6_send_response().
Fixes: 3c5b4d69c3 ("net: annotate data-races around sk->sk_mark")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/f459d1fc44f205e13f6d8bdca2c8bfb9902ffac9.1736244569.git.daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
two-thirds of our normal weekly fix count, but delaying sending these
until -rc7 seemed like a really bad idea.
AFAIK we have no bugs under investigation. One or two reverts for
stuff for which we haven't gotten a proper fix will likely come in
the next PR.
Including fixes from wireles and netfilter.
Current release - fix to a fix:
- netfilter: nft_set_hash: unaligned atomic read on struct nft_set_ext
- eth: gve: trigger RX NAPI instead of TX NAPI in gve_xsk_wakeup
Previous releases - regressions:
- net: reenable NETIF_F_IPV6_CSUM offload for BIG TCP packets
- mptcp:
- fix sleeping rcvmsg sleeping forever after bad recvbuffer adjust
- fix TCP options overflow
- prevent excessive coalescing on receive, fix throughput
- net: fix memory leak in tcp_conn_request() if map insertion fails
- wifi: cw1200: fix potential NULL dereference after conversion
to GPIO descriptors
- phy: micrel: dynamically control external clock of KSZ PHY,
fix suspend behavior
Previous releases - always broken:
- af_packet: fix VLAN handling with MSG_PEEK
- net: restrict SO_REUSEPORT to inet sockets
- netdev-genl: avoid empty messages in NAPI get
- dsa: microchip: fix set_ageing_time function on KSZ9477 and LAN937X
- eth: gve: XDP fixes around transmit, queue wakeup etc.
- eth: ti: icssg-prueth: fix firmware load sequence to prevent time
jump which breaks timesync related operations
Misc:
- netlink: specs: mptcp: add missing attr and improve documentation
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmd4FCgACgkQMUZtbf5S
IruB7hAAkppr4+4kywDq/PIqgkisjYzpNIQIrphebUeIHIcAizWbJEL9CiHXnCQD
w95+nVhzDmsWXdzE2AvwM/2YAByLP6CpgKEa+7rI3K5BQZ86lbzm+ftOOFYiF3eV
2v5yLfjgMVbzwqzuP3ivkNrtG95IGQFPobLl7tI9Wpgm3yD0+H8BCqtZNAwA4N6g
mEdZD1jMOzkdINoJBg35O70B2GDq/WSS1N8dgxtj1F7EPDuQdO1kijJmqFjYT3fw
30/NdjlfGtbFro6zp+8ZY6P9RG3CLNhYXu+nUqdOHJOJw65aXUxOPjWCRp3o6m12
HIIp22imDTvNTSCfn0wlbGJfC+v9HdIgB9gqApAaxNOS02bQgEy5tuxxXNTo2B1I
bqH7GJ+6a7axsQ6BnpcOcDhu2fnbtguj1W/zC/MlGQmZZj2g+UNCf4QqPir/XkcG
r43/iy2/L8cgZWYXzH1skfUMgbymI5uLytO+n+CLsb6P/N3WLa7wxHNV5Lni+Xng
qDyd0TwUccCr2KBfZqjgFeGbvQ7qub+eeQCYx9kUAaFhKUsQB/eZ5i9C7SsfutMa
xzwITqCkcvmdyU3gG3v/oABtQkxdTcRqWCfYDGcXm0zb1jzMf5ZGN4E/gABYM7p9
EewL1kCAt4ULmBc8Hqrj+2Erlzh+xsodvvfpgpJandMVXaJW44k=
=YKS4
-----END PGP SIGNATURE-----
Merge tag 'net-6.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from wireles and netfilter.
Nothing major here. Over the last two weeks we gathered only around
two-thirds of our normal weekly fix count, but delaying sending these
until -rc7 seemed like a really bad idea.
AFAIK we have no bugs under investigation. One or two reverts for
stuff for which we haven't gotten a proper fix will likely come in the
next PR.
Current release - fix to a fix:
- netfilter: nft_set_hash: unaligned atomic read on struct
nft_set_ext
- eth: gve: trigger RX NAPI instead of TX NAPI in gve_xsk_wakeup
Previous releases - regressions:
- net: reenable NETIF_F_IPV6_CSUM offload for BIG TCP packets
- mptcp:
- fix sleeping rcvmsg sleeping forever after bad recvbuffer adjust
- fix TCP options overflow
- prevent excessive coalescing on receive, fix throughput
- net: fix memory leak in tcp_conn_request() if map insertion fails
- wifi: cw1200: fix potential NULL dereference after conversion to
GPIO descriptors
- phy: micrel: dynamically control external clock of KSZ PHY, fix
suspend behavior
Previous releases - always broken:
- af_packet: fix VLAN handling with MSG_PEEK
- net: restrict SO_REUSEPORT to inet sockets
- netdev-genl: avoid empty messages in NAPI get
- dsa: microchip: fix set_ageing_time function on KSZ9477 and LAN937X
- eth:
- gve: XDP fixes around transmit, queue wakeup etc.
- ti: icssg-prueth: fix firmware load sequence to prevent time
jump which breaks timesync related operations
Misc:
- netlink: specs: mptcp: add missing attr and improve documentation"
* tag 'net-6.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits)
net: ti: icssg-prueth: Fix clearing of IEP_CMP_CFG registers during iep_init
net: ti: icssg-prueth: Fix firmware load sequence.
mptcp: prevent excessive coalescing on receive
mptcp: don't always assume copied data in mptcp_cleanup_rbuf()
mptcp: fix recvbuffer adjust on sleeping rcvmsg
ila: serialize calls to nf_register_net_hooks()
af_packet: fix vlan_get_protocol_dgram() vs MSG_PEEK
af_packet: fix vlan_get_tci() vs MSG_PEEK
net: wwan: iosm: Properly check for valid exec stage in ipc_mmio_init()
net: restrict SO_REUSEPORT to inet sockets
net: reenable NETIF_F_IPV6_CSUM offload for BIG TCP packets
net: sfc: Correct key_len for efx_tc_ct_zone_ht_params
net: wwan: t7xx: Fix FSM command timeout issue
sky2: Add device ID 11ab:4373 for Marvell 88E8075
mptcp: fix TCP options overflow.
net: mv643xx_eth: fix an OF node reference leak
gve: trigger RX NAPI instead of TX NAPI in gve_xsk_wakeup
eth: bcmsysport: fix call balance of priv->clk handling routines
net: llc: reset skb->transport_header
netlink: specs: mptcp: fix missing doc
...
The "struct sock *sk" parameter in ip_rcv_finish_core is unused, which
leads the compiler to optimize it out. As a result, the
"struct sk_buff *skb" parameter is passed using x1. And this make kprobe
hard to use.
Signed-off-by: Yu Tian <tianyu2@kernelsoft.com>
Link: https://patch.msgid.link/20241231023610.1657926-1-tianyu2@kernelsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Corrected the netlink message size calculation for multicast group
join/leave notifications. The previous calculation did not account for
the inclusion of both IPv4/IPv6 addresses and ifa_cacheinfo in the
payload. This fix ensures that the allocated message size is
sufficient to hold all necessary information.
This patch also includes the following improvements:
* Uses GFP_KERNEL instead of GFP_ATOMIC when holding the RTNL mutex.
* Uses nla_total_size(sizeof(struct in6_addr)) instead of
nla_total_size(16).
* Removes unnecessary EXPORT_SYMBOL().
Fixes: 2c2b61d213 ("netlink: add IGMP/MLD join/leave notifications")
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Yuyang Huang <yuyanghuang@google.com>
Link: https://patch.msgid.link/20241221100007.1910089-1-yuyanghuang@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The device denoted by tunnel->parms.link resides in the underlay net
namespace. Therefore pass tunnel->net to ip_tunnel_init_flow().
Fixes: db53cd3d88 ("net: Handle l3mdev in ip_tunnel_init_flow")
Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20241219130336.103839-1-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If a UDP socket changes its local address while it's receiving
datagrams, as a result of connect(), there is a period during which
a lookup operation might fail to find it, after the address is changed
but before the secondary hash (port and address) and the four-tuple
hash (local and remote ports and addresses) are updated.
Secondary hash chains were introduced by commit 30fff9231f ("udp:
bind() optimisation") and, as a result, a rehash operation became
needed to make a bound socket reachable again after a connect().
This operation was introduced by commit 719f835853 ("udp: add
rehash on connect()") which isn't however a complete fix: the
socket will be found once the rehashing completes, but not while
it's pending.
This is noticeable with a socat(1) server in UDP4-LISTEN mode, and a
client sending datagrams to it. After the server receives the first
datagram (cf. _xioopen_ipdgram_listen()), it issues a connect() to
the address of the sender, in order to set up a directed flow.
Now, if the client, running on a different CPU thread, happens to
send a (subsequent) datagram while the server's socket changes its
address, but is not rehashed yet, this will result in a failed
lookup and a port unreachable error delivered to the client, as
apparent from the following reproducer:
LEN=$(($(cat /proc/sys/net/core/wmem_default) / 4))
dd if=/dev/urandom bs=1 count=${LEN} of=tmp.in
while :; do
taskset -c 1 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
sleep 0.1 || sleep 1
taskset -c 2 socat OPEN:tmp.in UDP4:localhost:1337,shut-null
wait
done
where the client will eventually get ECONNREFUSED on a write()
(typically the second or third one of a given iteration):
2024/11/13 21:28:23 socat[46901] E write(6, 0x556db2e3c000, 8192): Connection refused
This issue was first observed as a seldom failure in Podman's tests
checking UDP functionality while using pasta(1) to connect the
container's network namespace, which leads us to a reproducer with
the lookup error resulting in an ICMP packet on a tap device:
LOCAL_ADDR="$(ip -j -4 addr show|jq -rM '.[] | .addr_info[0] | select(.scope == "global").local')"
while :; do
./pasta --config-net -p pasta.pcap -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
sleep 0.2 || sleep 1
socat OPEN:tmp.in UDP4:${LOCAL_ADDR}:1337,shut-null
wait
cmp tmp.in tmp.out
done
Once this fails:
tmp.in tmp.out differ: char 8193, line 29
we can finally have a look at what's going on:
$ tshark -r pasta.pcap
1 0.000000 :: ? ff02::16 ICMPv6 110 Multicast Listener Report Message v2
2 0.168690 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
3 0.168767 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
4 0.168806 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
5 0.168827 c6:47:05:8d:dc:04 ? Broadcast ARP 42 Who has 88.198.0.161? Tell 88.198.0.164
6 0.168851 9a:55:9a:55:9a:55 ? c6:47:05:8d:dc:04 ARP 42 88.198.0.161 is at 9a:55:9a:55:9a:55
7 0.168875 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
8 0.168896 88.198.0.164 ? 88.198.0.161 ICMP 590 Destination unreachable (Port unreachable)
9 0.168926 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
10 0.168959 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
11 0.168989 88.198.0.161 ? 88.198.0.164 UDP 4138 60260 ? 1337 Len=4096
12 0.169010 88.198.0.161 ? 88.198.0.164 UDP 42 60260 ? 1337 Len=0
On the third datagram received, the network namespace of the container
initiates an ARP lookup to deliver the ICMP message.
In another variant of this reproducer, starting the client with:
strace -f pasta --config-net -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc 2>strace.log &
and connecting to the socat server using a loopback address:
socat OPEN:tmp.in UDP4:localhost:1337,shut-null
we can more clearly observe a sendmmsg() call failing after the
first datagram is delivered:
[pid 278012] connect(173, 0x7fff96c95fc0, 16) = 0
[...]
[pid 278012] recvmmsg(173, 0x7fff96c96020, 1024, MSG_DONTWAIT, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = 1
[...]
[pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = -1 ECONNREFUSED (Connection refused)
and, somewhat confusingly, after a connect() on the same socket
succeeded.
Until commit 4cdeeee925 ("net: udp: prefer listeners bound to an
address"), the race between receive address change and lookup didn't
actually cause visible issues, because, once the lookup based on the
secondary hash chain failed, we would still attempt a lookup based on
the primary hash (destination port only), and find the socket with the
outdated secondary hash.
That change, however, dropped port-only lookups altogether, as side
effect, making the race visible.
To fix this, while avoiding the need to make address changes and
rehash atomic against lookups, reintroduce primary hash lookups as
fallback, if lookups based on four-tuple and secondary hashes fail.
To this end, introduce a simplified lookup implementation, which
doesn't take care of SO_REUSEPORT groups: if we have one, there are
multiple sockets that would match the four-tuple or secondary hash,
meaning that we can't run into this race at all.
v2:
- instead of synchronising lookup operations against address change
plus rehash, reintroduce a simplified version of the original
primary hash lookup as fallback
v1:
- fix build with CONFIG_IPV6=n: add ifdef around sk_v6_rcv_saddr
usage (Kuniyuki Iwashima)
- directly use sk_rcv_saddr for IPv4 receive addresses instead of
fetching inet_rcv_saddr (Kuniyuki Iwashima)
- move inet_update_saddr() to inet_hashtables.h and use that
to set IPv4/IPv6 addresses as suitable (Kuniyuki Iwashima)
- rebase onto net-next, update commit message accordingly
Reported-by: Ed Santiago <santiago@redhat.com>
Link: https://github.com/containers/podman/issues/24147
Analysed-by: David Gibson <david@gibson.dropbear.id.au>
Fixes: 30fff9231f ("udp: bind() optimisation")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bpf kselftest sockhash::test_txmsg_cork_hangs in test_sockmap.c triggers a
kernel NULL pointer dereference:
BUG: kernel NULL pointer dereference, address: 0000000000000008
? __die_body+0x6e/0xb0
? __die+0x8b/0xa0
? page_fault_oops+0x358/0x3c0
? local_clock+0x19/0x30
? lock_release+0x11b/0x440
? kernelmode_fixup_or_oops+0x54/0x60
? __bad_area_nosemaphore+0x4f/0x210
? mmap_read_unlock+0x13/0x30
? bad_area_nosemaphore+0x16/0x20
? do_user_addr_fault+0x6fd/0x740
? prb_read_valid+0x1d/0x30
? exc_page_fault+0x55/0xd0
? asm_exc_page_fault+0x2b/0x30
? splice_to_socket+0x52e/0x630
? shmem_file_splice_read+0x2b1/0x310
direct_splice_actor+0x47/0x70
splice_direct_to_actor+0x133/0x300
? do_splice_direct+0x90/0x90
do_splice_direct+0x64/0x90
? __ia32_sys_tee+0x30/0x30
do_sendfile+0x214/0x300
__se_sys_sendfile64+0x8e/0xb0
__x64_sys_sendfile64+0x25/0x30
x64_sys_call+0xb82/0x2840
do_syscall_64+0x75/0x110
entry_SYSCALL_64_after_hwframe+0x4b/0x53
This is caused by tcp_bpf_sendmsg() returning a larger value(12289) than
size (8192), which causes the while loop in splice_to_socket() to release
an uninitialized pipe buf.
The underlying cause is that this code assumes sk_msg_memcopy_from_iter()
will copy all bytes upon success but it actually might only copy part of
it.
This commit changes it to use the real copied bytes.
Signed-off-by: Levi Zim <rsworktech@outlook.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@kernel.org>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20241130-tcp-bpf-sendmsg-v1-2-bae583d014f3@outlook.com
Use inet_sk_init_flowi4() to automatically initialise the flowi4
structure in __ip_queue_xmit() instead of passing parameters manually
to ip_route_output_ports().
Override ->flowi4_tos with the value passed as parameter since that's
required by SCTP.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/37e64ffbd9adac187b14aa9097b095f5c86e85be.1734357769.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
IPv4 code commonly has to initialise a flowi4 structure from an IPv4
socket. This requires looking at potential IPv4 options to set the
proper destination address, call flowi4_init_output() with the correct
set of parameters and run the sk_classify_flow security hook.
Instead of reimplementing these operations in different parts of the
stack, let's define inet_sk_init_flowi4() which does all these
operations.
The first user is inet_sk_rebuild_header(), where inet_sk_init_flowi4()
replaces ip_route_output_ports(). Unlike ip_route_output_ports(), which
sets the flowi4 structure and performs the route lookup in one go,
inet_sk_init_flowi4() only initialises the flow. The route lookup is
then done by ip_route_output_flow(). Decoupling flow initialisation
from route lookup makes this new interface applicable more broadly as
it will allow some users to overwrite specific struct flowi4 members
before the route lookup.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/fd416275262b1f518d5abfcef740ce4f4a1a6522.1734357769.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Under DOS, inet_peer_xrlim_allow() might be called millions
of times per second from different cpus.
Make sure to write over peer->rate_tokens and peer->rate_last
only when really needed.
Note the inherent races of this function are still there,
we do not care of precise ICMP rate limiting.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20241219150330.3159027-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When we do sk_psock_verdict_apply->sk_psock_skb_ingress, an sk_msg will
be created out of the skb, and the rmem accounting of the sk_msg will be
handled by the skb.
For skmsgs in __SK_REDIRECT case of tcp_bpf_send_verdict, when redirecting
to the ingress of a socket, although we sk_rmem_schedule and add sk_msg to
the ingress_msg of sk_redir, we do not update sk_rmem_alloc. As a result,
except for the global memory limit, the rmem of sk_redir is nearly
unlimited. Thus, add sk_rmem_alloc related logic to limit the recv buffer.
Since the function sk_msg_recvmsg and __sk_psock_purge_ingress_msg are
used in these two paths. We use "msg->skb" to test whether the sk_msg is
skb backed up. If it's not, we shall do the memory accounting explicitly.
Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20241210012039.1669389-3-zijianzhang@bytedance.com
When bpf_tcp_ingress() is called, the skmsg is being redirected to the
ingress of the destination socket. Therefore, we should charge its
receive socket buffer, instead of sending socket buffer.
Because sk_rmem_schedule() tests pfmemalloc of skb, we need to
introduce a wrapper and call it for skmsg.
Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20241210012039.1669389-2-zijianzhang@bytedance.com
We already have enough variants of ip_route_output*() functions. We
don't need a GRE specific one in the generic route.h header file.
Furthermore, ip_route_output_gre() is only used once, in ipgre_open(),
where it can be easily replaced by a simple call to
ip_route_output_key().
While there, and for clarity, explicitly set .flowi4_scope to
RT_SCOPE_UNIVERSE instead of relying on the implicit zero
initialisation.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/ab7cba47b8558cd4bfe2dc843c38b622a95ee48e.1734527729.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
IPv4 FIB rules cannot match on flow label so reject requests that try to
add such rules. Do that in the IPv4 configure callback as the netlink
policy resides in the core and used by both IPv4 and IPv6.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
All inet_getpeer() callers except ip4_frag_init() don't need
to acquire a permanent refcount on the inetpeer.
They can switch to full RCU protection.
Move the refcount_inc_not_zero() into ip4_frag_init(),
so that all the other callers no longer have to
perform a pair of expensive atomic operations on
a possibly contended cache line.
inet_putpeer() no longer needs to be exported.
After this patch, my DUT can receive 8,400,000 UDP packets
per second targeting closed ports, using 50% less cpu cycles
than before.
Also change two calls to l3mdev_master_ifindex() by
l3mdev_master_ifindex_rcu() (Ido ideas)
Fixes: 8c2bd38b95 ("icmp: change the order of rate limits")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241215175629.1248773-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
inet_putpeer() will be removed in the following patch,
because we will no longer use refcounts.
Update inetpeer timestamp (p->dtime) at lookup time.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241215175629.1248773-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The Linux socket API currently allows setting SO_PRIORITY at the
socket level, applying a uniform priority to all packets sent through
that socket. The exception to this is IP_TOS, when the priority value
is calculated during the handling of
ancillary data, as implemented in commit f02db315b8 ("ipv4: IP_TOS
and IP_TTL can be specified as ancillary data").
However, this is a computed
value, and there is currently no mechanism to set a custom priority
via control messages prior to this patch.
According to this patch, if SO_PRIORITY is specified as ancillary data,
the packet is sent with the priority value set through
sockc->priority, overriding the socket-level values
set via the traditional setsockopt() method. This is analogous to
the existing support for SO_MARK, as implemented in
commit c6af0c227a ("ip: support SO_MARK cmsg").
If both cmsg SO_PRIORITY and IP_TOS are passed, then the one that
takes precedence is the last one in the cmsg list.
This patch has the side effect that raw_send_hdrinc now interprets cmsg
IP_TOS.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Suggested-by: Ferenc Fejes <fejes@inf.elte.hu>
Signed-off-by: Anna Emese Nyiri <annaemesenyiri@gmail.com>
Link: https://patch.msgid.link/20241213084457.45120-3-annaemesenyiri@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
adding a route metric greater than 0x7fff_ffff leads to an
unintended wrap when printing the underlying u32 as an
unsigned int (`%d`) thus incorrectly rendering the metric
as negative. Formatting using `%u` corrects the issue.
Signed-off-by: Maximilian Güntner <code@mguentner.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241212161911.51598-1-code@mguentner.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This change introduces netlink notifications for multicast address
changes. The following features are included:
* Addition and deletion of multicast addresses are reported using
RTM_NEWMULTICAST and RTM_DELMULTICAST messages with AF_INET and
AF_INET6.
* Two new notification groups: RTNLGRP_IPV4_MCADDR and
RTNLGRP_IPV6_MCADDR are introduced for receiving these events.
This change allows user space applications (e.g., ip monitor) to
efficiently track multicast group memberships by listening for netlink
events. Previously, applications relied on inefficient polling of
procfs, introducing delays. With netlink notifications, applications
receive realtime updates on multicast group membership changes,
enabling more precise metrics collection and system monitoring.
This change also unlocks the potential for implementing a wide range
of sophisticated multicast related features in user space by allowing
applications to combine kernel provided multicast address information
with user space data and communicate decisions back to the kernel for
more fine grained control. This mechanism can be used for various
purposes, including multicast filtering, IGMP/MLD offload, and
IGMP/MLD snooping.
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Co-developed-by: Patrick Ruddy <pruddy@vyatta.att-mail.com>
Signed-off-by: Patrick Ruddy <pruddy@vyatta.att-mail.com>
Link: https://lore.kernel.org/r/20180906091056.21109-1-pruddy@vyatta.att-mail.com
Signed-off-by: Yuyang Huang <yuyanghuang@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current release - fix to a fix:
- rtnetlink: fix error code in rtnl_newlink()
- tipc: fix NULL deref in cleanup_bearer()
Current release - regressions:
- ip: fix warning about invalid return from in ip_route_input_rcu()
Current release - new code bugs:
- udp: fix L4 hash after reconnect
- eth: lan969x: fix cyclic dependency between modules
- eth: bnxt_en: fix potential crash when dumping FW log coredump
Previous releases - regressions:
- wifi: mac80211:
- fix a queue stall in certain cases of channel switch
- wake the queues in case of failure in resume
- splice: do not checksum AF_UNIX sockets
- virtio_net: fix BUG()s in BQL support due to incorrect accounting
of purged packets during interface stop
- eth: stmmac: fix TSO DMA API mis-usage causing oops
- eth: bnxt_en: fixes for HW GRO: GSO type on 5750X chips and
oops due to incorrect aggregation ID mask on 5760X chips
Previous releases - always broken:
- Bluetooth: improve setsockopt() handling of malformed user input
- eth: ocelot: fix PTP timestamping in presence of packet loss
- ptp: kvm: x86: avoid "fail to initialize ptp_kvm" when simply
not supported
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmdbCXgACgkQMUZtbf5S
Irtizg/9GVtTtu0OeQpHlxOpXOdqRciHDHBcjc0+rYihHazA47wtOszPg2BDiebV
1D+uTaPoxuJUZo9jDAGMerUpy6gmC8+4h9gp72oSU9uGNHTrDWsylsn16foFkmpg
hMsq+bzYr9ayekIXoI4T//PQ8MO8fqLFPdJmFPIKjkTtsrCzzARck9R4uDlWzrJj
v5cQY+q/6qnwZTvvto67ahjdKUw8k3XIRZxLDqrDiW+zUzdk9XRwK46AdP3eybcx
OCMHvXmWx6DTbjeEbzhq5YwDGAnBOE9rP4vJmpV9y+PcPDCmPzt7IDNWACcEPHY4
3vuZv3JJP/5MIqGHidDn1JYgWl/Y3iv5ZfKInG585XH+5VWemq3WL1JOS2ua6Xmu
hoGhwNTGea4KtCeutE8xSwMSBTxswkdPb93ZFPt28zKAN118chBvGLRv2jepSvQR
3AQhJ9bgGuErHMYh5vdiluRVj/4bwSIFqEH6vr6w9+DUDFiTSKERLXSJ8dc8S+9K
ghd/I8POb4VTfjZIyHzo1DJOulPXe84KGMcOuAfh0AV7o5HcuP+oNdR3+qS2Lf+G
EByIX8osZsHjqaVr5ba+KnZz2XrdO7mbE54fCKa9ZUwkNIbcCEqOJBqcMlPWxvtK
whrGDOS8ifYYK6fL6IFO5CtxBvWmQgMOYV6Sjp9J27PD4jiMrms=
=TDKt
-----END PGP SIGNATURE-----
Merge tag 'net-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bluetooth, netfilter and wireless.
Current release - fix to a fix:
- rtnetlink: fix error code in rtnl_newlink()
- tipc: fix NULL deref in cleanup_bearer()
Current release - regressions:
- ip: fix warning about invalid return from in ip_route_input_rcu()
Current release - new code bugs:
- udp: fix L4 hash after reconnect
- eth: lan969x: fix cyclic dependency between modules
- eth: bnxt_en: fix potential crash when dumping FW log coredump
Previous releases - regressions:
- wifi: mac80211:
- fix a queue stall in certain cases of channel switch
- wake the queues in case of failure in resume
- splice: do not checksum AF_UNIX sockets
- virtio_net: fix BUG()s in BQL support due to incorrect accounting
of purged packets during interface stop
- eth:
- stmmac: fix TSO DMA API mis-usage causing oops
- bnxt_en: fixes for HW GRO: GSO type on 5750X chips and oops
due to incorrect aggregation ID mask on 5760X chips
Previous releases - always broken:
- Bluetooth: improve setsockopt() handling of malformed user input
- eth: ocelot: fix PTP timestamping in presence of packet loss
- ptp: kvm: x86: avoid "fail to initialize ptp_kvm" when simply not
supported"
* tag 'net-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
net: dsa: tag_ocelot_8021q: fix broken reception
net: dsa: microchip: KSZ9896 register regmap alignment to 32 bit boundaries
net: renesas: rswitch: fix initial MPIC register setting
Bluetooth: btmtk: avoid UAF in btmtk_process_coredump
Bluetooth: iso: Fix circular lock in iso_conn_big_sync
Bluetooth: iso: Fix circular lock in iso_listen_bis
Bluetooth: SCO: Add support for 16 bits transparent voice setting
Bluetooth: iso: Fix recursive locking warning
Bluetooth: iso: Always release hdev at the end of iso_listen_bis
Bluetooth: hci_event: Fix using rcu_read_(un)lock while iterating
Bluetooth: hci_core: Fix sleeping function called from invalid context
team: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL
team: Fix initial vlan_feature set in __team_compute_features
bonding: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL
bonding: Fix initial {vlan,mpls}_feature set in bond_compute_features
net, team, bonding: Add netdev_base_features helper
net/sched: netem: account for backlog updates from child qdisc
net: dsa: felix: fix stuck CPU-injected packets with short taprio windows
splice: do not checksum AF_UNIX sockets
net: usb: qmi_wwan: add Telit FE910C04 compositions
...
Today we have a hardcoded delay of 1 sec before a TIME-WAIT socket can be
reused by reopening a connection. This is a safe choice based on an
assumption that the other TCP timestamp clock frequency, which is unknown
to us, may be as low as 1 Hz (RFC 7323, section 5.4).
However, this means that in the presence of short lived connections with an
RTT of couple of milliseconds, the time during which a 4-tuple is blocked
from reuse can be orders of magnitude longer that the connection lifetime.
Combined with a reduced pool of ephemeral ports, when using
IP_LOCAL_PORT_RANGE to share an egress IP address between hosts [1], the
long TIME-WAIT reuse delay can lead to port exhaustion, where all available
4-tuples are tied up in TIME-WAIT state.
Turn the reuse delay into a per-netns setting so that sysadmins can make
more aggressive assumptions about remote TCP timestamp clock frequency and
shorten the delay in order to allow connections to reincarnate faster.
Note that applications can completely bypass the TIME-WAIT delay protection
already today by locking the local port with bind() before connecting. Such
immediate connection reuse may result in PAWS failing to detect old
duplicate segments, leaving us with just the sequence number check as a
safety net.
This new configurable offers a trade off where the sysadmin can balance
between the risk of PAWS detection failing to act versus exhausting ports
by having sockets tied up in TIME-WAIT state for too long.
[1] https://lpc.events/event/16/contributions/1349/
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20241209-jakub-krn-909-poc-msec-tw-tstamp-v2-2-66aca0eed03e@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Prepare ground for TIME-WAIT socket reuse with subsecond delay.
Today the last TS.Recent update timestamp, recorded in seconds and stored
tp->ts_recent_stamp and tw->tw_ts_recent_stamp fields, has two purposes.
Firstly, it is used to track the age of the last recorded TS.Recent value
to detect when that value becomes outdated due to potential wrap-around of
the other TCP timestamp clock (RFC 7323, section 5.5).
For this purpose a second-based timestamp is completely sufficient as even
in the worst case scenario of a peer using a high resolution microsecond
timestamp, the wrap-around interval is ~36 minutes long.
Secondly, it serves as a threshold value for allowing TIME-WAIT socket
reuse. A TIME-WAIT socket can be reused only once the virtual 1 Hz clock,
ktime_get_seconds, is past the TS.Recent update timestamp.
The purpose behind delaying the TIME-WAIT socket reuse is to wait for the
other TCP timestamp clock to tick at least once before reusing the
connection. It is only then that the PAWS mechanism for the reopened
connection can detect old duplicate segments from the previous connection
incarnation (RFC 7323, appendix B.2).
In this case using a timestamp with second resolution not only blocks the
way toward allowing faster TIME-WAIT reuse after shorter subsecond delay,
but also makes it impossible to reliably delay TW reuse by one second.
As Eric Dumazet has pointed out [1], due to timestamp rounding, the TW
reuse delay will actually be between (0, 1] seconds, and 0.5 seconds on
average. We delay TW reuse for one full second only when last TS.Recent
update coincides with our virtual 1 Hz clock tick.
Considering the above, introduce a dedicated field to store a millisecond
timestamp of transition into the TIME-WAIT state. Place it in an existing
4-byte hole inside inet_timewait_sock structure to avoid an additional
memory cost.
Use the new timestamp to (i) reliably delay TIME-WAIT reuse by one second,
and (ii) prepare for configurable subsecond reuse delay in the subsequent
change.
We assume here that a full one second delay was the original intention in
[2] because it accounts for the worst case scenario of the other TCP using
the slowest recommended 1 Hz timestamp clock.
A more involved alternative would be to change the resolution of the last
TS.Recent update timestamp, tw->tw_ts_recent_stamp, to milliseconds.
[1] https://lore.kernel.org/netdev/CANn89iKB4GFd8sVzCbRttqw_96o3i2wDhX-3DraQtsceNGYwug@mail.gmail.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b8439924316d5bcb266d165b93d632a4b4b859af
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20241209-jakub-krn-909-poc-msec-tw-tstamp-v2-1-66aca0eed03e@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ensure there is enough space before adding MPTCP options in
tcp_syn_options().
Without this check, 'remaining' could underflow, and causes issues. If
there is not enough space, MPTCP should not be used.
Signed-off-by: MoYuanhao <moyuanhao3676@163.com>
Fixes: cec37a6e41 ("mptcp: Handle MP_CAPABLE options for outgoing connections")
Cc: stable@vger.kernel.org
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
[ Matt: Add Fixes, cc Stable, update Description ]
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241209-net-mptcp-check-space-syn-v1-1-2da992bb6f74@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After the blamed commit below, udp_rehash() is supposed to be called
with both local and remote addresses set.
Currently that is already the case for IPv6 sockets, but for IPv4 the
destination address is updated after rehashing.
Address the issue moving the destination address and port initialization
before rehashing.
Fixes: 1b29a730ef ("ipv6/udp: Add 4-tuple hash for connected socket")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/4761e466ab9f7542c68cdc95f248987d127044d2.1733499715.git.pabeni@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
- Fix several issues for BPF LPM trie map which were found by
syzbot and during addition of new test cases (Hou Tao)
- Fix a missing process_iter_arg register type check in the
BPF verifier (Kumar Kartikeya Dwivedi, Tao Lyu)
- Fix several correctness gaps in the BPF verifier when
interacting with the BPF stack without CAP_PERFMON
(Kumar Kartikeya Dwivedi, Eduard Zingerman, Tao Lyu)
- Fix OOB BPF map writes when deleting elements for the case of
xsk map as well as devmap (Maciej Fijalkowski)
- Fix xsk sockets to always clear DMA mapping information when
unmapping the pool (Larysa Zaremba)
- Fix sk_mem_uncharge logic in tcp_bpf_sendmsg to only uncharge
after sent bytes have been finalized (Zijian Zhang)
- Fix BPF sockmap with vsocks which was missing a queue check
in poll and sockmap cleanup on close (Michal Luczaj)
- Fix tools infra to override makefile ARCH variable if defined
but empty, which addresses cross-building tools. (Björn Töpel)
- Fix two resolve_btfids build warnings on unresolved bpf_lsm
symbols (Thomas Weißschuh)
- Fix a NULL pointer dereference in bpftool (Amir Mohammadi)
- Fix BPF selftests to check for CONFIG_PREEMPTION instead of
CONFIG_PREEMPT (Sebastian Andrzej Siewior)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-----BEGIN PGP SIGNATURE-----
iIsEABYKADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZ1N8bhUcZGFuaWVsQGlv
Z2VhcmJveC5uZXQACgkQ2yufC7HISIO6ZAD+ITpujJgxvFGC0R7E9o3XJ7V1SpmR
SlW0lGpj6vOHTUAA/2MRoZurJSTbdT3fbWiCUgU1rMcwkoErkyxUaPuBci0D
=kgXL
-----END PGP SIGNATURE-----
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Daniel Borkmann::
- Fix several issues for BPF LPM trie map which were found by syzbot
and during addition of new test cases (Hou Tao)
- Fix a missing process_iter_arg register type check in the BPF
verifier (Kumar Kartikeya Dwivedi, Tao Lyu)
- Fix several correctness gaps in the BPF verifier when interacting
with the BPF stack without CAP_PERFMON (Kumar Kartikeya Dwivedi,
Eduard Zingerman, Tao Lyu)
- Fix OOB BPF map writes when deleting elements for the case of xsk map
as well as devmap (Maciej Fijalkowski)
- Fix xsk sockets to always clear DMA mapping information when
unmapping the pool (Larysa Zaremba)
- Fix sk_mem_uncharge logic in tcp_bpf_sendmsg to only uncharge after
sent bytes have been finalized (Zijian Zhang)
- Fix BPF sockmap with vsocks which was missing a queue check in poll
and sockmap cleanup on close (Michal Luczaj)
- Fix tools infra to override makefile ARCH variable if defined but
empty, which addresses cross-building tools. (Björn Töpel)
- Fix two resolve_btfids build warnings on unresolved bpf_lsm symbols
(Thomas Weißschuh)
- Fix a NULL pointer dereference in bpftool (Amir Mohammadi)
- Fix BPF selftests to check for CONFIG_PREEMPTION instead of
CONFIG_PREEMPT (Sebastian Andrzej Siewior)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (31 commits)
selftests/bpf: Add more test cases for LPM trie
selftests/bpf: Move test_lpm_map.c to map_tests
bpf: Use raw_spinlock_t for LPM trie
bpf: Switch to bpf mem allocator for LPM trie
bpf: Fix exact match conditions in trie_get_next_key()
bpf: Handle in-place update for full LPM trie correctly
bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie
bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elem
bpf: Remove unnecessary check when updating LPM trie
selftests/bpf: Add test for narrow spill into 64-bit spilled scalar
selftests/bpf: Add test for reading from STACK_INVALID slots
selftests/bpf: Introduce __caps_unpriv annotation for tests
bpf: Fix narrow scalar spill onto 64-bit spilled scalar slots
bpf: Don't mark STACK_INVALID as STACK_MISC in mark_stack_slot_misc
samples/bpf: Remove unnecessary -I flags from libbpf EXTRA_CFLAGS
bpf: Zero index arg error string for dynptr and iter
selftests/bpf: Add tests for iter arg check
bpf: Ensure reg is PTR_TO_STACK in process_iter_arg
tools: Override makefile ARCH variable if defined, but empty
selftests/bpf: Add apply_bytes test to test_txmsg_redir_wait_sndmem in test_sockmap
...
Define `XFRM_MODE_IPTFS` and `IPSEC_MODE_IPTFS` constants, and add these to
switch case and conditionals adjacent with the existing TUNNEL modes.
Signed-off-by: Christian Hopps <chopps@labn.net>
Tested-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
UDP send path suffers from one indirect call to ip_generic_getfrag()
We can use INDIRECT_CALL_1() to avoid it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Brian Vazquez <brianvv@google.com>
Link: https://patch.msgid.link/20241203173617.2595451-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit 612b1c0dec. On a
scenario with multiple threads blocking on a recvfrom(), we need to call
sock_def_readable() on every __udp_enqueue_schedule_skb() otherwise the
threads won't be woken up as __skb_wait_for_more_packets() is using
prepare_to_wait_exclusive().
Link: https://bugzilla.redhat.com/2308477
Fixes: 612b1c0dec ("udp: avoid calling sock_def_readable() if possible")
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241202155620.1719-1-ffmancera@riseup.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
arp link failure may trigger ip_rt_bug while xfrm enabled, call trace is:
WARNING: CPU: 0 PID: 0 at net/ipv4/route.c:1241 ip_rt_bug+0x14/0x20
Modules linked in:
CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.12.0-rc6-00077-g2e1b3cc9d7f7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:ip_rt_bug+0x14/0x20
Call Trace:
<IRQ>
ip_send_skb+0x14/0x40
__icmp_send+0x42d/0x6a0
ipv4_link_failure+0xe2/0x1d0
arp_error_report+0x3c/0x50
neigh_invalidate+0x8d/0x100
neigh_timer_handler+0x2e1/0x330
call_timer_fn+0x21/0x120
__run_timer_base.part.0+0x1c9/0x270
run_timer_softirq+0x4c/0x80
handle_softirqs+0xac/0x280
irq_exit_rcu+0x62/0x80
sysvec_apic_timer_interrupt+0x77/0x90
The script below reproduces this scenario:
ip xfrm policy add src 0.0.0.0/0 dst 0.0.0.0/0 \
dir out priority 0 ptype main flag localok icmp
ip l a veth1 type veth
ip a a 192.168.141.111/24 dev veth0
ip l s veth0 up
ping 192.168.141.155 -c 1
icmp_route_lookup() create input routes for locally generated packets
while xfrm relookup ICMP traffic.Then it will set input route
(dst->out = ip_rt_bug) to skb for DESTUNREACH.
For ICMP err triggered by locally generated packets, dst->dev of output
route is loopback. Generally, xfrm relookup verification is not required
on loopback interfaces (net.ipv4.conf.lo.disable_xfrm = 1).
Skip icmp relookup for locally generated packets to fix it.
Fixes: 8b7817f3a9 ("[IPSEC]: Add ICMP host relookup support")
Signed-off-by: Dong Chenchen <dongchenchen2@huawei.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241127040850.1513135-1-dongchenchen2@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>