Current release - regressions:
- core: hold instance lock during NETDEV_CHANGE
- rtnetlink: fix bad unlock balance in do_setlink().
- ipv6:
- fix null-ptr-deref in addrconf_add_ifaddr().
- align behavior across nexthops during path selection
Previous releases - regressions:
- sctp: prevent transport UaF in sendmsg
- mptcp: only inc MPJoinAckHMacFailure for HMAC failures
Previous releases - always broken:
- sched:
- make ->qlen_notify() idempotent
- ensure sufficient space when sending filter netlink notifications
- sch_sfq: really don't allow 1 packet limit
- netfilter: fix incorrect avx2 match of 5th field octet
- tls: explicitly disallow disconnect
- eth: octeontx2-pf: fix VF root node parent queue priority
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmf3xusSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkud0P/iWOQB0oj0nvxl2ionPzgJEPduxuF0V6
YPyDBUzLC7Gq6NmTcdDlNJt8fE6UmKUIneghUm9Ss7LRpKv0/TPvorKMSK44Zt53
a5q49JeoI0TvnnhJesdHjiF31hrInqZmcX8OjSH8Q/SCKuy7rsgzao0vjvhd7lxm
wA6LlWnJO1Pf991nNpbjUSoAZ7CMNlEIewGkdq0+6UADC7D9VagKTgIkFKw1BvRw
2Eb2pzvdO9Pj02+l/mjdRhUzMZlr+FG+WBqXk5oKR0YZ2t3CS4O9/UUBoAn775tM
gCfzepNuAUXGX0I6h+DANCNuswWuG/IvYTdhy+hRWblYeCkILU60E8eVMlh7tpII
fUd5GSRhX1NpGNHUlDG/4b6IcjMO3ebtce2cm2Y9t2CUe7EqB0HZyvTczNroTxip
KXrXcCBuEkzxXCZhaN/CrBu8Piu8vJk/rMH5ha1khce9CkmYY+m9ruvsYjZmPI+/
P/SFkRdb/yV/SIOmay8FCJsy60t4FOtLnlDDrnygq4Q/9a7VwafebVpKS1fbTELG
ZTiELN3/PN2GUnfREf0DVLPfn9sdqrMZaclLOJpp/Zi1/RZpo52WHceXJShiu9pe
A8B+3SuPgOaLfhwyqiHlWm5moc9kNF26vlrWfFjK1GrJdxMisYwQoWD5eHfQFhDX
UaxlAmndwZa9
=wkGz
-----END PGP SIGNATURE-----
Merge tag 'net-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter.
Current release - regressions:
- core: hold instance lock during NETDEV_CHANGE
- rtnetlink: fix bad unlock balance in do_setlink()
- ipv6:
- fix null-ptr-deref in addrconf_add_ifaddr()
- align behavior across nexthops during path selection
Previous releases - regressions:
- sctp: prevent transport UaF in sendmsg
- mptcp: only inc MPJoinAckHMacFailure for HMAC failures
Previous releases - always broken:
- sched:
- make ->qlen_notify() idempotent
- ensure sufficient space when sending filter netlink notifications
- sch_sfq: really don't allow 1 packet limit
- netfilter: fix incorrect avx2 match of 5th field octet
- tls: explicitly disallow disconnect
- eth: octeontx2-pf: fix VF root node parent queue priority"
* tag 'net-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (38 commits)
ethtool: cmis_cdb: Fix incorrect read / write length extension
selftests: netfilter: add test case for recent mismatch bug
nft_set_pipapo: fix incorrect avx2 match of 5th field octet
net: ppp: Add bound checking for skb data on ppp_sync_txmung
net: Fix null-ptr-deref by sock_lock_init_class_and_name() and rmmod.
ipv6: Align behavior across nexthops during path selection
net: phy: allow MDIO bus PM ops to start/stop state machine for phylink-controlled PHY
net: phy: move phy_link_change() prior to mdio_bus_phy_may_suspend()
selftests/tc-testing: sfq: check that a derived limit of 1 is rejected
net_sched: sch_sfq: move the limit validation
net_sched: sch_sfq: use a temporary work area for validating configuration
net: libwx: handle page_pool_dev_alloc_pages error
selftests: mptcp: validate MPJoin HMacFailure counters
mptcp: only inc MPJoinAckHMacFailure for HMAC failures
rtnetlink: Fix bad unlock balance in do_setlink().
net: ethtool: Don't call .cleanup_data when prepare_data fails
tc: Ensure we have enough buffer space when sending filter netlink notifications
net: libwx: Fix the wrong Rx descriptor field
octeontx2-pf: qos: fix VF root node parent queue index
selftests: tls: check that disconnect does nothing
...
A nexthop is only chosen when the calculated multipath hash falls in the
nexthop's hash region (i.e., the hash is smaller than the nexthop's hash
threshold) and when the nexthop is assigned a non-negative score by
rt6_score_route().
Commit 4d0ab3a688 ("ipv6: Start path selection from the first
nexthop") introduced an unintentional difference between the first
nexthop and the rest when the score is negative.
When the first nexthop matches, but has a negative score, the code will
currently evaluate subsequent nexthops until one is found with a
non-negative score. On the other hand, when a different nexthop matches,
but has a negative score, the code will fallback to the nexthop with
which the selection started ('match').
Align the behavior across all nexthops and fallback to 'match' when the
first nexthop matches, but has a negative score.
Fixes: 3d709f69a3 ("ipv6: Use hash-threshold instead of modulo-N")
Fixes: 4d0ab3a688 ("ipv6: Start path selection from the first nexthop")
Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Closes: https://lore.kernel.org/netdev/67efef607bc41_1ddca82948c@willemb.c.googlers.com.notmuch/
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250408084316.243559-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Most UDP tunnels bind a socket to a local port, with ANY address, no
peer and no interface index specified.
Additionally it's quite common to have a single tunnel device per
namespace.
Track in each namespace the UDP tunnel socket respecting the above.
When only a single one is present, store a reference in the netns.
When such reference is not NULL, UDP tunnel GRO lookup just need to
match the incoming packet destination port vs the socket local port.
The tunnel socket never sets the reuse[port] flag[s]. When bound to no
address and interface, no other socket can exist in the same netns
matching the specified local port.
Matching packets with non-local destination addresses will be
aggregated, and eventually segmented as needed - no behavior changes
intended.
Restrict the optimization to kernel sockets only: it covers all the
relevant use-cases, and user-space owned sockets could be disconnected
and rebound after setup_udp_tunnel_sock(), breaking the uniqueness
assumption
Note that the UDP tunnel socket reference is stored into struct
netns_ipv4 for both IPv4 and IPv6 tunnels. That is intentional to keep
all the fastpath-related netns fields in the same struct and allow
cacheline-based optimization. Currently both the IPv4 and IPv6 socket
pointer share the same cacheline as the `udp_table` field.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/41d16bc8d1257d567f9344c445b4ae0b4a91ede4.1744040675.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.
Conversion was done with coccinelle plus manual fixups where necessary.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Nexthops whose link is down are not supposed to be considered during
path selection when the "ignore_routes_with_linkdown" sysctl is set.
This is done by assigning them a negative region boundary.
However, when comparing the computed hash (unsigned) with the region
boundary (signed), the negative region boundary is treated as unsigned,
resulting in incorrect nexthop selection.
Fix by treating the computed hash as signed. Note that the computed hash
is always in range of [0, 2^31 - 1].
Fixes: 3d709f69a3 ("ipv6: Use hash-threshold instead of modulo-N")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250402114224.293392-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cited commit transitioned IPv6 path selection to use hash-threshold
instead of modulo-N. With hash-threshold, each nexthop is assigned a
region boundary in the multipath hash function's output space and a
nexthop is chosen if the calculated hash is smaller than the nexthop's
region boundary.
Hash-threshold does not work correctly if path selection does not start
with the first nexthop. For example, if fib6_select_path() is always
passed the last nexthop in the group, then it will always be chosen
because its region boundary covers the entire hash function's output
space.
Fix this by starting the selection process from the first nexthop and do
not consider nexthops for which rt6_score_route() provided a negative
score.
Fixes: 3d709f69a3 ("ipv6: Use hash-threshold instead of modulo-N")
Reported-by: Stanislav Fomichev <stfomichev@gmail.com>
Closes: https://lore.kernel.org/netdev/Z9RIyKZDNoka53EO@mini-arch/
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250402114224.293392-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ipv6_add_dev might call dev_disable_lro which unconditionally grabs
instance lock, so it will deadlock during NETDEV_REGISTER. Switch
to netif_disable_lro.
Make sure all callers hold the instance lock as well.
Cc: Cosmin Ratiu <cratiu@nvidia.com>
Fixes: ad7c7b2172 ("net: hold netdev instance lock during sysfs operations")
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250401163452.622454-4-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Using RTEXT_FILTER_SKIP_STATS is incorrectly skipping non-stats IPv6
netlink attributes on link dump. This causes issues on userspace tools,
e.g iproute2 is not rendering address generation mode as it should due
to missing netlink attribute.
Move the filling of IFLA_INET6_STATS and IFLA_INET6_ICMP6STATS to a
helper function guarded by a flag check to avoid hitting the same
situation in the future.
Fixes: d5566fd72e ("rtnetlink: RTEXT_FILTER_SKIP_STATS support to avoid dumping inet/inet6 stats")
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250402121751.3108-1-ffmancera@riseup.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When calling netlbl_conn_setattr(), addr->sa_family is used
to determine the function behavior. If sk is an IPv4 socket,
but the connect function is called with an IPv6 address,
the function calipso_sock_setattr() is triggered.
Inside this function, the following code is executed:
sk_fullsock(__sk) ? inet_sk(__sk)->pinet6 : NULL;
Since sk is an IPv4 socket, pinet6 is NULL, leading to a
null pointer dereference.
This patch fixes the issue by checking if inet6_sk(sk)
returns a NULL pointer before accessing pinet6.
Signed-off-by: Debin Zhu <mowenroot@163.com>
Signed-off-by: Bitao Ouyang <1985755126@qq.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Fixes: ceba1832b1 ("calipso: Set the calipso socket label to match the secattr.")
Link: https://patch.msgid.link/20250401124018.4763-1-mowenroot@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
icsk->icsk_timeout can be replaced by icsk->icsk_retransmit_timer.expires
This saves 8 bytes in TCP/DCCP sockets and helps for better cache locality.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250324203607.703850-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Revert "udp_tunnel: use static call for GRO hooks when possible"
This reverts commit 311b36574c.
Revert "udp_tunnel: create a fastpath GRO lookup."
This reverts commit 8d4880db37.
There are multiple small issues with the series. In the interest
of unblocking the merge window let's opt for a revert.
Link: https://lore.kernel.org/cover.1742557254.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmff3F0ACgkQ1w0aZmrP
KyGZDQ/7BzMeVJNjr8gRAfYtjqt+NWGr1vf6Tz8GDsGEHfgqeFrX/GyRMZi90kOV
YMB8K+HiIou5yJtY2ZWgSkGsId8aK7fHMmN5KnP4l0XL6bcwi7yP/sck2m+KdN9k
cApi/iVMVtJ4/+4MPrD6rgPcsDonj+wHwMQ3WItGNgenYDTOnEmqeEL7AK6HGTAg
kTUmjVnyws+9UllNRzgJ/67OVewzPWy8imixFl1H+ZEfM0rTuNtr0zzl6rttXIU2
w6FK6Kw3WBZYYfLelLLmtZ2UoxqVD90Y6DOPip1mMjj95jrJPSedsZfUsZivDTNn
JOIn/zLtwGjJ2hO/2rFxEEoeiqG79Fskg7fGzQ5mxVtJ1/otDc53WMHjNtQQpYNz
3xpPrwVOdCNQvorDLoDL2cInoc91ZADyJGFmLAou5NQdMbAWKsGKXEQolEiG0JEh
hmWlrzkY5cns/dSGeZDAZvyhpVSF8dnClUP2BsPU3vVYN2MbCEBH10dwOkWcUhiq
kj+1sNPnxkDiy054e708N3w0OKToHwtgJkfpEENxtI7dtCj/6sz9JHaN77RPiuzf
aCIyjrhlUslkB6q5bLznyGoiQTaqzjOVWIGPcMKNT7XbElmhxIUMh3U05SktdlXz
F9m1jIvThxPKj492i8ZEDjZQ9iBCEYm5KmnRD89aW+zf4UPbSYg=
=b5n2
-----END PGP SIGNATURE-----
Merge tag 'nf-next-25-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following batch contains Netfilter updates for net-next:
1) Use kvmalloc in xt_hashlimit, from Denis Kirjanov.
2) Tighten nf_conntrack sysctl accepted values for nf_conntrack_max
and nf_ct_expect_max, from Nicolas Bouchinet.
3) Avoid lookup in nft_fib if socket is available, from Florian Westphal.
4) Initialize struct lsm_context in nfnetlink_queue to avoid
hypothetical ENOMEM errors, Chenyuan Yang.
5) Use strscpy() instead of _pad when initializing xtables table name,
kzalloc is already used to initialized the table memory area.
From Thorsten Blum.
6) Missing socket lookup by conntrack information for IPv6 traffic
in nft_socket, there is a similar chunk in IPv4, this was never
added when IPv6 NAT was introduced. From Maxim Mikityanskiy.
7) Fix clang issues with nf_tables CONFIG_MITIGATION_RETPOLINE,
from WangYuli.
* tag 'nf-next-25-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_tables: Only use nf_skip_indirect_calls() when MITIGATION_RETPOLINE
netfilter: socket: Lookup orig tuple for IPv6 SNAT
netfilter: xtables: Use strscpy() instead of strscpy_pad()
netfilter: nfnetlink_queue: Initialize ctx to avoid memory allocation error
netfilter: fib: avoid lookup if socket is available
netfilter: conntrack: Bound nf_conntrack sysctl writes
netfilter: xt_hashlimit: replace vmalloc calls with kvmalloc
====================
Link: https://patch.msgid.link/20250323100922.59983-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
inet_connection_sock_af_ops.addr2sockaddr() hasn't been used at all
in the git era.
$ git grep addr2sockaddr $(git rev-list HEAD | tail -n 1)
Let's remove it.
Note that there was a 4 bytes hole after sockaddr_len and now it's
6 bytes, so the binary layout is not changed.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250318060112.3729-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
nf_sk_lookup_slow_v4 does the conntrack lookup for IPv4 packets to
restore the original 5-tuple in case of SNAT, to be able to find the
right socket (if any). Then socket_match() can correctly check whether
the socket was transparent.
However, the IPv6 counterpart (nf_sk_lookup_slow_v6) lacks this
conntrack lookup, making xt_socket fail to match on the socket when the
packet was SNATed. Add the same logic to nf_sk_lookup_slow_v6.
IPv6 SNAT is used in Kubernetes clusters for pod-to-world packets, as
pods' addresses are in the fd00::/8 ULA subnet and need to be replaced
with the node's external address. Cilium leverages Envoy to enforce L7
policies, and Envoy uses transparent sockets. Cilium inserts an iptables
prerouting rule that matches on `-m socket --transparent` and redirects
the packets to localhost, but it fails to match SNATed IPv6 packets due
to that missing conntrack lookup.
Closes: https://github.com/cilium/cilium/issues/37932
Fixes: eb31628e37 ("netfilter: nf_tables: Add support for IPv6 NAT")
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In case the fib match is used from the input hook we can avoid the fib
lookup if early demux assigned a socket for us: check that the input
interface matches sk-cached one.
Rework the existing 'lo bypass' logic to first check sk, then
for loopback interface type to elide the fib lookup.
This speeds up fib matching a little, before:
93.08 GBit/s (no rules at all)
75.1 GBit/s ("fib saddr . iif oif missing drop" in prerouting)
75.62 GBit/s ("fib saddr . iif oif missing drop" in input)
After:
92.48 GBit/s (no rules at all)
75.62 GBit/s (fib rule in prerouting)
90.37 GBit/s (fib rule in input).
Numbers for the 'no rules' and 'prerouting' are expected to
closely match in-between runs, the 3rd/input test case exercises the
the 'avoid lookup if cached ifindex in sk matches' case.
Test used iperf3 via veth interface, lo can't be used due to existing
loopback test.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Cross-merge networking fixes after downstream PR (net-6.14-rc8).
Conflict:
tools/testing/selftests/net/Makefile
03544faad7 ("selftest: net: add proc_net_pktgen")
3ed61b8938 ("selftests: net: test for lwtunnel dst ref loops")
tools/testing/selftests/net/config:
85cb3711ac ("selftests: net: Add test cases for link and peer netns")
3ed61b8938 ("selftests: net: test for lwtunnel dst ref loops")
Adjacent commits:
tools/testing/selftests/net/Makefile
c935af429e ("selftests: net: add support for testing SO_RCVMARK and SO_RCVPRIORITY")
355d940f4d ("Revert "selftests: Add IPv6 link-local address generation tests for GRE devices."")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Fix the lwtunnel_output() reentry loop in ioam6_iptunnel when the
destination is the same after transformation. Note that a check on the
destination address was already performed, but it was not enough. This
is the example of a lwtunnel user taking care of loops without relying
only on the last resort detection offered by lwtunnel.
Fixes: 8cb3bf8bff ("ipv6: ioam: Add support for the ip6ip6 encapsulation")
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Link: https://patch.msgid.link/20250314120048.12569-3-justin.iurman@uliege.be
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
As mentioned in commit 648700f76b ("inet: frags:
use rhashtables for reassembly units"):
A followup patch will even remove the refcount hold/release
left from prior implementation and save a couple of atomic
operations.
This patch implements this idea, seven years later.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250312082250.1803501-5-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In the following patch, we no longer assume inet_frag_kill()
callers own a reference.
Consuming two refcounts from inet_frag_kill() would lead in UAF.
Propagate the pointer to the refs that will be consumed later
by the final inet_frag_putn() call.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250312082250.1803501-4-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
inet_frag_putn() can release multiple references
in one step.
Use it in inet_frags_free_cb().
Replace inet_frag_put(X) with inet_frag_putn(X, 1)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250312082250.1803501-2-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
While creating a new IPv6, we could get a weird -ENOMEM when
RTA_NH_ID is set and either of the conditions below is true:
1) CONFIG_IPV6_SUBTREES is enabled and rtm_src_len is specified
2) nexthop_get() fails
e.g.)
# strace ip -6 route add fe80::dead:beef:dead:beef nhid 1 from ::
recvmsg(3, {msg_iov=[{iov_base=[...[
{error=-ENOMEM, msg=[... [...]]},
[{nla_len=49, nla_type=NLMSGERR_ATTR_MSG}, "Nexthops can not be used with so"...]
]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 148
Let's set err explicitly after ip_fib_metrics_init() in
ip6_route_info_create().
Fixes: f88d8ea67f ("ipv6: Plumb support for nexthop object in a fib6_info")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250312013854.61125-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
fib_check_nh_v6_gw() expects that fib6_nh_init() cleans up everything
when it fails.
Commit 7dd73168e2 ("ipv6: Always allocate pcpu memory in a fib6_nh")
moved fib_nh_common_init() before alloc_percpu_gfp() within fib6_nh_init()
but forgot to add cleanup for fib6_nh->nh_common.nhc_pcpu_rth_output in
case it fails to allocate fib6_nh->rt6i_pcpu, resulting in memleak.
Let's call fib_nh_common_release() and clear nhc_pcpu_rth_output in the
error path.
Note that we can remove the fib6_nh_release() call in nh_create_ipv6()
later in net-next.git.
Fixes: 7dd73168e2 ("ipv6: Always allocate pcpu memory in a fib6_nh")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250312010333.56001-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When updating the source/destination address, the TCP/UDP checksum needs to
be updated as well.
Fixes: bee88cd5bd ("net: add support for segmenting TCP fraglist GSO packets")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://patch.msgid.link/20250311212530.91519-1-nbd@nbd.name
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Most UDP tunnels bind a socket to a local port, with ANY address, no
peer and no interface index specified.
Additionally it's quite common to have a single tunnel device per
namespace.
Track in each namespace the UDP tunnel socket respecting the above.
When only a single one is present, store a reference in the netns.
When such reference is not NULL, UDP tunnel GRO lookup just need to
match the incoming packet destination port vs the socket local port.
The tunnel socket never sets the reuse[port] flag[s]. When bound to no
address and interface, no other socket can exist in the same netns
matching the specified local port.
Matching packets with non-local destination addresses will be
aggregated, and eventually segmented as needed - no behavior changes
intended.
Note that the UDP tunnel socket reference is stored into struct
netns_ipv4 for both IPv4 and IPv6 tunnels. That is intentional to keep
all the fastpath-related netns fields in the same struct and allow
cacheline-based optimization. Currently both the IPv4 and IPv6 socket
pointer share the same cacheline as the `udp_table` field.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/4d5c319c4471161829f50cb8436841de81a5edae.1741718157.git.pabeni@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
ECN bits in TOS are always cleared when sending in ACKs in TW. Clearing
them is problematic for TCP flows that used Accurate ECN because ECN bits
decide which service queue the packet is placed into (L4S vs Classic).
Effectively, TW ACKs are always downgraded from L4S to Classic queue
which might impact, e.g., delay the ACK will experience on the path
compared with the other packets of the flow.
Change the TW ACK sending code to differentiate:
- In tcp_v4_send_reset(), commit ba9e04a7dd ("ip: fix tos reflection
in ack and reset packets") cleans ECN bits for TW reset and this is
not affected.
- In tcp_v4_timewait_ack(), ECN bits for all TW ACKs are cleaned. But now
only ECN bits of ACKs for oow data or paws_reject are cleaned, and ECN
bits of other ACKs will not be cleaned.
- In tcp_v4_reqsk_send_ack(), commit 66b13d99d9 ("ipv4: tcp: fix TOS
value in ACK messages sent from TIME_WAIT") did not clean ECN bits of
ACKs for oow data or paws_reject. But now the ECN bits rae cleaned for
these ACKs.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With AccECN, there's one additional TCP flag to be used (AE)
and ACE field that overloads the definition of AE, CWR, and
ECE flags. As tcp_flags was previously only 1 byte, the
byte-order stuff needs to be added to it's handling.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cross-merge networking fixes after downstream PR (net-6.14-rc6).
Conflicts:
tools/testing/selftests/drivers/net/ping.py
75cc19c8ff ("selftests: drv-net: add xdp cases for ping.py")
de94e86974 ("selftests: drv-net: store addresses in dict indexed by ipver")
https://lore.kernel.org/netdev/20250311115758.17a1d414@canb.auug.org.au/
net/core/devmem.c
a70f891e0f ("net: devmem: do not WARN conditionally after netdev_rx_queue_restart()")
1d22d3060b ("net: drop rtnl_lock for queue_mgmt operations")
https://lore.kernel.org/netdev/20250313114929.43744df1@canb.auug.org.au/
Adjacent changes:
tools/testing/selftests/net/Makefile
6f50175cca ("selftests: Add IPv6 link-local address generation tests for GRE devices.")
2e5584e0f9 ("selftests/net: expand cmsg_ipv6.sh with ipv4")
drivers/net/ethernet/broadcom/bnxt/bnxt.c
661958552e ("eth: bnxt: do not use BNXT_VNIC_NTUPLE unconditionally in queue restart logic")
fe96d717d3 ("bnxt_en: Extend queue stop/start for TX rings")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Use addrconf_addr_gen() to generate IPv6 link-local addresses on GRE
devices in most cases and fall back to using add_v4_addrs() only in
case the GRE configuration is incompatible with addrconf_addr_gen().
GRE used to use addrconf_addr_gen() until commit e5dd729460
("ip/ip6_gre: use the same logic as SIT interfaces when computing v6LL
address") restricted this use to gretap and ip6gretap devices, and
created add_v4_addrs() (borrowed from SIT) for non-Ethernet GRE ones.
The original problem came when commit 9af28511be ("addrconf: refuse
isatap eui64 for INADDR_ANY") made __ipv6_isatap_ifid() fail when its
addr parameter was 0. The commit says that this would create an invalid
address, however, I couldn't find any RFC saying that the generated
interface identifier would be wrong. Anyway, since gre over IPv4
devices pass their local tunnel address to __ipv6_isatap_ifid(), that
commit broke their IPv6 link-local address generation when the local
address was unspecified.
Then commit e5dd729460 ("ip/ip6_gre: use the same logic as SIT
interfaces when computing v6LL address") tried to fix that case by
defining add_v4_addrs() and calling it to generate the IPv6 link-local
address instead of using addrconf_addr_gen() (apart for gretap and
ip6gretap devices, which would still use the regular
addrconf_addr_gen(), since they have a MAC address).
That broke several use cases because add_v4_addrs() isn't properly
integrated into the rest of IPv6 Neighbor Discovery code. Several of
these shortcomings have been fixed over time, but add_v4_addrs()
remains broken on several aspects. In particular, it doesn't send any
Router Sollicitations, so the SLAAC process doesn't start until the
interface receives a Router Advertisement. Also, add_v4_addrs() mostly
ignores the address generation mode of the interface
(/proc/sys/net/ipv6/conf/*/addr_gen_mode), thus breaking the
IN6_ADDR_GEN_MODE_RANDOM and IN6_ADDR_GEN_MODE_STABLE_PRIVACY cases.
Fix the situation by using add_v4_addrs() only in the specific scenario
where the normal method would fail. That is, for interfaces that have
all of the following characteristics:
* run over IPv4,
* transport IP packets directly, not Ethernet (that is, not gretap
interfaces),
* tunnel endpoint is INADDR_ANY (that is, 0),
* device address generation mode is EUI64.
In all other cases, revert back to the regular addrconf_addr_gen().
Also, remove the special case for ip6gre interfaces in add_v4_addrs(),
since ip6gre devices now always use addrconf_addr_gen() instead.
Fixes: e5dd729460 ("ip/ip6_gre: use the same logic as SIT interfaces when computing v6LL address")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/559c32ce5c9976b269e6337ac9abb6a96abe5096.1741375285.git.gnault@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When spanning datagram construction over multiple send calls using
MSG_MORE, per datagram settings are configured on the first send.
That is when ip(6)_setup_cork stores these settings for subsequent use
in __ip(6)_append_data and others.
The only flag that escaped this was dontfrag. As a result, a datagram
could be constructed with df=0 on the first sendmsg, but df=1 on a
next. Which is what cmsg_ip.sh does in an upcoming MSG_MORE test in
the "diff" scenario.
Changing datagram conditions in the middle of constructing an skb
makes this already complex code path even more convoluted. It is here
unintentional. Bring this flag in line with expected sockopt/cmsg
behavior.
And stop passing ipc6 to __ip6_append_data, to avoid such issues
in the future. This is already the case for __ip_append_data.
inet6_cork had a 6 byte hole, so the 1B flag has no impact.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250307033620.411611-3-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As of the blamed commit ipc6.dontfrag is always initialized at the
start of udpv6_sendmsg, by ipcm6_init_sk, to either 0 or 1.
Later checks against -1 are no longer needed and the branches are now
dead code.
The blamed commit had removed those branches. But I had overlooked
this one case.
UDP has both a lockless fast path and a slower path for corked
requests. This branch remained in the fast path.
Fixes: 096208592b ("ipv6: replace ipcm6_init calls with ipcm6_init_sk")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250307033620.411611-2-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move the more esoteric helpers for netdev instance lock to
a dedicated header. This avoids growing netdevice.h to infinity
and makes rebuilding the kernel much faster (after touching
the header with the helpers).
The main netdev_lock() / netdev_unlock() functions are used
in static inlines in netdevice.h and will probably be used
most commonly, so keep them in netdevice.h.
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250307183006.2312761-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
SKB_DROP_REASON_UDP_CSUM can be used in four locations
when dropping a packet because of a wrong UDP checksum.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250307102002.2095238-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In order to speedup __inet_hash_connect(), we want to ensure hash values
for <source address, port X, destination address, destination port>
are not randomly spread, but monotonically increasing.
Goal is to allow __inet_hash_connect() to derive the hash value
of a candidate 4-tuple with a single addition in the following
patch in the series.
Given :
hash_0 = inet_ehashfn(saddr, 0, daddr, dport)
hash_sport = inet_ehashfn(saddr, sport, daddr, dport)
Then (hash_sport == hash_0 + sport) for all sport values.
As far as I know, there is no security implication with this change.
After this patch, when __inet_hash_connect() has to try XXXX candidates,
the hash table buckets are contiguous and packed, allowing
a better use of cpu caches and hardware prefetchers.
Tested:
Server: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog
Client: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server
Before this patch:
utime_start=0.271607
utime_end=3.847111
stime_start=18.407684
stime_end=1997.485557
num_transactions=1350742
latency_min=0.014131929
latency_max=17.895073144
latency_mean=0.505675853
latency_stddev=2.125164772
num_samples=307884
throughput=139866.80
perf top on client:
56.86% [kernel] [k] __inet6_check_established
17.96% [kernel] [k] __inet_hash_connect
13.88% [kernel] [k] inet6_ehashfn
2.52% [kernel] [k] rcu_all_qs
2.01% [kernel] [k] __cond_resched
0.41% [kernel] [k] _raw_spin_lock
After this patch:
utime_start=0.286131
utime_end=4.378886
stime_start=11.952556
stime_end=1991.655533
num_transactions=1446830
latency_min=0.001061085
latency_max=12.075275028
latency_mean=0.376375302
latency_stddev=1.361969596
num_samples=306383
throughput=151866.56
perf top:
50.01% [kernel] [k] __inet6_check_established
20.65% [kernel] [k] __inet_hash_connect
15.81% [kernel] [k] inet6_ehashfn
2.92% [kernel] [k] rcu_all_qs
2.34% [kernel] [k] __cond_resched
0.50% [kernel] [k] _raw_spin_lock
0.34% [kernel] [k] sched_balance_trigger
0.24% [kernel] [k] queued_spin_lock_slowpath
There is indeed an increase of throughput and reduction of latency.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250305034550.879255-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR (net-6.14-rc6).
Conflicts:
net/ethtool/cabletest.c
2bcf4772e4 ("net: ethtool: try to protect all callback with netdev instance lock")
637399bf7e ("net: ethtool: netlink: Allow NULL nlattrs when getting a phy_device")
No Adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add missing skb_dst_drop() to drop reference to the old dst before
adding the new dst to the skb.
Fixes: 79ff2fc31e ("ila: Cache a route to translated address")
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Link: https://patch.msgid.link/20250305081655.19032-1-justin.iurman@uliege.be
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This patch follows commit 92191dd107 ("net: ipv6: fix dst ref loops in
rpl, seg6 and ioam6 lwtunnels") and, on a second thought, the same patch
is also needed for ila (even though the config that triggered the issue
was pathological, but still, we don't want that to happen).
Fixes: 79ff2fc31e ("ila: Cache a route to translated address")
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Link: https://patch.msgid.link/20250304181039.35951-1-justin.iurman@uliege.be
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
After blamed commit rtm_to_fib_config() now calls
lwtunnel_valid_encap_type{_attr}() without RTNL held,
triggering an unlock balance in __rtnl_unlock,
as reported by syzbot [1]
IPv6 and rtm_to_nh_config() are not yet converted.
Add a temporary @rtnl_is_held parameter to lwtunnel_valid_encap_type()
and lwtunnel_valid_encap_type_attr().
While we are at it replace the two rcu_dereference()
in lwtunnel_valid_encap_type() with more appropriate
rcu_access_pointer().
[1]
syz-executor245/5836 is trying to release lock (rtnl_mutex) at:
[<ffffffff89d0e38c>] __rtnl_unlock+0x6c/0xf0 net/core/rtnetlink.c:142
but there are no more locks to release!
other info that might help us debug this:
no locks held by syz-executor245/5836.
stack backtrace:
CPU: 0 UID: 0 PID: 5836 Comm: syz-executor245 Not tainted 6.14.0-rc4-syzkaller-00873-g3424291dd242 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
print_unlock_imbalance_bug+0x25b/0x2d0 kernel/locking/lockdep.c:5289
__lock_release kernel/locking/lockdep.c:5518 [inline]
lock_release+0x47e/0xa30 kernel/locking/lockdep.c:5872
__mutex_unlock_slowpath+0xec/0x800 kernel/locking/mutex.c:891
__rtnl_unlock+0x6c/0xf0 net/core/rtnetlink.c:142
lwtunnel_valid_encap_type+0x38a/0x5f0 net/core/lwtunnel.c:169
lwtunnel_valid_encap_type_attr+0x113/0x270 net/core/lwtunnel.c:209
rtm_to_fib_config+0x949/0x14e0 net/ipv4/fib_frontend.c:808
inet_rtm_newroute+0xf6/0x2a0 net/ipv4/fib_frontend.c:917
rtnetlink_rcv_msg+0x791/0xcf0 net/core/rtnetlink.c:6919
netlink_rcv_skb+0x206/0x480 net/netlink/af_netlink.c:2534
netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline]
netlink_unicast+0x7f6/0x990 net/netlink/af_netlink.c:1339
netlink_sendmsg+0x8de/0xcb0 net/netlink/af_netlink.c:1883
sock_sendmsg_nosec net/socket.c:709 [inline]
Fixes: 1dd2af7963 ("ipv4: fib: Convert RTM_NEWROUTE and RTM_DELROUTE to per-netns RTNL.")
Reported-by: syzbot+3f18ef0f7df107a3f6a0@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/67c6f87a.050a0220.38b91b.0147.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250304125918.2763514-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When __inet_hash_connect() has to try many 4-tuples before
finding an available one, we see a high spinlock cost from
the many spin_lock_bh(&head->lock) performed in its loop.
This patch adds an RCU lookup to avoid the spinlock cost.
check_established() gets a new @rcu_lookup argument.
First reason is to not make any changes while head->lock
is not held.
Second reason is to not make this RCU lookup a second time
after the spinlock has been acquired.
Tested:
Server:
ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog
Client:
ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server
Before series:
utime_start=0.288582
utime_end=1.548707
stime_start=20.637138
stime_end=2002.489845
num_transactions=484453
latency_min=0.156279245
latency_max=20.922042756
latency_mean=1.546521274
latency_stddev=3.936005194
num_samples=312537
throughput=47426.00
perf top on the client:
49.54% [kernel] [k] _raw_spin_lock
25.87% [kernel] [k] _raw_spin_lock_bh
5.97% [kernel] [k] queued_spin_lock_slowpath
5.67% [kernel] [k] __inet_hash_connect
3.53% [kernel] [k] __inet6_check_established
3.48% [kernel] [k] inet6_ehashfn
0.64% [kernel] [k] rcu_all_qs
After this series:
utime_start=0.271607
utime_end=3.847111
stime_start=18.407684
stime_end=1997.485557
num_transactions=1350742
latency_min=0.014131929
latency_max=17.895073144
latency_mean=0.505675853 # Nice reduction of latency metrics
latency_stddev=2.125164772
num_samples=307884
throughput=139866.80 # 190 % increase
perf top on client:
56.86% [kernel] [k] __inet6_check_established
17.96% [kernel] [k] __inet_hash_connect
13.88% [kernel] [k] inet6_ehashfn
2.52% [kernel] [k] rcu_all_qs
2.01% [kernel] [k] __cond_resched
0.41% [kernel] [k] _raw_spin_lock
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250302124237.3913746-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When __inet_hash_connect() has to try many 4-tuples before
finding an available one, we see a high spinlock cost from
__inet_check_established() and/or __inet6_check_established().
This patch adds an RCU lookup to avoid the spinlock
acquisition when the 4-tuple is found in the hash table.
Note that there are still spin_lock_bh() calls in
__inet_hash_connect() to protect inet_bind_hashbucket,
this will be fixed later in this series.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250302124237.3913746-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The name 'netns_local' is confusing. A following commit will export it via
netlink, so let's use a more explicit name.
Reported-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Instead of using sock_kmalloc() to allocate an ip_options and then
immediately duplicate another ip_options to the newly allocated one in
ipv6_dup_options(), mptcp_copy_ip_options() and sctp_v4_copy_ip_options(),
the newly added sock_kmemdup() helper can be used to simplify the code.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/91ae749d66600ec6fb679e0e518fda6acb5c3e6f.1740735165.git.tanggeliang@kylinos.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After commit 8d52da23b6 ("tcp: Defer ts_recent changes
until req is owned"), req->ts_recent is not changed anymore.
It is set once in tcp_openreq_init(), bpf_sk_assign_tcp_reqsk()
or cookie_tcp_reqsk_alloc() before the req can be seen by other
cpus/threads.
This completes the revert of eba20811f3 ("tcp: annotate
data-races around tcp_rsk(req)->ts_recent").
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wang Hai <wanghai38@huawei.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250301201424.2046477-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp4_check_fraglist_gro(), tcp6_check_fraglist_gro(),
udp4_gro_lookup_skb() and udp6_gro_lookup_skb()
assume RCU is held so that the net structure does not disappear.
Use dev_net_rcu() instead of dev_net() to get LOCKDEP support.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250301201424.2046477-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
TCP uses of dev_net() are under RCU protection, change them
to dev_net_rcu() to get LOCKDEP support.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250301201424.2046477-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We want to add new drop reasons for packets dropped in 3WHS in the
following patches.
tcp_rcv_state_process() has to set reason to TCP_FASTOPEN,
because tcp_check_req() will conditionally overwrite the drop_reason.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250301201424.2046477-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ping_rcv() callers currently call skb_free() or consume_skb(),
forcing ping_rcv() to clone the skb.
After this patch ping_rcv() is now 'consuming' the original skb,
either moving to a socket receive queue, or dropping it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250226183437.1457318-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Extend IPv6 FIB rules to match on DSCP using a mask. Unlike IPv4, also
initialize the DSCP mask when a non-zero 'tos' is specified as there is
no difference in matching between 'tos' and 'dscp'. As a side effect,
this makes it possible to match on 'dscp 0', like in IPv4.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20250220080525.831924-4-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When link_net is set, use it as link netns instead of dev_net(). This
prepares for rtnetlink core to create device in target netns directly,
in which case the two namespaces may be different.
Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-9-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently some IPv6 tunnel drivers set tnl->net to dev_net(dev) in
ndo_init(), which is called in register_netdevice(). However, it lacks
the context of link-netns when we enable cross-net tunnels at device
registration time.
Let's move the init of tunnel link-netns before register_netdevice().
ip6_gre has already initialized netns, so just remove the redundant
assignment.
Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-8-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are 4 net namespaces involved when creating links:
- source netns - where the netlink socket resides,
- target netns - where to put the device being created,
- link netns - netns associated with the device (backend),
- peer netns - netns of peer device.
Currently, two nets are passed to newlink() callback - "src_net"
parameter and "dev_net" (implicitly in net_device). They are set as
follows, depending on netlink attributes in the request.
+------------+-------------------+---------+---------+
| peer netns | IFLA_LINK_NETNSID | src_net | dev_net |
+------------+-------------------+---------+---------+
| | absent | source | target |
| absent +-------------------+---------+---------+
| | present | link | link |
+------------+-------------------+---------+---------+
| | absent | peer | target |
| present +-------------------+---------+---------+
| | present | peer | link |
+------------+-------------------+---------+---------+
When IFLA_LINK_NETNSID is present, the device is created in link netns
first and then moved to target netns. This has some side effects,
including extra ifindex allocation, ifname validation and link events.
These could be avoided if we create it in target netns from
the beginning.
On the other hand, the meaning of src_net parameter is ambiguous. It
varies depending on how parameters are passed. It is the effective
link (or peer netns) by design, but some drivers ignore it and use
dev_net instead.
To provide more netns context for drivers, this patch packs existing
newlink() parameters, along with the source netns, link netns and peer
netns, into a struct. The old "src_net" is renamed to "net" to avoid
confusion with real source netns, and will be deprecated later. The use
of src_net are converted to params->net trivially.
Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-3-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Extend IPv6 FIB rules to match on source and destination ports using a
mask. Note that the mask is only set when not matching on a range.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250217134109.311176-5-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This initializes tclass and dontfrag before cmsg parsing, removing the
need for explicit checks against -1 in each caller.
Leave hlimit set to -1, because its full initialization
(in ip6_sk_dst_hoplimit) requires more state (dst, flowi6, ..).
This also prepares for calling sockcm_init in a follow-on patch.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250214222720.3205500-7-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ndisc_send_redirect() is always called under rcu_read_lock().
It can use dev_net_rcu() and avoid one redundant
rcu_read_lock()/rcu_read_unlock() pair.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250214140705.2105890-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Current inet_sock_set_state trace from inet_csk_clone_lock() is missing
many details :
... sock:inet_sock_set_state: family=AF_INET6 protocol=IPPROTO_TCP \
sport=4901 dport=0 \
saddr=127.0.0.6 daddr=0.0.0.0 \
saddrv6=:: daddrv6=:: \
oldstate=TCP_LISTEN newstate=TCP_SYN_RECV
Only the sport gives the listener port, no other parts of the n-tuple are correct.
In this patch, I initialize relevant fields of the new socket before
calling inet_sk_set_state(newsk, TCP_SYN_RECV).
We now have a trace including all the source/destination bits.
... sock:inet_sock_set_state: family=AF_INET6 protocol=IPPROTO_TCP \
sport=4901 dport=47648 \
saddr=127.0.0.6 daddr=127.0.0.6 \
saddrv6=2002:a05:6830:1f85:: daddrv6=2001:4860:f803:65::3 \
oldstate=TCP_LISTEN newstate=TCP_SYN_RECV
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250212131328.1514243-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
mld_newpack() can be called without RTNL or RCU being held.
Note that we no longer can use sock_alloc_send_skb() because
ipv6.igmp_sk uses GFP_KERNEL allocations which can sleep.
Instead use alloc_skb() and charge the net->ipv6.igmp_sk
socket under RCU protection.
Fixes: b8ad0cbc58 ("[NETNS][IPV6] mcast - handle several network namespace")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250212141021.1663666-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The following patch will not set skb->sk from VRF path.
Let's fetch net from fib_rule->fr_net instead of sock_net(skb->sk)
in fib[46]_rule_configure().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250207072502.87775-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
igmp6_send() can be called without RTNL or RCU being held.
Extend RCU protection so that we can safely fetch the net pointer
and avoid a potential UAF.
Note that we no longer can use sock_alloc_send_skb() because
ipv6.igmp_sk uses GFP_KERNEL allocations which can sleep.
Instead use alloc_skb() and charge the net->ipv6.igmp_sk
socket under RCU protection.
Fixes: b8ad0cbc58 ("[NETNS][IPV6] mcast - handle several network namespace")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250207135841.1948589-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ndisc_send_skb() can be called without RTNL or RCU held.
Acquire rcu_read_lock() earlier, so that we can use dev_net_rcu()
and avoid a potential UAF.
Fixes: 1762f7e88e ("[NETNS][IPV6] ndisc - make socket control per namespace")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250207135841.1948589-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ndisc_alloc_skb() can be called without RTNL or RCU being held.
Add RCU protection to avoid possible UAF.
Fixes: de09334b93 ("ndisc: Introduce ndisc_alloc_skb() helper.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250207135841.1948589-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ndisc_send_redirect() is called under RCU protection, not RTNL.
It must use dev_get_by_index_rcu() instead of __dev_get_by_index()
Fixes: 2f17becfbe ("vrf: check the original netdevice for generating redirect")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Suryaputra <ssuryaextr@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250207135841.1948589-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Instead of grabbing rcu_read_lock() from ip6_input_finish(),
do it earlier in is caller, so that ip6_input() access
to dev_net() can be validated by LOCKDEP.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250205155120.1676781-13-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
icmp6_send() must acquire rcu_read_lock() sooner to ensure
the dev_net() call done from a safe context.
Other ICMPv6 uses of dev_net() seem safe, change them to
dev_net_rcu() to get LOCKDEP support to catch bugs.
Fixes: 9a43b709a2 ("[NETNS][IPV6] icmp6 - make icmpv6_socket per namespace")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250205155120.1676781-12-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ip6_default_advmss() needs rcu protection to make
sure the net structure it reads does not disappear.
Fixes: 5578689a4e ("[NETNS][IPV6] route6 - make route6 per namespace")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250205155120.1676781-11-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 4094871db1 ("udp: only do GSO if # of segs > 1") avoided GSO
for small packets. But the kernel currently dismisses GSO requests only
after checking MTU/PMTU on gso_size. This means any packets, regardless
of their payload sizes, could be dropped when PMTU becomes smaller than
requested gso_size. We encountered this issue in production and it
caused a reliability problem that new QUIC connection cannot be
established before PMTU cache expired, while non GSO sockets still
worked fine at the same time.
Ideally, do not check any GSO related constraints when payload size is
smaller than requested gso_size, and return EMSGSIZE instead of EINVAL
on MTU/PMTU check failure to be more specific on the error cause.
Fixes: 4094871db1 ("udp: only do GSO if # of segs > 1")
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some lwtunnels have a dst cache for post-transformation dst.
If the packet destination did not change we may end up recording
a reference to the lwtunnel in its own cache, and the lwtunnel
state will never be freed.
Discovered by the ioam6.sh test, kmemleak was recently fixed
to catch per-cpu memory leaks. I'm not sure if rpl and seg6
can actually hit this, but in principle I don't see why not.
Fixes: 8cb3bf8bff ("ipv6: ioam: Add support for the ip6ip6 encapsulation")
Fixes: 6c8702c60b ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels")
Fixes: a7a29f9c36 ("net: ipv6: add rpl sr tunnel")
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250130031519.2716843-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmeXIF4ACgkQrB3Eaf9P
W7fRMA/+Js2x0HNA3+6SMb5nJzY6lywi1BIRAzstyfd6EsxbHlgfdWYCCpixboA0
/ZfDe7yPND/ewPIQLT9eO6hk9YzuAVhYUkdIcDC5jdFDNbh9dDqBdyu5P/5spsi9
9SdFEucoOsKBP4ejmSvtwGsVNIf/1vB8hFqYxB+vh8+d/g8PHrI3xxk+2b7KkIGS
ms+IyDCoVdCGQUOp4BGtEQbzXtx67diH5dcfwg8/DJpSMbfqO3ZFRG7gPu8C5Igt
cxVSCW67rv/zzPkGPv8B+nczAdVUZ3OFXgEWxdDCN/mUbFKwxUcIDxZVJMfBBAUP
lcjsbzmNfj2PNMLZFe/5LuU6o+sFEZdxmTPmvbb+lSYrRHx2oz2/Jb871gEj8rTC
vNZ+1Lu1k7QRjEPiO1fe85vWdmU4G81+WAzC88nD0KYLDUN4c+MmxUFQkKbAxf6p
e6VCihcKqi5Sa6R73Ohm87iyiSuv8WvkyVSM0XgQrkXWDFy5Jp2Bo25pW0QgVxK+
l/aHhDA+YHFEOZTcjZsh/EdKlQRIxBNJ3ualITkjd2T+A1WyWm0A3S+kYZQCKqiM
WGGWM3oVNXkUAaRxvURNvmXqO+hPeKfIElDeVrOUjG8zQ+EktKcg4KpDQb2BGJCj
s9ksFj0pplR4GHxUrFmkEPxJWYKpFqUYCZMJDnBnHFm1ykC7QGM=
=pg+h
-----END PGP SIGNATURE-----
Merge tag 'ipsec-2025-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2025-01-27
1) Fix incrementing the upper 32 bit sequence numbers for GSO skbs.
From Jianbo Liu.
2) Fix an out-of-bounds read on xfrm state lookup.
From Florian Westphal.
3) Fix secpath handling on packet offload mode.
From Alexandre Cassen.
4) Fix the usage of skb->sk in the xfrm layer.
5) Don't disable preemption while looking up cache state
to fix PREEMPT_RT.
From Sebastian Sewior.
* tag 'ipsec-2025-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: Don't disable preemption while looking up cache state.
xfrm: Fix the usage of skb->sk
xfrm: delete intermediate secpath entry in packet offload mode
xfrm: state: fix out-of-bounds read during lookup
xfrm: replay: Fix the update of replay_esn->oseq_hi for GSO
====================
Link: https://patch.msgid.link/20250127060757.3946314-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Let's register inet6_rtm_deladdr() with RTNL_FLAG_DOIT_PERNET and
hold rtnl_net_lock() before inet6_addr_del().
Now that inet6_addr_del() is always called under per-netns RTNL.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-12-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Let's register inet6_rtm_newaddr() with RTNL_FLAG_DOIT_PERNET
and hold rtnl_net_lock() before __dev_get_by_index().
Now that inet6_addr_add() and inet6_addr_modify() are always
called under per-netns RTNL.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-11-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
inet6_addr_add() and inet6_addr_modify() have the same code to validate
IPv6 lifetime that is done under RTNL.
Let's factorise it out to inet6_rtm_newaddr() so that we can validate
the lifetime without RTNL later.
Note that inet6_addr_add() is called from addrconf_add_ifaddr(), but the
lifetime is INFINITY_LIFE_TIME in the path, so expires and flags are 0.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-10-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We will convert inet6_rtm_newaddr() to per-netns RTNL.
Except for IFA_F_OPTIMISTIC, cfg.ifa_flags can be set before
__dev_get_by_index().
Let's move ifa_flags setup before __dev_get_by_index() so that
we can set ifa_flags without RTNL.
Also, now it's moved before tb[IFA_CACHEINFO] in preparing for
the next patch.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-9-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
inet6_addr_add() is called from inet6_rtm_newaddr() and
addrconf_add_ifaddr().
inet6_addr_add() looks up dev by __dev_get_by_index(), but
it's already done in inet6_rtm_newaddr().
Let's move the 2nd lookup to addrconf_add_ifaddr() and pass
dev to inet6_addr_add().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-8-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These functions are called from inet6_ioctl() with a socket's netns
and hold RTNL.
* SIOCSIFADDR : addrconf_add_ifaddr()
* SIOCDIFADDR : addrconf_del_ifaddr()
* SIOCSIFDSTADDR : addrconf_set_dstaddr()
Let's use rtnl_net_lock().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-7-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
addrconf_init() holds RTNL for blackhole_netdev, which is the global
device in init_net.
addrconf_cleanup() holds RTNL to clean up devices in init_net too.
Let's use rtnl_net_lock(&init_net) there.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-6-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
addrconf_dad_work() is per-address work and holds RTNL internally.
We can fetch netns as dev_net(ifp->idev->dev).
Let's use rtnl_net_lock().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
addrconf_verify_work() is per-netns work to call addrconf_verify_rtnl()
under RTNL.
Let's use rtnl_net_lock().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
xfrm assumed to always have a full socket at skb->sk.
This is not always true, so fix it by converting to a
full socket before it is used.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Following fields of 'struct mr_mfc' can be updated
concurrently (no lock protection) from ip_mr_forward()
and ip6_mr_forward()
- bytes
- pkt
- wrong_if
- lastuse
They also can be read from other functions.
Convert bytes, pkt and wrong_if to atomic_long_t,
and use READ_ONCE()/WRITE_ONCE() for lastuse.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250114221049.1190631-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmd/mNkACgkQrB3Eaf9P
W7cQww/9Hnv7+wosuBxFW2o2ptgZThiwj/Ovz0oiWGWvctpN7VlpwmrDOsXQ+XMn
xF6JBFEjnJanYoDBb78D0dMffJcMcUKZopTUU/ZMitSNr8aIHYiuB4SWPG1tqxl4
Ete7Mr2m3tS96YePQNnAaRZzEuGsx3BQb28VLTWl9So81MByD2OK4fsAbYz22Gg8
7A6tDHn1mUd9b2VG+LeeBZaDDFG8C0O2x4E/8Z3DX3z1N8y3LABPwZ38jcgTviKO
1ZldGrJT+PownBydu23bWDassKE2TuVvGH9e/SOPeQj8DJ4Lmd0bafMZTY6xwfNT
RJCwhlzZUpYRXFzvcf3+U3egsqEWEemV7/LzAapdT0V9OqfLWMUh3b1jMA4KcblZ
qmlm/MhZyXutDoeuASwtM4jgM3wGwovOofrKKsb13hD9VLBs5jFZmFSw5TlbmwE3
sjv7V4pFwNyfJnwQtyMmfuuHiy8w+fzqAA2GCg8mF3OosHABH/FOvEBP8xg1Vqu1
iKlLkByfyaCFD+GxTPzqSDvSB8nDzeZBgBM/ILGuwH0OWr+0gxBzl1sxrhAkw9hC
Gf+4J3wg7EknTxfrJk4LyqfyS50GvUIzpLSkHYxDtQRd4zv75bxRzMvoZvuapTtN
GGpWYftKO8n2kgLORQ0dTtFee3c4w/No+KXWsFdtRVNpyk3N0Yw=
=U79q
-----END PGP SIGNATURE-----
Merge tag 'ipsec-next-2025-01-09' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
ipsec-next-2025-01-09
1) Implement the AGGFRAG protocol and basic IP-TFS (RFC9347) functionality.
From Christian Hopps.
2) Support ESN context update to hardware for TX.
From Jianbo Liu.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This change introduces a mechanism for notifying userspace
applications about changes to IPv6 anycast addresses via netlink. It
includes:
* Addition and deletion of IPv6 anycast addresses are reported using
RTM_NEWANYCAST and RTM_DELANYCAST.
* A new netlink group (RTNLGRP_IPV6_ACADDR) for subscribing to these
notifications.
This enables user space applications(e.g. ip monitor) to efficiently
track anycast addresses through netlink messages, improving metrics
collection and system monitoring. It also unlocks the potential for
advanced anycast management in user space, such as hardware offload
control and fine grained network control.
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Yuyang Huang <yuyanghuang@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250107114355.1766086-1-yuyanghuang@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Corrected the netlink message size calculation for multicast group
join/leave notifications. The previous calculation did not account for
the inclusion of both IPv4/IPv6 addresses and ifa_cacheinfo in the
payload. This fix ensures that the allocated message size is
sufficient to hold all necessary information.
This patch also includes the following improvements:
* Uses GFP_KERNEL instead of GFP_ATOMIC when holding the RTNL mutex.
* Uses nla_total_size(sizeof(struct in6_addr)) instead of
nla_total_size(16).
* Removes unnecessary EXPORT_SYMBOL().
Fixes: 2c2b61d213 ("netlink: add IGMP/MLD join/leave notifications")
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Yuyang Huang <yuyanghuang@google.com>
Link: https://patch.msgid.link/20241221100007.1910089-1-yuyanghuang@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If a UDP socket changes its local address while it's receiving
datagrams, as a result of connect(), there is a period during which
a lookup operation might fail to find it, after the address is changed
but before the secondary hash (port and address) and the four-tuple
hash (local and remote ports and addresses) are updated.
Secondary hash chains were introduced by commit 30fff9231f ("udp:
bind() optimisation") and, as a result, a rehash operation became
needed to make a bound socket reachable again after a connect().
This operation was introduced by commit 719f835853 ("udp: add
rehash on connect()") which isn't however a complete fix: the
socket will be found once the rehashing completes, but not while
it's pending.
This is noticeable with a socat(1) server in UDP4-LISTEN mode, and a
client sending datagrams to it. After the server receives the first
datagram (cf. _xioopen_ipdgram_listen()), it issues a connect() to
the address of the sender, in order to set up a directed flow.
Now, if the client, running on a different CPU thread, happens to
send a (subsequent) datagram while the server's socket changes its
address, but is not rehashed yet, this will result in a failed
lookup and a port unreachable error delivered to the client, as
apparent from the following reproducer:
LEN=$(($(cat /proc/sys/net/core/wmem_default) / 4))
dd if=/dev/urandom bs=1 count=${LEN} of=tmp.in
while :; do
taskset -c 1 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
sleep 0.1 || sleep 1
taskset -c 2 socat OPEN:tmp.in UDP4:localhost:1337,shut-null
wait
done
where the client will eventually get ECONNREFUSED on a write()
(typically the second or third one of a given iteration):
2024/11/13 21:28:23 socat[46901] E write(6, 0x556db2e3c000, 8192): Connection refused
This issue was first observed as a seldom failure in Podman's tests
checking UDP functionality while using pasta(1) to connect the
container's network namespace, which leads us to a reproducer with
the lookup error resulting in an ICMP packet on a tap device:
LOCAL_ADDR="$(ip -j -4 addr show|jq -rM '.[] | .addr_info[0] | select(.scope == "global").local')"
while :; do
./pasta --config-net -p pasta.pcap -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
sleep 0.2 || sleep 1
socat OPEN:tmp.in UDP4:${LOCAL_ADDR}:1337,shut-null
wait
cmp tmp.in tmp.out
done
Once this fails:
tmp.in tmp.out differ: char 8193, line 29
we can finally have a look at what's going on:
$ tshark -r pasta.pcap
1 0.000000 :: ? ff02::16 ICMPv6 110 Multicast Listener Report Message v2
2 0.168690 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
3 0.168767 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
4 0.168806 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
5 0.168827 c6:47:05:8d:dc:04 ? Broadcast ARP 42 Who has 88.198.0.161? Tell 88.198.0.164
6 0.168851 9a:55:9a:55:9a:55 ? c6:47:05:8d:dc:04 ARP 42 88.198.0.161 is at 9a:55:9a:55:9a:55
7 0.168875 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
8 0.168896 88.198.0.164 ? 88.198.0.161 ICMP 590 Destination unreachable (Port unreachable)
9 0.168926 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
10 0.168959 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
11 0.168989 88.198.0.161 ? 88.198.0.164 UDP 4138 60260 ? 1337 Len=4096
12 0.169010 88.198.0.161 ? 88.198.0.164 UDP 42 60260 ? 1337 Len=0
On the third datagram received, the network namespace of the container
initiates an ARP lookup to deliver the ICMP message.
In another variant of this reproducer, starting the client with:
strace -f pasta --config-net -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc 2>strace.log &
and connecting to the socat server using a loopback address:
socat OPEN:tmp.in UDP4:localhost:1337,shut-null
we can more clearly observe a sendmmsg() call failing after the
first datagram is delivered:
[pid 278012] connect(173, 0x7fff96c95fc0, 16) = 0
[...]
[pid 278012] recvmmsg(173, 0x7fff96c96020, 1024, MSG_DONTWAIT, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = 1
[...]
[pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = -1 ECONNREFUSED (Connection refused)
and, somewhat confusingly, after a connect() on the same socket
succeeded.
Until commit 4cdeeee925 ("net: udp: prefer listeners bound to an
address"), the race between receive address change and lookup didn't
actually cause visible issues, because, once the lookup based on the
secondary hash chain failed, we would still attempt a lookup based on
the primary hash (destination port only), and find the socket with the
outdated secondary hash.
That change, however, dropped port-only lookups altogether, as side
effect, making the race visible.
To fix this, while avoiding the need to make address changes and
rehash atomic against lookups, reintroduce primary hash lookups as
fallback, if lookups based on four-tuple and secondary hashes fail.
To this end, introduce a simplified lookup implementation, which
doesn't take care of SO_REUSEPORT groups: if we have one, there are
multiple sockets that would match the four-tuple or secondary hash,
meaning that we can't run into this race at all.
v2:
- instead of synchronising lookup operations against address change
plus rehash, reintroduce a simplified version of the original
primary hash lookup as fallback
v1:
- fix build with CONFIG_IPV6=n: add ifdef around sk_v6_rcv_saddr
usage (Kuniyuki Iwashima)
- directly use sk_rcv_saddr for IPv4 receive addresses instead of
fetching inet_rcv_saddr (Kuniyuki Iwashima)
- move inet_update_saddr() to inet_hashtables.h and use that
to set IPv4/IPv6 addresses as suitable (Kuniyuki Iwashima)
- rebase onto net-next, update commit message accordingly
Reported-by: Ed Santiago <santiago@redhat.com>
Link: https://github.com/containers/podman/issues/24147
Analysed-by: David Gibson <david@gibson.dropbear.id.au>
Fixes: 30fff9231f ("udp: bind() optimisation")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The default IPv6 multipath hash policy takes the flow label into account
when calculating a multipath hash and previous patches added a flow
label selector to IPv6 FIB rules.
Allow user space to specify a flow label in route get requests by adding
a new netlink attribute and using its value to populate the "flowlabel"
field in the IPv6 flow info structure prior to a route lookup.
Deny the attribute in RTM_{NEW,DEL}ROUTE requests by checking for it in
rtm_to_fib6_config() and returning an error if present.
A subsequent patch will use this capability to test the new flow label
selector in IPv6 FIB rules.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Implement support for the new flow label selector which allows IPv6 FIB
rules to match on the flow label with a mask. Ensure that both flow
label attributes are specified (or none) and that the mask is valid.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
All inet_getpeer() callers except ip4_frag_init() don't need
to acquire a permanent refcount on the inetpeer.
They can switch to full RCU protection.
Move the refcount_inc_not_zero() into ip4_frag_init(),
so that all the other callers no longer have to
perform a pair of expensive atomic operations on
a possibly contended cache line.
inet_putpeer() no longer needs to be exported.
After this patch, my DUT can receive 8,400,000 UDP packets
per second targeting closed ports, using 50% less cpu cycles
than before.
Also change two calls to l3mdev_master_ifindex() by
l3mdev_master_ifindex_rcu() (Ido ideas)
Fixes: 8c2bd38b95 ("icmp: change the order of rate limits")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241215175629.1248773-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The Linux socket API currently allows setting SO_PRIORITY at the
socket level, applying a uniform priority to all packets sent through
that socket. The exception to this is IP_TOS, when the priority value
is calculated during the handling of
ancillary data, as implemented in commit f02db315b8 ("ipv4: IP_TOS
and IP_TTL can be specified as ancillary data").
However, this is a computed
value, and there is currently no mechanism to set a custom priority
via control messages prior to this patch.
According to this patch, if SO_PRIORITY is specified as ancillary data,
the packet is sent with the priority value set through
sockc->priority, overriding the socket-level values
set via the traditional setsockopt() method. This is analogous to
the existing support for SO_MARK, as implemented in
commit c6af0c227a ("ip: support SO_MARK cmsg").
If both cmsg SO_PRIORITY and IP_TOS are passed, then the one that
takes precedence is the last one in the cmsg list.
This patch has the side effect that raw_send_hdrinc now interprets cmsg
IP_TOS.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Suggested-by: Ferenc Fejes <fejes@inf.elte.hu>
Signed-off-by: Anna Emese Nyiri <annaemesenyiri@gmail.com>
Link: https://patch.msgid.link/20241213084457.45120-3-annaemesenyiri@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This change introduces netlink notifications for multicast address
changes. The following features are included:
* Addition and deletion of multicast addresses are reported using
RTM_NEWMULTICAST and RTM_DELMULTICAST messages with AF_INET and
AF_INET6.
* Two new notification groups: RTNLGRP_IPV4_MCADDR and
RTNLGRP_IPV6_MCADDR are introduced for receiving these events.
This change allows user space applications (e.g., ip monitor) to
efficiently track multicast group memberships by listening for netlink
events. Previously, applications relied on inefficient polling of
procfs, introducing delays. With netlink notifications, applications
receive realtime updates on multicast group membership changes,
enabling more precise metrics collection and system monitoring.
This change also unlocks the potential for implementing a wide range
of sophisticated multicast related features in user space by allowing
applications to combine kernel provided multicast address information
with user space data and communicate decisions back to the kernel for
more fine grained control. This mechanism can be used for various
purposes, including multicast filtering, IGMP/MLD offload, and
IGMP/MLD snooping.
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Co-developed-by: Patrick Ruddy <pruddy@vyatta.att-mail.com>
Signed-off-by: Patrick Ruddy <pruddy@vyatta.att-mail.com>
Link: https://lore.kernel.org/r/20180906091056.21109-1-pruddy@vyatta.att-mail.com
Signed-off-by: Yuyang Huang <yuyanghuang@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
psf->sf_count[MCAST_XXX] fields are read locklessly from
ipv6_chk_mcast_addr() and igmp6_mcf_seq_show().
Add READ_ONCE() and WRITE_ONCE() annotations accordingly.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241210183352.86530-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
mc->mca_sfcount[MCAST_EXCLUDE] is read locklessly from
ipv6_chk_mcast_addr().
Add READ_ONCE() and WRITE_ONCE() annotations accordingly.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241210183352.86530-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a label and two gotos to shorten lines by two tabulations,
to ease code review of following patches.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241210183352.86530-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch mitigates the two-reallocations issue with rpl_iptunnel by
providing the dst_entry (in the cache) to the first call to
skb_cow_head(). As a result, the very first iteration would still
trigger two reallocations (i.e., empty cache), while next iterations
would only trigger a single reallocation.
Performance tests before/after applying this patch, which clearly shows
there is no impact (it even shows improvement):
- before: https://ibb.co/nQJhqwc
- after: https://ibb.co/4ZvW6wV
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Cc: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This patch mitigates the two-reallocations issue with seg6_iptunnel by
providing the dst_entry (in the cache) to the first call to
skb_cow_head(). As a result, the very first iteration would still
trigger two reallocations (i.e., empty cache), while next iterations
would only trigger a single reallocation.
Performance tests before/after applying this patch, which clearly shows
the improvement:
- before: https://ibb.co/3Cg4sNH
- after: https://ibb.co/8rQ350r
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Cc: David Lebrun <dlebrun@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This patch mitigates the two-reallocations issue with ioam6_iptunnel by
providing the dst_entry (in the cache) to the first call to
skb_cow_head(). As a result, the very first iteration may still trigger
two reallocations (i.e., empty cache), while next iterations would only
trigger a single reallocation.
Performance tests before/after applying this patch, which clearly shows
the improvement:
- inline mode:
- before: https://ibb.co/LhQ8V63
- after: https://ibb.co/x5YT2bS
- encap mode:
- before: https://ibb.co/3Cjm5m0
- after: https://ibb.co/TwpsxTC
- encap mode with tunsrc:
- before: https://ibb.co/Gpy9QPg
- after: https://ibb.co/PW1bZFT
This patch also fixes an incorrect behavior: after the insertion, the
second call to skb_cow_head() makes sure that the dev has enough
headroom in the skb for layer 2 and stuff. In that case, the "old"
dst_entry was used, which is now fixed. After discussing with Paolo, it
appears that both patches can be merged into a single one -this one-
(for the sake of readability) and target net-next.
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Define `XFRM_MODE_IPTFS` and `IPSEC_MODE_IPTFS` constants, and add these to
switch case and conditionals adjacent with the existing TUNNEL modes.
Signed-off-by: Christian Hopps <chopps@labn.net>
Tested-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
UDP send path suffers from one indirect call to ip_generic_getfrag()
We can use INDIRECT_CALL_1() to avoid it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Brian Vazquez <brianvv@google.com>
Link: https://patch.msgid.link/20241203173617.2595451-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Dst objects get leaked in ip6_negative_advice() when this function is
executed for an expired IPv6 route located in the exception table. There
are several conditions that must be fulfilled for the leak to occur:
* an ICMPv6 packet indicating a change of the MTU for the path is received,
resulting in an exception dst being created
* a TCP connection that uses the exception dst for routing packets must
start timing out so that TCP begins retransmissions
* after the exception dst expires, the FIB6 garbage collector must not run
before TCP executes ip6_negative_advice() for the expired exception dst
When TCP executes ip6_negative_advice() for an exception dst that has
expired and if no other socket holds a reference to the exception dst, the
refcount of the exception dst is 2, which corresponds to the increment
made by dst_init() and the increment made by the TCP socket for which the
connection is timing out. The refcount made by the socket is never
released. The refcount of the dst is decremented in sk_dst_reset() but
that decrement is counteracted by a dst_hold() intentionally placed just
before the sk_dst_reset() in ip6_negative_advice(). After
ip6_negative_advice() has finished, there is no other object tied to the
dst. The socket lost its reference stored in sk_dst_cache and the dst is
no longer in the exception table. The exception dst becomes a leaked
object.
As a result of this dst leak, an unbalanced refcount is reported for the
loopback device of a net namespace being destroyed under kernels that do
not contain e5f80fcf86 ("ipv6: give an IPv6 dev to blackhole_netdev"):
unregister_netdevice: waiting for lo to become free. Usage count = 2
Fix the dst leak by removing the dst_hold() in ip6_negative_advice(). The
patch that introduced the dst_hold() in ip6_negative_advice() was
92f1655aa2 ("net: fix __dst_negative_advice() race"). But 92f1655aa2
merely refactored the code with regards to the dst refcount so the issue
was present even before 92f1655aa2. The bug was introduced in
54c1a859ef ("ipv6: Don't drop cache route entry unless timer actually
expired.") where the expired cached route is deleted and the sk_dst_cache
member of the socket is set to NULL by calling dst_negative_advice() but
the refcount belonging to the socket is left unbalanced.
The IPv4 version - ipv4_negative_advice() - is not affected by this bug.
When the TCP connection times out ipv4_negative_advice() merely resets the
sk_dst_cache of the socket while decrementing the refcount of the
exception dst.
Fixes: 92f1655aa2 ("net: fix __dst_negative_advice() race")
Fixes: 54c1a859ef ("ipv6: Don't drop cache route entry unless timer actually expired.")
Link: https://lore.kernel.org/netdev/20241113105611.GA6723@incl/T/#u
Signed-off-by: Jiri Wiesner <jwiesner@suse.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241128085950.GA4505@incl
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Several places call ip6mr_get_table() with no RCU nor RTNL lock.
Add RCU protection inside such helper and provide a lockless variant
for the few callers that already acquired the relevant lock.
Note that some users additionally reference the table outside the RCU
lock. That is actually safe as the table deletion can happen only
after all table accesses are completed.
Fixes: e2d57766e6 ("net: Provide compat support for SIOCGETMIFCNT_IN6 and SIOCGETSGCNT_IN6.")
Fixes: d7c31cbde4 ("net: ip6mr: add RTM_GETROUTE netlink op")
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The multicast route tables lifecycle, for both ipv4 and ipv6, is
protected by RCU using the RTNL lock for write access. In many
places a table pointer escapes the RCU (or RTNL) protected critical
section, but such scenarios are actually safe because tables are
deleted only at namespace cleanup time or just after allocation, in
case of default rule creation failure.
Tables freed at namespace cleanup time are assured to be alive for the
whole netns lifetime; tables freed just after creation time are never
exposed to other possible users.
Ensure that the free conditions are respected in ip{,6}mr_free_table, to
document the locking schema and to prevent future possible introduction
of 'table del' operation from breaking it.
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
RFC8981 section 3.4 says that existing temporary addresses must have their
lifetimes adjusted so that no temporary addresses should ever remain "valid"
or "preferred" longer than the incoming SLAAC Prefix Information. This would
strongly imply in Linux's case that if the "mngtmpaddr" address is deleted or
un-flagged as such, its corresponding temporary addresses must be cleared out
right away.
But now the temporary address is renewed even after ‘mngtmpaddr’ is removed
or becomes unmanaged as manage_tempaddrs() set temporary addresses
prefered/valid time to 0, and later in addrconf_verify_rtnl() all checkings
failed to remove the addresses. Fix this by deleting the temporary address
directly for these situations.
Fixes: 778964f2fd ("ipv6/addrconf: fix timing bug in tempaddr regen")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Implement ipv6 udp hash4 like that in ipv4. The major difference is that
the hash value should be calculated with udp6_ehashfn(). Besides,
ipv4-mapped ipv6 address is handled before hash() and rehash(). Export
udp_ehashfn because now we use it in udpv6 rehash.
Core procedures of hash/unhash/rehash are same as ipv4, and udpv4 and
udpv6 share the same udptable, so some functions in ipv4 hash4 can also
be shared.
Co-developed-by: Cambda Zhu <cambda@linux.alibaba.com>
Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>
Co-developed-by: Fred Chen <fred.cc@alibaba-inc.com>
Signed-off-by: Fred Chen <fred.cc@alibaba-inc.com>
Co-developed-by: Yubing Qiu <yubing.qiuyubing@alibaba-inc.com>
Signed-off-by: Yubing Qiu <yubing.qiuyubing@alibaba-inc.com>
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the udp_table has two hash table, the port hash and portaddr
hash. Usually for UDP servers, all sockets have the same local port and
addr, so they are all on the same hash slot within a reuseport group.
In some applications, UDP servers use connect() to manage clients. In
particular, when firstly receiving from an unseen 4 tuple, a new socket
is created and connect()ed to the remote addr:port, and then the fd is
used exclusively by the client.
Once there are connected sks in a reuseport group, udp has to score all
sks in the same hash2 slot to find the best match. This could be
inefficient with a large number of connections, resulting in high
softirq overhead.
To solve the problem, this patch implement 4-tuple hash for connected
udp sockets. During connect(), hash4 slot is updated, as well as a
corresponding counter, hash4_cnt, in hslot2. In __udp4_lib_lookup(),
hslot4 will be searched firstly if the counter is non-zero. Otherwise,
hslot2 is used like before. Note that only connected sockets enter this
hash4 path, while un-connected ones are not affected.
hlist_nulls is used for hash4, because we probably move to another hslot
wrongly when lookup with concurrent rehash. Then we check nulls at the
list end to see if we should restart lookup. Because udp does not use
SLAB_TYPESAFE_BY_RCU, we don't need to touch sk_refcnt when lookup.
Stress test results (with 1 cpu fully used) are shown below, in pps:
(1) _un-connected_ socket as server
[a] w/o hash4: 1,825176
[b] w/ hash4: 1,831750 (+0.36%)
(2) 500 _connected_ sockets as server
[c] w/o hash4: 290860 (only 16% of [a])
[d] w/ hash4: 1,889658 (+3.1% compared with [b])
With hash4, compute_score is skipped when lookup, so [d] is slightly
better than [b].
Co-developed-by: Cambda Zhu <cambda@linux.alibaba.com>
Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>
Co-developed-by: Fred Chen <fred.cc@alibaba-inc.com>
Signed-off-by: Fred Chen <fred.cc@alibaba-inc.com>
Co-developed-by: Yubing Qiu <yubing.qiuyubing@alibaba-inc.com>
Signed-off-by: Yubing Qiu <yubing.qiuyubing@alibaba-inc.com>
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Preparing for udp 4-tuple hash (uhash4 for short).
To implement uhash4 without cache line missing when lookup, hslot2 is
used to record the number of hashed sockets in hslot4. Thus adding a new
struct udp_hslot_main with field hash4_cnt, which is used by hash2. The
new struct is used to avoid doubling the size of udp_hslot.
Before uhash4 lookup, firstly checking hash4_cnt to see if there are
hashed sks in hslot4. Because hslot2 is always used in lookup, there is
no cache line miss.
Related helpers are updated, and use the helpers as possible.
uhash4 is implemented in following patches.
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmc3A/gACgkQrB3Eaf9P
W7fNew//XCIhIvFYaQcP2x84T4EYB679NkGlwMATxXgn40+sp7muSwVweynEWNIu
FltfBAwYD/MxD7g519abVPMWXs/iYI5duw3vvqnxmkOoebWLLocg2VoqFIdVXlQw
/hj+1X/oNT4OKcaQAw/FAGRuYvkc90YB/rRG51RwAIR0tyBjRwfUsozMM8QX/zQI
I0cLCgGAf/kylQre+dhvUkMhXaLogMF5v0qzPxhyMBD02JaUpe6+5cdHQcmKOhqa
ksTpySYnIKIHZrLizeFGDZpinaDIph20vGaDvDXpqTYFuwvCQsZczJy02dF4otf2
2dZz6+2La+ZM+WsGIqpALqKCNhr8fOcQxCRH3eGLPBwoXXt5CFAMgJKob8hKuonW
FgJaYMBZOjYbgGah8WbEe/YsWq4y3uRs48pFtY+T5cn7AskNxIvUoLNjSS83Hlqu
PJbveiKsZygig966Q/zUFATYnvj3zEgjVEcSbK6LRyBXL79Njr8l+PZ0Zoz76tc4
bF1Xv0x+lRYmwa9rvOFaeqrP/GTe0xvlitFzuCN7HnXiN8URKnnDY2odkXYzo+Z7
MBbP8wR/CaoiAvdMw74116nAIFOW95LPtvdGJTvlS9jAOt1P7dWQ3/mFKEpItndv
cJjWzI7HKl0+85FcCDw+tmsDWWGbALUyPw96i8UgUcDGyqVKUgA=
=Ioo8
-----END PGP SIGNATURE-----
Merge tag 'ipsec-next-2024-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
ipsec-next-11-15
1) Add support for RFC 9611 per cpu xfrm state handling.
2) Add inbound and outbound xfrm state caches to speed up
state lookups.
3) Convert xfrm to dscp_t. From Guillaume Nault.
4) Fix error handling in build_aevent.
From Everest K.C.
5) Replace strncpy with strscpy_pad in copy_to_user_auth.
From Daniel Yang.
6) Fix an uninitialized symbol during acquire state insertion.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In this commit, we make ip_route_input() return skb drop reasons that come
from ip_route_input_noref().
Meanwhile, adjust all the call to it.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Soft lockups have been observed on a cluster of Linux-based edge routers
located in a highly dynamic environment. Using the `bird` service, these
routers continuously update BGP-advertised routes due to frequently
changing nexthop destinations, while also managing significant IPv6
traffic. The lockups occur during the traversal of the multipath
circular linked-list in the `fib6_select_path` function, particularly
while iterating through the siblings in the list. The issue typically
arises when the nodes of the linked list are unexpectedly deleted
concurrently on a different core—indicated by their 'next' and
'previous' elements pointing back to the node itself and their reference
count dropping to zero. This results in an infinite loop, leading to a
soft lockup that triggers a system panic via the watchdog timer.
Apply RCU primitives in the problematic code sections to resolve the
issue. Where necessary, update the references to fib6_siblings to
annotate or use the RCU APIs.
Include a test script that reproduces the issue. The script
periodically updates the routing table while generating a heavy load
of outgoing IPv6 traffic through multiple iperf3 clients. It
consistently induces infinite soft lockups within a couple of minutes.
Kernel log:
0 [ffffbd13003e8d30] machine_kexec at ffffffff8ceaf3eb
1 [ffffbd13003e8d90] __crash_kexec at ffffffff8d0120e3
2 [ffffbd13003e8e58] panic at ffffffff8cef65d4
3 [ffffbd13003e8ed8] watchdog_timer_fn at ffffffff8d05cb03
4 [ffffbd13003e8f08] __hrtimer_run_queues at ffffffff8cfec62f
5 [ffffbd13003e8f70] hrtimer_interrupt at ffffffff8cfed756
6 [ffffbd13003e8fd0] __sysvec_apic_timer_interrupt at ffffffff8cea01af
7 [ffffbd13003e8ff0] sysvec_apic_timer_interrupt at ffffffff8df1b83d
-- <IRQ stack> --
8 [ffffbd13003d3708] asm_sysvec_apic_timer_interrupt at ffffffff8e000ecb
[exception RIP: fib6_select_path+299]
RIP: ffffffff8ddafe7b RSP: ffffbd13003d37b8 RFLAGS: 00000287
RAX: ffff975850b43600 RBX: ffff975850b40200 RCX: 0000000000000000
RDX: 000000003fffffff RSI: 0000000051d383e4 RDI: ffff975850b43618
RBP: ffffbd13003d3800 R8: 0000000000000000 R9: ffff975850b40200
R10: 0000000000000000 R11: 0000000000000000 R12: ffffbd13003d3830
R13: ffff975850b436a8 R14: ffff975850b43600 R15: 0000000000000007
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
9 [ffffbd13003d3808] ip6_pol_route at ffffffff8ddb030c
10 [ffffbd13003d3888] ip6_pol_route_input at ffffffff8ddb068c
11 [ffffbd13003d3898] fib6_rule_lookup at ffffffff8ddf02b5
12 [ffffbd13003d3928] ip6_route_input at ffffffff8ddb0f47
13 [ffffbd13003d3a18] ip6_rcv_finish_core.constprop.0 at ffffffff8dd950d0
14 [ffffbd13003d3a30] ip6_list_rcv_finish.constprop.0 at ffffffff8dd96274
15 [ffffbd13003d3a98] ip6_sublist_rcv at ffffffff8dd96474
16 [ffffbd13003d3af8] ipv6_list_rcv at ffffffff8dd96615
17 [ffffbd13003d3b60] __netif_receive_skb_list_core at ffffffff8dc16fec
18 [ffffbd13003d3be0] netif_receive_skb_list_internal at ffffffff8dc176b3
19 [ffffbd13003d3c50] napi_gro_receive at ffffffff8dc565b9
20 [ffffbd13003d3c80] ice_receive_skb at ffffffffc087e4f5 [ice]
21 [ffffbd13003d3c90] ice_clean_rx_irq at ffffffffc0881b80 [ice]
22 [ffffbd13003d3d20] ice_napi_poll at ffffffffc088232f [ice]
23 [ffffbd13003d3d80] __napi_poll at ffffffff8dc18000
24 [ffffbd13003d3db8] net_rx_action at ffffffff8dc18581
25 [ffffbd13003d3e40] __do_softirq at ffffffff8df352e9
26 [ffffbd13003d3eb0] run_ksoftirqd at ffffffff8ceffe47
27 [ffffbd13003d3ec0] smpboot_thread_fn at ffffffff8cf36a30
28 [ffffbd13003d3ee8] kthread at ffffffff8cf2b39f
29 [ffffbd13003d3f28] ret_from_fork at ffffffff8ce5fa64
30 [ffffbd13003d3f50] ret_from_fork_asm at ffffffff8ce03cbb
Fixes: 66f5d6ce53 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Reported-by: Adrian Oliver <kernel@aoliver.ca>
Signed-off-by: Omid Ehtemam-Haghighi <omid.ehtemamhaghighi@menlosecurity.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Ido Schimmel <idosch@idosch.org>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Simon Horman <horms@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241106010236.1239299-1-omid.ehtemamhaghighi@menlosecurity.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Most of the original conversion is from the spatch below,
but I edited some and left out other instances that were
either buggy after conversion (where default values don't
fit into the type) or just looked strange.
@@
expression attr, def;
expression val;
identifier fn =~ "^nla_get_.*";
fresh identifier dfn = fn ## "_default";
@@
(
-if (attr)
- val = fn(attr);
-else
- val = def;
+val = dfn(attr, def);
|
-if (!attr)
- val = def;
-else
- val = fn(attr);
+val = dfn(attr, def);
|
-if (!attr)
- return def;
-return fn(attr);
+return dfn(attr, def);
|
-attr ? fn(attr) : def
+dfn(attr, def)
|
-!attr ? def : fn(attr)
+dfn(attr, def)
)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org>
Link: https://patch.msgid.link/20241108114145.0580b8684e7f.I740beeaa2f70ebfc19bfca1045a24d6151992790@changeid
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmcr+e0ACgkQ1w0aZmrP
KyFS9RAAoB5S0NZBNf60fpG6b5ZlfDEfoOUXfp91CRT1WY7KaCekH19aLf5O5HAg
iJNjdOs7R4YCMfUfbWzDj0ZYnayz0h618Nin/EufIJbAMoOnBMnb12r5DUhnMpnC
M9noDQJzyXPhlE1gR3py7I9VgwdqfRa3+EfS1uTm1NL9tv7MLuej+Z6nnmQ2Sw2/
tUfuhLyKvs3iIiegeojrsTGix4YnMNrIrQUqpJXq7jCfXHFPz10MmlR7fZuBK99s
QuphQ9Onf7SXow2bxkdhB2cS4i3+BK5fLFXHDW9cLROL+PKPenAsdnKcsXe5ViL5
Ck1/N4KxydXt3djmDWfvXFF2/BaxMHnO5S9p+1ZAE18auCz6KKzefjBo134rftgz
GN8Fu8A2OQ9pZzod/21M0m2BDAoOG3kKRq6MjoUydfc8/T0q/FVtInprjh1twIvg
3uZludlQUKANVeH3bRR16fE0Z5k8fdTX3BXxRgQNo9hrDjpnmAyJmh6XTCuS9XWJ
L0VR94QLnN/yi7U7EWdEk2VV944Pfj6aeNjFC8AWHbhP+DzvELFFZvDvQ7AHaiK1
fOZfuX4nO4i2WrnCHeVLhYxdkvGixyWxorMYcDNmRDSQirQLaARMaAMFfgI4TKkm
R/wjJKXitntN9cLiiHuParkVQZAiFH1AwuvxQgCI5uaN6Gxpivo=
=TkS8
-----END PGP SIGNATURE-----
Merge tag 'nf-next-24-11-07' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following series contains Netfilter updates for net-next:
1) Make legacy xtables configs user selectable, from Breno Leitao.
2) Fix a few sparse warnings related to percpu, from Uros Bizjak.
3) Use strscpy_pad, from Justin Stitt.
4) Use nft_trans_elem_alloc() in catchall flush, from Florian Westphal.
5) A series of 7 patches to fix false positive with CONFIG_RCU_LIST=y.
Florian also sees possible issue with 10 while module load/removal
when requesting an expression that is available via module. As for
patch 11, object is being updated so reference on the module already
exists so I don't see any real issue.
Florian says:
"Unfortunately there are many more errors, and not all are false positives.
First patches pass lockdep_commit_lock_is_held() to the rcu list traversal
macro so that those splats are avoided.
The last two patches are real code change as opposed to
'pass the transaction mutex to relax rcu check':
Those two lists are not protected by transaction mutex so could be altered
in parallel.
This targets nf-next because these are long-standing issues."
netfilter pull request 24-11-07
* tag 'nf-next-24-11-07' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_tables: must hold rcu read lock while iterating object type list
netfilter: nf_tables: must hold rcu read lock while iterating expression type list
netfilter: nf_tables: avoid false-positive lockdep splats with basechain hook
netfilter: nf_tables: avoid false-positive lockdep splats in set walker
netfilter: nf_tables: avoid false-positive lockdep splats with flowtables
netfilter: nf_tables: avoid false-positive lockdep splats with sets
netfilter: nf_tables: avoid false-positive lockdep splat on rule deletion
netfilter: nf_tables: prefer nft_trans_elem_alloc helper
netfilter: nf_tables: replace deprecated strncpy with strscpy_pad
netfilter: nf_tables: Fix percpu address space issues in nf_tables_api.c
netfilter: Make legacy configs user selectable
====================
Link: https://patch.msgid.link/20241106234625.168468-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The CI is hitting some aperiodic hangup at device removal time in the
pmtu.sh self-test:
unregister_netdevice: waiting for veth_A-R1 to become free. Usage count = 6
ref_tracker: veth_A-R1@ffff888013df15d8 has 1/5 users at
dst_init+0x84/0x4a0
dst_alloc+0x97/0x150
ip6_dst_alloc+0x23/0x90
ip6_rt_pcpu_alloc+0x1e6/0x520
ip6_pol_route+0x56f/0x840
fib6_rule_lookup+0x334/0x630
ip6_route_output_flags+0x259/0x480
ip6_dst_lookup_tail.constprop.0+0x5c2/0x940
ip6_dst_lookup_flow+0x88/0x190
udp_tunnel6_dst_lookup+0x2a7/0x4c0
vxlan_xmit_one+0xbde/0x4a50 [vxlan]
vxlan_xmit+0x9ad/0xf20 [vxlan]
dev_hard_start_xmit+0x10e/0x360
__dev_queue_xmit+0xf95/0x18c0
arp_solicit+0x4a2/0xe00
neigh_probe+0xaa/0xf0
While the first suspect is the dst_cache, explicitly tracking the dst
owing the last device reference via probes proved such dst is held by
the nexthop in the originating fib6_info.
Similar to commit f5b51fe804 ("ipv6: route: purge exception on
removal"), we need to explicitly release the originating fib info when
disconnecting a to-be-removed device from a live ipv6 dst: move the
fib6_info cleanup into ip6_dst_ifdown().
Tested running:
./pmtu.sh cleanup_ipv6_exception
in a tight loop for more than 400 iterations with no spat, running an
unpatched kernel I observed a splat every ~10 iterations.
Fixes: f88d8ea67f ("ipv6: Plumb support for nexthop object in a fib6_info")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/604c45c188c609b732286b47ac2a451a40f6cf6d.1730828007.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Under CONFIG_PROVE_RCU_LIST + CONFIG_RCU_EXPERT
hlist_for_each_entry_rcu() provides very helpful splats, which help
to find possible issues. I missed CONFIG_RCU_EXPERT=y in my testing
config the same as described in
a3e4bf7f96 ("configs/debug: make sure PROVE_RCU_LIST=y takes effect").
The fix itself is trivial: add the very same lockdep annotations
as were used to dereference ao_info from the socket.
Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/netdev/20241028152645.35a8be66@kernel.org/
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Link: https://patch.msgid.link/20241030-tcp-ao-hlist-lockdep-annotate-v1-1-bf641a64d7c6@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that we can have percpu xfrm states, the number of active
states might increase. To get a better lookup performance,
we add a percpu cache to cache the used inbound xfrm states.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Tested-by: Antony Antony <antony.antony@secunet.com>
Tested-by: Tobias Brunner <tobias@strongswan.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmcXbBEACgkQrB3Eaf9P
W7e0hQ//XiBdyhArA8kYIgsCylrOr+y/uCErnIhzUTqo20uE3dMPvzQHwY1GIgiU
HYXKg49WLVxSuFtLRu32qCr0G+muU1UI5OL58IQuQ+TxKzj0hnV4BqAx+rNYhaFb
JxJhgAcQQu7VCL7/qgqGsQnhq/hhg29Rfqa1VTEZ4RthEMahPDbwyibjyOfqwSgm
fCPIl2FTkB7E0PZnwZJGxmaOJXS7g/djb+CPmBI6zxLQHG5VXY/UGNyObUvTLD9K
gV+N0u0ieyDTxpvpgh6HMAFSkORLS/PIUCAX0SZEW48+7DLbBeKMMYwegtxxJZ3D
3zaWi8uKGh5rjOslQbU4ZlpxJr7yvIV6RhGJhOPDYz5Es4EXHU7c0tZ/pma46eb0
2PJxQyTHW4O9fbybQvl0w9fUQlhjKMbv/TygJgpOIk9YUr2y8Yxc8yhmWi+669ly
e7PEi/33lqJI44gisu0BMresxJcPA3eFWje+Dzw/7N/tlLJzbWt3psRqB9u/JwVH
LD0YvXraZYvaRNzeGUfbXTrvmouhLcl15zAE8RFJBTgGJbpILviJ9NfUMOIO7Yor
BBKEWlylCm/4x5iOdVb17gFCi7uERiahbxNg3+hltAQuMvEdrhhWXp1N7esTRvkf
D1o0qR5C2k2jyc9LQNqfiGWDEOgTCt1DCdhpo2F/EtF5kSerp6s=
=Ai21
-----END PGP SIGNATURE-----
Merge tag 'ipsec-2024-10-22' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2024-10-22
1) Fix routing behavior that relies on L4 information
for xfrm encapsulated packets.
From Eyal Birger.
2) Remove leftovers of pernet policy_inexact lists.
From Florian Westphal.
3) Validate new SA's prefixlen when the selector family is
not set from userspace.
From Sabrina Dubroca.
4) Fix a kernel-infoleak when dumping an auth algorithm.
From Petr Vaganov.
Please pull or let me know if there are problems.
ipsec-2024-10-22
* tag 'ipsec-2024-10-22' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: fix one more kernel-infoleak in algo dumping
xfrm: validate new SA's prefixlen using SA family when sel.family is unset
xfrm: policy: remove last remnants of pernet inexact list
xfrm: respect ip protocols rules criteria when performing dst lookups
xfrm: extract dst lookup parameters into a struct
====================
Link: https://patch.msgid.link/20241022092226.654370-1-steffen.klassert@secunet.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
kernel test robot reported a section mismatch in ip6_mr_cleanup().
WARNING: modpost: vmlinux: section mismatch in reference: ip6_mr_cleanup+0x0 (section: .text) -> 0xffffffff (section: .init.rodata)
WARNING: modpost: vmlinux: section mismatch in reference: ip6_mr_cleanup+0x14 (section: .text) -> ip6mr_rtnl_msg_handlers (section: .init.rodata)
ip6_mr_cleanup() uses ip6mr_rtnl_msg_handlers[] that has
__initconst_or_module qualifier.
ip6_mr_cleanup() is only called from inet6_init() but does
not have __init qualifier.
Let's add __init to ip6_mr_cleanup().
Fixes: 3ac84e31b3 ("ipmr: Use rtnl_register_many().")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202410180139.B3HeemsC-lkp@intel.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20241017174732.39487-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The next patch will add init_srcu_struct() in rtnl_af_register(),
then we need to handle its error.
Let's add the error handling in advance to make the following
patch cleaner.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Matt Johnston <matt@codeconstruct.com.au>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Cross-merge networking fixes after downstream PR (net-6.12-rc4).
Conflicts:
107a034d5c ("net/mlx5: qos: Store rate groups in a qos domain")
1da9cfd6c4 ("net/mlx5: Unregister notifier on eswitch init failure")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We will remove rtnl_register() and rtnl_register_module() in favour
of rtnl_register_many().
When it succeeds for built-in callers, rtnl_register_many() guarantees
all rtnetlink types in the passed array are supported, and there is no
chance that a part of message types is not supported.
Let's use rtnl_register_many() instead.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241014201828.91221-9-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We will remove rtnl_register_module() in favour of rtnl_register_many().
rtnl_register_many() will unwind the previous successful registrations
on failure and simplify module error handling.
Let's use rtnl_register_many() instead.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241014201828.91221-8-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sock_init_data() attaches the allocated sk pointer to the provided sock
object. If inet6_create() fails later, the sk object is released, but the
sock object retains the dangling sk pointer, which may cause use-after-free
later.
Clear the sock sk pointer on error.
Signed-off-by: Ignat Korchagin <ignat@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241014153808.51894-8-ignat@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If:
1) the user requested USO, but
2) there is not enough payload for GSO to kick in, and
3) the egress device doesn't offer checksum offload, then
we want to compute the L4 checksum in software early on.
In the case when we are not taking the GSO path, but it has been requested,
the software checksum fallback in skb_segment doesn't get a chance to
compute the full checksum, if the egress device can't do it. As a result we
end up sending UDP datagrams with only a partial checksum filled in, which
the peer will discard.
Fixes: 10154dbded ("udp: Allow GSO transmit from devices with no checksum offload")
Reported-by: Ivan Babrou <ivan@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20241011-uso-swcsum-fixup-v2-1-6e1ddc199af9@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since SLOB was removed and since
commit 6c6c47b063 ("mm, slab: call kvfree_rcu_barrier() from kmem_cache_destroy()"),
it is not necessary to use call_rcu when the callback only performs
kmem_cache_free. Use kfree_rcu() directly.
The changes were made using Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://patch.msgid.link/20241013201704.49576-5-Julia.Lawall@inria.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This option makes legacy Netfilter Kconfig user selectable, giving users
the option to configure iptables without enabling any other config.
Make the following KConfig entries user selectable:
* BRIDGE_NF_EBTABLES_LEGACY
* IP_NF_ARPTABLES
* IP_NF_IPTABLES_LEGACY
* IP6_NF_IPTABLES_LEGACY
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
tcp_v6_send_response() send orphaned 'control packets'.
These are RST packets and also ACK packets sent from TIME_WAIT.
Some eBPF programs would prefer to have a meaningful skb->sk
pointer as much as possible.
This means that TCP can now attach TIME_WAIT sockets to outgoing
skbs.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Brian Vazquez <brianvv@google.com>
Link: https://patch.msgid.link/20241010174817.1543642-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After we made sure no fib_seq_read() handlers needs RTNL anymore,
we can remove RTNL from fib_seq_sum().
Note that after RTNL was dropped, fib_seq_sum() result was possibly
outdated anyway.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241009184405.3752829-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
mr_call_vif_notifiers() and mr_call_mfc_notifiers() already
uses WRITE_ONCE() on the write side.
Using RTNL to protect the reads seems a big hammer.
Constify 'struct net' argument of ip6mr_rules_seq_read()
and ipmr_rules_seq_read().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241009184405.3752829-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Using RTNL to protect ops->fib_rules_seq reads seems a big hammer.
Writes are protected by RTNL.
We can use READ_ONCE() when reading it.
Constify 'struct net' argument of fib6_tables_seq_read() and
fib6_rules_seq_read().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241009184405.3752829-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
commit 2384d02520 ("net/ipv6: Add anycast addresses to a global hashtable")
added inet6_acaddr_hash(), using ipv6_addr_hash() and net_hash_mix()
to get hash spreading for typical users.
However ipv6_addr_hash() is highly predictable and a malicious user
could abuse a specific hash bucket.
Switch to __ipv6_addr_jhash(). We could use a dedicated
secret, or reuse net_hash_mix() as I did in this patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241008121307.800040-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In commit 3f27fb2321 ("ipv6: addrconf: add per netns perturbation
in inet6_addr_hash()"), I added net_hash_mix() in inet6_addr_hash()
to get better hash dispersion, at a time all netns were sharing the
hash table.
Since then, commit 21a216a8fc ("ipv6/addrconf: allocate a per
netns hash table") made the hash table per netns.
We could remove the net_hash_mix() from inet6_addr_hash(), but
there is still an issue with ipv6_addr_hash().
It is highly predictable and a malicious user can easily create
thousands of IPv6 addresses all stored in the same hash bucket.
Switch to __ipv6_addr_jhash(). We could use a dedicated
secret, or reuse net_hash_mix() as I did in this patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241008120101.734521-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We need to init l3mdev unconditionally, else main routing table is searched
and incorrect result is returned unless strict (iif keyword) matching is
requested.
Next patch adds a selftest for this.
Fixes: 2a8a7c0eaa ("netfilter: nft_fib: Fix for rpath check with VRF devices")
Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1761
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
icsk->icsk_pending can be read locklessly already.
Following patch in the series will add another lockless read.
Add smp_load_acquire() and smp_store_release() annotations
because following patch will add a test in tcp_write_timer(),
and READ_ONCE()/WRITE_ONCE() alone would possibly lead to races.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241002173042.917928-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The last type of sockets which supports SOF_TIMESTAMPING_OPT_ID is RAW
sockets. To add new option this patch converts all callers (direct and
indirect) of _sock_tx_timestamp to provide sockcm_cookie instead of
tsflags. And while here fix __sock_tx_timestamp to receive tsflags as
__u32 instead of __u16.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://patch.msgid.link/20241001125716.2832769-3-vadfed@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
SOF_TIMESTAMPING_OPT_ID socket option flag gives a way to correlate TX
timestamps and packets sent via socket. Unfortunately, there is no way
to reliably predict socket timestamp ID value in case of error returned
by sendmsg. For UDP sockets it's impossible because of lockless
nature of UDP transmit, several threads may send packets in parallel. In
case of RAW sockets MSG_MORE option makes things complicated. More
details are in the conversation [1].
This patch adds new control message type to give user-space
software an opportunity to control the mapping between packets and
values by providing ID with each sendmsg for UDP sockets.
The documentation is also added in this patch.
[1] https://lore.kernel.org/netdev/CALCETrU0jB+kg0mhV6A8mrHfTE1D1pr1SD_B9Eaa9aDPfgHdtA@mail.gmail.com/
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://patch.msgid.link/20241001125716.2832769-2-vadfed@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pass a dscp_t variable to ip_route_input(), instead of a plain u8, to
prevent accidental setting of ECN bits in ->flowi4_tos.
Callers of ip_route_input() to consider are:
* input_action_end_dx4_finish() and input_action_end_dt4() in
net/ipv6/seg6_local.c. These functions set the tos parameter to 0,
which is already a valid dscp_t value, so they don't need to be
adjusted for the new prototype.
* icmp_route_lookup(), which already has a dscp_t variable to pass as
parameter. We just need to remove the inet_dscp_to_dsfield()
conversion.
* br_nf_pre_routing_finish(), ip_options_rcv_srr() and ip4ip6_err(),
which get the DSCP directly from IPv4 headers. Define a helper to
read the .tos field of struct iphdr as dscp_t, so that these
function don't have to do the conversion manually.
While there, declare *iph as const in br_nf_pre_routing_finish(),
declare its local variables in reverse-christmas-tree order and move
the "err = ip_route_input()" assignment out of the conditional to avoid
checkpatch warning.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/e9d40781d64d3d69f4c79ac8a008b8d67a033e8d.1727807926.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Current release - regressions:
- eth: mlx5: fix wrong reserved field in hca_cap_2 in mlx5_ifc
- eth: am65-cpsw: fix forever loop in cleanup code
Current release - new code bugs:
- eth: mlx5: HWS, fixed double-free in error flow of creating SQ
Previous releases - regressions:
- core: avoid potential underflow in qdisc_pkt_len_init() with UFO
- core: test for not too small csum_start in virtio_net_hdr_to_skb()
- vrf: revert "vrf: remove unnecessary RCU-bh critical section"
- bluetooth:
- fix uaf in l2cap_connect
- fix possible crash on mgmt_index_removed
- dsa: improve shutdown sequence
- eth: mlx5e: SHAMPO, fix overflow of hd_per_wq
- eth: ip_gre: fix drops of small packets in ipgre_xmit
Previous releases - always broken:
- core: fix gso_features_check to check for both dev->gso_{ipv4_,}max_size
- core: fix tcp fraglist segmentation after pull from frag_list
- netfilter: nf_tables: prevent nf_skb_duplicated corruption
- sctp: set sk_state back to CLOSED if autobind fails in sctp_listen_start
- mac802154: fix potential RCU dereference issue in mac802154_scan_worker
- eth: fec: restart PPS after link state change
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmb+giESHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkDowP/25YsDA8uaH5yelI85vUgp1T50MWgFxJ
ARm58Pzxr8byX6eIup95xSsjLvMbLaWj5LIA2Y49AV0fWVgGn0U8yx4mPy0Czhdg
J1oxtyoV1pR2V/okWzD4yhZV2on7OGsS73I6J1s6BAowezr19A+aa5Un57dW/103
ccwBuBOYlSIOIHmarOxuFhWMYcwXreNBHa9K7J6JtDFn9F56fUn+ZoIUJ7x27cSO
eWhh9bIkeEb+xYeUXAjNP3pBvJ1xpwIyZv+JMTp40jNsAXPjSpI3Jwd1YlAAMuT9
J2dW0Zs8uwm5LzBPFvI9iM0WHEmVy6+b32NjnKVwPn2+XGGWQss52bmRElNcJkrw
4NeG6/6CPIE0xuczBECuMa0X68NDKIZsjy3Q3OahV82ef2cwhRk6FexyIg5oiMPx
KmMi5B+UQw6ZY3ZF/ME/0jJx/H5ayOC01yNBaTUPrLJr8gjquWEMjZXEqJsdyixJ
5OoZeKG5oN6HkN7g/IxoFjg/W/g93OULO3qH+IzLQG4NlVs6Zp4ykL7dT+Py2zzc
Ru3n5+HA4PqDn2u7gmP1mu2g/lmKUIZEEvR+msP81Cywlz5qtWIH1a6oIeVC7bjt
JNhgBgzKGGMGdgmhYNzXw213WCEbz0+as2SNlvlbiqMP5FKQPLzzBVuJoz4AtJVn
cyVy7D66HuMW
=cq2I
-----END PGP SIGNATURE-----
Merge tag 'net-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from ieee802154, bluetooth and netfilter.
Current release - regressions:
- eth: mlx5: fix wrong reserved field in hca_cap_2 in mlx5_ifc
- eth: am65-cpsw: fix forever loop in cleanup code
Current release - new code bugs:
- eth: mlx5: HWS, fixed double-free in error flow of creating SQ
Previous releases - regressions:
- core: avoid potential underflow in qdisc_pkt_len_init() with UFO
- core: test for not too small csum_start in virtio_net_hdr_to_skb()
- vrf: revert "vrf: remove unnecessary RCU-bh critical section"
- bluetooth:
- fix uaf in l2cap_connect
- fix possible crash on mgmt_index_removed
- dsa: improve shutdown sequence
- eth: mlx5e: SHAMPO, fix overflow of hd_per_wq
- eth: ip_gre: fix drops of small packets in ipgre_xmit
Previous releases - always broken:
- core: fix gso_features_check to check for both
dev->gso_{ipv4_,}max_size
- core: fix tcp fraglist segmentation after pull from frag_list
- netfilter: nf_tables: prevent nf_skb_duplicated corruption
- sctp: set sk_state back to CLOSED if autobind fails in
sctp_listen_start
- mac802154: fix potential RCU dereference issue in
mac802154_scan_worker
- eth: fec: restart PPS after link state change"
* tag 'net-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (48 commits)
sctp: set sk_state back to CLOSED if autobind fails in sctp_listen_start
dt-bindings: net: xlnx,axi-ethernet: Add missing reg minItems
doc: net: napi: Update documentation for napi_schedule_irqoff
net/ncsi: Disable the ncsi work before freeing the associated structure
net: phy: qt2025: Fix warning: unused import DeviceId
gso: fix udp gso fraglist segmentation after pull from frag_list
bridge: mcast: Fail MDB get request on empty entry
vrf: revert "vrf: Remove unnecessary RCU-bh critical section"
net: ethernet: ti: am65-cpsw: Fix forever loop in cleanup code
net: phy: realtek: Check the index value in led_hw_control_get
ppp: do not assume bh is held in ppp_channel_bridge_input()
selftests: rds: move include.sh to TEST_FILES
net: test for not too small csum_start in virtio_net_hdr_to_skb()
net: gso: fix tcp fraglist segmentation after pull from frag_list
ipv4: ip_gre: Fix drops of small packets in ipgre_xmit
net: stmmac: dwmac4: extend timeout for VLAN Tag register busy bit check
net: add more sanity checks to qdisc_pkt_len_init()
net: avoid potential underflow in qdisc_pkt_len_init() with UFO
net: ethernet: ti: cpsw_ale: Fix warning on some platforms
net: microchip: Make FDMA config symbol invisible
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmb9qqsACgkQ1V2XiooU
IOSplRAAsv0Rr2WRA+pDpQwcmMNWoemGtu0qB7L6IchM36P64GvldMhEgfSPCh1h
6HdV8WlkGE5Q/bOPCNbkLg/INBelADoioaOlsdOO5oc+rGUw/Z4Swcq/1PF60Vaz
tz8AOU0opAD3X50U5bqD1Z2xToonS9nz9Ql7OWAbTdn9red/2SY+H1fyDz00VIHU
X4y2GWND5Hi6KIsAGTu9OiyQKy9hb1oA5xNU1OeNY+gNsr+r+NSbX0BOMSRJTvLv
MyY0kzP+S+yTx2FGcDMqgKfo60Sb4Ru6rJXl3XKd6QxhW9Mt6adcmmlqa5edoWU3
bJYkzugl66XKh1pDkC9u7om7zOOzBhjvLObDMbcYfAVJCctsErGcRDJIvS8M+ECB
tRsxRFU2CSud4HzIeKfQUP7b16KghnBa4kTsc0r8MLcfU5D/aR/WMR62W/ua00IS
noyWqtpdNk/7yR9HMzaCbsjgm+OZbtJbOSWCNaDo4TsXf+g+jQ+cf1Nl26cE73gB
xWGcc3LKIkcjQpOU+Zu0fluF7OdnDNNTEoHprnahilBHDOtmSBDMwxAoJichCZMt
mEN1CThG0B+YwlWH9yFL1bOQs1zHHFjHfJspdtqCok+UeD20p8QD1V8mlEsAkkT/
alw0Gxa6T2KepuOF9KcMnx4IcpkqwpgkwcGXvwRWWchANgbi1ao=
=UUcp
-----END PGP SIGNATURE-----
Merge tag 'nf-24-10-02' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Fix incorrect documentation in uapi/linux/netfilter/nf_tables.h
regarding flowtable hooks, from Phil Sutter.
2) Fix nft_audit.sh selftests with newer nft binaries, due to different
(valid) audit output, also from Phil.
3) Disable BH when duplicating packets via nf_dup infrastructure,
otherwise race on nf_skb_duplicated for locally generated traffic.
From Eric.
4) Missing return in callback of selftest C program, from zhang jiao.
netfilter pull request 24-10-02
* tag 'nf-24-10-02' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
selftests: netfilter: Add missing return value
netfilter: nf_tables: prevent nf_skb_duplicated corruption
selftests: netfilter: Fix nft_audit.sh for newer nft binaries
netfilter: uapi: NFTA_FLOWTABLE_HOOK is NLA_NESTED
====================
Link: https://patch.msgid.link/20241002202421.1281311-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Detect tcp gso fraglist skbs with corrupted geometry (see below) and
pass these to skb_segment instead of skb_segment_list, as the first
can segment them correctly.
Valid SKB_GSO_FRAGLIST skbs
- consist of two or more segments
- the head_skb holds the protocol headers plus first gso_size
- one or more frag_list skbs hold exactly one segment
- all but the last must be gso_size
Optional datapath hooks such as NAT and BPF (bpf_skb_pull_data) can
modify these skbs, breaking these invariants.
In extreme cases they pull all data into skb linear. For TCP, this
causes a NULL ptr deref in __tcpv4_gso_segment_list_csum at
tcp_hdr(seg->next).
Detect invalid geometry due to pull, by checking head_skb size.
Don't just drop, as this may blackhole a destination. Convert to be
able to pass to regular skb_segment.
Approach and description based on a patch by Willem de Bruijn.
Link: https://lore.kernel.org/netdev/20240428142913.18666-1-shiming.cheng@mediatek.com/
Link: https://lore.kernel.org/netdev/20240922150450.3873767-1-willemdebruijn.kernel@gmail.com/
Fixes: bee88cd5bd ("net: add support for segmenting TCP fraglist GSO packets")
Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20240926085315.51524-1-nbd@nbd.name
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
asm/unaligned.h is always an include of asm-generic/unaligned.h;
might as well move that thing to linux/unaligned.h and include
that - there's nothing arch-specific in that header.
auto-generated by the following:
for i in `git grep -l -w asm/unaligned.h`; do
sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i
done
for i in `git grep -l -w asm-generic/unaligned.h`; do
sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i
done
git mv include/asm-generic/unaligned.h include/linux/unaligned.h
git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h
sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild
sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmb1P8AACgkQ1V2XiooU
IOT2KQ/9Gpf66VH41Byae9qzpgS+iRWUkN3Apn/5m7io/v0AuEmDfDRCPcOH/k8N
61m5RGBzuZETR3YhmlzzvMv5WXmHJmUCGjWm5M2b6Byji13GsdgTqJ3VXwgQXINI
tuE2bRTRzm5oBOsJvTENb5X7A3Bmjnk93N4jJSQgQNzO+fTNgiUQxszrUc2llQLS
D85VC94AtNu3fKbv+sv76yWGdR+srq2ePeN+6lDT/Hx6sqnU+uWziYaSXLTmWd9S
va+yOgi2t0gJkCZqfR/Aw8fQJSpCLWFIy4LBJa1fFX6ni462w2c7VOMPHnJ3PlOy
QG+UAH2brpRyIVn3IBzEeBDb1ZhrsHKsEaUz84LHs22XbZCCZ4xAfe0DsFmxC0o3
TW9f0RA9geRlnZOxHJRHc8I6Edi4B3oBcvbEe6PaoHeQJCUqfVJp8dgkLT0IvySJ
TWYQEx8A/fSBKmr8QQ9L/wEomTTnvLuW5GW4dyOsfoyS7DKd9wgIycujakqmowIA
ZnaXmosCtopNGrf5lxKsWYDac4VKLJufzjCj/4b7Q1BBaJXmSj0xVD0/0fSJeijk
t9nfvvOwBKBYOoZOwYK2KD+YmMwxSuHz48yE0WZANoRnTP/gwFhY9bDmonqOi7+e
L5Vbtv6QZtnChnHCSkRzXEkmKUIlzMoi607suV1jYmmDiEQoa+A=
=a9OT
-----END PGP SIGNATURE-----
Merge tag 'nf-24-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
v2: with kdoc fixes per Paolo Abeni.
The following patchset contains Netfilter fixes for net:
Patch #1 and #2 handle an esoteric scenario: Given two tasks sending UDP
packets to one another, two packets of the same flow in each direction
handled by different CPUs that result in two conntrack objects in NEW
state, where reply packet loses race. Then, patch #3 adds a testcase for
this scenario. Series from Florian Westphal.
1) NAT engine can falsely detect a port collision if it happens to pick
up a reply packet as NEW rather than ESTABLISHED. Add extra code to
detect this and suppress port reallocation in this case.
2) To complete the clash resolution in the reply direction, extend conntrack
logic to detect clashing conntrack in the reply direction to existing entry.
3) Adds a test case.
Then, an assorted list of fixes follow:
4) Add a selftest for tproxy, from Antonio Ojea.
5) Guard ctnetlink_*_size() functions under
#if defined(CONFIG_NETFILTER_NETLINK_GLUE_CT) || defined(CONFIG_NF_CONNTRACK_EVENTS)
From Andy Shevchenko.
6) Use -m socket --transparent in iptables tproxy documentation.
From XIE Zhibang.
7) Call kfree_rcu() when releasing flowtable hooks to address race with
netlink dump path, from Phil Sutter.
8) Fix compilation warning in nf_reject with CONFIG_BRIDGE_NETFILTER=n.
From Simon Horman.
9) Guard ctnetlink_label_size() under CONFIG_NF_CONNTRACK_EVENTS which
is its only user, to address a compilation warning. From Simon Horman.
10) Use rcu-protected list iteration over basechain hooks from netlink
dump path.
11) Fix memcg for nf_tables, use GFP_KERNEL_ACCOUNT is not complete.
12) Remove old nfqueue conntrack clash resolution. Instead trying to
use same destination address consistently which requires double DNAT,
use the existing clash resolution which allows clashing packets
go through with different destination. Antonio Ojea originally
reported an issue from the postrouting chain, I proposed a fix:
https://lore.kernel.org/netfilter-devel/ZuwSwAqKgCB2a51-@calendula/T/
which he reported it did not work for him.
13) Adds a selftest for patch 12.
14) Fixes ipvs.sh selftest.
netfilter pull request 24-09-26
* tag 'nf-24-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
selftests: netfilter: Avoid hanging ipvs.sh
kselftest: add test for nfqueue induced conntrack race
netfilter: nfnetlink_queue: remove old clash resolution logic
netfilter: nf_tables: missing objects with no memcg accounting
netfilter: nf_tables: use rcu chain hook list iterator from netlink dump path
netfilter: ctnetlink: compile ctnetlink_label_size with CONFIG_NF_CONNTRACK_EVENTS
netfilter: nf_reject: Fix build warning when CONFIG_BRIDGE_NETFILTER=n
netfilter: nf_tables: Keep deleted flowtable hooks until after RCU
docs: tproxy: ignore non-transparent sockets in iptables
netfilter: ctnetlink: Guard possible unused functions
selftests: netfilter: nft_tproxy.sh: add tcp tests
selftests: netfilter: add reverse-clash resolution test case
netfilter: conntrack: add clash resolution for reverse collisions
netfilter: nf_nat: don't try nat source port reallocation for reverse dir clash
====================
Link: https://patch.msgid.link/20240926110717.102194-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
If CONFIG_BRIDGE_NETFILTER is not enabled, which is the case for x86_64
defconfig, then building nf_reject_ipv4.c and nf_reject_ipv6.c with W=1
using gcc-14 results in the following warnings, which are treated as
errors:
net/ipv4/netfilter/nf_reject_ipv4.c: In function 'nf_send_reset':
net/ipv4/netfilter/nf_reject_ipv4.c:243:23: error: variable 'niph' set but not used [-Werror=unused-but-set-variable]
243 | struct iphdr *niph;
| ^~~~
cc1: all warnings being treated as errors
net/ipv6/netfilter/nf_reject_ipv6.c: In function 'nf_send_reset6':
net/ipv6/netfilter/nf_reject_ipv6.c:286:25: error: variable 'ip6h' set but not used [-Werror=unused-but-set-variable]
286 | struct ipv6hdr *ip6h;
| ^~~~
cc1: all warnings being treated as errors
Address this by reducing the scope of these local variables to where
they are used, which is code only compiled when CONFIG_BRIDGE_NETFILTER
enabled.
Compile tested and run through netfilter selftests.
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Closes: https://lore.kernel.org/netfilter-devel/20240906145513.567781-1-andriy.shevchenko@linux.intel.com/
Signed-off-by: Simon Horman <horms@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The series in the "fixes" tag added the ability to consider L4 attributes
in routing rules.
The dst lookup on the outer packet of encapsulated traffic in the xfrm
code was not adapted to this change, thus routing behavior that relies
on L4 information is not respected.
Pass the ip protocol information when performing dst lookups.
Fixes: a25724b05a ("Merge branch 'fib_rules-support-sport-dport-and-proto-match'")
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Tested-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Preparation for adding more fields to dst lookup functions without
changing their signatures.
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The rpl sr tunnel code contains calls to dst_cache_*() which are
only present when the dst cache is built.
Select DST_CACHE to build the dst cache, similar to other kconfig
options in the same file.
Compiling the rpl sr tunnel without DST_CACHE will lead to linker
errors.
Fixes: a7a29f9c36 ("net: ipv6: add rpl sr tunnel")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Simon Horman <horms@kernel.org> # build-tested
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement support for the new DSCP selector that allows IPv6 FIB rules
to match on the entire DSCP field. This is done despite the fact that
the above can be achieved using the existing TOS selector, so that user
space program will be able to work with IPv4 and IPv6 rules in the same
way.
Differentiate between both selectors by adding a new bit in the IPv6 FIB
rule structure that is only set when the 'FRA_DSCP' attribute is
specified by user space. Reject rules that use both selectors.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240911093748.3662015-4-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Free the skb before returning from rpl_input when skb_cow_head() fails.
Use a "drop" label and goto instructions.
Fixes: a7a29f9c36 ("net: ipv6: add rpl sr tunnel")
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240911174557.11536-1-justin.iurman@uliege.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Make skb_frag_page() fail in the case where the frag is not backed
by a page, and fix its relevant callers to handle this case.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240910171458.219195-8-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmbaOvIACgkQ1V2XiooU
IOT/oQ/+JTsIkXRn8XAOgjsbxOEOvrUPAzb72Atz/cCA0RPQkHXbdZtxLDPbcN1v
lQG6R+ZK+trS70fIMqnfSbEB/eaCWum+/kd9ZSp5RCFW4M9OVde+KTJj+IfEzsQZ
spZRR53VnAN5jSeI2U3w4iYnyCWn5Xtp2sGETrjh43yK3cirvo7sZd/+477gZiGp
qBDEgZrzcDzfm8IxJCCUeJdcNeM7ytoMhuyITT9YrvUt0Qo6+qPsx5hVFwMFly/M
WkvxCR/1DR+Unhp4a30STEPPxDR0f284WoaiuxEvNAN2yP7p7O35mcStzyfhlOh+
wB/Cc4ESBa3fPRhA+l3FDsdyrlHsi3c8VUwBWcXVryeD5e1mzyveXye9O2HtWmET
wBtukfdPORu8JBBHxf3kmv+ZLAJLjAwyO1G1DHFruL/yEAJIDq4gluxlR+71rg7n
qAZUvvV3MGQMCNIO3GlQ6ODtl0UcIUTHwW5//MEaxOC/aqWN/fr/keSz8xGE2Qkt
47TFbBiGC6UR0KD+wWGAWfOlWN4G9m7E4SG++vCkXJGio4bvyGl8TxorWsh99vCv
BMq59ZRtsS1xiEcWF48Q0Y5YtURIdCih/LcfDdbIQFzkNlHzzGpo68MHN/anqgu/
GE4JTdgjf79lfDqJDqdnQiio7P44NZqhkeUT8yQTE1xbIKsQRNY=
=Uxb1
-----END PGP SIGNATURE-----
Merge tag 'nf-next-24-09-06' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
Patch #1 adds ctnetlink support for kernel side filtering for
deletions, from Changliang Wu.
Patch #2 updates nft_counter support to Use u64_stats_t,
from Sebastian Andrzej Siewior.
Patch #3 uses kmemdup_array() in all xtables frontends,
from Yan Zhen.
Patch #4 is a oneliner to use ERR_CAST() in nf_conntrack instead
opencoded casting, from Shen Lichuan.
Patch #5 removes unused argument in nftables .validate interface,
from Florian Westphal.
Patch #6 is a oneliner to correct a typo in nftables kdoc,
from Simon Horman.
Patch #7 fixes missing kdoc in nftables, also from Simon.
Patch #8 updates nftables to handle timeout less than CONFIG_HZ.
Patch #9 rejects element expiration if timeout is zero,
otherwise it is silently ignored.
Patch #10 disallows element expiration larger than timeout.
Patch #11 removes unnecessary READ_ONCE annotation while mutex is held.
Patch #12 adds missing READ_ONCE/WRITE_ONCE annotation in dynset.
Patch #13 annotates data-races around element expiration.
Patch #14 allocates timeout and expiration in one single set element
extension, they are tighly couple, no reason to keep them
separated anymore.
Patch #15 updates nftables to interpret zero timeout element as never
times out. Note that it is already possible to declare sets
with elements that never time out but this generalizes to all
kind of set with timeouts.
Patch #16 supports for element timeout and expiration updates.
* tag 'nf-next-24-09-06' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_tables: set element timeout update support
netfilter: nf_tables: zero timeout means element never times out
netfilter: nf_tables: consolidate timeout extension for elements
netfilter: nf_tables: annotate data-races around element expiration
netfilter: nft_dynset: annotate data-races around set timeout
netfilter: nf_tables: remove annotation to access set timeout while holding lock
netfilter: nf_tables: reject expiration higher than timeout
netfilter: nf_tables: reject element expiration with no timeout
netfilter: nf_tables: elements with timeout below CONFIG_HZ never expire
netfilter: nf_tables: Add missing Kernel doc
netfilter: nf_tables: Correct spelling in nf_tables.h
netfilter: nf_tables: drop unused 3rd argument from validate callback ops
netfilter: conntrack: Convert to use ERR_CAST()
netfilter: Use kmemdup_array instead of kmemdup for multiple allocation
netfilter: nft_counter: Use u64_stats_t for statistic.
netfilter: ctnetlink: support CTA_FILTER for flush
====================
Link: https://patch.msgid.link/20240905232920.5481-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
list_head can be initialized automatically with LIST_HEAD()
instead of calling INIT_LIST_HEAD(). Here we can simplify
the code.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Link: https://patch.msgid.link/20240904093243.3345012-5-lihongbo22@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Unmask the upper DSCP bits when calling ip_route_output_ports() so that
in the future it could perform the FIB lookup according to the full DSCP
value.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240903135327.2810535-5-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Unmask the upper DSCP bits when calling ip_route_output_ports() so that
in the future it could perform the FIB lookup according to the full DSCP
value.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240903135327.2810535-4-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch improves two checks on user data.
The first one prevents bit 23 from being set, as specified by RFC 9197
(Sec 4.4.1):
Bit 23 Reserved; MUST be set to zero upon transmission and be
ignored upon receipt. This bit is reserved to allow for
future extensions of the IOAM Trace-Type bit field.
The second one checks that the tunnel destination address !=
IPV6_ADDR_ANY, just like we already do for the tunnel source address.
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Link: https://patch.msgid.link/20240830191919.51439-1-justin.iurman@uliege.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
"Interface can't change network namespaces" is rather an attribute,
not a feature, and it can't be changed via Ethtool.
Make it a "cold" private flag instead of a netdev_feature and free
one more bit.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
NETIF_F_LLTX can't be changed via Ethtool and is not a feature,
rather an attribute, very similar to IFF_NO_QUEUE (and hot).
Free one netdev_features_t bit and make it a "hot" private flag.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When we are allocating an array, using kmemdup_array() to take care about
multiplication and possible overflows.
Also it makes auditing the code easier.
Signed-off-by: Yan Zhen <yanzhen@vivo.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The function calls flowi4_init_output() to initialize an IPv4 flow key
with which it then performs a FIB lookup using ip_route_output_flow().
The 'tos' variable with which the TOS value in the IPv4 flow key
(flowi4_tos) is initialized contains the full DS field. Unmask the upper
DSCP bits so that in the future the FIB lookup could be performed
according to the full DSCP value.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Host wide ICMP ratelimiter should be per netns, to provide better isolation.
Following patch in this series makes the sysctl per netns.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240829144641.3880376-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ICMP messages are ratelimited :
After the blamed commits, the two rate limiters are applied in this order:
1) host wide ratelimit (icmp_global_allow())
2) Per destination ratelimit (inetpeer based)
In order to avoid side-channels attacks, we need to apply
the per destination check first.
This patch makes the following change :
1) icmp_global_allow() checks if the host wide limit is reached.
But credits are not yet consumed. This is deferred to 3)
2) The per destination limit is checked/updated.
This might add a new node in inetpeer tree.
3) icmp_global_consume() consumes tokens if prior operations succeeded.
This means that host wide ratelimit is still effective
in keeping inetpeer tree small even under DDOS.
As a bonus, I removed icmp_global.lock as the fast path
can use a lock-free operation.
Fixes: c0303efeab ("net: reduce cycles spend on ICMP replies that gets rate limited")
Fixes: 4cdf507d54 ("icmp: add a global rate limitation")
Reported-by: Keyu Man <keyu.man@email.ucr.edu>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20240829144641.3880376-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When doing further testing of the recently added SO_PEEK_OFF feature
for TCP I realized I had omitted to add it for TCP/IPv6.
I do that here.
Fixes: 05ea491641 ("tcp: add support for SO_PEEK_OFF socket option")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20240828183752.660267-2-jmaloy@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
No lock protects tcp tw fields.
tcptw->tw_rcv_nxt can be changed from twsk_rcv_nxt_update()
while other threads might read this field.
Add READ_ONCE()/WRITE_ONCE() annotations, and make sure
tcp_timewait_state_process() reads tcptw->tw_rcv_nxt only once.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20240827015250.3509197-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Using a volatile qualifier for a specific struct field is unusual.
Use instead READ_ONCE()/WRITE_ONCE() where necessary.
tcp_timewait_state_process() can change tw_substate while other
threads are reading this field.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20240827015250.3509197-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ipv6_setsockopt() can directly call ip_setsockopt()
instead of going through udp_prot.setsockopt()
ipv6_getsockopt() can directly call ip_getsockopt()
instead of going through udp_prot.getsockopt()
These indirections predate git history, not sure why they
were there.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240823140019.3727643-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When coping sockaddr in ip6_mc_msfget(), the time of copies
depends on the minimum value between sl_count and gf_numsrc.
Using min() here is very semantic.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240822133908.1042240-7-lizetao1@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmbHtrsACgkQ1V2XiooU
IOTZ8g/+OW8W468NmBHA2zrTWei19irA1iCBLvXMPakice0+ADU51eqVp6uUrCeP
iBUZGMtCq4WzFBrAEBePK3UNxxMHquWvAsA0kO/XW95KVM++s9ykF62q89jugMb3
CADEv/TxgJrkzpLWclxHNTCWMKpURijlkjT+kCMR4fKbeQnB6e/jI+2sdl7l5iRG
tHHm8ieewNNKE+jlSUJUrPEIM3tXRaZh9+JmbClfsF6wUw7qLmT/7P92aHBX4Owp
tpqi/Xc5/2k+Ud96a8u1NYrLLG+L70uz3SaeE7PvhaRavFuYftk2XLB4L2umtEfb
ZCZO/lCadH23XrVAUs5EtCDk4Tu3rZdTDsKYm2qS66uBsh/e6hg+j/cIPSO8jsNq
5Zbs/XzPFJ1PUpXVy8Sfs9vxH+cDuiqhy9nfKrbQotsqtoW+z52UoFH4WAjfmpqb
XMI+yeSTXYl1KIo2LV408VFRFRGcstBvXE7bOn7ufSrltRZcFdx7wqQBgVbh1zvA
1NTzIguZ+Lf2hPcNLPQd/f2vghKRTI7gUwzlDRw6so6NOWUM6/yV5KyoiZKekHjC
S6+M8cdiyMH8DmSsvAb46YKtDxYuIHqLVxuVqjfBHrMo1hLIo5smMCeCRA1Vabd5
/E4DTwpN5tVWX+HZl1wcAtQpXhcTktWM0qGSPnlRS11gwbAGWvk=
=O8tu
-----END PGP SIGNATURE-----
Merge tag 'nf-next-24-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following batch contains Netfilter updates for net-next:
Patch #1 fix checksum calculation in nfnetlink_queue with SCTP,
segment GSO packet since skb_zerocopy() does not support
GSO_BY_FRAGS, from Antonio Ojea.
Patch #2 extend nfnetlink_queue coverage to handle SCTP packets,
from Antonio Ojea.
Patch #3 uses consume_skb() instead of kfree_skb() in nfnetlink,
from Donald Hunter.
Patch #4 adds a dedicate commit list for sets to speed up
intra-transaction lookups, from Florian Westphal.
Patch #5 skips removal of element from abort path for the pipapo
backend, ditching the shadow copy of this datastructure
is sufficient.
Patch #6 moves nf_ct_netns_get() out of nf_conncount_init() to
let users of conncoiunt decide when to enable conntrack,
this is needed by openvswitch, from Xin Long.
Patch #7 pass context to all nft_parse_register_load() in
preparation for the next patch.
Patches #8 and #9 reject loads from uninitialized registers from
control plane to remove register initialization from
datapath. From Florian Westphal.
* tag 'nf-next-24-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_tables: don't initialize registers in nft_do_chain()
netfilter: nf_tables: allow loads only when register is initialized
netfilter: nf_tables: pass context structure to nft_parse_register_load
netfilter: move nf_ct_netns_get out of nf_conncount_init
netfilter: nf_tables: do not remove elements if set backend implements .abort
netfilter: nf_tables: store new sets in dedicated list
netfilter: nfnetlink: convert kfree_skb to consume_skb
selftests: netfilter: nft_queue.sh: sctp coverage
netfilter: nfnetlink_queue: unbreak SCTP traffic
====================
Link: https://patch.msgid.link/20240822221939.157858-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The initial value of err is -ENOBUFS, and err is guaranteed to be
less than 0 before all goto errout. Therefore, on the error path
of errout, there is no need to repeatedly judge that err is less than 0,
and delete redundant judgments to make the code more concise.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The initial value of err is -ENOBUFS, and err is guaranteed to be
less than 0 before all goto errout. Therefore, on the error path
of errout, there is no need to repeatedly judge that err is less than 0,
and delete redundant judgments to make the code more concise.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
drivers/net/ethernet/broadcom/bnxt/bnxt.h
c948c0973d ("bnxt_en: Don't clear ntuple filters and rss contexts during ethtool ops")
f2878cdeb7 ("bnxt_en: Add support to call FW to update a VNIC")
Link: https://patch.msgid.link/20240822210125.1542769-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch provides a new feature (i.e., "tunsrc") for the tunnel (i.e.,
"encap") mode of ioam6. Just like seg6 already does, except it is
attached to a route. The "tunsrc" is optional: when not provided (by
default), the automatic resolution is applied. Using "tunsrc" when
possible has a benefit: performance. See the comparison:
- before (= "encap" mode): https://ibb.co/bNCzvf7
- after (= "encap" mode with "tunsrc"): https://ibb.co/PT8L6yq
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This patch prepares the next one by correcting the alignment of some
lines.
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
If skb_expand_head() returns NULL, skb has been freed
and the associated dst/idev could also have been freed.
We must use rcu_read_lock() to prevent a possible UAF.
Fixes: 0c9f227bee ("ipv6: use skb_expand_head in ip6_xmit")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vasily Averin <vasily.averin@linux.dev>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240820160859.3786976-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If skb_expand_head() returns NULL, skb has been freed
and associated dst/idev could also have been freed.
We need to hold rcu_read_lock() to make sure the dst and
associated idev are alive.
Fixes: 5796015fa9 ("ipv6: allocate enough headroom in ip6_finish_output2()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vasily Averin <vasily.averin@linux.dev>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240820160859.3786976-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
err varibale will be set everytime,like -ENOBUFS and in if (err < 0),
when code gets into this path. This check will just slowdown
the execution and that's all.
Signed-off-by: Xi Huang <xuiagnh@gmail.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20240820115442.49366-1-xuiagnh@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>