Since SLOB was removed and since
commit 6c6c47b063 ("mm, slab: call kvfree_rcu_barrier() from kmem_cache_destroy()"),
it is not necessary to use call_rcu when the callback only performs
kmem_cache_free. Use kfree_rcu() directly.
The changes were made using Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://patch.msgid.link/20241013201704.49576-4-Julia.Lawall@inria.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since SLOB was removed and since
commit 6c6c47b063 ("mm, slab: call kvfree_rcu_barrier() from kmem_cache_destroy()"),
it is not necessary to use call_rcu when the callback only performs
kmem_cache_free. Use kfree_rcu() directly.
The changes were made using Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://patch.msgid.link/20241013201704.49576-3-Julia.Lawall@inria.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This option makes legacy Netfilter Kconfig user selectable, giving users
the option to configure iptables without enabling any other config.
Make the following KConfig entries user selectable:
* BRIDGE_NF_EBTABLES_LEGACY
* IP_NF_ARPTABLES
* IP_NF_IPTABLES_LEGACY
* IP6_NF_IPTABLES_LEGACY
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ip_send_unicast_reply() send orphaned 'control packets'.
These are RST packets and also ACK packets sent from TIME_WAIT.
Some eBPF programs would prefer to have a meaningful skb->sk
pointer as much as possible.
This means that TCP can now attach TIME_WAIT sockets to outgoing
skbs.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Brian Vazquez <brianvv@google.com>
Link: https://patch.msgid.link/20241010174817.1543642-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This can be used to attach a socket to an skb,
taking a reference on sk->sk_refcnt.
This helper might be a NOP if sk->sk_refcnt is zero.
Use it from tcp_make_synack().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Brian Vazquez <brianvv@google.com>
Link: https://patch.msgid.link/20241010174817.1543642-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After commit 8d7017fd62 ("blackhole_netdev: use blackhole_netdev to
invalidate dst entries"), blackhole_netdev was introduced to invalidate
dst cache entries on the TX path whenever the cache times out or is
flushed.
When two UDP sockets (sk1 and sk2) send messages to the same destination
simultaneously, they are using the same dst cache. If the dst cache is
invalidated on one path (sk2) while the other (sk1) is still transmitting,
sk1 may try to use the invalid dst entry.
CPU1 CPU2
udp_sendmsg(sk1) udp_sendmsg(sk2)
udp_send_skb()
ip_output()
<--- dst timeout or flushed
dst_dev_put()
ip_finish_output2()
ip_neigh_for_gw()
This results in a scenario where ip_neigh_for_gw() returns -EINVAL because
blackhole_dev lacks an in_dev, which is needed to initialize the neigh in
arp_constructor(). This error is then propagated back to userspace,
breaking the UDP application.
The patch fixes this issue by assigning an in_dev to blackhole_dev for
IPv4, similar to what was done for IPv6 in commit e5f80fcf86 ("ipv6:
give an IPv6 dev to blackhole_netdev"). This ensures that even when the
dst entry is invalidated with blackhole_dev, it will not fail to create
the neigh entry.
As devinet_init() is called ealier than blackhole_netdev_init() in system
booting, it can not assign the in_dev to blackhole_dev in devinet_init().
As Paolo suggested, add a separate late_initcall() in devinet.c to ensure
inet_blackhole_dev_init() is called after blackhole_netdev_init().
Fixes: 8d7017fd62 ("blackhole_netdev: use blackhole_netdev to invalidate dst entries")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/3000792d45ca44e16c785ebe2b092e610e5b3df1.1728499633.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After we made sure no fib_seq_read() handlers needs RTNL anymore,
we can remove RTNL from fib_seq_sum().
Note that after RTNL was dropped, fib_seq_sum() result was possibly
outdated anyway.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241009184405.3752829-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
mr_call_vif_notifiers() and mr_call_mfc_notifiers() already
uses WRITE_ONCE() on the write side.
Using RTNL to protect the reads seems a big hammer.
Constify 'struct net' argument of ip6mr_rules_seq_read()
and ipmr_rules_seq_read().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241009184405.3752829-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Using RTNL to protect ops->fib_rules_seq reads seems a big hammer.
Writes are protected by RTNL.
We can use READ_ONCE() when reading it.
Constify 'struct net' argument of fib4_rules_seq_read()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241009184405.3752829-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmcG9rEACgkQ1V2XiooU
IOSKFg//Yme4n8+qtNN2OEMsiYy8MN8AwkE74pWPPV3/2yG62iARBdgnMWcq20OX
CzhloCgzDVTHn8LhzaoI1Ja8o2Zrirq6QdTFnVlZjfILf0Q4kOwvLE6AFP6SCpsB
9/oQ5t8ShCWenHJerjT2u33LI8x6MMgtvP+sxTvaE7IQvQVVkGRZ6nTYU9ZnHdd3
S46Qa3X+qGYdmPpt8hH8qXjJLqvl/mPh66p7TCxFFpv108BaC8yjQeMRTL4fslEa
wD9AgVpOpSqLZstHjf5xMZ6uMaGocmyorRKsglHypGE3HGS0E1kgbTN7AEdeoOhD
gVN7iCUbtq7v7PsXyrXc4wK7Hiu38S5fPsHoqIxS1o8brpnb1hDoOsyVy10PyMiN
wZVt7xP3MsxJCnpS0ky6QTTv2dJVxXSeuCYTAV22HmpHln/EHakxJkdTZlM+ErMo
P9lrB6mNOWmqvOQMmPeptSZ5P8nK6roi5A3oICi6aQLuu1sQYRnpCQneZLQhCBEm
ruxYfrGDLBVVsIAg/OPrBqaWoqoxkGyZqOHkV8nDZ0XcIsvExmL4NBu5RT8f1pur
CG9EoqkrccGz4p5t/L++1B1wkeUGloywXh9IBz2soK83nz+lIe/Kd68ybNP+CRH0
0au+hGbBzUkgkgS5tCoUJeCGW/6/HZNkVLdzYkKwbLZv8GW6Tlg=
=LKqW
-----END PGP SIGNATURE-----
Merge tag 'nf-24-10-09' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Restrict xtables extensions to families that are safe, syzbot found
a way to combine ebtables with extensions that are never used by
userspace tools. From Florian Westphal.
2) Set l3mdev inconditionally whenever possible in nft_fib to fix lookup
mismatch, also from Florian.
netfilter pull request 24-10-09
* tag 'nf-24-10-09' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
selftests: netfilter: conntrack_vrf.sh: add fib test case
netfilter: fib: check correct rtable in vrf setups
netfilter: xtables: avoid NFPROTO_UNSPEC where needed
====================
Link: https://patch.msgid.link/20241009213858.3565808-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
No one uses inet_addr_lst anymore, so let's remove it.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241008172906.1326-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Each IPv4 address could have a lifetime, which is useful for DHCP,
and GC is periodically executed as check_lifetime_work.
check_lifetime() does the actual GC under RTNL.
1. Acquire RTNL
2. Iterate inet_addr_lst
3. Remove IPv4 address if expired
4. Release RTNL
Namespacifying the GC is required for per-netns RTNL, but using the
per-netns hash table will shorten the time on the hash bucket iteration
under RTNL.
Let's add per-netns GC work and use the per-netns hash table.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241008172906.1326-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now, all IPv4 addresses are put in the per-netns hash table.
Let's use it in inet_lookup_ifaddr_rcu().
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241008172906.1326-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As a prep for per-netns RTNL conversion, we want to namespacify
the IPv4 address hash table and the GC work.
Let's allocate the per-netns IPv4 address hash table to
net->ipv4.inet_addr_lst and link IPv4 addresses into it.
The actual users will be converted later.
Note that the IPv6 address hash table is already namespacified.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241008172906.1326-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 783237e8da ("net-tcp: Fast Open client - sending SYN-data")
introduces tcp_connect_queue_skb() and it would overwrite tcp->write_seq,
so it is no need to update tp->write_seq before invoking
tcp_connect_queue_skb().
Signed-off-by: xin.guo <guoxin0309@gmail.com>
Link: https://patch.msgid.link/1728289544-4611-1-git-send-email-guoxin0309@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pass a dscp_t variable to __fib_validate_source(), instead of a plain
u8, to prevent accidental setting of ECN bits in ->flowi4_tos.
Only fib_validate_source() actually calls __fib_validate_source().
Since it already has a dscp_t variable to pass as parameter, we only
need to remove the inet_dscp_to_dsfield() conversion.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/8206b0a64a21a208ed94774e261a251c8d7bc251.1728302212.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pass a dscp_t variable to fib_validate_source(), instead of a plain u8,
to prevent accidental setting of ECN bits in ->flowi4_tos.
All callers of fib_validate_source() already have a dscp_t variable to
pass as parameter. We just need to remove the inet_dscp_to_dsfield()
conversions.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/08612a4519bc5a3578bb493fbaad82437ebb73dc.1728302212.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pass a dscp_t variable to ip_mc_validate_source(), instead of a plain
u8, to prevent accidental setting of ECN bits in ->flowi4_tos.
Callers of ip_mc_validate_source() to consider are:
* ip_route_input_mc() which already has a dscp_t variable to pass as
parameter. We just need to remove the inet_dscp_to_dsfield()
conversion.
* udp_v4_early_demux() which gets the DSCP directly from the IPv4
header and can simply use the ip4h_dscp() helper.
Also, stop including net/inet_dscp.h in udp.c as we don't use any of
its declarations anymore.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/c91b2cca04718b7ee6cf5b9c1d5b40507d65a8d4.1728302212.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pass a dscp_t variable to ip_route_input_mc(), instead of a plain u8,
to prevent accidental setting of ECN bits in ->flowi4_tos.
Only ip_route_input_rcu() actually calls ip_route_input_mc(). Since it
already has a dscp_t variable to pass as parameter, we only need to
remove the inet_dscp_to_dsfield() conversion.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/0cc653ef59bbc0a28881f706d34896c61eba9e01.1728302212.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pass a dscp_t variable to __mkroute_input(), instead of a plain u8, to
prevent accidental setting of ECN bits in ->flowi4_tos.
Only ip_mkroute_input() actually calls __mkroute_input(). Since it
already has a dscp_t variable to pass as parameter, we only need to
remove the inet_dscp_to_dsfield() conversion.
While there, reorganise the function parameters to fill up horizontal
space.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/40853c720aee4d608e6b1b204982164c3b76697d.1728302212.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pass a dscp_t variable to ip_mkroute_input(), instead of a plain u8, to
prevent accidental setting of ECN bits in ->flowi4_tos.
Only ip_route_input_slow() actually calls ip_mkroute_input(). Since it
already has a dscp_t variable to pass as parameter, we only need to
remove the inet_dscp_to_dsfield() conversion.
While there, reorganise the function parameters to fill up horizontal
space.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/6aa71e28f9ff681cbd70847080e1ab6b526f94f1.1728302212.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pass a dscp_t variable to ip_route_use_hint(), instead of a plain u8,
to prevent accidental setting of ECN bits in ->flowi4_tos.
Only ip_rcv_finish_core() actually calls ip_route_use_hint(). Use the
ip4h_dscp() helper to get the DSCP from the IPv4 header.
While there, modify the declaration of ip_route_use_hint() in
include/net/route.h so that it matches the prototype of its
implementation in net/ipv4/route.c.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/c40994fdf804db7a363d04fdee01bf48dddda676.1728302212.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We need to init l3mdev unconditionally, else main routing table is searched
and incorrect result is returned unless strict (iif keyword) matching is
requested.
Next patch adds a selftest for this.
Fixes: 2a8a7c0eaa ("netfilter: nft_fib: Fix for rpath check with VRF devices")
Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1761
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The runtime cost of joining a single multicast group in the current
implementation of ____ip_mc_inc_group grows linearly with the number of
existing memberships. This is caused by the linear search for an
existing group record in the multicast address list.
This linear complexity results in quadratic complexity when successively
adding memberships, which becomes a performance bottleneck when setting
up large numbers of multicast memberships.
If available, use the existing multicast hash map mc_hash to quickly
search for an existing group membership record. This leads to
near-constant complexity on the addition of a new multicast record,
significantly improving performance for workloads involving many
multicast memberships.
On profiling with a loopback device, this patch presented a speedup of
around 6 when successively setting up 2000 multicast groups using
setsockopt without measurable drawbacks on smaller numbers of
multicast groups.
Signed-off-by: Jonas Rebmann <jre@pengutronix.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Upcoming per-netns RTNL conversion needs to get rid
of shared hash tables.
fib_info_devhash[] is one of them.
It is unclear why we used a hash table, because
a single hlist_head per net device was cheaper and scalable.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241004134720.579244-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After the prior patch, fib_info_lock became redundant
because all of its users are holding RTNL.
BH protection is not needed.
Remove the READ_ONCE()/WRITE_ONCE() annotations around fib_info_cnt,
since it is protected by RTNL.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241004134720.579244-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
fib_info_devhash[] is not resized in fib_info_hash_move().
fib_nh structs are already freed after an rcu grace period.
This will allow to remove fib_info_lock in the following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241004134720.579244-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
fib_devindex_hashfn() converts a 32bit ifindex value to a 8bit hash.
It makes no sense doing this from fib_info_hashfn() and
fib_find_info_nh().
It is better to keep as many bits as possible to let
fib_info_hashfn_result() have better spread.
Only fib_info_devhash_bucket() needs to make this operation,
we can 'inline' trivial fib_devindex_hashfn() in it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241004134720.579244-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For now, we refresh the tcp_mstamp for delayed acks and keepalives, but
not for the compressed ack in tcp_compressed_ack_kick().
I have not found out the effact of the tcp_mstamp when sending ack, but
we can still refresh it for the compressed ack to keep consistent.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241003082231.759759-1-dongml2@chinatelecom.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
delack timer is not stopped from inet_csk_clear_xmit_timer()
because we do not define INET_CSK_CLEAR_TIMERS.
This is a conscious choice : inet_csk_clear_xmit_timer()
is often called from another cpu. Calling del_timer()
would cause false sharing and lock contention.
This means that very often, tcp_delack_timer() is called
at the timer expiration, while there is no ACK to transmit.
This can be detected very early, avoiding the socket spinlock.
Notes:
- test about tp->compressed_ack is racy,
but in the unlikely case there is a race, the dedicated
compressed_ack_timer hrtimer would close it.
- Even if the fast path is not taken, reading
icsk->icsk_ack.pending and tp->compressed_ack
before acquiring the socket spinlock reduces
acquisition time and chances of contention.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241002173042.917928-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
retransmit timer is not stopped from inet_csk_clear_xmit_timer()
because we do not define INET_CSK_CLEAR_TIMERS.
This is a conscious choice : for active TCP flows, it is better
to only call mod_timer(), because there is more chances of
keeping the timer unchanged. Also inet_csk_clear_xmit_timer()
is often called from another cpu, and calling del_timer()
would cause false sharing and lock contention.
This means that very often, tcp_write_timer() is called
at the timer expiration, while there is nothing to retransmit.
This can be detected very early, avoiding the socket spinlock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241002173042.917928-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
icsk->icsk_pending can be read locklessly already.
Following patch in the series will add another lockless read.
Add smp_load_acquire() and smp_store_release() annotations
because following patch will add a test in tcp_write_timer(),
and READ_ONCE()/WRITE_ONCE() alone would possibly lead to races.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241002173042.917928-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The last type of sockets which supports SOF_TIMESTAMPING_OPT_ID is RAW
sockets. To add new option this patch converts all callers (direct and
indirect) of _sock_tx_timestamp to provide sockcm_cookie instead of
tsflags. And while here fix __sock_tx_timestamp to receive tsflags as
__u32 instead of __u16.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://patch.msgid.link/20241001125716.2832769-3-vadfed@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
SOF_TIMESTAMPING_OPT_ID socket option flag gives a way to correlate TX
timestamps and packets sent via socket. Unfortunately, there is no way
to reliably predict socket timestamp ID value in case of error returned
by sendmsg. For UDP sockets it's impossible because of lockless
nature of UDP transmit, several threads may send packets in parallel. In
case of RAW sockets MSG_MORE option makes things complicated. More
details are in the conversation [1].
This patch adds new control message type to give user-space
software an opportunity to control the mapping between packets and
values by providing ID with each sendmsg for UDP sockets.
The documentation is also added in this patch.
[1] https://lore.kernel.org/netdev/CALCETrU0jB+kg0mhV6A8mrHfTE1D1pr1SD_B9Eaa9aDPfgHdtA@mail.gmail.com/
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://patch.msgid.link/20241001125716.2832769-2-vadfed@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pass a dscp_t variable to ip_route_input_slow(), instead of a plain u8,
to prevent accidental setting of ECN bits in ->flowi4_tos.
Only ip_route_input_rcu() actually calls ip_route_input_slow(). Since
it already has a dscp_t variable to pass as parameter, we only need to
remove the inet_dscp_to_dsfield() conversion.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/d6bca5f87eea9e83a3861e6e05594cdd252583c9.1727807926.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pass a dscp_t variable to ip_route_input_rcu(), instead of a plain u8,
to prevent accidental setting of ECN bits in ->flowi4_tos.
Callers of ip_route_input_rcu() to consider are:
* ip_route_input_noref(), which already has a dscp_t variable to pass
as parameter. We just need to remove the inet_dscp_to_dsfield()
conversion.
* inet_rtm_getroute(), which receives a u8 from user space and needs
to convert it with inet_dsfield_to_dscp().
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/c4dbb5aa9cbc79c4fcb317abbffa7c7156bc56a7.1727807926.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pass a dscp_t variable to ip_route_input_noref(), instead of a plain
u8, to prevent accidental setting of ECN bits in ->flowi4_tos.
Callers of ip_route_input_noref() to consider are:
* arp_process() in net/ipv4/arp.c. This function sets the tos
parameter to 0, which is already a valid dscp_t value, so it
doesn't need to be adjusted for the new prototype.
* ip_route_input(), which already has a dscp_t variable to pass as
parameter. We just need to remove the inet_dscp_to_dsfield()
conversion.
* ipvlan_l3_rcv(), bpf_lwt_input_reroute(), ip_expire(),
ip_rcv_finish_core(), xfrm4_rcv_encap_finish() and
xfrm4_rcv_encap(), which get the DSCP directly from IPv4 headers
and can simply use the ip4h_dscp() helper.
While there, declare the IPv4 header pointers as const in
ipvlan_l3_rcv() and bpf_lwt_input_reroute().
Also, modify the declaration of ip_route_input_noref() in
include/net/route.h so that it matches the prototype of its
implementation in net/ipv4/route.c.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/a8a747bed452519c4d0cc06af32c7e7795d7b627.1727807926.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pass a dscp_t variable to ip_route_input(), instead of a plain u8, to
prevent accidental setting of ECN bits in ->flowi4_tos.
Callers of ip_route_input() to consider are:
* input_action_end_dx4_finish() and input_action_end_dt4() in
net/ipv6/seg6_local.c. These functions set the tos parameter to 0,
which is already a valid dscp_t value, so they don't need to be
adjusted for the new prototype.
* icmp_route_lookup(), which already has a dscp_t variable to pass as
parameter. We just need to remove the inet_dscp_to_dsfield()
conversion.
* br_nf_pre_routing_finish(), ip_options_rcv_srr() and ip4ip6_err(),
which get the DSCP directly from IPv4 headers. Define a helper to
read the .tos field of struct iphdr as dscp_t, so that these
function don't have to do the conversion manually.
While there, declare *iph as const in br_nf_pre_routing_finish(),
declare its local variables in reverse-christmas-tree order and move
the "err = ip_route_input()" assignment out of the conditional to avoid
checkpatch warning.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/e9d40781d64d3d69f4c79ac8a008b8d67a033e8d.1727807926.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pass a dscp_t variable to icmp_route_lookup(), instead of a plain u8,
to prevent accidental setting of ECN bits in ->flowi4_tos. Rename that
variable ("tos" -> "dscp") to make the intent clear.
While there, reorganise the function parameters to fill up horizontal
space.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/294fead85c6035bcdc5fcf9a6bb4ce8798c45ba1.1727807926.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix tcp_rcv_synrecv_state_fastopen() to not zero retrans_stamp
if retransmits are outstanding.
tcp_fastopen_synack_timer() sets retrans_stamp, so typically we'll
need to zero retrans_stamp here to prevent spurious
retransmits_timed_out(). The logic to zero retrans_stamp is from this
2019 commit:
commit cd736d8b67 ("tcp: fix retrans timestamp on passive Fast Open")
However, in the corner case where the ACK of our TFO SYNACK carried
some SACK blocks that caused us to enter TCP_CA_Recovery then that
non-zero retrans_stamp corresponds to the active fast recovery, and we
need to leave retrans_stamp with its current non-zero value, for
correct ETIMEDOUT and undo behavior.
Fixes: cd736d8b67 ("tcp: fix retrans timestamp on passive Fast Open")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241001200517.2756803-4-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix tcp_enter_recovery() so that if there are no retransmits out then
we zero retrans_stamp when entering fast recovery. This is necessary
to fix two buggy behaviors.
Currently a non-zero retrans_stamp value can persist across multiple
back-to-back loss recovery episodes. This is because we generally only
clears retrans_stamp if we are completely done with loss recoveries,
and get to tcp_try_to_open() and find !tcp_any_retrans_done(sk). This
behavior causes two bugs:
(1) When a loss recovery episode (CA_Loss or CA_Recovery) is followed
immediately by a new CA_Recovery, the retrans_stamp value can persist
and can be a time before this new CA_Recovery episode starts. That
means that timestamp-based undo will be using the wrong retrans_stamp
(a value that is too old) when comparing incoming TS ecr values to
retrans_stamp to see if the current fast recovery episode can be
undone.
(2) If there is a roughly minutes-long sequence of back-to-back fast
recovery episodes, one after another (e.g. in a shallow-buffered or
policed bottleneck), where each fast recovery successfully makes
forward progress and recovers one window of sequence space (but leaves
at least one retransmit in flight at the end of the recovery),
followed by several RTOs, then the ETIMEDOUT check may be using the
wrong retrans_stamp (a value set at the start of the first fast
recovery in the sequence). This can cause a very premature ETIMEDOUT,
killing the connection prematurely.
This commit changes the code to zero retrans_stamp when entering fast
recovery, when this is known to be safe (no retransmits are out in the
network). That ensures that when starting a fast recovery episode, and
it is safe to do so, retrans_stamp is set when we send the fast
retransmit packet. That addresses both bug (1) and bug (2) by ensuring
that (if no retransmits are out when we start a fast recovery) we use
the initial fast retransmit of this fast recovery as the time value
for undo and ETIMEDOUT calculations.
This makes intuitive sense, since the start of a new fast recovery
episode (in a scenario where no lost packets are out in the network)
means that the connection has made forward progress since the last RTO
or fast recovery, and we should thus "restart the clock" used for both
undo and ETIMEDOUT logic.
Note that if when we start fast recovery there *are* retransmits out
in the network, there can still be undesirable (1)/(2) issues. For
example, after this patch we can still have the (1) and (2) problems
in cases like this:
+ round 1: sender sends flight 1
+ round 2: sender receives SACKs and enters fast recovery 1,
retransmits some packets in flight 1 and then sends some new data as
flight 2
+ round 3: sender receives some SACKs for flight 2, notes losses, and
retransmits some packets to fill the holes in flight 2
+ fast recovery has some lost retransmits in flight 1 and continues
for one or more rounds sending retransmits for flight 1 and flight 2
+ fast recovery 1 completes when snd_una reaches high_seq at end of
flight 1
+ there are still holes in the SACK scoreboard in flight 2, so we
enter fast recovery 2, but some retransmits in the flight 2 sequence
range are still in flight (retrans_out > 0), so we can't execute the
new retrans_stamp=0 added here to clear retrans_stamp
It's not yet clear how to fix these remaining (1)/(2) issues in an
efficient way without breaking undo behavior, given that retrans_stamp
is currently used for undo and ETIMEDOUT. Perhaps the optimal (but
expensive) strategy would be to set retrans_stamp to the timestamp of
the earliest outstanding retransmit when entering fast recovery. But
at least this commit makes things better.
Note that this does not change the semantics of retrans_stamp; it
simply makes retrans_stamp accurate in some cases where it was not
before:
(1) Some loss recovery, followed by an immediate entry into a fast
recovery, where there are no retransmits out when entering the fast
recovery.
(2) When a TFO server has a SYNACK retransmit that sets retrans_stamp,
and then the ACK that completes the 3-way handshake has SACK blocks
that trigger a fast recovery. In this case when entering fast recovery
we want to zero out the retrans_stamp from the TFO SYNACK retransmit,
and set the retrans_stamp based on the timestamp of the fast recovery.
We introduce a tcp_retrans_stamp_cleanup() helper, because this
two-line sequence already appears in 3 places and is about to appear
in 2 more as a result of this bug fix patch series. Once this bug fix
patches series in the net branch makes it into the net-next branch
we'll update the 3 other call sites to use the new helper.
This is a long-standing issue. The Fixes tag below is chosen to be the
oldest commit at which the patch will apply cleanly, which is from
Linux v3.5 in 2012.
Fixes: 1fbc340514 ("tcp: early retransmit: tcp_enter_recovery()")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241001200517.2756803-3-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix the TCP loss recovery undo logic in tcp_packet_delayed() so that
it can trigger undo even if TSQ prevents a fast recovery episode from
reaching tcp_retransmit_skb().
Geumhwan Yu <geumhwan.yu@samsung.com> recently reported that after
this commit from 2019:
commit bc9f38c832 ("tcp: avoid unconditional congestion window undo
on SYN retransmit")
...and before this fix we could have buggy scenarios like the
following:
+ Due to reordering, a TCP connection receives some SACKs and enters a
spurious fast recovery.
+ TSQ prevents all invocations of tcp_retransmit_skb(), because many
skbs are queued in lower layers of the sending machine's network
stack; thus tp->retrans_stamp remains 0.
+ The connection receives a TCP timestamp ECR value echoing a
timestamp before the fast recovery, indicating that the fast
recovery was spurious.
+ The connection fails to undo the spurious fast recovery because
tp->retrans_stamp is 0, and thus tcp_packet_delayed() returns false,
due to the new logic in the 2019 commit: commit bc9f38c832 ("tcp:
avoid unconditional congestion window undo on SYN retransmit")
This fix tweaks the logic to be more similar to the
tcp_packet_delayed() logic before bc9f38c832, except that we take
care not to be fooled by the FLAG_SYN_ACKED code path zeroing out
tp->retrans_stamp (the bug noted and fixed by Yuchung in
bc9f38c832).
Note that this returns the high-level behavior of tcp_packet_delayed()
to again match the comment for the function, which says: "Nothing was
retransmitted or returned timestamp is less than timestamp of the
first retransmission." Note that this comment is in the original
2005-04-16 Linux git commit, so this is evidently long-standing
behavior.
Fixes: bc9f38c832 ("tcp: avoid unconditional congestion window undo on SYN retransmit")
Reported-by: Geumhwan Yu <geumhwan.yu@samsung.com>
Diagnosed-by: Geumhwan Yu <geumhwan.yu@samsung.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241001200517.2756803-2-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Mix netns into all IPv4 FIB hashes to avoid massive collision when
inserting the same address in many netns.
Signed-off-by: Alexandre Ferrieux <alexandre.ferrieux@orange.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241001231438.3855035-1-alexandre.ferrieux@orange.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Current release - regressions:
- eth: mlx5: fix wrong reserved field in hca_cap_2 in mlx5_ifc
- eth: am65-cpsw: fix forever loop in cleanup code
Current release - new code bugs:
- eth: mlx5: HWS, fixed double-free in error flow of creating SQ
Previous releases - regressions:
- core: avoid potential underflow in qdisc_pkt_len_init() with UFO
- core: test for not too small csum_start in virtio_net_hdr_to_skb()
- vrf: revert "vrf: remove unnecessary RCU-bh critical section"
- bluetooth:
- fix uaf in l2cap_connect
- fix possible crash on mgmt_index_removed
- dsa: improve shutdown sequence
- eth: mlx5e: SHAMPO, fix overflow of hd_per_wq
- eth: ip_gre: fix drops of small packets in ipgre_xmit
Previous releases - always broken:
- core: fix gso_features_check to check for both dev->gso_{ipv4_,}max_size
- core: fix tcp fraglist segmentation after pull from frag_list
- netfilter: nf_tables: prevent nf_skb_duplicated corruption
- sctp: set sk_state back to CLOSED if autobind fails in sctp_listen_start
- mac802154: fix potential RCU dereference issue in mac802154_scan_worker
- eth: fec: restart PPS after link state change
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmb+giESHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkDowP/25YsDA8uaH5yelI85vUgp1T50MWgFxJ
ARm58Pzxr8byX6eIup95xSsjLvMbLaWj5LIA2Y49AV0fWVgGn0U8yx4mPy0Czhdg
J1oxtyoV1pR2V/okWzD4yhZV2on7OGsS73I6J1s6BAowezr19A+aa5Un57dW/103
ccwBuBOYlSIOIHmarOxuFhWMYcwXreNBHa9K7J6JtDFn9F56fUn+ZoIUJ7x27cSO
eWhh9bIkeEb+xYeUXAjNP3pBvJ1xpwIyZv+JMTp40jNsAXPjSpI3Jwd1YlAAMuT9
J2dW0Zs8uwm5LzBPFvI9iM0WHEmVy6+b32NjnKVwPn2+XGGWQss52bmRElNcJkrw
4NeG6/6CPIE0xuczBECuMa0X68NDKIZsjy3Q3OahV82ef2cwhRk6FexyIg5oiMPx
KmMi5B+UQw6ZY3ZF/ME/0jJx/H5ayOC01yNBaTUPrLJr8gjquWEMjZXEqJsdyixJ
5OoZeKG5oN6HkN7g/IxoFjg/W/g93OULO3qH+IzLQG4NlVs6Zp4ykL7dT+Py2zzc
Ru3n5+HA4PqDn2u7gmP1mu2g/lmKUIZEEvR+msP81Cywlz5qtWIH1a6oIeVC7bjt
JNhgBgzKGGMGdgmhYNzXw213WCEbz0+as2SNlvlbiqMP5FKQPLzzBVuJoz4AtJVn
cyVy7D66HuMW
=cq2I
-----END PGP SIGNATURE-----
Merge tag 'net-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from ieee802154, bluetooth and netfilter.
Current release - regressions:
- eth: mlx5: fix wrong reserved field in hca_cap_2 in mlx5_ifc
- eth: am65-cpsw: fix forever loop in cleanup code
Current release - new code bugs:
- eth: mlx5: HWS, fixed double-free in error flow of creating SQ
Previous releases - regressions:
- core: avoid potential underflow in qdisc_pkt_len_init() with UFO
- core: test for not too small csum_start in virtio_net_hdr_to_skb()
- vrf: revert "vrf: remove unnecessary RCU-bh critical section"
- bluetooth:
- fix uaf in l2cap_connect
- fix possible crash on mgmt_index_removed
- dsa: improve shutdown sequence
- eth: mlx5e: SHAMPO, fix overflow of hd_per_wq
- eth: ip_gre: fix drops of small packets in ipgre_xmit
Previous releases - always broken:
- core: fix gso_features_check to check for both
dev->gso_{ipv4_,}max_size
- core: fix tcp fraglist segmentation after pull from frag_list
- netfilter: nf_tables: prevent nf_skb_duplicated corruption
- sctp: set sk_state back to CLOSED if autobind fails in
sctp_listen_start
- mac802154: fix potential RCU dereference issue in
mac802154_scan_worker
- eth: fec: restart PPS after link state change"
* tag 'net-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (48 commits)
sctp: set sk_state back to CLOSED if autobind fails in sctp_listen_start
dt-bindings: net: xlnx,axi-ethernet: Add missing reg minItems
doc: net: napi: Update documentation for napi_schedule_irqoff
net/ncsi: Disable the ncsi work before freeing the associated structure
net: phy: qt2025: Fix warning: unused import DeviceId
gso: fix udp gso fraglist segmentation after pull from frag_list
bridge: mcast: Fail MDB get request on empty entry
vrf: revert "vrf: Remove unnecessary RCU-bh critical section"
net: ethernet: ti: am65-cpsw: Fix forever loop in cleanup code
net: phy: realtek: Check the index value in led_hw_control_get
ppp: do not assume bh is held in ppp_channel_bridge_input()
selftests: rds: move include.sh to TEST_FILES
net: test for not too small csum_start in virtio_net_hdr_to_skb()
net: gso: fix tcp fraglist segmentation after pull from frag_list
ipv4: ip_gre: Fix drops of small packets in ipgre_xmit
net: stmmac: dwmac4: extend timeout for VLAN Tag register busy bit check
net: add more sanity checks to qdisc_pkt_len_init()
net: avoid potential underflow in qdisc_pkt_len_init() with UFO
net: ethernet: ti: cpsw_ale: Fix warning on some platforms
net: microchip: Make FDMA config symbol invisible
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmb9qqsACgkQ1V2XiooU
IOSplRAAsv0Rr2WRA+pDpQwcmMNWoemGtu0qB7L6IchM36P64GvldMhEgfSPCh1h
6HdV8WlkGE5Q/bOPCNbkLg/INBelADoioaOlsdOO5oc+rGUw/Z4Swcq/1PF60Vaz
tz8AOU0opAD3X50U5bqD1Z2xToonS9nz9Ql7OWAbTdn9red/2SY+H1fyDz00VIHU
X4y2GWND5Hi6KIsAGTu9OiyQKy9hb1oA5xNU1OeNY+gNsr+r+NSbX0BOMSRJTvLv
MyY0kzP+S+yTx2FGcDMqgKfo60Sb4Ru6rJXl3XKd6QxhW9Mt6adcmmlqa5edoWU3
bJYkzugl66XKh1pDkC9u7om7zOOzBhjvLObDMbcYfAVJCctsErGcRDJIvS8M+ECB
tRsxRFU2CSud4HzIeKfQUP7b16KghnBa4kTsc0r8MLcfU5D/aR/WMR62W/ua00IS
noyWqtpdNk/7yR9HMzaCbsjgm+OZbtJbOSWCNaDo4TsXf+g+jQ+cf1Nl26cE73gB
xWGcc3LKIkcjQpOU+Zu0fluF7OdnDNNTEoHprnahilBHDOtmSBDMwxAoJichCZMt
mEN1CThG0B+YwlWH9yFL1bOQs1zHHFjHfJspdtqCok+UeD20p8QD1V8mlEsAkkT/
alw0Gxa6T2KepuOF9KcMnx4IcpkqwpgkwcGXvwRWWchANgbi1ao=
=UUcp
-----END PGP SIGNATURE-----
Merge tag 'nf-24-10-02' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Fix incorrect documentation in uapi/linux/netfilter/nf_tables.h
regarding flowtable hooks, from Phil Sutter.
2) Fix nft_audit.sh selftests with newer nft binaries, due to different
(valid) audit output, also from Phil.
3) Disable BH when duplicating packets via nf_dup infrastructure,
otherwise race on nf_skb_duplicated for locally generated traffic.
From Eric.
4) Missing return in callback of selftest C program, from zhang jiao.
netfilter pull request 24-10-02
* tag 'nf-24-10-02' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
selftests: netfilter: Add missing return value
netfilter: nf_tables: prevent nf_skb_duplicated corruption
selftests: netfilter: Fix nft_audit.sh for newer nft binaries
netfilter: uapi: NFTA_FLOWTABLE_HOOK is NLA_NESTED
====================
Link: https://patch.msgid.link/20241002202421.1281311-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Detect gso fraglist skbs with corrupted geometry (see below) and
pass these to skb_segment instead of skb_segment_list, as the first
can segment them correctly.
Valid SKB_GSO_FRAGLIST skbs
- consist of two or more segments
- the head_skb holds the protocol headers plus first gso_size
- one or more frag_list skbs hold exactly one segment
- all but the last must be gso_size
Optional datapath hooks such as NAT and BPF (bpf_skb_pull_data) can
modify these skbs, breaking these invariants.
In extreme cases they pull all data into skb linear. For UDP, this
causes a NULL ptr deref in __udpv4_gso_segment_list_csum at
udp_hdr(seg->next)->dest.
Detect invalid geometry due to pull, by checking head_skb size.
Don't just drop, as this may blackhole a destination. Convert to be
able to pass to regular skb_segment.
Link: https://lore.kernel.org/netdev/20240428142913.18666-1-shiming.cheng@mediatek.com/
Fixes: 9fd1ff5d2a ("udp: Support UDP fraglist GRO/GSO.")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20241001171752.107580-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Detect tcp gso fraglist skbs with corrupted geometry (see below) and
pass these to skb_segment instead of skb_segment_list, as the first
can segment them correctly.
Valid SKB_GSO_FRAGLIST skbs
- consist of two or more segments
- the head_skb holds the protocol headers plus first gso_size
- one or more frag_list skbs hold exactly one segment
- all but the last must be gso_size
Optional datapath hooks such as NAT and BPF (bpf_skb_pull_data) can
modify these skbs, breaking these invariants.
In extreme cases they pull all data into skb linear. For TCP, this
causes a NULL ptr deref in __tcpv4_gso_segment_list_csum at
tcp_hdr(seg->next).
Detect invalid geometry due to pull, by checking head_skb size.
Don't just drop, as this may blackhole a destination. Convert to be
able to pass to regular skb_segment.
Approach and description based on a patch by Willem de Bruijn.
Link: https://lore.kernel.org/netdev/20240428142913.18666-1-shiming.cheng@mediatek.com/
Link: https://lore.kernel.org/netdev/20240922150450.3873767-1-willemdebruijn.kernel@gmail.com/
Fixes: bee88cd5bd ("net: add support for segmenting TCP fraglist GSO packets")
Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20240926085315.51524-1-nbd@nbd.name
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
asm/unaligned.h is always an include of asm-generic/unaligned.h;
might as well move that thing to linux/unaligned.h and include
that - there's nothing arch-specific in that header.
auto-generated by the following:
for i in `git grep -l -w asm/unaligned.h`; do
sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i
done
for i in `git grep -l -w asm-generic/unaligned.h`; do
sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i
done
git mv include/asm-generic/unaligned.h include/linux/unaligned.h
git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h
sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild
sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
Regression Description:
Depending on the options specified for the GRE tunnel device, small
packets may be dropped. This occurs because the pskb_network_may_pull
function fails due to the packet's insufficient length.
For example, if only the okey option is specified for the tunnel device,
original (before encapsulation) packets smaller than 28 bytes (including
the IPv4 header) will be dropped. This happens because the required
length is calculated relative to the network header, not the skb->head.
Here is how the required length is computed and checked:
* The pull_len variable is set to 28 bytes, consisting of:
* IPv4 header: 20 bytes
* GRE header with Key field: 8 bytes
* The pskb_network_may_pull function adds the network offset, shifting
the checkable space further to the beginning of the network header and
extending it to the beginning of the packet. As a result, the end of
the checkable space occurs beyond the actual end of the packet.
Instead of ensuring that 28 bytes are present in skb->head, the function
is requesting these 28 bytes starting from the network header. For small
packets, this requested length exceeds the actual packet size, causing
the check to fail and the packets to be dropped.
This issue affects both locally originated and forwarded packets in
DMVPN-like setups.
How to reproduce (for local originated packets):
ip link add dev gre1 type gre ikey 1.9.8.4 okey 1.9.8.4 \
local <your-ip> remote 0.0.0.0
ip link set mtu 1400 dev gre1
ip link set up dev gre1
ip address add 192.168.13.1/24 dev gre1
ip neighbor add 192.168.13.2 lladdr <remote-ip> dev gre1
ping -s 1374 -c 10 192.168.13.2
tcpdump -vni gre1
tcpdump -vni <your-ext-iface> 'ip proto 47'
ip -s -s -d link show dev gre1
Solution:
Use the pskb_may_pull function instead the pskb_network_may_pull.
Fixes: 80d875cfc9 ("ipv4: ip_gre: Avoid skb_pull() failure in ipgre_xmit()")
Signed-off-by: Anton Danilov <littlesmilingcloud@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240924235158.106062-1-littlesmilingcloud@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Previous releases - regressions:
- netfilter:
- nf_reject_ipv6: fix nf_reject_ip6_tcphdr_put()
- nf_tables: keep deleted flowtable hooks until after RCU
- tcp: check skb is non-NULL in tcp_rto_delta_us()
- phy: aquantia: fix -ETIMEDOUT PHY probe failure when firmware not present
- eth: virtio_net: fix mismatched buf address when unmapping for small packets
- eth: stmmac: fix zero-division error when disabling tc cbs
- eth: bonding: fix unnecessary warnings and logs from bond_xdp_get_xmit_slave()
Previous releases - always broken:
- netfilter:
- fix clash resolution for bidirectional flows
- fix allocation with no memcg accounting
- eth: r8169: add tally counter fields added with RTL8125
- eth: ravb: fix rx and tx frame size limit
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmb1bHASHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkxUAP/3cnsANzqmulU+zXLRCyYqQkMnLDrXuC
yb1sy4gf/2vih+UPAK0Gw+NXMnL/Ftlv2EMV9RQKFjIWV4D0AYGEmKdnPhe2ycRN
0Gr7zSZdP2KlA7HgYSehxmWjrNFatAmyGvIEYs+9JBzLnoZCkRlsrYE8HO7fk8+a
4FDyh+FyiniDKR3+W/tgPoZy/U+FS9AUftOrAjCM/o6c0WPugwgHDxwlyrBg3lAp
Mkx8Q3IPWESOfPcUmJ+AezljfL1W3xAG/4cxALpN9lboeJaZNjvMQgMyqC1uVyHS
VJOkOuhQEVfXpc9139j5DxPHhacmLBQGfDw6ZXevwRC9NwgaLcRh9cf3rUafA7uC
qT7P5dt5y3kGOqp7pltUsFT7C47VD7ZlFz4J6eqTVCVTopjpMipZajvWZEIDNqPa
ftsMW0ZIbjpJVTJAvhlrKySxsRFte6b3aa9VdttkevgQPMneEXyePe8Me6Fbrv+t
hF5R8we6842xclLfjBCJT1d4e7yW8B5o69eygQbyaqRK9EhbaF+4R0V+NK9eVnd9
qZudNZBznnfdVgjjgcu12qievHEazIAFkyjs+ZCt2xYNcRg8cLwr/TclOB8fEMBO
VpjPci4j1Ln158EbGJf30VQpZJzXSrxZ4HFZU1Be+d3fW58o1H9zMfvweOcvxI/v
AQWSy3aMoWHB
=l8TJ
-----END PGP SIGNATURE-----
Merge tag 'net-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter.
It looks like that most people are still traveling: both the ML volume
and the processing capacity are low.
Previous releases - regressions:
- netfilter:
- nf_reject_ipv6: fix nf_reject_ip6_tcphdr_put()
- nf_tables: keep deleted flowtable hooks until after RCU
- tcp: check skb is non-NULL in tcp_rto_delta_us()
- phy: aquantia: fix -ETIMEDOUT PHY probe failure when firmware not
present
- eth: virtio_net: fix mismatched buf address when unmapping for
small packets
- eth: stmmac: fix zero-division error when disabling tc cbs
- eth: bonding: fix unnecessary warnings and logs from
bond_xdp_get_xmit_slave()
Previous releases - always broken:
- netfilter:
- fix clash resolution for bidirectional flows
- fix allocation with no memcg accounting
- eth: r8169: add tally counter fields added with RTL8125
- eth: ravb: fix rx and tx frame size limit"
* tag 'net-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (35 commits)
selftests: netfilter: Avoid hanging ipvs.sh
kselftest: add test for nfqueue induced conntrack race
netfilter: nfnetlink_queue: remove old clash resolution logic
netfilter: nf_tables: missing objects with no memcg accounting
netfilter: nf_tables: use rcu chain hook list iterator from netlink dump path
netfilter: ctnetlink: compile ctnetlink_label_size with CONFIG_NF_CONNTRACK_EVENTS
netfilter: nf_reject: Fix build warning when CONFIG_BRIDGE_NETFILTER=n
netfilter: nf_tables: Keep deleted flowtable hooks until after RCU
docs: tproxy: ignore non-transparent sockets in iptables
netfilter: ctnetlink: Guard possible unused functions
selftests: netfilter: nft_tproxy.sh: add tcp tests
selftests: netfilter: add reverse-clash resolution test case
netfilter: conntrack: add clash resolution for reverse collisions
netfilter: nf_nat: don't try nat source port reallocation for reverse dir clash
selftests/net: packetdrill: increase timing tolerance in debug mode
usbnet: fix cyclical race on disconnect with work queue
net: stmmac: set PP_FLAG_DMA_SYNC_DEV only if XDP is enabled
virtio_net: Fix mismatched buf address when unmapping for small packets
bonding: Fix unnecessary warnings and logs from bond_xdp_get_xmit_slave()
r8169: add missing MODULE_FIRMWARE entry for RTL8126A rev.b
...
If CONFIG_BRIDGE_NETFILTER is not enabled, which is the case for x86_64
defconfig, then building nf_reject_ipv4.c and nf_reject_ipv6.c with W=1
using gcc-14 results in the following warnings, which are treated as
errors:
net/ipv4/netfilter/nf_reject_ipv4.c: In function 'nf_send_reset':
net/ipv4/netfilter/nf_reject_ipv4.c:243:23: error: variable 'niph' set but not used [-Werror=unused-but-set-variable]
243 | struct iphdr *niph;
| ^~~~
cc1: all warnings being treated as errors
net/ipv6/netfilter/nf_reject_ipv6.c: In function 'nf_send_reset6':
net/ipv6/netfilter/nf_reject_ipv6.c:286:25: error: variable 'ip6h' set but not used [-Werror=unused-but-set-variable]
286 | struct ipv6hdr *ip6h;
| ^~~~
cc1: all warnings being treated as errors
Address this by reducing the scope of these local variables to where
they are used, which is code only compiled when CONFIG_BRIDGE_NETFILTER
enabled.
Compile tested and run through netfilter selftests.
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Closes: https://lore.kernel.org/netfilter-devel/20240906145513.567781-1-andriy.shevchenko@linux.intel.com/
Signed-off-by: Simon Horman <horms@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The series in the "fixes" tag added the ability to consider L4 attributes
in routing rules.
The dst lookup on the outer packet of encapsulated traffic in the xfrm
code was not adapted to this change, thus routing behavior that relies
on L4 information is not respected.
Pass the ip protocol information when performing dst lookups.
Fixes: a25724b05a ("Merge branch 'fib_rules-support-sport-dport-and-proto-match'")
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Tested-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Preparation for adding more fields to dst lookup functions without
changing their signatures.
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmbk/nIACgkQ6rmadz2v
bTqxuBAAnqW81Rr0nORIxeJMbyo4EiFuYHGk6u5BYP9NPzqHroUPCLVmSP7Hp/Ta
CJjsiZeivZsGa6Qlc3BCa4hHNpqP5WE1C/73svSDn7/99EfxdSBtirpMVFUPsUtn
DDb5chNpvnxKNS8Mw5Ty8wBrdbXHMlSx+IfaFHpv0Yn6EAcuF4UdoEUq2l3PqhfD
Il9Zm127eViPGAP+o+TBZFfW+rRw8d0ngqeRq2GvJ8ibNEDWss+GmBI1Dod7d+fC
dUDg96Ipdm1a5Xz7dnH80eXz9JHdpu6qhQrQMKKArnlpJElrKiOf9b17ZcJoPQOR
ZnstEnUyVnrWROZxUuKY72+2tx3TuSf+L9uZqFHNx3Ix5FIoS+tFbHf4b8SxtsOb
hb2X7SigdGqhQDxUT+IPeO5hsJlIvG1/VYxMXxgc++rh9DjL06hDLUSH1WBSU0fC
kFQ7HrcpAlVHtWmGbwwUyVjD+KC/qmZBTAnkcYT4C62WZVytSCnihIuSFAvV1tpZ
SSIhVPyQ599UoZIiQYihp0S4qP74FotCtErWSrThneh2Cl8kDsRq//lV1nj/PTV8
CpTvz4VCFDFTgthCfd62fP95EwW5K+aE3NjGTPW/9Hx/0+J/1tT+yqWsrToGaruf
TbrqtzQhpclz9UEqA+696cVAXNj9uRU4AoD3YIg72kVnRlkgYd0=
=MDwh
-----END PGP SIGNATURE-----
Merge tag 'bpf-next-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:
- Introduce '__attribute__((bpf_fastcall))' for helpers and kfuncs with
corresponding support in LLVM.
It is similar to existing 'no_caller_saved_registers' attribute in
GCC/LLVM with a provision for backward compatibility. It allows
compilers generate more efficient BPF code assuming the verifier or
JITs will inline or partially inline a helper/kfunc with such
attribute. bpf_cast_to_kern_ctx, bpf_rdonly_cast,
bpf_get_smp_processor_id are the first set of such helpers.
- Harden and extend ELF build ID parsing logic.
When called from sleepable context the relevants parts of ELF file
will be read to find and fetch .note.gnu.build-id information. Also
harden the logic to avoid TOCTOU, overflow, out-of-bounds problems.
- Improvements and fixes for sched-ext:
- Allow passing BPF iterators as kfunc arguments
- Make the pointer returned from iter_next method trusted
- Fix x86 JIT convergence issue due to growing/shrinking conditional
jumps in variable length encoding
- BPF_LSM related:
- Introduce few VFS kfuncs and consolidate them in
fs/bpf_fs_kfuncs.c
- Enforce correct range of return values from certain LSM hooks
- Disallow attaching to other LSM hooks
- Prerequisite work for upcoming Qdisc in BPF:
- Allow kptrs in program provided structs
- Support for gen_epilogue in verifier_ops
- Important fixes:
- Fix uprobe multi pid filter check
- Fix bpf_strtol and bpf_strtoul helpers
- Track equal scalars history on per-instruction level
- Fix tailcall hierarchy on x86 and arm64
- Fix signed division overflow to prevent INT_MIN/-1 trap on x86
- Fix get kernel stack in BPF progs attached to tracepoint:syscall
- Selftests:
- Add uprobe bench/stress tool
- Generate file dependencies to drastically improve re-build time
- Match JIT-ed and BPF asm with __xlated/__jited keywords
- Convert older tests to test_progs framework
- Add support for RISC-V
- Few fixes when BPF programs are compiled with GCC-BPF backend
(support for GCC-BPF in BPF CI is ongoing in parallel)
- Add traffic monitor
- Enable cross compile and musl libc
* tag 'bpf-next-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (260 commits)
btf: require pahole 1.21+ for DEBUG_INFO_BTF with default DWARF version
btf: move pahole check in scripts/link-vmlinux.sh to lib/Kconfig.debug
btf: remove redundant CONFIG_BPF test in scripts/link-vmlinux.sh
bpf: Call the missed kfree() when there is no special field in btf
bpf: Call the missed btf_record_free() when map creation fails
selftests/bpf: Add a test case to write mtu result into .rodata
selftests/bpf: Add a test case to write strtol result into .rodata
selftests/bpf: Rename ARG_PTR_TO_LONG test description
selftests/bpf: Fix ARG_PTR_TO_LONG {half-,}uninitialized test
bpf: Zero former ARG_PTR_TO_{LONG,INT} args in case of error
bpf: Improve check_raw_mode_ok test for MEM_UNINIT-tagged types
bpf: Fix helper writes to read-only maps
bpf: Remove truncation test in bpf_strtol and bpf_strtoul helpers
bpf: Fix bpf_strtol and bpf_strtoul helpers for 32bit
selftests/bpf: Add tests for sdiv/smod overflow cases
bpf: Fix a sdiv overflow issue
libbpf: Add bpf_object__token_fd accessor
docs/bpf: Add missing BPF program types to docs
docs/bpf: Add constant values for linkages
bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmbn5g0ACgkQu+CwddJF
iJq+Uwf/aqnLNEpjUBzwUUhSojCpPnTtiyjv+AILTxoSTHmbu8OvN0W79+Rpbdmk
O4QapAK+BCs+VL2VATwCCufcJ75Z78txO+buQE0DgwluFTIYZ+IwpUMPsK04ln6A
FD1/uvP1QFx60heqcp2c4zWFBUpg4DE6ufx2A5kieO268lFcWLxyVlcdgRU79ZCt
uAcV2yDLk3GvPGfxZwPKEmZUo/FmuSoBv0XgT+eWxmTu/R7hcpFse49OyjBH8Tvb
8d/RCIFgXOr8dTIjtds7eenwB/is4TkRlctezEQ0jO9/JwL/BVOgXZjD1qCtNWqz
is4TWK7VV+vdq1RD+0xC2hV/+uGEwQ==
=+WAm
-----END PGP SIGNATURE-----
Merge tag 'slab-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:
"This time it's mostly refactoring and improving APIs for slab users in
the kernel, along with some debugging improvements.
- kmem_cache_create() refactoring (Christian Brauner)
Over the years have been growing new parameters to
kmem_cache_create() where most of them are needed only for a small
number of caches - most recently the rcu_freeptr_offset parameter.
To avoid adding new parameters to kmem_cache_create() and adjusting
all its callers, or creating new wrappers such as
kmem_cache_create_rcu(), we can now pass extra parameters using the
new struct kmem_cache_args. Not explicitly initialized fields
default to values interpreted as unused.
kmem_cache_create() is for now a wrapper that works both with the
new form: kmem_cache_create(name, object_size, args, flags) and the
legacy form: kmem_cache_create(name, object_size, align, flags,
ctor)
- kmem_cache_destroy() waits for kfree_rcu()'s in flight (Vlastimil
Babka, Uladislau Rezki)
Since SLOB removal, kfree() is allowed for freeing objects
allocated by kmem_cache_create(). By extension kfree_rcu() as
allowed as well, which can allow converting simple call_rcu()
callbacks that only do kmem_cache_free(), as there was never a
kmem_cache_free_rcu() variant. However, for caches that can be
destroyed e.g. on module removal, the cache owners knew to issue
rcu_barrier() first to wait for the pending call_rcu()'s, and this
is not sufficient for pending kfree_rcu()'s due to its internal
batching optimizations. Ulad has provided a new
kvfree_rcu_barrier() and to make the usage less error-prone,
kmem_cache_destroy() calls it. Additionally, destroying
SLAB_TYPESAFE_BY_RCU caches now again issues rcu_barrier()
synchronously instead of using an async work, because the past
motivation for async work no longer applies. Users of custom
call_rcu() callbacks should however keep calling rcu_barrier()
before cache destruction.
- Debugging use-after-free in SLAB_TYPESAFE_BY_RCU caches (Jann Horn)
Currently, KASAN cannot catch UAFs in such caches as it is legal to
access them within a grace period, and we only track the grace
period when trying to free the underlying slab page. The new
CONFIG_SLUB_RCU_DEBUG option changes the freeing of individual
object to be RCU-delayed, after which KASAN can poison them.
- Delayed memcg charging (Shakeel Butt)
In some cases, the memcg is uknown at allocation time, such as
receiving network packets in softirq context. With
kmem_cache_charge() these may be now charged later when the user
and its memcg is known.
- Misc fixes and improvements (Pedro Falcato, Axel Rasmussen,
Christoph Lameter, Yan Zhen, Peng Fan, Xavier)"
* tag 'slab-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (34 commits)
mm, slab: restore kerneldoc for kmem_cache_create()
io_uring: port to struct kmem_cache_args
slab: make __kmem_cache_create() static inline
slab: make kmem_cache_create_usercopy() static inline
slab: remove kmem_cache_create_rcu()
file: port to struct kmem_cache_args
slab: create kmem_cache_create() compatibility layer
slab: port KMEM_CACHE_USERCOPY() to struct kmem_cache_args
slab: port KMEM_CACHE() to struct kmem_cache_args
slab: remove rcu_freeptr_offset from struct kmem_cache
slab: pass struct kmem_cache_args to do_kmem_cache_create()
slab: pull kmem_cache_open() into do_kmem_cache_create()
slab: pass struct kmem_cache_args to create_cache()
slab: port kmem_cache_create_usercopy() to struct kmem_cache_args
slab: port kmem_cache_create_rcu() to struct kmem_cache_args
slab: port kmem_cache_create() to struct kmem_cache_args
slab: add struct kmem_cache_args
slab: s/__kmem_cache_create/do_kmem_cache_create/g
memcg: add charging of already allocated slab objects
mm/slab: Optimize the code logic in find_mergeable()
...
Implement support for the new DSCP selector that allows IPv4 FIB rules
to match on the entire DSCP field, unlike the existing TOS selector that
only matches on the three lower DSCP bits.
Differentiate between both selectors by adding a new bit in the IPv4 FIB
rule structure (in an existing one byte hole) that is only set when the
'FRA_DSCP' attribute is specified by user space. Reject rules that use
both selectors.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240911093748.3662015-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZuH9UQAKCRDbK58LschI
g0/zAP99WOcCBp1M/jSTUOba230+eiol7l5RirDEA6wu7TqY2QEAuvMG0KfCCpTI
I0WqStrK1QMbhwKPodJC1k+17jArKgw=
=jfMU
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-09-11
We've added 12 non-merge commits during the last 16 day(s) which contain
a total of 20 files changed, 228 insertions(+), 30 deletions(-).
There's a minor merge conflict in drivers/net/netkit.c:
00d066a4d4 ("netdev_features: convert NETIF_F_LLTX to dev->lltx")
d966087948 ("netkit: Disable netpoll support")
The main changes are:
1) Enable bpf_dynptr_from_skb for tp_btf such that this can be used
to easily parse skbs in BPF programs attached to tracepoints,
from Philo Lu.
2) Add a cond_resched() point in BPF's sock_hash_free() as there have
been several syzbot soft lockup reports recently, from Eric Dumazet.
3) Fix xsk_buff_can_alloc() to account for queue_empty_descs which
got noticed when zero copy ice driver started to use it,
from Maciej Fijalkowski.
4) Move the xdp:xdp_cpumap_kthread tracepoint before cpumap pushes skbs
up via netif_receive_skb_list() to better measure latencies,
from Daniel Xu.
5) Follow-up to disable netpoll support from netkit, from Daniel Borkmann.
6) Improve xsk selftests to not assume a fixed MAX_SKB_FRAGS of 17 but
instead gather the actual value via /proc/sys/net/core/max_skb_frags,
also from Maciej Fijalkowski.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
sock_map: Add a cond_resched() in sock_hash_free()
selftests/bpf: Expand skb dynptr selftests for tp_btf
bpf: Allow bpf_dynptr_from_skb() for tp_btf
tcp: Use skb__nullable in trace_tcp_send_reset
selftests/bpf: Add test for __nullable suffix in tp_btf
bpf: Support __nullable argument suffix for tp_btf
bpf, cpumap: Move xdp:xdp_cpumap_kthread tracepoint before rcv
selftests/xsk: Read current MAX_SKB_FRAGS from sysctl knob
xsk: Bump xsk_queue::queue_empty_descs in xp_can_alloc()
tcp_bpf: Remove an unused parameter for bpf_tcp_ingress()
bpf, sockmap: Correct spelling skmsg.c
netkit: Disable netpoll support
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
Link: https://patch.msgid.link/20240911211525.13834-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR.
No conflicts (sort of) and no adjacent changes.
This merge reverts commit b3c9e65eb2 ("net: hsr: remove seqnr_lock")
from net, as it was superseded by
commit 430d67bdcb ("net: hsr: Use the seqnr lock for frames received via interlink port.")
in net-next.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In tcp_recvmsg_locked(), detect if the skb being received by the user
is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM
flag - pass it to tcp_recvmsg_devmem() for custom handling.
tcp_recvmsg_devmem() copies any data in the skb header to the linear
buffer, and returns a cmsg to the user indicating the number of bytes
returned in the linear buffer.
tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags,
and returns to the user a cmsg_devmem indicating the location of the
data in the dmabuf device memory. cmsg_devmem contains this information:
1. the offset into the dmabuf where the payload starts. 'frag_offset'.
2. the size of the frag. 'frag_size'.
3. an opaque token 'frag_token' to return to the kernel when the buffer
is to be released.
The pages awaiting freeing are stored in the newly added
sk->sk_user_frags, and each page passed to userspace is get_page()'d.
This reference is dropped once the userspace indicates that it is
done reading this page. All pages are released when the socket is
destroyed.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240910171458.219195-10-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For device memory TCP, we expect the skb headers to be available in host
memory for access, and we expect the skb frags to be in device memory
and unaccessible to the host. We expect there to be no mixing and
matching of device memory frags (unaccessible) with host memory frags
(accessible) in the same skb.
Add a skb->devmem flag which indicates whether the frags in this skb
are device memory frags or not.
__skb_fill_netmem_desc() now checks frags added to skbs for net_iov,
and marks the skb as skb->devmem accordingly.
Add checks through the network stack to avoid accessing the frags of
devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240910171458.219195-9-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Make skb_frag_page() fail in the case where the frag is not backed
by a page, and fix its relevant callers to handle this case.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240910171458.219195-8-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
introduce a new flag SOF_TIMESTAMPING_OPT_RX_FILTER in the receive
path. User can set it with SOF_TIMESTAMPING_SOFTWARE to filter
out rx software timestamp report, especially after a process turns on
netstamp_needed_key which can time stamp every incoming skb.
Previously, we found out if an application starts first which turns on
netstamp_needed_key, then another one only passing SOF_TIMESTAMPING_SOFTWARE
could also get rx timestamp. Now we handle this case by introducing this
new flag without breaking users.
Quoting Willem to explain why we need the flag:
"why a process would want to request software timestamp reporting, but
not receive software timestamp generation. The only use I see is when
the application does request
SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_TX_SOFTWARE."
Similarly, this new flag could also be used for hardware case where we
can set it with SOF_TIMESTAMPING_RAW_HARDWARE, then we won't receive
hardware receive timestamp.
Another thing about errqueue in this patch I have a few words to say:
In this case, we need to handle the egress path carefully, or else
reporting the tx timestamp will fail. Egress path and ingress path will
finally call sock_recv_timestamp(). We have to distinguish them.
Errqueue is a good indicator to reflect the flow direction.
Suggested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20240909015612.3856-2-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
At the moment, the slab objects are charged to the memcg at the
allocation time. However there are cases where slab objects are
allocated at the time where the right target memcg to charge it to is
not known. One such case is the network sockets for the incoming
connection which are allocated in the softirq context.
Couple hundred thousand connections are very normal on large loaded
server and almost all of those sockets underlying those connections get
allocated in the softirq context and thus not charged to any memcg.
However later at the accept() time we know the right target memcg to
charge. Let's add new API to charge already allocated objects, so we can
have better accounting of the memory usage.
To measure the performance impact of this change, tcp_crr is used from
the neper [1] performance suite. Basically it is a network ping pong
test with new connection for each ping pong.
The server and the client are run inside 3 level of cgroup hierarchy
using the following commands:
Server:
$ tcp_crr -6
Client:
$ tcp_crr -6 -c -H ${server_ip}
If the client and server run on different machines with 50 GBPS NIC,
there is no visible impact of the change.
For the same machine experiment with v6.11-rc5 as base.
base (throughput) with-patch
tcp_crr 14545 (+- 80) 14463 (+- 56)
It seems like the performance impact is within the noise.
Link: https://github.com/google/neper [1]
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com> # net
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
The grc must be initialize first. There can be a condition where if
fou is NULL, goto out will be executed and grc would be used
uninitialized.
Fixes: 7e41969350 ("fou: Fix null-ptr-deref in GRO.")
Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20240906102839.202798-1-usama.anjum@collabora.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Unmask the upper DSCP bits when calling ip_route_output_key() so that in
the future it could perform the FIB lookup according to the full DSCP
value.
Note that callers of udp_tunnel_dst_lookup() pass the entire DS field in
the 'tos' argument.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unmask the upper DSCP bits when calling ip_route_output_key() so that in
the future it could perform the FIB lookup according to the full DSCP
value.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unmask the upper DSCP bits when calling ip_route_output_key() so that in
the future it could perform the FIB lookup according to the full DSCP
value.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unmask the upper DSCP bits when initializing an IPv4 flow key via
ip_tunnel_init_flow() before passing it to ip_route_output_key() so that
in the future we could perform the FIB lookup according to the full DSCP
value.
Note that the 'tos' variable includes the full DS field. Either the one
specified as part of the tunnel parameters or the one inherited from the
inner packet.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unmask the upper DSCP bits when initializing an IPv4 flow key via
ip_tunnel_init_flow() before passing it to ip_route_output_key() so that
in the future we could perform the FIB lookup according to the full DSCP
value.
Note that the 'tos' variable includes the full DS field. Either the one
specified via the tunnel key or the one inherited from the inner packet.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unmask the upper DSCP bits when initializing an IPv4 flow key via
ip_tunnel_init_flow() before passing it to ip_route_output_key() so that
in the future we could perform the FIB lookup according to the full DSCP
value.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unmask the upper DSCP bits when calling ip_route_output_key() so that in
the future it could perform the FIB lookup according to the full DSCP
value.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unmask the upper DSCP bits when calling ip_route_output_gre() so that in
the future it could perform the FIB lookup according to the full DSCP
value.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmbaOvIACgkQ1V2XiooU
IOT/oQ/+JTsIkXRn8XAOgjsbxOEOvrUPAzb72Atz/cCA0RPQkHXbdZtxLDPbcN1v
lQG6R+ZK+trS70fIMqnfSbEB/eaCWum+/kd9ZSp5RCFW4M9OVde+KTJj+IfEzsQZ
spZRR53VnAN5jSeI2U3w4iYnyCWn5Xtp2sGETrjh43yK3cirvo7sZd/+477gZiGp
qBDEgZrzcDzfm8IxJCCUeJdcNeM7ytoMhuyITT9YrvUt0Qo6+qPsx5hVFwMFly/M
WkvxCR/1DR+Unhp4a30STEPPxDR0f284WoaiuxEvNAN2yP7p7O35mcStzyfhlOh+
wB/Cc4ESBa3fPRhA+l3FDsdyrlHsi3c8VUwBWcXVryeD5e1mzyveXye9O2HtWmET
wBtukfdPORu8JBBHxf3kmv+ZLAJLjAwyO1G1DHFruL/yEAJIDq4gluxlR+71rg7n
qAZUvvV3MGQMCNIO3GlQ6ODtl0UcIUTHwW5//MEaxOC/aqWN/fr/keSz8xGE2Qkt
47TFbBiGC6UR0KD+wWGAWfOlWN4G9m7E4SG++vCkXJGio4bvyGl8TxorWsh99vCv
BMq59ZRtsS1xiEcWF48Q0Y5YtURIdCih/LcfDdbIQFzkNlHzzGpo68MHN/anqgu/
GE4JTdgjf79lfDqJDqdnQiio7P44NZqhkeUT8yQTE1xbIKsQRNY=
=Uxb1
-----END PGP SIGNATURE-----
Merge tag 'nf-next-24-09-06' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
Patch #1 adds ctnetlink support for kernel side filtering for
deletions, from Changliang Wu.
Patch #2 updates nft_counter support to Use u64_stats_t,
from Sebastian Andrzej Siewior.
Patch #3 uses kmemdup_array() in all xtables frontends,
from Yan Zhen.
Patch #4 is a oneliner to use ERR_CAST() in nf_conntrack instead
opencoded casting, from Shen Lichuan.
Patch #5 removes unused argument in nftables .validate interface,
from Florian Westphal.
Patch #6 is a oneliner to correct a typo in nftables kdoc,
from Simon Horman.
Patch #7 fixes missing kdoc in nftables, also from Simon.
Patch #8 updates nftables to handle timeout less than CONFIG_HZ.
Patch #9 rejects element expiration if timeout is zero,
otherwise it is silently ignored.
Patch #10 disallows element expiration larger than timeout.
Patch #11 removes unnecessary READ_ONCE annotation while mutex is held.
Patch #12 adds missing READ_ONCE/WRITE_ONCE annotation in dynset.
Patch #13 annotates data-races around element expiration.
Patch #14 allocates timeout and expiration in one single set element
extension, they are tighly couple, no reason to keep them
separated anymore.
Patch #15 updates nftables to interpret zero timeout element as never
times out. Note that it is already possible to declare sets
with elements that never time out but this generalizes to all
kind of set with timeouts.
Patch #16 supports for element timeout and expiration updates.
* tag 'nf-next-24-09-06' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_tables: set element timeout update support
netfilter: nf_tables: zero timeout means element never times out
netfilter: nf_tables: consolidate timeout extension for elements
netfilter: nf_tables: annotate data-races around element expiration
netfilter: nft_dynset: annotate data-races around set timeout
netfilter: nf_tables: remove annotation to access set timeout while holding lock
netfilter: nf_tables: reject expiration higher than timeout
netfilter: nf_tables: reject element expiration with no timeout
netfilter: nf_tables: elements with timeout below CONFIG_HZ never expire
netfilter: nf_tables: Add missing Kernel doc
netfilter: nf_tables: Correct spelling in nf_tables.h
netfilter: nf_tables: drop unused 3rd argument from validate callback ops
netfilter: conntrack: Convert to use ERR_CAST()
netfilter: Use kmemdup_array instead of kmemdup for multiple allocation
netfilter: nft_counter: Use u64_stats_t for statistic.
netfilter: ctnetlink: support CTA_FILTER for flush
====================
Link: https://patch.msgid.link/20240905232920.5481-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
list_head can be initialized automatically with LIST_HEAD()
instead of calling INIT_LIST_HEAD(). Here we can simplify
the code.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Link: https://patch.msgid.link/20240904093243.3345012-2-lihongbo22@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Unmask the upper DSCP bits when calling ip_route_output_ports() so that
in the future it could perform the FIB lookup according to the full DSCP
value.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240903135327.2810535-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The function is passed the full DS field in its 'tos' argument by its
two callers. It then masks the upper DSCP bits using RT_TOS() when
passing it to ip_route_output_ports().
Unmask the upper DSCP bits when passing 'tos' to ip_route_output_ports()
so that in the future it could perform the FIB lookup according to the
full DSCP value.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240903135327.2810535-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
"Interface can't change network namespaces" is rather an attribute,
not a feature, and it can't be changed via Ethtool.
Make it a "cold" private flag instead of a netdev_feature and free
one more bit.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
NETIF_F_LLTX can't be changed via Ethtool and is not a feature,
rather an attribute, very similar to IFF_NO_QUEUE (and hot).
Free one netdev_features_t bit and make it a "hot" private flag.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When we are allocating an array, using kmemdup_array() to take care about
multiplication and possible overflows.
Also it makes auditing the code easier.
Signed-off-by: Yan Zhen <yanzhen@vivo.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Parameter flags is not used in bpf_tcp_ingress().
Signed-off-by: Yaxin Chen <yaxin.chen1@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20240823224843.1985277-1-yaxin.chen1@bytedance.com
The function calls flowi4_init_output() to initialize an IPv4 flow key
with which it then performs a FIB lookup using ip_route_output_flow().
'arg->tos' with which the TOS value in the IPv4 flow key (flowi4_tos) is
initialized contains the full DS field. Unmask the upper DSCP bits so
that in the future the FIB lookup could be performed according to the
full DSCP value.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
build_sk_flow_key() and __build_flow_key() are used to build an IPv4
flow key before calling one of the FIB lookup APIs.
Unmask the upper DSCP bits so that in the future the lookup could be
performed according to the full DSCP value.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function is called to resolve a route for an ICMP message that is
sent in response to a situation. Based on the type of the generated ICMP
message, the function is either passed the DS field of the packet that
generated the ICMP message or a DS field that is derived from it.
Unmask the upper DSCP bits before resolving and output route via
ip_route_output_key_hash() so that in the future the lookup could be
performed according to the full DSCP value.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unmask the upper DSCP bits so that in the future output routes could be
looked up according to the full DSCP value.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unmask the upper DSCP bits when looking up an output route via the
RTM_GETROUTE netlink message so that in the future the lookup could be
performed according to the full DSCP value.
No functional changes intended since the upper DSCP bits are masked when
comparing against the TOS selectors in FIB rules and routes.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previous patch made ICMP rate limits per netns, it makes sense
to allow each netns to change the associated sysctl.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240829144641.3880376-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Host wide ICMP ratelimiter should be per netns, to provide better isolation.
Following patch in this series makes the sysctl per netns.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240829144641.3880376-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ICMP messages are ratelimited :
After the blamed commits, the two rate limiters are applied in this order:
1) host wide ratelimit (icmp_global_allow())
2) Per destination ratelimit (inetpeer based)
In order to avoid side-channels attacks, we need to apply
the per destination check first.
This patch makes the following change :
1) icmp_global_allow() checks if the host wide limit is reached.
But credits are not yet consumed. This is deferred to 3)
2) The per destination limit is checked/updated.
This might add a new node in inetpeer tree.
3) icmp_global_consume() consumes tokens if prior operations succeeded.
This means that host wide ratelimit is still effective
in keeping inetpeer tree small even under DDOS.
As a bonus, I removed icmp_global.lock as the fast path
can use a lock-free operation.
Fixes: c0303efeab ("net: reduce cycles spend on ICMP replies that gets rate limited")
Fixes: 4cdf507d54 ("icmp: add a global rate limitation")
Reported-by: Keyu Man <keyu.man@email.ucr.edu>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20240829144641.3880376-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
No lock protects tcp tw fields.
tcptw->tw_rcv_nxt can be changed from twsk_rcv_nxt_update()
while other threads might read this field.
Add READ_ONCE()/WRITE_ONCE() annotations, and make sure
tcp_timewait_state_process() reads tcptw->tw_rcv_nxt only once.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20240827015250.3509197-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Using a volatile qualifier for a specific struct field is unusual.
Use instead READ_ONCE()/WRITE_ONCE() where necessary.
tcp_timewait_state_process() can change tw_substate while other
threads are reading this field.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20240827015250.3509197-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We have some problem closing zero-window fin-wait-1 tcp sockets in our
environment. This patch come from the investigation.
Previously tcp_abort only sends out reset and calls tcp_done when the
socket is not SOCK_DEAD, aka orphan. For orphan socket, it will only
purging the write queue, but not close the socket and left it to the
timer.
While purging the write queue, tp->packets_out and sk->sk_write_queue
is cleared along the way. However tcp_retransmit_timer have early
return based on !tp->packets_out and tcp_probe_timer have early
return based on !sk->sk_write_queue.
This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched
and socket not being killed by the timers, converting a zero-windowed
orphan into a forever orphan.
This patch removes the SOCK_DEAD check in tcp_abort, making it send
reset to peer and close the socket accordingly. Preventing the
timer-less orphan from happening.
According to Lorenzo's email in the v1 thread, the check was there to
prevent force-closing the same socket twice. That situation is handled
by testing for TCP_CLOSE inside lock, and returning -ENOENT if it is
already closed.
The -ENOENT code comes from the associate patch Lorenzo made for
iproute2-ss; link attached below, which also conform to RFC 9293.
At the end of the patch, tcp_write_queue_purge(sk) is removed because it
was already called in tcp_done_with_error().
p.s. This is the same patch with v2. Resent due to mis-labeled "changes
requested" on patchwork.kernel.org.
Link: https://patchwork.ozlabs.org/project/netdev/patch/1450773094-7978-3-git-send-email-lorenzo@google.com/
Fixes: c1e64e298b ("net: diag: Support destroying TCP sockets.")
Signed-off-by: Xueming Feng <kuro@kuroa.me>
Tested-by: Lorenzo Colitti <lorenzo@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240826102327.1461482-1-kuro@kuroa.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The macro sk_for_each_bound_bhash accepts a parameter
__sk, but it was not used, rather the sk2 is directly
used, so we replace the sk2 with __sk in macro.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Link: https://patch.msgid.link/20240823070453.3327832-1-lihongbo22@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We found that one close-wait socket was reset by the other side
due to a new connection reusing the same port which is beyond our
expectation, so we have to investigate the underlying reason.
The following experiment is conducted in the test environment. We
limit the port range from 40000 to 40010 and delay the time to close()
after receiving a fin from the active close side, which can help us
easily reproduce like what happened in production.
Here are three connections captured by tcpdump:
127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965525191
127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 2769915070
127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1
127.0.0.1.40002 > 127.0.0.1.9999: Flags [F.], seq 1, ack 1
// a few seconds later, within 60 seconds
127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730
127.0.0.1.9999 > 127.0.0.1.40002: Flags [.], ack 2
127.0.0.1.40002 > 127.0.0.1.9999: Flags [R], seq 2965525193
// later, very quickly
127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730
127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 3120990805
127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1
As we can see, the first flow is reset because:
1) client starts a new connection, I mean, the second one
2) client tries to find a suitable port which is a timewait socket
(its state is timewait, substate is fin_wait2)
3) client occupies that timewait port to send a SYN
4) server finds a corresponding close-wait socket in ehash table,
then replies with a challenge ack
5) client sends an RST to terminate this old close-wait socket.
I don't think the port selection algo can choose a FIN_WAIT2 socket
when we turn on tcp_tw_reuse because on the server side there
remain unread data. In some cases, if one side haven't call close() yet,
we should not consider it as expendable and treat it at will.
Even though, sometimes, the server isn't able to call close() as soon
as possible like what we expect, it can not be terminated easily,
especially due to a second unrelated connection happening.
After this patch, we can see the expected failure if we start a
connection when all the ports are occupied in fin_wait2 state:
"Ncat: Cannot assign requested address."
Reported-by: Jade Dong <jadedong@tencent.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240823001152.31004-1-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmbHtrsACgkQ1V2XiooU
IOTZ8g/+OW8W468NmBHA2zrTWei19irA1iCBLvXMPakice0+ADU51eqVp6uUrCeP
iBUZGMtCq4WzFBrAEBePK3UNxxMHquWvAsA0kO/XW95KVM++s9ykF62q89jugMb3
CADEv/TxgJrkzpLWclxHNTCWMKpURijlkjT+kCMR4fKbeQnB6e/jI+2sdl7l5iRG
tHHm8ieewNNKE+jlSUJUrPEIM3tXRaZh9+JmbClfsF6wUw7qLmT/7P92aHBX4Owp
tpqi/Xc5/2k+Ud96a8u1NYrLLG+L70uz3SaeE7PvhaRavFuYftk2XLB4L2umtEfb
ZCZO/lCadH23XrVAUs5EtCDk4Tu3rZdTDsKYm2qS66uBsh/e6hg+j/cIPSO8jsNq
5Zbs/XzPFJ1PUpXVy8Sfs9vxH+cDuiqhy9nfKrbQotsqtoW+z52UoFH4WAjfmpqb
XMI+yeSTXYl1KIo2LV408VFRFRGcstBvXE7bOn7ufSrltRZcFdx7wqQBgVbh1zvA
1NTzIguZ+Lf2hPcNLPQd/f2vghKRTI7gUwzlDRw6so6NOWUM6/yV5KyoiZKekHjC
S6+M8cdiyMH8DmSsvAb46YKtDxYuIHqLVxuVqjfBHrMo1hLIo5smMCeCRA1Vabd5
/E4DTwpN5tVWX+HZl1wcAtQpXhcTktWM0qGSPnlRS11gwbAGWvk=
=O8tu
-----END PGP SIGNATURE-----
Merge tag 'nf-next-24-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following batch contains Netfilter updates for net-next:
Patch #1 fix checksum calculation in nfnetlink_queue with SCTP,
segment GSO packet since skb_zerocopy() does not support
GSO_BY_FRAGS, from Antonio Ojea.
Patch #2 extend nfnetlink_queue coverage to handle SCTP packets,
from Antonio Ojea.
Patch #3 uses consume_skb() instead of kfree_skb() in nfnetlink,
from Donald Hunter.
Patch #4 adds a dedicate commit list for sets to speed up
intra-transaction lookups, from Florian Westphal.
Patch #5 skips removal of element from abort path for the pipapo
backend, ditching the shadow copy of this datastructure
is sufficient.
Patch #6 moves nf_ct_netns_get() out of nf_conncount_init() to
let users of conncoiunt decide when to enable conntrack,
this is needed by openvswitch, from Xin Long.
Patch #7 pass context to all nft_parse_register_load() in
preparation for the next patch.
Patches #8 and #9 reject loads from uninitialized registers from
control plane to remove register initialization from
datapath. From Florian Westphal.
* tag 'nf-next-24-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_tables: don't initialize registers in nft_do_chain()
netfilter: nf_tables: allow loads only when register is initialized
netfilter: nf_tables: pass context structure to nft_parse_register_load
netfilter: move nf_ct_netns_get out of nf_conncount_init
netfilter: nf_tables: do not remove elements if set backend implements .abort
netfilter: nf_tables: store new sets in dedicated list
netfilter: nfnetlink: convert kfree_skb to consume_skb
selftests: netfilter: nft_queue.sh: sctp coverage
netfilter: nfnetlink_queue: unbreak SCTP traffic
====================
Link: https://patch.msgid.link/20240822221939.157858-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The initial value of err is -ENOBUFS, and err is guaranteed to be
less than 0 before all goto errout. Therefore, on the error path
of errout, there is no need to repeatedly judge that err is less than 0,
and delete redundant judgments to make the code more concise.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The initial value of err is -ENOBUFS, and err is guaranteed to be
less than 0 before all goto errout. Therefore, on the error path
of errout, there is no need to repeatedly judge that err is less than 0,
and delete redundant judgments to make the code more concise.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The initial value of err is -ENOBUFS, and err is guaranteed to be
less than 0 before all goto errout. Therefore, on the error path
of errout, there is no need to repeatedly judge that err is less than 0,
and delete redundant judgments to make the code more concise.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
drivers/net/ethernet/broadcom/bnxt/bnxt.h
c948c0973d ("bnxt_en: Don't clear ntuple filters and rss contexts during ethtool ops")
f2878cdeb7 ("bnxt_en: Add support to call FW to update a VNIC")
Link: https://patch.msgid.link/20240822210125.1542769-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Unmask the upper DSCP bits when performing source validation and routing
a packet using the same route from a previously processed packet (hint).
In the future, this will allow us to perform the FIB lookup that is
performed as part of source validation according to the full DSCP value.
No functional changes intended since the upper DSCP bits are masked when
comparing against the TOS selectors in FIB rules and routes.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240821125251.1571445-13-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Unmask the upper DSCP bits when performing source validation for
multicast packets during early demux. In the future, this will allow us
to perform the FIB lookup which is performed as part of source
validation according to the full DSCP value.
No functional changes intended since the upper DSCP bits are masked when
comparing against the TOS selectors in FIB rules and routes.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240821125251.1571445-12-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Align the ICMP code to other callers of ip_route_input() and pass the
full DS field. In the future this will allow us to perform a route
lookup according to the full DSCP value.
No functional changes intended since the upper DSCP bits are masked when
comparing against the TOS selectors in FIB rules and routes.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240821125251.1571445-11-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Unmask the upper DSCP bits when looking up an input route via the
RTM_GETROUTE netlink message so that in the future the lookup could be
performed according to the full DSCP value.
No functional changes intended since the upper DSCP bits are masked when
comparing against the TOS selectors in FIB rules and routes.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240821125251.1571445-10-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Unmask the upper DSCP bits in input route lookup so that in the future
the lookup could be performed according to the full DSCP value.
No functional changes intended since the upper DSCP bits are masked when
comparing against the TOS selectors in FIB rules and routes.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240821125251.1571445-9-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As explained in commit 35ebf65e85 ("ipv4: Create and use
fib_compute_spec_dst() helper."), the function is used - for example -
to determine the source address for an ICMP reply. If we are responding
to a multicast or broadcast packet, the source address is set to the
source address that we would use if we were to send a packet to the
unicast source of the original packet. This address is determined by
performing a FIB lookup and using the preferred source address of the
resulting route.
Unmask the upper DSCP bits of the DS field of the packet that triggered
the reply so that in the future the FIB lookup could be performed
according to the full DSCP value.
No functional changes intended since the upper DSCP bits are masked when
comparing against the TOS selectors in FIB rules and routes.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240821125251.1571445-8-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Unmask the upper DSCP bits when calling ipmr_fib_lookup() so that in the
future it could perform the FIB lookup according to the full DSCP value.
Note that ipmr_fib_lookup() performs a FIB rule lookup (returning the
relevant routing table) and that IPv4 multicast FIB rules do not support
matching on TOS / DSCP. However, it is still worth unmasking the upper
DSCP bits in case support for DSCP matching is ever added.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240821125251.1571445-7-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In a similar fashion to the iptables rpfilter match, unmask the upper
DSCP bits of the DS field of the currently tested packet so that in the
future the FIB lookup could be performed according to the full DSCP
value.
No functional changes intended since the upper DSCP bits are masked when
comparing against the TOS selectors in FIB rules and routes.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240821125251.1571445-6-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The rpfilter match performs a reverse path filter test on a packet by
performing a FIB lookup with the source and destination addresses
swapped.
Unmask the upper DSCP bits of the DS field of the tested packet so that
in the future the FIB lookup could be performed according to the full
DSCP value.
No functional changes intended since the upper DSCP bits are masked when
comparing against the TOS selectors in FIB rules and routes.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240821125251.1571445-5-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The Record Route IP option records the addresses of the routers that
routed the packet. In the case of forwarded packets, the kernel performs
a route lookup via fib_lookup() and fills in the preferred source
address of the matched route.
Unmask the upper DSCP bits when performing the lookup so that in the
future the lookup could be performed according to the full DSCP value.
No functional changes intended since the upper DSCP bits are masked when
comparing against the TOS selectors in FIB rules and routes.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240821125251.1571445-4-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The NETLINK_FIB_LOOKUP netlink family can be used to perform a FIB
lookup according to user provided parameters and communicate the result
back to user space.
Unmask the upper DSCP bits of the user-provided DS field before invoking
the IPv4 FIB lookup API so that in the future the lookup could be
performed according to the full DSCP value.
No functional changes intended since the upper DSCP bits are masked when
comparing against the TOS selectors in FIB rules and routes.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240821125251.1571445-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When assembling fraglist GSO packets, udp4_gro_complete does not set
skb->csum_start, which makes the extra validation in __udp_gso_segment fail.
Fixes: 89add40066 ("net: drop bad gso csum_start and offset in virtio_net_hdr")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20240819150621.59833-1-nbd@nbd.name
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Recent commit ed8ebee6de ("l2tp: have l2tp_ip_destroy_sock use
ip_flush_pending_frames") was incorrect in that l2tp_ip does not use
socket cork and ip_flush_pending_frames is for sockets that do. Use
__skb_queue_purge instead and remove the unnecessary lock.
Also unexport ip_flush_pending_frames since it was originally exported
in commit 4ff8863419 ("ipv4: export ip_flush_pending_frames") for
l2tp and is not used by other modules.
Suggested-by: xiyou.wangcong@gmail.com
Signed-off-by: James Chapman <jchapman@katalix.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240819143333.3204957-1-jchapman@katalix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The TOS field in the IPv4 flow information structure ('flowi4_tos') is
matched by the kernel against the TOS selector in IPv4 rules and routes.
The field is initialized differently by different call sites. Some treat
it as DSCP (RFC 2474) and initialize all six DSCP bits, some treat it as
RFC 1349 TOS and initialize it using RT_TOS() and some treat it as RFC
791 TOS and initialize it using IPTOS_RT_MASK.
What is common to all these call sites is that they all initialize the
lower three DSCP bits, which fits the TOS definition in the initial IPv4
specification (RFC 791).
Therefore, the kernel only allows configuring IPv4 FIB rules that match
on the lower three DSCP bits which are always guaranteed to be
initialized by all call sites:
# ip -4 rule add tos 0x1c table 100
# ip -4 rule add tos 0x3c table 100
Error: Invalid tos.
While this works, it is unlikely to be very useful. RFC 791 that
initially defined the TOS and IP precedence fields was updated by RFC
2474 over twenty five years ago where these fields were replaced by a
single six bits DSCP field.
Extending FIB rules to match on DSCP can be done by adding a new DSCP
selector while maintaining the existing semantics of the TOS selector
for applications that rely on that.
A prerequisite for allowing FIB rules to match on DSCP is to adjust all
the call sites to initialize the high order DSCP bits and remove their
masking along the path to the core where the field is matched on.
However, making this change alone will result in a behavior change. For
example, a forwarded IPv4 packet with a DS field of 0xfc will no longer
match a FIB rule that was configured with 'tos 0x1c'.
This behavior change can be avoided by masking the upper three DSCP bits
in 'flowi4_tos' before comparing it against the TOS selectors in FIB
rules and routes.
Implement the above by adding a new function that checks whether a given
DSCP value matches the one specified in the IPv4 flow information
structure and invoke it from the three places that currently match on
'flowi4_tos'.
Use RT_TOS() for the masking of 'flowi4_tos' instead of IPTOS_RT_MASK
since the latter is not uAPI and we should be able to remove it at some
point.
Include <linux/ip.h> in <linux/in_route.h> since the former defines
IPTOS_TOS_MASK which is used in the definition of RT_TOS() in
<linux/in_route.h>.
No regressions in FIB tests:
# ./fib_tests.sh
[...]
Tests passed: 218
Tests failed: 0
And FIB rule tests:
# ./fib_rule_tests.sh
[...]
Tests passed: 116
Tests failed: 0
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
As part of its functionality, the nftables FIB expression module
performs a FIB lookup, but unlike other users of the FIB lookup API, it
does so without masking the upper DSCP bits. In particular, this differs
from the equivalent iptables match ("rpfilter") that does mask the upper
DSCP bits before the FIB lookup.
Align the module to other users of the FIB lookup API and mask the upper
DSCP bits using IPTOS_RT_MASK before the lookup.
No regressions in nft_fib.sh:
# ./nft_fib.sh
PASS: fib expression did not cause unwanted packet drops
PASS: fib expression did drop packets for 1.1.1.1
PASS: fib expression did drop packets for 1c3::c01d
PASS: fib expression forward check with policy based routing
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The NETLINK_FIB_LOOKUP netlink family can be used to perform a FIB
lookup according to user provided parameters and communicate the result
back to user space.
However, unlike other users of the FIB lookup API, the upper DSCP bits
and the ECN bits of the DS field are not masked, which can result in the
wrong result being returned.
Solve this by masking the upper DSCP bits and the ECN bits using
IPTOS_RT_MASK.
The structure that communicates the request and the response is not
exported to user space, so it is unlikely that this netlink family is
actually in use [1].
[1] https://lore.kernel.org/netdev/ZpqpB8vJU%2FQ6LSqa@debian/
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Use the netlink policy to validate IPv6 address length.
Destination address currently has policy for max len set,
and source has no policy validation. In both cases
the code does the real check. With correct policy
check the code can be removed.
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240816212245.467745-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Its possible that two threads call tcp_sk_exit_batch() concurrently,
once from the cleanup_net workqueue, once from a task that failed to clone
a new netns. In the latter case, error unwinding calls the exit handlers
in reverse order for the 'failed' netns.
tcp_sk_exit_batch() calls tcp_twsk_purge().
Problem is that since commit b099ce2602 ("net: Batch inet_twsk_purge"),
this function picks up twsk in any dying netns, not just the one passed
in via exit_batch list.
This means that the error unwind of setup_net() can "steal" and destroy
timewait sockets belonging to the exiting netns.
This allows the netns exit worker to proceed to call
WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
without the expected 1 -> 0 transition, which then splats.
At same time, error unwind path that is also running inet_twsk_purge()
will splat as well:
WARNING: .. at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210
...
refcount_dec include/linux/refcount.h:351 [inline]
inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221
inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
setup_net+0x714/0xb40 net/core/net_namespace.c:375
copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
... because refcount_dec() of tw_refcount unexpectedly dropped to 0.
This doesn't seem like an actual bug (no tw sockets got lost and I don't
see a use-after-free) but as erroneous trigger of debug check.
Add a mutex to force strict ordering: the task that calls tcp_twsk_purge()
blocks other task from doing final _dec_and_test before mutex-owner has
removed all tw sockets of dying netns.
Fixes: e9bd0cca09 ("tcp: Don't allocate tcp_death_row outside of struct netns_ipv4.")
Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/0000000000003a5292061f5e4e19@google.com/
Link: https://lore.kernel.org/netdev/20240812140104.GA21559@breakpoint.cc/
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240812222857.29837-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
INFINITY_LIFE_TIME is the common value used in IPv4 and IPv6 but defined
in both .c files.
Also, 0xffffffff used in addrconf_timeout_fixup() is INFINITY_LIFE_TIME.
Let's move INFINITY_LIFE_TIME's definition to addrconf.h
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20240809235406.50187-6-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Whenever ifa is allocated, we call INIT_HLIST_NODE(&ifa->hash).
Let's move it to inet_alloc_ifa().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20240809235406.50187-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now, ifa_dev is only set in inet_alloc_ifa() and never
NULL after ifa gets visible.
Let's remove the unneeded NULL check for ifa->ifa_dev.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20240809235406.50187-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a new IPv4 address is assigned via ioctl(SIOCSIFADDR),
inet_set_ifa() sets ifa->ifa_dev if it's different from in_dev
passed as an argument.
In this case, ifa is always a newly allocated object, and
ifa->ifa_dev is NULL.
inet_set_ifa() can be called for an existing reused ifa, then,
this check is always false.
Let's set ifa_dev in inet_alloc_ifa() and remove the check
in inet_set_ifa().
Now, inet_alloc_ifa() is symmetric with inet_rcu_free_ifa().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20240809235406.50187-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dev->ip_ptr could be NULL if we set an invalid MTU.
Even then, if we issue ioctl(SIOCSIFADDR) for a new IPv4 address,
devinet_ioctl() allocates struct in_ifaddr and fails later in
inet_set_ifa() because in_dev is NULL.
Let's move the check earlier.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20240809235406.50187-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch is based on the discussions between Neal Cardwell and
Eric Dumazet in the link
https://lore.kernel.org/netdev/20240726204105.1466841-1-quic_subashab@quicinc.com/
It was correctly pointed out that tp->window_clamp would not be
updated in cases where net.ipv4.tcp_moderate_rcvbuf=0 or if
(copied <= tp->rcvq_space.space). While it is expected for most
setups to leave the sysctl enabled, the latter condition may
not end up hitting depending on the TCP receive queue size and
the pattern of arriving data.
The updated check should be hit only on initial MSS update from
TCP_MIN_MSS to measured MSS value and subsequently if there was
an update to a larger value.
Fixes: 05f76b2d63 ("tcp: Adjust clamping window for applications specifying SO_RCVBUF")
Signed-off-by: Sean Tranchetti <quic_stranche@quicinc.com>
Signed-off-by: Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In CLOS networks, as link failures occur at various points in the network,
ECMP weights of the involved nodes are adjusted to compensate. With high
fan-out of the involved nodes, and overall high number of nodes,
a (non-)ECMP weight ratio that we would like to configure does not fit into
8 bits. Instead of, say, 255:254, we might like to configure something like
1000:999. For these deployments, the 8-bit weight may not be enough.
To that end, in this patch increase the next hop weight from u8 to u16.
Increasing the width of an integral type can be tricky, because while the
code still compiles, the types may not check out anymore, and numerical
errors come up. To prevent this, the conversion was done in two steps.
First the type was changed from u8 to a single-member structure, which
invalidated all uses of the field. This allowed going through them one by
one and audit for type correctness. Then the structure was replaced with a
vanilla u16 again. This should ensure that no place was missed.
The UAPI for configuring nexthop group members is that an attribute
NHA_GROUP carries an array of struct nexthop_grp entries:
struct nexthop_grp {
__u32 id; /* nexthop id - must exist */
__u8 weight; /* weight of this nexthop */
__u8 resvd1;
__u16 resvd2;
};
The field resvd1 is currently validated and required to be zero. We can
lift this requirement and carry high-order bits of the weight in the
reserved field:
struct nexthop_grp {
__u32 id; /* nexthop id - must exist */
__u8 weight; /* weight of this nexthop */
__u8 weight_high;
__u16 resvd2;
};
Keeping the fields split this way was chosen in case an existing userspace
makes assumptions about the width of the weight field, and to sidestep any
endianness issues.
The weight field is currently encoded as the weight value minus one,
because weight of 0 is invalid. This same trick is impossible for the new
weight_high field, because zero must mean actual zero. With this in place:
- Old userspace is guaranteed to carry weight_high of 0, therefore
configuring 8-bit weights as appropriate. When dumping nexthops with
16-bit weight, it would only show the lower 8 bits. But configuring such
nexthops implies existence of userspace aware of the extension in the
first place.
- New userspace talking to an old kernel will work as long as it only
attempts to configure 8-bit weights, where the high-order bits are zero.
Old kernel will bounce attempts at configuring >8-bit weights.
Renaming reserved fields as they are allocated for some purpose is commonly
done in Linux. Whoever touches a reserved field is doing so at their own
risk. nexthop_grp::resvd1 in particular is currently used by at least
strace, however they carry an own copy of UAPI headers, and the conversion
should be trivial. A helper is provided for decoding the weight out of the
two fields. Forcing a conversion seems preferable to bending backwards and
introducing anonymous unions or whatever.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/483e2fcf4beb0d9135d62e7d27b46fa2685479d4.1723036486.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are many unpatched kernel versions out there that do not initialize
the reserved fields of struct nexthop_grp. The issue with that is that if
those fields were to be used for some end (i.e. stop being reserved), old
kernels would still keep sending random data through the field, and a new
userspace could not rely on the value.
In this patch, use the existing NHA_OP_FLAGS, which is currently inbound
only, to carry flags back to the userspace. Add a flag to indicate that the
reserved fields in struct nexthop_grp are zeroed before dumping. This is
reliant on the actual fix from commit 6d745cd0e9 ("net: nexthop:
Initialize all fields in dumped nexthops").
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/21037748d4f9d8ff486151f4c09083bcf12d5df8.1723036486.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In commit 10154dbded ("udp: Allow GSO transmit from devices with no
checksum offload") we have intentionally allowed UDP GSO packets marked
CHECKSUM_NONE to pass to the GSO stack, so that they can be segmented and
checksummed by a software fallback when the egress device lacks these
features.
What was not taken into consideration is that a CHECKSUM_NONE skb can be
handed over to the GSO stack also when the egress device advertises the
tx-udp-segmentation / NETIF_F_GSO_UDP_L4 feature.
This will happen when there are IPv6 extension headers present, which we
check for in __ip6_append_data(). Syzbot has discovered this scenario,
producing a warning as below:
ip6tnl0: caps=(0x00000006401d7869, 0x00000006401d7869)
WARNING: CPU: 0 PID: 5112 at net/core/dev.c:3293 skb_warn_bad_offload+0x166/0x1a0 net/core/dev.c:3291
Modules linked in:
CPU: 0 PID: 5112 Comm: syz-executor391 Not tainted 6.10.0-rc7-syzkaller-01603-g80ab5445da62 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/07/2024
RIP: 0010:skb_warn_bad_offload+0x166/0x1a0 net/core/dev.c:3291
[...]
Call Trace:
<TASK>
__skb_gso_segment+0x3be/0x4c0 net/core/gso.c:127
skb_gso_segment include/net/gso.h:83 [inline]
validate_xmit_skb+0x585/0x1120 net/core/dev.c:3661
__dev_queue_xmit+0x17a4/0x3e90 net/core/dev.c:4415
neigh_output include/net/neighbour.h:542 [inline]
ip6_finish_output2+0xffa/0x1680 net/ipv6/ip6_output.c:137
ip6_finish_output+0x41e/0x810 net/ipv6/ip6_output.c:222
ip6_send_skb+0x112/0x230 net/ipv6/ip6_output.c:1958
udp_v6_send_skb+0xbf5/0x1870 net/ipv6/udp.c:1292
udpv6_sendmsg+0x23b3/0x3270 net/ipv6/udp.c:1588
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0xef/0x270 net/socket.c:745
____sys_sendmsg+0x525/0x7d0 net/socket.c:2585
___sys_sendmsg net/socket.c:2639 [inline]
__sys_sendmmsg+0x3b2/0x740 net/socket.c:2725
__do_sys_sendmmsg net/socket.c:2754 [inline]
__se_sys_sendmmsg net/socket.c:2751 [inline]
__x64_sys_sendmmsg+0xa0/0xb0 net/socket.c:2751
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
[...]
</TASK>
We are hitting the bad offload warning because when an egress device is
capable of handling segmentation offload requested by
skb_shinfo(skb)->gso_type, the chain of gso_segment callbacks won't produce
any segment skbs and return NULL. See the skb_gso_ok() branch in
{__udp,tcp,sctp}_gso_segment helpers.
To fix it, force a fallback to software USO when processing a packet with
IPv6 extension headers, since we don't know if these can checksummed by
all devices which offer USO.
Fixes: 10154dbded ("udp: Allow GSO transmit from devices with no checksum offload")
Reported-by: syzbot+e15b7e15b8a751a91d9a@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000e1609a061d5330ce@google.com/
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://patch.msgid.link/20240808-udp-gso-egress-from-tunnel-v4-2-f5c5b4149ab9@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now it's time to let it work by using the 'reason' parameter in
the trace world :)
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When user tries to disconnect a socket and there are more data written
into tcp write queue, we should tell users about this reset reason.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introducing this to show the users the reason of keepalive timeout.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introducing a new type TCP_STATE to handle some reset conditions
appearing in RFC 793 due to its socket state. Actually, we can look
into RFC 9293 which has no discrepancy about this part.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introducing a new type TCP_ABORT_ON_MEMORY for tcp reset reason to handle
out of memory case.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introducing a new type TCP_ABORT_ON_LINGER for tcp reset reason to handle
negative linger value case.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introducing a new type TCP_ABORT_ON_CLOSE for tcp reset reason to handle
the case where more data is unread in closing phase.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following helpers do not touch their 'struct net' argument.
- udp_sk_bound_dev_eq()
- udp4_lib_lookup()
- __udp4_lib_lookup()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240802134029.3748005-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Following helpers do not touch their struct net argument:
- bpf_sk_lookup_run_v4()
- inet_lookup_reuseport()
- inet_lhash2_lookup()
- inet_lookup_run_sk_lookup()
- __inet_lookup_listener()
- __inet_lookup_established()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240802134029.3748005-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
inet_sk_bound_dev_eq() and its callers do not modify the net structure.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240802134029.3748005-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The lifetime of TCP-AO static_key is the same as the last
tcp_ao_info. On the socket destruction tcp_ao_info ceases to be
with RCU grace period, while tcp-ao static branch is currently deferred
destructed. The static key definition is
: DEFINE_STATIC_KEY_DEFERRED_FALSE(tcp_ao_needed, HZ);
which means that if RCU grace period is delayed by more than a second
and tcp_ao_needed is in the process of disablement, other CPUs may
yet see tcp_ao_info which atent dead, but soon-to-be.
And that breaks the assumption of static_key_fast_inc_not_disabled().
See the comment near the definition:
> * The caller must make sure that the static key can't get disabled while
> * in this function. It doesn't patch jump labels, only adds a user to
> * an already enabled static key.
Originally it was introduced in commit eb8c507296 ("jump_label:
Prevent key->enabled int overflow"), which is needed for the atomic
contexts, one of which would be the creation of a full socket from a
request socket. In that atomic context, it's known by the presence
of the key (md5/ao) that the static branch is already enabled.
So, the ref counter for that static branch is just incremented
instead of holding the proper mutex.
static_key_fast_inc_not_disabled() is just a helper for such usage
case. But it must not be used if the static branch could get disabled
in parallel as it's not protected by jump_label_mutex and as a result,
races with jump_label_update() implementation details.
Happened on netdev test-bot[1], so not a theoretical issue:
[] jump_label: Fatal kernel bug, unexpected op at tcp_inbound_hash+0x1a7/0x870 [ffffffffa8c4e9b7] (eb 50 0f 1f 44 != 66 90 0f 1f 00)) size:2 type:1
[] ------------[ cut here ]------------
[] kernel BUG at arch/x86/kernel/jump_label.c:73!
[] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
[] CPU: 3 PID: 243 Comm: kworker/3:3 Not tainted 6.10.0-virtme #1
[] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[] Workqueue: events jump_label_update_timeout
[] RIP: 0010:__jump_label_patch+0x2f6/0x350
...
[] Call Trace:
[] <TASK>
[] arch_jump_label_transform_queue+0x6c/0x110
[] __jump_label_update+0xef/0x350
[] __static_key_slow_dec_cpuslocked.part.0+0x3c/0x60
[] jump_label_update_timeout+0x2c/0x40
[] process_one_work+0xe3b/0x1670
[] worker_thread+0x587/0xce0
[] kthread+0x28a/0x350
[] ret_from_fork+0x31/0x70
[] ret_from_fork_asm+0x1a/0x30
[] </TASK>
[] Modules linked in: veth
[] ---[ end trace 0000000000000000 ]---
[] RIP: 0010:__jump_label_patch+0x2f6/0x350
[1]: https://netdev-3.bots.linux.dev/vmksft-tcp-ao-dbg/results/696681/5-connect-deny-ipv6/stderr
Cc: stable@kernel.org
Fixes: 67fa83f7c8 ("net/tcp: Add static_key for TCP-AO")
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current release - regressions:
- core: drop bad gso csum_start and offset in virtio_net_hdr
- wifi: mt76: fix null pointer access in mt792x_mac_link_bss_remove
- eth: tun: add missing bpf_net_ctx_clear() in do_xdp_generic()
- phy: aquantia: only poll GLOBAL_CFG regs on aqr113, aqr113c and aqr115c
Current release - new code bugs:
- smc: prevent UAF in inet_create()
- bluetooth: btmtk: fix kernel crash when entering btmtk_usb_suspend
- eth: bnxt: reject unsupported hash functions
Previous releases - regressions:
- sched: act_ct: take care of padding in struct zones_ht_key
- netfilter: fix null-ptr-deref in iptable_nat_table_init().
- tcp: adjust clamping window for applications specifying SO_RCVBUF
Previous releases - always broken:
- ethtool: rss: small fixes to spec and GET
- mptcp:
- fix signal endpoint re-add
- pm: fix backup support in signal endpoints
- wifi: ath12k: fix soft lockup on suspend
- eth: bnxt_en: fix RSS logic in __bnxt_reserve_rings()
- eth: ice: fix AF_XDP ZC timeout and concurrency issues
- eth: mlx5:
- fix missing lock on sync reset reload
- fix error handling in irq_pool_request_irq
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmarelYSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkdPwP/2lxh5Cc/SK/mJjBvyBdO2+cuNR0M4Kf
UV2PA4oOLREYXEPgmOtJQ/VcsmOLa1pEPAdJarZwB5ztalgRKkIogHzzjfY43Fmx
rAgZqGnIJrWRtepDM8jAaEJC0bEKywH5Wo6eh+oi0GCS07B48lpYATI/1gQdwBjV
CgcZTQd/04PVx69Bi8LiQyfbwppAsIQQa9YaGmqGuQa74Hp9gz+4VyeRFg54h3CP
6fWwRHNVO8GsGNA1UgWbeXXajhUU+AG/gDThqIcgxs3KmrREzU9EvcQ70XCzphOA
JoUy9yykWRGen7aFGrggfY4NzjQmL6g+/rCvbIMfidRsJKBaQYBeMUkbQRnAh34V
Pe3aSBEnv1aBKaQA7yntdqYGRJ2Sz56a1kjCvI86eDjExt4UshbZi+TfuQSj6zAY
/ejOawhEYPFZw2FlvkBetyck7iroG1404DoBPghoRu9dG2e3p0eJOZfXiEzfS2qB
PsJtMPiexSdEcY3sxVKOMh4hx0Zjkqest7laitb1Lrbg5pLhEiHvDkyhoUPGI2oa
a3N4rsBc6sgSTQfJsx4nXFfKfNQsNu2Nr308BDk16XOHZ4J7Hgt6xR6STDo9ACz1
Gy5munCN2AhGSdhR5niFI3ocNpDM5oWkztBfjz7YmIQv18NcU5nO8ByYytXMbglq
sSsnR+VbYeCu
=z53B
-----END PGP SIGNATURE-----
Merge tag 'net-6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from wireless, bleutooth, BPF and netfilter.
Current release - regressions:
- core: drop bad gso csum_start and offset in virtio_net_hdr
- wifi: mt76: fix null pointer access in mt792x_mac_link_bss_remove
- eth: tun: add missing bpf_net_ctx_clear() in do_xdp_generic()
- phy: aquantia: only poll GLOBAL_CFG regs on aqr113, aqr113c and
aqr115c
Current release - new code bugs:
- smc: prevent UAF in inet_create()
- bluetooth: btmtk: fix kernel crash when entering btmtk_usb_suspend
- eth: bnxt: reject unsupported hash functions
Previous releases - regressions:
- sched: act_ct: take care of padding in struct zones_ht_key
- netfilter: fix null-ptr-deref in iptable_nat_table_init().
- tcp: adjust clamping window for applications specifying SO_RCVBUF
Previous releases - always broken:
- ethtool: rss: small fixes to spec and GET
- mptcp:
- fix signal endpoint re-add
- pm: fix backup support in signal endpoints
- wifi: ath12k: fix soft lockup on suspend
- eth: bnxt_en: fix RSS logic in __bnxt_reserve_rings()
- eth: ice: fix AF_XDP ZC timeout and concurrency issues
- eth: mlx5:
- fix missing lock on sync reset reload
- fix error handling in irq_pool_request_irq"
* tag 'net-6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (76 commits)
mptcp: fix duplicate data handling
mptcp: fix bad RCVPRUNED mib accounting
ipv6: fix ndisc_is_useropt() handling for PIO
igc: Fix double reset adapter triggered from a single taprio cmd
net: MAINTAINERS: Demote Qualcomm IPA to "maintained"
net: wan: fsl_qmc_hdlc: Discard received CRC
net: wan: fsl_qmc_hdlc: Convert carrier_lock spinlock to a mutex
net/mlx5e: Add a check for the return value from mlx5_port_set_eth_ptys
net/mlx5e: Fix CT entry update leaks of modify header context
net/mlx5e: Require mlx5 tc classifier action support for IPsec prio capability
net/mlx5: Fix missing lock on sync reset reload
net/mlx5: Lag, don't use the hardcoded value of the first port
net/mlx5: DR, Fix 'stack guard page was hit' error in dr_rule
net/mlx5: Fix error handling in irq_pool_request_irq
net/mlx5: Always drain health in shutdown callback
net: Add skbuff.h to MAINTAINERS
r8169: don't increment tx_dropped in case of NETDEV_TX_BUSY
netfilter: iptables: Fix potential null-ptr-deref in ip6table_nat_table_init().
netfilter: iptables: Fix null-ptr-deref in iptable_nat_table_init().
net: drop bad gso csum_start and offset in virtio_net_hdr
...
To avoid protocol modules implementing their own, export
ip_flush_pending_frames.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The goo.gl URL shortener is deprecated and is due to stop
expanding existing links in 2025.
Expand the link in Kconfig.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240729205337.48058-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tighten csum_start and csum_offset checks in virtio_net_hdr_to_skb
for GSO packets.
The function already checks that a checksum requested with
VIRTIO_NET_HDR_F_NEEDS_CSUM is in skb linear. But for GSO packets
this might not hold for segs after segmentation.
Syzkaller demonstrated to reach this warning in skb_checksum_help
offset = skb_checksum_start_offset(skb);
ret = -EINVAL;
if (WARN_ON_ONCE(offset >= skb_headlen(skb)))
By injecting a TSO packet:
WARNING: CPU: 1 PID: 3539 at net/core/dev.c:3284 skb_checksum_help+0x3d0/0x5b0
ip_do_fragment+0x209/0x1b20 net/ipv4/ip_output.c:774
ip_finish_output_gso net/ipv4/ip_output.c:279 [inline]
__ip_finish_output+0x2bd/0x4b0 net/ipv4/ip_output.c:301
iptunnel_xmit+0x50c/0x930 net/ipv4/ip_tunnel_core.c:82
ip_tunnel_xmit+0x2296/0x2c70 net/ipv4/ip_tunnel.c:813
__gre_xmit net/ipv4/ip_gre.c:469 [inline]
ipgre_xmit+0x759/0xa60 net/ipv4/ip_gre.c:661
__netdev_start_xmit include/linux/netdevice.h:4850 [inline]
netdev_start_xmit include/linux/netdevice.h:4864 [inline]
xmit_one net/core/dev.c:3595 [inline]
dev_hard_start_xmit+0x261/0x8c0 net/core/dev.c:3611
__dev_queue_xmit+0x1b97/0x3c90 net/core/dev.c:4261
packet_snd net/packet/af_packet.c:3073 [inline]
The geometry of the bad input packet at tcp_gso_segment:
[ 52.003050][ T8403] skb len=12202 headroom=244 headlen=12093 tailroom=0
[ 52.003050][ T8403] mac=(168,24) mac_len=24 net=(192,52) trans=244
[ 52.003050][ T8403] shinfo(txflags=0 nr_frags=1 gso(size=1552 type=3 segs=0))
[ 52.003050][ T8403] csum(0x60000c7 start=199 offset=1536
ip_summed=3 complete_sw=0 valid=0 level=0)
Mitigate with stricter input validation.
csum_offset: for GSO packets, deduce the correct value from gso_type.
This is already done for USO. Extend it to TSO. Let UFO be:
udp[46]_ufo_fragment ignores these fields and always computes the
checksum in software.
csum_start: finding the real offset requires parsing to the transport
header. Do not add a parser, use existing segmentation parsing. Thanks
to SKB_GSO_DODGY, that also catches bad packets that are hw offloaded.
Again test both TSO and USO. Do not test UFO for the above reason, and
do not test UDP tunnel offload.
GSO packet are almost always CHECKSUM_PARTIAL. USO packets may be
CHECKSUM_NONE since commit 10154dbded ("udp: Allow GSO transmit
from devices with no checksum offload"), but then still these fields
are initialized correctly in udp4_hwcsum/udp6_hwcsum_outgoing. So no
need to test for ip_summed == CHECKSUM_PARTIAL first.
This revises an existing fix mentioned in the Fixes tag, which broke
small packets with GSO offload, as detected by kselftests.
Link: https://syzkaller.appspot.com/bug?extid=e1db31216c789f552871
Link: https://lore.kernel.org/netdev/20240723223109.2196886-1-kuba@kernel.org
Fixes: e269d79c7d ("net: missing check virtio")
Cc: stable@vger.kernel.org
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20240729201108.1615114-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The bpf_tcp_ca struct_ops currently uses a "u32 unsupported_ops[]"
array to track which ops is not supported.
After cfi_stubs had been added, the function pointer in cfi_stubs is
also NULL for the unsupported ops. Thus, the "u32 unsupported_ops[]"
becomes redundant. This observation was originally brought up in the
bpf/cfi discussion:
https://lore.kernel.org/bpf/CAADnVQJoEkdjyCEJRPASjBw1QGsKYrF33QdMGc1RZa9b88bAEA@mail.gmail.com/
The recent bpf qdisc patch (https://lore.kernel.org/bpf/20240714175130.4051012-6-amery.hung@bytedance.com/)
also needs to specify quite many unsupported ops. It is a good time
to clean it up.
This patch removes the need of "u32 unsupported_ops[]" and tests for null-ness
in the cfi_stubs instead.
Testing the cfi_stubs is done in a new function bpf_struct_ops_supported().
The verifier will call bpf_struct_ops_supported() when loading the
struct_ops program. The ".check_member" is removed from the bpf_tcp_ca
in this patch. ".check_member" could still be useful for other subsytems
to enforce other restrictions (e.g. sched_ext checks for prog->sleepable).
To keep the same error return, ENOTSUPP is used.
Cc: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240722183049.2254692-2-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
tp->scaling_ratio is not updated based on skb->len/skb->truesize once
SO_RCVBUF is set leading to the maximum window scaling to be 25% of
rcvbuf after
commit dfa2f04833 ("tcp: get rid of sysctl_tcp_adv_win_scale")
and 50% of rcvbuf after
commit 697a6c8cec ("tcp: increase the default TCP scaling ratio").
50% tries to emulate the behavior of older kernels using
sysctl_tcp_adv_win_scale with default value.
Systems which were using a different values of sysctl_tcp_adv_win_scale
in older kernels ended up seeing reduced download speeds in certain
cases as covered in https://lists.openwall.net/netdev/2024/05/15/13
While the sysctl scheme is no longer acceptable, the value of 50% is
a bit conservative when the skb->len/skb->truesize ratio is later
determined to be ~0.66.
Applications not specifying SO_RCVBUF update the window scaling and
the receiver buffer every time data is copied to userspace. This
computation is now used for applications setting SO_RCVBUF to update
the maximum window scaling while ensuring that the receive buffer
is within the application specified limit.
Fixes: dfa2f04833 ("tcp: get rid of sysctl_tcp_adv_win_scale")
Signed-off-by: Sean Tranchetti <quic_stranche@quicinc.com>
Signed-off-by: Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 3a7e02c040 ("minmax: avoid overly complicated constant
expressions in VM code") added the simpler MIN_T/MAX_T macros in order
to avoid some excessive expansion from the rather complicated regular
min/max macros.
The complexity of those macros stems from two issues:
(a) trying to use them in situations that require a C constant
expression (in static initializers and for array sizes)
(b) the type sanity checking
and MIN_T/MAX_T avoids both of these issues.
Now, in the whole (long) discussion about all this, it was pointed out
that the whole type sanity checking is entirely unnecessary for
min_t/max_t which get a fixed type that the comparison is done in.
But that still leaves min_t/max_t unnecessarily complicated due to
worries about the C constant expression case.
However, it turns out that there really aren't very many cases that use
min_t/max_t for this, and we can just force-convert those.
This does exactly that.
Which in turn will then allow for much simpler implementations of
min_t()/max_t(). All the usual "macros in all upper case will evaluate
the arguments multiple times" rules apply.
We should do all the same things for the regular min/max() vs MIN/MAX()
cases, but that has the added complexity of various drivers defining
their own local versions of MIN/MAX, so that needs another level of
fixes first.
Link: https://lore.kernel.org/all/b47fad1d0cf8449886ad148f8c013dae@AcuMS.aculab.com/
Cc: David Laight <David.Laight@aculab.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
catching COVID, so relatively short PR. Including fixes from bpf
and netfilter.
Current release - regressions:
- tcp: process the 3rd ACK with sk_socket for TFO and MPTCP
Current release - new code bugs:
- l2tp: protect session IDR and tunnel session list with one lock,
make sure the state is coherent to avoid a warning
- eth: bnxt_en: update xdp_rxq_info in queue restart logic
- eth: airoha: fix location of the MBI_RX_AGE_SEL_MASK field
Previous releases - regressions:
- xsk: require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len,
the field reuses previously un-validated pad
Previous releases - always broken:
- tap/tun: drop short frames to prevent crashes later in the stack
- eth: ice: add a per-VF limit on number of FDIR filters
- af_unix: disable MSG_OOB handling for sockets in sockmap/sockhash
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmaibxAACgkQMUZtbf5S
IruuIRAAu96TiN/urPwmKznyb/Sk8x7p8iUzn6OvPS/TUlFUkURQtOh6M9uvbpN4
x/L//EWkMR0hY4SkBegoiXfb1GS0PjBdWTWUiROm5X9nVHqp5KRZAxWXhjFiS1BO
BIYOT+JfCl7mQiPs90Mys/cEtYOggMBsCZQVIGw/iYoJLFREqxFSONwa0dG+tGMX
jn9WNu4yCVDhJ/jtl2MaTsCNtYUaBUgYrKHJBfNGfJ2Lz/7rH9yFui2WSMlmOd/U
QGeCb1DWURlShlCqY37wNinbFsxWkI5JN00ukTtwFAXLIaqc+zgHcIjrDjTJwK43
F4tKbJT3+bmehMU/h3Uo3c7DhXl7n9zDGiDtbCxnkykp0sFGJpjhDrWydo51c+YB
qW5HaNrII2LiDicOVN8L29ylvKp7AEkClxgivEhZVGGk2f/szJRXfp9u3WBn5kAx
3paH55YN0DEsKbYbb1ZENEI1Vnc/4ff4PxZJCUNKwzcS8wCn1awqwcriK9TjS/cp
fjilNFT4J3/uFrodHWTkx0jJT6UJFT0aF03qPLUH/J5kG+EVukOf1jBPInNdf1si
1j47SpblHUe86HiHphFMt32KZ210lJzWxh8uGma57Y2sB9makdLiK4etrFjkiMJJ
Z8A3kGp3KpFjbuK4tHY25rp+5oxLNNOBNpay29lQrWtCL/NDcaQ=
=9OsH
-----END PGP SIGNATURE-----
Merge tag 'net-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf and netfilter.
A lot of networking people were at a conference last week, busy
catching COVID, so relatively short PR.
Current release - regressions:
- tcp: process the 3rd ACK with sk_socket for TFO and MPTCP
Current release - new code bugs:
- l2tp: protect session IDR and tunnel session list with one lock,
make sure the state is coherent to avoid a warning
- eth: bnxt_en: update xdp_rxq_info in queue restart logic
- eth: airoha: fix location of the MBI_RX_AGE_SEL_MASK field
Previous releases - regressions:
- xsk: require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len,
the field reuses previously un-validated pad
Previous releases - always broken:
- tap/tun: drop short frames to prevent crashes later in the stack
- eth: ice: add a per-VF limit on number of FDIR filters
- af_unix: disable MSG_OOB handling for sockets in sockmap/sockhash"
* tag 'net-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (34 commits)
tun: add missing verification for short frame
tap: add missing verification for short frame
mISDN: Fix a use after free in hfcmulti_tx()
gve: Fix an edge case for TSO skb validity check
bnxt_en: update xdp_rxq_info in queue restart logic
tcp: process the 3rd ACK with sk_socket for TFO/MPTCP
selftests/bpf: Add XDP_UMEM_TX_METADATA_LEN to XSK TX metadata test
xsk: Require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len
bpf: Fix a segment issue when downgrading gso_size
net: mediatek: Fix potential NULL pointer dereference in dummy net_device handling
MAINTAINERS: make Breno the netconsole maintainer
MAINTAINERS: Update bonding entry
net: nexthop: Initialize all fields in dumped nexthops
net: stmmac: Correct byte order of perfect_match
selftests: forwarding: skip if kernel not support setting bridge fdb learning limit
tipc: Return non-zero value from tipc_udp_addr2str() on error
netfilter: nft_set_pipapo_avx2: disable softinterrupts
ice: Fix recipe read procedure
ice: Add a per-VF limit on number of FDIR filters
net: bonding: correctly annotate RCU in bond_should_notify_peers()
...
The 'Fixes' commit recently changed the behaviour of TCP by skipping the
processing of the 3rd ACK when a sk->sk_socket is set. The goal was to
skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an
unnecessary ACK in case of simultaneous connect(). Unfortunately, that
had an impact on TFO and MPTCP.
I started to look at the impact on MPTCP, because the MPTCP CI found
some issues with the MPTCP Packetdrill tests [1]. Then Paolo Abeni
suggested me to look at the impact on TFO with "plain" TCP.
For MPTCP, when receiving the 3rd ACK of a request adding a new path
(MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that
has been created when the MPTCP connection got established before with
the first path. The newly added 'goto' will then skip the processing of
the segment text (step 7) and not go through tcp_data_queue() where the
MPTCP options are validated, and some actions are triggered, e.g.
sending the MPJ 4th ACK [2] as demonstrated by the new errors when
running a packetdrill test [3] establishing a second subflow.
This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
delayed. Still, we don't want to have this behaviour as it delays the
switch to the fully established mode, and invalid MPTCP options in this
3rd ACK will not be caught any more. This modification also affects the
MPTCP + TFO feature as well, and being the reason why the selftests
started to be unstable the last few days [4].
For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer
passing: if the 3rd ACK contains data, and the connection is accept()ed
before receiving them, these data would no longer be processed, and thus
not ACKed.
One last thing about MPTCP, in case of simultaneous connect(), a
fallback to TCP will be done, which seems fine:
`../common/defaults.sh`
0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3
+0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
+0 > S 0:0(0) <mss 1460, sackOK, TS val 100 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
+0 < S 0:0(0) win 1000 <mss 1460, sackOK, TS val 407 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
+0 > S. 0:0(0) ack 1 <mss 1460, sackOK, TS val 330 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
+0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]>
+0 > . 1:1(0) ack 1 <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1>
Simultaneous SYN-data crossing is also not supported by TFO, see [6].
Kuniyuki Iwashima suggested to restrict the processing to SYN+ACK only:
that's a more generic solution than the one initially proposed, and
also enough to fix the issues described above.
Later on, Eric Dumazet mentioned that an ACK should still be sent in
reaction to the second SYN+ACK that is received: not sending a DUPACK
here seems wrong and could hurt:
0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
+0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
+0 > S 0:0(0) <mss 1460, sackOK, TS val 1000 ecr 0,nop,wscale 8>
+0 < S 0:0(0) win 1000 <mss 1000, sackOK, nop, nop>
+0 > S. 0:0(0) ack 1 <mss 1460, sackOK, TS val 3308134035 ecr 0,nop,wscale 8>
+0 < S. 0:0(0) ack 1 win 1000 <mss 1000, sackOK, nop, nop>
+0 > . 1:1(0) ack 1 <nop, nop, sack 0:1> // <== Here
So in this version, the 'goto consume' is dropped, to always send an ACK
when switching from TCP_SYN_RECV to TCP_ESTABLISHED. This ACK will be
seen as a DUPACK -- with DSACK if SACK has been negotiated -- in case of
simultaneous SYN crossing: that's what is expected here.
Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1]
Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2]
Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3]
Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4]
Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5]
Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6]
Fixes: 23e89e8ee7 ("tcp: Don't drop SYN+ACK for simultaneous connect().")
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240724-upstream-net-next-20240716-tcp-3rd-ack-consume-sk_socket-v3-1-d48339764ce9@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.
This patch has been generated by the following coccinelle script:
```
virtual patch
@r1@
identifier ctl, write, buffer, lenp, ppos;
identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int write, void *buffer, size_t *lenp, loff_t *ppos);
@r2@
identifier func, ctl, write, buffer, lenp, ppos;
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int write, void *buffer, size_t *lenp, loff_t *ppos)
{ ... }
@r3@
identifier func;
@@
int func(
- struct ctl_table *
+ const struct ctl_table *
,int , void *, size_t *, loff_t *);
@r4@
identifier func, ctl;
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int , void *, size_t *, loff_t *);
@r5@
identifier func, write, buffer, lenp, ppos;
@@
int func(
- struct ctl_table *
+ const struct ctl_table *
,int write, void *buffer, size_t *lenp, loff_t *ppos);
```
* Code formatting was adjusted in xfs_sysctl.c to comply with code
conventions. The xfs_stats_clear_proc_handler,
xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
adjusted.
* The ctl_table argument in proc_watchdog_common was const qualified.
This is called from a proc_handler itself and is calling back into
another proc_handler, making it necessary to change it as part of the
proc_handler migration.
Co-developed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Co-developed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Joel Granados <j.granados@samsung.com>
struct nexthop_grp contains two reserved fields that are not initialized by
nla_put_nh_group(), and carry garbage. This can be observed e.g. with
strace (edited for clarity):
# ip nexthop add id 1 dev lo
# ip nexthop add id 101 group 1
# strace -e recvmsg ip nexthop get id 101
...
recvmsg(... [{nla_len=12, nla_type=NHA_GROUP},
[{id=1, weight=0, resvd1=0x69, resvd2=0x67}]] ...) = 52
The fields are reserved and therefore not currently used. But as they are, they
leak kernel memory, and the fact they are not just zero complicates repurposing
of the fields for new ends. Initialize the full structure.
Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Record Route IP option records the addresses of the routers that
routed the packet. In the case of forwarded packets, the kernel performs
a route lookup via fib_lookup() and fills in the preferred source
address of the matched route.
The lookup is performed with the DS field of the forwarded packet, but
using the RT_TOS() macro which only masks one of the two ECN bits. If
the packet is ECT(0) or CE, the matched route might be different than
the route via which the packet was forwarded as the input path masks
both of the ECN bits, resulting in the wrong address being filled in the
Record Route option.
Fix by masking both of the ECN bits.
Fixes: 8e36360ae8 ("ipv4: Remove route key identity dependencies in ip_rt_get_source().")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20240718123407.434778-1-idosch@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The TOS value that is returned to user space in the route get reply is
the one with which the lookup was performed ('fl4->flowi4_tos'). This is
fine when the matched route is configured with a TOS as it would not
match if its TOS value did not match the one with which the lookup was
performed.
However, matching on TOS is only performed when the route's TOS is not
zero. It is therefore possible to have the kernel incorrectly return a
non-zero TOS:
# ip link add name dummy1 up type dummy
# ip address add 192.0.2.1/24 dev dummy1
# ip route get fibmatch 192.0.2.2 tos 0xfc
192.0.2.0/24 tos 0x1c dev dummy1 proto kernel scope link src 192.0.2.1
Fix by instead returning the DSCP field from the FIB result structure
which was populated during the route lookup.
Output after the patch:
# ip link add name dummy1 up type dummy
# ip address add 192.0.2.1/24 dev dummy1
# ip route get fibmatch 192.0.2.2 tos 0xfc
192.0.2.0/24 dev dummy1 proto kernel scope link src 192.0.2.1
Extend the existing selftests to not only verify that the correct route
is returned, but that it is also returned with correct "tos" value (or
without it).
Fixes: b61798130f ("net: ipv4: RTM_GETROUTE: return matched fib result when requested")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The TOS value that is returned to user space in the route get reply is
the one with which the lookup was performed ('fl4->flowi4_tos'). This is
fine when the matched route is configured with a TOS as it would not
match if its TOS value did not match the one with which the lookup was
performed.
However, matching on TOS is only performed when the route's TOS is not
zero. It is therefore possible to have the kernel incorrectly return a
non-zero TOS:
# ip link add name dummy1 up type dummy
# ip address add 192.0.2.1/24 dev dummy1
# ip route get 192.0.2.2 tos 0xfc
192.0.2.2 tos 0x1c dev dummy1 src 192.0.2.1 uid 0
cache
Fix by adding a DSCP field to the FIB result structure (inside an
existing 4 bytes hole), populating it in the route lookup and using it
when filling the route get reply.
Output after the patch:
# ip link add name dummy1 up type dummy
# ip address add 192.0.2.1/24 dev dummy1
# ip route get 192.0.2.2 tos 0xfc
192.0.2.2 dev dummy1 src 192.0.2.1 uid 0
cache
Fixes: 1a00fee4ff ("ipv4: Remove rt_key_{src,dst,tos} from struct rtable.")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Replace the deprecated[1] uses of strncpy() in tcp_ca_get_name_by_key()
and tcp_get_default_congestion_control(). The callers use the results as
standard C strings (via nla_put_string() and proc handlers respectively),
so trailing padding is not needed.
Since passing the destination buffer arguments decays it to a pointer,
the size can't be trivially determined by the compiler. ca->name is
the same length in both cases, so strscpy() won't fail (when ca->name
is NUL-terminated). Include the length explicitly instead of using the
2-argument strscpy().
Link: https://github.com/KSPP/linux/issues/90 [1]
Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240714041111.it.918-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Merge in late fixes to prepare for the 6.11 net-next PR.
Conflicts:
93c3a96c30 ("net: pse-pd: Do not return EOPNOSUPP if config is null")
4cddb0f15e ("net: ethtool: pse-pd: Fix possible null-deref")
30d7b67277 ("net: ethtool: Add new power limit get and set features")
https://lore.kernel.org/20240715123204.623520bb@canb.auug.org.au/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmaSU/QACgkQrB3Eaf9P
W7etjA/+I8bWTjMCCGFT7AXIisXWQhHbrRuaU6hpROxWUTAyjUuM4qhdXHYUyG6i
2mcg7Ppqn0etEnrvCDJqgWGPonSJuxKRMpRNiB2uRYZAKDK2X7d5gCVVK+xGyuYn
rXjAw3yQ9W6oV8lQvm7GqLYOFL5vj9UA5q8QEhyTxH11HDDRBjlHSgzgWovzGsjO
2qLHSh3wuBuuoWS6jhN5n0pA1mFiKxhzPRRvTV2Q8CEBt+JML0gGd08g0s6tSGMJ
qlEGdTHIkIGi/QsbOoRm14X5gYYrDz1EEATISZTA9/Pbb03MsQfxUp6EUZNZIM4O
/K9XO7LLXOYWXBcI3BDCHCOT1cJPw1WVvYwlwWzu4DpxelPAc+pk2/QZk9wV2cWd
MzScbhHKmZ5GnYnlfQAyOnC5tvQXUBG2OntyXMBGh9seh+H5Lcl1RJAflIwRvBx5
7cnR6HiTmLUlbBxKjSJF+xFPnTucp0J637DkY/ONtAA7qNHnOKh3LWqkIH80q/FI
7Ua0EpgTtzAzN6iR2ujMHusfAjJs4yhMGY5KFGcEHwqS2axYq+mpnaShYzNebzl6
9kOmj6UAVP0tivH2Ahmsz2HaNhZaJ3hXftZeF3zwcoN6XTc3jrQ4JuNyiDcsUdnf
ggyLMZ7VI6Jf38ep8LEnfpqQm5qFTVfto62goWWLlGgr4wsy66c=
=KyYL
-----END PGP SIGNATURE-----
Merge tag 'ipsec-next-2024-07-13' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2024-07-13
1) Support sending NAT keepalives in ESP in UDP states.
Userspace IKE daemon had to do this before, but the
kernel can better keep track of it.
From Eyal Birger.
2) Support IPsec crypto offload for IPv6 ESP and IPv4 UDP-encapsulated
ESP data paths. Currently, IPsec crypto offload is enabled for GRO
code path only. This patchset support UDP encapsulation for the non
GRO path. From Mike Yu.
* tag 'ipsec-next-2024-07-13' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
xfrm: Support crypto offload for outbound IPv4 UDP-encapsulated ESP packet
xfrm: Support crypto offload for inbound IPv4 UDP-encapsulated ESP packet
xfrm: Allow UDP encapsulation in crypto offload control path
xfrm: Support crypto offload for inbound IPv6 ESP packets not in GRO path
xfrm: support sending NAT keepalives in ESP in UDP states
====================
Link: https://patch.msgid.link/20240713102416.3272997-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
By default, an address assigned to the output interface is selected when
the source address is not specified. This is problematic when a route,
configured in a vrf, uses an interface from another vrf (aka route leak).
The original vrf does not own the selected source address.
Let's add a check against the output interface and call the appropriate
function to select the source address.
CC: stable@vger.kernel.org
Fixes: 8cbb512c92 ("net: Add source address lookup op for VRF")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240710081521.3809742-2-nicolas.dichtel@6wind.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmaPqSEACgkQrB3Eaf9P
W7eOmQ//YVp6OL+oS5lRzLMvhKLXh42qGbaOPAZl/k0cOACsOnNhubTQHUToIMYt
FXLVCDrXHU3F4JVGdgzwJb+/2wqElP+3Wlw48WCnycAlB8NpFc24qKwZHWzo04Mv
uutWG5oVXXMYsnLEQhsQCMj+rCjDnSJG2bmsQCHS8GFB4PKP/SSGm/H0UFUbYjIE
leZ6rPmqmHf/FShqSmm0VTbXyeLE3bIJQ5zfDLzKW9/nO5h/VyZcZCEzEENF5i2i
bKaEGSNrK4evyj+9j/B8FDdujEfVbNyanTAkChJgx3Wug6rIy1QdsG2xDpPn3zm+
pdDvSLPAjjLHrCr7yPPnHEdtOYBvnvjW035VBG/q7pNZfHUaKcutvQJESiNVjsV0
hqmL8XhKgdT/0dPrevXVSXcLOXT25EkzLoN8W4P3qOY4OSFQPC8V+ELCOhWGlZwB
rKA8/NfEwV2yIlxhEzSYUTaGT3YZVLJsAVuEfR8Y3tq/j7X5G6h4lCKddxNKhLn+
jJroKlKQEHsC7HCMOW9kJijiXWxNjT4cAPRXMSIxf3cL29UwU9zPE1wx1oq1Pr97
FZiGg9IapcK5nKslaim+nwn6PtEJzVzCWtZ5gddtS4qOrZKuveql/B2P1I8EL9S6
LUqOE9gUeQpSdG/M5FqkLJnUE1knHYRZhQw682fA1zvZFj+G9lo=
=xFmH
-----END PGP SIGNATURE-----
Merge tag 'ipsec-2024-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2024-07-11
1) Fix esp_output_tail_tcp() on unsupported ESPINTCP.
From Hagar Hemdan.
2) Fix two bugs in the recently introduced SA direction separation.
From Antony Antony.
3) Fix unregister netdevice hang on hardware offload. We had to add another
list where skbs linked to that are unlinked from the lists (deleted)
but not yet freed.
4) Fix netdev reference count imbalance in xfrm_state_find.
From Jianbo Liu.
5) Call xfrm_dev_policy_delete when killingi them on offloaded policies.
Jianbo Liu.
* tag 'ipsec-2024-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: call xfrm_dev_policy_delete when kill policy
xfrm: fix netdev reference count imbalance
xfrm: Export symbol xfrm_dev_state_delete.
xfrm: Fix unregister netdevice hang on hardware offload.
xfrm: Log input direction mismatch error in one place
xfrm: Fix input error path memory access
net: esp: cleanup esp_output_tail_tcp() in case of unsupported ESPINTCP
====================
Link: https://patch.msgid.link/20240711100025.1949454-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
RFC 9293 states that in the case of simultaneous connect(), the connection
gets established when SYN+ACK is received. [0]
TCP Peer A TCP Peer B
1. CLOSED CLOSED
2. SYN-SENT --> <SEQ=100><CTL=SYN> ...
3. SYN-RECEIVED <-- <SEQ=300><CTL=SYN> <-- SYN-SENT
4. ... <SEQ=100><CTL=SYN> --> SYN-RECEIVED
5. SYN-RECEIVED --> <SEQ=100><ACK=301><CTL=SYN,ACK> ...
6. ESTABLISHED <-- <SEQ=300><ACK=101><CTL=SYN,ACK> <-- SYN-RECEIVED
7. ... <SEQ=100><ACK=301><CTL=SYN,ACK> --> ESTABLISHED
However, since commit 0c24604b68 ("tcp: implement RFC 5961 4.2"), such a
SYN+ACK is dropped in tcp_validate_incoming() and responded with Challenge
ACK.
For example, the write() syscall in the following packetdrill script fails
with -EAGAIN, and wrong SNMP stats get incremented.
0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
+0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
+0 > S 0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8>
+0 < S 0:0(0) win 1000 <mss 1000>
+0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8>
+0 < S. 0:0(0) ack 1 win 1000
+0 write(3, ..., 100) = 100
+0 > P. 1:101(100) ack 1
--
# packetdrill cross-synack.pkt
cross-synack.pkt:13: runtime error in write call: Expected result 100 but got -1 with errno 11 (Resource temporarily unavailable)
# nstat
...
TcpExtTCPChallengeACK 1 0.0
TcpExtTCPSYNChallenge 1 0.0
The problem is that bpf_skops_established() is triggered by the Challenge
ACK instead of SYN+ACK. This causes the bpf prog to miss the chance to
check if the peer supports a TCP option that is expected to be exchanged
in SYN and SYN+ACK.
Let's accept a bare SYN+ACK for active-open TCP_SYN_RECV sockets to avoid
such a situation.
Note that tcp_ack_snd_check() in tcp_rcv_state_process() is skipped not to
send an unnecessary ACK, but this could be a bit risky for net.git, so this
targets for net-next.
Link: https://www.rfc-editor.org/rfc/rfc9293.html#section-3.5-7 [0]
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240710171246.87533-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
esp_xmit() is already able to handle UDP encapsulation through the call to
esp_output_head(). However, the ESP header and the outer IP header
are not correct and need to be corrected.
Test: Enabled both dir=in/out IPsec crypto offload, and verified IPv4
UDP-encapsulated ESP packets on both wifi/cellular network
Signed-off-by: Mike Yu <yumike@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Cross-merge networking fixes after downstream PR.
Conflicts:
net/sched/act_ct.c
26488172b0 ("net/sched: Fix UAF when resolving a clash")
3abbd7ed8b ("act_ct: prepare for stolen verdict coming from conntrack and nat engine")
No adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If a TCP socket is using TCP_USER_TIMEOUT, and the other peer
retracted its window to zero, tcp_retransmit_timer() can
retransmit a packet every two jiffies (2 ms for HZ=1000),
for about 4 minutes after TCP_USER_TIMEOUT has 'expired'.
The fix is to make sure tcp_rtx_probe0_timed_out() takes
icsk->icsk_user_timeout into account.
Before blamed commit, the socket would not timeout after
icsk->icsk_user_timeout, but would use standard exponential
backoff for the retransmits.
Also worth noting that before commit e89688e3e9 ("net: tcp:
fix unexcepted socket die when snd_wnd is 0"), the issue
would last 2 minutes instead of 4.
Fixes: b701a99e43 ("tcp: Add tcp_clamp_rto_to_user_timeout() helper to improve accuracy")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Jon Maxwell <jmaxwell37@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20240710001402.2758273-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Loss recovery undo_retrans bookkeeping had a long-standing bug where a
DSACK from a spurious TLP retransmit packet could cause an erroneous
undo of a fast recovery or RTO recovery that repaired a single
really-lost packet (in a sequence range outside that of the TLP
retransmit). Basically, because the loss recovery state machine didn't
account for the fact that it sent a TLP retransmit, the DSACK for the
TLP retransmit could erroneously be implicitly be interpreted as
corresponding to the normal fast recovery or RTO recovery retransmit
that plugged a real hole, thus resulting in an improper undo.
For example, consider the following buggy scenario where there is a
real packet loss but the congestion control response is improperly
undone because of this bug:
+ send packets P1, P2, P3, P4
+ P1 is really lost
+ send TLP retransmit of P4
+ receive SACK for original P2, P3, P4
+ enter fast recovery, fast-retransmit P1, increment undo_retrans to 1
+ receive DSACK for TLP P4, decrement undo_retrans to 0, undo (bug!)
+ receive cumulative ACK for P1-P4 (fast retransmit plugged real hole)
The fix: when we initialize undo machinery in tcp_init_undo(), if
there is a TLP retransmit in flight, then increment tp->undo_retrans
so that we make sure that we receive a DSACK corresponding to the TLP
retransmit, as well as DSACKs for all later normal retransmits, before
triggering a loss recovery undo. Note that we also have to move the
line that clears tp->tlp_high_seq for RTO recovery, so that upon RTO
we remember the tp->tlp_high_seq value until tcp_init_undo() and clear
it only afterward.
Also note that the bug dates back to the original 2013 TLP
implementation, commit 6ba8a3b19e ("tcp: Tail loss probe (TLP)").
However, this patch will only compile and work correctly with kernels
that have tp->tlp_retrans, which was added only in v5.8 in 2020 in
commit 76be93fc07 ("tcp: allow at most one TLP probe per flight").
So we associate this fix with that later commit.
Fixes: 76be93fc07 ("tcp: allow at most one TLP probe per flight")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Kevin Yang <yyd@google.com>
Link: https://patch.msgid.link/20240703171246.1739561-1-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
KMSAN reported uninit-value access in raw_lookup() [1]. Diag for raw
sockets uses the pad field in struct inet_diag_req_v2 for the
underlying protocol. This field corresponds to the sdiag_raw_protocol
field in struct inet_diag_req_raw.
inet_diag_get_exact_compat() converts inet_diag_req to
inet_diag_req_v2, but leaves the pad field uninitialized. So the issue
occurs when raw_lookup() accesses the sdiag_raw_protocol field.
Fix this by initializing the pad field in
inet_diag_get_exact_compat(). Also, do the same fix in
inet_diag_dump_compat() to avoid the similar issue in the future.
[1]
BUG: KMSAN: uninit-value in raw_lookup net/ipv4/raw_diag.c:49 [inline]
BUG: KMSAN: uninit-value in raw_sock_get+0x657/0x800 net/ipv4/raw_diag.c:71
raw_lookup net/ipv4/raw_diag.c:49 [inline]
raw_sock_get+0x657/0x800 net/ipv4/raw_diag.c:71
raw_diag_dump_one+0xa1/0x660 net/ipv4/raw_diag.c:99
inet_diag_cmd_exact+0x7d9/0x980
inet_diag_get_exact_compat net/ipv4/inet_diag.c:1404 [inline]
inet_diag_rcv_msg_compat+0x469/0x530 net/ipv4/inet_diag.c:1426
sock_diag_rcv_msg+0x23d/0x740 net/core/sock_diag.c:282
netlink_rcv_skb+0x537/0x670 net/netlink/af_netlink.c:2564
sock_diag_rcv+0x35/0x40 net/core/sock_diag.c:297
netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
netlink_unicast+0xe74/0x1240 net/netlink/af_netlink.c:1361
netlink_sendmsg+0x10c6/0x1260 net/netlink/af_netlink.c:1905
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x332/0x3d0 net/socket.c:745
____sys_sendmsg+0x7f0/0xb70 net/socket.c:2585
___sys_sendmsg+0x271/0x3b0 net/socket.c:2639
__sys_sendmsg net/socket.c:2668 [inline]
__do_sys_sendmsg net/socket.c:2677 [inline]
__se_sys_sendmsg net/socket.c:2675 [inline]
__x64_sys_sendmsg+0x27e/0x4a0 net/socket.c:2675
x64_sys_call+0x135e/0x3ce0 arch/x86/include/generated/asm/syscalls_64.h:47
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xd9/0x1e0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Uninit was stored to memory at:
raw_sock_get+0x650/0x800 net/ipv4/raw_diag.c:71
raw_diag_dump_one+0xa1/0x660 net/ipv4/raw_diag.c:99
inet_diag_cmd_exact+0x7d9/0x980
inet_diag_get_exact_compat net/ipv4/inet_diag.c:1404 [inline]
inet_diag_rcv_msg_compat+0x469/0x530 net/ipv4/inet_diag.c:1426
sock_diag_rcv_msg+0x23d/0x740 net/core/sock_diag.c:282
netlink_rcv_skb+0x537/0x670 net/netlink/af_netlink.c:2564
sock_diag_rcv+0x35/0x40 net/core/sock_diag.c:297
netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
netlink_unicast+0xe74/0x1240 net/netlink/af_netlink.c:1361
netlink_sendmsg+0x10c6/0x1260 net/netlink/af_netlink.c:1905
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x332/0x3d0 net/socket.c:745
____sys_sendmsg+0x7f0/0xb70 net/socket.c:2585
___sys_sendmsg+0x271/0x3b0 net/socket.c:2639
__sys_sendmsg net/socket.c:2668 [inline]
__do_sys_sendmsg net/socket.c:2677 [inline]
__se_sys_sendmsg net/socket.c:2675 [inline]
__x64_sys_sendmsg+0x27e/0x4a0 net/socket.c:2675
x64_sys_call+0x135e/0x3ce0 arch/x86/include/generated/asm/syscalls_64.h:47
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xd9/0x1e0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Local variable req.i created at:
inet_diag_get_exact_compat net/ipv4/inet_diag.c:1396 [inline]
inet_diag_rcv_msg_compat+0x2a6/0x530 net/ipv4/inet_diag.c:1426
sock_diag_rcv_msg+0x23d/0x740 net/core/sock_diag.c:282
CPU: 1 PID: 8888 Comm: syz-executor.6 Not tainted 6.10.0-rc4-00217-g35bb670d65fc #32
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
Fixes: 432490f9d4 ("net: ip, diag -- Add diag interface for raw sockets")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240703091649.111773-1-syoshida@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When we process segments with TCP AO, we don't check it in
tcp_parse_options(). Thus, opt_rx->saw_unknown is set to 1,
which unconditionally triggers the BPF TCP option parser.
Let's avoid the unnecessary BPF invocation.
Fixes: 0a3a809089 ("net/tcp: Verify inbound TCP-AO signed segments")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Dmitry Safonov <0x7f454c46@gmail.com>
Link: https://patch.msgid.link/20240703033508.6321-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
I don't see anything checking that TCP_METRICS_ATTR_SADDR_IPV4
is at least 4 bytes long, and the policy doesn't have an entry
for this attribute at all (neither does it for IPv6 but v6 is
manually validated).
Reviewed-by: Eric Dumazet <edumazet@google.com>
Fixes: 3e7013ddf5 ("tcp: metrics: Allow selective get/del of tcp-metrics based on src IP")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Today sending a UDP GSO packet from a TUN device results in an EIO error:
import fcntl, os, struct
from socket import *
TUNSETIFF = 0x400454CA
IFF_TUN = 0x0001
IFF_NO_PI = 0x1000
UDP_SEGMENT = 103
tun_fd = os.open("/dev/net/tun", os.O_RDWR)
ifr = struct.pack("16sH", b"tun0", IFF_TUN | IFF_NO_PI)
fcntl.ioctl(tun_fd, TUNSETIFF, ifr)
os.system("ip addr add 192.0.2.1/24 dev tun0")
os.system("ip link set dev tun0 up")
s = socket(AF_INET, SOCK_DGRAM)
s.setsockopt(SOL_UDP, UDP_SEGMENT, 1200)
s.sendto(b"x" * 3000, ("192.0.2.2", 9)) # EIO
This is due to a check in the udp stack if the egress device offers
checksum offload. While TUN/TAP devices, by default, don't advertise this
capability because it requires support from the TUN/TAP reader.
However, the GSO stack has a software fallback for checksum calculation,
which we can use. This way we don't force UDP_SEGMENT users to handle the
EIO error and implement a segmentation fallback.
Lift the restriction so that UDP_SEGMENT can be used with any egress
device. We also need to adjust the UDP GSO code to match the GSO stack
expectation about ip_summed field, as set in commit 8d63bee643 ("net:
avoid skb_warn_bad_offload false positives on UFO"). Otherwise we will hit
the bad offload check.
Users should, however, expect a potential performance impact when
batch-sending packets with UDP_SEGMENT without checksum offload on the
egress device. In such case the packet payload is read twice: first during
the sendmsg syscall when copying data from user memory, and then in the GSO
stack for checksum computation. This double memory read can be less
efficient than a regular sendmsg where the checksum is calculated during
the initial data copy from user memory.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20240626-linux-udpgso-v2-1-422dfcbd6b48@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In some production workloads we noticed that connections could
sometimes close extremely prematurely with ETIMEDOUT after
transmitting only 1 TLP and RTO retransmission (when we would normally
expect roughly tcp_retries2 = TCP_RETR2 = 15 RTOs before a connection
closes with ETIMEDOUT).
From tracing we determined that these workloads can suffer from a
scenario where in fast recovery, after some retransmits, a DSACK undo
can happen at a point where the scoreboard is totally clear (we have
retrans_out == sacked_out == lost_out == 0). In such cases, calling
tcp_try_keep_open() means that we do not execute any code path that
clears tp->retrans_stamp to 0. That means that tp->retrans_stamp can
remain erroneously set to the start time of the undone fast recovery,
even after the fast recovery is undone. If minutes or hours elapse,
and then a TLP/RTO/RTO sequence occurs, then the start_ts value in
retransmits_timed_out() (which is from tp->retrans_stamp) will be
erroneously ancient (left over from the fast recovery undone via
DSACKs). Thus this ancient tp->retrans_stamp value can cause the
connection to die very prematurely with ETIMEDOUT via
tcp_write_err().
The fix: we change DSACK undo in fast recovery (TCP_CA_Recovery) to
call tcp_try_to_open() instead of tcp_try_keep_open(). This ensures
that if no retransmits are in flight at the time of DSACK undo in fast
recovery then we properly zero retrans_stamp. Note that calling
tcp_try_to_open() is more consistent with other loss recovery
behavior, since normal fast recovery (CA_Recovery) and RTO recovery
(CA_Loss) both normally end when tp->snd_una meets or exceeds
tp->high_seq and then in tcp_fastretrans_alert() the "default" switch
case executes tcp_try_to_open(). Also note that by inspection this
change to call tcp_try_to_open() implies at least one other nice bug
fix, where now an ECE-marked DSACK that causes an undo will properly
invoke tcp_enter_cwr() rather than ignoring the ECE mark.
Fixes: c7d9d6a185 ("tcp: undo on DSACK during recovery")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
e3f02f32a0 ("ionic: fix kernel panic due to multi-buffer handling")
d9c0420999 ("ionic: Mark error paths in the data path as unlikely")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Testing determined that the recent commit 9e046bb111 ("tcp: clear
tp->retrans_stamp in tcp_rcv_fastopen_synack()") has a race, and does
not always ensure retrans_stamp is 0 after a TFO payload retransmit.
If transmit completion for the SYN+data skb happens after the client
TCP stack receives the SYNACK (which sometimes happens), then
retrans_stamp can erroneously remain non-zero for the lifetime of the
connection, causing a premature ETIMEDOUT later.
Testing and tracing showed that the buggy scenario is the following
somewhat tricky sequence:
+ Client attempts a TFO handshake. tcp_send_syn_data() sends SYN + TFO
cookie + data in a single packet in the syn_data skb. It hands the
syn_data skb to tcp_transmit_skb(), which makes a clone. Crucially,
it then reuses the same original (non-clone) syn_data skb,
transforming it by advancing the seq by one byte and removing the
FIN bit, and enques the resulting payload-only skb in the
sk->tcp_rtx_queue.
+ Client sets retrans_stamp to the start time of the three-way
handshake.
+ Cookie mismatches or server has TFO disabled, and server only ACKs
SYN.
+ tcp_ack() sees SYN is acked, tcp_clean_rtx_queue() clears
retrans_stamp.
+ Since the client SYN was acked but not the payload, the TFO failure
code path in tcp_rcv_fastopen_synack() tries to retransmit the
payload skb. However, in some cases the transmit completion for the
clone of the syn_data (which had SYN + TFO cookie + data) hasn't
happened. In those cases, skb_still_in_host_queue() returns true
for the retransmitted TFO payload, because the clone of the syn_data
skb has not had its tx completetion.
+ Because skb_still_in_host_queue() finds skb_fclone_busy() is true,
it sets the TSQ_THROTTLED bit and the retransmit does not happen in
the tcp_rcv_fastopen_synack() call chain.
+ The tcp_rcv_fastopen_synack() code next implicitly assumes the
retransmit process is finished, and sets retrans_stamp to 0 to clear
it, but this is later overwritten (see below).
+ Later, upon tx completion, tcp_tsq_write() calls
tcp_xmit_retransmit_queue(), which puts the retransmit in flight and
sets retrans_stamp to a non-zero value.
+ The client receives an ACK for the retransmitted TFO payload data.
+ Since we're in CA_Open and there are no dupacks/SACKs/DSACKs/ECN to
make tcp_ack_is_dubious() true and make us call
tcp_fastretrans_alert() and reach a code path that clears
retrans_stamp, retrans_stamp stays nonzero.
+ Later, if there is a TLP, RTO, RTO sequence, then the connection
will suffer an early ETIMEDOUT due to the erroneously ancient
retrans_stamp.
The fix: this commit refactors the code to have
tcp_rcv_fastopen_synack() retransmit by reusing the relevant parts of
tcp_simple_retransmit() that enter CA_Loss (without changing cwnd) and
call tcp_xmit_retransmit_queue(). We have tcp_simple_retransmit() and
tcp_rcv_fastopen_synack() share code in this way because in both cases
we get a packet indicating non-congestion loss (MTU reduction or TFO
failure) and thus in both cases we want to retransmit as many packets
as cwnd allows, without reducing cwnd. And given that retransmits will
set retrans_stamp to a non-zero value (and may do so in a later
calling context due to TSQ), we also want to enter CA_Loss so that we
track when all retransmitted packets are ACked and clear retrans_stamp
when that happens (to ensure later recurring RTOs are using the
correct retrans_stamp and don't declare ETIMEDOUT prematurely).
Fixes: 9e046bb111 ("tcp: clear tp->retrans_stamp in tcp_rcv_fastopen_synack()")
Fixes: a7abf3cd76 ("tcp: consider using standard rtx logic in tcp_rcv_fastopen_synack()")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Link: https://patch.msgid.link/20240624144323.2371403-1-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When bonding is configured in BOND_MODE_BROADCAST mode, if two identical
SYN packets are received at the same time and processed on different CPUs,
it can potentially create the same sk (sock) but two different reqsk
(request_sock) in tcp_conn_request().
These two different reqsk will respond with two SYNACK packets, and since
the generation of the seq (ISN) incorporates a timestamp, the final two
SYNACK packets will have different seq values.
The consequence is that when the Client receives and replies with an ACK
to the earlier SYNACK packet, we will reset(RST) it.
========================================================================
This behavior is consistently reproducible in my local setup,
which comprises:
| NETA1 ------ NETB1 |
PC_A --- bond --- | | --- bond --- PC_B
| NETA2 ------ NETB2 |
- PC_A is the Server and has two network cards, NETA1 and NETA2. I have
bonded these two cards using BOND_MODE_BROADCAST mode and configured
them to be handled by different CPU.
- PC_B is the Client, also equipped with two network cards, NETB1 and
NETB2, which are also bonded and configured in BOND_MODE_BROADCAST mode.
If the client attempts a TCP connection to the server, it might encounter
a failure. Capturing packets from the server side reveals:
10.10.10.10.45182 > localhost: Flags [S], seq 320236027,
10.10.10.10.45182 > localhost: Flags [S], seq 320236027,
localhost > 10.10.10.10.45182: Flags [S.], seq 2967855116,
localhost > 10.10.10.10.45182: Flags [S.], seq 2967855123, <==
10.10.10.10.45182 > localhost: Flags [.], ack 4294967290,
10.10.10.10.45182 > localhost: Flags [.], ack 4294967290,
localhost > 10.10.10.10.45182: Flags [R], seq 2967855117, <==
localhost > 10.10.10.10.45182: Flags [R], seq 2967855117,
Two SYNACKs with different seq numbers are sent by localhost,
resulting in an anomaly.
========================================================================
The attempted solution is as follows:
Add a return value to inet_csk_reqsk_queue_hash_add() to confirm if the
ehash insertion is successful (Up to now, the reason for unsuccessful
insertion is that a reqsk for the same connection has already been
inserted). If the insertion fails, release the reqsk.
Due to the refcnt, Kuniyuki suggests also adding a return value check
for the DCCP module; if ehash insertion fails, indicating a successful
insertion of the same connection, simply release the reqsk as well.
Simultaneously, In the reqsk_queue_hash_req(), the start of the
req->rsk_timer is adjusted to be after successful insertion.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: luoxuanqiang <luoxuanqiang@kylinos.cn>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240621013929.1386815-1-luoxuanqiang@kylinos.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
ipv4_tcp_sk is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.
Make a struct with a sock member (original ipv4_tcp_sk) and a
local_lock_t and use local_lock_nested_bh() for locking. This change
adds only lockdep coverage and does not alter the functional behaviour
for !PREEMPT_RT.
Cc: David Ahern <dsahern@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-7-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sigpool_scratch is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.
Make a struct with a pad member (original sigpool_scratch) and a
local_lock_t and use local_lock_nested_bh() for locking. This change
adds only lockdep coverage and does not alter the functional behaviour
for !PREEMPT_RT.
Cc: David Ahern <dsahern@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-6-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/ethernet/broadcom/bnxt/bnxt.c
1e7962114c ("bnxt_en: Restore PTP tx_avail count in case of skb_pad() error")
165f87691a ("bnxt_en: add timestamping statistics support")
No adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving
socket to the tracepoint.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/r/202406011751.NpVN0sSk-lkp@intel.com/
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving
socket to the tracepoint.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/r/202406011539.jhwBd7DX-lkp@intel.com/
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving
socket to the tracepoint.
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving
socket to the tracepoint.
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some applications were reporting ETIMEDOUT errors on apparently
good looking flows, according to packet dumps.
We were able to root cause the issue to an accidental setting
of tp->retrans_stamp in the following scenario:
- client sends TFO SYN with data.
- server has TFO disabled, ACKs only SYN but not payload.
- client receives SYNACK covering only SYN.
- tcp_ack() eats SYN and sets tp->retrans_stamp to 0.
- tcp_rcv_fastopen_synack() calls tcp_xmit_retransmit_queue()
to retransmit TFO payload w/o SYN, sets tp->retrans_stamp to "now",
but we are not in any loss recovery state.
- TFO payload is ACKed.
- we are not in any loss recovery state, and don't see any dupacks,
so we don't get to any code path that clears tp->retrans_stamp.
- tp->retrans_stamp stays non-zero for the lifetime of the connection.
- after first RTO, tcp_clamp_rto_to_user_timeout() clamps second RTO
to 1 jiffy due to bogus tp->retrans_stamp.
- on clamped RTO with non-zero icsk_retransmits, retransmits_timed_out()
sets start_ts from tp->retrans_stamp from TFO payload retransmit
hours/days ago, and computes bogus long elapsed time for loss recovery,
and suffers ETIMEDOUT early.
Fixes: a7abf3cd76 ("tcp: consider using standard rtx logic in tcp_rcv_fastopen_synack()")
CC: stable@vger.kernel.org
Co-developed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Co-developed-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240614130615.396837-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Drop the WARN_ON_ONCE inn gue_gro_receive if the encapsulated type is
not known or does not have a GRO handler.
Such a packet is easily constructed. Syzbot generates them and sets
off this warning.
Remove the warning as it is expected and not actionable.
The warning was previously reduced from WARN_ON to WARN_ON_ONCE in
commit 270136613b ("fou: Do WARN_ON_ONCE in gue_gro_receive for bad
proto callbacks").
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240614122552.1649044-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Previously, the offload data path decrypted the packet before checking
the direction, leading to error logging and packet dropping. However,
dropped packets wouldn't be visible in tcpdump or audit log.
With this fix, the offload path, upon noticing SA direction mismatch,
will pass the packet to the stack without decrypting it. The L3 layer
will then log the error, audit, and drop ESP without decrypting or
decapsulating it.
This also ensures that the slow path records the error and audit log,
making dropped packets visible in tcpdump.
Fixes: 304b44f0d5 ("xfrm: Add dir validation to "in" data path lookup")
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
As the comment in this function says, the code currently just clears the
CIPSO part with IPOPT_NOP, rather than removing it completely and
trimming the packet. The other cipso_v4_*_delattr() functions, however,
do the proper removal and also calipso_skbuff_delattr() makes an effort
to remove the CALIPSO options instead of replacing them with padding.
Some routers treat IPv4 packets with anything (even NOPs) in the option
header as a special case and take them through a slower processing path.
Consequently, hardening guides such as STIG recommend to configure such
routers to drop packets with non-empty IP option headers [1][2]. Thus,
users might expect NetLabel to produce packets with minimal padding (or
at least with no padding when no actual options are present).
Implement the proper option removal to address this and to be closer to
what the peer functions do.
[1] https://www.stigviewer.com/stig/juniper_router_rtr/2019-09-27/finding/V-90937
[2] https://www.stigviewer.com/stig/cisco_ios_xe_router_rtr/2021-03-26/finding/V-217001
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As evident from the definition of ip_options_get(), the IP option
IPOPT_END is used to pad the IP option data array, not IPOPT_NOP. Yet
the loop that walks the IP options to determine the total IP options
length in cipso_v4_delopt() doesn't take IPOPT_END into account.
Fix it by recognizing the IPOPT_END value as the end of actual options.
Fixes: 014ab19a69 ("selinux: Set socket NetLabel based on connection endpoint")
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When calculating hashes for the purpose of multipath forwarding, both IPv4
and IPv6 code currently fall back on flow_hash_from_keys(). That uses a
randomly-generated seed. That's a fine choice by default, but unfortunately
some deployments may need a tighter control over the seed used.
In this patch, make the seed configurable by adding a new sysctl key,
net.ipv4.fib_multipath_hash_seed to control the seed. This seed is used
specifically for multipath forwarding and not for the other concerns that
flow_hash_from_keys() is used for, such as queue selection. Expose the knob
as sysctl because other such settings, such as headers to hash, are also
handled that way. Like those, the multipath hash seed is a per-netns
variable.
Despite being placed in the net.ipv4 namespace, the multipath seed sysctl
is used for both IPv4 and IPv6, similarly to e.g. a number of TCP
variables.
The seed used by flow_hash_from_keys() is a 128-bit quantity. However it
seems that usually the seed is a much more modest value. 32 bits seem
typical (Cisco, Cumulus), some systems go even lower. For that reason, and
to decouple the user interface from implementation details, go with a
32-bit quantity, which is then quadruplicated to form the siphash key.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240607151357.421181-3-petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The following patches will add a sysctl to control multipath hash
seed. In order to centralize the hash computation, add a helper,
fib_multipath_hash_from_keys(), and have all IPv4 and IPv6 route.c
invocations of flow_hash_from_keys() go through this helper instead.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240607151357.421181-2-petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now there are tracepoints, that cover all functionality of
tcp_hash_fail(), but also wire up missing places
They are also faster, can be disabled and provide filtering.
This potentially may create a regression if a userspace depends on dmesg
logs. Fingers crossed, let's see if anyone complains in reality.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of forcing userspace to parse dmesg (that's what currently is
happening, at least in codebase of my current company), provide a better
way, that can be enabled/disabled in runtime.
Currently, there are already tcp events, add hashing related ones there,
too. Rasdaemon currently exercises net_dev_xmit_timeout,
devlink_health_report, but it'll be trivial to teach it to deal with
failed hashes. Otherwise, BGP may trace/log them itself. Especially
exciting for possible investigations is key rotation (RNext_key
requests).
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Two reasons:
1. It's grown up enough
2. In order to not do header spaghetti by including
<trace/events/tcp.h>, which is necessary for TCP tracepoints.
While at it, unexport and make static tcp_inbound_ao_hash().
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's going to be used more in TCP-AO tracepoints.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's possible to clean-up some ifdefs by hiding that
tcp_{md5,ao}_needed static branch is defined and compiled only
under related configs, since commit 4c8530dc7d ("net/tcp: Only produce
AO/MD5 logs if there are any keys").
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With commit 34d21de99c ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core instead
of this driver.
With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.
Move ip_tunnel driver to leverage the core allocation.
All the ip_tunnel_init() users call ip_tunnel_init() as part of their
.ndo_init callback. The .ndo_init callback is called before the stats
allocation in netdev_register(), thus, the allocation will happen before
the netdev is visible.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240607084420.3932875-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Due to timer wheel implementation, a timer will usually fire
after its schedule.
For instance, for HZ=1000, a timeout between 512ms and 4s
has a granularity of 64ms.
For this range of values, the extra delay could be up to 63ms.
For TCP, this means that tp->rcv_tstamp may be after
inet_csk(sk)->icsk_timeout whenever the timer interrupt
finally triggers, if one packet came during the extra delay.
We need to make sure tcp_rtx_probe0_timed_out() handles this case.
Fixes: e89688e3e9 ("net: tcp: fix unexcepted socket die when snd_wnd is 0")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Menglong Dong <imagedong@tencent.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://lore.kernel.org/r/20240607125652.1472540-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZmIsRAAKCRDbK58LschI
g4SSAP0bkl6rPMn7zp1h+/l7hlvpp2aVOmasBTe8hIhAGUbluwD/TGq4sNsGgXFI
i4tUtFRhw8pOjy2guy6526qyJvBs8wY=
=WMhY
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-06-06
We've added 54 non-merge commits during the last 10 day(s) which contain
a total of 50 files changed, 1887 insertions(+), 527 deletions(-).
The main changes are:
1) Add a user space notification mechanism via epoll when a struct_ops
object is getting detached/unregistered, from Kui-Feng Lee.
2) Big batch of BPF selftest refactoring for sockmap and BPF congctl
tests, from Geliang Tang.
3) Add BTF field (type and string fields, right now) iterator support
to libbpf instead of using existing callback-based approaches,
from Andrii Nakryiko.
4) Extend BPF selftests for the latter with a new btf_field_iter
selftest, from Alan Maguire.
5) Add new kfuncs for a generic, open-coded bits iterator,
from Yafang Shao.
6) Fix BPF selftests' kallsyms_find() helper under kernels configured
with CONFIG_LTO_CLANG_THIN, from Yonghong Song.
7) Remove a bunch of unused structs in BPF selftests,
from David Alan Gilbert.
8) Convert test_sockmap section names into names understood by libbpf
so it can deduce program type and attach type, from Jakub Sitnicki.
9) Extend libbpf with the ability to configure log verbosity
via LIBBPF_LOG_LEVEL environment variable, from Mykyta Yatsenko.
10) Fix BPF selftests with regards to bpf_cookie and find_vma flakiness
in nested VMs, from Song Liu.
11) Extend riscv32/64 JITs to introduce shift/add helpers to generate Zba
optimization, from Xiao Wang.
12) Enable BPF programs to declare arrays and struct fields with kptr,
bpf_rb_root, and bpf_list_head, from Kui-Feng Lee.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits)
selftests/bpf: Drop useless arguments of do_test in bpf_tcp_ca
selftests/bpf: Use start_test in test_dctcp in bpf_tcp_ca
selftests/bpf: Use start_test in test_dctcp_fallback in bpf_tcp_ca
selftests/bpf: Add start_test helper in bpf_tcp_ca
selftests/bpf: Use connect_to_fd_opts in do_test in bpf_tcp_ca
libbpf: Auto-attach struct_ops BPF maps in BPF skeleton
selftests/bpf: Add btf_field_iter selftests
selftests/bpf: Fix send_signal test with nested CONFIG_PARAVIRT
libbpf: Remove callback-based type/string BTF field visitor helpers
bpftool: Use BTF field iterator in btfgen
libbpf: Make use of BTF field iterator in BTF handling code
libbpf: Make use of BTF field iterator in BPF linker code
libbpf: Add BTF field iterator
selftests/bpf: Ignore .llvm.<hash> suffix in kallsyms_find()
selftests/bpf: Fix bpf_cookie and find_vma in nested VM
selftests/bpf: Test global bpf_list_head arrays.
selftests/bpf: Test global bpf_rb_root arrays and fields in nested struct types.
selftests/bpf: Test kptr arrays and kptrs in nested struct fields.
bpf: limit the number of levels of a nested struct type.
bpf: look into the types of the fields of a struct type recursively.
...
====================
Link: https://lore.kernel.org/r/20240606223146.23020-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Its no longer used outside inet_timewait_sock.c, so move it there.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
After previous patch, even if timer fires immediately on another CPU,
context that schedules the timer now holds the ehash spinlock, so timer
cannot reap tw socket until ehash lock is released.
BH disable is moved into hashdance_schedule.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The TCP timewait timer is proving to be problematic for setups where
scheduler CPU isolation is achieved at runtime via cpusets (as opposed to
statically via isolcpus=domains).
What happens there is a CPU goes through tcp_time_wait(), arming the
time_wait timer, then gets isolated. TCP_TIMEWAIT_LEN later, the timer
fires, causing interference for the now-isolated CPU. This is conceptually
similar to the issue described in commit e02b931248 ("workqueue: Unbind
kworkers before sending them to exit()")
Move inet_twsk_schedule() to within inet_twsk_hashdance(), with the ehash
lock held. Expand the lock's critical section from inet_twsk_kill() to
inet_twsk_deschedule_put(), serializing the scheduling vs descheduling of
the timer. IOW, this prevents the following race:
tcp_time_wait()
inet_twsk_hashdance()
inet_twsk_deschedule_put()
del_timer_sync()
inet_twsk_schedule()
Thanks to Paolo Abeni for suggesting to leverage the ehash lock.
This also restores a comment from commit ec94c2696f ("tcp/dccp: avoid
one atomic operation for timewait hashdance") as inet_twsk_hashdance() had
a "Step 1" and "Step 3" comment, but the "Step 2" had gone missing.
inet_twsk_deschedule_put() now acquires the ehash spinlock to synchronize
with inet_twsk_hashdance_schedule().
To ease possible regression search, actual un-pin is done in next patch.
Link: https://lore.kernel.org/all/ZPhpfMjSiHVjQkTk@localhost.localdomain/
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
drivers/net/ethernet/pensando/ionic/ionic_txrx.c
d9c0420999 ("ionic: Mark error paths in the data path as unlikely")
491aee894a ("ionic: fix kernel panic in XDP_TX action")
net/ipv6/ip6_fib.c
b4cb4a1391 ("net: use unrcu_pointer() helper")
b01e1c0307 ("ipv6: fix possible race in __fib6_drop_pcpu_from()")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
reqsk_alloc() has a single caller, no need to expose it
in include/net/request_sock.h.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
inet_reqsk_alloc() does not belong to tcp_input.c,
move it to inet_connection_sock.c instead.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This list is used to tranfert dst that are handled by
rt_flush_dev() and rt6_uncached_list_flush_dev() out
of the per-cpu lists.
But quarantine list is not used later.
If we simply use list_del_init(&rt->dst.rt_uncached),
this also removes the dst from per-cpu list.
This patch also makes the future calls to rt_del_uncached_list()
and rt6_uncached_list_del() faster, because no spinlock
acquisition is needed anymore.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240604165150.726382-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Toke mentioned unrcu_pointer() existence, allowing
to remove some of the ugly casts we have when using
xchg() for rcu protected pointers.
Also make inet_rcv_compat const.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20240604111603.45871-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Adding a sysctl knob to allow user to specify a default
rto_min at socket init time, other than using the hard
coded 200ms default rto_min.
Note that the rto_min route option has the highest precedence
for configuring this setting, followed by the TCP_BPF_RTO_MIN
socket option, followed by the tcp_rto_min_us sysctl.
Signed-off-by: Kevin Yang <yyd@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rto_min now has multiple sources, ordered by preprecedence high to
low: ip route option rto_min, icsk->icsk_rto_min.
When derive delack_max from rto_min, we should not only use ip
route option, but should use tcp_rto_min helper to get the correct
rto_min.
Signed-off-by: Kevin Yang <yyd@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jaroslav reports Dell's OMSA Systems Management Data Engine
expects NLM_DONE in a separate recvmsg(), both for rtnl_dump_ifinfo()
and inet_dump_ifaddr(). We already added a similar fix previously in
commit 460b0d33cf ("inet: bring NLM_DONE out to a separate recv() again")
Instead of modifying all the dump handlers, and making them look
different than modern for_each_netdev_dump()-based dump handlers -
put the workaround in rtnetlink code. This will also help us move
the custom rtnl-locking from af_netlink in the future (in net-next).
Note that this change is not touching rtnl_dump_all(). rtnl_dump_all()
is different kettle of fish and a potential problem. We now mix families
in a single recvmsg(), but NLM_DONE is not coalesced.
Tested:
./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_addr.yaml \
--dump getaddr --json '{"ifa-family": 2}'
./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_route.yaml \
--dump getroute --json '{"rtm-family": 2}'
./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_link.yaml \
--dump getlink
Fixes: 3e41af9076 ("rtnetlink: use xarray iterator to implement rtnl_dump_ifinfo()")
Fixes: cdb2f80f1c ("inet: use xa_array iterator to implement inet_dump_ifaddr()")
Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Link: https://lore.kernel.org/all/CAK8fFZ7MKoFSEzMBDAOjoUt+vTZRRQgLDNXEOfdCCXSoXXKE0g@mail.gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to RFC 1213, we should also take CLOSE-WAIT sockets into
consideration:
"tcpCurrEstab OBJECT-TYPE
...
The number of TCP connections for which the current state
is either ESTABLISHED or CLOSE- WAIT."
After this, CurrEstab counter will display the total number of
ESTABLISHED and CLOSE-WAIT sockets.
The logic of counting
When we increment the counter?
a) if we change the state to ESTABLISHED.
b) if we change the state from SYN-RECEIVED to CLOSE-WAIT.
When we decrement the counter?
a) if the socket leaves ESTABLISHED and will never go into CLOSE-WAIT,
say, on the client side, changing from ESTABLISHED to FIN-WAIT-1.
b) if the socket leaves CLOSE-WAIT, say, on the server side, changing
from CLOSE-WAIT to LAST-ACK.
Please note: there are two chances that old state of socket can be changed
to CLOSE-WAIT in tcp_fin(). One is SYN-RECV, the other is ESTABLISHED.
So we have to take care of the former case.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These fields can be read and written locklessly, add annotations
around these minor races.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When I was doing some experiments, I found that when using the first
parameter, namely, struct net, in ip_metrics_convert() always triggers NULL
pointer crash. Then I digged into this part, realizing that we can remove
this one due to its uselessness.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_skb_can_collapse() checks for conditions which don't make
sense on input. Because of this we ended up sprinkling a few
pairs of mptcp_skb_can_collapse() and skb_cmp_decrypted() calls
on the input path. Group them in a new helper. This should make
it less likely that someone will check mptcp and not decrypted
or vice versa when adding new code.
This implicitly adds a decrypted check early in tcp_collapse().
AFAIU this will very slightly increase our ability to collapse
packets under memory pressure, not a real bug.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
For TLS offload we mark packets with skb->decrypted to make sure
they don't escape the host without getting encrypted first.
The crypto state lives in the socket, so it may get detached
by a call to skb_orphan(). As a safety check - the egress path
drops all packets with skb->decrypted and no "crypto-safe" socket.
The skb marking was added to sendpage only (and not sendmsg),
because tls_device injected data into the TCP stack using sendpage.
This special case was missed when sendpage got folded into sendmsg.
Fixes: c5c37af6ec ("tcp: Convert do_tcp_sendpages() to use MSG_SPLICE_PAGES")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240530232607.82686-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
TCP_CLOSE may or may not have current/rnext keys and should not be
considered "established". The fast-path for TCP_CLOSE is
SKB_DROP_REASON_TCP_CLOSE. This is what tcp_rcv_state_process() does
anyways. Add an early drop path to not spend any time verifying
segment signatures for sockets in TCP_CLOSE state.
Cc: stable@vger.kernel.org # v6.7
Fixes: 0a3a809089 ("net/tcp: Verify inbound TCP-AO signed segments")
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Link: https://lore.kernel.org/r/20240529-tcp_ao-sk_state-v1-1-d69b5d323c52@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/ethernet/ti/icssg/icssg_classifier.c
abd5576b9c ("net: ti: icssg-prueth: Add support for ICSSG switch firmware")
56a5cf538c ("net: ti: icssg-prueth: Fix start counter for ft1 filter")
https://lore.kernel.org/all/20240531123822.3bb7eadf@canb.auug.org.au/
No other adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pass an additional pointer of bpf_struct_ops_link to callback function reg,
unreg, and update provided by subsystems defined in bpf_struct_ops. A
bpf_struct_ops_map can be registered for multiple links. Passing a pointer
of bpf_struct_ops_link helps subsystems to distinguish them.
This pointer will be used in the later patches to let the subsystem
initiate a detachment on a link that was registered to it previously.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240530065946.979330-2-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmZWXboACgkQ1V2XiooU
IOQW/BAAld3NcyYFzwhUmiAcig7jvYkIZwGE7wBVPz82G+gC4n8k+cIt50XlhTQt
lctjKHI7y96hWqlAPbu96lRn/KFgFCCRF3sACpJ/rR5rLn+hLBgIiXKZohLoAkkA
zoSfpHXQo8u8tGmdNJiweKNZbJWXj4Wmaj3SLqbf/0mMyIQUigZ+4EHDHWGP8JvQ
6ilcEndEy9IBD3M5jIE3ubPwbrXwBQjaGRgT+e7yLgNDP8FEg3IeWcPW5hMtPEW3
mizkxamzdNzhaBH2U33hKvoQv0q9QmtcBa8angvx9OCEDRa9c/RpfIYwj8SsdQW1
kdzSAMNpYho+lRmq708G02vYQya5+wfof9NlJHrRqPj4BwFpj53sGm11ksfrYkId
WSJj4WAAcv4zNavS3jQD8SeGrA7bxyvyfO6JUSTEThyMYWvN0s8erGNZYa3Mk1AD
PH3WqMp8WV6EtG4+jQymvLcJ6HrzncdNeBdmbvVhGa9xPGaDIQFs3c4WRRjRu9jd
0cPPITOtb0cOGbt/w9jK1ot7wcm00uwQ1xcFlJO88I7dkgqHxYEsQ8gUC+cJoqL6
s2oPVKMvYzJnyhkXumP1/w3hqkk9EJq5pZZiu47UHbtfT3HhQRBPNydwTvB3Mj30
X8rczN2sFFcDp1KkMQF6C8tQZSXIEEAf5XTUm+q69MmuiT0lrn8=
=cexP
-----END PGP SIGNATURE-----
Merge tag 'nf-24-05-29' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
Patch #1 syzbot reports that nf_reinject() could be called without
rcu_read_lock() when flushing pending packets at nfnetlink
queue removal, from Eric Dumazet.
Patch #2 flushes ipset list:set when canceling garbage collection to
reference to other lists to fix a race, from Jozsef Kadlecsik.
Patch #3 restores q-in-q matching with nft_payload by reverting
f6ae9f120d ("netfilter: nft_payload: add C-VLAN support").
Patch #4 fixes vlan mangling in skbuff when vlan offload is present
in skbuff, without this patch nft_payload corrupts packets
in this case.
Patch #5 fixes possible nul-deref in tproxy no IP address is found in
netdevice, reported by syzbot and patch from Florian Westphal.
Patch #6 removes a superfluous restriction which prevents loose fib
lookups from input and forward hooks, from Eric Garver.
My assessment is that patches #1, #2 and #5 address possible kernel
crash, anything else in this batch fixes broken features.
netfilter pull request 24-05-29
* tag 'nf-24-05-29' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nft_fib: allow from forward/input without iif selector
netfilter: tproxy: bail out if IP has been disabled on the device
netfilter: nft_payload: skbuff vlan metadata mangle support
netfilter: nft_payload: restore vlan q-in-q match support
netfilter: ipset: Add list flush to cancel_gc
netfilter: nfnetlink_queue: acquire rcu_read_lock() in instance_destroy_rcu()
====================
Link: https://lore.kernel.org/r/20240528225519.1155786-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
A recent change to inet_dump_ifaddr had the function incorrectly iterate
over net rather than tgt_net, resulting in the data coming for the
incorrect network namespace.
Fixes: cdb2f80f1c ("inet: use xa_array iterator to implement inet_dump_ifaddr()")
Reported-by: Stéphane Graber <stgraber@stgraber.org>
Closes: https://github.com/lxc/incus/issues/892
Bisected-by: Stéphane Graber <stgraber@stgraber.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Tested-by: Stéphane Graber <stgraber@stgraber.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240528203030.10839-1-aleksandr.mikhalitsyn@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
__dst_negative_advice() does not enforce proper RCU rules when
sk->dst_cache must be cleared, leading to possible UAF.
RCU rules are that we must first clear sk->sk_dst_cache,
then call dst_release(old_dst).
Note that sk_dst_reset(sk) is implementing this protocol correctly,
while __dst_negative_advice() uses the wrong order.
Given that ip6_negative_advice() has special logic
against RTF_CACHE, this means each of the three ->negative_advice()
existing methods must perform the sk_dst_reset() themselves.
Note the check against NULL dst is centralized in
__dst_negative_advice(), there is no need to duplicate
it in various callbacks.
Many thanks to Clement Lecigne for tracking this issue.
This old bug became visible after the blamed commit, using UDP sockets.
Fixes: a87cb3e48e ("net: Facility to report route quality of connected sockets")
Reported-by: Clement Lecigne <clecigne@google.com>
Diagnosed-by: Clement Lecigne <clecigne@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240528114353.1794151-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These functions have races when they:
1) Write sk->sk_err
2) call sk_error_report(sk)
3) call tcp_done(sk)
As described in prior patches in this series:
An smp_wmb() is missing.
We should call tcp_done() before sk_error_report(sk)
to have consistent tcp_poll() results on SMP hosts.
Use tcp_done_with_error() where we centralized the
correct sequence.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20240528125253.1966136-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_abort() has the same issue than the one fixed in the prior patch
in tcp_write_err().
In order to get consistent results from tcp_poll(), we must call
sk_error_report() after tcp_done().
We can use tcp_done_with_error() to centralize this logic.
Fixes: c1e64e298b ("net: diag: Support destroying TCP sockets.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20240528125253.1966136-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
I noticed flakes in a packetdrill test, expecting an epoll_wait()
to return EPOLLERR | EPOLLHUP on a failed connect() attempt,
after multiple SYN retransmits. It sometimes return EPOLLERR only.
The issue is that tcp_write_err():
1) writes an error in sk->sk_err,
2) calls sk_error_report(),
3) then calls tcp_done().
tcp_done() is writing SHUTDOWN_MASK into sk->sk_shutdown,
among other things.
Problem is that the awaken user thread (from 2) sk_error_report())
might call tcp_poll() before tcp_done() has written sk->sk_shutdown.
tcp_poll() only sees a non zero sk->sk_err and returns EPOLLERR.
This patch fixes the issue by making sure to call sk_error_report()
after tcp_done().
tcp_write_err() also lacks an smp_wmb().
We can reuse tcp_done_with_error() to factor out the details,
as Neal suggested.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20240528125253.1966136-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_reset() ends with a sequence that is carefuly ordered.
We need to fix [e]poll bugs in the following patches,
it makes sense to use a common helper.
Suggested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20240528125253.1966136-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The sysctl core is preparing to only expose instances of
struct ctl_table as "const".
This will also affect the ctl_table argument of sysctl handlers.
As the function prototype of all sysctl handlers throughout the tree
needs to stay consistent that change will be done in one commit.
To reduce the size of that final commit, switch utility functions which
are not bound by "typedef proc_handler" to "const struct ctl_table".
No functional change.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20240527-sysctl-const-handler-net-v1-2-16523767d0b2@weissschuh.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
syzbot reports:
general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
[..]
RIP: 0010:nf_tproxy_laddr4+0xb7/0x340 net/ipv4/netfilter/nf_tproxy_ipv4.c:62
Call Trace:
nft_tproxy_eval_v4 net/netfilter/nft_tproxy.c:56 [inline]
nft_tproxy_eval+0xa9a/0x1a00 net/netfilter/nft_tproxy.c:168
__in_dev_get_rcu() can return NULL, so check for this.
Reported-and-tested-by: syzbot+b94a6818504ea90d7661@syzkaller.appspotmail.com
Fixes: cc6eb43385 ("tproxy: use the interface primary IP address as a default value for --on-ip")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZlWtmQAKCRDbK58LschI
g0TUAQDT76jx7Rq1DShCtZ3eqiBMNkYczK8b+GqNsSG8YGduaAEA1jn/GN+H65Rh
atQZ/pYAfLZflMV04+XE0GyBr5q1uQg=
=NczG
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-05-28
We've added 23 non-merge commits during the last 11 day(s) which contain
a total of 45 files changed, 696 insertions(+), 277 deletions(-).
The main changes are:
1) Rename skb's mono_delivery_time to tstamp_type for extensibility
and add SKB_CLOCK_TAI type support to bpf_skb_set_tstamp(),
from Abhishek Chauhan.
2) Add netfilter CT zone ID and direction to bpf_ct_opts so that arbitrary
CT zones can be used from XDP/tc BPF netfilter CT helper functions,
from Brad Cowie.
3) Several tweaks to the instruction-set.rst IETF doc to address
the Last Call review comments, from Dave Thaler.
4) Small batch of riscv64 BPF JIT optimizations in order to emit more
compressed instructions to the JITed image for better icache efficiency,
from Xiao Wang.
5) Sort bpftool C dump output from BTF, aiming to simplify vmlinux.h
diffing and forcing more natural type definitions ordering,
from Mykyta Yatsenko.
6) Use DEV_STATS_INC() macro in BPF redirect helpers to silence
a syzbot/KCSAN race report for the tx_errors counter,
from Jiang Yunshui.
7) Un-constify bpf_func_info in bpftool to fix compilation with LLVM 17+
which started treating const structs as constants and thus breaking
full BTF program name resolution, from Ivan Babrou.
8) Fix up BPF program numbers in test_sockmap selftest in order to reduce
some of the test-internal array sizes, from Geliang Tang.
9) Small cleanup in Makefile.btf script to use test-ge check for v1.25-only
pahole, from Alan Maguire.
10) Fix bpftool's make dependencies for vmlinux.h in order to avoid needless
rebuilds in some corner cases, from Artem Savkov.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (23 commits)
bpf, net: Use DEV_STAT_INC()
bpf, docs: Fix instruction.rst indentation
bpf, docs: Clarify call local offset
bpf, docs: Add table captions
bpf, docs: clarify sign extension of 64-bit use of 32-bit imm
bpf, docs: Use RFC 2119 language for ISA requirements
bpf, docs: Move sentence about returning R0 to abi.rst
bpf: constify member bpf_sysctl_kern:: Table
riscv, bpf: Try RVC for reg move within BPF_CMPXCHG JIT
riscv, bpf: Use STACK_ALIGN macro for size rounding up
riscv, bpf: Optimize zextw insn with Zba extension
selftests/bpf: Handle forwarding of UDP CLOCK_TAI packets
net: Add additional bit to support clockid_t timestamp type
net: Rename mono_delivery_time to tstamp_type for scalabilty
selftests/bpf: Update tests for new ct zone opts for nf_conntrack kfuncs
net: netfilter: Make ct zone opts configurable for bpf ct helpers
selftests/bpf: Fix prog numbers in test_sockmap
bpf: Remove unused variable "prev_state"
bpftool: Un-const bpf_func_info to fix it for llvm 17 and newer
bpf: Fix order of args in call to bpf_map_kvcalloc
...
====================
Link: https://lore.kernel.org/r/20240528105924.30905-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jason commit made checks against ACK sequence less strict
and can be exploited by attackers to establish spoofed flows
with less probes.
Innocent users might use tcp_rmem[1] == 1,000,000,000,
or something more reasonable.
An attacker can use a regular TCP connection to learn the server
initial tp->rcv_wnd, and use it to optimize the attack.
If we make sure that only the announced window (smaller than 65535)
is used for ACK validation, we force an attacker to use
65537 packets to complete the 3WHS (assuming server ISN is unknown)
Fixes: 378979e94e ("tcp: remove 64 KByte limit for initial tp->rcv_wnd value")
Link: https://datatracker.ietf.org/meeting/119/materials/slides-119-tcpm-ghost-acks-00
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://lore.kernel.org/r/20240523130528.60376-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Syzkaller was able to trigger
kernel BUG at net/core/gro.c:424 !
RIP: 0010:gro_pull_from_frag0 net/core/gro.c:424 [inline]
RIP: 0010:gro_try_pull_from_frag0 net/core/gro.c:446 [inline]
RIP: 0010:dev_gro_receive+0x242f/0x24b0 net/core/gro.c:571
Due to using an incorrect NAPI_GRO_CB(skb)->network_offset.
The referenced commit sets this offset to 0 in skb_gro_reset_offset.
That matches the expected case in dev_gro_receive:
pp = INDIRECT_CALL_INET(ptype->callbacks.gro_receive,
ipv6_gro_receive, inet_gro_receive,
&gro_list->list, skb);
But syzkaller injected an skb with protocol ETH_P_TEB into an ip6gre
device (by writing the IP6GRE encapsulated version to a TAP device).
The result was a first call to eth_gro_receive, and thus an extra
ETH_HLEN in network_offset that should not be there. First issue hit
is when computing offset from network header in ipv6_gro_pull_exthdrs.
Initialize both offsets in the network layer gro_receive.
This pairs with all reads in gro_receive, which use
skb_gro_receive_network_offset().
Fixes: 186b1ea73a ("net: gro: use cb instead of skb->network_header")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
CC: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240523141434.1752483-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cited commit started returning an error when user space requests to dump
the interface's IPv4 addresses and IPv4 is disabled on the interface.
Restore the previous behavior and do not return an error.
Before cited commit:
# ip address show dev dummy1
10: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether e2:40:68:98:d0:18 brd ff:ff:ff:ff:ff:ff
inet6 fe80::e040:68ff:fe98:d018/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
# ip link set dev dummy1 mtu 67
# ip address show dev dummy1
10: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 67 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether e2:40:68:98:d0:18 brd ff:ff:ff:ff:ff:ff
After cited commit:
# ip address show dev dummy1
10: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 32:2d:69:f2:9c:99 brd ff:ff:ff:ff:ff:ff
inet6 fe80::302d:69ff:fef2:9c99/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
# ip link set dev dummy1 mtu 67
# ip address show dev dummy1
RTNETLINK answers: No such device
Dump terminated
With this patch:
# ip address show dev dummy1
10: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether de:17:56:bb:57:c0 brd ff:ff:ff:ff:ff:ff
inet6 fe80::dc17:56ff:febb:57c0/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
# ip link set dev dummy1 mtu 67
# ip address show dev dummy1
10: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 67 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether de:17:56:bb:57:c0 brd ff:ff:ff:ff:ff:ff
I fixed the exact same issue for IPv6 in commit c04f7dfe6e ("ipv6: Fix
address dump when IPv6 is disabled on an interface"), but noted [1] that
I am not doing the change for IPv4 because I am not aware of a way to
disable IPv4 on an interface other than unregistering it. I clearly
missed the above case.
[1] https://lore.kernel.org/netdev/20240321173042.2151756-1-idosch@nvidia.com/
Fixes: cdb2f80f1c ("inet: use xa_array iterator to implement inet_dump_ifaddr()")
Reported-by: Carolina Jubran <cjubran@nvidia.com>
Reported-by: Yamen Safadi <ysafadi@nvidia.com>
Tested-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240523110257.334315-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tstamp_type is now set based on actual clockid_t compressed
into 2 bits.
To make the design scalable for future needs this commit bring in
the change to extend the tstamp_type:1 to tstamp_type:2 to support
other clockid_t timestamp.
We now support CLOCK_TAI as part of tstamp_type as part of this
commit with existing support CLOCK_MONOTONIC and CLOCK_REALTIME.
Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240509211834.3235191-3-quic_abchauha@quicinc.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
mono_delivery_time was added to check if skb->tstamp has delivery
time in mono clock base (i.e. EDT) otherwise skb->tstamp has
timestamp in ingress and delivery_time at egress.
Renaming the bitfield from mono_delivery_time to tstamp_type is for
extensibilty for other timestamps such as userspace timestamp
(i.e. SO_TXTIME) set via sock opts.
As we are renaming the mono_delivery_time to tstamp_type, it makes
sense to start assigning tstamp_type based on enum defined
in this commit.
Earlier we used bool arg flag to check if the tstamp is mono in
function skb_set_delivery_time, Now the signature of the functions
accepts tstamp_type to distinguish between mono and real time.
Also skb_set_delivery_type_by_clockid is a new function which accepts
clockid to determine the tstamp_type.
In future tstamp_type:1 can be extended to support userspace timestamp
by increasing the bitfield.
Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240509211834.3235191-2-quic_abchauha@quicinc.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
regression you have been notified of in the past weeks.
The TCP window fix will require some follow-up, already queued.
Current release - regressions:
- af_unix: fix garbage collection of embryos
Previous releases - regressions:
- af_unix: fix race between GC and receive path
- ipv6: sr: fix missing sk_buff release in seg6_input_core
- tcp: remove 64 KByte limit for initial tp->rcv_wnd value
- eth: r8169: fix rx hangup
- eth: lan966x: remove ptp traps in case the ptp is not enabled.
- eth: ixgbe: fix link breakage vs cisco switches.
- eth: ice: prevent ethtool from corrupting the channels.
Previous releases - always broken:
- openvswitch: set the skbuff pkt_type for proper pmtud support.
- tcp: Fix shift-out-of-bounds in dctcp_update_alpha().
Misc:
- a bunch of selftests stabilization patches.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmZPXmUSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOk/o4QAJTA/LcQmHkObgQWyJ7vSykhRFmxSsfR
Qc/DstWuNkM+xDbasdjlxaM+BPgf0RduyB/bsPOr8UvGw0S0NUwQBC9V9bgQ0p67
D9qrZH6gEDRbzG+mkbF49SXksJMSdNSygWc4YnYaCW+eufpCaZwN15q+4pAgAWfW
UmSra9wCkgl9nRc7N4+UEJbhhi0Lso/yaRlHUUUooHOP0ENDe3JSKidUyS3UuhYc
Ah75gKIMm9BygUhg/+mrsRyeb1kfXMfJ54ku/uEIimErG4rTntCJCAc+dBoRXtob
pImg4xfgr1OBL1wQKTHM+nvhE+DThLAJOSguX2RYvTvklx/l00tL1PQkA/kn6XNM
HdQGnDoN1JpUs3xw90hxWp0gzOwJ1XCjbXT/Dx2kp+ltFj0A1EZViTNNTgh6y2E0
B5oo8NFD0y02ilMdaGW/KOpceglO82p2P4DEc0kBAYvCICQ8MKMdtThuubQeB0FK
EO7Xs7lKbDXLJUDtmN4EiE1sofvLVD+1htGt5FG2jtizyQ5Ho/b2aTk2uq0kRN3F
mZgaXcNR3sOJGBdaTvzquALZ2Dt69w0D3EHGv/30tD5zwQO8j71W5OoWTnjknWUp
Nh7ytL/YlqvwJI47UuuTeDBh95jb/KpTWFv8EYsQLI0JOTfa1VXsoDxidg6rnHuX
mvLdIOtzTZqU
=zd2T
-----END PGP SIGNATURE-----
Merge tag 'net-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Quite smaller than usual. Notably it includes the fix for the unix
regression from the past weeks. The TCP window fix will require some
follow-up, already queued.
Current release - regressions:
- af_unix: fix garbage collection of embryos
Previous releases - regressions:
- af_unix: fix race between GC and receive path
- ipv6: sr: fix missing sk_buff release in seg6_input_core
- tcp: remove 64 KByte limit for initial tp->rcv_wnd value
- eth: r8169: fix rx hangup
- eth: lan966x: remove ptp traps in case the ptp is not enabled
- eth: ixgbe: fix link breakage vs cisco switches
- eth: ice: prevent ethtool from corrupting the channels
Previous releases - always broken:
- openvswitch: set the skbuff pkt_type for proper pmtud support
- tcp: Fix shift-out-of-bounds in dctcp_update_alpha()
Misc:
- a bunch of selftests stabilization patches"
* tag 'net-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (25 commits)
r8169: Fix possible ring buffer corruption on fragmented Tx packets.
idpf: Interpret .set_channels() input differently
ice: Interpret .set_channels() input differently
nfc: nci: Fix handling of zero-length payload packets in nci_rx_work()
net: relax socket state check at accept time.
tcp: remove 64 KByte limit for initial tp->rcv_wnd value
net: ti: icssg_prueth: Fix NULL pointer dereference in prueth_probe()
tls: fix missing memory barrier in tls_init
net: fec: avoid lock evasion when reading pps_enable
Revert "ixgbe: Manual AN-37 for troublesome link partners for X550 SFI"
testing: net-drv: use stats64 for testing
net: mana: Fix the extra HZ in mana_hwc_send_request
net: lan966x: Remove ptp traps in case the ptp is not enabled.
openvswitch: Set the skbuff pkt_type for proper pmtud support.
selftest: af_unix: Make SCM_RIGHTS into OOB data.
af_unix: Fix garbage collection of embryos carrying OOB with SCM_RIGHTS
tcp: Fix shift-out-of-bounds in dctcp_update_alpha().
selftests/net: use tc rule to filter the na packet
ipv6: sr: fix memleak in seg6_hmac_init_algo
af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock.
...
Recently, we had some servers upgraded to the latest kernel and noticed
the indicator from the user side showed worse results than before. It is
caused by the limitation of tp->rcv_wnd.
In 2018 commit a337531b94 ("tcp: up initial rmem to 128KB and SYN rwin
to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most
CDN teams would not benefit from this change because they cannot have a
large window to receive a big packet, which will be slowed down especially
in long RTT. Small rcv_wnd means slow transfer speed, to some extent. It's
the side effect for the latency/time-sensitive users.
To avoid future confusion, current change doesn't affect the initial
receive window on the wire in a SYN or SYN+ACK packet which are set within
65535 bytes according to RFC 7323 also due to the limit in
__tcp_transmit_skb():
th->window = htons(min(tp->rcv_wnd, 65535U));
In one word, __tcp_transmit_skb() already ensures that constraint is
respected, no matter how large tp->rcv_wnd is. The change doesn't violate
RFC.
Let me provide one example if with or without the patch:
Before:
client --- SYN: rwindow=65535 ---> server
client <--- SYN+ACK: rwindow=65535 ---- server
client --- ACK: rwindow=65536 ---> server
Note: for the last ACK, the calculation is 512 << 7.
After:
client --- SYN: rwindow=65535 ---> server
client <--- SYN+ACK: rwindow=65535 ---- server
client --- ACK: rwindow=175232 ---> server
Note: I use the following command to make it work:
ip route change default via [ip] dev eth0 metric 100 initrwnd 120
For the last ACK, the calculation is 1369 << 7.
When we apply such a patch, having a large rcv_wnd if the user tweak this
knob can help transfer data more rapidly and save some rtts.
Fixes: a337531b94 ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20240521134220.12510-1-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
xmit() functions should consume skb or return error codes in error
paths.
When the configuration "CONFIG_INET_ESPINTCP" is not set, the
implementation of the function "esp_output_tail_tcp" violates this rule.
The function frees the skb and returns the error code.
This change removes the kfree_skb from both functions, for both
esp4 and esp6.
WARN_ON is added because esp_output_tail_tcp() should never be called if
CONFIG_INET_ESPINTCP is not set.
This bug was discovered and resolved using Coverity Static Analysis
Security Testing (SAST) by Synopsys, Inc.
Fixes: e27cca96cd ("xfrm: add espintcp (RFC 8229)")
Signed-off-by: Hagar Hemdan <hagarhem@amazon.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmZFcFwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpuP1EADKJOJRtcvO/av2cUR+HZFDC+s/jBwHIJK+
4UY633zQlxjqc7dt4rX7zk/uk4mkhZnsGY+wS6xH08kB3VO9YksrwREVt6Ur9lP8
UXNVPpPcZ7fFcIp41rYkZX9pTDp2N8z2qsVg7V8wcXJ7EeTXd6L4ZLjhfCiHvs2s
i6yIEwLrW+voYuqSFV7vWBIM3mSXSRTIiO2DqRAOtT2lsj374DOthvP2lOSSb5wq
6TF4s4z3HMGs+HF3rjP5kJ6ic6RdC6i31lzEivUMhwCiKN1AZXdp96KXaC+NVPRV
t5//EdS+pSenQgkg6XH7d5kzFoCUFJfVZt05w0GCqMA081Q9ySjUymN1zedJbGd9
8CDlW01N8XLqG6+F9yakJLSFY+mUFGduPuueTNiUJWP8kTkQCtYIRzZDeyjxQrE5
c17NW5S1uWkf26Ucyi1r+gxw9N4kGkuB3+NitC6DOc7BW5CocEIoqLWi/UH7cEZe
0v6loTakqBAdgh03RCDMUj9Rt/37pQs2KFT9/CazVpbkvkKsue4xK4K2CUFsxqOj
qcoc/LD62at4S3AUWwhUIs3YaQ7v/6AY5hIktqAwsFHmDffUbPdRrXWY1keKIprJ
4qS/sY0M+kvKGnp+80fPVHab9l6/fMLfabIyFuh0M3W/M4eHGt2YfKWreoGEy/1x
xLq2iq+ehw==
=S6Xt
-----END PGP SIGNATURE-----
Merge tag 'net-accept-more-20240515' of git://git.kernel.dk/linux
Pull more io_uring updates from Jens Axboe:
"This adds support for IORING_CQE_F_SOCK_NONEMPTY for io_uring accept
requests.
This is very similar to previous work that enabled the same hint for
doing receives on sockets. By far the majority of the work here is
refactoring to enable the networking side to pass back whether or not
the socket had more pending requests after accepting the current one,
the last patch just wires it up for io_uring.
Not only does this enable applications to know whether there are more
connections to accept right now, it also enables smarter logic for
io_uring multishot accept on whether to retry immediately or wait for
a poll trigger"
* tag 'net-accept-more-20240515' of git://git.kernel.dk/linux:
io_uring/net: wire up IORING_CQE_F_SOCK_NONEMPTY for accept
net: pass back whether socket was empty post accept
net: have do_accept() take a struct proto_accept_arg argument
net: change proto and proto_ops accept type
We're going to send an RST due to invalid syn packet which is already
checked whether 1) it is in sequence, 2) it is a retransmitted skb.
As RFC 793 says, if the state of socket is not CLOSED/LISTEN/SYN-SENT,
then we should send an RST when receiving bad syn packet:
"fourth, check the SYN bit,...If the SYN is in the window it is an
error, send a reset"
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Link: https://lore.kernel.org/r/20240510122502.27850-6-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are two possible cases where TCP layer can send an RST. Since they
happen in the same place, I think using one independent reason is enough
to identify this special situation.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Link: https://lore.kernel.org/r/20240510122502.27850-5-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This adds an 'is_empty' argument to struct proto_accept_arg, which can
be used to pass back information on whether or not the given socket has
more connections to accept post the one just accepted.
To utilize this information, the caller should initialize the 'is_empty'
field to, eg, -1 and then check for 0/1 after the accept. If the field
has been set, the caller knows whether there are more pending connections
or not. If the field remains -1 after the accept call, the protocol
doesn't support passing back this information.
This patch wires it up for ipv4/6 TCP.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rather than pass in flags, error pointer, and whether this is a kernel
invocation or not, add a struct proto_accept_arg struct as the argument.
This then holds all of these arguments, and prepares accept for being
able to pass back more information.
No functional changes in this patch.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZkGcZAAKCRDbK58LschI
g6o6APwLsqhrM2w71VUN5ciCxu4H5VDtZp6wkdqtVbxxU4qNxQEApKgYgKt8ZLF3
Kily5c7m+S4ZXhMX21rb8JhSAz0dfQk=
=5Dk7
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-05-13
We've added 119 non-merge commits during the last 14 day(s) which contain
a total of 134 files changed, 9462 insertions(+), 4742 deletions(-).
The main changes are:
1) Add BPF JIT support for 32-bit ARCv2 processors, from Shahab Vahedi.
2) Add BPF range computation improvements to the verifier in particular
around XOR and OR operators, refactoring of checks for range computation
and relaxing MUL range computation so that src_reg can also be an unknown
scalar, from Cupertino Miranda.
3) Add support to attach kprobe BPF programs through kprobe_multi link in
a session mode, meaning, a BPF program is attached to both function entry
and return, the entry program can decide if the return program gets
executed and the entry program can share u64 cookie value with return
program. Session mode is a common use-case for tetragon and bpftrace,
from Jiri Olsa.
4) Fix a potential overflow in libbpf's ring__consume_n() and improve libbpf
as well as BPF selftest's struct_ops handling, from Andrii Nakryiko.
5) Improvements to BPF selftests in context of BPF gcc backend,
from Jose E. Marchesi & David Faust.
6) Migrate remaining BPF selftest tests from test_sock_addr.c to prog_test-
-style in order to retire the old test, run it in BPF CI and additionally
expand test coverage, from Jordan Rife.
7) Big batch for BPF selftest refactoring in order to remove duplicate code
around common network helpers, from Geliang Tang.
8) Another batch of improvements to BPF selftests to retire obsolete
bpf_tcp_helpers.h as everything is available vmlinux.h,
from Martin KaFai Lau.
9) Fix BPF map tear-down to not walk the map twice on free when both timer
and wq is used, from Benjamin Tissoires.
10) Fix BPF verifier assumptions about socket->sk that it can be non-NULL,
from Alexei Starovoitov.
11) Change BTF build scripts to using --btf_features for pahole v1.26+,
from Alan Maguire.
12) Small improvements to BPF reusing struct_size() and krealloc_array(),
from Andy Shevchenko.
13) Fix s390 JIT to emit a barrier for BPF_FETCH instructions,
from Ilya Leoshkevich.
14) Extend TCP ->cong_control() callback in order to feed in ack and
flag parameters and allow write-access to tp->snd_cwnd_stamp
from BPF program, from Miao Xu.
15) Add support for internal-only per-CPU instructions to inline
bpf_get_smp_processor_id() helper call for arm64 and riscv64 BPF JITs,
from Puranjay Mohan.
16) Follow-up to remove the redundant ethtool.h from tooling infrastructure,
from Tushar Vyavahare.
17) Extend libbpf to support "module:<function>" syntax for tracing
programs, from Viktor Malik.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits)
bpf: make list_for_each_entry portable
bpf: ignore expected GCC warning in test_global_func10.c
bpf: disable strict aliasing in test_global_func9.c
selftests/bpf: Free strdup memory in xdp_hw_metadata
selftests/bpf: Fix a few tests for GCC related warnings.
bpf: avoid gcc overflow warning in test_xdp_vlan.c
tools: remove redundant ethtool.h from tooling infra
selftests/bpf: Expand ATTACH_REJECT tests
selftests/bpf: Expand getsockname and getpeername tests
sefltests/bpf: Expand sockaddr hook deny tests
selftests/bpf: Expand sockaddr program return value tests
selftests/bpf: Retire test_sock_addr.(c|sh)
selftests/bpf: Remove redundant sendmsg test cases
selftests/bpf: Migrate ATTACH_REJECT test cases
selftests/bpf: Migrate expected_attach_type tests
selftests/bpf: Migrate wildcard destination rewrite test
selftests/bpf: Migrate sendmsg6 v4 mapped address tests
selftests/bpf: Migrate sendmsg deny test cases
selftests/bpf: Migrate WILDCARD_IP test
selftests/bpf: Handle SYSCALL_EPERM and SYSCALL_ENOTSUPP test cases
...
====================
Link: https://lore.kernel.org/r/20240513134114.17575-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A way for an application to know if an MPTCP connection fell back to TCP
is to use getsockopt(MPTCP_INFO) and look for errors. The issue with
this technique is that the same errors -- EOPNOTSUPP (IPv4) and
ENOPROTOOPT (IPv6) -- are returned if there was a fallback, *or* if the
kernel doesn't support this socket option. The userspace then has to
look at the kernel version to understand what the errors mean.
It is not clean, and it doesn't take into account older kernels where
the socket option has been backported. A cleaner way would be to expose
this info to the TCP socket level. In case of MPTCP socket where no
fallback happened, the socket options for the TCP level will be handled
in MPTCP code, in mptcp_getsockopt_sol_tcp(). If not, that will be in
TCP code, in do_tcp_getsockopt(). So MPTCP simply has to set the value
1, while TCP has to set 0.
If the socket option is not supported, one of these two errors will be
reported:
- EOPNOTSUPP (95 - Operation not supported) for MPTCP sockets
- ENOPROTOOPT (92 - Protocol not available) for TCP sockets, e.g. on the
socket received after an 'accept()', when the client didn't request to
use MPTCP: this socket will be a TCP one, even if the listen socket
was an MPTCP one.
With this new option, the kernel can return a clear answer to both "Is
this kernel new enough to tell me the fallback status?" and "If it is
new enough, is it currently a TCP or MPTCP socket?" questions, while not
breaking the previous method.
Acked-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://lore.kernel.org/r/20240509-upstream-net-next-20240509-mptcp-tcp_is_mptcp-v1-1-f846df999202@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
{inet,ipv6}_gro_receive functions perform flush checks (ttl, flags,
iph->id, ...) against all packets in a loop. These flush checks are used in
all merging UDP and TCP flows.
These checks need to be done only once and only against the found p skb,
since they only affect flush and not same_flow.
This patch leverages correct network header offsets from the cb for both
outer and inner network headers - allowing these checks to be done only
once, in tcp_gro_receive and udp_gro_receive_segment. As a result,
NAPI_GRO_CB(p)->flush is not used at all. In addition, flush_id checks are
more declarative and contained in inet_gro_flush, thus removing the need
for flush_id in napi_gro_cb.
This results in less parsing code for non-loop flush tests for TCP and UDP
flows.
To make sure results are not within noise range - I've made netfilter drop
all TCP packets, and measured CPU performance in GRO (in this case GRO is
responsible for about 50% of the CPU utilization).
perf top while replaying 64 parallel IP/TCP streams merging in GRO:
(gro_receive_network_flush is compiled inline to tcp_gro_receive)
net-next:
6.94% [kernel] [k] inet_gro_receive
3.02% [kernel] [k] tcp_gro_receive
patch applied:
4.27% [kernel] [k] tcp_gro_receive
4.22% [kernel] [k] inet_gro_receive
perf top while replaying 64 parallel IP/IP/TCP streams merging in GRO (same
results for any encapsulation, in this case inet_gro_receive is top
offender in net-next)
net-next:
10.09% [kernel] [k] inet_gro_receive
2.08% [kernel] [k] tcp_gro_receive
patch applied:
6.97% [kernel] [k] inet_gro_receive
3.68% [kernel] [k] tcp_gro_receive
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240509190819.2985-3-richardbgobert@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch converts references of skb->network_header to napi_gro_cb's
network_offset and inner_network_offset.
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240509190819.2985-2-richardbgobert@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
I missed that (struct ifaddrmsg)->ifa_flags was only 8bits,
while (struct in_ifaddr)->ifa_flags is 32bits.
Use a temporary 32bit variable as I did in set_ifa_lifetime()
and check_lifetime().
Fixes: 3ddc2231c8 ("inet: annotate data-races around ifa->ifa_flags")
Reported-by: Yu Watanabe <watanabe.yu@gmail.com>
Dianosed-by: Yu Watanabe <watanabe.yu@gmail.com>
Closes: https://github.com/systemd/systemd/pull/32666#issuecomment-2103977928
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240510072932.2678952-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmZA578ACgkQ1V2XiooU
IOS18g//Zyuv+23GcUM7+FEXrlMN658xJyWiYvKjOaZZx5ZiV0QdZc4cbfPFD44p
qZBMmVC/WVoC89SLdwH1W47KoJU0xsK3/OdqGHHeNJ69111wIQMpOLfAetS2K0mb
F+Ue2vyWg1GQDICGsCdenHX7ihtVvnJJkomxc+3ObxtLCNsb2Dsr6JM5hMVP5Bil
4UZnPsrgfWy3A8O92burlPVE1sTWDFfFUGIf8geJc4QadwkgufkzxMhXNO7xHlpG
EZ99s8FPyD3R6tRPjf4gwdjr7JjinrdrYjZDuS4d3Uv8pKlUqcx8PgXG51/unr/y
qlynLXtEc1QU6SO2jENosHAG2/LQG2zsYEiiLFCP+a1JOtOxevZQKx8MyAFW8xDX
+RQhcBpTBocIyJ/tCDoM9lp69iYTR196Ct48v6pSGMNhZcddT4K4BkUL47GEs8T3
IA5x8h5gV2Q9ECMgqSaycdUsfLNgE/6fWx0ROs/wo3tMsgWrXCSJi8RFtN1sNbIO
rfuNnQiETIFBkQxBi7um8jadxdfIHm65cjgZBCyVbNNml3JwjYvXLxCXt2G7LxC4
Sg4nZvIbqWIifoMc1aQKypvFZjzzsWtFmYCuEUVLrnpj2SFTyh5CNzNo3MlHf7LG
sRb/XubdY6e0spLzd5VDjwH5qOT3poWAccatRr5BVUarxXCD5Vs=
=RD9p
-----END PGP SIGNATURE-----
Merge tag 'nf-next-24-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
Patch #1 skips transaction if object type provides no .update interface.
Patch #2 skips NETDEV_CHANGENAME which is unused.
Patch #3 enables conntrack to handle Multicast Router Advertisements and
Multicast Router Solicitations from the Multicast Router Discovery
protocol (RFC4286) as untracked opposed to invalid packets.
From Linus Luessing.
Patch #4 updates DCCP conntracker to mark invalid as invalid, instead of
dropping them, from Jason Xing.
Patch #5 uses NF_DROP instead of -NF_DROP since NF_DROP is 0,
also from Jason.
Patch #6 removes reference in netfilter's sysctl documentation on pickup
entries which were already removed by Florian Westphal.
Patch #7 removes check for IPS_OFFLOAD flag to disable early drop which
allows to evict entries from the conntrack table,
also from Florian.
Patches #8 to #16 updates nf_tables pipapo set backend to allocate
the datastructure copy on-demand from preparation phase,
to better deal with OOM situations where .commit step is too late
to fail. Series from Florian Westphal.
Patch #17 adds a selftest with packetdrill to cover conntrack TCP state
transitions, also from Florian.
Patch #18 use GFP_KERNEL to clone elements from control plane to avoid
quick atomic reserves exhaustion with large sets, reporter refers
to million entries magnitude.
* tag 'nf-next-24-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_tables: allow clone callbacks to sleep
selftests: netfilter: add packetdrill based conntrack tests
netfilter: nft_set_pipapo: remove dirty flag
netfilter: nft_set_pipapo: move cloning of match info to insert/removal path
netfilter: nft_set_pipapo: prepare pipapo_get helper for on-demand clone
netfilter: nft_set_pipapo: merge deactivate helper into caller
netfilter: nft_set_pipapo: prepare walk function for on-demand clone
netfilter: nft_set_pipapo: prepare destroy function for on-demand clone
netfilter: nft_set_pipapo: make pipapo_clone helper return NULL
netfilter: nft_set_pipapo: move prove_locking helper around
netfilter: conntrack: remove flowtable early-drop test
netfilter: conntrack: documentation: remove reference to non-existent sysctl
netfilter: use NF_DROP instead of -NF_DROP
netfilter: conntrack: dccp: try not to drop skb in conntrack
netfilter: conntrack: fix ct-state for ICMPv6 Multicast Router Discovery
netfilter: nf_tables: remove NETDEV_CHANGENAME from netdev chain event handler
netfilter: nf_tables: skip transaction if update object is not implemented
====================
Link: https://lore.kernel.org/r/20240512161436.168973-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
DCCP is going away soon, and had no twsk_unique() method.
We can directly call tcp_twsk_unique() for TCP sockets.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240507164140.940547-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
35d92abfba ("net: hns3: fix kernel crash when devlink reload during initialization")
2a1a1a7b5f ("net: hns3: add command queue trace for hns3")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduce a tracepoint for icmp_send, which can help users to get more
detail information conveniently when icmp abnormal events happen.
1. Giving an usecase example:
=============================
When an application experiences packet loss due to an unreachable UDP
destination port, the kernel will send an exception message through the
icmp_send function. By adding a trace point for icmp_send, developers or
system administrators can obtain detailed information about the UDP
packet loss, including the type, code, source address, destination address,
source port, and destination port. This facilitates the trouble-shooting
of UDP packet loss issues especially for those network-service
applications.
2. Operation Instructions:
==========================
Switch to the tracing directory.
cd /sys/kernel/tracing
Filter for destination port unreachable.
echo "type==3 && code==3" > events/icmp/icmp_send/filter
Enable trace event.
echo 1 > events/icmp/icmp_send/enable
3. Result View:
================
udp_client_erro-11370 [002] ...s.12 124.728002:
icmp_send: icmp_send: type=3, code=3.
From 127.0.0.1:41895 to 127.0.0.1:6666 ulen=23
skbaddr=00000000589b167a
Signed-off-by: Peilin He <he.peilin@zte.com.cn>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Reviewed-by: Yunkai Zhang <zhang.yunkai@zte.com.cn>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Cc: Liu Chun <liu.chun2@zte.com.cn>
Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simon reported that ndo_change_mtu() methods were never
updated to use WRITE_ONCE(dev->mtu, new_mtu) as hinted
in commit 501a90c945 ("inet: protect against too small
mtu values.")
We read dev->mtu without holding RTNL in many places,
with READ_ONCE() annotations.
It is time to take care of ndo_change_mtu() methods
to use corresponding WRITE_ONCE()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Simon Horman <horms@kernel.org>
Closes: https://lore.kernel.org/netdev/20240505144608.GB67882@kernel.org/
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://lore.kernel.org/r/20240506102812.3025432-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We want to be able to run rtnl_fill_ifinfo() under RCU protection
instead of RTNL in the future.
All rtnl_link_ops->get_link_net() methods already using dev_net()
are ready. I added READ_ONCE() annotations on others.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmY0mEIACgkQrB3Eaf9P
W7fU7g//bQyydwei/4Vo+cNCPp82k8wL/qhDY3IjN10PfJOSNmeCAcSgkuuHTRSx
g/hoxZEVzLrQT5bt+Sb38JxADFiL787GjdGEUy1gzF7CnDKcGT5KnydYNjDqDVGt
nOv9kAGfWIkMKdNqrhHifddPMWd+ZqvpUcFz5olvqIE2mpNgMwy2i3NID9bNAV31
v5AEvNINa1LKOhX9cEka8iPQXwp+I6yTLqyOd4VciOuFr8dPg0FQqFaYR+OtMsV0
kIxdGTVfmRWaNgq/Tsg4z/2rXEwmEjTWzAhNVGu8o8L3JozXOvbjIrDG7Ws6qB3V
XTFl8ueRMk0UCTlY/QAfip5H7IlAo+H0FUBC45FNP1UhHeWisXT4D5rqAEqQTlZR
bddtuueLZyKclFpXRNi+/8vdDrXhhEzeNINkc52Ef33rUTtZJR8bXrEUKzaYCIuF
ldub0PA3+e5wvwIxq5/Chc/+MIaIHnXBMUmbCJSPnMrupBQtO+i6arPQcbtaBAgS
YyVGTRk9YN0UAjSriIuiViLlgUCMsvsWgfSz9rd0PE54MFBrvcLPeCtPxKZ+sTVG
Y2iSZ8d3ThvsMiQVNU8gj3SlTY1oTvuaijDDGjnR0nWkxV9LMJHCPKfIzsbOKLJe
+ee5hKP4TOFygnV58BkqdGK/LavNpouTIbrM43hgmJ0IX9kSt4o=
=QiGZ
-----END PGP SIGNATURE-----
Merge tag 'ipsec-next-2024-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2024-05-03
1) Remove Obsolete UDP_ENCAP_ESPINUDP_NON_IKE Support.
This was defined by an early version of an IETF draft
that did not make it to a standard.
2) Introduce direction attribute for xfrm states.
xfrm states have a direction, a stsate can be used
either for input or output packet processing.
Add a direction to xfrm states to make it clear
for what a xfrm state is used.
* tag 'ipsec-next-2024-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
xfrm: Restrict SA direction attribute to specific netlink message types
xfrm: Add dir validation to "in" data path lookup
xfrm: Add dir validation to "out" data path lookup
xfrm: Add Direction to the SA in or out
udpencap: Remove Obsolete UDP_ENCAP_ESPINUDP_NON_IKE Support
====================
Link: https://lore.kernel.org/r/20240503082732.2835810-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
At the beginning in 2009 one patch [1] introduced collecting drop
counter in nf_conntrack_in() by returning -NF_DROP. Later, another
patch [2] changed the return value of tcp_packet() which now is
renamed to nf_conntrack_tcp_packet() from -NF_DROP to NF_DROP. As
we can see, that -NF_DROP should be corrected.
Similarly, there are other two points where the -NF_DROP is used.
Well, as NF_DROP is equal to 0, inverting NF_DROP makes no sense
as patch [2] said many years ago.
[1]
commit 7d1e04598e ("netfilter: nf_conntrack: account packets drop by tcp_packet()")
[2]
commit ec8d540969 ("netfilter: conntrack: fix dropping packet after l4proto->packet()")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When forwarding TCP after GRO, software segmentation is very expensive,
especially when the checksum needs to be recalculated.
One case where that's currently unavoidable is when routing packets over
PPPoE. Performance improves significantly when using fraglist GRO
implemented in the same way as for UDP.
When NETIF_F_GRO_FRAGLIST is enabled, perform a lookup for an established
socket in the same netns as the receiving device. While this may not
cover all relevant use cases in multi-netns configurations, it should be
good enough for most configurations that need this.
Here's a measurement of running 2 TCP streams through a MediaTek MT7622
device (2-core Cortex-A53), which runs NAT with flow offload enabled from
one ethernet port to PPPoE on another ethernet port + cake qdisc set to
1Gbps.
rx-gro-list off: 630 Mbit/s, CPU 35% idle
rx-gro-list on: 770 Mbit/s, CPU 40% idle
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Pull the code out of tcp_gro_receive in order to access the tcp header
from tcp4/6_gro_receive.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This pulls the flow port matching out of tcp_gro_receive, so that it can be
reused for the next change, which adds the TCP fraglist GRO heuristic.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This implements fraglist GRO similar to how it's handled in UDP, however
no functional changes are added yet. The next change adds a heuristic for
using fraglist GRO instead of regular GRO.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Preparation for adding TCP fraglist GRO support. It expects packets to be
combined in a similar way as UDP fraglist GSO packets.
For IPv4 packets, NAT is handled in the same way as UDP fraglist GSO.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This helper function will be used for TCP fraglist GRO support
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmYzUX4ACgkQrB3Eaf9P
W7e7qQ/+LgkDkL/LyXv3kAPN8b2SapIiIajarlRfgdPYdM6PP+kzGJxC/t5NZ2HE
Q1N6K0hIL042rna1/grkUKHeQn4PXUlfT6y8YgjiuCvpFDVNb2ofyl3AmxjJnH1A
iwMWf6EhwGoxbVs3DbDJ554U8T0nBJeZ+MXLF/4BI13bNdj7stbcKRqj6KHC5sQO
JgtFVX+ip6LLGL7rR4YMv2h2p1sSu3Vp6bMcfM85I4ENec0UIjgsAF9P0buPl4gr
2oKtMxga86CQWcymKo6DI+MsBBk91wvM+5/T9zQtpdxDuNEQNrotCoCc0Kd03xmP
EGzJagwVGFj08kYJ7qICDwpXWCpLDVumoxWFNBWmAW9uNEkUW8Tiqmm8eW2Azs3d
VAUFcyzHr7mkAaqSDDdE4J+L276Z+dS+BHPnoF6Sp+ctuvSmmeS6lyY9mGnFGH7H
OiqFKonjBEC5iNAMIXF3WRKueMDdbbDFwHK4NEiTIUSeAMqETUP2sBC1GNTaN8YJ
soKYtwUtiag2P44ZYy5UYeKJlaBnT1FOZHLs24iCOY1XjqJerwjefQuBO6HDBz/I
vkaSY6ak6uRsAdfst45uQNPfxlJkFDbwRDowFCdhu5qG7bifqnXstQmNta2U1109
4e3vt5jPowN/9bCtMx7Z+ftmmTsapxYCu5ZYRVAq82WahsXFPtE=
=aeD1
-----END PGP SIGNATURE-----
Merge tag 'ipsec-2024-05-02' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2024-05-02
1) Fix an error pointer dereference in xfrm_in_fwd_icmp.
From Antony Antony.
2) Preserve vlan tags for ESP transport mode software GRO.
From Paul Davey.
3) Fix a spelling mistake in an uapi xfrm.h comment.
From Anotny Antony.
* tag 'ipsec-2024-05-02' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: Correct spelling mistake in xfrm.h comment
xfrm: Preserve vlan tags for transport mode software GRO
xfrm: fix possible derferencing in error path
====================
Link: https://lore.kernel.org/r/20240502084838.2269355-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)
* Remove sentinel element from ctl_table structs.
* Remove the zeroing out of an array element (to make it look like a
sentinel) in sysctl_route_net_init And ipv6_route_sysctl_init.
This is not longer needed and is safe after commit c899710fe7
("networking: Update to register_net_sysctl_sz") added the array size
to the ctl_table registration.
* Remove extra sentinel element in the declaration of devinet_vars.
* Removed the "-1" in __devinet_sysctl_register, sysctl_route_net_init,
ipv6_sysctl_net_init and ipv4_sysctl_init_net that adjusted for having
an extra empty element when looping over ctl_table arrays
* Replace the for loop stop condition in __addrconf_sysctl_register that
tests for procname == NULL with one that depends on array size
* Removing the unprivileged user check in ipv6_route_sysctl_init is
safe as it is replaced by calling ipv6_route_sysctl_table_size;
introduced in commit c899710fe7 ("networking: Update to
register_net_sysctl_sz")
* Use a table_size variable to keep the value of ARRAY_SIZE
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows the write of tp->snd_cwnd_stamp in a bpf tcp
ca program. An use case of writing this field is to keep track
of the time whenever tp->snd_cwnd is raised or reduced inside
the `cong_control` callback.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Miao Xu <miaxu@meta.com>
Link: https://lore.kernel.org/r/20240502042318.801932-3-miaxu@meta.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
This patch adds two new arguments for cong_control of struct
tcp_congestion_ops:
- ack
- flag
These two arguments are inherited from the caller tcp_cong_control in
tcp_intput.c. One use case of them is to update cwnd and pacing rate
inside cong_control based on the info they provide. For example, the
flag can be used to decide if it is the right time to raise or reduce a
sender's cwnd.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Miao Xu <miaxu@meta.com>
Link: https://lore.kernel.org/r/20240502042318.801932-2-miaxu@meta.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Cross-merge networking fixes after downstream PR.
Conflicts:
include/linux/filter.h
kernel/bpf/core.c
66e13b615a ("bpf: verifier: prevent userspace memory access")
d503a04f8b ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
https://lore.kernel.org/all/20240429114939.210328b0@canb.auug.org.au/
No adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
GRO-GSO path is supposed to be transparent and as such L3 flush checks are
relevant to all UDP flows merging in GRO. This patch uses the same logic
and code from tcp_gro_receive, terminating merge if flush is non zero.
Fixes: e20cf8d3f1 ("udp: implement GRO for plain UDP sockets.")
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Commits a602456 ("udp: Add GRO functions to UDP socket") and 57c67ff ("udp:
additional GRO support") introduce incorrect usage of {ip,ipv6}_hdr in the
complete phase of gro. The functions always return skb->network_header,
which in the case of encapsulated packets at the gro complete phase, is
always set to the innermost L3 of the packet. That means that calling
{ip,ipv6}_hdr for skbs which completed the GRO receive phase (both in
gro_list and *_gro_complete) when parsing an encapsulated packet's _outer_
L3/L4 may return an unexpected value.
This incorrect usage leads to a bug in GRO's UDP socket lookup.
udp{4,6}_lib_lookup_skb functions use ip_hdr/ipv6_hdr respectively. These
*_hdr functions return network_header which will point to the innermost L3,
resulting in the wrong offset being used in __udp{4,6}_lib_lookup with
encapsulated packets.
This patch adds network_offset and inner_network_offset to napi_gro_cb, and
makes sure both are set correctly.
To fix the issue, network_offsets union is used inside napi_gro_cb, in
which both the outer and the inner network offsets are saved.
Reproduction example:
Endpoint configuration example (fou + local address bind)
# ip fou add port 6666 ipproto 4
# ip link add name tun1 type ipip remote 2.2.2.1 local 2.2.2.2 encap fou encap-dport 5555 encap-sport 6666 mode ipip
# ip link set tun1 up
# ip a add 1.1.1.2/24 dev tun1
Netperf TCP_STREAM result on net-next before patch is applied:
net-next main, GRO enabled:
$ netperf -H 1.1.1.2 -t TCP_STREAM -l 5
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
131072 16384 16384 5.28 2.37
net-next main, GRO disabled:
$ netperf -H 1.1.1.2 -t TCP_STREAM -l 5
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
131072 16384 16384 5.01 2745.06
patch applied, GRO enabled:
$ netperf -H 1.1.1.2 -t TCP_STREAM -l 5
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
131072 16384 16384 5.01 2877.38
Fixes: a6024562ff ("udp: Add GRO functions to UDP socket")
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
KMSAN reported uninit-value access in __ip_make_skb() [1]. __ip_make_skb()
tests HDRINCL to know if the skb has icmphdr. However, HDRINCL can cause a
race condition. If calling setsockopt(2) with IP_HDRINCL changes HDRINCL
while __ip_make_skb() is running, the function will access icmphdr in the
skb even if it is not included. This causes the issue reported by KMSAN.
Check FLOWI_FLAG_KNOWN_NH on fl4->flowi4_flags instead of testing HDRINCL
on the socket.
Also, fl4->fl4_icmp_type and fl4->fl4_icmp_code are not initialized. These
are union in struct flowi4 and are implicitly initialized by
flowi4_init_output(), but we should not rely on specific union layout.
Initialize these explicitly in raw_sendmsg().
[1]
BUG: KMSAN: uninit-value in __ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481
__ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481
ip_finish_skb include/net/ip.h:243 [inline]
ip_push_pending_frames+0x4c/0x5c0 net/ipv4/ip_output.c:1508
raw_sendmsg+0x2381/0x2690 net/ipv4/raw.c:654
inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x274/0x3c0 net/socket.c:745
__sys_sendto+0x62c/0x7b0 net/socket.c:2191
__do_sys_sendto net/socket.c:2203 [inline]
__se_sys_sendto net/socket.c:2199 [inline]
__x64_sys_sendto+0x130/0x200 net/socket.c:2199
do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x6d/0x75
Uninit was created at:
slab_post_alloc_hook mm/slub.c:3804 [inline]
slab_alloc_node mm/slub.c:3845 [inline]
kmem_cache_alloc_node+0x5f6/0xc50 mm/slub.c:3888
kmalloc_reserve+0x13c/0x4a0 net/core/skbuff.c:577
__alloc_skb+0x35a/0x7c0 net/core/skbuff.c:668
alloc_skb include/linux/skbuff.h:1318 [inline]
__ip_append_data+0x49ab/0x68c0 net/ipv4/ip_output.c:1128
ip_append_data+0x1e7/0x260 net/ipv4/ip_output.c:1365
raw_sendmsg+0x22b1/0x2690 net/ipv4/raw.c:648
inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x274/0x3c0 net/socket.c:745
__sys_sendto+0x62c/0x7b0 net/socket.c:2191
__do_sys_sendto net/socket.c:2203 [inline]
__se_sys_sendto net/socket.c:2199 [inline]
__x64_sys_sendto+0x130/0x200 net/socket.c:2199
do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x6d/0x75
CPU: 1 PID: 15709 Comm: syz-executor.7 Not tainted 6.8.0-11567-gb3603fcb79b1 #25
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
Fixes: 99e5acae19 ("ipv4: Fix potential uninit variable access bug in __ip_make_skb()")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Link: https://lore.kernel.org/r/20240430123945.2057348-1-syoshida@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
ioctl(SIOCGARP) holds rtnl_lock() to get netdev by __dev_get_by_name()
and copy dev->name safely and calls neigh_lookup() later, which looks
up a neighbour entry under RCU.
Let's replace __dev_get_by_name() with dev_get_by_name_rcu() and strscpy()
with netdev_copy_name() to avoid locking rtnl_lock().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240430015813.71143-8-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
arp_ioctl() holds rtnl_lock() first regardless of cmd (SIOCDARP,
SIOCSARP, and SIOCGARP) to get net_device by __dev_get_by_name()
and copy dev->name safely.
In the SIOCGARP path, arp_req_get() calls neigh_lookup(), which
looks up a neighbour entry under RCU.
We will extend the RCU section not to take rtnl_lock() and instead
use dev_get_by_name_rcu() for SIOCGARP.
As a preparation, let's move __dev_get_by_name() into another
function and call it from arp_req_delete(), arp_req_set(), and
arp_req_get().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240430015813.71143-6-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When ioctl(SIOCDARP/SIOCSARP) is issued for non-proxy entry (no ATF_COM)
without arpreq.arp_dev[] set, arp_req_set() and arp_req_delete() looks up
dev based on IPv4 address by ip_route_output().
Let's factorise the same code as arp_req_dev().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240430015813.71143-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When ioctl(SIOCDARP/SIOCSARP) is issued with ATF_PUBL, r.arp_netmask
must be 0.0.0.0 or 255.255.255.255.
Currently, the netmask is validated in arp_req_delete_public() or
arp_req_set_public() under rtnl_lock().
We have ATF_NETMASK test in arp_ioctl() before holding rtnl_lock(),
so let's move the netmask validation there.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240430015813.71143-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In arp_req_set(), if ATF_PERM is set in arpreq.arp_flags,
ATF_COM is set automatically.
The flag will be used later for neigh_update() only when
a neighbour entry is found.
Let's set ATF_COM just before calling neigh_update().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240430015813.71143-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move some proto memory definitions out of <net/sock.h>
Very few files need them, and following patch
will include <net/hotdata.h> from <net/proto_memory.h>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240429134025.1233626-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_out_of_memory() has a single caller: tcp_check_oom().
Following patch will also make sk_memory_allocated()
not anymore visible from <net/sock.h> and <net/tcp.h>
Add const qualifier to sock argument of tcp_out_of_memory()
and tcp_check_oom().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240429134025.1233626-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sysctl_max_skb_frags is used in TCP and MPTCP fast paths,
move it to net_hodata for better cache locality.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240429134025.1233626-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
I added dst_rt6_info() in commit
e8dfd42c17 ("ipv6: introduce dst_rt6_info() helper")
This patch does a similar change for IPv4.
Instead of (struct rtable *)dst casts, we can use :
#define dst_rtable(_ptr) \
container_of_const(_ptr, struct rtable, dst)
Patch is smaller than IPv6 one, because IPv4 has skb_rtable() helper.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/20240429133009.1227754-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZi9+AAAKCRDbK58LschI
g0nEAP487m7L0nLVriC2oIOWsi29tklW3etm6DO7gmGRGIHgrgEAnMyV1xBj3bGj
v6jJwDcybCym1hLx+1x1JCZ4eoAFswE=
=xbna
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-04-29
We've added 147 non-merge commits during the last 32 day(s) which contain
a total of 158 files changed, 9400 insertions(+), 2213 deletions(-).
The main changes are:
1) Add an internal-only BPF per-CPU instruction for resolving per-CPU
memory addresses and implement support in x86 BPF JIT. This allows
inlining per-CPU array and hashmap lookups
and the bpf_get_smp_processor_id() helper, from Andrii Nakryiko.
2) Add BPF link support for sk_msg and sk_skb programs, from Yonghong Song.
3) Optimize x86 BPF JIT's emit_mov_imm64, and add support for various
atomics in bpf_arena which can be JITed as a single x86 instruction,
from Alexei Starovoitov.
4) Add support for passing mark with bpf_fib_lookup helper,
from Anton Protopopov.
5) Add a new bpf_wq API for deferring events and refactor sleepable
bpf_timer code to keep common code where possible,
from Benjamin Tissoires.
6) Fix BPF_PROG_TEST_RUN infra with regards to bpf_dummy_struct_ops programs
to check when NULL is passed for non-NULLable parameters,
from Eduard Zingerman.
7) Harden the BPF verifier's and/or/xor value tracking,
from Harishankar Vishwanathan.
8) Introduce crypto kfuncs to make BPF programs able to utilize the kernel
crypto subsystem, from Vadim Fedorenko.
9) Various improvements to the BPF instruction set standardization doc,
from Dave Thaler.
10) Extend libbpf APIs to partially consume items from the BPF ringbuffer,
from Andrea Righi.
11) Bigger batch of BPF selftests refactoring to use common network helpers
and to drop duplicate code, from Geliang Tang.
12) Support bpf_tail_call_static() helper for BPF programs with GCC 13,
from Jose E. Marchesi.
13) Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF
program to have code sections where preemption is disabled,
from Kumar Kartikeya Dwivedi.
14) Allow invoking BPF kfuncs from BPF_PROG_TYPE_SYSCALL programs,
from David Vernet.
15) Extend the BPF verifier to allow different input maps for a given
bpf_for_each_map_elem() helper call in a BPF program, from Philo Lu.
16) Add support for PROBE_MEM32 and bpf_addr_space_cast instructions
for riscv64 and arm64 JITs to enable BPF Arena, from Puranjay Mohan.
17) Shut up a false-positive KMSAN splat in interpreter mode by unpoison
the stack memory, from Martin KaFai Lau.
18) Improve xsk selftest coverage with new tests on maximum and minimum
hardware ring size configurations, from Tushar Vyavahare.
19) Various ReST man pages fixes as well as documentation and bash completion
improvements for bpftool, from Rameez Rehman & Quentin Monnet.
20) Fix libbpf with regards to dumping subsequent char arrays,
from Quentin Deslandes.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (147 commits)
bpf, docs: Clarify PC use in instruction-set.rst
bpf_helpers.h: Define bpf_tail_call_static when building with GCC
bpf, docs: Add introduction for use in the ISA Internet Draft
selftests/bpf: extend BPF_SOCK_OPS_RTT_CB test for srtt and mrtt_us
bpf: add mrtt and srtt as BPF_SOCK_OPS_RTT_CB args
selftests/bpf: dummy_st_ops should reject 0 for non-nullable params
bpf: check bpf_dummy_struct_ops program params for test runs
selftests/bpf: do not pass NULL for non-nullable params in dummy_st_ops
selftests/bpf: adjust dummy_st_ops_success to detect additional error
bpf: mark bpf_dummy_struct_ops.test_1 parameter as nullable
selftests/bpf: Add ring_buffer__consume_n test.
bpf: Add bpf_guard_preempt() convenience macro
selftests: bpf: crypto: add benchmark for crypto functions
selftests: bpf: crypto skcipher algo selftests
bpf: crypto: add skcipher to bpf crypto
bpf: make common crypto API for TC/XDP programs
bpf: update the comment for BTF_FIELDS_MAX
selftests/bpf: Fix wq test.
selftests/bpf: Use make_sockaddr in test_sock_addr
selftests/bpf: Use connect_to_addr in test_sock_addr
...
====================
Link: https://lore.kernel.org/r/20240429131657.19423-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Instead of (struct rt6_info *)dst casts, we can use :
#define dst_rt6_info(_ptr) \
container_of_const(_ptr, struct rt6_info, dst)
Some places needed missing const qualifiers :
ip6_confirm_neigh(), ipv6_anycast_destination(),
ipv6_unicast_destination(), has_gateway()
v2: added missing parts (David Ahern)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a followup of commit c4e86b4363 ("net: add two more
call_rcu_hurry()")
Our reference to ifa->ifa_dev must be freed ASAP
to release the reference to the netdev the same way.
inet_rcu_free_ifa()
in_dev_put()
-> in_dev_finish_destroy()
-> netdev_put()
This should speedup device/netns dismantles when CONFIG_RCU_LAZY=y
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
I forgot to call tcp_skb_collapse_tstamp() in the
case we consume the second skb in write queue.
Neal suggested to create a common helper used by tcp_mtu_probe()
and tcp_grow_skb().
Fixes: 8ee602c635 ("tcp: try to send bigger TSO packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20240425193450.411640-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
At last, we should let it work by introducing this reset reason in
trace world.
One of the possible expected outputs is:
... tcp_send_reset: skbaddr=xxx skaddr=xxx src=xxx dest=xxx
state=TCP_ESTABLISHED reason=NOT_SPECIFIED
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reuse the dropreason logic to show the exact reason of tcp reset,
so we can finally display the corresponding item in enum sk_reset_reason
instead of reinventing new reset reasons. This patch replaces all
the prior NOT_SPECIFIED reasons.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Like what we did to passive reset:
only passing possible reset reason in each active reset path.
No functional changes.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Adjust the parameter and support passing reason of reset which
is for now NOT_SPECIFIED. No functional changes.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The software GRO path for esp transport mode uses skb_mac_header_rebuild
prior to re-injecting the packet via the xfrm_napi_dev. This only
copies skb->mac_len bytes of header which may not be sufficient if the
packet contains 802.1Q tags or other VLAN tags. Worse copying only the
initial header will leave a packet marked as being VLAN tagged but
without the corresponding tag leading to mangling when it is later
untagged.
The VLAN tags are important when receiving the decrypted esp transport
mode packet after GRO processing to ensure it is received on the correct
interface.
Therefore record the full mac header length in xfrm*_transport_input for
later use in corresponding xfrm*_transport_finish to copy the entire mac
header when rebuilding the mac header for GRO. The skb->data pointer is
left pointing skb->mac_header bytes after the start of the mac header as
is expected by the network stack and network and transport header
offsets reset to this location.
Fixes: 7785bba299 ("esp: Add a software GRO codepath")
Signed-off-by: Paul Davey <paul.davey@alliedtelesis.co.nz>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Two important arguments in RTT estimation, mrtt and srtt, are passed to
tcp_bpf_rtt(), so that bpf programs get more information about RTT
computation in BPF_SOCK_OPS_RTT_CB.
The difference between bpf_sock_ops->srtt_us and the srtt here is: the
former is an old rtt before update, while srtt passed by tcp_bpf_rtt()
is that after update.
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240425161724.73707-2-lulie@linux.alibaba.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
While testing TCP performance with latest trees,
I saw suspect SOCKET_BACKLOG drops.
tcp_add_backlog() computes its limit with :
limit = (u32)READ_ONCE(sk->sk_rcvbuf) +
(u32)(READ_ONCE(sk->sk_sndbuf) >> 1);
limit += 64 * 1024;
This does not take into account that sk->sk_backlog.len
is reset only at the very end of __release_sock().
Both sk->sk_backlog.len and sk->sk_rmem_alloc could reach
sk_rcvbuf in normal conditions.
We should double sk->sk_rcvbuf contribution in the formula
to absorb bubbles in the backlog, which happen more often
for very fast flows.
This change maintains decent protection against abuses.
Fixes: c377411f24 ("net: sk_add_backlog() take rmem_alloc into account")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240423125620.3309458-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Marking TCP_SKB_CB(skb)->sacked with TCPCB_EVER_RETRANS after the
traceopint (trace_tcp_retransmit_skb), then we can get the
retransmission efficiency by counting skbs w/ and w/o TCPCB_EVER_RETRANS
mark in this tracepoint.
We have discussed to achieve this with BPF_SOCK_OPS in [0], and using
tracepoint is thought to be a better solution.
[0]
https://lore.kernel.org/all/20240417124622.35333-1-lulie@linux.alibaba.com/
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since call_rcu, which is called in the hlist_for_each_entry_rcu traversal
of tcp_ao_connect_init, is not part of the RCU read critical section, it
is possible that the RCU grace period will pass during the traversal and
the key will be free.
To prevent this, it should be changed to hlist_for_each_entry_safe.
Fixes: 7c2ffaf21b ("net/tcp: Calculate TCP-AO traffic keys")
Signed-off-by: Hyunwoo Kim <v4bel@theori.io>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Dmitry Safonov <0x7f454c46@gmail.com>
Link: https://lore.kernel.org/r/ZiYu9NJ/ClR8uSkH@v4bel-B760M-AORUS-ELITE-AX
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While investigating TCP performance, I found that TCP would
sometimes send big skbs followed by a single MSS skb,
in a 'locked' pattern.
For instance, BIG TCP is enabled, MSS is set to have 4096 bytes
of payload per segment. gso_max_size is set to 181000.
This means that an optimal TCP packet size should contain
44 * 4096 = 180224 bytes of payload,
However, I was seeing packets sizes interleaved in this pattern:
172032, 8192, 172032, 8192, 172032, 8192, <repeat>
tcp_tso_should_defer() heuristic is defeated, because after a split of
a packet in write queue for whatever reason (this might be a too small
CWND or a small enough pacing_rate),
the leftover packet in the queue is smaller than the optimal size.
It is time to try to make 'leftover packets' bigger so that
tcp_tso_should_defer() can give its full potential.
After this patch, we can see the following output:
14:13:34.009273 IP6 sender > receiver: Flags [P.], seq 4048380:4098360, ack 1, win 256, options [nop,nop,TS val 3425678144 ecr 1561784500], length 49980
14:13:34.010272 IP6 sender > receiver: Flags [P.], seq 4098360:4148340, ack 1, win 256, options [nop,nop,TS val 3425678145 ecr 1561784501], length 49980
14:13:34.011271 IP6 sender > receiver: Flags [P.], seq 4148340:4198320, ack 1, win 256, options [nop,nop,TS val 3425678146 ecr 1561784502], length 49980
14:13:34.012271 IP6 sender > receiver: Flags [P.], seq 4198320:4248300, ack 1, win 256, options [nop,nop,TS val 3425678147 ecr 1561784503], length 49980
14:13:34.013272 IP6 sender > receiver: Flags [P.], seq 4248300:4298280, ack 1, win 256, options [nop,nop,TS val 3425678148 ecr 1561784504], length 49980
14:13:34.014271 IP6 sender > receiver: Flags [P.], seq 4298280:4348260, ack 1, win 256, options [nop,nop,TS val 3425678149 ecr 1561784505], length 49980
14:13:34.015272 IP6 sender > receiver: Flags [P.], seq 4348260:4398240, ack 1, win 256, options [nop,nop,TS val 3425678150 ecr 1561784506], length 49980
14:13:34.016270 IP6 sender > receiver: Flags [P.], seq 4398240:4448220, ack 1, win 256, options [nop,nop,TS val 3425678151 ecr 1561784507], length 49980
14:13:34.017269 IP6 sender > receiver: Flags [P.], seq 4448220:4498200, ack 1, win 256, options [nop,nop,TS val 3425678152 ecr 1561784508], length 49980
14:13:34.018276 IP6 sender > receiver: Flags [P.], seq 4498200:4548180, ack 1, win 256, options [nop,nop,TS val 3425678153 ecr 1561784509], length 49980
14:13:34.019259 IP6 sender > receiver: Flags [P.], seq 4548180:4598160, ack 1, win 256, options [nop,nop,TS val 3425678154 ecr 1561784510], length 49980
With 200 concurrent flows on a 100Gbit NIC, we can see a reduction
of TSO packets (and ACK packets) of about 30 %.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240418214600.1291486-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_write_xmit() calls tcp_init_tso_segs()
to set gso_size and gso_segs on the packet.
tcp_init_tso_segs() requires the stack to maintain
an up to date tcp_skb_pcount(), and this makes sense
for packets in rtx queue. Not so much for packets
still in the write queue.
In the following patch, we don't want to deal with
tcp_skb_pcount() when moving payload from 2nd
skb to 1st skb in the write queue.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240418214600.1291486-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_cwnd_test() has a special handing for the last packet in
the write queue if it is smaller than one MSS and has the FIN flag.
This is in violation of TCP RFC, and seems quite dubious.
This packet can be sent only if the current CWND is bigger
than the number of packets in flight.
Making tcp_cwnd_test() result independent of the first skb
in the write queue is needed for the last patch of the series.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240418214600.1291486-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After commit 1eeb504357 ("tcp/dccp: do not care about
families in inet_twsk_purge()") tcp_twsk_purge() is
no longer potentially called from a module.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
First problem is a double call to __in_dev_get_rcu(), because
the second one could return NULL.
if (__in_dev_get_rcu(dev) && __in_dev_get_rcu(dev)->ifa_list)
Second problem is a read from dev->ip6_ptr with no NULL check:
if (!list_empty(&rcu_dereference(dev->ip6_ptr)->addr_list))
Use the correct RCU API to fix these.
v2: add missing include <net/addrconf.h>
Fixes: d329ea5bd8 ("icmp: add response to RFC 8335 PROBE messages")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Andreas Roeseler <andreas.a.roeseler@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
To be able to constify instances of struct ctl_tables it is necessary to
remove ways through which non-const versions are exposed from the
sysctl core.
One of these is the ctl_table_arg member of struct ctl_table_header.
Constify this reference as a prerequisite for the full constification of
struct ctl_table instances.
No functional change.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If "udp_cmsg_send()" returned 0 (i.e. only UDP cmsg),
"connected" should not be set to 0. Otherwise it stops
the connected socket from using the cached route.
Fixes: 2e8de85763 ("udp: add gso segment cmsg")
Signed-off-by: Yick Xie <yick.xie@gmail.com>
Cc: stable@vger.kernel.org
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240418170610.867084-1-yick.xie@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The UDP_ENCAP_ESPINUDP_NON_IKE mode, introduced into the Linux kernel
in 2004 [2], has remained inactive and obsolete for an extended period.
This mode was originally defined in an early version of an IETF draft
[1] from 2001. By the time it was integrated into the kernel in 2004 [2],
it had already been replaced by UDP_ENCAP_ESPINUDP [3] in later
versions of draft-ietf-ipsec-udp-encaps, particularly in version 06.
Over time, UDP_ENCAP_ESPINUDP_NON_IKE has lost its relevance, with no
known use cases.
With this commit, we remove support for UDP_ENCAP_ESPINUDP_NON_IKE,
simplifying the codebase and eliminating unnecessary complexity.
Kernel will return an error -ENOPROTOOPT if the userspace tries to set
this option.
References:
[1] https://datatracker.ietf.org/doc/html/draft-ietf-ipsec-udp-encaps-00.txt
[2] Commit that added UDP_ENCAP_ESPINUDP_NON_IKE to the Linux historic
repository.
Author: Andreas Gruenbacher <agruen@suse.de>
Date: Fri Apr 9 01:47:47 2004 -0700
[IPSEC]: Support draft-ietf-ipsec-udp-encaps-00/01, some ipec impls need it.
[3] Commit that added UDP_ENCAP_ESPINUDP to the Linux historic
repository.
Author: Derek Atkins <derek@ihtfp.com>
Date: Wed Apr 2 13:21:02 2003 -0800
[IPSEC]: Implement UDP Encapsulation framework.
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
TCP_METRICS_CMD_GET and TCP_METRICS_CMD_DEL use their
own locking (tcp_metrics_lock and RCU),
they do not need genl_mutex protection.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240416162025.1251547-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Change tcp_metrics_nl_dump() to return 0 at the end
of a dump so that NLMSG_DONE can be appended
to the current skb, saving one recvmsg() system call.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240416161112.1199265-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andrew Oates reported that some macOS hosts could repeatedly
send FIN packets even if the remote peer drops them and
send back DUP ACK RWIN 0 packets.
<quoting Andrew>
20:27:16.968254 gif0 In IP macos > victim: Flags [SEW], seq 1950399762, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 501897188 ecr 0,sackOK,eol], length 0
20:27:16.968339 gif0 Out IP victim > macos: Flags [S.E], seq 2995489058, ack 1950399763, win 1448, options [mss 1460,sackOK,TS val 3829877593 ecr 501897188,nop,wscale 0], length 0
20:27:16.968833 gif0 In IP macos > victim: Flags [.], ack 1, win 2058, options [nop,nop,TS val 501897188 ecr 3829877593], length 0
20:27:16.968885 gif0 In IP macos > victim: Flags [P.], seq 1:1449, ack 1, win 2058, options [nop,nop,TS val 501897188 ecr 3829877593], length 1448
20:27:16.968896 gif0 Out IP victim > macos: Flags [.], ack 1449, win 0, options [nop,nop,TS val 3829877593 ecr 501897188], length 0
20:27:19.454593 gif0 In IP macos > victim: Flags [F.], seq 1449, ack 1, win 2058, options [nop,nop,TS val 501899674 ecr 3829877593], length 0
20:27:19.454675 gif0 Out IP victim > macos: Flags [.], ack 1449, win 0, options [nop,nop,TS val 3829880079 ecr 501899674], length 0
20:27:19.455116 gif0 In IP macos > victim: Flags [F.], seq 1449, ack 1, win 2058, options [nop,nop,TS val 501899674 ecr 3829880079], length 0
The retransmits/dup-ACKs then repeat in a tight loop.
</quoting Andrew>
RFC 9293 3.4. Sequence Numbers states :
Note that when the receive window is zero no segments should be
acceptable except ACK segments. Thus, it is be possible for a TCP to
maintain a zero receive window while transmitting data and receiving
ACKs. However, even when the receive window is zero, a TCP must
process the RST and URG fields of all incoming segments.
Even if we could consider a bare FIN.ACK packet to be an ACK in RFC terms,
the retransmits should use exponential backoff.
Accepting the FIN in linux does not add extra memory costs,
because the FIN flag will simply be merged to the tail skb in
the receive queue, and incoming packet is freed.
Reported-by: Andrew Oates <aoates@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Christoph Paasch <cpaasch@apple.com>
Cc: Vidhi Goel <vidhi_goel@apple.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We've observed a 7-12% performance regression in iperf3 UDP ipv4 and
ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to
commit f0ea27e7bf ("udp: re-score reuseport groups when connected
sockets are present"). The failing tests were those that would spawn
UDP sockets per-cpu on systems that have a high number of cpus.
Unsurprisingly, it is not caused by the extra re-scoring of the reused
socket, but due to the compiler no longer inlining compute_score, once
it has the extra call site in udp4_lib_lookup2. This is augmented by
the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus.
We could just explicitly inline it, but compute_score() is quite a large
function, around 300b. Inlining in two sites would almost double
udp4_lib_lookup2, which is a silly thing to do just to workaround a
mitigation. Instead, this patch shuffles the code a bit to avoid the
multiple calls to compute_score. Since it is a static function used in
one spot, the compiler can safely fold it in, as it did before, without
increasing the text size.
With this patch applied I ran my original iperf3 testcases. The failing
cases all looked like this (ipv4):
iperf3 -c 127.0.0.1 --udp -4 -f K -b $R -l 8920 -t 30 -i 5 -P 64 -O 2
where $R is either 1G/10G/0 (max, unlimited). I ran 3 times each.
baseline is v6.9-rc3. harmean == harmonic mean; CV == coefficient of
variation.
ipv4:
1G 10G MAX
HARMEAN (CV) HARMEAN (CV) HARMEAN (CV)
baseline 1743852.66(0.0208) 1725933.02(0.0167) 1705203.78(0.0386)
patched 1968727.61(0.0035) 1962283.22(0.0195) 1923853.50(0.0256)
ipv6:
1G 10G MAX
HARMEAN (CV) HARMEAN (CV) HARMEAN (CV)
baseline 1729020.03(0.0028) 1691704.49(0.0243) 1692251.34(0.0083)
patched 1900422.19(0.0067) 1900968.01(0.0067) 1568532.72(0.1519)
This restores the performance we had before the change above with this
benchmark. We obviously don't expect any real impact when mitigations
are disabled, but just to be sure it also doesn't regresses:
mitigations=off ipv4:
1G 10G MAX
HARMEAN (CV) HARMEAN (CV) HARMEAN (CV)
baseline 3230279.97(0.0066) 3229320.91(0.0060) 2605693.19(0.0697)
patched 3242802.36(0.0073) 3239310.71(0.0035) 2502427.19(0.0882)
Cc: Lorenz Bauer <lmb@isovalent.com>
Fixes: f0ea27e7bf ("udp: re-score reuseport groups when connected sockets are present")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RTO_ONLINK was a flag used in ->flowi4_tos that allowed to alter the
scope of an IPv4 route lookup. Setting this flag was equivalent to
specifying RT_SCOPE_LINK in ->flowi4_scope.
With commit ec20b28300 ("ipv4: Set scope explicitly in
ip_route_output()."), the last users of RTO_ONLINK have been removed.
Therefore, we can now drop the code that checked this bit and stop
modifying ->flowi4_scope in ip_route_output_key_hash().
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/57de760565cab55df7b129f523530ac6475865b2.1712754146.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When reading received messages from a socket with MSG_PEEK, we may want
to read the contents with an offset, like we can do with pread/preadv()
when reading files. Currently, it is not possible to do that.
In this commit, we add support for the SO_PEEK_OFF socket option for TCP,
in a similar way it is done for Unix Domain sockets.
In the iperf3 log examples shown below, we can observe a throughput
improvement of 15-20 % in the direction host->namespace when using the
protocol splicer 'pasta' (https://passt.top).
This is a consistent result.
pasta(1) and passt(1) implement user-mode networking for network
namespaces (containers) and virtual machines by means of a translation
layer between Layer-2 network interface and native Layer-4 sockets
(TCP, UDP, ICMP/ICMPv6 echo).
Received, pending TCP data to the container/guest is kept in kernel
buffers until acknowledged, so the tool routinely needs to fetch new
data from socket, skipping data that was already sent.
At the moment this is implemented using a dummy buffer passed to
recvmsg(). With this change, we don't need a dummy buffer and the
related buffer copy (copy_to_user()) anymore.
passt and pasta are supported in KubeVirt and libvirt/qemu.
jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
SO_PEEK_OFF not supported by kernel.
jmaloy@freyr:~/passt# iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 44822
[ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 1.02 GBytes 8.78 Gbits/sec
[ 5] 1.00-2.00 sec 1.06 GBytes 9.08 Gbits/sec
[ 5] 2.00-3.00 sec 1.07 GBytes 9.15 Gbits/sec
[ 5] 3.00-4.00 sec 1.10 GBytes 9.46 Gbits/sec
[ 5] 4.00-5.00 sec 1.03 GBytes 8.85 Gbits/sec
[ 5] 5.00-6.00 sec 1.10 GBytes 9.44 Gbits/sec
[ 5] 6.00-7.00 sec 1.11 GBytes 9.56 Gbits/sec
[ 5] 7.00-8.00 sec 1.07 GBytes 9.20 Gbits/sec
[ 5] 8.00-9.00 sec 667 MBytes 5.59 Gbits/sec
[ 5] 9.00-10.00 sec 1.03 GBytes 8.83 Gbits/sec
[ 5] 10.00-10.04 sec 30.1 MBytes 6.36 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.04 sec 10.3 GBytes 8.78 Gbits/sec receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
jmaloy@freyr:~/passt#
logout
[ perf record: Woken up 23 times to write data ]
[ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
jmaloy@freyr:~/passt$
jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
SO_PEEK_OFF supported by kernel.
jmaloy@freyr:~/passt# iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 52084
[ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 1.32 GBytes 11.3 Gbits/sec
[ 5] 1.00-2.00 sec 1.19 GBytes 10.2 Gbits/sec
[ 5] 2.00-3.00 sec 1.26 GBytes 10.8 Gbits/sec
[ 5] 3.00-4.00 sec 1.36 GBytes 11.7 Gbits/sec
[ 5] 4.00-5.00 sec 1.33 GBytes 11.4 Gbits/sec
[ 5] 5.00-6.00 sec 1.21 GBytes 10.4 Gbits/sec
[ 5] 6.00-7.00 sec 1.31 GBytes 11.2 Gbits/sec
[ 5] 7.00-8.00 sec 1.25 GBytes 10.7 Gbits/sec
[ 5] 8.00-9.00 sec 1.33 GBytes 11.5 Gbits/sec
[ 5] 9.00-10.00 sec 1.24 GBytes 10.7 Gbits/sec
[ 5] 10.00-10.04 sec 56.0 MBytes 12.1 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.04 sec 12.9 GBytes 11.0 Gbits/sec receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
logout
[ perf record: Woken up 20 times to write data ]
[ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
jmaloy@freyr:~/passt$
The perf record confirms this result. Below, we can observe that the
CPU spends significantly less time in the function ____sys_recvmsg()
when we have offset support.
Without offset support:
----------------------
jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
-p ____sys_recvmsg -x --stdio -i perf.data | head -1
46.32% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg
With offset support:
----------------------
jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
-p ____sys_recvmsg -x --stdio -i perf.data | head -1
28.12% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240409152805.913891-1-jmaloy@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a new header, linux/skbuff_ref.h, which contains all the skb_*_ref()
helpers. Many of the consumers of skbuff.h do not actually use any of
the skb ref helpers, and we can speed up compilation a bit by minimizing
this header file.
Additionally in the later patch in the series we add page_pool support
to skb_frag_ref(), which requires some page_pool dependencies. We can
now add these dependencies to skbuff_ref.h instead of a very ubiquitous
skbuff.h
Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://lore.kernel.org/r/20240410190505.1225848-2-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In my recent commit, I missed that do_replace() handlers
use copy_from_sockptr() (which I fixed), followed
by unsafe copy_from_sockptr_offset() calls.
In all functions, we can perform the @optlen validation
before even calling xt_alloc_table_info() with the following
check:
if ((u64)optlen < (u64)tmp.size + sizeof(tmp))
return -EINVAL;
Fixes: 0c83842df4 ("netfilter: validate user input for expected length")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org>
Link: https://lore.kernel.org/r/20240409120741.3538135-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
I forgot 32bit arches might have 64bit alignment for u64
fields.
tcp_sock_write_txrx group does not contain pointers,
but two u64 fields. It is possible that on 32bit kernel,
a 32bit hole is before tp->tcp_clock_cache.
I will try to remember a group can be bigger on 32bit
kernels in the future.
With help from Vladimir Oltean.
Fixes: d2c3a7eb1a ("tcp: more struct tcp_sock adjustments")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202404082207.HCEdQhUO-lkp@intel.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20240409140914.4105429-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The implementations of these 2 functions are almost identical. Remove
the implementation of napi_frag_unref, and make it a call into
skb_page_unref so we don't duplicate the implementation.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240408153000.2152844-2-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The log_martians variable is only used in an #ifdef, causing a 'make W=1'
warning with gcc:
net/ipv4/route.c: In function 'ip_rt_send_redirect':
net/ipv4/route.c:880:13: error: variable 'log_martians' set but not used [-Werror=unused-but-set-variable]
Change the #ifdef to an equivalent IS_ENABLED() to let the compiler
see where the variable is used.
Fixes: 30038fc61a ("net: ip_rt_send_redirect() optimization")
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240408074219.3030256-2-arnd@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
TCP can transform a TIMEWAIT socket into a SYN_RECV one from
a SYN packet, and the ISN of the SYNACK packet is normally
generated using TIMEWAIT tw_snd_nxt :
tcp_timewait_state_process()
...
u32 isn = tcptw->tw_snd_nxt + 65535 + 2;
if (isn == 0)
isn++;
TCP_SKB_CB(skb)->tcp_tw_isn = isn;
return TCP_TW_SYN;
This SYN packet also bypasses normal checks against listen queue
being full or not.
tcp_conn_request()
...
__u32 isn = TCP_SKB_CB(skb)->tcp_tw_isn;
...
/* TW buckets are converted to open requests without
* limitations, they conserve resources and peer is
* evidently real one.
*/
if ((syncookies == 2 || inet_csk_reqsk_queue_is_full(sk)) && !isn) {
want_cookie = tcp_syn_flood_action(sk, rsk_ops->slab_name);
if (!want_cookie)
goto drop;
}
This was using TCP_SKB_CB(skb)->tcp_tw_isn field in skb.
Unfortunately this field has been accidentally cleared
after the call to tcp_timewait_state_process() returning
TCP_TW_SYN.
Using a field in TCP_SKB_CB(skb) for a temporary state
is overkill.
Switch instead to a per-cpu variable.
As a bonus, we do not have to clear tcp_tw_isn in TCP receive
fast path.
It is temporarily set then cleared only in the TCP_TW_SYN dance.
Fixes: 4ad19de877 ("net: tcp6: fix double call of tcp_v6_fill_cb()")
Fixes: eeea10b83a ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
tcp_v6_init_req() reads TCP_SKB_CB(skb)->tcp_tw_isn to find
out if the request socket is created by a SYN hitting a TIMEWAIT socket.
This has been buggy for a decade, lets directly pass the information
from tcp_conn_request().
This is a preparatory patch to make the following one easier to review.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add a "scope" parameter to ip_route_output() so that callers don't have
to override the tos parameter with the RTO_ONLINK flag if they want a
local scope.
This will allow converting flowi4_tos to dscp_t in the future, thus
allowing static analysers to flag invalid interactions between
"tos" (the DSCP bits) and ECN.
Only three users ask for local scope (bonding, arp and atm). The others
continue to use RT_SCOPE_UNIVERSE. While there, add a comment to warn
users about the limitations of ip_route_output().
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Leon Romanovsky <leonro@nvidia.com> # infiniband
Signed-off-by: David S. Miller <davem@davemloft.net>
tp->recvmsg_inq is used from tcp recvmsg() thus should
be in tcp_sock_read_rx group.
tp->tcp_clock_cache and tp->tcp_mstamp are written
both in rx and tx paths, thus are better placed
in tcp_sock_write_txrx group.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Structures which are about to be copied to userspace shouldn't have
uninitialized fields or paddings.
memset() the whole &ip_tunnel_parm in ip_tunnel_parm_to_user() before
filling it with the kernel data. The compilers will hopefully combine
writes to it.
Fixes: 117aef12a7 ("ip_tunnel: use a separate struct to store tunnel params in the kernel")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/netdev/5f63dd25-de94-4ca3-84e6-14095953db13@moroto.mountain
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
fqdir_free_fn() is using very expensive rcu_barrier()
When one netns is dismantled, we often call fqdir_exit()
multiple times, typically lauching fqdir_free_fn() twice.
Delaying by one second fqdir_free_fn() helps to reduce
the number of rcu_barrier() calls, and lock contention
on rcu_state.barrier_mutex.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ->decrypted bit can be reused for other crypto protocols.
Remove the direct dependency on TLS, add helpers to clean up
the ifdefs leaking out everywhere.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
tp->window_clamp can be read locklessly, add READ_ONCE()
and WRITE_ONCE() annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://lore.kernel.org/r/20240404114231.2195171-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Prior to this patch, what we can see by enabling trace_tcp_send is
only happening under two circumstances:
1) active rst mode
2) non-active rst mode and based on the full socket
That means the inconsistency occurs if we use tcpdump and trace
simultaneously to see how rst happens.
It's necessary that we should take into other cases into considerations,
say:
1) time-wait socket
2) no socket
...
By parsing the incoming skb and reversing its 4-tuple can
we know the exact 'flow' which might not exist.
Samples after applied this patch:
1. tcp_send_reset: skbaddr=XXX skaddr=XXX src=ip:port dest=ip:port
state=TCP_ESTABLISHED
2. tcp_send_reset: skbaddr=000...000 skaddr=XXX src=ip:port dest=ip:port
state=UNKNOWN
Note:
1) UNKNOWN means we cannot extract the right information from skb.
2) skbaddr/skaddr could be 0
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Link: https://lore.kernel.org/r/20240401073605.37335-3-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since we have pool->p.napi (Jakub) and pool->cpuid (Lorenzo) to check
whether it's safe to use direct recycling, we can use both globally for
each page instead of relying solely on @allow_direct argument.
Let's assume that @allow_direct means "I'm sure it's local, don't waste
time rechecking this" and when it's false, try the mentioned params to
still recycle the page directly. If neither is true, we'll lose some
CPU cycles, but then it surely won't be hotpath. On the other hand,
paths where it's possible to use direct cache, but not possible to
safely set @allow_direct, will benefit from this move.
The whole propagation of @napi_safe through a dozen of skb freeing
functions can now go away, which saves us some stack space.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20240329165507.3240110-2-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We can change inet_csk() to propagate its argument const qualifier,
thanks to container_of_const().
We have to fix few places that had mistakes, like tcp_bound_rto().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240329144931.295800-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Historically, tunnel flags like TUNNEL_CSUM or TUNNEL_ERSPAN_OPT
have been defined as __be16. Now all of those 16 bits are occupied
and there's no more free space for new flags.
It can't be simply switched to a bigger container with no
adjustments to the values, since it's an explicit Endian storage,
and on LE systems (__be16)0x0001 equals to
(__be64)0x0001000000000000.
We could probably define new 64-bit flags depending on the
Endianness, i.e. (__be64)0x0001 on BE and (__be64)0x00010000... on
LE, but that would introduce an Endianness dependency and spawn a
ton of Sparse warnings. To mitigate them, all of those places which
were adjusted with this change would be touched anyway, so why not
define stuff properly if there's no choice.
Define IP_TUNNEL_*_BIT counterparts as a bit number instead of the
value already coded and a fistful of <16 <-> bitmap> converters and
helpers. The two flags which have a different bit position are
SIT_ISATAP_BIT and VTI_ISVTI_BIT, as they were defined not as
__cpu_to_be16(), but as (__force __be16), i.e. had different
positions on LE and BE. Now they both have strongly defined places.
Change all __be16 fields which were used to store those flags, to
IP_TUNNEL_DECLARE_FLAGS() -> DECLARE_BITMAP(__IP_TUNNEL_FLAG_NUM) ->
unsigned long[1] for now, and replace all TUNNEL_* occurrences to
their bitmap counterparts. Use the converters in the places which talk
to the userspace, hardware (NFP) or other hosts (GRE header). The rest
must explicitly use the new flags only. This must be done at once,
otherwise there will be too many conversions throughout the code in
the intermediate commits.
Finally, disable the old __be16 flags for use in the kernel code
(except for the two 'irregular' flags mentioned above), to prevent
any accidental (mis)use of them. For the userspace, nothing is
changed, only additions were made.
Most noticeable bloat-o-meter difference (.text):
vmlinux: 307/-1 (306)
gre.ko: 62/0 (62)
ip_gre.ko: 941/-217 (724) [*]
ip_tunnel.ko: 390/-900 (-510) [**]
ip_vti.ko: 138/0 (138)
ip6_gre.ko: 534/-18 (516) [*]
ip6_tunnel.ko: 118/-10 (108)
[*] gre_flags_to_tnl_flags() grew, but still is inlined
[**] ip_tunnel_find() got uninlined, hence such decrease
The average code size increase in non-extreme case is 100-200 bytes
per module, mostly due to sizeof(long) > sizeof(__be16), as
%__IP_TUNNEL_FLAG_NUM is less than %BITS_PER_LONG and the compilers
are able to expand the majority of bitmap_*() calls here into direct
operations on scalars.
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unlike IPv6 tunnels which use purely-kernel __ip6_tnl_parm structure
to store params inside the kernel, IPv4 tunnel code uses the same
ip_tunnel_parm which is being used to talk with the userspace.
This makes it difficult to alter or add any fields or use a
different format for whatever data.
Define struct ip_tunnel_parm_kern, a 1:1 copy of ip_tunnel_parm for
now, and use it throughout the code. Define the pieces, where the copy
user <-> kernel happens, as standalone functions, and copy the data
there field-by-field, so that the kernel-side structure could be easily
modified later on and the users wouldn't have to care about this.
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While looking at UDP receive performance, I saw sk_wake_async()
was no longer inlined.
This matters at least on AMD Zen1-4 platforms (see SRSO)
This might be because rcu_read_lock() and rcu_read_unlock()
are no longer nops in recent kernels ?
Add sk_wake_async_rcu() variant, which must be called from
contexts already holding rcu lock.
As SOCK_FASYNC is deprecated in modern days, use unlikely()
to give a hint to the compiler.
sk_wake_async_rcu() is properly inlined from
__udp_enqueue_schedule_skb() and sock_def_readable().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240328144032.1864988-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sock_def_readable() is quite expensive (particularly
when ep_poll_callback() is in the picture).
We must call sk->sk_data_ready() when :
- receive queue was empty, or
- SO_PEEK_OFF is enabled on the socket, or
- sk->sk_data_ready is not sock_def_readable.
We still need to call sk_wake_async().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240328144032.1864988-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sk->sk_rcvbuf is read locklessly twice, while other threads
could change its value.
Use a READ_ONCE() to annotate the race.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240328144032.1864988-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jianguo Wu reported another bind() regression introduced by bhash2.
Calling bind() for the following 3 addresses on the same port, the
3rd one should fail but now succeeds.
1. 0.0.0.0 or ::ffff:0.0.0.0
2. [::] w/ IPV6_V6ONLY
3. IPv4 non-wildcard address or v4-mapped-v6 non-wildcard address
The first two bind() create tb2 like this:
bhash2 -> tb2(:: w/ IPV6_V6ONLY) -> tb2(0.0.0.0)
The 3rd bind() will match with the IPv6 only wildcard address bucket
in inet_bind2_bucket_match_addr_any(), however, no conflicting socket
exists in the bucket. So, inet_bhash2_conflict() will returns false,
and thus, inet_bhash2_addr_any_conflict() returns false consequently.
As a result, the 3rd bind() bypasses conflict check, which should be
done against the IPv4 wildcard address bucket.
So, in inet_bhash2_addr_any_conflict(), we must iterate over all buckets.
Note that we cannot add ipv6_only flag for inet_bind2_bucket as it
would confuse the following patetrn.
1. [::] w/ SO_REUSE{ADDR,PORT} and IPV6_V6ONLY
2. [::] w/ SO_REUSE{ADDR,PORT}
3. IPv4 non-wildcard address or v4-mapped-v6 non-wildcard address
The first bind() would create a bucket with ipv6_only flag true,
the second bind() would add the [::] socket into the same bucket,
and the third bind() could succeed based on the wrong assumption
that ipv6_only bucket would not conflict with v4(-mapped-v6) address.
Fixes: 28044fc1d4 ("net: Add a bhash2 table hashed by port and address")
Diagnosed-by: Jianguo Wu <wujianguo106@163.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240326204251.51301-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 5e07e67241 ("tcp: Use bhash2 for v4-mapped-v6 non-wildcard
address.") introduced bind() regression for v4-mapped-v6 address.
When we bind() the following two addresses on the same port, the 2nd
bind() should succeed but fails now.
1. [::] w/ IPV6_ONLY
2. ::ffff:127.0.0.1
After the chagne, v4-mapped-v6 uses bhash2 instead of bhash to
detect conflict faster, but I forgot to add a necessary change.
During the 2nd bind(), inet_bind2_bucket_match_addr_any() returns
the tb2 bucket of [::], and inet_bhash2_conflict() finally calls
inet_bind_conflict(), which returns true, meaning conflict.
inet_bhash2_addr_any_conflict
|- inet_bind2_bucket_match_addr_any <-- return [::] bucket
`- inet_bhash2_conflict
`- __inet_bhash2_conflict <-- checks IPV6_ONLY for AF_INET
| but not for v4-mapped-v6 address
`- inet_bind_conflict <-- does not check address
inet_bind_conflict() does not check socket addresses because
__inet_bhash2_conflict() is expected to do so.
However, it checks IPV6_V6ONLY attribute only against AF_INET
socket, and not for v4-mapped-v6 address.
As a result, v4-mapped-v6 address conflicts with v6-only wildcard
address.
To avoid that, let's add the missing test to use bhash2 for
v4-mapped-v6 address.
Fixes: 5e07e67241 ("tcp: Use bhash2 for v4-mapped-v6 non-wildcard address.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240326204251.51301-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
TCP ehash table is often sparsely populated.
inet_twsk_purge() spends too much time calling cond_resched().
This patch can reduce time spent in inet_twsk_purge() by 20x.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240327191206.508114-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
GRO has a fundamental issue with UDP tunnel packets as it can't detect
those in a foolproof way and GRO could happen before they reach the
tunnel endpoint. Previous commits have fixed issues when UDP tunnel
packets come from a remote host, but if those packets are issued locally
they could run into checksum issues.
If the inner packet has a partial checksum the information will be lost
in the GRO logic, either in udp4/6_gro_complete or in
udp_gro_complete_segment and packets will have an invalid checksum when
leaving the host.
Prevent local UDP tunnel packets from ever being GROed at the outer UDP
level.
Due to skb->encapsulation being wrongly used in some drivers this is
actually only preventing UDP tunnel packets with a partial checksum to
be GROed (see iptunnel_handle_offloads) but those were also the packets
triggering issues so in practice this should be sufficient.
Fixes: 9fd1ff5d2a ("udp: Support UDP fraglist GRO/GSO.")
Fixes: 36707061d6 ("udp: allow forwarding of plain (non-fraglisted) UDP GRO packets")
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP GRO validates checksums and in udp4/6_gro_complete fraglist packets
are converted to CHECKSUM_UNNECESSARY to avoid later checks. However
this is an issue for CHECKSUM_PARTIAL packets as they can be looped in
an egress path and then their partial checksums are not fixed.
Different issues can be observed, from invalid checksum on packets to
traces like:
gen01: hw csum failure
skb len=3008 headroom=160 headlen=1376 tailroom=0
mac=(106,14) net=(120,40) trans=160
shinfo(txflags=0 nr_frags=0 gso(size=0 type=0 segs=0))
csum(0xffff232e ip_summed=2 complete_sw=0 valid=0 level=0)
hash(0x77e3d716 sw=1 l4=1) proto=0x86dd pkttype=0 iif=12
...
Fix this by only converting CHECKSUM_NONE packets to
CHECKSUM_UNNECESSARY by reusing __skb_incr_checksum_unnecessary. All
other checksum types are kept as-is, including CHECKSUM_COMPLETE as
fraglist packets being segmented back would have their skb->csum valid.
Fixes: 9fd1ff5d2a ("udp: Support UDP fraglist GRO/GSO.")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If packets are GROed with fraglist they might be segmented later on and
continue their journey in the stack. In skb_segment_list those skbs can
be reused as-is. This is an issue as their destructor was removed in
skb_gro_receive_list but not the reference to their socket, and then
they can't be orphaned. Fix this by also removing the reference to the
socket.
For example this could be observed,
kernel BUG at include/linux/skbuff.h:3131! (skb_orphan)
RIP: 0010:ip6_rcv_core+0x11bc/0x19a0
Call Trace:
ipv6_list_rcv+0x250/0x3f0
__netif_receive_skb_list_core+0x49d/0x8f0
netif_receive_skb_list_internal+0x634/0xd40
napi_complete_done+0x1d2/0x7d0
gro_cell_poll+0x118/0x1f0
A similar construction is found in skb_gro_receive, apply the same
change there.
Fixes: 5e10da5385 ("skbuff: allow 'slow_gro' for skb carring sock reference")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When rx-udp-gro-forwarding is enabled UDP packets might be GROed when
being forwarded. If such packets might land in a tunnel this can cause
various issues and udp_gro_receive makes sure this isn't the case by
looking for a matching socket. This is performed in
udp4/6_gro_lookup_skb but only in the current netns. This is an issue
with tunneled packets when the endpoint is in another netns. In such
cases the packets will be GROed at the UDP level, which leads to various
issues later on. The same thing can happen with rx-gro-list.
We saw this with geneve packets being GROed at the UDP level. In such
case gso_size is set; later the packet goes through the geneve rx path,
the geneve header is pulled, the offset are adjusted and frag_list skbs
are not adjusted with regard to geneve. When those skbs hit
skb_fragment, it will misbehave. Different outcomes are possible
depending on what the GROed skbs look like; from corrupted packets to
kernel crashes.
One example is a BUG_ON[1] triggered in skb_segment while processing the
frag_list. Because gso_size is wrong (geneve header was pulled)
skb_segment thinks there is "geneve header size" of data in frag_list,
although it's in fact the next packet. The BUG_ON itself has nothing to
do with the issue. This is only one of the potential issues.
Looking up for a matching socket in udp_gro_receive is fragile: the
lookup could be extended to all netns (not speaking about performances)
but nothing prevents those packets from being modified in between and we
could still not find a matching socket. It's OK to keep the current
logic there as it should cover most cases but we also need to make sure
we handle tunnel packets being GROed too early.
This is done by extending the checks in udp_unexpected_gso: GSO packets
lacking the SKB_GSO_UDP_TUNNEL/_CSUM bits and landing in a tunnel must
be segmented.
[1] kernel BUG at net/core/skbuff.c:4408!
RIP: 0010:skb_segment+0xd2a/0xf70
__udp_gso_segment+0xaa/0x560
Fixes: 9fd1ff5d2a ("udp: Support UDP fraglist GRO/GSO.")
Fixes: 36707061d6 ("udp: allow forwarding of plain (non-fraglisted) UDP GRO packets")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit 7aae231ac9 ("bpf: tcp: Limit calling some tcp cc functions to CONFIG_DYNAMIC_FTRACE")
added CONFIG_DYNAMIC_FTRACE guard because pahole was only generating
btf for ftrace-able functions. The ftrace filter had already been
removed from pahole, so the CONFIG_DYNAMIC_FTRACE guard can be
removed.
The commit 569c484f99 ("bpf: Limit static tcp-cc functions in the .BTF_ids list to x86")
has added CONFIG_X86 guard because it failed the powerpc arch which
prepended a "." to the local static function, so "cubictcp_init" becomes
".cubictcp_init". "__bpf_kfunc" has been added to kfunc
since then and it uses the __unused compiler attribute.
There is an existing
"__bpf_kfunc static u32 bpf_kfunc_call_test_static_unused_arg(u32 arg, u32 unused)"
test in bpf_testmod.c to cover the static kfunc case.
cross compile on ppc64 with CONFIG_DYNAMIC_FTRACE disabled:
> readelf -s vmlinux | grep cubictcp_
56938: c00000000144fd00 184 FUNC LOCAL DEFAULT 2 cubictcp_cwnd_event [<localentry>: 8]
56939: c00000000144fdb8 200 FUNC LOCAL DEFAULT 2 cubictcp_recalc_[...] [<localentry>: 8]
56940: c00000000144fe80 296 FUNC LOCAL DEFAULT 2 cubictcp_init [<localentry>: 8]
56941: c00000000144ffa8 228 FUNC LOCAL DEFAULT 2 cubictcp_state [<localentry>: 8]
56942: c00000000145008c 1908 FUNC LOCAL DEFAULT 2 cubictcp_cong_avoid [<localentry>: 8]
56943: c000000001450800 1644 FUNC LOCAL DEFAULT 2 cubictcp_acked [<localentry>: 8]
> bpftool btf dump file vmlinux | grep cubictcp_
[51540] FUNC 'cubictcp_acked' type_id=38137 linkage=static
[51541] FUNC 'cubictcp_cong_avoid' type_id=38122 linkage=static
[51543] FUNC 'cubictcp_cwnd_event' type_id=51542 linkage=static
[51544] FUNC 'cubictcp_init' type_id=9186 linkage=static
[51545] FUNC 'cubictcp_recalc_ssthresh' type_id=35021 linkage=static
[51547] FUNC 'cubictcp_state' type_id=38141 linkage=static
The patch removed both config guards.
Cc: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240322191433.4133280-1-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
ip_local_out() and other functions can pass skb->sk as function argument.
If the skb is a fragment and reassembly happens before such function call
returns, the sk must not be released.
This affects skb fragments reassembled via netfilter or similar
modules, e.g. openvswitch or ct_act.c, when run as part of tx pipeline.
Eric Dumazet made an initial analysis of this bug. Quoting Eric:
Calling ip_defrag() in output path is also implying skb_orphan(),
which is buggy because output path relies on sk not disappearing.
A relevant old patch about the issue was :
8282f27449 ("inet: frag: Always orphan skbs inside ip_defrag()")
[..]
net/ipv4/ip_output.c depends on skb->sk being set, and probably to an
inet socket, not an arbitrary one.
If we orphan the packet in ipvlan, then downstream things like FQ
packet scheduler will not work properly.
We need to change ip_defrag() to only use skb_orphan() when really
needed, ie whenever frag_list is going to be used.
Eric suggested to stash sk in fragment queue and made an initial patch.
However there is a problem with this:
If skb is refragmented again right after, ip_do_fragment() will copy
head->sk to the new fragments, and sets up destructor to sock_wfree.
IOW, we have no choice but to fix up sk_wmem accouting to reflect the
fully reassembled skb, else wmem will underflow.
This change moves the orphan down into the core, to last possible moment.
As ip_defrag_offset is aliased with sk_buff->sk member, we must move the
offset into the FRAG_CB, else skb->sk gets clobbered.
This allows to delay the orphaning long enough to learn if the skb has
to be queued or if the skb is completing the reasm queue.
In the former case, things work as before, skb is orphaned. This is
safe because skb gets queued/stolen and won't continue past reasm engine.
In the latter case, we will steal the skb->sk reference, reattach it to
the head skb, and fix up wmem accouting when inet_frag inflates truesize.
Fixes: 7026b1ddb6 ("netfilter: Pass socket pointer down through okfn().")
Diagnosed-by: Eric Dumazet <edumazet@google.com>
Reported-by: xingwei lee <xrivendell7@gmail.com>
Reported-by: yue sun <samsun1006219@gmail.com>
Reported-by: syzbot+e5167d7144a62715044c@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240326101845.30836-1-fw@strlen.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmYE3H8ACgkQ1V2XiooU
IOR/2xAAiN+XhXMFiplQvR/6CEVirEriFIUUT8IR+fwIBUsFc6QAMpdNwCHo7j49
2gFxHNER8oD2IynVCvqYM/O2Ukz8S3FclMicgOb03HZ3cenCUjW1l0el8dYhb8Do
twVs2Trt217QyiwNkM0B68+5lo3wDo3uB70p2abdmcJ1tgCPncpDPR2Pl3DBPswP
kMrO1aohYBTn4SFyaVbCLzkOlS1T6Bf4yqQMcL+zgIdd9+kLkYRqHlvMiwM/vgwp
JJk7mnQx9h73y8sx9EAgaaf+63rNJ1JcDsKAhAAqJa9lJMVOPFTChaDOGp4aInvD
qYBUIqCRC/FWN7BEnq4Hj6NvLmbUPm+9YkMnE7nCfZXJVCVUyfwBFtv1FVKvD5YU
Ybi7Nf66bh9kYhy23TmhdQXEjzO9rIrJQEADz7qmGYydx27c5rwWUu8u7c0SRQ/V
l/1rmT39Dr3vyZIBguIXaeU8hSwVvtlhavEVQ73wKGdMGgdg6QUB2zUrYvb/IU74
v9PSmIoXKdcG4wwQj/ijRGgsZx556ifzehwbHhvFpCOn2THprMLnIaXC/mLKyiJb
TwBhTxyYBqGecEa214VQaxUndzwT9txZ7ExO6ZzEZEip0RERX6LoSvlvy4xo715U
ndq2s/yCSjRprmvEZSkgBwva0LEXjrFl9gormfkM+ejnDC4eAcQ=
=sQm5
-----END PGP SIGNATURE-----
Merge tag 'nf-24-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
Patch #1 reject destroy chain command to delete device hooks in netdev
family, hence, only delchain commands are allowed.
Patch #2 reject table flag update interference with netdev basechain
hook updates, this can leave hooks in inconsistent
registration/unregistration state.
Patch #3 do not unregister netdev basechain hooks if table is dormant.
Otherwise, splat with double unregistration is possible.
Patch #4 fixes Kconfig to allow to restore IP_NF_ARPTABLES,
from Kuniyuki Iwashima.
There are a more fixes still in progress on my side that need more work.
* tag 'nf-24-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: arptables: Select NETFILTER_FAMILY_ARP when building arp_tables.c
netfilter: nf_tables: skip netdev hook unregistration if table is dormant
netfilter: nf_tables: reject table flag and netdev basechain updates
netfilter: nf_tables: reject destroy command to remove basechain hooks
====================
Link: https://lore.kernel.org/r/20240328031855.2063-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We had various syzbot reports about tcp timers firing after
the corresponding netns has been dismantled.
Fortunately Josef Bacik could trigger the issue more often,
and could test a patch I wrote two years ago.
When TCP sockets are closed, we call inet_csk_clear_xmit_timers()
to 'stop' the timers.
inet_csk_clear_xmit_timers() can be called from any context,
including when socket lock is held.
This is the reason it uses sk_stop_timer(), aka del_timer().
This means that ongoing timers might finish much later.
For user sockets, this is fine because each running timer
holds a reference on the socket, and the user socket holds
a reference on the netns.
For kernel sockets, we risk that the netns is freed before
timer can complete, because kernel sockets do not hold
reference on the netns.
This patch adds inet_csk_clear_xmit_timers_sync() function
that using sk_stop_timer_sync() to make sure all timers
are terminated before the kernel socket is released.
Modules using kernel sockets close them in their netns exit()
handler.
Also add sock_not_owned_by_me() helper to get LOCKDEP
support : inet_csk_clear_xmit_timers_sync() must not be called
while socket lock is held.
It is very possible we can revert in the future commit
3a58f13a88 ("net: rds: acquire refcount on TCP sockets")
which attempted to solve the issue in rds only.
(net/smc/af_smc.c and net/mptcp/subflow.c have similar code)
We probably can remove the check_net() tests from
tcp_out_of_resources() and __tcp_close() in the future.
Reported-by: Josef Bacik <josef@toxicpanda.com>
Closes: https://lore.kernel.org/netdev/20240314210740.GA2823176@perftesting/
Fixes: 26abe14379 ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
Fixes: 8a68173691 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket")
Link: https://lore.kernel.org/bpf/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Josef Bacik <josef@toxicpanda.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://lore.kernel.org/r/20240322135732.1535772-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The "*hw_stats_used" value needs to be set on the success paths to prevent
an uninitialized variable bug in the caller, nla_put_nh_group_stats().
Fixes: 5072ae00ae ("net: nexthop: Expose nexthop group HW stats to user space")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/f08ac289-d57f-4a1a-830f-cf9a0563cb9c@moroto.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Current release - regressions:
- rxrpc: fix use of page_frag_alloc_align(), it changed semantics
and we added a new caller in a different subtree
- xfrm: allow UDP encapsulation only in offload modes
Current release - new code bugs:
- tcp: fix refcnt handling in __inet_hash_connect()
- Revert "net: Re-use and set mono_delivery_time bit for userspace tstamp
packets", conflicted with some expectations in BPF uAPI
Previous releases - regressions:
- ipv4: raw: fix sending packets from raw sockets via IPsec tunnels
- devlink: fix devlink's parallel command processing
- veth: do not manipulate GRO when using XDP
- esp: fix bad handling of pages from page_pool
Previous releases - always broken:
- report RCU QS for busy network kthreads (with Paul McK's blessing)
- tcp/rds: fix use-after-free on netns with kernel TCP reqsk
- virt: vmxnet3: fix missing reserved tailroom with XDP
Misc:
- couple of build fixes for Documentation
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmX8bXsACgkQMUZtbf5S
IrsfBg/+KzrEx0tB/Af57ZZGZ5PMjPy+XFDox4iFfHm338UFuGXVvZrXd7G+6YkH
ZwWeF5YDPKzwIEiZ5D3hewZPlkLH0Eg88q74chlE0gUv7t1jhuQHUdIVeFnPcLbN
t/8AcCZCJ2fENbr1iNnzZON1RW0fVOl+SDxhiSYFeFqii6FywDfqWL/h0u86H/AF
KRktgb0LzH0waH6IiefVV1NZyjnZwmQ6+UVQerTzUnQmWhV1xQKoO3MQpZuFRvr6
O+kPZMkrqnTCCy7RO1BexS5cefqc80i5Z25FLGcaHgpnYd2pDNDMMxqrhqO9Y0Pv
6u/tLgRxzVUDXWouzREIRe50Z9GJswkg78zilAhpqYiHRjd8jaBH6y+9mhGFc7F8
iVAx02WfJhlk0aynFf2qZmR7PQIb9XjtFJ7OAeJrno9UD7zAubtikGM/6m6IZfRV
TD1mze95RVnNjbHZMeg6oNLFUMJXVTobtvtqk5pTQvsNsmSYGFvkvWC5/P6ycyYt
pMx6E0PA/ZCnQAlThCOCzFa5BO+It3RJHcQJhgbOzHrlWKwmrjBKcKJcLLcxFSUt
4wwjdEcG1Bo2wdnsjwsQwJDHQW+M9TSLdLM3YVptM9jbqOMizoqr6/xSykg3H4wZ
t/dSiYSsEr06z7lvwbAjUXJ/mfszZ+JsVAFXAN7ahcM4OZb5WTQ=
=gpLl
-----END PGP SIGNATURE-----
Merge tag 'net-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from CAN, netfilter, wireguard and IPsec.
I'd like to highlight [ lowlight? - Linus ] Florian W stepping down as
a netfilter maintainer due to constant stream of bug reports. Not sure
what we can do but IIUC this is not the first such case.
Current release - regressions:
- rxrpc: fix use of page_frag_alloc_align(), it changed semantics and
we added a new caller in a different subtree
- xfrm: allow UDP encapsulation only in offload modes
Current release - new code bugs:
- tcp: fix refcnt handling in __inet_hash_connect()
- Revert "net: Re-use and set mono_delivery_time bit for userspace
tstamp packets", conflicted with some expectations in BPF uAPI
Previous releases - regressions:
- ipv4: raw: fix sending packets from raw sockets via IPsec tunnels
- devlink: fix devlink's parallel command processing
- veth: do not manipulate GRO when using XDP
- esp: fix bad handling of pages from page_pool
Previous releases - always broken:
- report RCU QS for busy network kthreads (with Paul McK's blessing)
- tcp/rds: fix use-after-free on netns with kernel TCP reqsk
- virt: vmxnet3: fix missing reserved tailroom with XDP
Misc:
- couple of build fixes for Documentation"
* tag 'net-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (59 commits)
selftests: forwarding: Fix ping failure due to short timeout
MAINTAINERS: step down as netfilter maintainer
netfilter: nf_tables: Fix a memory leak in nf_tables_updchain
net: dsa: mt7530: fix handling of all link-local frames
net: dsa: mt7530: fix link-local frames that ingress vlan filtering ports
bpf: report RCU QS in cpumap kthread
net: report RCU QS on threaded NAPI repolling
rcu: add a helper to report consolidated flavor QS
ionic: update documentation for XDP support
lib/bitmap: Fix bitmap_scatter() and bitmap_gather() kernel doc
netfilter: nf_tables: do not compare internal table flags on updates
netfilter: nft_set_pipapo: release elements in clone only from destroy path
octeontx2-af: Use separate handlers for interrupts
octeontx2-pf: Send UP messages to VF only when VF is up.
octeontx2-pf: Use default max_active works instead of one
octeontx2-pf: Wait till detach_resources msg is complete
octeontx2: Detect the mbox up or down message via register
devlink: fix port new reply cmd type
tcp: Clear req->syncookie in reqsk_alloc().
net/bnx2x: Prevent access to a freed page in page_pool
...
Since the referenced commit, the xfrm_inner_extract_output() function
uses the protocol field to determine the address family. So not setting
it for IPv4 raw sockets meant that such packets couldn't be tunneled via
IPsec anymore.
IPv6 raw sockets are not affected as they already set the protocol since
9c9c9ad5fa ("ipv6: set skb->protocol on tcp, raw and ip6_append_data
genereated skbs").
Fixes: f4796398f2 ("xfrm: Remove inner/outer modes from output path")
Signed-off-by: Tobias Brunner <tobias@strongswan.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/c5d9a947-eb19-4164-ac99-468ea814ce20@strongswan.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This reverts commit 885c36e59f.
The patch currently broke the bpf selftest test_tc_dtime because
uapi field __sk_buff->tstamp_type depends on skb->mono_delivery_time which
does not necessarily mean mono with the original fix as the bit was re-used
for userspace timestamp as well to avoid tstamp reset in the forwarding
path. To solve this we need to keep mono_delivery_time as is and
introduce another bit called user_delivery_time and fall back to the
initial proposal of setting the user_delivery_time bit based on
sk_clockid set from userspace.
Fixes: 885c36e59f ("net: Re-use and set mono_delivery_time bit for userspace tstamp packets")
Link: https://lore.kernel.org/netdev/bc037db4-58bb-4861-ac31-a361a93841d3@linux.dev/
Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the skb is reorganized during esp_output (!esp->inline), the pages
coming from the original skb fragments are supposed to be released back
to the system through put_page. But if the skb fragment pages are
originating from a page_pool, calling put_page on them will trigger a
page_pool leak which will eventually result in a crash.
This leak can be easily observed when using CONFIG_DEBUG_VM and doing
ipsec + gre (non offloaded) forwarding:
BUG: Bad page state in process ksoftirqd/16 pfn:1451b6
page:00000000de2b8d32 refcount:0 mapcount:0 mapping:0000000000000000 index:0x1451b6000 pfn:0x1451b6
flags: 0x200000000000000(node=0|zone=2)
page_type: 0xffffffff()
raw: 0200000000000000 dead000000000040 ffff88810d23c000 0000000000000000
raw: 00000001451b6000 0000000000000001 00000000ffffffff 0000000000000000
page dumped because: page_pool leak
Modules linked in: ip_gre gre mlx5_ib mlx5_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat xt_addrtype br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm ib_uverbs ib_core overlay zram zsmalloc fuse [last unloaded: mlx5_core]
CPU: 16 PID: 96 Comm: ksoftirqd/16 Not tainted 6.8.0-rc4+ #22
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x36/0x50
bad_page+0x70/0xf0
free_unref_page_prepare+0x27a/0x460
free_unref_page+0x38/0x120
esp_ssg_unref.isra.0+0x15f/0x200
esp_output_tail+0x66d/0x780
esp_xmit+0x2c5/0x360
validate_xmit_xfrm+0x313/0x370
? validate_xmit_skb+0x1d/0x330
validate_xmit_skb_list+0x4c/0x70
sch_direct_xmit+0x23e/0x350
__dev_queue_xmit+0x337/0xba0
? nf_hook_slow+0x3f/0xd0
ip_finish_output2+0x25e/0x580
iptunnel_xmit+0x19b/0x240
ip_tunnel_xmit+0x5fb/0xb60
ipgre_xmit+0x14d/0x280 [ip_gre]
dev_hard_start_xmit+0xc3/0x1c0
__dev_queue_xmit+0x208/0xba0
? nf_hook_slow+0x3f/0xd0
ip_finish_output2+0x1ca/0x580
ip_sublist_rcv_finish+0x32/0x40
ip_sublist_rcv+0x1b2/0x1f0
? ip_rcv_finish_core.constprop.0+0x460/0x460
ip_list_rcv+0x103/0x130
__netif_receive_skb_list_core+0x181/0x1e0
netif_receive_skb_list_internal+0x1b3/0x2c0
napi_gro_receive+0xc8/0x200
gro_cell_poll+0x52/0x90
__napi_poll+0x25/0x1a0
net_rx_action+0x28e/0x300
__do_softirq+0xc3/0x276
? sort_range+0x20/0x20
run_ksoftirqd+0x1e/0x30
smpboot_thread_fn+0xa6/0x130
kthread+0xcd/0x100
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x31/0x50
? kthread_complete_and_exit+0x20/0x20
ret_from_fork_asm+0x11/0x20
</TASK>
The suggested fix is to introduce a new wrapper (skb_page_unref) that
covers page refcounting for page_pool pages as well.
Cc: stable@vger.kernel.org
Fixes: 6a5bcd84e8 ("page_pool: Allow drivers to hint on SKB recycling")
Reported-and-tested-by: Anatoli N.Chechelnickiy <Anatoli.Chechelnickiy@m.interpipe.biz>
Reported-by: Ian Kumlien <ian.kumlien@gmail.com>
Link: https://lore.kernel.org/netdev/CAA85sZvvHtrpTQRqdaOx6gd55zPAVsqMYk_Lwh4Md5knTq7AyA@mail.gmail.com
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
heap optimizations".
- Kuan-Wei Chiu has also sped up the library sorting code in the series
"lib/sort: Optimize the number of swaps and comparisons".
- Alexey Gladkov has added the ability for code running within an IPC
namespace to alter its IPC and MQ limits. The series is "Allow to
change ipc/mq sysctls inside ipc namespace".
- Geert Uytterhoeven has contributed some dhrystone maintenance work in
the series "lib: dhry: miscellaneous cleanups".
- Ryusuke Konishi continues nilfs2 maintenance work in the series
"nilfs2: eliminate kmap and kmap_atomic calls"
"nilfs2: fix kernel bug at submit_bh_wbc()"
- Nathan Chancellor has updated our build tools requirements in the
series "Bump the minimum supported version of LLVM to 13.0.1".
- Muhammad Usama Anjum continues with the selftests maintenance work in
the series "selftests/mm: Improve run_vmtests.sh".
- Oleg Nesterov has done some maintenance work against the signal code
in the series "get_signal: minor cleanups and fix".
Plus the usual shower of singleton patches in various parts of the tree.
Please see the individual changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZfMnvgAKCRDdBJ7gKXxA
jjKMAP4/Upq07D4wjkMVPb+QrkipbbLpdcgJ++q3z6rba4zhPQD+M3SFriIJk/Xh
tKVmvihFxfAhdDthseXcIf1nBjMALwY=
=8rVc
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2024-03-14-09-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min
heap optimizations".
- Kuan-Wei Chiu has also sped up the library sorting code in the series
"lib/sort: Optimize the number of swaps and comparisons".
- Alexey Gladkov has added the ability for code running within an IPC
namespace to alter its IPC and MQ limits. The series is "Allow to
change ipc/mq sysctls inside ipc namespace".
- Geert Uytterhoeven has contributed some dhrystone maintenance work in
the series "lib: dhry: miscellaneous cleanups".
- Ryusuke Konishi continues nilfs2 maintenance work in the series
"nilfs2: eliminate kmap and kmap_atomic calls"
"nilfs2: fix kernel bug at submit_bh_wbc()"
- Nathan Chancellor has updated our build tools requirements in the
series "Bump the minimum supported version of LLVM to 13.0.1".
- Muhammad Usama Anjum continues with the selftests maintenance work in
the series "selftests/mm: Improve run_vmtests.sh".
- Oleg Nesterov has done some maintenance work against the signal code
in the series "get_signal: minor cleanups and fix".
Plus the usual shower of singleton patches in various parts of the tree.
Please see the individual changelogs for details.
* tag 'mm-nonmm-stable-2024-03-14-09-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits)
nilfs2: prevent kernel bug at submit_bh_wbc()
nilfs2: fix failure to detect DAT corruption in btree and direct mappings
ocfs2: enable ocfs2_listxattr for special files
ocfs2: remove SLAB_MEM_SPREAD flag usage
assoc_array: fix the return value in assoc_array_insert_mid_shortcut()
buildid: use kmap_local_page()
watchdog/core: remove sysctl handlers from public header
nilfs2: use div64_ul() instead of do_div()
mul_u64_u64_div_u64: increase precision by conditionally swapping a and b
kexec: copy only happens before uchunk goes to zero
get_signal: don't initialize ksig->info if SIGNAL_GROUP_EXIT/group_exec_task
get_signal: hide_si_addr_tag_bits: fix the usage of uninitialized ksig
get_signal: don't abuse ksig->info.si_signo and ksig->sig
const_structs.checkpatch: add device_type
Normalise "name (ad@dr)" MODULE_AUTHORs to "name <ad@dr>"
dyndbg: replace kstrdup() + strchr() with kstrdup_and_replace()
list: leverage list_is_head() for list_entry_is_head()
nilfs2: MAINTAINERS: drop unreachable project mirror site
smp: make __smp_processor_id() 0-argument macro
fat: fix uninitialized field in nostale filehandles
...
syzkaller reported a warning of netns tracker [0] followed by KASAN
splat [1] and another ref tracker warning [1].
syzkaller could not find a repro, but in the log, the only suspicious
sequence was as follows:
18:26:22 executing program 1:
r0 = socket$inet6_mptcp(0xa, 0x1, 0x106)
...
connect$inet6(r0, &(0x7f0000000080)={0xa, 0x4001, 0x0, @loopback}, 0x1c) (async)
The notable thing here is 0x4001 in connect(), which is RDS_TCP_PORT.
So, the scenario would be:
1. unshare(CLONE_NEWNET) creates a per netns tcp listener in
rds_tcp_listen_init().
2. syz-executor connect()s to it and creates a reqsk.
3. syz-executor exit()s immediately.
4. netns is dismantled. [0]
5. reqsk timer is fired, and UAF happens while freeing reqsk. [1]
6. listener is freed after RCU grace period. [2]
Basically, reqsk assumes that the listener guarantees netns safety
until all reqsk timers are expired by holding the listener's refcount.
However, this was not the case for kernel sockets.
Commit 740ea3c4a0 ("tcp: Clean up kernel listener's reqsk in
inet_twsk_purge()") fixed this issue only for per-netns ehash.
Let's apply the same fix for the global ehash.
[0]:
ref_tracker: net notrefcnt@0000000065449cc3 has 1/1 users at
sk_alloc (./include/net/net_namespace.h:337 net/core/sock.c:2146)
inet6_create (net/ipv6/af_inet6.c:192 net/ipv6/af_inet6.c:119)
__sock_create (net/socket.c:1572)
rds_tcp_listen_init (net/rds/tcp_listen.c:279)
rds_tcp_init_net (net/rds/tcp.c:577)
ops_init (net/core/net_namespace.c:137)
setup_net (net/core/net_namespace.c:340)
copy_net_ns (net/core/net_namespace.c:497)
create_new_namespaces (kernel/nsproxy.c:110)
unshare_nsproxy_namespaces (kernel/nsproxy.c:228 (discriminator 4))
ksys_unshare (kernel/fork.c:3429)
__x64_sys_unshare (kernel/fork.c:3496)
do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
...
WARNING: CPU: 0 PID: 27 at lib/ref_tracker.c:179 ref_tracker_dir_exit (lib/ref_tracker.c:179)
[1]:
BUG: KASAN: slab-use-after-free in inet_csk_reqsk_queue_drop (./include/net/inet_hashtables.h:180 net/ipv4/inet_connection_sock.c:952 net/ipv4/inet_connection_sock.c:966)
Read of size 8 at addr ffff88801b370400 by task swapper/0/0
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Call Trace:
<IRQ>
dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1))
print_report (mm/kasan/report.c:378 mm/kasan/report.c:488)
kasan_report (mm/kasan/report.c:603)
inet_csk_reqsk_queue_drop (./include/net/inet_hashtables.h:180 net/ipv4/inet_connection_sock.c:952 net/ipv4/inet_connection_sock.c:966)
reqsk_timer_handler (net/ipv4/inet_connection_sock.c:979 net/ipv4/inet_connection_sock.c:1092)
call_timer_fn (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./include/trace/events/timer.h:127 kernel/time/timer.c:1701)
__run_timers.part.0 (kernel/time/timer.c:1752 kernel/time/timer.c:2038)
run_timer_softirq (kernel/time/timer.c:2053)
__do_softirq (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./include/trace/events/irq.h:142 kernel/softirq.c:554)
irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632 kernel/softirq.c:644)
sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1076 (discriminator 14))
</IRQ>
Allocated by task 258 on cpu 0 at 83.612050s:
kasan_save_stack (mm/kasan/common.c:48)
kasan_save_track (mm/kasan/common.c:68)
__kasan_slab_alloc (mm/kasan/common.c:343)
kmem_cache_alloc (mm/slub.c:3813 mm/slub.c:3860 mm/slub.c:3867)
copy_net_ns (./include/linux/slab.h:701 net/core/net_namespace.c:421 net/core/net_namespace.c:480)
create_new_namespaces (kernel/nsproxy.c:110)
unshare_nsproxy_namespaces (kernel/nsproxy.c:228 (discriminator 4))
ksys_unshare (kernel/fork.c:3429)
__x64_sys_unshare (kernel/fork.c:3496)
do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
Freed by task 27 on cpu 0 at 329.158864s:
kasan_save_stack (mm/kasan/common.c:48)
kasan_save_track (mm/kasan/common.c:68)
kasan_save_free_info (mm/kasan/generic.c:643)
__kasan_slab_free (mm/kasan/common.c:265)
kmem_cache_free (mm/slub.c:4299 mm/slub.c:4363)
cleanup_net (net/core/net_namespace.c:456 net/core/net_namespace.c:446 net/core/net_namespace.c:639)
process_one_work (kernel/workqueue.c:2638)
worker_thread (kernel/workqueue.c:2700 kernel/workqueue.c:2787)
kthread (kernel/kthread.c:388)
ret_from_fork (arch/x86/kernel/process.c:153)
ret_from_fork_asm (arch/x86/entry/entry_64.S:250)
The buggy address belongs to the object at ffff88801b370000
which belongs to the cache net_namespace of size 4352
The buggy address is located 1024 bytes inside of
freed 4352-byte region [ffff88801b370000, ffff88801b371100)
[2]:
WARNING: CPU: 0 PID: 95 at lib/ref_tracker.c:228 ref_tracker_free (lib/ref_tracker.c:228 (discriminator 1))
Modules linked in:
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:ref_tracker_free (lib/ref_tracker.c:228 (discriminator 1))
...
Call Trace:
<IRQ>
__sk_destruct (./include/net/net_namespace.h:353 net/core/sock.c:2204)
rcu_core (./arch/x86/include/asm/preempt.h:26 kernel/rcu/tree.c:2165 kernel/rcu/tree.c:2433)
__do_softirq (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./include/trace/events/irq.h:142 kernel/softirq.c:554)
irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632 kernel/softirq.c:644)
sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1076 (discriminator 14))
</IRQ>
Reported-by: syzkaller <syzkaller@googlegroups.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Fixes: 467fa15356 ("RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240308200122.64357-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
inet_twsk_purge() uses rcu to find TIME_WAIT and NEW_SYN_RECV
objects to purge.
These objects use SLAB_TYPESAFE_BY_RCU semantic and need special
care. We need to use refcount_inc_not_zero(&sk->sk_refcnt).
Reuse the existing correct logic I wrote for TIME_WAIT,
because both structures have common locations for
sk_state, sk_family, and netns pointer.
If after the refcount_inc_not_zero() the object fields longer match
the keys, use sock_gen_put(sk) to release the refcount.
Then we can call inet_twsk_deschedule_put() for TIME_WAIT,
inet_csk_reqsk_queue_drop_and_put() for NEW_SYN_RECV sockets,
with BH disabled.
Then we need to restart the loop because we had drop rcu_read_lock().
Fixes: 740ea3c4a0 ("tcp: Clean up kernel listener's reqsk in inet_twsk_purge()")
Link: https://lore.kernel.org/netdev/CANn89iLvFuuihCtt9PME2uS1WJATnf5fKjDToa1WzVnRzHnPfg@mail.gmail.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240308200122.64357-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Locally generated packets can increment the new nexthop statistics from
process context, resulting in the following splat [1] due to preemption
being enabled. Fix by using get_cpu_ptr() / put_cpu_ptr() which will
which take care of disabling / enabling preemption.
BUG: using smp_processor_id() in preemptible [00000000] code: ping/949
caller is nexthop_select_path+0xcf8/0x1e30
CPU: 12 PID: 949 Comm: ping Not tainted 6.8.0-rc7-custom-gcb450f605fae #11
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0xbd/0xe0
check_preemption_disabled+0xce/0xe0
nexthop_select_path+0xcf8/0x1e30
fib_select_multipath+0x865/0x18b0
fib_select_path+0x311/0x1160
ip_route_output_key_hash_rcu+0xe54/0x2720
ip_route_output_key_hash+0x193/0x380
ip_route_output_flow+0x25/0x130
raw_sendmsg+0xbab/0x34a0
inet_sendmsg+0xa2/0xe0
__sys_sendto+0x2ad/0x430
__x64_sys_sendto+0xe5/0x1c0
do_syscall_64+0xc5/0x1d0
entry_SYSCALL_64_after_hwframe+0x63/0x6b
[...]
Fixes: f4676ea74b ("net: nexthop: Add nexthop group entry stats")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240311162307.545385-5-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Passing a maximum attribute type to nlmsg_parse() that is larger than
the size of the passed policy will result in an out-of-bounds access [1]
when the attribute type is used as an index into the policy array.
Fix by setting the maximum attribute type according to the policy size,
as is already done for RTM_NEWNEXTHOP messages. Add a test case that
triggers the bug.
No regressions in fib nexthops tests:
# ./fib_nexthops.sh
[...]
Tests passed: 236
Tests failed: 0
[1]
BUG: KASAN: global-out-of-bounds in __nla_validate_parse+0x1e53/0x2940
Read of size 1 at addr ffffffff99ab4d20 by task ip/610
CPU: 3 PID: 610 Comm: ip Not tainted 6.8.0-rc7-custom-gd435d6e3e161 #9
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x8f/0xe0
print_report+0xcf/0x670
kasan_report+0xd8/0x110
__nla_validate_parse+0x1e53/0x2940
__nla_parse+0x40/0x50
rtm_del_nexthop+0x1bd/0x400
rtnetlink_rcv_msg+0x3cc/0xf20
netlink_rcv_skb+0x170/0x440
netlink_unicast+0x540/0x820
netlink_sendmsg+0x8d3/0xdb0
____sys_sendmsg+0x31f/0xa60
___sys_sendmsg+0x13a/0x1e0
__sys_sendmsg+0x11c/0x1f0
do_syscall_64+0xc5/0x1d0
entry_SYSCALL_64_after_hwframe+0x63/0x6b
[...]
The buggy address belongs to the variable:
rtm_nh_policy_del+0x20/0x40
Fixes: 2118f9390d ("net: nexthop: Adjust netlink policy parsing for a new attribute")
Reported-by: Eric Dumazet <edumazet@google.com>
Closes: https://lore.kernel.org/netdev/CANn89i+UNcG0PJMW5X7gOMunF38ryMh=L1aeZUKH3kL4UdUqag@mail.gmail.com/
Reported-by: syzbot+65bb09a7208ce3d4a633@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/00000000000088981b06133bc07b@google.com/
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240311162307.545385-4-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The attribute is parsed in __nh_valid_dump_req() which is called by the
dump handlers of RTM_GETNEXTHOP and RTM_GETNEXTHOPBUCKET although it is
only used by the former and rejected by the policy of the latter.
Move the parsing to nh_valid_dump_req() which is only called by the dump
handler of RTM_GETNEXTHOP.
This is a preparation for a subsequent patch.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240311162307.545385-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The attribute is parsed into 'op_flags' in nh_valid_get_del_req() which
is called from the handlers of three message types: RTM_DELNEXTHOP,
RTM_GETNEXTHOPBUCKET and RTM_GETNEXTHOP. The attribute is only used by
the latter and rejected by the policies of the other two.
Pass 'op_flags' as NULL from the handlers of the other two and only
parse the attribute when the argument is not NULL.
This is a preparation for a subsequent patch.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240311162307.545385-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmXvm7IACgkQ6rmadz2v
bTqdMA//VMHNHVLb4oROoXyQD9fw2mCmIUEKzP88RXfqcxsfEX7HF+k8B5ZTk0ro
CHXTAnc79+Qqg0j24bkQKxup/fKBQVw9D+Ia4b3ytlm1I2MtyU/16xNEzVhAPU2D
iKk6mVBsEdCbt/GjpWORy/VVnZlZpC7BOpZLxsbbxgXOndnCegyjXzSnLGJGxdvi
zkrQTn2SrFzLi6aNpVLqrv6Nks6HJusfCKsIrtlbkQ85dulasHOtwK9s6GF60nte
aaho+MPx3L+lWEgapsm8rR779pHaYIB/GbZUgEPxE/xUJ/V8BzDgFNLMzEiIBRMN
a0zZam11BkBzCfcO9gkvDRByaei/dZz2jdqfU4GlHklFj1WFfz8Q7fRLEPINksvj
WXLgJADGY5mtGbjG21FScThxzj+Ruqwx0a13ddlyI/W+P3y5yzSWsLwJG5F9p0oU
6nlkJ4U8yg+9E1ie5ae0TibqvRJzXPjfOERZGwYDSVvfQGzv1z+DGSOPMmgNcWYM
dIaO+A/+NS3zdbk8+1PP2SBbhHPk6kWyCUByWc7wMzCPTiwriFGY/DD2sN+Fsufo
zorzfikUQOlTfzzD5jbmT49U8hUQUf6QIWsu7BijSiHaaC7am4S8QB2O6ibJMqdv
yNiwvuX+ThgVIY3QKrLLqL0KPGeKMR5mtfq6rrwSpfp/b4g27FE=
=eFgA
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:
====================
pull-request: bpf-next 2024-03-11
We've added 59 non-merge commits during the last 9 day(s) which contain
a total of 88 files changed, 4181 insertions(+), 590 deletions(-).
The main changes are:
1) Enforce VM_IOREMAP flag and range in ioremap_page_range and introduce
VM_SPARSE kind and vm_area_[un]map_pages to be used in bpf_arena,
from Alexei.
2) Introduce bpf_arena which is sparse shared memory region between bpf
program and user space where structures inside the arena can have
pointers to other areas of the arena, and pointers work seamlessly for
both user-space programs and bpf programs, from Alexei and Andrii.
3) Introduce may_goto instruction that is a contract between the verifier
and the program. The verifier allows the program to loop assuming it's
behaving well, but reserves the right to terminate it, from Alexei.
4) Use IETF format for field definitions in the BPF standard
document, from Dave.
5) Extend struct_ops libbpf APIs to allow specify version suffixes for
stuct_ops map types, share the same BPF program between several map
definitions, and other improvements, from Eduard.
6) Enable struct_ops support for more than one page in trampolines,
from Kui-Feng.
7) Support kCFI + BPF on riscv64, from Puranjay.
8) Use bpf_prog_pack for arm64 bpf trampoline, from Puranjay.
9) Fix roundup_pow_of_two undefined behavior on 32-bit archs, from Toke.
====================
Link: https://lore.kernel.org/r/20240312003646.8692-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When sending the notifications to collect NH statistics for resilient
groups, the driver will need to know the nexthop IDs in individual buckets
to look up the right counter. To that end, move the nexthop ID from struct
nh_notifier_grp_entry_info to nh_notifier_single_info.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/8f964cd50b1a56d3606ce7ab4c50354ae019c43b.1709901020.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE notifier currently keeps the group
ID unset. That makes it impossible to look up the group for which the
notifier is intended. This is not an issue at the moment, because the only
client is netdevsim, and that just so that it veto replacements, which is a
static property not tied to a particular group. But for any practical use,
the ID is necessary. Set it.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/025fef095dcfb408042568bb5439da014d47239e.1709901020.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When reading wmem[0], it could be changed concurrently without
READ_ONCE() protection. So add one annotation here.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commits ca065d0cf8 ("udp: no longer use SLAB_DESTROY_BY_RCU")
and 7ae215d23c ("bpf: Don't refcount LISTEN sockets in sk_assign()")
UDP early demux no longer need to grab a refcount on the UDP socket.
This save two atomic operations per incoming packet for connected
sockets.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Joe Stringer <joe@wand.net.nz>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'len' variable can't be negative when assigned the result of
'min_t' because all 'min_t' parameters are cast to unsigned int,
and then the minimum one is chosen.
To fix the logic, check 'len' as read from 'optlen',
where the types of relevant variables are (signed) int.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'olr' variable can't be negative when assigned the result of
'min_t' because all 'min_t' parameters are cast to unsigned int,
and then the minimum one is chosen.
To fix the logic, check 'olr' as read from 'optlen',
where the types of relevant variables are (signed) int.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'len' variable can't be negative when assigned the result of
'min_t' because all 'min_t' parameters are cast to unsigned int,
and then the minimum one is chosen.
To fix the logic, check 'len' as read from 'optlen',
where the types of relevant variables are (signed) int.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no point cloning an skb and having to free the clone
if the receive queue of the raw socket is full.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240307163020.2524409-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The only error that can happen during a nexthop dump is insufficient
space in the skb caring the netlink messages (EMSGSIZE). If this happens
and some messages were already filled in, the nexthop code returns the
skb length to signal the netlink core that more objects need to be
dumped.
After commit b5a899154a ("netlink: handle EMSGSIZE errors in the
core") there is no need to handle this error in the nexthop code as it
is now handled in the core.
Simplify the code and simply return the error to the core.
No regressions in nexthop tests:
# ./fib_nexthops.sh
Tests passed: 234
Tests failed: 0
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240307154727.3555462-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add netlink support for reading NH group hardware stats.
Stats collection is done through a new notifier,
NEXTHOP_EVENT_HW_STATS_REPORT_DELTA. Drivers that implement HW counters for
a given NH group are thereby asked to collect the stats and report back to
core by calling nh_grp_hw_stats_report_delta(). This is similar to what
netdevice L3 stats do.
Besides exposing number of packets that passed in the HW datapath, also
include information on whether any driver actually realizes the counters.
The core can tell based on whether it got any _report_delta() reports from
the drivers. This allows enabling the statistics at the group at any time,
with drivers opting into supporting them. This is also in line with what
netdevice L3 stats are doing.
So as not to waste time and space, tie the collection and reporting of HW
stats with a new op flag, NHA_OP_FLAG_DUMP_HW_STATS.
Co-developed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Kees Cook <keescook@chromium.org> # For the __counted_by bits
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add netlink support for enabling collection of HW statistics on nexthop
groups.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add hw_stats field to several notifier structures to communicate to the
drivers that HW statistics should be configured for nexthops within a given
group.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add netlink support for reading NH group stats.
This data is only for statistics of the traffic in the SW datapath. HW
nexthop group statistics will be added in the following patches.
Emission of the stats is keyed to a new op_stats flag to avoid cluttering
the netlink message with stats if the user doesn't need them:
NHA_OP_FLAG_DUMP_STATS.
Co-developed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add nexthop group entry stats to count the number of packets forwarded
via each nexthop in the group. The stats will be exposed to user space
for better data path observability in the next patch.
The per-CPU stats pointer is placed at the beginning of 'struct
nh_grp_entry', so that all the fields accessed for the data path reside
on the same cache line:
struct nh_grp_entry {
struct nexthop * nh; /* 0 8 */
struct nh_grp_entry_stats * stats; /* 8 8 */
u8 weight; /* 16 1 */
/* XXX 7 bytes hole, try to pack */
union {
struct {
atomic_t upper_bound; /* 24 4 */
} hthr; /* 24 4 */
struct {
struct list_head uw_nh_entry; /* 24 16 */
u16 count_buckets; /* 40 2 */
u16 wants_buckets; /* 42 2 */
} res; /* 24 24 */
}; /* 24 24 */
struct list_head nh_list; /* 48 16 */
/* --- cacheline 1 boundary (64 bytes) --- */
struct nexthop * nh_parent; /* 64 8 */
/* size: 72, cachelines: 2, members: 6 */
/* sum members: 65, holes: 1, sum holes: 7 */
/* last cacheline: 8 bytes */
};
Co-developed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to add per-nexthop statistics, but still not increase netlink
message size for consumers that do not care about them, there needs to be a
toggle through which the user indicates their desire to get the statistics.
To that end, add a new attribute, NHA_OP_FLAGS. The idea is to be able to
use the attribute for carrying of arbitrary operation-specific flags, i.e.
not make it specific for get / dump.
Add the new attribute to get and dump policies, but do not actually allow
any flags yet -- those will come later as the flags themselves are defined.
Add the necessary parsing code.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A following patch will introduce a new attribute, op-specific flags to
adjust the behavior of an operation. Different operations will recognize
different flags.
- To make the differentiation possible, stop sharing the policies for get
and del operations.
- To allow querying for presence of the attribute, have all the attribute
arrays sized to NHA_MAX, regardless of what is permitted by policy, and
pass the corresponding value to nlmsg_parse() as well.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move RPS related structures and helpers from include/linux/netdevice.h
and include/net/sock.h to a new include file.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-18-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
"struct net_protocol" has a 32bit hole in 32bit arches.
Use it to store the 32bit secret used by UDP and TCP,
to increase cache locality in rx path.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-15-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These structures are read in rx path, move them to net_hotdata
for better cache locality.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-14-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These structures are used in GRO and GSO paths.
Move them to net_hodata for better cache locality.
v2: udpv6_offload definition depends on CONFIG_INET=y
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-12-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These are used in TCP fast paths.
Move them into net_hotdata for better cache locality.
v2: tcpv6_offload definition depends on CONFIG_INET
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These structures are used in GRO and GSO paths.
v2: ipv6_packet_offload definition depends on CONFIG_INET
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After commit b5a899154a ("netlink: handle EMSGSIZE errors
in the core"), we can remove some code that was not 100 % correct
anyway.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306102426.245689-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Found with git grep 'MODULE_AUTHOR(".*([^)]*@'
Fixed with
sed -i '/MODULE_AUTHOR(".*([^)]*@/{s/ (/ </g;s/)"/>"/;s/)and/> and/}' \
$(git grep -l 'MODULE_AUTHOR(".*([^)]*@')
Also:
in drivers/media/usb/siano/smsusb.c normalise ", INC" to ", Inc";
this is what every other MODULE_AUTHOR for this company says,
and it's what the header says
in drivers/sbus/char/openprom.c normalise a double-spaced separator;
this is clearly copied from the copyright header,
where the names are aligned on consecutive lines thusly:
* Linux/SPARC PROM Configuration Driver
* Copyright (C) 1996 Thomas K. Dyas (tdyas@noc.rutgers.edu)
* Copyright (C) 1996 Eddie C. Dost (ecd@skynet.be)
but the authorship branding is single-line
Link: https://lkml.kernel.org/r/mk3geln4azm5binjjlfsgjepow4o73domjv6ajybws3tz22vb3@tarta.nabijaczleweli.xyz
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently getsockopt does not support IP_ROUTER_ALERT and
IPV6_ROUTER_ALERT, and we are unable to get the values of these two
socket options through getsockopt.
This patch adds getsockopt support for IP_ROUTER_ALERT and
IPV6_ROUTER_ALERT.
Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While testing for places where zero-sized destinations were still showing
up in the kernel, sock_copy() and inet_reqsk_clone() were found, which
are using very specific memcpy() offsets for both avoiding a portion of
struct sock, and copying beyond the end of it (since struct sock is really
just a common header before the protocol-specific allocation). Instead
of trying to unravel this historical lack of container_of(), just switch
to unsafe_memcpy(), since that's effectively what was happening already
(memcpy() wasn't checking 0-sized destinations while the code base was
being converted away from fake flexible arrays).
Avoid the following false positive warning with future changes to
CONFIG_FORTIFY_SOURCE:
memcpy: detected field-spanning write (size 3068) of destination "&nsk->__sk_common.skc_dontcopy_end" at net/core/sock.c:2057 (size 0)
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240304212928.make.772-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Bridge driver today has no support to forward the userspace timestamp
packets and ends up resetting the timestamp. ETF qdisc checks the
packet coming from userspace and encounters to be 0 thereby dropping
time sensitive packets. These changes will allow userspace timestamps
packets to be forwarded from the bridge to NIC drivers.
Setting the same bit (mono_delivery_time) to avoid dropping of
userspace tstamp packets in the forwarding path.
Existing functionality of mono_delivery_time remains unaltered here,
instead just extended with userspace tstamp support for bridge
forwarding path.
Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240301201348.2815102-1-quic_abchauha@quicinc.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In tcp_gro_complete() :
Moving the skb->inner_transport_header setting
allows the compiler to reuse the previously loaded value
of skb->transport_header.
Caching skb_shinfo() avoids duplications as well.
In tcp4_gro_complete(), doing a single change on
skb_shinfo(skb)->gso_type also generates better code.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
skb_gro_header_hard() is renamed to skb_gro_may_pull() to match
the convention used by common helpers like pskb_may_pull().
This means the condition is inverted:
if (skb_gro_header_hard(skb, hlen))
slow_path();
becomes:
if (!skb_gro_may_pull(skb, hlen))
slow_path();
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Stephen Rothwell and kernel test robot reported that some arches
(parisc, hexagon) and/or compilers would not like blamed commit.
Lets make sure tcp_sock_write_rx group does not start with a hole.
While we are at it, correct tcp_sock_write_tx CACHELINE_ASSERT_GROUP_SIZE()
since after the blamed commit, we went to 105 bytes.
Fixes: 9912362205 ("tcp: remove some holes in struct tcp_sock")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/netdev/20240301121108.5d39e4f9@canb.auug.org.au/
Closes: https://lore.kernel.org/oe-kbuild-all/202403011451.csPYOS3C-lkp@intel.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Simon Horman <horms@kernel.org> # build-tested
Link: https://lore.kernel.org/r/20240301171945.2958176-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Perform all validations when updating values of struct_ops maps. Doing
validation in st_ops->reg() and st_ops->update() is not necessary anymore.
However, tcp_register_congestion_control() has been called in various
places. It still needs to do validations.
Cc: netdev@vger.kernel.org
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240224223418.526631-2-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
This is a cleanup patch, making code a bit more concise.
1) Use skb_network_offset(skb) in place of
(skb_network_header(skb) - skb->data)
2) Use -skb_network_offset(skb) in place of
(skb->data - skb_network_header(skb))
3) Use skb_transport_offset(skb) in place of
(skb_transport_header(skb) - skb->data)
4) Use skb_inner_transport_offset(skb) in place of
(skb_inner_transport_header(skb) - skb->data)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com> # for sfc
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZeEKVAAKCRDbK58LschI
g7oYAQD5Jlv4fIVTvxvfZrTTZ2tU+OsPa75mc8SDKwpash3YygEA8kvESy8+t6pg
D6QmSf1DIZdFoSp/bV+pfkNWMeR8gwg=
=mTAj
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-02-29
We've added 119 non-merge commits during the last 32 day(s) which contain
a total of 150 files changed, 3589 insertions(+), 995 deletions(-).
The main changes are:
1) Extend the BPF verifier to enable static subprog calls in spin lock
critical sections, from Kumar Kartikeya Dwivedi.
2) Fix confusing and incorrect inference of PTR_TO_CTX argument type
in BPF global subprogs, from Andrii Nakryiko.
3) Larger batch of riscv BPF JIT improvements and enabling inlining
of the bpf_kptr_xchg() for RV64, from Pu Lehui.
4) Allow skeleton users to change the values of the fields in struct_ops
maps at runtime, from Kui-Feng Lee.
5) Extend the verifier's capabilities of tracking scalars when they
are spilled to stack, especially when the spill or fill is narrowing,
from Maxim Mikityanskiy & Eduard Zingerman.
6) Various BPF selftest improvements to fix errors under gcc BPF backend,
from Jose E. Marchesi.
7) Avoid module loading failure when the module trying to register
a struct_ops has its BTF section stripped, from Geliang Tang.
8) Annotate all kfuncs in .BTF_ids section which eventually allows
for automatic kfunc prototype generation from bpftool, from Daniel Xu.
9) Several updates to the instruction-set.rst IETF standardization
document, from Dave Thaler.
10) Shrink the size of struct bpf_map resp. bpf_array,
from Alexei Starovoitov.
11) Initial small subset of BPF verifier prepwork for sleepable bpf_timer,
from Benjamin Tissoires.
12) Fix bpftool to be more portable to musl libc by using POSIX's
basename(), from Arnaldo Carvalho de Melo.
13) Add libbpf support to gcc in CORE macro definitions,
from Cupertino Miranda.
14) Remove a duplicate type check in perf_event_bpf_event,
from Florian Lehner.
15) Fix bpf_spin_{un,}lock BPF helpers to actually annotate them
with notrace correctly, from Yonghong Song.
16) Replace the deprecated bpf_lpm_trie_key 0-length array with flexible
array to fix build warnings, from Kees Cook.
17) Fix resolve_btfids cross-compilation to non host-native endianness,
from Viktor Malik.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits)
selftests/bpf: Test if shadow types work correctly.
bpftool: Add an example for struct_ops map and shadow type.
bpftool: Generated shadow variables for struct_ops maps.
libbpf: Convert st_ops->data to shadow type.
libbpf: Set btf_value_type_id of struct bpf_map for struct_ops.
bpf: Replace bpf_lpm_trie_key 0-length array with flexible array
bpf, arm64: use bpf_prog_pack for memory management
arm64: patching: implement text_poke API
bpf, arm64: support exceptions
arm64: stacktrace: Implement arch_bpf_stack_walk() for the BPF JIT
bpf: add is_async_callback_calling_insn() helper
bpf: introduce in_sleepable() helper
bpf: allow more maps in sleepable bpf programs
selftests/bpf: Test case for lacking CFI stub functions.
bpf: Check cfi_stubs before registering a struct_ops type.
bpf: Clarify batch lookup/lookup_and_delete semantics
bpf, docs: specify which BPF_ABS and BPF_IND fields were zero
bpf, docs: Fix typos in instruction-set.rst
selftests/bpf: update tcp_custom_syncookie to use scalar packet offset
bpf: Shrink size of struct bpf_map/bpf_array.
...
====================
Link: https://lore.kernel.org/r/20240301001625.8800-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
1) inet_dump_ifaddr() can can run under RCU protection
instead of RTNL.
2) properly return 0 at the end of a dump, avoiding an
an extra recvmsg() system call.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the following patch, inet_base_seq() will no longer be called
with RTNL held.
Add READ_ONCE()/WRITE_ONCE() annotations in dev_base_seq_inc()
and inet_base_seq().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ifa->ifa_flags can be read locklessly.
Add appropriate READ_ONCE()/WRITE_ONCE() annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ifa->ifa_preferred_lft can be read locklessly.
Add appropriate READ_ONCE()/WRITE_ONCE() annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ifa->ifa_valid_lft can be read locklessly.
Add appropriate READ_ONCE()/WRITE_ONCE() annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ifa->ifa_tstamp can be read locklessly.
Add appropriate READ_ONCE()/WRITE_ONCE() annotations.
Do the same for ifa->ifa_cstamp to prepare upcoming changes.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The input parameter 'level' in do_raw_set/getsockopt() is not used.
Therefore, remove it.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240228072505.640550-1-shaozhengchao@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Since commit 446fda4f26 ("[NetLabel]: CIPSOv4 engine"), *bitmap_walk
function only returns -1. Nearly 18 years have passed, -2 scenes never
come up, so there's no need to consider it.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/r/20240227093604.3574241-1-shaozhengchao@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
1) inet_netconf_dump_devconf() can run under RCU protection
instead of RTNL.
2) properly return 0 at the end of a dump, avoiding an
an extra recvmsg() system call.
3) Do not use inet_base_seq() anymore, for_each_netdev_dump()
has nice properties. Restarting a GETDEVCONF dump if a device has
been added/removed or if net->ipv4.dev_addr_genid has changed is moot.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240227092411.2315725-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
"ip -4 netconf show dev XXXX" no longer acquires RTNL.
Return -ENODEV instead of -EINVAL if no netdev or idev can be found.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240227092411.2315725-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add READ_ONCE() in ipv4_devconf_get() and corresponding
WRITE_ONCE() in ipv4_devconf_set()
Add IPV4_DEVCONF_RO() and IPV4_DEVCONF_ALL_RO() macros,
and use them when reading devconf fields.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240227092411.2315725-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It's time to let it work right now. We've already prepared for this:)
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update three callers including both ipv4 and ipv6 and let the dropreason
mechanism work in reality.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In this patch, I equipped this function with more dropreasons, but
it still doesn't work yet, which I will do later.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch does two things:
1) add two more new reasons
2) only change the return value(1) to various drop reason values
for the future use
For now, we still cannot trace those two reasons. We'll implement the full
function in the subsequent patch in this series.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now it's time to use the prepared definitions to refine this part.
Four reasons used might enough for now, I think.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only move the skb drop from tcp_v4_do_rcv() to cookie_v4_check() itself,
no other changes made. It can help us refine the specific drop reasons
later.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
No longer hold RTNL while calling inet_dump_fib().
Also change return value for a completed dump:
Returning 0 instead of skb->len allows NLMSG_DONE
to be appended to the skb. User space does not have
to call us again to get a standalone NLMSG_DONE marker.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new field into struct fib_dump_filter, to let callers
tell if they use RTNL locking or RCU.
This is used in the following patch, when inet_dump_fib()
no longer holds RTNL.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzkaller triggered following kasan splat:
BUG: KASAN: use-after-free in __skb_flow_dissect+0x19d1/0x7a50 net/core/flow_dissector.c:1170
Read of size 1 at addr ffff88812fb4000e by task syz-executor183/5191
[..]
kasan_report+0xda/0x110 mm/kasan/report.c:588
__skb_flow_dissect+0x19d1/0x7a50 net/core/flow_dissector.c:1170
skb_flow_dissect_flow_keys include/linux/skbuff.h:1514 [inline]
___skb_get_hash net/core/flow_dissector.c:1791 [inline]
__skb_get_hash+0xc7/0x540 net/core/flow_dissector.c:1856
skb_get_hash include/linux/skbuff.h:1556 [inline]
ip_tunnel_xmit+0x1855/0x33c0 net/ipv4/ip_tunnel.c:748
ipip_tunnel_xmit+0x3cc/0x4e0 net/ipv4/ipip.c:308
__netdev_start_xmit include/linux/netdevice.h:4940 [inline]
netdev_start_xmit include/linux/netdevice.h:4954 [inline]
xmit_one net/core/dev.c:3548 [inline]
dev_hard_start_xmit+0x13d/0x6d0 net/core/dev.c:3564
__dev_queue_xmit+0x7c1/0x3d60 net/core/dev.c:4349
dev_queue_xmit include/linux/netdevice.h:3134 [inline]
neigh_connected_output+0x42c/0x5d0 net/core/neighbour.c:1592
...
ip_finish_output2+0x833/0x2550 net/ipv4/ip_output.c:235
ip_finish_output+0x31/0x310 net/ipv4/ip_output.c:323
..
iptunnel_xmit+0x5b4/0x9b0 net/ipv4/ip_tunnel_core.c:82
ip_tunnel_xmit+0x1dbc/0x33c0 net/ipv4/ip_tunnel.c:831
ipgre_xmit+0x4a1/0x980 net/ipv4/ip_gre.c:665
__netdev_start_xmit include/linux/netdevice.h:4940 [inline]
netdev_start_xmit include/linux/netdevice.h:4954 [inline]
xmit_one net/core/dev.c:3548 [inline]
dev_hard_start_xmit+0x13d/0x6d0 net/core/dev.c:3564
...
The splat occurs because skb->data points past skb->head allocated area.
This is because neigh layer does:
__skb_pull(skb, skb_network_offset(skb));
... but skb_network_offset() returns a negative offset and __skb_pull()
arg is unsigned. IOW, we skb->data gets "adjusted" by a huge value.
The negative value is returned because skb->head and skb->data distance is
more than 64k and skb->network_header (u16) has wrapped around.
The bug is in the ip_tunnel infrastructure, which can cause
dev->needed_headroom to increment ad infinitum.
The syzkaller reproducer consists of packets getting routed via a gre
tunnel, and route of gre encapsulated packets pointing at another (ipip)
tunnel. The ipip encapsulation finds gre0 as next output device.
This results in the following pattern:
1). First packet is to be sent out via gre0.
Route lookup found an output device, ipip0.
2).
ip_tunnel_xmit for gre0 bumps gre0->needed_headroom based on the future
output device, rt.dev->needed_headroom (ipip0).
3).
ip output / start_xmit moves skb on to ipip0. which runs the same
code path again (xmit recursion).
4).
Routing step for the post-gre0-encap packet finds gre0 as output device
to use for ipip0 encapsulated packet.
tunl0->needed_headroom is then incremented based on the (already bumped)
gre0 device headroom.
This repeats for every future packet:
gre0->needed_headroom gets inflated because previous packets' ipip0 step
incremented rt->dev (gre0) headroom, and ipip0 incremented because gre0
needed_headroom was increased.
For each subsequent packet, gre/ipip0->needed_headroom grows until
post-expand-head reallocations result in a skb->head/data distance of
more than 64k.
Once that happens, skb->network_header (u16) wraps around when
pskb_expand_head tries to make sure that skb_network_offset() is unchanged
after the headroom expansion/reallocation.
After this skb_network_offset(skb) returns a different (and negative)
result post headroom expansion.
The next trip to neigh layer (or anything else that would __skb_pull the
network header) makes skb->data point to a memory location outside
skb->head area.
v2: Cap the needed_headroom update to an arbitarily chosen upperlimit to
prevent perpetual increase instead of dropping the headroom increment
completely.
Reported-and-tested-by: syzbot+bfde3bef047a81b8fde6@syzkaller.appspotmail.com
Closes: https://groups.google.com/g/syzkaller-bugs/c/fL9G6GtWskY/m/VKk_PR5FBAAJ
Fixes: 243aad830e ("ip_gre: include route header_len in max_headroom calculation")
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240220135606.4939-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmXV2OYNHGZ3QHN0cmxl
bi5kZQAKCRBwkajZrV/2AHCRD/9sHoOd4QCVVgcDr3SjpaVWikM0Zdkge65At/uY
bFENWgcDsSfsH7kAQm+nwzseT+QtTk9OOv9wqWzdEYROD7sqjVK2Zv/CUs24odGj
7Wj35OLYLgUIEMlHF/G9kOuWqW61URXwXcHvoFWkew1WweAVDqi648osLWUP9qkL
IFJ5729/1upq9XJc+pMxIy2Oe2zhMc4XNHsy1OCOg4fUQtDM81jgoJz0137ohCIh
PW4aaSno8ZeRuFe1RKfya5+suv3WgMui/fOBmpnnhjWVxHRJvYZ926wsy/jC7xRJ
E7/TdmymbzijRBEHh+IxQYZkE55XXc0E1Lj1ic653AzUWJ3tQRfD+HWg+GYj/WCu
sWy1e7eRJIjYVbeB5m6ao3g47Zq1XIRXo7E2Rvt3E2beM6t9aMIMuuajBHAOEV2O
pCfG4zBlEYw1SuuuoqzcXTVLKDf6WZjx1xtUAJCTks8JFTjPEwPwOQhGCv1cc/BC
qox7MejeDH/L+ZreeTYnWlQr1GGokNgrmpdDx0G8GBBRUDPoP8D4GTxvNEz44XOO
SfL2yl5v82GBBmsFHzC2J8BGN8KC4JyzDGupU+bcdMWCs8tSvMK0KVeankRvpdBl
x4VLmdoNo6zvtOYlPOxdphhsd6xA0dFiLMgSr9f5WsIgepaC+Umxp59IfCEH/bfl
1Kcg9g==
=GYgG
-----END PGP SIGNATURE-----
Merge tag 'nf-next-24-02-21' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:
====================
netfilter updates for net-next
1. Prefer KMEM_CACHE() macro to create kmem caches, from Kunwu Chan.
Patches 2 and 3 consolidate nf_log NULL checks and introduces
extra boundary checks on family and type to make it clear that no out
of bounds access will happen. No in-tree user currently passes such
values, but thats not clear from looking at the function.
From Pablo Neira Ayuso.
Patch 4, also from Pablo, gets rid of unneeded conditional in
nft_osf init function.
Patch 5, from myself, fixes erroneous Kconfig dependencies that
came in an earlier net-next pull request. This should get rid
of the xtables related build failure reports.
Patches 6 to 10 are an update to nftables' concatenated-ranges
set type to speed up element insertions. This series also
compacts a few data structures and cleans up a few oddities such
as reliance on ZERO_SIZE_PTR when asking to allocate a set with
no elements. From myself.
Patches 11 moves the nf_reinject function from the netfilter core
(vmlinux) into the nfnetlink_queue backend, the only location where
this is called from. Also from myself.
Patch 12, from Kees Cook, switches xtables' compat layer to use
unsafe_memcpy because xt_entry_target cannot easily get converted
to a real flexible array (its UAPI and used inside other structs).
* tag 'nf-next-24-02-21' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: x_tables: Use unsafe_memcpy() for 0-sized destination
netfilter: move nf_reinject into nfnetlink_queue modules
netfilter: nft_set_pipapo: use GFP_KERNEL for insertions
netfilter: nft_set_pipapo: speed up bulk element insertions
netfilter: nft_set_pipapo: shrink data structures
netfilter: nft_set_pipapo: do not rely on ZERO_SIZE_PTR
netfilter: nft_set_pipapo: constify lookup fn args where possible
netfilter: xtables: fix up kconfig dependencies
netfilter: nft_osf: simplify init path
netfilter: nf_log: validate nf_logger_find_get()
netfilter: nf_log: consolidate check for NULL logger in lookup function
netfilter: expect: Simplify the allocation of slab caches in nf_conntrack_expect_init
====================
Link: https://lore.kernel.org/r/20240221112637.5396-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We want to re-organize the struct sock layout. The sk_peek_off
field location is problematic, as most protocols want it in the
RX read area, while UDP wants it on a cacheline different from
sk_receive_queue.
Create a local (inside udp_sock) copy of the 'peek offset is enabled'
flag and place it inside the same cacheline of reader_queue.
Check such flag before reading sk_peek_off. This will save potential
false sharing and cache misses in the fast-path.
Tested under UDP flood with small packets. The struct sock layout
update causes a 4% performance drop, and this patch restores completely
the original tput.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/67ab679c15fbf49fa05b3ffe05d91c47ab84f147.1708426665.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.
And change cache name from 'ip_dst_cache' to 'rtable'.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.
And change cache name from 'ip_mrt_cache' to 'mfc_cache'.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Randy Dunlap reports arptables build failure:
arp_tables.c:(.text+0x20): undefined reference to `xt_find_table'
... because recent change removed a 'select' on the xtables core.
Add a "depends" clause on arptables to resolve this.
Kernel test robot reports another build breakage:
iptable_nat.c:(.text+0x8): undefined reference to `ipt_unregister_table_exit'
... because of a typo, the nat table selected ip6tables.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Closes: https://lore.kernel.org/netfilter-devel/d0dfbaef-046a-4c42-9daa-53636664bf6d@infradead.org/
Fixes: a9525c7f62 ("netfilter: xtables: allow xtables-nft only builds")
Fixes: 4654467dc7 ("netfilter: arptables: allow xtables-nft only builds")
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Signed-off-by: Florian Westphal <fw@strlen.de>
The variable len being initialized with a value that is never read, an
if statement is initializing it in both paths of the if statement.
The initialization is redundant and can be removed.
Cleans up clang scan build warning:
net/ipv4/tcp_ao.c:512:11: warning: Value stored to 'len' during its
initialization is never read [deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Link: https://lore.kernel.org/r/20240216125443.2107244-1-colin.i.king@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>