Commit Graph

12829 Commits

Author SHA1 Message Date
Guillaume Nault
65a55aa7e6 ipv4: Prepare inet_rtm_getroute() to .flowi4_tos conversion.
Store rtm->rtm_tos in a dscp_t variable, which can then be used for
setting fl4.flowi4_tos and also be passed as parameter of
ip_route_input_rcu().

The .flowi4_tos field is going to be converted to dscp_t to ensure ECN
bits aren't erroneously taken into account during route lookups. Having
a dscp_t variable available will simplify that conversion, as we'll
just have to drop the inet_dscp_to_dsfield() call.

Note that we can't just convert rtm->rtm_tos to dscp_t because this
structure is exported to user space.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/7bc1c7dc47ad1393569095d334521fae59af5bc7.1736944951.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-16 16:52:29 -08:00
Guillaume Nault
2c77bcb344 gre: Prepare ipgre_open() to .flowi4_tos conversion.
Use ip4h_dscp() to get the tunnel DSCP option as dscp_t, instead of
manually masking the raw tos field with INET_DSCP_MASK. This will ease
the conversion of fl4->flowi4_tos to dscp_t, which just becomes a
matter of dropping the inet_dscp_to_dsfield() call.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/6c05a11afdc61530f1a4505147e0909ad51feb15.1736941806.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-16 16:51:57 -08:00
Jakub Kicinski
2ee738e90e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.13-rc8).

Conflicts:

drivers/net/ethernet/realtek/r8169_main.c
  1f691a1fc4 ("r8169: remove redundant hwmon support")
  152d00a913 ("r8169: simplify setting hwmon attribute visibility")
https://lore.kernel.org/20250115122152.760b4e8d@canb.auug.org.au

Adjacent changes:

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  152f4da05a ("bnxt_en: add support for rx-copybreak ethtool command")
  f0aa6a37a3 ("eth: bnxt: always recalculate features after XDP clearing, fix null-deref")

drivers/net/ethernet/intel/ice/ice_type.h
  50327223a8 ("ice: add lock to protect low latency interface")
  dc26548d72 ("ice: Fix quad registers read on E825")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-16 10:34:59 -08:00
Eric Dumazet
3440fa34ad inet: ipmr: fix data-races
Following fields of 'struct mr_mfc' can be updated
concurrently (no lock protection) from ip_mr_forward()
and ip6_mr_forward()

- bytes
- pkt
- wrong_if
- lastuse

They also can be read from other functions.

Convert bytes, pkt and wrong_if to atomic_long_t,
and use READ_ONCE()/WRITE_ONCE() for lastuse.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250114221049.1190631-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15 15:07:23 -08:00
Eric Dumazet
d16b344790 tcp: add LINUX_MIB_PAWS_OLD_ACK SNMP counter
Prior patch in the series added TCP_RFC7323_PAWS_ACK drop reason.

This patch adds the corresponding SNMP counter, for folks
using nstat instead of tracing for TCP diagnostics.

nstat -az | grep PAWSOldAck

Suggested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250113135558.3180360-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-14 13:28:13 -08:00
Eric Dumazet
124c4c32e9 tcp: add TCP_RFC7323_PAWS_ACK drop reason
XPS can cause reorders because of the relaxed OOO
conditions for pure ACK packets.

For hosts not using RFS, what can happpen is that ACK
packets are sent on behalf of the cpu processing NIC
interrupts, selecting TX queue A for ACK packet P1.

Then a subsequent sendmsg() can run on another cpu.
TX queue selection uses the socket hash and can choose
another queue B for packets P2 (with payload).

If queue A is more congested than queue B,
the ACK packet P1 could be sent on the wire after
P2.

A linux receiver when processing P1 (after P2) currently increments
LINUX_MIB_PAWSESTABREJECTED (TcpExtPAWSEstab)
and use TCP_RFC7323_PAWS drop reason.
It might also send a DUPACK if not rate limited.

In order to better understand this pattern, this
patch adds a new drop_reason : TCP_RFC7323_PAWS_ACK.

For old ACKS like these, we no longer increment
LINUX_MIB_PAWSESTABREJECTED and no longer sends a DUPACK,
keeping credit for other more interesting DUPACK.

perf record -e skb:kfree_skb -a
perf script
...
         swapper       0 [148] 27475.438637: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK
         swapper       0 [208] 27475.438706: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK
         swapper       0 [208] 27475.438908: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK
         swapper       0 [148] 27475.439010: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK
         swapper       0 [148] 27475.439214: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK
         swapper       0 [208] 27475.439286: skb:kfree_skb: ... location=tcp_validate_incoming+0x4f0 reason: TCP_RFC7323_PAWS_ACK
...

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250113135558.3180360-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-14 13:28:13 -08:00
Eric Dumazet
ea98b61bdd tcp: add drop_reason support to tcp_disordered_ack()
Following patch is adding a new drop_reason to tcp_validate_incoming().

Change tcp_disordered_ack() to not return a boolean anymore,
but a drop reason.

Change its name to tcp_disordered_ack_check()

Refactor tcp_validate_incoming() to ease the code
review of the following patch, and reduce indentation
level.

This patch is a refactor, with no functional change.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250113135558.3180360-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-14 13:28:13 -08:00
Philo Lu
644f9108f3 udp: Make rehash4 independent in udp_lib_rehash()
As discussed in [0], rehash4 could be missed in udp_lib_rehash() when
udp hash4 changes while hash2 doesn't change. This patch fixes this by
moving rehash4 codes out of rehash2 checking, and then rehash2 and
rehash4 are done separately.

By doing this, we no longer need to call rehash4 explicitly in
udp_lib_hash4(), as the rehash callback in __ip4_datagram_connect takes
it. Thus, now udp_lib_hash4() returns directly if the sk is already
hashed.

Note that uhash4 may fail to work under consecutive connect(<dst
address>) calls because rehash() is not called with every connect(). To
overcome this, connect(<AF_UNSPEC>) needs to be called after the next
connect to a new destination.

[0]
https://lore.kernel.org/all/4761e466ab9f7542c68cdc95f248987d127044d2.1733499715.git.pabeni@redhat.com/

Fixes: 78c91ae2c6 ("ipv4/udp: Add 4-tuple hash for connected socket")
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Link: https://patch.msgid.link/20250110010810.107145-1-lulie@linux.alibaba.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-14 10:44:10 +01:00
David S. Miller
7b24f164cf ipsec-next-2025-01-09
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmd/mNkACgkQrB3Eaf9P
 W7cQww/9Hnv7+wosuBxFW2o2ptgZThiwj/Ovz0oiWGWvctpN7VlpwmrDOsXQ+XMn
 xF6JBFEjnJanYoDBb78D0dMffJcMcUKZopTUU/ZMitSNr8aIHYiuB4SWPG1tqxl4
 Ete7Mr2m3tS96YePQNnAaRZzEuGsx3BQb28VLTWl9So81MByD2OK4fsAbYz22Gg8
 7A6tDHn1mUd9b2VG+LeeBZaDDFG8C0O2x4E/8Z3DX3z1N8y3LABPwZ38jcgTviKO
 1ZldGrJT+PownBydu23bWDassKE2TuVvGH9e/SOPeQj8DJ4Lmd0bafMZTY6xwfNT
 RJCwhlzZUpYRXFzvcf3+U3egsqEWEemV7/LzAapdT0V9OqfLWMUh3b1jMA4KcblZ
 qmlm/MhZyXutDoeuASwtM4jgM3wGwovOofrKKsb13hD9VLBs5jFZmFSw5TlbmwE3
 sjv7V4pFwNyfJnwQtyMmfuuHiy8w+fzqAA2GCg8mF3OosHABH/FOvEBP8xg1Vqu1
 iKlLkByfyaCFD+GxTPzqSDvSB8nDzeZBgBM/ILGuwH0OWr+0gxBzl1sxrhAkw9hC
 Gf+4J3wg7EknTxfrJk4LyqfyS50GvUIzpLSkHYxDtQRd4zv75bxRzMvoZvuapTtN
 GGpWYftKO8n2kgLORQ0dTtFee3c4w/No+KXWsFdtRVNpyk3N0Yw=
 =U79q
 -----END PGP SIGNATURE-----

Merge tag 'ipsec-next-2025-01-09' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next

Steffen Klassert says:

====================
ipsec-next-2025-01-09

1) Implement the AGGFRAG protocol and basic IP-TFS (RFC9347) functionality.
   From Christian Hopps.

2) Support ESN context update to hardware for TX.
   From Jianbo Liu.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-10 09:15:17 +00:00
Antoine Tenart
8c7a6efc01 ipv4: route: fix drop reason being overridden in ip_route_input_slow
When jumping to 'martian_destination' a drop reason is always set but
that label falls-through the 'e_nobufs' one, overriding the value.

The behavior was introduced by the mentioned commit. The logic went
from,

	goto martian_destination;
	...
  martian_destination:
	...
  e_inval:
	err = -EINVAL;
	goto out;
  e_nobufs:
	err = -ENOBUFS;
	goto out;

to,

	reason = ...;
	goto martian_destination;
	...
  martian_destination:
	...
  e_nobufs:
	reason = SKB_DROP_REASON_NOMEM;
	goto out;

A 'goto out' is clearly missing now after 'martian_destination' to avoid
overriding the drop reason.

Fixes: 5b92112acd ("net: ip: make ip_route_input_slow() return drop reasons")
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Cc: Menglong Dong <menglong8.dong@gmail.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250108165725.404564-1-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-09 18:03:24 -08:00
Jakub Kicinski
14ea4cd1b1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.13-rc7).

Conflicts:
  a42d71e322 ("net_sched: sch_cake: Add drop reasons")
  737d4d91d3 ("sched: sch_cake: add bounds checks to host bulk flow fairness counts")

Adjacent changes:

drivers/net/ethernet/meta/fbnic/fbnic.h
  3a856ab347 ("eth: fbnic: add IRQ reuse support")
  95978931d5 ("eth: fbnic: Revert "eth: fbnic: Add hardware monitoring support via HWMON interface"")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-09 16:11:47 -08:00
Daniel Borkmann
80fb40baba tcp: Annotate data-race around sk->sk_mark in tcp_v4_send_reset
This is a follow-up to 3c5b4d69c3 ("net: annotate data-races around
sk->sk_mark"). sk->sk_mark can be read and written without holding
the socket lock. IPv6 equivalent is already covered with READ_ONCE()
annotation in tcp_v6_send_response().

Fixes: 3c5b4d69c3 ("net: annotate data-races around sk->sk_mark")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/f459d1fc44f205e13f6d8bdca2c8bfb9902ffac9.1736244569.git.daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-08 10:22:02 -08:00
Jakub Kicinski
385f186aba Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.13-rc6).

No conflicts.

Adjacent changes:

include/linux/if_vlan.h
  f91a5b8089 ("af_packet: fix vlan_get_protocol_dgram() vs MSG_PEEK")
  3f330db306 ("net: reformat kdoc return statements")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-03 16:29:29 -08:00
Linus Torvalds
aba74e639f Nothing major here. Over the last two weeks we gathered only around
two-thirds of our normal weekly fix count, but delaying sending these
 until -rc7 seemed like a really bad idea.
 
 AFAIK we have no bugs under investigation. One or two reverts for
 stuff for which we haven't gotten a proper fix will likely come in
 the next PR.
 
 Including fixes from wireles and netfilter.
 
 Current release - fix to a fix:
 
  - netfilter: nft_set_hash: unaligned atomic read on struct nft_set_ext
 
  - eth: gve: trigger RX NAPI instead of TX NAPI in gve_xsk_wakeup
 
 Previous releases - regressions:
 
  - net: reenable NETIF_F_IPV6_CSUM offload for BIG TCP packets
 
  - mptcp:
    - fix sleeping rcvmsg sleeping forever after bad recvbuffer adjust
    - fix TCP options overflow
    - prevent excessive coalescing on receive, fix throughput
 
  - net: fix memory leak in tcp_conn_request() if map insertion fails
 
  - wifi: cw1200: fix potential NULL dereference after conversion
    to GPIO descriptors
 
  - phy: micrel: dynamically control external clock of KSZ PHY,
    fix suspend behavior
 
 Previous releases - always broken:
 
  - af_packet: fix VLAN handling with MSG_PEEK
 
  - net: restrict SO_REUSEPORT to inet sockets
 
  - netdev-genl: avoid empty messages in NAPI get
 
  - dsa: microchip: fix set_ageing_time function on KSZ9477 and LAN937X
 
  - eth: gve: XDP fixes around transmit, queue wakeup etc.
 
  - eth: ti: icssg-prueth: fix firmware load sequence to prevent time
    jump which breaks timesync related operations
 
 Misc:
 
  - netlink: specs: mptcp: add missing attr and improve documentation
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmd4FCgACgkQMUZtbf5S
 IruB7hAAkppr4+4kywDq/PIqgkisjYzpNIQIrphebUeIHIcAizWbJEL9CiHXnCQD
 w95+nVhzDmsWXdzE2AvwM/2YAByLP6CpgKEa+7rI3K5BQZ86lbzm+ftOOFYiF3eV
 2v5yLfjgMVbzwqzuP3ivkNrtG95IGQFPobLl7tI9Wpgm3yD0+H8BCqtZNAwA4N6g
 mEdZD1jMOzkdINoJBg35O70B2GDq/WSS1N8dgxtj1F7EPDuQdO1kijJmqFjYT3fw
 30/NdjlfGtbFro6zp+8ZY6P9RG3CLNhYXu+nUqdOHJOJw65aXUxOPjWCRp3o6m12
 HIIp22imDTvNTSCfn0wlbGJfC+v9HdIgB9gqApAaxNOS02bQgEy5tuxxXNTo2B1I
 bqH7GJ+6a7axsQ6BnpcOcDhu2fnbtguj1W/zC/MlGQmZZj2g+UNCf4QqPir/XkcG
 r43/iy2/L8cgZWYXzH1skfUMgbymI5uLytO+n+CLsb6P/N3WLa7wxHNV5Lni+Xng
 qDyd0TwUccCr2KBfZqjgFeGbvQ7qub+eeQCYx9kUAaFhKUsQB/eZ5i9C7SsfutMa
 xzwITqCkcvmdyU3gG3v/oABtQkxdTcRqWCfYDGcXm0zb1jzMf5ZGN4E/gABYM7p9
 EewL1kCAt4ULmBc8Hqrj+2Erlzh+xsodvvfpgpJandMVXaJW44k=
 =YKS4
 -----END PGP SIGNATURE-----

Merge tag 'net-6.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from wireles and netfilter.

  Nothing major here. Over the last two weeks we gathered only around
  two-thirds of our normal weekly fix count, but delaying sending these
  until -rc7 seemed like a really bad idea.

  AFAIK we have no bugs under investigation. One or two reverts for
  stuff for which we haven't gotten a proper fix will likely come in the
  next PR.

  Current release - fix to a fix:

   - netfilter: nft_set_hash: unaligned atomic read on struct
     nft_set_ext

   - eth: gve: trigger RX NAPI instead of TX NAPI in gve_xsk_wakeup

  Previous releases - regressions:

   - net: reenable NETIF_F_IPV6_CSUM offload for BIG TCP packets

   - mptcp:
      - fix sleeping rcvmsg sleeping forever after bad recvbuffer adjust
      - fix TCP options overflow
      - prevent excessive coalescing on receive, fix throughput

   - net: fix memory leak in tcp_conn_request() if map insertion fails

   - wifi: cw1200: fix potential NULL dereference after conversion to
     GPIO descriptors

   - phy: micrel: dynamically control external clock of KSZ PHY, fix
     suspend behavior

  Previous releases - always broken:

   - af_packet: fix VLAN handling with MSG_PEEK

   - net: restrict SO_REUSEPORT to inet sockets

   - netdev-genl: avoid empty messages in NAPI get

   - dsa: microchip: fix set_ageing_time function on KSZ9477 and LAN937X

   - eth:
      - gve: XDP fixes around transmit, queue wakeup etc.
      - ti: icssg-prueth: fix firmware load sequence to prevent time
        jump which breaks timesync related operations

  Misc:

   - netlink: specs: mptcp: add missing attr and improve documentation"

* tag 'net-6.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits)
  net: ti: icssg-prueth: Fix clearing of IEP_CMP_CFG registers during iep_init
  net: ti: icssg-prueth: Fix firmware load sequence.
  mptcp: prevent excessive coalescing on receive
  mptcp: don't always assume copied data in mptcp_cleanup_rbuf()
  mptcp: fix recvbuffer adjust on sleeping rcvmsg
  ila: serialize calls to nf_register_net_hooks()
  af_packet: fix vlan_get_protocol_dgram() vs MSG_PEEK
  af_packet: fix vlan_get_tci() vs MSG_PEEK
  net: wwan: iosm: Properly check for valid exec stage in ipc_mmio_init()
  net: restrict SO_REUSEPORT to inet sockets
  net: reenable NETIF_F_IPV6_CSUM offload for BIG TCP packets
  net: sfc: Correct key_len for efx_tc_ct_zone_ht_params
  net: wwan: t7xx: Fix FSM command timeout issue
  sky2: Add device ID 11ab:4373 for Marvell 88E8075
  mptcp: fix TCP options overflow.
  net: mv643xx_eth: fix an OF node reference leak
  gve: trigger RX NAPI instead of TX NAPI in gve_xsk_wakeup
  eth: bcmsysport: fix call balance of priv->clk handling routines
  net: llc: reset skb->transport_header
  netlink: specs: mptcp: fix missing doc
  ...
2025-01-03 14:36:54 -08:00
Yu Tian
5df7ca0b82 ipv4: remove useless arg
The "struct sock *sk" parameter in ip_rcv_finish_core is unused, which
leads the compiler to optimize it out. As a result, the
"struct sk_buff *skb" parameter is passed using x1. And this make kprobe
hard to use.

Signed-off-by: Yu Tian <tianyu2@kernelsoft.com>
Link: https://patch.msgid.link/20241231023610.1657926-1-tianyu2@kernelsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-02 17:17:40 -08:00
Yuyang Huang
aa4ad7c3f2 netlink: correct nlmsg size for multicast notifications
Corrected the netlink message size calculation for multicast group
join/leave notifications. The previous calculation did not account for
the inclusion of both IPv4/IPv6 addresses and ifa_cacheinfo in the
payload. This fix ensures that the allocated message size is
sufficient to hold all necessary information.

This patch also includes the following improvements:
* Uses GFP_KERNEL instead of GFP_ATOMIC when holding the RTNL mutex.
* Uses nla_total_size(sizeof(struct in6_addr)) instead of
  nla_total_size(16).
* Removes unnecessary EXPORT_SYMBOL().

Fixes: 2c2b61d213 ("netlink: add IGMP/MLD join/leave notifications")
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Yuyang Huang <yuyanghuang@google.com>
Link: https://patch.msgid.link/20241221100007.1910089-1-yuyanghuang@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-23 10:26:43 -08:00
Xiao Liang
b5a7b661a0 net: Fix netns for ip_tunnel_init_flow()
The device denoted by tunnel->parms.link resides in the underlay net
namespace. Therefore pass tunnel->net to ip_tunnel_init_flow().

Fixes: db53cd3d88 ("net: Handle l3mdev in ip_tunnel_init_flow")
Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20241219130336.103839-1-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-23 10:04:41 -08:00
Wang Liang
4f4aa4aa28 net: fix memory leak in tcp_conn_request()
If inet_csk_reqsk_queue_hash_add() return false, tcp_conn_request() will
return without free the dst memory, which allocated in af_ops->route_req.

Here is the kmemleak stack:

unreferenced object 0xffff8881198631c0 (size 240):
  comm "softirq", pid 0, jiffies 4299266571 (age 1802.392s)
  hex dump (first 32 bytes):
    00 10 9b 03 81 88 ff ff 80 98 da bc ff ff ff ff  ................
    81 55 18 bb ff ff ff ff 00 00 00 00 00 00 00 00  .U..............
  backtrace:
    [<ffffffffb93e8d4c>] kmem_cache_alloc+0x60c/0xa80
    [<ffffffffba11b4c5>] dst_alloc+0x55/0x250
    [<ffffffffba227bf6>] rt_dst_alloc+0x46/0x1d0
    [<ffffffffba23050a>] __mkroute_output+0x29a/0xa50
    [<ffffffffba23456b>] ip_route_output_key_hash+0x10b/0x240
    [<ffffffffba2346bd>] ip_route_output_flow+0x1d/0x90
    [<ffffffffba254855>] inet_csk_route_req+0x2c5/0x500
    [<ffffffffba26b331>] tcp_conn_request+0x691/0x12c0
    [<ffffffffba27bd08>] tcp_rcv_state_process+0x3c8/0x11b0
    [<ffffffffba2965c6>] tcp_v4_do_rcv+0x156/0x3b0
    [<ffffffffba299c98>] tcp_v4_rcv+0x1cf8/0x1d80
    [<ffffffffba239656>] ip_protocol_deliver_rcu+0xf6/0x360
    [<ffffffffba2399a6>] ip_local_deliver_finish+0xe6/0x1e0
    [<ffffffffba239b8e>] ip_local_deliver+0xee/0x360
    [<ffffffffba239ead>] ip_rcv+0xad/0x2f0
    [<ffffffffba110943>] __netif_receive_skb_one_core+0x123/0x140

Call dst_release() to free the dst memory when
inet_csk_reqsk_queue_hash_add() return false in tcp_conn_request().

Fixes: ff46e3b442 ("Fix race for duplicate reqsk on identical SYN")
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Link: https://patch.msgid.link/20241219072859.3783576-1-wangliang74@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-23 10:03:12 -08:00
Stefano Brivio
a502ea6fa9 udp: Deal with race between UDP socket address change and rehash
If a UDP socket changes its local address while it's receiving
datagrams, as a result of connect(), there is a period during which
a lookup operation might fail to find it, after the address is changed
but before the secondary hash (port and address) and the four-tuple
hash (local and remote ports and addresses) are updated.

Secondary hash chains were introduced by commit 30fff9231f ("udp:
bind() optimisation") and, as a result, a rehash operation became
needed to make a bound socket reachable again after a connect().

This operation was introduced by commit 719f835853 ("udp: add
rehash on connect()") which isn't however a complete fix: the
socket will be found once the rehashing completes, but not while
it's pending.

This is noticeable with a socat(1) server in UDP4-LISTEN mode, and a
client sending datagrams to it. After the server receives the first
datagram (cf. _xioopen_ipdgram_listen()), it issues a connect() to
the address of the sender, in order to set up a directed flow.

Now, if the client, running on a different CPU thread, happens to
send a (subsequent) datagram while the server's socket changes its
address, but is not rehashed yet, this will result in a failed
lookup and a port unreachable error delivered to the client, as
apparent from the following reproducer:

  LEN=$(($(cat /proc/sys/net/core/wmem_default) / 4))
  dd if=/dev/urandom bs=1 count=${LEN} of=tmp.in

  while :; do
  	taskset -c 1 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
  	sleep 0.1 || sleep 1
  	taskset -c 2 socat OPEN:tmp.in UDP4:localhost:1337,shut-null
  	wait
  done

where the client will eventually get ECONNREFUSED on a write()
(typically the second or third one of a given iteration):

  2024/11/13 21:28:23 socat[46901] E write(6, 0x556db2e3c000, 8192): Connection refused

This issue was first observed as a seldom failure in Podman's tests
checking UDP functionality while using pasta(1) to connect the
container's network namespace, which leads us to a reproducer with
the lookup error resulting in an ICMP packet on a tap device:

  LOCAL_ADDR="$(ip -j -4 addr show|jq -rM '.[] | .addr_info[0] | select(.scope == "global").local')"

  while :; do
  	./pasta --config-net -p pasta.pcap -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
  	sleep 0.2 || sleep 1
  	socat OPEN:tmp.in UDP4:${LOCAL_ADDR}:1337,shut-null
  	wait
  	cmp tmp.in tmp.out
  done

Once this fails:

  tmp.in tmp.out differ: char 8193, line 29

we can finally have a look at what's going on:

  $ tshark -r pasta.pcap
      1   0.000000           :: ? ff02::16     ICMPv6 110 Multicast Listener Report Message v2
      2   0.168690 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      3   0.168767 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      4   0.168806 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      5   0.168827 c6:47:05:8d:dc:04 ? Broadcast    ARP 42 Who has 88.198.0.161? Tell 88.198.0.164
      6   0.168851 9a:55:9a:55:9a:55 ? c6:47:05:8d:dc:04 ARP 42 88.198.0.161 is at 9a:55:9a:55:9a:55
      7   0.168875 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      8   0.168896 88.198.0.164 ? 88.198.0.161 ICMP 590 Destination unreachable (Port unreachable)
      9   0.168926 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
     10   0.168959 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
     11   0.168989 88.198.0.161 ? 88.198.0.164 UDP 4138 60260 ? 1337 Len=4096
     12   0.169010 88.198.0.161 ? 88.198.0.164 UDP 42 60260 ? 1337 Len=0

On the third datagram received, the network namespace of the container
initiates an ARP lookup to deliver the ICMP message.

In another variant of this reproducer, starting the client with:

  strace -f pasta --config-net -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc 2>strace.log &

and connecting to the socat server using a loopback address:

  socat OPEN:tmp.in UDP4:localhost:1337,shut-null

we can more clearly observe a sendmmsg() call failing after the
first datagram is delivered:

  [pid 278012] connect(173, 0x7fff96c95fc0, 16) = 0
  [...]
  [pid 278012] recvmmsg(173, 0x7fff96c96020, 1024, MSG_DONTWAIT, NULL) = -1 EAGAIN (Resource temporarily unavailable)
  [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = 1
  [...]
  [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = -1 ECONNREFUSED (Connection refused)

and, somewhat confusingly, after a connect() on the same socket
succeeded.

Until commit 4cdeeee925 ("net: udp: prefer listeners bound to an
address"), the race between receive address change and lookup didn't
actually cause visible issues, because, once the lookup based on the
secondary hash chain failed, we would still attempt a lookup based on
the primary hash (destination port only), and find the socket with the
outdated secondary hash.

That change, however, dropped port-only lookups altogether, as side
effect, making the race visible.

To fix this, while avoiding the need to make address changes and
rehash atomic against lookups, reintroduce primary hash lookups as
fallback, if lookups based on four-tuple and secondary hashes fail.

To this end, introduce a simplified lookup implementation, which
doesn't take care of SO_REUSEPORT groups: if we have one, there are
multiple sockets that would match the four-tuple or secondary hash,
meaning that we can't run into this race at all.

v2:
  - instead of synchronising lookup operations against address change
    plus rehash, reintroduce a simplified version of the original
    primary hash lookup as fallback

v1:
  - fix build with CONFIG_IPV6=n: add ifdef around sk_v6_rcv_saddr
    usage (Kuniyuki Iwashima)
  - directly use sk_rcv_saddr for IPv4 receive addresses instead of
    fetching inet_rcv_saddr (Kuniyuki Iwashima)
  - move inet_update_saddr() to inet_hashtables.h and use that
    to set IPv4/IPv6 addresses as suitable (Kuniyuki Iwashima)
  - rebase onto net-next, update commit message accordingly

Reported-by: Ed Santiago <santiago@redhat.com>
Link: https://github.com/containers/podman/issues/24147
Analysed-by: David Gibson <david@gibson.dropbear.id.au>
Fixes: 30fff9231f ("udp: bind() optimisation")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-12-23 11:39:55 +00:00
Levi Zim
5153a75ef3 tcp_bpf: Fix copied value in tcp_bpf_sendmsg
bpf kselftest sockhash::test_txmsg_cork_hangs in test_sockmap.c triggers a
kernel NULL pointer dereference:

BUG: kernel NULL pointer dereference, address: 0000000000000008
 ? __die_body+0x6e/0xb0
 ? __die+0x8b/0xa0
 ? page_fault_oops+0x358/0x3c0
 ? local_clock+0x19/0x30
 ? lock_release+0x11b/0x440
 ? kernelmode_fixup_or_oops+0x54/0x60
 ? __bad_area_nosemaphore+0x4f/0x210
 ? mmap_read_unlock+0x13/0x30
 ? bad_area_nosemaphore+0x16/0x20
 ? do_user_addr_fault+0x6fd/0x740
 ? prb_read_valid+0x1d/0x30
 ? exc_page_fault+0x55/0xd0
 ? asm_exc_page_fault+0x2b/0x30
 ? splice_to_socket+0x52e/0x630
 ? shmem_file_splice_read+0x2b1/0x310
 direct_splice_actor+0x47/0x70
 splice_direct_to_actor+0x133/0x300
 ? do_splice_direct+0x90/0x90
 do_splice_direct+0x64/0x90
 ? __ia32_sys_tee+0x30/0x30
 do_sendfile+0x214/0x300
 __se_sys_sendfile64+0x8e/0xb0
 __x64_sys_sendfile64+0x25/0x30
 x64_sys_call+0xb82/0x2840
 do_syscall_64+0x75/0x110
 entry_SYSCALL_64_after_hwframe+0x4b/0x53

This is caused by tcp_bpf_sendmsg() returning a larger value(12289) than
size (8192), which causes the while loop in splice_to_socket() to release
an uninitialized pipe buf.

The underlying cause is that this code assumes sk_msg_memcopy_from_iter()
will copy all bytes upon success but it actually might only copy part of
it.

This commit changes it to use the real copied bytes.

Signed-off-by: Levi Zim <rsworktech@outlook.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@kernel.org>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20241130-tcp-bpf-sendmsg-v1-2-bae583d014f3@outlook.com
2024-12-20 22:53:36 +01:00
Guillaume Nault
148721f8e0 ipv4: Use inet_sk_init_flowi4() in __ip_queue_xmit().
Use inet_sk_init_flowi4() to automatically initialise the flowi4
structure in __ip_queue_xmit() instead of passing parameters manually
to ip_route_output_ports().

Override ->flowi4_tos with the value passed as parameter since that's
required by SCTP.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/37e64ffbd9adac187b14aa9097b095f5c86e85be.1734357769.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20 13:50:09 -08:00
Guillaume Nault
42e5ffc385 ipv4: Use inet_sk_init_flowi4() in inet_csk_rebuild_route().
Use inet_sk_init_flowi4() to automatically initialise the flowi4
structure in inet_csk_rebuild_route() instead of passing parameters
manually to ip_route_output_ports().

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/b270931636effa1095508e0f0a3e8c3a0e6d357f.1734357769.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20 13:50:09 -08:00
Guillaume Nault
5be1323b50 ipv4: Use inet_sk_init_flowi4() in ip4_datagram_release_cb().
Use inet_sk_init_flowi4() to automatically initialise the flowi4
structure in ip4_datagram_release_cb() instead of passing parameters
manually to ip_route_output_ports().

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/9c326b8d9e919478f7952b21473d31da07eba2dd.1734357769.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20 13:50:09 -08:00
Guillaume Nault
1dbdce30f0 ipv4: Define inet_sk_init_flowi4() and use it in inet_sk_rebuild_header().
IPv4 code commonly has to initialise a flowi4 structure from an IPv4
socket. This requires looking at potential IPv4 options to set the
proper destination address, call flowi4_init_output() with the correct
set of parameters and run the sk_classify_flow security hook.

Instead of reimplementing these operations in different parts of the
stack, let's define inet_sk_init_flowi4() which does all these
operations.

The first user is inet_sk_rebuild_header(), where inet_sk_init_flowi4()
replaces ip_route_output_ports(). Unlike ip_route_output_ports(), which
sets the flowi4 structure and performs the route lookup in one go,
inet_sk_init_flowi4() only initialises the flow. The route lookup is
then done by ip_route_output_flow(). Decoupling flow initialisation
from route lookup makes this new interface applicable more broadly as
it will allow some users to overwrite specific struct flowi4 members
before the route lookup.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/fd416275262b1f518d5abfcef740ce4f4a1a6522.1734357769.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20 13:50:09 -08:00
Eric Dumazet
05dd04b218 inetpeer: avoid false sharing in inet_peer_xrlim_allow()
Under DOS, inet_peer_xrlim_allow() might be called millions
of times per second from different cpus.

Make sure to write over peer->rate_tokens and peer->rate_last
only when really needed.

Note the inherent races of this function are still there,
we do not care of precise ICMP rate limiting.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20241219150330.3159027-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20 13:04:40 -08:00
Zijian Zhang
d888b7af7c tcp_bpf: Add sk_rmem_alloc related logic for tcp_bpf ingress redirection
When we do sk_psock_verdict_apply->sk_psock_skb_ingress, an sk_msg will
be created out of the skb, and the rmem accounting of the sk_msg will be
handled by the skb.

For skmsgs in __SK_REDIRECT case of tcp_bpf_send_verdict, when redirecting
to the ingress of a socket, although we sk_rmem_schedule and add sk_msg to
the ingress_msg of sk_redir, we do not update sk_rmem_alloc. As a result,
except for the global memory limit, the rmem of sk_redir is nearly
unlimited. Thus, add sk_rmem_alloc related logic to limit the recv buffer.

Since the function sk_msg_recvmsg and __sk_psock_purge_ingress_msg are
used in these two paths. We use "msg->skb" to test whether the sk_msg is
skb backed up. If it's not, we shall do the memory accounting explicitly.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20241210012039.1669389-3-zijianzhang@bytedance.com
2024-12-20 17:59:47 +01:00
Cong Wang
54f89b3178 tcp_bpf: Charge receive socket buffer in bpf_tcp_ingress()
When bpf_tcp_ingress() is called, the skmsg is being redirected to the
ingress of the destination socket. Therefore, we should charge its
receive socket buffer, instead of sending socket buffer.

Because sk_rmem_schedule() tests pfmemalloc of skb, we need to
introduce a wrapper and call it for skmsg.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20241210012039.1669389-2-zijianzhang@bytedance.com
2024-12-20 17:59:47 +01:00
Guillaume Nault
29b540795b gre: Drop ip_route_output_gre().
We already have enough variants of ip_route_output*() functions. We
don't need a GRE specific one in the generic route.h header file.

Furthermore, ip_route_output_gre() is only used once, in ipgre_open(),
where it can be easily replaced by a simple call to
ip_route_output_key().

While there, and for clarity, explicitly set .flowi4_scope to
RT_SCOPE_UNIVERSE instead of relying on the implicit zero
initialisation.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/ab7cba47b8558cd4bfe2dc843c38b622a95ee48e.1734527729.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 19:24:47 -08:00
Ido Schimmel
f0c898d8c2 ipv4: fib_rules: Reject flow label attributes
IPv4 FIB rules cannot match on flow label so reject requests that try to
add such rules. Do that in the IPv4 configure callback as the netlink
policy resides in the core and used by both IPv4 and IPv6.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-19 16:02:21 +01:00
Eric Dumazet
a853c60950 inetpeer: do not get a refcount in inet_getpeer()
All inet_getpeer() callers except ip4_frag_init() don't need
to acquire a permanent refcount on the inetpeer.

They can switch to full RCU protection.

Move the refcount_inc_not_zero() into ip4_frag_init(),
so that all the other callers no longer have to
perform a pair of expensive atomic operations on
a possibly contended cache line.

inet_putpeer() no longer needs to be exported.

After this patch, my DUT can receive 8,400,000 UDP packets
per second targeting closed ports, using 50% less cpu cycles
than before.

Also change two calls to l3mdev_master_ifindex() by
l3mdev_master_ifindex_rcu() (Ido ideas)

Fixes: 8c2bd38b95 ("icmp: change the order of rate limits")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241215175629.1248773-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-17 19:37:48 -08:00
Eric Dumazet
50b362f21d inetpeer: update inetpeer timestamp in inet_getpeer()
inet_putpeer() will be removed in the following patch,
because we will no longer use refcounts.

Update inetpeer timestamp (p->dtime) at lookup time.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241215175629.1248773-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-17 19:37:00 -08:00
Eric Dumazet
7a596a50c4 inetpeer: remove create argument of inet_getpeer()
All callers of inet_getpeer() want to create an inetpeer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241215175629.1248773-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-17 19:37:00 -08:00
Eric Dumazet
661cd8fc8e inetpeer: remove create argument of inet_getpeer_v[46]()
All callers of inet_getpeer_v4() and inet_getpeer_v6()
want to create an inetpeer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241215175629.1248773-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-17 19:37:00 -08:00
Anna Emese Nyiri
a32f3e9d1e sock: support SO_PRIORITY cmsg
The Linux socket API currently allows setting SO_PRIORITY at the
socket level, applying a uniform priority to all packets sent through
that socket. The exception to this is IP_TOS, when the priority value
is calculated during the handling of
ancillary data, as implemented in commit f02db315b8 ("ipv4: IP_TOS
and IP_TTL can be specified as ancillary data").
However, this is a computed
value, and there is currently no mechanism to set a custom priority
via control messages prior to this patch.

According to this patch, if SO_PRIORITY is specified as ancillary data,
the packet is sent with the priority value set through
sockc->priority, overriding the socket-level values
set via the traditional setsockopt() method. This is analogous to
the existing support for SO_MARK, as implemented in
commit c6af0c227a ("ip: support SO_MARK cmsg").

If both cmsg SO_PRIORITY and IP_TOS are passed, then the one that
takes precedence is the last one in the cmsg list.

This patch has the side effect that raw_send_hdrinc now interprets cmsg
IP_TOS.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Suggested-by: Ferenc Fejes <fejes@inf.elte.hu>
Signed-off-by: Anna Emese Nyiri <annaemesenyiri@gmail.com>
Link: https://patch.msgid.link/20241213084457.45120-3-annaemesenyiri@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-16 18:13:44 -08:00
Maximilian Güntner
329365dc46 ipv4: output metric as unsigned int
adding a route metric greater than 0x7fff_ffff leads to an
unintended wrap when printing the underlying u32 as an
unsigned int (`%d`) thus incorrectly rendering the metric
as negative.  Formatting using `%u` corrects the issue.

Signed-off-by: Maximilian Güntner <code@mguentner.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241212161911.51598-1-code@mguentner.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-15 13:13:40 -08:00
Yuyang Huang
2c2b61d213 netlink: add IGMP/MLD join/leave notifications
This change introduces netlink notifications for multicast address
changes. The following features are included:
* Addition and deletion of multicast addresses are reported using
  RTM_NEWMULTICAST and RTM_DELMULTICAST messages with AF_INET and
  AF_INET6.
* Two new notification groups: RTNLGRP_IPV4_MCADDR and
  RTNLGRP_IPV6_MCADDR are introduced for receiving these events.

This change allows user space applications (e.g., ip monitor) to
efficiently track multicast group memberships by listening for netlink
events. Previously, applications relied on inefficient polling of
procfs, introducing delays. With netlink notifications, applications
receive realtime updates on multicast group membership changes,
enabling more precise metrics collection and system monitoring. 

This change also unlocks the potential for implementing a wide range
of sophisticated multicast related features in user space by allowing
applications to combine kernel provided multicast address information
with user space data and communicate decisions back to the kernel for
more fine grained control. This mechanism can be used for various
purposes, including multicast filtering, IGMP/MLD offload, and
IGMP/MLD snooping.

Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Co-developed-by: Patrick Ruddy <pruddy@vyatta.att-mail.com>
Signed-off-by: Patrick Ruddy <pruddy@vyatta.att-mail.com>
Link: https://lore.kernel.org/r/20180906091056.21109-1-pruddy@vyatta.att-mail.com
Signed-off-by: Yuyang Huang <yuyanghuang@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-12-15 12:31:35 +00:00
Jakub Kicinski
5098462fba Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.13-rc3).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-12 14:19:05 -08:00
Linus Torvalds
150b567e0d Including fixes from bluetooth, netfilter and wireless.
Current release - fix to a fix:
 
  - rtnetlink: fix error code in rtnl_newlink()
 
  - tipc: fix NULL deref in cleanup_bearer()
 
 Current release - regressions:
 
  - ip: fix warning about invalid return from in ip_route_input_rcu()
 
 Current release - new code bugs:
 
  - udp: fix L4 hash after reconnect
 
  - eth: lan969x: fix cyclic dependency between modules
 
  - eth: bnxt_en: fix potential crash when dumping FW log coredump
 
 Previous releases - regressions:
 
  - wifi: mac80211:
    - fix a queue stall in certain cases of channel switch
    - wake the queues in case of failure in resume
 
  - splice: do not checksum AF_UNIX sockets
 
  - virtio_net: fix BUG()s in BQL support due to incorrect accounting
    of purged packets during interface stop
 
  - eth: stmmac: fix TSO DMA API mis-usage causing oops
 
  - eth: bnxt_en: fixes for HW GRO: GSO type on 5750X chips and
    oops due to incorrect aggregation ID mask on 5760X chips
 
 Previous releases - always broken:
 
  - Bluetooth: improve setsockopt() handling of malformed user input
 
  - eth: ocelot: fix PTP timestamping in presence of packet loss
 
  - ptp: kvm: x86: avoid "fail to initialize ptp_kvm" when simply
    not supported
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmdbCXgACgkQMUZtbf5S
 Irtizg/9GVtTtu0OeQpHlxOpXOdqRciHDHBcjc0+rYihHazA47wtOszPg2BDiebV
 1D+uTaPoxuJUZo9jDAGMerUpy6gmC8+4h9gp72oSU9uGNHTrDWsylsn16foFkmpg
 hMsq+bzYr9ayekIXoI4T//PQ8MO8fqLFPdJmFPIKjkTtsrCzzARck9R4uDlWzrJj
 v5cQY+q/6qnwZTvvto67ahjdKUw8k3XIRZxLDqrDiW+zUzdk9XRwK46AdP3eybcx
 OCMHvXmWx6DTbjeEbzhq5YwDGAnBOE9rP4vJmpV9y+PcPDCmPzt7IDNWACcEPHY4
 3vuZv3JJP/5MIqGHidDn1JYgWl/Y3iv5ZfKInG585XH+5VWemq3WL1JOS2ua6Xmu
 hoGhwNTGea4KtCeutE8xSwMSBTxswkdPb93ZFPt28zKAN118chBvGLRv2jepSvQR
 3AQhJ9bgGuErHMYh5vdiluRVj/4bwSIFqEH6vr6w9+DUDFiTSKERLXSJ8dc8S+9K
 ghd/I8POb4VTfjZIyHzo1DJOulPXe84KGMcOuAfh0AV7o5HcuP+oNdR3+qS2Lf+G
 EByIX8osZsHjqaVr5ba+KnZz2XrdO7mbE54fCKa9ZUwkNIbcCEqOJBqcMlPWxvtK
 whrGDOS8ifYYK6fL6IFO5CtxBvWmQgMOYV6Sjp9J27PD4jiMrms=
 =TDKt
 -----END PGP SIGNATURE-----

Merge tag 'net-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bluetooth, netfilter and wireless.

  Current release - fix to a fix:

   - rtnetlink: fix error code in rtnl_newlink()

   - tipc: fix NULL deref in cleanup_bearer()

  Current release - regressions:

   - ip: fix warning about invalid return from in ip_route_input_rcu()

  Current release - new code bugs:

   - udp: fix L4 hash after reconnect

   - eth: lan969x: fix cyclic dependency between modules

   - eth: bnxt_en: fix potential crash when dumping FW log coredump

  Previous releases - regressions:

   - wifi: mac80211:
      - fix a queue stall in certain cases of channel switch
      - wake the queues in case of failure in resume

   - splice: do not checksum AF_UNIX sockets

   - virtio_net: fix BUG()s in BQL support due to incorrect accounting
     of purged packets during interface stop

   - eth:
      - stmmac: fix TSO DMA API mis-usage causing oops
      - bnxt_en: fixes for HW GRO: GSO type on 5750X chips and oops
        due to incorrect aggregation ID mask on 5760X chips

  Previous releases - always broken:

   - Bluetooth: improve setsockopt() handling of malformed user input

   - eth: ocelot: fix PTP timestamping in presence of packet loss

   - ptp: kvm: x86: avoid "fail to initialize ptp_kvm" when simply not
     supported"

* tag 'net-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
  net: dsa: tag_ocelot_8021q: fix broken reception
  net: dsa: microchip: KSZ9896 register regmap alignment to 32 bit boundaries
  net: renesas: rswitch: fix initial MPIC register setting
  Bluetooth: btmtk: avoid UAF in btmtk_process_coredump
  Bluetooth: iso: Fix circular lock in iso_conn_big_sync
  Bluetooth: iso: Fix circular lock in iso_listen_bis
  Bluetooth: SCO: Add support for 16 bits transparent voice setting
  Bluetooth: iso: Fix recursive locking warning
  Bluetooth: iso: Always release hdev at the end of iso_listen_bis
  Bluetooth: hci_event: Fix using rcu_read_(un)lock while iterating
  Bluetooth: hci_core: Fix sleeping function called from invalid context
  team: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL
  team: Fix initial vlan_feature set in __team_compute_features
  bonding: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL
  bonding: Fix initial {vlan,mpls}_feature set in bond_compute_features
  net, team, bonding: Add netdev_base_features helper
  net/sched: netem: account for backlog updates from child qdisc
  net: dsa: felix: fix stuck CPU-injected packets with short taprio windows
  splice: do not checksum AF_UNIX sockets
  net: usb: qmi_wwan: add Telit FE910C04 compositions
  ...
2024-12-12 11:28:05 -08:00
Jakub Sitnicki
ca6a6f9386 tcp: Add sysctl to configure TIME-WAIT reuse delay
Today we have a hardcoded delay of 1 sec before a TIME-WAIT socket can be
reused by reopening a connection. This is a safe choice based on an
assumption that the other TCP timestamp clock frequency, which is unknown
to us, may be as low as 1 Hz (RFC 7323, section 5.4).

However, this means that in the presence of short lived connections with an
RTT of couple of milliseconds, the time during which a 4-tuple is blocked
from reuse can be orders of magnitude longer that the connection lifetime.
Combined with a reduced pool of ephemeral ports, when using
IP_LOCAL_PORT_RANGE to share an egress IP address between hosts [1], the
long TIME-WAIT reuse delay can lead to port exhaustion, where all available
4-tuples are tied up in TIME-WAIT state.

Turn the reuse delay into a per-netns setting so that sysadmins can make
more aggressive assumptions about remote TCP timestamp clock frequency and
shorten the delay in order to allow connections to reincarnate faster.

Note that applications can completely bypass the TIME-WAIT delay protection
already today by locking the local port with bind() before connecting. Such
immediate connection reuse may result in PAWS failing to detect old
duplicate segments, leaving us with just the sequence number check as a
safety net.

This new configurable offers a trade off where the sysadmin can balance
between the risk of PAWS detection failing to act versus exhausting ports
by having sockets tied up in TIME-WAIT state for too long.

[1] https://lpc.events/event/16/contributions/1349/

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20241209-jakub-krn-909-poc-msec-tw-tstamp-v2-2-66aca0eed03e@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-11 20:17:33 -08:00
Jakub Sitnicki
19ce8cd304 tcp: Measure TIME-WAIT reuse delay with millisecond precision
Prepare ground for TIME-WAIT socket reuse with subsecond delay.

Today the last TS.Recent update timestamp, recorded in seconds and stored
tp->ts_recent_stamp and tw->tw_ts_recent_stamp fields, has two purposes.

Firstly, it is used to track the age of the last recorded TS.Recent value
to detect when that value becomes outdated due to potential wrap-around of
the other TCP timestamp clock (RFC 7323, section 5.5).

For this purpose a second-based timestamp is completely sufficient as even
in the worst case scenario of a peer using a high resolution microsecond
timestamp, the wrap-around interval is ~36 minutes long.

Secondly, it serves as a threshold value for allowing TIME-WAIT socket
reuse. A TIME-WAIT socket can be reused only once the virtual 1 Hz clock,
ktime_get_seconds, is past the TS.Recent update timestamp.

The purpose behind delaying the TIME-WAIT socket reuse is to wait for the
other TCP timestamp clock to tick at least once before reusing the
connection. It is only then that the PAWS mechanism for the reopened
connection can detect old duplicate segments from the previous connection
incarnation (RFC 7323, appendix B.2).

In this case using a timestamp with second resolution not only blocks the
way toward allowing faster TIME-WAIT reuse after shorter subsecond delay,
but also makes it impossible to reliably delay TW reuse by one second.

As Eric Dumazet has pointed out [1], due to timestamp rounding, the TW
reuse delay will actually be between (0, 1] seconds, and 0.5 seconds on
average. We delay TW reuse for one full second only when last TS.Recent
update coincides with our virtual 1 Hz clock tick.

Considering the above, introduce a dedicated field to store a millisecond
timestamp of transition into the TIME-WAIT state. Place it in an existing
4-byte hole inside inet_timewait_sock structure to avoid an additional
memory cost.

Use the new timestamp to (i) reliably delay TIME-WAIT reuse by one second,
and (ii) prepare for configurable subsecond reuse delay in the subsequent
change.

We assume here that a full one second delay was the original intention in
[2] because it accounts for the worst case scenario of the other TCP using
the slowest recommended 1 Hz timestamp clock.

A more involved alternative would be to change the resolution of the last
TS.Recent update timestamp, tw->tw_ts_recent_stamp, to milliseconds.

[1] https://lore.kernel.org/netdev/CANn89iKB4GFd8sVzCbRttqw_96o3i2wDhX-3DraQtsceNGYwug@mail.gmail.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b8439924316d5bcb266d165b93d632a4b4b859af

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20241209-jakub-krn-909-poc-msec-tw-tstamp-v2-1-66aca0eed03e@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-11 20:17:33 -08:00
MoYuanhao
06d64ab46f tcp: check space before adding MPTCP SYN options
Ensure there is enough space before adding MPTCP options in
tcp_syn_options().

Without this check, 'remaining' could underflow, and causes issues. If
there is not enough space, MPTCP should not be used.

Signed-off-by: MoYuanhao <moyuanhao3676@163.com>
Fixes: cec37a6e41 ("mptcp: Handle MP_CAPABLE options for outgoing connections")
Cc: stable@vger.kernel.org
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
[ Matt: Add Fixes, cc Stable, update Description ]
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241209-net-mptcp-check-space-syn-v1-1-2da992bb6f74@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-10 18:26:52 -08:00
Paolo Abeni
51a00be6a0 udp: fix l4 hash after reconnect
After the blamed commit below, udp_rehash() is supposed to be called
with both local and remote addresses set.

Currently that is already the case for IPv6 sockets, but for IPv4 the
destination address is updated after rehashing.

Address the issue moving the destination address and port initialization
before rehashing.

Fixes: 1b29a730ef ("ipv6/udp: Add 4-tuple hash for connected socket")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/4761e466ab9f7542c68cdc95f248987d127044d2.1733499715.git.pabeni@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-10 15:30:36 +01:00
Kuniyuki Iwashima
cdd0b9132d ip: Return drop reason if in_dev is NULL in ip_route_input_rcu().
syzkaller reported a warning in __sk_skb_reason_drop().

Commit 61b95c70f3 ("net: ip: make ip_route_input_rcu() return
drop reasons") missed a path where -EINVAL is returned.

Then, the cited commit started to trigger the warning with the
invalid error.

Let's fix it by returning SKB_DROP_REASON_NOT_SPECIFIED.

[0]:
WARNING: CPU: 0 PID: 10 at net/core/skbuff.c:1216 __sk_skb_reason_drop net/core/skbuff.c:1216 [inline]
WARNING: CPU: 0 PID: 10 at net/core/skbuff.c:1216 sk_skb_reason_drop+0x97/0x1b0 net/core/skbuff.c:1241
Modules linked in:
CPU: 0 UID: 0 PID: 10 Comm: kworker/0:1 Not tainted 6.12.0-10686-gbb18265c3aba #10 1c308307628619808b5a4a0495c4aab5637b0551
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Workqueue: wg-crypt-wg2 wg_packet_decrypt_worker
RIP: 0010:__sk_skb_reason_drop net/core/skbuff.c:1216 [inline]
RIP: 0010:sk_skb_reason_drop+0x97/0x1b0 net/core/skbuff.c:1241
Code: 5d 41 5c 41 5d 41 5e e9 e7 9e 95 fd e8 e2 9e 95 fd 31 ff 44 89 e6 e8 58 a1 95 fd 45 85 e4 0f 85 a2 00 00 00 e8 ca 9e 95 fd 90 <0f> 0b 90 e8 c1 9e 95 fd 44 89 e6 bf 01 00 00 00 e8 34 a1 95 fd 41
RSP: 0018:ffa0000000007650 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 000000000000ffff RCX: ffffffff83bc3592
RDX: ff110001002a0000 RSI: ffffffff83bc34d6 RDI: 0000000000000007
RBP: ff11000109ee85f0 R08: 0000000000000001 R09: ffe21c00213dd0da
R10: 000000000000ffff R11: 0000000000000000 R12: 00000000ffffffea
R13: 0000000000000000 R14: ff11000109ee86d4 R15: ff11000109ee8648
FS:  0000000000000000(0000) GS:ff1100011a000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020177000 CR3: 0000000108a3d006 CR4: 0000000000771ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000600
PKRU: 55555554
Call Trace:
 <IRQ>
 kfree_skb_reason include/linux/skbuff.h:1263 [inline]
 ip_rcv_finish_core.constprop.0+0x896/0x2320 net/ipv4/ip_input.c:424
 ip_list_rcv_finish.constprop.0+0x1b2/0x710 net/ipv4/ip_input.c:610
 ip_sublist_rcv net/ipv4/ip_input.c:636 [inline]
 ip_list_rcv+0x34a/0x460 net/ipv4/ip_input.c:670
 __netif_receive_skb_list_ptype net/core/dev.c:5715 [inline]
 __netif_receive_skb_list_core+0x536/0x900 net/core/dev.c:5762
 __netif_receive_skb_list net/core/dev.c:5814 [inline]
 netif_receive_skb_list_internal+0x77c/0xdc0 net/core/dev.c:5905
 gro_normal_list include/net/gro.h:515 [inline]
 gro_normal_list include/net/gro.h:511 [inline]
 napi_complete_done+0x219/0x8c0 net/core/dev.c:6256
 wg_packet_rx_poll+0xbff/0x1e40 drivers/net/wireguard/receive.c:488
 __napi_poll.constprop.0+0xb3/0x530 net/core/dev.c:6877
 napi_poll net/core/dev.c:6946 [inline]
 net_rx_action+0x9eb/0xe30 net/core/dev.c:7068
 handle_softirqs+0x1ac/0x740 kernel/softirq.c:554
 do_softirq kernel/softirq.c:455 [inline]
 do_softirq+0x48/0x80 kernel/softirq.c:442
 </IRQ>
 <TASK>
 __local_bh_enable_ip+0xed/0x110 kernel/softirq.c:382
 spin_unlock_bh include/linux/spinlock.h:396 [inline]
 ptr_ring_consume_bh include/linux/ptr_ring.h:367 [inline]
 wg_packet_decrypt_worker+0x3ba/0x580 drivers/net/wireguard/receive.c:499
 process_one_work+0x940/0x1a70 kernel/workqueue.c:3229
 process_scheduled_works kernel/workqueue.c:3310 [inline]
 worker_thread+0x639/0xe30 kernel/workqueue.c:3391
 kthread+0x283/0x350 kernel/kthread.c:389
 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:244
 </TASK>

Fixes: 82d9983ebe ("net: ip: make ip_route_input_noref() return drop reasons")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241206020715.80207-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-07 17:55:37 -08:00
Linus Torvalds
b5f217084a BPF fixes:
- Fix several issues for BPF LPM trie map which were found by
   syzbot and during addition of new test cases (Hou Tao)
 
 - Fix a missing process_iter_arg register type check in the
   BPF verifier (Kumar Kartikeya Dwivedi, Tao Lyu)
 
 - Fix several correctness gaps in the BPF verifier when
   interacting with the BPF stack without CAP_PERFMON
   (Kumar Kartikeya Dwivedi, Eduard Zingerman, Tao Lyu)
 
 - Fix OOB BPF map writes when deleting elements for the case of
   xsk map as well as devmap (Maciej Fijalkowski)
 
 - Fix xsk sockets to always clear DMA mapping information when
   unmapping the pool (Larysa Zaremba)
 
 - Fix sk_mem_uncharge logic in tcp_bpf_sendmsg to only uncharge
   after sent bytes have been finalized (Zijian Zhang)
 
 - Fix BPF sockmap with vsocks which was missing a queue check
   in poll and sockmap cleanup on close (Michal Luczaj)
 
 - Fix tools infra to override makefile ARCH variable if defined
   but empty, which addresses cross-building tools. (Björn Töpel)
 
 - Fix two resolve_btfids build warnings on unresolved bpf_lsm
   symbols (Thomas Weißschuh)
 
 - Fix a NULL pointer dereference in bpftool (Amir Mohammadi)
 
 - Fix BPF selftests to check for CONFIG_PREEMPTION instead of
   CONFIG_PREEMPT (Sebastian Andrzej Siewior)
 
 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
 -----BEGIN PGP SIGNATURE-----
 
 iIsEABYKADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZ1N8bhUcZGFuaWVsQGlv
 Z2VhcmJveC5uZXQACgkQ2yufC7HISIO6ZAD+ITpujJgxvFGC0R7E9o3XJ7V1SpmR
 SlW0lGpj6vOHTUAA/2MRoZurJSTbdT3fbWiCUgU1rMcwkoErkyxUaPuBci0D
 =kgXL
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Daniel Borkmann::

 - Fix several issues for BPF LPM trie map which were found by syzbot
   and during addition of new test cases (Hou Tao)

 - Fix a missing process_iter_arg register type check in the BPF
   verifier (Kumar Kartikeya Dwivedi, Tao Lyu)

 - Fix several correctness gaps in the BPF verifier when interacting
   with the BPF stack without CAP_PERFMON (Kumar Kartikeya Dwivedi,
   Eduard Zingerman, Tao Lyu)

 - Fix OOB BPF map writes when deleting elements for the case of xsk map
   as well as devmap (Maciej Fijalkowski)

 - Fix xsk sockets to always clear DMA mapping information when
   unmapping the pool (Larysa Zaremba)

 - Fix sk_mem_uncharge logic in tcp_bpf_sendmsg to only uncharge after
   sent bytes have been finalized (Zijian Zhang)

 - Fix BPF sockmap with vsocks which was missing a queue check in poll
   and sockmap cleanup on close (Michal Luczaj)

 - Fix tools infra to override makefile ARCH variable if defined but
   empty, which addresses cross-building tools. (Björn Töpel)

 - Fix two resolve_btfids build warnings on unresolved bpf_lsm symbols
   (Thomas Weißschuh)

 - Fix a NULL pointer dereference in bpftool (Amir Mohammadi)

 - Fix BPF selftests to check for CONFIG_PREEMPTION instead of
   CONFIG_PREEMPT (Sebastian Andrzej Siewior)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (31 commits)
  selftests/bpf: Add more test cases for LPM trie
  selftests/bpf: Move test_lpm_map.c to map_tests
  bpf: Use raw_spinlock_t for LPM trie
  bpf: Switch to bpf mem allocator for LPM trie
  bpf: Fix exact match conditions in trie_get_next_key()
  bpf: Handle in-place update for full LPM trie correctly
  bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie
  bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elem
  bpf: Remove unnecessary check when updating LPM trie
  selftests/bpf: Add test for narrow spill into 64-bit spilled scalar
  selftests/bpf: Add test for reading from STACK_INVALID slots
  selftests/bpf: Introduce __caps_unpriv annotation for tests
  bpf: Fix narrow scalar spill onto 64-bit spilled scalar slots
  bpf: Don't mark STACK_INVALID as STACK_MISC in mark_stack_slot_misc
  samples/bpf: Remove unnecessary -I flags from libbpf EXTRA_CFLAGS
  bpf: Zero index arg error string for dynptr and iter
  selftests/bpf: Add tests for iter arg check
  bpf: Ensure reg is PTR_TO_STACK in process_iter_arg
  tools: Override makefile ARCH variable if defined, but empty
  selftests/bpf: Add apply_bytes test to test_txmsg_redir_wait_sndmem in test_sockmap
  ...
2024-12-06 15:07:48 -08:00
Jakub Kicinski
302cc446cb Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.13-rc2).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-05 11:50:14 -08:00
Christian Hopps
d1716d5a44 xfrm: add generic iptfs defines and functionality
Define `XFRM_MODE_IPTFS` and `IPSEC_MODE_IPTFS` constants, and add these to
switch case and conditionals adjacent with the existing TUNNEL modes.

Signed-off-by: Christian Hopps <chopps@labn.net>
Tested-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-12-05 10:01:28 +01:00
Eric Dumazet
5204ccbfa2 inet: add indirect call wrapper for getfrag() calls
UDP send path suffers from one indirect call to ip_generic_getfrag()

We can use INDIRECT_CALL_1() to avoid it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Brian Vazquez <brianvv@google.com>
Link: https://patch.msgid.link/20241203173617.2595451-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-04 19:20:52 -08:00
Paolo Abeni
50b9420444 ipmr: tune the ipmr_can_free_table() checks.
Eric reported a syzkaller-triggered splat caused by recent ipmr changes:

WARNING: CPU: 2 PID: 6041 at net/ipv6/ip6mr.c:419
ip6mr_free_table+0xbd/0x120 net/ipv6/ip6mr.c:419
Modules linked in:
CPU: 2 UID: 0 PID: 6041 Comm: syz-executor183 Not tainted
6.12.0-syzkaller-10681-g65ae975e97d5 #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:ip6mr_free_table+0xbd/0x120 net/ipv6/ip6mr.c:419
Code: 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 80 3c
02 00 75 58 49 83 bc 24 c0 0e 00 00 00 74 09 e8 44 ef a9 f7 90 <0f> 0b
90 e8 3b ef a9 f7 48 8d 7b 38 e8 12 a3 96 f7 48 89 df be 0f
RSP: 0018:ffffc90004267bd8 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff88803c710000 RCX: ffffffff89e4d844
RDX: ffff88803c52c880 RSI: ffffffff89e4d87c RDI: ffff88803c578ec0
RBP: 0000000000000001 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: ffff88803c578000
R13: ffff88803c710000 R14: ffff88803c710008 R15: dead000000000100
FS: 00007f7a855ee6c0(0000) GS:ffff88806a800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7a85689938 CR3: 000000003c492000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
ip6mr_rules_exit+0x176/0x2d0 net/ipv6/ip6mr.c:283
ip6mr_net_exit_batch+0x53/0xa0 net/ipv6/ip6mr.c:1388
ops_exit_list+0x128/0x180 net/core/net_namespace.c:177
setup_net+0x4fe/0x860 net/core/net_namespace.c:394
copy_net_ns+0x2b4/0x6b0 net/core/net_namespace.c:500
create_new_namespaces+0x3ea/0xad0 kernel/nsproxy.c:110
unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228
ksys_unshare+0x45d/0xa40 kernel/fork.c:3334
__do_sys_unshare kernel/fork.c:3405 [inline]
__se_sys_unshare kernel/fork.c:3403 [inline]
__x64_sys_unshare+0x31/0x40 kernel/fork.c:3403
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f7a856332d9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 51 18 00 00 90 48 89 f8 48
89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f7a855ee238 EFLAGS: 00000246 ORIG_RAX: 0000000000000110
RAX: ffffffffffffffda RBX: 00007f7a856bd308 RCX: 00007f7a856332d9
RDX: 00007f7a8560f8c6 RSI: 0000000000000000 RDI: 0000000062040200
RBP: 00007f7a856bd300 R08: 00007fff932160a7 R09: 00007f7a855ee6c0
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f7a856bd30c
R13: 0000000000000000 R14: 00007fff93215fc0 R15: 00007fff932160a8
</TASK>

The root cause is a network namespace creation failing after successful
initialization of the ipmr subsystem. Such a case is not currently
matched by the ipmr_can_free_table() helper.

New namespaces are zeroed on allocation and inserted into net ns list
only after successful creation; when deleting an ipmr table, the list
next pointer can be NULL only on netns initialization failure.

Update the ipmr_can_free_table() checks leveraging such condition.

Reported-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot+6e8cb445d4b43d006e0c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=6e8cb445d4b43d006e0c
Fixes: 11b6e701bc ("ipmr: add debug check for mr table cleanup")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/8bde975e21bbca9d9c27e36209b2dd4f1d7a3f00.1733212078.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-04 18:49:16 -08:00
Fernando Fernandez Mancera
3d501f562f Revert "udp: avoid calling sock_def_readable() if possible"
This reverts commit 612b1c0dec. On a
scenario with multiple threads blocking on a recvfrom(), we need to call
sock_def_readable() on every __udp_enqueue_schedule_skb() otherwise the
threads won't be woken up as __skb_wait_for_more_packets() is using
prepare_to_wait_exclusive().

Link: https://bugzilla.redhat.com/2308477
Fixes: 612b1c0dec ("udp: avoid calling sock_def_readable() if possible")
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241202155620.1719-1-ffmancera@riseup.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-03 18:41:10 -08:00
Dong Chenchen
c44daa7e3c net: Fix icmp host relookup triggering ip_rt_bug
arp link failure may trigger ip_rt_bug while xfrm enabled, call trace is:

WARNING: CPU: 0 PID: 0 at net/ipv4/route.c:1241 ip_rt_bug+0x14/0x20
Modules linked in:
CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.12.0-rc6-00077-g2e1b3cc9d7f7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:ip_rt_bug+0x14/0x20
Call Trace:
 <IRQ>
 ip_send_skb+0x14/0x40
 __icmp_send+0x42d/0x6a0
 ipv4_link_failure+0xe2/0x1d0
 arp_error_report+0x3c/0x50
 neigh_invalidate+0x8d/0x100
 neigh_timer_handler+0x2e1/0x330
 call_timer_fn+0x21/0x120
 __run_timer_base.part.0+0x1c9/0x270
 run_timer_softirq+0x4c/0x80
 handle_softirqs+0xac/0x280
 irq_exit_rcu+0x62/0x80
 sysvec_apic_timer_interrupt+0x77/0x90

The script below reproduces this scenario:
ip xfrm policy add src 0.0.0.0/0 dst 0.0.0.0/0 \
	dir out priority 0 ptype main flag localok icmp
ip l a veth1 type veth
ip a a 192.168.141.111/24 dev veth0
ip l s veth0 up
ping 192.168.141.155 -c 1

icmp_route_lookup() create input routes for locally generated packets
while xfrm relookup ICMP traffic.Then it will set input route
(dst->out = ip_rt_bug) to skb for DESTUNREACH.

For ICMP err triggered by locally generated packets, dst->dev of output
route is loopback. Generally, xfrm relookup verification is not required
on loopback interfaces (net.ipv4.conf.lo.disable_xfrm = 1).

Skip icmp relookup for locally generated packets to fix it.

Fixes: 8b7817f3a9 ("[IPSEC]: Add ICMP host relookup support")
Signed-off-by: Dong Chenchen <dongchenchen2@huawei.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241127040850.1513135-1-dongchenchen2@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-30 14:17:10 -08:00