Commit Graph

15680 Commits

Author SHA1 Message Date
Alexander Duyck
0fe79f28bf net: allow gro_max_size to exceed 65536
Allow the gro_max_size to exceed a value larger than 65536.

There weren't really any external limitations that prevented this other
than the fact that IPv4 only supports a 16 bit length field. Since we have
the option of adding a hop-by-hop header for IPv6 we can allow IPv6 to
exceed this value and for IPv4 and non-TCP flows we can cap things at 65536
via a constant rather than relying on gro_max_size.

[edumazet] limit GRO_MAX_SIZE to (8 * 65535) to avoid overflows.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 10:18:56 +01:00
Eric Dumazet
09f3d1a3a5 ipv6/gso: remove temporary HBH/jumbo header
ipv6 tcp and gro stacks will soon be able to build big TCP packets,
with an added temporary Hop By Hop header.

If GSO is involved for these large packets, we need to remove
the temporary HBH header before segmentation happens.

v2: perform HBH removal from ipv6_gso_segment() instead of
    skb_segment() (Alexander feedback)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 10:18:56 +01:00
Eric Dumazet
7c96d8ec96 ipv6: add struct hop_jumbo_hdr definition
Following patches will need to add and remove local IPv6 jumbogram
options to enable BIG TCP.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 10:18:56 +01:00
David S. Miller
1a01a07517 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

This is v2 including deadlock fix in conntrack ecache rework
reported by Jakub Kicinski.

The following patchset contains Netfilter updates for net-next,
mostly updates to conntrack from Florian Westphal.

1) Add a dedicated list for conntrack event redelivery.

2) Include event redelivery list in conntrack dumps of dying type.

3) Remove per-cpu dying list for event redelivery, not used anymore.

4) Add netns .pre_exit to cttimeout to zap timeout objects before
   synchronize_rcu() call.

5) Remove nf_ct_unconfirmed_destroy.

6) Add generation id for conntrack extensions for conntrack
   timeout and helpers.

7) Detach timeout policy from conntrack on cttimeout module removal.

8) Remove __nf_ct_unconfirmed_destroy.

9) Remove unconfirmed list.

10) Remove unconditional local_bh_disable in init_conntrack().

11) Consolidate conntrack iterator nf_ct_iterate_cleanup().

12) Detect if ctnetlink listeners exist to short-circuit event
    path early.

13) Un-inline nf_ct_ecache_ext_add().

14) Add nf_conntrack_events autodetect ctnetlink listener mode
    and make it default.

15) Add nf_ct_ecache_exist() to check for event cache extension.

16) Extend flowtable reverse route lookup to include source, iif,
    tos and mark, from Sven Auhagen.

17) Do not verify zero checksum UDP packets in nf_reject,
    from Kevin Mitchell.

====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 10:10:37 +01:00
Jakub Kicinski
2c5f153647 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2022-05-13

1) Cleanups for the code behind the XFRM offload API. This is a
   preparation for the extension of the API for policy offload.
   From Leon Romanovsky.

* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
  xfrm: drop not needed flags variable in XFRM offload struct
  net/mlx5e: Use XFRM state direction instead of flags
  netdevsim: rely on XFRM state direction instead of flags
  ixgbe: propagate XFRM offload state direction instead of flags
  xfrm: store and rely on direction to construct offload flags
  xfrm: rename xfrm_state_offload struct to allow reuse
  xfrm: delete not used number of external headers
  xfrm: free not used XFRM_ESP_NO_TRAILER flag
====================

Link: https://lore.kernel.org/r/20220513151218.4010119-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-13 10:25:08 -07:00
Kevin Mitchell
4f9bd53084 netfilter: conntrack: skip verification of zero UDP checksum
The checksum is optional for UDP packets. However nf_reject would
previously require a valid checksum to elicit a response such as
ICMP_DEST_UNREACH.

Add some logic to nf_reject_verify_csum to determine if a UDP packet has
a zero checksum and should therefore not be verified.

Signed-off-by: Kevin Mitchell <kevmitch@arista.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:56:28 +02:00
Florian Westphal
8edc813111 netfilter: prefer extension check to pointer check
The pointer check usually results in a 'false positive': its likely
that the ctnetlink module is loaded but no event monitoring is enabled.

After recent change to autodetect ctnetlink usage and only allocate
the ecache extension if a listener is active, check if the extension
is present on a given conntrack.

If its not there, there is nothing to report and calls to the
notification framework can be elided.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:56:28 +02:00
Florian Westphal
b0a7ab4a77 netfilter: conntrack: un-inline nf_ct_ecache_ext_add
Only called when new ct is allocated or the extension isn't present.
This function will be extended, place this in the conntrack module
instead of inlining.

The callers already depend on nf_conntrack module.
Return value is changed to bool, noone used the returned pointer.

Make sure that the core drops the newly allocated conntrack
if the extension is requested but can't be added.
This makes it necessary to ifdef the section, as the stub
always returns false we'd drop every new conntrack if the
the ecache extension is disabled in kconfig.

Add from data path (xt_CT, nft_ct) is unchanged.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:56:28 +02:00
Florian Westphal
2794cdb0b9 netfilter: nfnetlink: allow to detect if ctnetlink listeners exist
At this time, every new conntrack gets the 'event cache extension'
enabled for it.

This is because the 'net.netfilter.nf_conntrack_events' sysctl defaults
to 1.

Changing the default to 0 means that commands that rely on the event
notification extension, e.g. 'conntrack -E' or conntrackd, stop working.

We COULD detect if there is a listener by means of
'nfnetlink_has_listeners()' and only add the extension if this is true.

The downside is a dependency from conntrack module to nfnetlink module.

This adds a different way: inc/dec a counter whenever a ctnetlink group
is being (un)subscribed and toggle a flag in struct net.

Next patches will take advantage of this and will only add the event
extension if the flag is set.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:56:28 +02:00
Pablo Neira Ayuso
8169ff5840 netfilter: conntrack: add nf_ct_iter_data object for nf_ct_iterate_cleanup*()
This patch adds a structure to collect all the context data that is
passed to the cleanup iterator.

 struct nf_ct_iter_data {
       struct net *net;
       void *data;
       u32 portid;
       int report;
 };

There is a netns field that allows to clean up conntrack entries
specifically owned by the specified netns.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:56:27 +02:00
Florian Westphal
8a75a2c174 netfilter: conntrack: remove unconfirmed list
It has no function anymore and can be removed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:53:27 +02:00
Florian Westphal
c56716c69c netfilter: extensions: introduce extension genid count
Multiple netfilter extensions store pointers to external data
in their extension area struct.

Examples:
1. Timeout policies
2. Connection tracking helpers.

No references are taken for these.

When a helper or timeout policy is removed, the conntrack table gets
traversed and affected extensions are cleared.

Conntrack entries not yet in the hashtable are referenced via a special
list, the unconfirmed list.

On removal of a policy or connection tracking helper, the unconfirmed
list gets traversed an all entries are marked as dying, this prevents
them from getting committed to the table at insertion time: core checks
for dying bit, if set, the conntrack entry gets destroyed at confirm
time.

The disadvantage is that each new conntrack has to be added to the percpu
unconfirmed list, and each insertion needs to remove it from this list.
The list is only ever needed when a policy or helper is removed -- a rare
occurrence.

Add a generation ID count: Instead of adding to the list and then
traversing that list on policy/helper removal, increment a counter
that is stored in the extension area.

For unconfirmed conntracks, the extension has the genid valid at ct
allocation time.

Removal of a helper/policy etc. increments the counter.
At confirmation time, validate that ext->genid == global_id.

If the stored number is not the same, do not allow the conntrack
insertion, just like as if a confirmed-list traversal would have flagged
the entry as dying.

After insertion, the genid is no longer relevant (conntrack entries
are now reachable via the conntrack table iterators and is set to 0.

This allows removal of the percpu unconfirmed list.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:52:16 +02:00
Florian Westphal
17438b42ce netfilter: remove nf_ct_unconfirmed_destroy helper
This helper tags connections not yet in the conntrack table as
dying.  These nf_conn entries will be dropped instead when the
core attempts to insert them from the input or postrouting
'confirm' hook.

After the previous change, the entries get unlinked from the
list earlier, so that by the time the actual exit hook runs,
new connections no longer have a timeout policy assigned.

Its enough to walk the hashtable instead.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:52:16 +02:00
Florian Westphal
78222bacfc netfilter: cttimeout: decouple unlink and free on netns destruction
Make it so netns pre_exit unlinks the objects from the pernet list, so
they cannot be found anymore.

netns core issues a synchronize_rcu() before calling the exit hooks so
any the time the exit hooks run unconfirmed nf_conn entries have been
free'd or they have been committed to the hashtable.

The exit hook still tags unconfirmed entries as dying, this can
now be removed in a followup change.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:52:16 +02:00
Florian Westphal
1397af5bfd netfilter: conntrack: remove the percpu dying list
Its no longer needed. Entries that need event redelivery are placed
on the new pernet dying list.

The advantage is that there is no need to take additional spinlock on
conntrack removal unless event redelivery failed or the conntrack entry
was never added to the table in the first place (confirmed bit not set).

The IPS_CONFIRMED bit now needs to be set as soon as the entry has been
unlinked from the unconfirmed list, else the destroy function may
attempt to unlink it a second time.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:52:16 +02:00
Florian Westphal
0d3cc504ba netfilter: conntrack: include ecache dying list in dumps
The new pernet dying list includes conntrack entries that await
delivery of the 'destroy' event via ctnetlink.

The old percpu dying list will be removed soon.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:52:16 +02:00
Florian Westphal
2ed3bf188b netfilter: ecache: use dedicated list for event redelivery
This disentangles event redelivery and the percpu dying list.

Because entries are now stored on a dedicated list, all
entries are in NFCT_ECACHE_DESTROY_FAIL state and all entries
still have confirmed bit set -- the reference count is at least 1.

The 'struct net' back-pointer can be removed as well.

The pcpu dying list will be removed eventually, it has no functionality.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:51:28 +02:00
Eric Dumazet
4915d50e30 inet: add READ_ONCE(sk->sk_bound_dev_if) in INET_MATCH()
INET_MATCH() runs without holding a lock on the socket.

We probably need to annotate most reads.

This patch makes INET_MATCH() an inline function
to ease our changes.

v2:

We remove the 32bit version of it, as modern compilers
should generate the same code really, no need to
try to be smarter.

Also make 'struct net *net' the first argument.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-13 12:17:25 +01:00
Martin KaFai Lau
cae3873c5b net: inet: Retire port only listening_hash
The listen sk is currently stored in two hash tables,
listening_hash (hashed by port) and lhash2 (hashed by port and address).

After commit 0ee58dad5b ("net: tcp6: prefer listeners bound to an address")
and commit d9fbc7f643 ("net: tcp: prefer listeners bound to an address"),
the TCP-SYN lookup fast path does not use listening_hash.

The commit 05c0b35709 ("tcp: seq_file: Replace listening_hash with lhash2")
also moved the seq_file (/proc/net/tcp) iteration usage from
listening_hash to lhash2.

There are still a few listening_hash usages left.
One of them is inet_reuseport_add_sock() which uses the listening_hash
to search a listen sk during the listen() system call.  This turns
out to be very slow on use cases that listen on many different
VIPs at a popular port (e.g. 443).  [ On top of the slowness in
adding to the tail in the IPv6 case ].  The latter patch has a
selftest to demonstrate this case.

This patch takes this chance to move all remaining listening_hash
usages to lhash2 and then retire listening_hash.

Since most changes need to be done together, it is hard to cut
the listening_hash to lhash2 switch into small patches.  The
changes in this patch is highlighted here for the review
purpose.

1. Because of the listening_hash removal, lhash2 can use the
   sk->sk_nulls_node instead of the icsk->icsk_listen_portaddr_node.
   This will also keep the sk_unhashed() check to work as is
   after stop adding sk to listening_hash.

   The union is removed from inet_listen_hashbucket because
   only nulls_head is needed.

2. icsk->icsk_listen_portaddr_node and its helpers are removed.

3. The current lhash2 users needs to iterate with sk_nulls_node
   instead of icsk_listen_portaddr_node.

   One case is in the inet[6]_lhash2_lookup().

   Another case is the seq_file iterator in tcp_ipv4.c.
   One thing to note is sk_nulls_next() is needed
   because the old inet_lhash2_for_each_icsk_continue()
   does a "next" first before iterating.

4. Move the remaining listening_hash usage to lhash2

   inet_reuseport_add_sock() which this series is
   trying to improve.

   inet_diag.c and mptcp_diag.c are the final two
   remaining use cases and is moved to lhash2 now also.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-12 16:52:18 -07:00
Martin KaFai Lau
8ea1eebb49 net: inet: Remove count from inet_listen_hashbucket
After commit 0ee58dad5b ("net: tcp6: prefer listeners bound to an address")
and commit d9fbc7f643 ("net: tcp: prefer listeners bound to an address"),
the count is no longer used.  This patch removes it.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-12 16:52:17 -07:00
Vladimir Oltean
bacf93b056 net: dsa: remove port argument from ->change_tag_protocol()
DSA has not supported (and probably will not support in the future
either) independent tagging protocols per CPU port.

Different switch drivers have different requirements, some may need to
replicate some settings for each CPU port, some may need to apply some
settings on a single CPU port, while some may have to configure some
global settings and then some per-CPU-port settings.

In any case, the current model where DSA calls ->change_tag_protocol for
each CPU port turns out to be impractical for drivers where there are
global things to be done. For example, felix calls dsa_tag_8021q_register(),
which makes no sense per CPU port, so it suppresses the second call.

Let drivers deal with replication towards all CPU ports, and remove the
CPU port argument from the function prototype.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-12 16:38:55 -07:00
Vladimir Oltean
72c3b0c735 net: dsa: felix: manage host flooding using a specific driver callback
At the time - commit 7569459a52 ("net: dsa: manage flooding on the CPU
ports") - not introducing a dedicated switch callback for host flooding
made sense, because for the only user, the felix driver, there was
nothing different to do for the CPU port than set the flood flags on the
CPU port just like on any other bridge port.

There are 2 reasons why this approach is not good enough, however.

(1) Other drivers, like sja1105, support configuring flooding as a
    function of {ingress port, egress port}, whereas the DSA
    ->port_bridge_flags() function only operates on an egress port.
    So with that driver we'd have useless host flooding from user ports
    which don't need it.

(2) Even with the felix driver, support for multiple CPU ports makes it
    difficult to piggyback on ->port_bridge_flags(). The way in which
    the felix driver is going to support host-filtered addresses with
    multiple CPU ports is that it will direct these addresses towards
    both CPU ports (in a sort of multicast fashion), then restrict the
    forwarding to only one of the two using the forwarding masks.
    Consequently, flooding will also be enabled towards both CPU ports.
    However, ->port_bridge_flags() gets passed the index of a single CPU
    port, and that leaves the flood settings out of sync between the 2
    CPU ports.

This is to say, it's better to have a specific driver method for host
flooding, which takes the user port as argument. This solves problem (1)
by allowing the driver to do different things for different user ports,
and problem (2) by abstracting the operation and letting the driver do
whatever, rather than explicitly making the DSA core point to the CPU
port it thinks needs to be touched.

This new method also creates a problem, which is that cross-chip setups
are not handled. However I don't have hardware right now where I can
test what is the proper thing to do, and there isn't hardware compatible
with multi-switch trees that supports host flooding. So it remains a
problem to be tackled in the future.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-12 16:38:55 -07:00
Vladimir Oltean
465c3de42b net: dsa: introduce the dsa_cpu_ports() helper
Similar to dsa_user_ports() which retrieves a port mask of all user
ports, introduce dsa_cpu_ports() which retrieves the mask of all CPU
ports of a switch.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-12 16:38:54 -07:00
Jakub Kicinski
9b19e57a3c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Build issue in drivers/net/ethernet/sfc/ptp.c
  54fccfdd7c ("sfc: efx_default_channel_type APIs can be static")
  49e6123c65 ("net: sfc: fix memory leak due to ptp channel")
https://lore.kernel.org/all/20220510130556.52598fe2@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-12 16:15:30 -07:00
Itay Iellin
103a2f3255 Bluetooth: Fix the creation of hdev->name
Set a size limit of 8 bytes of the written buffer to "hdev->name"
including the terminating null byte, as the size of "hdev->name" is 8
bytes. If an id value which is greater than 9999 is allocated,
then the "snprintf(hdev->name, sizeof(hdev->name), "hci%d", id)"
function call would lead to a truncation of the id value in decimal
notation.

Set an explicit maximum id parameter in the id allocation function call.
The id allocation function defines the maximum allocated id value as the
maximum id parameter value minus one. Therefore, HCI_MAX_ID is defined
as 10000.

Signed-off-by: Itay Iellin <ieitayie@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2022-05-11 17:18:42 -07:00
Paolo Abeni
8b796475fd net/sched: act_pedit: really ensure the skb is writable
Currently pedit tries to ensure that the accessed skb offset
is writable via skb_unclone(). The action potentially allows
touching any skb bytes, so it may end-up modifying shared data.

The above causes some sporadic MPTCP self-test failures, due to
this code:

	tc -n $ns2 filter add dev ns2eth$i egress \
		protocol ip prio 1000 \
		handle 42 fw \
		action pedit munge offset 148 u8 invert \
		pipe csum tcp \
		index 100

The above modifies a data byte outside the skb head and the skb is
a cloned one, carrying a TCP output packet.

This change addresses the issue by keeping track of a rough
over-estimate highest skb offset accessed by the action and ensuring
such offset is really writable.

Note that this may cause performance regressions in some scenarios,
but hopefully pedit is not in the critical path.

Fixes: db2c24175d ("act_pedit: access skb->data safely")
Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Tested-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/1fcf78e6679d0a287dd61bb0f04730ce33b3255d.1652194627.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-11 15:06:42 -07:00
Eric Dumazet
d268c1f5cf net: add CONFIG_DEBUG_NET
This config option enables network debugging checks.

This patch adds DEBUG_NET_WARN_ON_ONCE(cond)
Note that this is not a replacement for WARN_ON_ONCE(cond)
as (cond) is not evaluated if CONFIG_DEBUG_NET is not set.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-11 12:43:10 +01:00
Eric Dumazet
5b87be9e49 net: add include/net/net_debug.h
Remove from include/linux/netdevice.h helpers
that send debug/info/warnings to syslog.

We plan adding more helpers in following patches.

v2: added two includes, and 'struct net_device' forward declaration
    to avoid compile errors (kernel bots)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-11 12:43:10 +01:00
Vladimir Oltean
fe5233b0ba net: dsa: delete dsa_port_walk_{fdbs,mdbs}
All the users of these functions are gone, delete them before they gain
new ones.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-06 21:00:12 -07:00
Leon Romanovsky
254c4a824c xfrm: drop not needed flags variable in XFRM offload struct
After drivers were converted to rely on direction, the flags is not
used anymore and can be removed.

Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-05-06 08:35:46 +02:00
Leon Romanovsky
482db2f1dd xfrm: store and rely on direction to construct offload flags
XFRM state doesn't need anything from flags except to understand
direction, so store it separately. For future patches, such change
will allow us to reuse xfrm_dev_offload for policy offload too, which
has three possible directions instead of two.

Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-05-06 08:31:04 +02:00
Leon Romanovsky
87e0a94e60 xfrm: rename xfrm_state_offload struct to allow reuse
The struct xfrm_state_offload has all fields needed to hold information
for offloaded policies too. In order to do not create new struct with
same fields, let's rename existing one and reuse it later.

Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-05-06 08:29:28 +02:00
Leon Romanovsky
a36708e646 xfrm: delete not used number of external headers
num_exthdrs is set but never used, so delete it.

Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-05-06 08:25:46 +02:00
Leon Romanovsky
b01a277a05 xfrm: free not used XFRM_ESP_NO_TRAILER flag
After removal of Innova IPsec support from mlx5 driver, the last user
of this XFRM_ESP_NO_TRAILER was gone too. This means that we can safely
remove it as no other hardware is capable (or need) to remove ESP trailer.

Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-05-06 08:24:20 +02:00
Paolo Abeni
ea66758c17 tcp: allow MPTCP to update the announced window
The MPTCP RFC requires that the MPTCP-level receive window's
right edge never moves backward. Currently the MPTCP code
enforces such constraint while tracking the right edge, but it
does not reflects it on the wire, as MPTCP lacks a suitable hook
to update accordingly the TCP header.

This change modifies the existing mptcp_write_options() hook,
providing the current packet's TCP header to the MPTCP protocol,
so that the next patch could implement the above mentioned
constraint.

No functional changes intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 19:00:15 -07:00
Jakub Kicinski
c4a67a21a6 Revert "Merge branch 'mlxsw-line-card-model'"
This reverts commit 5e927a9f4b, reversing
changes made to cfc1d91a7d.

The discussion is still ongoing so let's remove the uAPI
until the discussion settles.

Link: https://lore.kernel.org/all/20220425090021.32e9a98f@kernel.org/
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20220504154037.539442-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 15:47:23 -07:00
Jakub Kicinski
c8227d568d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
tools/testing/selftests/net/forwarding/Makefile
  f62c5acc80 ("selftests/net/forwarding: add missing tests to Makefile")
  50fe062c80 ("selftests: forwarding: new test, verify host mdb entries")
https://lore.kernel.org/all/20220502111539.0b7e4621@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 13:03:18 -07:00
Willy Tarreau
b2d057560b secure_seq: use the 64 bits of the siphash for port offset calculation
SipHash replaced MD5 in secure_ipv{4,6}_port_ephemeral() via commit
7cd23e5300 ("secure_seq: use SipHash in place of MD5"), but the output
remained truncated to 32-bit only. In order to exploit more bits from the
hash, let's make the functions return the full 64-bit of siphash_3u32().
We also make sure the port offset calculation in __inet_hash_connect()
remains done on 32-bit to avoid the need for div_u64_rem() and an extra
cost on 32-bit systems.

Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
Cc: Amit Klein <aksecurity@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-04 19:22:20 -07:00
Jakub Kicinski
f43f0cd2d9 wireless-next patches for v5.19
First set of patches for v5.19 and this is a big one. We have two new
 drivers, a change in mac80211 STA API affecting most drivers and
 ath11k getting support for WCN6750. And as usual lots of fixes and
 cleanups all over.
 
 Major changes:
 
 new drivers
 
 * wfx: silicon labs devices
 
 * plfxlc: pureLiFi X, XL, XC devices
 
 mac80211
 
 * host based BSS color collision detection
 
 * prepare sta handling for IEEE 802.11be Multi-Link Operation (MLO) support
 
 rtw88
 
 * support TP-Link T2E devices
 
 rtw89
 
 * support firmware crash simulation
 
 * preparation for 8852ce hardware support
 
 ath11k
 
 * Wake-on-WLAN support for QCA6390 and WCN6855
 
 * device recovery (firmware restart) support for QCA6390 and WCN6855
 
 * support setting Specific Absorption Rate (SAR) for WCN6855
 
 * read country code from SMBIOS for WCN6855/QCA6390
 
 * support for WCN6750
 
 wcn36xx
 
 * support for transmit rate reporting to user space
 -----BEGIN PGP SIGNATURE-----
 
 iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmJxS2sRHGt2YWxvQGtl
 cm5lbC5vcmcACgkQbhckVSbrbZuNgwf9H2oxMKLKrlFoX1qHtNBwZuHS6IERhOkM
 NI9DjS4MCyiUSbA5r3sWlpqXQeKIbG/05gUZ6Y0ircGFwnAGjZ6isPwo8pKFgbh5
 QljXQjUTHbkshrXW8K+VGJxw4F1oiPlOGUDVdXPy2FLx5ZvBlaUV2rWQUzsWX9I0
 EnrM6ygHBVejVYDe+JSkb1gzb/07xuZN410IJPuZTPKJfYiE0oGU3zpTbExFitaz
 ObjfFUWqHrVue525WFAJ9Dbk8kYEKyMThr7rkkWekYJjujJLJo0qhEiZVZu0eEsk
 Vq4PdKmQAqlgbShQ/3Mv8BRsSH2wy62+zKjPWL+8t4Gmm9DbLu+++A==
 =T7Ii
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2022-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Kalle Valo says:

====================
wireless-next patches for v5.19

First set of patches for v5.19 and this is a big one. We have two new
drivers, a change in mac80211 STA API affecting most drivers and
ath11k getting support for WCN6750. And as usual lots of fixes and
cleanups all over.

Major changes:

new drivers
 - wfx: silicon labs devices
 - plfxlc: pureLiFi X, XL, XC devices

mac80211
 - host based BSS color collision detection
 - prepare sta handling for IEEE 802.11be Multi-Link Operation (MLO) support

rtw88
 - support TP-Link T2E devices

rtw89
 - support firmware crash simulation
 - preparation for 8852ce hardware support

ath11k
 - Wake-on-WLAN support for QCA6390 and WCN6855
 - device recovery (firmware restart) support for QCA6390 and WCN6855
 - support setting Specific Absorption Rate (SAR) for WCN6855
 - read country code from SMBIOS for WCN6855/QCA6390
 - support for WCN6750

wcn36xx
 - support for transmit rate reporting to user space

* tag 'wireless-next-2022-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (228 commits)
  rtw89: 8852c: rfk: add DPK
  rtw89: 8852c: rfk: add IQK
  rtw89: 8852c: rfk: add RX DCK
  rtw89: 8852c: rfk: add RCK
  rtw89: 8852c: rfk: add TSSI
  rtw89: 8852c: rfk: add LCK
  rtw89: 8852c: rfk: add DACK
  rtw89: 8852c: rfk: add RFK tables
  plfxlc: fix le16_to_cpu warning for beacon_interval
  rtw88: remove a copy of the NAPI_POLL_WEIGHT define
  carl9170: tx: fix an incorrect use of list iterator
  wil6210: use NAPI_POLL_WEIGHT for napi budget
  ath10k: remove a copy of the NAPI_POLL_WEIGHT define
  ath11k: Add support for WCN6750 device
  ath11k: Datapath changes to support WCN6750
  ath11k: HAL changes to support WCN6750
  ath11k: Add QMI changes for WCN6750
  ath11k: Fetch device information via QMI for WCN6750
  ath11k: Add register access logic for WCN6750
  ath11k: Add HW params for WCN6750
  ...
====================

Link: https://lore.kernel.org/r/20220503153622.C1671C385A4@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-03 17:27:51 -07:00
Jakub Kicinski
c5f50500a0 Stefan Schmidt says:
====================
pull-request: ieee802154-next 2022-05-01

Miquel Raynal landed two patch series bundled in this pull request.

The first series re-works the symbol duration handling to better
accommodate the needs of the various phy layers in ieee802154.

In the second series Miquel improves th errors handling from drivers
up mac802154. THis streamlines the error handling throughout the
ieee/mac802154 stack in preparation for sync TX to be introduced for
MLME frames.
====================

Link: https://lore.kernel.org/r/20220501194614.1198325-1-stefan@datenfreihafen.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-02 13:57:56 -07:00
Pavel Begunkov
de32bc6aad net: inline sock_alloc_send_skb
sock_alloc_send_skb() is simple and just proxying to another function,
so we can inline it and cut associated overhead.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-30 12:58:44 +01:00
Erin MacNeil
6fd1d51cfa net: SO_RCVMARK socket option for SO_MARK with recvmsg()
Adding a new socket option, SO_RCVMARK, to indicate that SO_MARK
should be included in the ancillary data returned by recvmsg().

Renamed the sock_recv_ts_and_drops() function to sock_recv_cmsgs().

Signed-off-by: Erin MacNeil <lnx.erin@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
Link: https://lore.kernel.org/r/20220427200259.2564-1-lnx.erin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-28 13:08:15 -07:00
Jakub Kicinski
0e55546b18 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
include/linux/netdevice.h
net/core/dev.c
  6510ea973d ("net: Use this_cpu_inc() to increment net->core_stats")
  794c24e992 ("net-core: rx_otherhost_dropped to core_stats")
https://lore.kernel.org/all/20220428111903.5f4304e0@canb.auug.org.au/

drivers/net/wan/cosa.c
  d48fea8401 ("net: cosa: fix error check return value of register_chrdev()")
  89fbca3307 ("net: wan: remove support for COSA and SRP synchronous serial boards")
https://lore.kernel.org/all/20220428112130.1f689e5e@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-28 13:02:01 -07:00
Paolo Abeni
febb2d2fa5 bluetooth pull request for net:
- Fix regression causing some HCI events to be discarded when they
    shouldn't.
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmJp06sZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKYIqD/9z5sdKwHfJCBhPA3f9RqrY
 uzJXaza10/rabRGj/sobWxbF662nvImUS9zXsyR1gH8BXG9yOiH/tL1vnP8+Hnze
 o0DHsD2iXOL1HDdt1ZXkyvkAoDlnV8pFyDuakeFG7ddOYIhoq12bMj7Tu8fGsei3
 miFFQwwHMvxPs77zJpkW0E5XqC1UI4LJA7a+ezpP+5Y7Oqzy7FJpv+RTjjtSxdKP
 kSlDUBuqzKuvXyeV4D6T0wyJFk4XFJVjyfwAtBiiXsADVCDSr+eJ6WUMixE04x35
 siykz7gu/Fl79SOJHmOm/ZnJDlO2GFoWmmXA2HomqT1N6CECA7CBfwOG/HQ8mETE
 z0TwYbwQbK4sewMWClz6InrPhfY2P6z47xsohY1DPWpJB+dsvYjLvqxnP7bnVxTc
 ZO3N8fNt3BDZnnqEaKXZyoIOwppS8+q0nGUAvi8nkhh3dMphYg6csDn8iNm0EdkI
 RGkiB+dgoTY9Wwe9GnVysBGC5rFV5uQBwKLkSZeBAgzE2zRVIJvn6RzmExVYDk/V
 nXaFmW8vG/rgXoOIfW1jO3YqgKOPgb+emX6ckmaFA+Z8ICUeYKazfLa2OUAsVILX
 S7LhP5aY9vPkaRr4iXXyt88nRl97NejSC4zc0iGJQijYwQKuClcUWy2fqzTSuUVv
 TrH8ti7QqGR4C+M0uV2N9Q==
 =sidM
 -----END PGP SIGNATURE-----

Merge tag 'for-net-2022-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

 - Fix regression causing some HCI events to be discarded when they
   shouldn't.

* tag 'for-net-2022-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: hci_sync: Cleanup hci_conn if it cannot be aborted
  Bluetooth: hci_event: Fix creating hci_conn object on error status
  Bluetooth: hci_event: Fix checking for invalid handle on error status
====================

Link: https://lore.kernel.org/r/20220427234031.1257281-1-luiz.dentz@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-28 10:18:51 +02:00
Jakub Kicinski
347cb5deae Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2022-04-27

We've added 5 non-merge commits during the last 20 day(s) which contain
a total of 6 files changed, 34 insertions(+), 12 deletions(-).

The main changes are:

1) Fix xsk sockets when rx and tx are separately bound to the same umem, also
   fix xsk copy mode combined with busy poll, from Maciej Fijalkowski.

2) Fix BPF tunnel/collect_md helpers with bpf_xmit lwt hook usage which triggered
   a crash due to invalid metadata_dst access, from Eyal Birger.

3) Fix release of page pool in XDP live packet mode, from Toke Høiland-Jørgensen.

4) Fix potential NULL pointer dereference in kretprobes, from Adam Zabrocki.

   (Masami & Steven preferred this small fix to be routed via bpf tree given it's
    follow-up fix to Masami's rethook work that went via bpf earlier, too.)

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  xsk: Fix possible crash when multiple sockets are created
  kprobes: Fix KRETPROBES when CONFIG_KRETPROBE_ON_RETHOOK is set
  bpf, lwt: Fix crash when using bpf_skb_set_tunnel_key() from bpf_xmit lwt hook
  bpf: Fix release of page_pool in BPF_PROG_RUN in test runner
  xsk: Fix l2fwd for copy mode + busy poll combo
====================

Link: https://lore.kernel.org/r/20220427212748.9576-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-27 15:18:40 -07:00
Eric Dumazet
68822bdf76 net: generalize skb freeing deferral to per-cpu lists
Logic added in commit f35f821935 ("tcp: defer skb freeing after socket
lock is released") helped bulk TCP flows to move the cost of skbs
frees outside of critical section where socket lock was held.

But for RPC traffic, or hosts with RFS enabled, the solution is far from
being ideal.

For RPC traffic, recvmsg() has to return to user space right after
skb payload has been consumed, meaning that BH handler has no chance
to pick the skb before recvmsg() thread. This issue is more visible
with BIG TCP, as more RPC fit one skb.

For RFS, even if BH handler picks the skbs, they are still picked
from the cpu on which user thread is running.

Ideally, it is better to free the skbs (and associated page frags)
on the cpu that originally allocated them.

This patch removes the per socket anchor (sk->defer_list) and
instead uses a per-cpu list, which will hold more skbs per round.

This new per-cpu list is drained at the end of net_action_rx(),
after incoming packets have been processed, to lower latencies.

In normal conditions, skbs are added to the per-cpu list with
no further action. In the (unlikely) cases where the cpu does not
run net_action_rx() handler fast enough, we use an IPI to raise
NET_RX_SOFTIRQ on the remote cpu.

Also, we do not bother draining the per-cpu list from dev_cpu_dead()
This is because skbs in this list have no requirement on how fast
they should be freed.

Note that we can add in the future a small per-cpu cache
if we see any contention on sd->defer_lock.

Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
page recycling strategy used by NIC driver (its page pool capacity
being too small compared to number of skbs/pages held in sockets
receive queues)

Note that this tuning was only done to demonstrate worse
conditions for skb freeing for this particular test.
These conditions can happen in more general production workload.

10 runs of one TCP_STREAM flow

Before:
Average throughput: 49685 Mbit.

Kernel profiles on cpu running user thread recvmsg() show high cost for
skb freeing related functions (*)

    57.81%  [kernel]       [k] copy_user_enhanced_fast_string
(*) 12.87%  [kernel]       [k] skb_release_data
(*)  4.25%  [kernel]       [k] __free_one_page
(*)  3.57%  [kernel]       [k] __list_del_entry_valid
     1.85%  [kernel]       [k] __netif_receive_skb_core
     1.60%  [kernel]       [k] __skb_datagram_iter
(*)  1.59%  [kernel]       [k] free_unref_page_commit
(*)  1.16%  [kernel]       [k] __slab_free
     1.16%  [kernel]       [k] _copy_to_iter
(*)  1.01%  [kernel]       [k] kfree
(*)  0.88%  [kernel]       [k] free_unref_page
     0.57%  [kernel]       [k] ip6_rcv_core
     0.55%  [kernel]       [k] ip6t_do_table
     0.54%  [kernel]       [k] flush_smp_call_function_queue
(*)  0.54%  [kernel]       [k] free_pcppages_bulk
     0.51%  [kernel]       [k] llist_reverse_order
     0.38%  [kernel]       [k] process_backlog
(*)  0.38%  [kernel]       [k] free_pcp_prepare
     0.37%  [kernel]       [k] tcp_recvmsg_locked
(*)  0.37%  [kernel]       [k] __list_add_valid
     0.34%  [kernel]       [k] sock_rfree
     0.34%  [kernel]       [k] _raw_spin_lock_irq
(*)  0.33%  [kernel]       [k] __page_cache_release
     0.33%  [kernel]       [k] tcp_v6_rcv
(*)  0.33%  [kernel]       [k] __put_page
(*)  0.29%  [kernel]       [k] __mod_zone_page_state
     0.27%  [kernel]       [k] _raw_spin_lock

After patch:
Average throughput: 73076 Mbit.

Kernel profiles on cpu running user thread recvmsg() looks better:

    81.35%  [kernel]       [k] copy_user_enhanced_fast_string
     1.95%  [kernel]       [k] _copy_to_iter
     1.95%  [kernel]       [k] __skb_datagram_iter
     1.27%  [kernel]       [k] __netif_receive_skb_core
     1.03%  [kernel]       [k] ip6t_do_table
     0.60%  [kernel]       [k] sock_rfree
     0.50%  [kernel]       [k] tcp_v6_rcv
     0.47%  [kernel]       [k] ip6_rcv_core
     0.45%  [kernel]       [k] read_tsc
     0.44%  [kernel]       [k] _raw_spin_lock_irqsave
     0.37%  [kernel]       [k] _raw_spin_lock
     0.37%  [kernel]       [k] native_irq_return_iret
     0.33%  [kernel]       [k] __inet6_lookup_established
     0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
     0.29%  [kernel]       [k] tcp_rcv_established
     0.29%  [kernel]       [k] llist_reverse_order

v2: kdoc issue (kernel bots)
    do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
    replace the sk_buff_head with a single-linked list (Jakub)
    add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-26 17:05:59 -07:00
Luiz Augusto von Dentz
9b3628d79b Bluetooth: hci_sync: Cleanup hci_conn if it cannot be aborted
This attempts to cleanup the hci_conn if it cannot be aborted as
otherwise it would likely result in having the controller and host
stack out of sync with respect to connection handle.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2022-04-26 20:10:51 +02:00
Luiz Augusto von Dentz
c86cc5a3ec Bluetooth: hci_event: Fix checking for invalid handle on error status
Commit d5ebaa7c5f introduces checks for handle range
(e.g HCI_CONN_HANDLE_MAX) but controllers like Intel AX200 don't seem
to respect the valid range int case of error status:

> HCI Event: Connect Complete (0x03) plen 11
        Status: Page Timeout (0x04)
        Handle: 65535
        Address: 94:DB:56:XX:XX:XX (Sony Home Entertainment&
	Sound Products Inc)
        Link type: ACL (0x01)
        Encryption: Disabled (0x00)
[1644965.827560] Bluetooth: hci0: Ignoring HCI_Connection_Complete for invalid handle

Because of it is impossible to cleanup the connections properly since
the stack would attempt to cancel the connection which is no longer in
progress causing the following trace:

< HCI Command: Create Connection Cancel (0x01|0x0008) plen 6
        Address: 94:DB:56:XX:XX:XX (Sony Home Entertainment&
	Sound Products Inc)
= bluetoothd: src/profile.c:record_cb() Unable to get Hands-Free Voice
	gateway SDP record: Connection timed out
> HCI Event: Command Complete (0x0e) plen 10
      Create Connection Cancel (0x01|0x0008) ncmd 1
        Status: Unknown Connection Identifier (0x02)
        Address: 94:DB:56:XX:XX:XX (Sony Home Entertainment&
	Sound Products Inc)
< HCI Command: Create Connection Cancel (0x01|0x0008) plen 6
        Address: 94:DB:56:XX:XX:XX (Sony Home Entertainment&
	Sound Products Inc)

Fixes: d5ebaa7c5f ("Bluetooth: hci_event: Ignore multiple conn complete events")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2022-04-26 20:09:07 +02:00
Maciej Fijalkowski
ba3beec2ec xsk: Fix possible crash when multiple sockets are created
Fix a crash that happens if an Rx only socket is created first, then a
second socket is created that is Tx only and bound to the same umem as
the first socket and also the same netdev and queue_id together with the
XDP_SHARED_UMEM flag. In this specific case, the tx_descs array page
pool was not created by the first socket as it was an Rx only socket.
When the second socket is bound it needs this tx_descs array of this
shared page pool as it has a Tx component, but unfortunately it was
never allocated, leading to a crash. Note that this array is only used
for zero-copy drivers using the batched Tx APIs, currently only ice and
i40e.

[ 5511.150360] BUG: kernel NULL pointer dereference, address: 0000000000000008
[ 5511.158419] #PF: supervisor write access in kernel mode
[ 5511.164472] #PF: error_code(0x0002) - not-present page
[ 5511.170416] PGD 0 P4D 0
[ 5511.173347] Oops: 0002 [#1] PREEMPT SMP PTI
[ 5511.178186] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G            E     5.18.0-rc1+ #97
[ 5511.187245] Hardware name: Intel Corp. GRANTLEY/GRANTLEY, BIOS GRRFCRB1.86B.0276.D07.1605190235 05/19/2016
[ 5511.198418] RIP: 0010:xsk_tx_peek_release_desc_batch+0x198/0x310
[ 5511.205375] Code: c0 83 c6 01 84 c2 74 6d 8d 46 ff 23 07 44 89 e1 48 83 c0 14 48 c1 e1 04 48 c1 e0 04 48 03 47 10 4c 01 c1 48 8b 50 08 48 8b 00 <48> 89 51 08 48 89 01 41 80 bd d7 00 00 00 00 75 82 48 8b 19 49 8b
[ 5511.227091] RSP: 0018:ffffc90000003dd0 EFLAGS: 00010246
[ 5511.233135] RAX: 0000000000000000 RBX: ffff88810c8da600 RCX: 0000000000000000
[ 5511.241384] RDX: 000000000000003c RSI: 0000000000000001 RDI: ffff888115f555c0
[ 5511.249634] RBP: ffffc90000003e08 R08: 0000000000000000 R09: ffff889092296b48
[ 5511.257886] R10: 0000ffffffffffff R11: ffff889092296800 R12: 0000000000000000
[ 5511.266138] R13: ffff88810c8db500 R14: 0000000000000040 R15: 0000000000000100
[ 5511.274387] FS:  0000000000000000(0000) GS:ffff88903f800000(0000) knlGS:0000000000000000
[ 5511.283746] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5511.290389] CR2: 0000000000000008 CR3: 00000001046e2001 CR4: 00000000003706f0
[ 5511.298640] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5511.306892] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 5511.315142] Call Trace:
[ 5511.317972]  <IRQ>
[ 5511.320301]  ice_xmit_zc+0x68/0x2f0 [ice]
[ 5511.324977]  ? ktime_get+0x38/0xa0
[ 5511.328913]  ice_napi_poll+0x7a/0x6a0 [ice]
[ 5511.333784]  __napi_poll+0x2c/0x160
[ 5511.337821]  net_rx_action+0xdd/0x200
[ 5511.342058]  __do_softirq+0xe6/0x2dd
[ 5511.346198]  irq_exit_rcu+0xb5/0x100
[ 5511.350339]  common_interrupt+0xa4/0xc0
[ 5511.354777]  </IRQ>
[ 5511.357201]  <TASK>
[ 5511.359625]  asm_common_interrupt+0x1e/0x40
[ 5511.364466] RIP: 0010:cpuidle_enter_state+0xd2/0x360
[ 5511.370211] Code: 49 89 c5 0f 1f 44 00 00 31 ff e8 e9 00 7b ff 45 84 ff 74 12 9c 58 f6 c4 02 0f 85 72 02 00 00 31 ff e8 02 0c 80 ff fb 45 85 f6 <0f> 88 11 01 00 00 49 63 c6 4c 2b 2c 24 48 8d 14 40 48 8d 14 90 49
[ 5511.391921] RSP: 0018:ffffffff82a03e60 EFLAGS: 00000202
[ 5511.397962] RAX: ffff88903f800000 RBX: 0000000000000001 RCX: 000000000000001f
[ 5511.406214] RDX: 0000000000000000 RSI: ffffffff823400b9 RDI: ffffffff8234c046
[ 5511.424646] RBP: ffff88810a384800 R08: 000005032a28c046 R09: 0000000000000008
[ 5511.443233] R10: 000000000000000b R11: 0000000000000006 R12: ffffffff82bcf700
[ 5511.461922] R13: 000005032a28c046 R14: 0000000000000001 R15: 0000000000000000
[ 5511.480300]  cpuidle_enter+0x29/0x40
[ 5511.494329]  do_idle+0x1c7/0x250
[ 5511.507610]  cpu_startup_entry+0x19/0x20
[ 5511.521394]  start_kernel+0x649/0x66e
[ 5511.534626]  secondary_startup_64_no_verify+0xc3/0xcb
[ 5511.549230]  </TASK>

Detect such case during bind() and allocate this memory region via newly
introduced xp_alloc_tx_descs(). Also, use kvcalloc instead of kcalloc as
for other buffer pool allocations, so that it matches the kvfree() from
xp_destroy().

Fixes: d1bc532e99 ("i40e: xsk: Move tmp desc array from driver to pool")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20220425153745.481322-1-maciej.fijalkowski@intel.com
2022-04-26 16:19:54 +02:00
Miquel Raynal
5a1b57c0dd net: mac802154: Create an error helper for asynchronous offloading errors
A few drivers do the full transmit operation asynchronously, which means
that a bus error that happens when forwarding the packet to the
transmitter or a timeout happening when offloading the request to the
transmitter will not be reported immediately.

The solution in this case is to call this new helper to free the
necessary resources, restart the queue and always return the same
generic TRAC error code: IEEE802154_SYSTEM_ERROR.

Suggested-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Acked-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/r/20220407100903.1695973-6-miquel.raynal@bootlin.com
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2022-04-25 20:51:12 +02:00