The code for gso_max_size was added originally to allow for debugging and
workaround of buggy devices that couldn't support TSO with blocks 64K in
size. The original reason for limiting it to 64K was because that was the
existing limits of IPv4 and non-jumbogram IPv6 length fields.
With the addition of Big TCP we can remove this limit and allow the value
to potentially go up to UINT_MAX and instead be limited by the tso_max_size
value.
So in order to support this we need to go through and clean up the
remaining users of the gso_max_size value so that the values will cap at
64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value
so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper
limit for GSO_MAX_SIZE.
v6: (edumazet) fixed a compile error if CONFIG_IPV6=n,
in a new sk_trim_gso_size() helper.
netif_set_tso_max_size() caps the requested TSO size
with GSO_MAX_SIZE.
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove from include/linux/netdevice.h helpers
that send debug/info/warnings to syslog.
We plan adding more helpers in following patches.
v2: added two includes, and 'struct net_device' forward declaration
to avoid compile errors (kernel bots)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a physical clock supports a free running cycle counter, then
timestamps shall be based on this time too. For TX it is known in
advance before the transmission if a timestamp based on the free running
cycle counter is needed. For RX it is impossible to know which timestamp
is needed before the packet is received and assigned to a socket.
Support late timestamp determination by a network device. Therefore, an
address/cookie is stored within the new netdev_data field of struct
skb_shared_hwtstamps. This address/cookie is provided to a new network
device function called ndo_get_tstamp(), which returns a timestamp based
on the normal/adjustable time or based on the free running cycle
counter. If function is not supported, then timestamp handling is not
changed.
This mechanism is intended for RX, but TX use is also possible.
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add extack support to .ndo_fdb_del in netdevice.h and
all related methods.
Signed-off-by: Alaa Mohamed <eng.alaamohamedsoliman.am@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These are now internal to the core, no need to expose them.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Up until commit 46e6b992c2 ("rtnetlink: allow GSO maximums to
be set on device creation") the gso_max_segs and gso_max_size
of a device were not controlled from user space.
The quoted commit added the ability to control them because of
the following setup:
netns A | netns B
veth<->veth eth0
If eth0 has TSO limitations and user wants to efficiently forward
traffic between eth0 and the veths they should copy the TSO
limitations of eth0 onto the veths. This would happen automatically
for macvlans or ipvlan but veth users are not so lucky (given the
loose coupling).
Unfortunately the commit in question allowed users to also override
the limits on real HW devices.
It may be useful to control the max GSO size and someone may be using
that ability (not that I know of any user), so create a separate set
of knobs to reliably record the TSO limitations. Validate the user
requests.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
To make later patches smaller create a helper for inheriting
the TSO limitations of a lower device. The TSO in the name
is not an accident, subsequent patches will replace GSO
with TSO in more names.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most drivers should not have to worry about selecting the right
weight for their NAPI instances and pass NAPI_POLL_WEIGHT.
It'd be best if we didn't require the argument at all and selected
the default internally.
This change prepares the ground for such reshuffling, allowing
for a smooth transition. The following API should remain after
the next release cycle:
netif_napi_add()
netif_napi_add_weight()
netif_napi_add_tx()
netif_napi_add_tx_weight()
Where the _weight() variants take an explicit weight argument.
I opted for a _weight() suffix rather than a __ prefix, because
we use __ in places to mean that caller needs to also issue a
synchronize_net() call.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20220502232703.396351-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Inline dev_queue_xmit() and dev_queue_xmit_accel(), they both are small
proxy functions doing nothing but redirecting the control flow to
__dev_queue_xmit().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The macro dev_core_stats_##FIELD##_inc() disables preemption and invokes
netdev_core_stats_alloc() to return a per-CPU pointer.
netdev_core_stats_alloc() will allocate memory on its first invocation
which breaks on PREEMPT_RT because it requires non-atomic context for
memory allocation.
This can be avoided by enabling preemption in netdev_core_stats_alloc()
assuming the caller always disables preemption.
It might be better to replace local_inc() with this_cpu_inc() now that
dev_core_stats_##FIELD##_inc() gained a preempt-disable section and does
not rely on already disabled preemption. This results in less
instructions on x86-64:
local_inc:
| incl %gs:__preempt_count(%rip) # __preempt_count
| movq 488(%rdi), %rax # _1->core_stats, _22
| testq %rax, %rax # _22
| je .L585 #,
| add %gs:this_cpu_off(%rip), %rax # this_cpu_off, tcp_ptr__
| .L586:
| testq %rax, %rax # _27
| je .L587 #,
| incq (%rax) # _6->a.counter
| .L587:
| decl %gs:__preempt_count(%rip) # __preempt_count
this_cpu_inc(), this patch:
| movq 488(%rdi), %rax # _1->core_stats, _5
| testq %rax, %rax # _5
| je .L591 #,
| .L585:
| incq %gs:(%rax) # _18->rx_dropped
Use unsigned long as type for the counter. Use this_cpu_inc() to
increment the counter. Use a plain read of the counter.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/YmbO0pxgtKpCw4SY@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Logic added in commit f35f821935 ("tcp: defer skb freeing after socket
lock is released") helped bulk TCP flows to move the cost of skbs
frees outside of critical section where socket lock was held.
But for RPC traffic, or hosts with RFS enabled, the solution is far from
being ideal.
For RPC traffic, recvmsg() has to return to user space right after
skb payload has been consumed, meaning that BH handler has no chance
to pick the skb before recvmsg() thread. This issue is more visible
with BIG TCP, as more RPC fit one skb.
For RFS, even if BH handler picks the skbs, they are still picked
from the cpu on which user thread is running.
Ideally, it is better to free the skbs (and associated page frags)
on the cpu that originally allocated them.
This patch removes the per socket anchor (sk->defer_list) and
instead uses a per-cpu list, which will hold more skbs per round.
This new per-cpu list is drained at the end of net_action_rx(),
after incoming packets have been processed, to lower latencies.
In normal conditions, skbs are added to the per-cpu list with
no further action. In the (unlikely) cases where the cpu does not
run net_action_rx() handler fast enough, we use an IPI to raise
NET_RX_SOFTIRQ on the remote cpu.
Also, we do not bother draining the per-cpu list from dev_cpu_dead()
This is because skbs in this list have no requirement on how fast
they should be freed.
Note that we can add in the future a small per-cpu cache
if we see any contention on sd->defer_lock.
Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
page recycling strategy used by NIC driver (its page pool capacity
being too small compared to number of skbs/pages held in sockets
receive queues)
Note that this tuning was only done to demonstrate worse
conditions for skb freeing for this particular test.
These conditions can happen in more general production workload.
10 runs of one TCP_STREAM flow
Before:
Average throughput: 49685 Mbit.
Kernel profiles on cpu running user thread recvmsg() show high cost for
skb freeing related functions (*)
57.81% [kernel] [k] copy_user_enhanced_fast_string
(*) 12.87% [kernel] [k] skb_release_data
(*) 4.25% [kernel] [k] __free_one_page
(*) 3.57% [kernel] [k] __list_del_entry_valid
1.85% [kernel] [k] __netif_receive_skb_core
1.60% [kernel] [k] __skb_datagram_iter
(*) 1.59% [kernel] [k] free_unref_page_commit
(*) 1.16% [kernel] [k] __slab_free
1.16% [kernel] [k] _copy_to_iter
(*) 1.01% [kernel] [k] kfree
(*) 0.88% [kernel] [k] free_unref_page
0.57% [kernel] [k] ip6_rcv_core
0.55% [kernel] [k] ip6t_do_table
0.54% [kernel] [k] flush_smp_call_function_queue
(*) 0.54% [kernel] [k] free_pcppages_bulk
0.51% [kernel] [k] llist_reverse_order
0.38% [kernel] [k] process_backlog
(*) 0.38% [kernel] [k] free_pcp_prepare
0.37% [kernel] [k] tcp_recvmsg_locked
(*) 0.37% [kernel] [k] __list_add_valid
0.34% [kernel] [k] sock_rfree
0.34% [kernel] [k] _raw_spin_lock_irq
(*) 0.33% [kernel] [k] __page_cache_release
0.33% [kernel] [k] tcp_v6_rcv
(*) 0.33% [kernel] [k] __put_page
(*) 0.29% [kernel] [k] __mod_zone_page_state
0.27% [kernel] [k] _raw_spin_lock
After patch:
Average throughput: 73076 Mbit.
Kernel profiles on cpu running user thread recvmsg() looks better:
81.35% [kernel] [k] copy_user_enhanced_fast_string
1.95% [kernel] [k] _copy_to_iter
1.95% [kernel] [k] __skb_datagram_iter
1.27% [kernel] [k] __netif_receive_skb_core
1.03% [kernel] [k] ip6t_do_table
0.60% [kernel] [k] sock_rfree
0.50% [kernel] [k] tcp_v6_rcv
0.47% [kernel] [k] ip6_rcv_core
0.45% [kernel] [k] read_tsc
0.44% [kernel] [k] _raw_spin_lock_irqsave
0.37% [kernel] [k] _raw_spin_lock
0.37% [kernel] [k] native_irq_return_iret
0.33% [kernel] [k] __inet6_lookup_established
0.31% [kernel] [k] ip6_protocol_deliver_rcu
0.29% [kernel] [k] tcp_rcv_established
0.29% [kernel] [k] llist_reverse_order
v2: kdoc issue (kernel bots)
do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
replace the sk_buff_head with a single-linked list (Jakub)
add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch fixes issue:
* If we install tc filters with act_skbedit in clsact hook.
It doesn't work, because netdev_core_pick_tx() overwrites
queue_mapping.
$ tc filter ... action skbedit queue_mapping 1
And this patch is useful:
* We can use FQ + EDT to implement efficient policies. Tx queues
are picked by xps, ndo_select_queue of netdev driver, or skb hash
in netdev_core_pick_tx(). In fact, the netdev driver, and skb
hash are _not_ under control. xps uses the CPUs map to select Tx
queues, but we can't figure out which task_struct of pod/containter
running on this cpu in most case. We can use clsact filters to classify
one pod/container traffic to one Tx queue. Why ?
In containter networking environment, there are two kinds of pod/
containter/net-namespace. One kind (e.g. P1, P2), the high throughput
is key in these applications. But avoid running out of network resource,
the outbound traffic of these pods is limited, using or sharing one
dedicated Tx queues assigned HTB/TBF/FQ Qdisc. Other kind of pods
(e.g. Pn), the low latency of data access is key. And the traffic is not
limited. Pods use or share other dedicated Tx queues assigned FIFO Qdisc.
This choice provides two benefits. First, contention on the HTB/FQ Qdisc
lock is significantly reduced since fewer CPUs contend for the same queue.
More importantly, Qdisc contention can be eliminated completely if each
CPU has its own FIFO Qdisc for the second kind of pods.
There must be a mechanism in place to support classifying traffic based on
pods/container to different Tx queues. Note that clsact is outside of Qdisc
while Qdisc can run a classifier to select a sub-queue under the lock.
In general recording the decision in the skb seems a little heavy handed.
This patch introduces a per-CPU variable, suggested by Eric.
The xmit.skip_txqueue flag is firstly cleared in __dev_queue_xmit().
- Tx Qdisc may install that skbedit actions, then xmit.skip_txqueue flag
is set in qdisc->enqueue() though tx queue has been selected in
netdev_tx_queue_mapping() or netdev_core_pick_tx(). That flag is cleared
firstly in __dev_queue_xmit(), is useful:
- Avoid picking Tx queue with netdev_tx_queue_mapping() in next netdev
in such case: eth0 macvlan - eth0.3 vlan - eth0 ixgbe-phy:
For example, eth0, macvlan in pod, which root Qdisc install skbedit
queue_mapping, send packets to eth0.3, vlan in host. In __dev_queue_xmit() of
eth0.3, clear the flag, does not select tx queue according to skb->queue_mapping
because there is no filters in clsact or tx Qdisc of this netdev.
Same action taked in eth0, ixgbe in Host.
- Avoid picking Tx queue for next packet. If we set xmit.skip_txqueue
in tx Qdisc (qdisc->enqueue()), the proper way to clear it is clearing it
in __dev_queue_xmit when processing next packets.
For performance reasons, use the static key. If user does not config the NET_EGRESS,
the patch will not be compiled.
+----+ +----+ +----+
| P1 | | P2 | | Pn |
+----+ +----+ +----+
| | |
+-----------+-----------+
|
| clsact/skbedit
| MQ
v
+-----------+-----------+
| q0 | q1 | qn
v v v
HTB/FQ HTB/FQ ... FIFO
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Talal Ahmad <talalahmad@google.com>
Cc: Kevin Hao <haokexin@gmail.com>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Antoine Tenart <atenart@kernel.org>
Cc: Wei Wang <weiwan@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add a new netdev op called ndo_fdb_del_bulk, it will be later used for
driver-specific bulk delete implementation dispatched from rtnetlink. The
first user will be the bridge, we need it to signal to rtnetlink from
the driver that we support bulk delete operation (NLM_F_BULK).
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Increment rx_otherhost_dropped counter when packet dropped due to
mismatched dest MAC addr.
An example when this drop can occur is when manually crafting raw
packets that will be consumed by a user space application via a tap
device. For testing purposes local traffic was generated using trafgen
for the client and netcat to start a server
Tested: Created 2 netns, sent 1 packet using trafgen from 1 to the other
with "{eth(daddr=$INCORRECT_MAC...}", verified that iproute2 showed the
counter was incremented. (Also had to modify iproute2 to show the stat,
additional patch for that coming next.)
Signed-off-by: Jeffrey Ji <jeffreyji@google.com>
Reviewed-by: Brian Vazquez <brianvv@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220406172600.1141083-1-jeffreyjilinux@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There's a number of functions and static variables used
under net/core/ but not from the outside. We currently
dump most of them into netdevice.h. That bad for many
reasons:
- netdevice.h is very cluttered, hard to figure out
what the APIs are;
- netdevice.h is very long;
- we have to touch netdevice.h more which causes expensive
incremental builds.
Create a header under net/core/ and move some declarations.
The new header is also a bit of a catch-all but that's
fine, if we create more specific headers people will
likely over-think where their declaration fit best.
And end up putting them in netdevice.h, again.
More work should be done on splitting netdevice.h into more
targeted headers, but that'd be more time consuming so small
steps.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This allows hardware flow offloading from Ethernet to WLAN on MT7622 SoC
Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: David S. Miller <davem@davemloft.net>
In [1], Will raised a potential issue that the cfg80211 code,
which does (from a locking perspective)
rtnl_lock()
wiphy_lock()
rtnl_unlock()
might be suspectible to ABBA deadlocks, because rtnl_unlock()
calls netdev_run_todo(), which might end up calling rtnl_lock()
again, which could then deadlock (see the comment in the code
added here for the scenario).
Some back and forth and thinking ensued, but clearly this can't
happen if the net_todo_list is empty at the rtnl_unlock() here.
Clearly, the code here cannot actually put an entry on it, and
all other users of rtnl_unlock() will empty it since that will
always go through netdev_run_todo(), emptying the list.
So the only other way to get there would be to add to the list
and then unlock the RTNL without going through rtnl_unlock(),
which is only possible through __rtnl_unlock(). However, this
isn't exported and not used in many places, and none of them
seem to be able to unregister before using it.
Therefore, add a WARN_ON() in the code to ensure this invariant
won't be broken, so that the cfg80211 (or any similar) code
stays safe.
[1] https://lore.kernel.org/r/Yjzpo3TfZxtKPMAG@google.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20220404113847.0ee02e4a70da.Ic73d206e217db20fd22dcec14fe5442ca732804b@changeid
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Building htmldocs complains:
include/linux/netdevice.h:2295: warning: Function parameter or member 'dm_private' not described in 'net_device'
Fixes: b26ef81c46 ("drop_monitor: remove quadratic behavior")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220322051053.1883186-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Network drivers can call to netif_get_num_default_rss_queues to get the
default number of receive queues to use. Right now, this default number
is min(8, num_online_cpus()).
Instead, as suggested by Jakub, use the number of physical cores divided
by 2 as a way to avoid wasting CPU resources and to avoid using both CPU
threads, but still allowing to scale for high-end processors with many
cores.
As an exception, select 2 queues for processors with 2 cores, because
otherwise it won't take any advantage of RSS despite being SMP capable.
Tested: Processor Intel Xeon E5-2620 (2 sockets, 6 cores/socket, 2
threads/core). NIC Broadcom NetXtreme II BCM57810 (10GBps). Ran some
tests with `perf stat iperf3 -R`, with parallelisms of 1, 8 and 24,
getting the following results:
- Number of queues: 6 (instead of 8)
- Network throughput: not affected
- CPU usage: utilized 0.05-0.12 CPUs more than before (having 24 CPUs
this is only 0.2-0.5% higher)
- Reduced the number of context switches by 7-50%, being more noticeable
when using a higher number of parallel threads.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
Link: https://lore.kernel.org/r/20220315091832.13873-1-ihuguet@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Before adding yet another possibly contended atomic_long_t,
it is time to add per-cpu storage for existing ones:
dev->tx_dropped, dev->rx_dropped, and dev->rx_nohandler
Because many devices do not have to increment such counters,
allocate the per-cpu storage on demand, so that dev_get_stats()
does not have to spend considerable time folding zero counters.
Note that some drivers have abused these counters which
were supposed to be only used by core networking stack.
v4: should use per_cpu_ptr() in dev_get_stats() (Jakub)
v3: added a READ_ONCE() in netdev_core_stats_alloc() (Paolo)
v2: add a missing include (reported by kernel test robot <lkp@intel.com>)
Change in netdev_core_stats_alloc() (Jakub)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: jeffreyji <jeffreyji@google.com>
Reviewed-by: Brian Vazquez <brianvv@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220311051420.2608812-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The esp tunnel GSO handlers use skb_mac_gso_segment to
push the inner packet to the segmentation handlers.
However, skb_mac_gso_segment takes the Ethernet Protocol
ID from 'skb->protocol' which is wrong for inter address
family tunnels. We fix this by introducing a new
skb_eth_gso_segment function.
This function can be used if it is necessary to pass the
Ethernet Protocol ID directly to the segmentation handler.
First users of this function will be the esp4 and esp6
tunnel segmentation handlers.
Fixes: c35fe4106b ("xfrm: Add mode handlers for IPsec on layer 2")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Remove netif_rx_any_context and netif_rx_ni() because there are no more
users in tree.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Offloading switch device drivers may be able to collect statistics of the
traffic taking place in the HW datapath that pertains to a certain soft
netdevice, such as VLAN. Add the necessary infrastructure to allow exposing
these statistics to the offloaded netdevice in question. The API was shaped
by the following considerations:
- Collection of HW statistics is not free: there may be a finite number of
counters, and the act of counting may have a performance impact. It is
therefore necessary to allow toggling whether HW counting should be done
for any particular SW netdevice.
- As the drivers are loaded and removed, a particular device may get
offloaded and unoffloaded again. At the same time, the statistics values
need to stay monotonic (modulo the eventual 64-bit wraparound),
increasing only to reflect traffic measured in the device.
To that end, the netdevice keeps around a lazily-allocated copy of struct
rtnl_link_stats64. Device drivers then contribute to the values kept
therein at various points. Even as the driver goes away, the struct stays
around to maintain the statistics values.
- Different HW devices may be able to count different things. The
motivation behind this patch in particular is exposure of HW counters on
Nvidia Spectrum switches, where the only practical approach to counting
traffic on offloaded soft netdevices currently is to use router interface
counters, and count L3 traffic. Correspondingly that is the statistics
suite added in this patch.
Other devices may be able to measure different kinds of traffic, and for
that reason, the APIs are built to allow uniform access to different
statistics suites.
- Because soft netdevices and offloading drivers are only loosely bound, a
netdevice uses a notifier chain to communicate with the drivers. Several
new notifiers, NETDEV_OFFLOAD_XSTATS_*, have been added to carry messages
to the offloading drivers.
- Devices can have various conditions for when a particular counter is
available. As the device is configured and reconfigured, the device
offload may become or cease being suitable for counter binding. A
netdevice can use a notifier type NETDEV_OFFLOAD_XSTATS_REPORT_USED to
ping offloading drivers and determine whether anyone currently implements
a given statistics suite. This information can then be propagated to user
space.
When the driver decides to unoffload a netdevice, it can use a
newly-added function, netdev_offload_xstats_report_delta(), to record
outstanding collected statistics, before destroying the HW counter.
This patch adds a helper, call_netdevice_notifiers_info_robust(), for
dispatching a notifier with the possibility of unwind when one of the
consumers bails. Given the wish to eventually get rid of the global
notifier block altogether, this helper only invokes the per-netns notifier
block.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
drop_monitor is using an unique list on which all netdevices in
the host have an element, regardless of their netns.
This scales poorly, not only at device unregister time (what I
caught during my netns dismantle stress tests), but also at packet
processing time whenever trace_napi_poll_hit() is called.
If the intent was to avoid adding one pointer in 'struct net_device'
then surely we prefer O(1) behavior.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dave suggested a while ago (eleven years by now) "Let's make netif_rx()
work in all contexts and get rid of netif_rx_ni()". Eric agreed and
pointed out that modern devices should use netif_receive_skb() to avoid
the overhead.
In the meantime someone added another variant, netif_rx_any_context(),
which behaves as suggested.
netif_rx() must be invoked with disabled bottom halves to ensure that
pending softirqs, which were raised within the function, are handled.
netif_rx_ni() can be invoked only from process context (bottom halves
must be enabled) because the function handles pending softirqs without
checking if bottom halves were disabled or not.
netif_rx_any_context() invokes on the former functions by checking
in_interrupts().
netif_rx() could be taught to handle both cases (disabled and enabled
bottom halves) by simply disabling bottom halves while invoking
netif_rx_internal(). The local_bh_enable() invocation will then invoke
pending softirqs only if the BH-disable counter drops to zero.
Eric is concerned about the overhead of BH-disable+enable especially in
regard to the loopback driver. As critical as this driver is, it will
receive a shortcut to avoid the additional overhead which is not needed.
Add a local_bh_disable() section in netif_rx() to ensure softirqs are
handled if needed.
Provide __netif_rx() which does not disable BH and has a lockdep assert
to ensure that interrupts are disabled. Use this shortcut in the
loopback driver and in drivers/net/*.c.
Make netif_rx_ni() and netif_rx_any_context() invoke netif_rx() so they
can be removed once they are no more users left.
Link: https://lkml.kernel.org/r/20100415.020246.218622820.davem@davemloft.net
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert one dev_hold()/dev_put() pair in register_netdevice()
and unregister_netdevice_many() to dev_hold_track()
and dev_put_track().
This would allow to detect a rogue dev_put() a bit earlier.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220207184107.1401096-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We are still chasing some syzbot reports where we think a rogue dev_put()
is called with no corresponding prior dev_hold().
Unfortunately it eats a reference on dev->dev_refcnt taken by innocent
dev_hold_track(), meaning that the refcount saturation splat comes
too late to be useful.
Make sure that 'not tracked' dev_put() and dev_hold() better use
CONFIG_NET_DEV_REFCNT_TRACKER=y debug infrastructure:
Prior patch in the series allowed ref_tracker_alloc() and ref_tracker_free()
to be called with a NULL @trackerp parameter, and to use a separate refcount
only to detect too many put() even in the following case:
dev_hold_track(dev, tracker_1, GFP_ATOMIC);
dev_hold(dev);
dev_put(dev);
dev_put(dev); // Should complain loudly here.
dev_put_track(dev, tracker_1); // instead of here
Add clarification about netdev_tracker_alloc() role.
v2: I replaced the dev_put() in linkwatch_do_dev()
with __dev_put() because callers called netdev_tracker_free().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In one net namespace, after creating a packet socket without binding
it to a device, users in other net namespaces can observe the new
`packet_type` added by this packet socket by reading `/proc/net/ptype`
file. This is minor information leakage as packet socket is
namespace aware.
Add a net pointer in `packet_type` to keep the net namespace of
of corresponding packet socket. In `ptype_seq_show`, this net pointer
must be checked when it is not NULL.
Fixes: 2feb27dbe0 ("[NETNS]: Minor information leak via /proc/net/ptype file.")
Signed-off-by: Congyu Liu <liu3101@purdue.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet suggested to allow users to modify max GRO packet size.
We have seen GRO being disabled by users of appliances (such as
wifi access points) because of claimed bufferbloat issues,
or some work arounds in sch_cake, to split GRO/GSO packets.
Instead of disabling GRO completely, one can chose to limit
the maximum packet size of GRO packets, depending on their
latency constraints.
This patch adds a per device gro_max_size attribute
that can be changed with ip link command.
ip link set dev eth0 gro_max_size 16000
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Coco Li <lixiaoyan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit d64c2a7612 ("staging: irda: remove the irda network stack and
drivers") removes the config IRDA.
Remove the remaining references to this non-existing config in the network
header files.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Link: https://lore.kernel.org/r/20211229113620.19368-1-lukas.bulwahn@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use flow_indr_dev_register/flow_indr_dev_setup_offload to
offload tc action.
We need to call tc_cleanup_flow_action to clean up tc action entry since
in tc_setup_action, some actions may hold dev refcnt, especially the mirror
action.
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The double 'as' in a comment is repeated, thus it should be removed.
Signed-off-by: Xiang wangx <wangxiang@cdjrlc.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most notable changes are in af_packet, tipc ones are trivial.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jon Maloy <jmaloy@redhat.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have 100+ syzbot reports about netns being dismantled too soon,
still unresolved as of today.
We think a missing get_net() or an extra put_net() is the root cause.
In order to find the bug(s), and be able to spot future ones,
this patch adds CONFIG_NET_NS_REFCNT_TRACKER and new helpers
to precisely pair all put_net() with corresponding get_net().
To use these helpers, each data structure owning a refcount
should also use a "netns_tracker" to pair the get and put.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a netdevice_tracker inside struct net_device, to track
the self reference when a device has an active watchdog timer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a netdevice_tracker inside struct net_device, to track
the self reference when a device is in lweventlist.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We want to track all dev_hold()/dev_put() to ease leak hunting.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This will help debugging pesky netdev reference leaks.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net device are refcounted. Over the years we had numerous bugs
caused by imbalanced dev_hold() and dev_put() calls.
The general idea is to be able to precisely pair each decrement with
a corresponding prior increment. Both share a cookie, basically
a pointer to private data storing stack traces.
This patch adds dev_hold_track() and dev_put_track().
To use these helpers, each data structure owning a refcount
should also use a "netdevice_tracker" to pair the hold and put.
netdevice_tracker dev_tracker;
...
dev_hold_track(dev, &dev_tracker, GFP_ATOMIC);
...
dev_put_track(dev, &dev_tracker);
Whenever a leak happens, we will get precise stack traces
of the point dev_hold_track() happened, at device dismantle phase.
We will also get a stack trace if too many dev_put_track() for the same
netdevice_tracker are attempted.
This is guarded by CONFIG_NET_DEV_REFCNT_TRACKER option.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
.ndo_change_proto_down was added seemingly to enable out-of-tree
implementations. Over 2.5yrs later we still have no real users
upstream. Hardwire the generic implementation for now, we can
revert once real users materialize. (rocker is a test vehicle,
not a user.)
We need to drop the optimization on the sysfs side, because
unlike ndos priv_flags will be changed at runtime, so we'd
need READ_ONCE/WRITE_ONCE everywhere..
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev->gso_max_segs is written under RTNL protection, or when the device is
not yet visible, but is read locklessly.
Add netif_set_gso_max_segs() helper.
Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_segs()
where we can to better document what is going on.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev->gso_max_size is written under RTNL protection, or when the device is
not yet visible, but is read locklessly.
Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_size()
where we can to better document what is going on.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netdev->dev_addr should only be modified via helpers,
but someone may be casting off the const. Add a runtime
check to catch abuses.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 406f42fa0d ("net-next: When a bond have a massive amount
of VLANs...") introduced a rbtree for faster Ethernet address look
up. We converted all users to make modifications via appropriate
helpers, make netdev->dev_addr const.
The update helpers need to upcast from the buffer to
struct netdev_hw_addr.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
These are not fast path, there is no point in inlining them.
Also provide netif_freeze_queues()/netif_unfreeze_queues()
so that we can use them from dev_watchdog() in the following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In following patches, dev_watchdog() will no longer stop all queues.
It will read queue->trans_start locklessly.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tx_timeout_show() assumed dev_watchdog() would stop all
the queues, to fetch queue->trans_timeout under protection
of the queue->_xmit_lock.
As we want to no longer disrupt transmits, we use an
atomic_long_t instead.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: david decotigny <david.decotigny@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Same rationale than prior patch : using the dedicated
section avoid holes and pack all these bool values.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move gro code and data from net/core/dev.c to net/core/gro.c
to ease maintenance.
gro_normal_list() and gro_normal_one() are inlined
because they are called from both files.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/core/gro.c will contain all core gro functions,
to shrink net/core/skbuff.c and net/core/dev.c
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This helper is used once, no need to keep it in fat net/core/skbuff.c
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
include/linux/netdevice.h became too big, move gro stuff
into include/net/gro.h
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS for net-next:
1) Add new run_estimation toggle to IPVS to stop the estimation_timer
logic, from Dust Li.
2) Relax superfluous dynset check on NFT_SET_TIMEOUT.
3) Add egress hook, from Lukas Wunner.
4) Nowadays, almost all hook functions in x_table land just call the hook
evaluation loop. Remove remaining hook wrappers from iptables and IPVS.
From Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The Qdisc::running sequence counter has two uses:
1. Reliably reading qdisc's tc statistics while the qdisc is running
(a seqcount read/retry loop at gnet_stats_add_basic()).
2. As a flag, indicating whether the qdisc in question is running
(without any retry loops).
For the first usage, the Qdisc::running sequence counter write section,
qdisc_run_begin() => qdisc_run_end(), covers a much wider area than what
is actually needed: the raw qdisc's bstats update. A u64_stats sync
point was thus introduced (in previous commits) inside the bstats
structure itself. A local u64_stats write section is then started and
stopped for the bstats updates.
Use that u64_stats sync point mechanism for the bstats read/retry loop
at gnet_stats_add_basic().
For the second qdisc->running usage, a __QDISC_STATE_RUNNING bit flag,
accessed with atomic bitops, is sufficient. Using a bit flag instead of
a sequence counter at qdisc_run_begin/end() and qdisc_is_running() leads
to the SMP barriers implicitly added through raw_read_seqcount() and
write_seqcount_begin/end() getting removed. All call sites have been
surveyed though, and no required ordering was identified.
Now that the qdisc->running sequence counter is no longer used, remove
it.
Note, using u64_stats implies no sequence counter protection for 64-bit
architectures. This can lead to the qdisc tc statistics "packets" vs.
"bytes" values getting out of sync on rare occasions. The individual
values will still be valid.
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support classifying packets with netfilter on egress to satisfy user
requirements such as:
* outbound security policies for containers (Laura)
* filtering and mangling intra-node Direct Server Return (DSR) traffic
on a load balancer (Laura)
* filtering locally generated traffic coming in through AF_PACKET,
such as local ARP traffic generated for clustering purposes or DHCP
(Laura; the AF_PACKET plumbing is contained in a follow-up commit)
* L2 filtering from ingress and egress for AVB (Audio Video Bridging)
and gPTP with nftables (Pablo)
* in the future: in-kernel NAT64/NAT46 (Pablo)
The egress hook introduced herein complements the ingress hook added by
commit e687ad60af ("netfilter: add netfilter ingress hook after
handle_ing() under unique static key"). A patch for nftables to hook up
egress rules from user space has been submitted separately, so users may
immediately take advantage of the feature.
Alternatively or in addition to netfilter, packets can be classified
with traffic control (tc). On ingress, packets are classified first by
tc, then by netfilter. On egress, the order is reversed for symmetry.
Conceptually, tc and netfilter can be thought of as layers, with
netfilter layered above tc.
Traffic control is capable of redirecting packets to another interface
(man 8 tc-mirred). E.g., an ingress packet may be redirected from the
host namespace to a container via a veth connection:
tc ingress (host) -> tc egress (veth host) -> tc ingress (veth container)
In this case, netfilter egress classifying is not performed when leaving
the host namespace! That's because the packet is still on the tc layer.
If tc redirects the packet to a physical interface in the host namespace
such that it leaves the system, the packet is never subjected to
netfilter egress classifying. That is only logical since it hasn't
passed through netfilter ingress classifying either.
Packets can alternatively be redirected at the netfilter layer using
nft fwd. Such a packet *is* subjected to netfilter egress classifying
since it has reached the netfilter layer.
Internally, the skb->nf_skip_egress flag controls whether netfilter is
invoked on egress by __dev_queue_xmit(). Because __dev_queue_xmit() may
be called recursively by tunnel drivers such as vxlan, the flag is
reverted to false after sch_handle_egress(). This ensures that
netfilter is applied both on the overlay and underlying network.
Interaction between tc and netfilter is possible by setting and querying
skb->mark.
If netfilter egress classifying is not enabled on any interface, it is
patched out of the data path by way of a static_key and doesn't make a
performance difference that is discernible from noise:
Before: 1537 1538 1538 1537 1538 1537 Mb/sec
After: 1536 1534 1539 1539 1539 1540 Mb/sec
Before + tc accept: 1418 1418 1418 1419 1419 1418 Mb/sec
After + tc accept: 1419 1424 1418 1419 1422 1420 Mb/sec
Before + tc drop: 1620 1619 1619 1619 1620 1620 Mb/sec
After + tc drop: 1616 1624 1625 1624 1622 1619 Mb/sec
When netfilter egress classifying is enabled on at least one interface,
a minimal performance penalty is incurred for every egress packet, even
if the interface it's transmitted over doesn't have any netfilter egress
rules configured. That is caused by checking dev->nf_hooks_egress
against NULL.
Measurements were performed on a Core i7-3615QM. Commands to reproduce:
ip link add dev foo type dummy
ip link set dev foo up
modprobe pktgen
echo "add_device foo" > /proc/net/pktgen/kpktgend_3
samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1
Accept all traffic with tc:
tc qdisc add dev foo clsact
tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,'
Drop all traffic with tc:
tc qdisc add dev foo clsact
tc filter add dev foo egress bpf da bytecode '1,6 0 0 2,'
Apply this patch when measuring packet drops to avoid errors in dmesg:
https://lore.kernel.org/netdev/a73dda33-57f4-95d8-ea51-ed483abd6a7a@iogearbox.net/
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Laura García Liébana <nevola@gmail.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The implement of function netdev_all_upper_get_next_dev_rcu has been
removed in:
commit f1170fd462 ("net: Remove all_adj_list and its references")
so delete redundant declaration in header file.
Fixes: f1170fd462 ("net: Remove all_adj_list and its references")
Signed-off-by: Chen Wandun <chenwandun@huawei.com>
Link: https://lore.kernel.org/r/20211013094702.3931071-1-chenwandun@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
__dev_addr_set() and dev_addr_mod() and pretty low level,
let the arguments be void, there's no chance for confusion
in callers converted to use them. Keep u8 in dev_addr_set()
because some of the callers are converted from a loop
and we want to make sure assignments are not from an array
of a different type.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Open code which is dev->priv_flags & IFF_MACSEC has already defined as
netif_is_macsec(). So use netif_is_macsec() instead of open code.
This patch doesn't change logic.
Signed-off-by: Juhee Kang <claudiajkang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__dev_get_by_name is currently used to either retrieve a net device
reference using its name or to check if a name is already used by a
registered net device (per ns). In the later case there is no need to
return a reference to a net device.
Introduce a new helper, netdev_name_in_use, to check if a name is
currently used by a registered net device without leaking a reference
the corresponding net device. This helper uses netdev_name_node_lookup
instead of __dev_get_by_name as we don't need the extra logic retrieving
a reference to the corresponding net device.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recent work on converting address list to a tree made it obvious
we need an abstraction around writing netdev->dev_addr. Without
such abstraction updating the main device address is invisible
to the core.
Introduce a number of helpers which for now just wrap memcpy()
but in the future can make necessary changes to the address
tree.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A common implementation of isatty(3) involves calling a ioctl passing
a dummy struct argument and checking whether the syscall failed --
bionic and glibc use TCGETS (passing a struct termios), and musl uses
TIOCGWINSZ (passing a struct winsize). If the FD is a socket, we will
copy sizeof(struct ifreq) bytes of data from the argument and return
-EFAULT if that fails. The result is that the isatty implementations
may return a non-POSIX-compliant value in errno in the case where part
of the dummy struct argument is inaccessible, as both struct termios
and struct winsize are smaller than struct ifreq (at least on arm64).
Although there is usually enough stack space following the argument
on the stack that this did not present a practical problem up to now,
with MTE stack instrumentation it's more likely for the copy to fail,
as the memory following the struct may have a different tag.
Fix the problem by adding an early check for whether the ioctl is a
valid socket ioctl, and return -ENOTTY if it isn't.
Fixes: 44c02a2c3d ("dev_ioctl(): move copyin/copyout to callers")
Link: https://linux-review.googlesource.com/id/I869da6cf6daabc3e4b7b82ac979683ba05e27d4d
Signed-off-by: Peter Collingbourne <pcc@google.com>
Cc: <stable@vger.kernel.org> # 4.19
Signed-off-by: David S. Miller <davem@davemloft.net>
The source of most of the slow down is the `dev_addr_lists.c` module,
which mainatins a linked list of HW addresses.
When using IPv6, this list grows for each IPv6 address added on a
VLAN, since each IPv6 address has a multicast HW address associated with
it.
When performing any modification to the involved links, this list is
traversed many times, often for nothing, all while holding the RTNL
lock.
Instead, this patch adds an auxilliary rbtree which cuts down
traversal time significantly.
Performance can be seen with the following script:
#!/bin/bash
ip netns del test || true 2>/dev/null
ip netns add test
echo 1 | ip netns exec test tee /proc/sys/net/ipv6/conf/all/keep_addr_on_down > /dev/null
set -e
ip -n test link add foo type veth peer name bar
ip -n test link add b1 type bond
ip -n test link add florp type vrf table 10
ip -n test link set bar master b1
ip -n test link set foo up
ip -n test link set bar up
ip -n test link set b1 up
ip -n test link set florp up
VLAN_COUNT=1500
BASE_DEV=b1
echo Creating vlans
ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT);
do ip -n test link add link $BASE_DEV name foo.\$i type vlan id \$i; done"
echo Bringing them up
ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT);
do ip -n test link set foo.\$i up; done"
echo Assiging IPv6 Addresses
ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT);
do ip -n test address add dev foo.\$i 2000::\$i/64; done"
echo Attaching to VRF
ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT);
do ip -n test link set foo.\$i master florp; done"
On an Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz machine, the performance
before the patch is (truncated):
Creating vlans
real 108.35
Bringing them up
real 4.96
Assiging IPv6 Addresses
real 19.22
Attaching to VRF
real 458.84
After the patch:
Creating vlans
real 5.59
Bringing them up
real 5.07
Assiging IPv6 Addresses
real 5.64
Attaching to VRF
real 25.37
Cc: David S. Miller <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Lu Wei <luwei32@huawei.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Gilad Naaman <gnaaman@drivenets.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
bpf-next 2021-08-10
We've added 31 non-merge commits during the last 8 day(s) which contain
a total of 28 files changed, 3644 insertions(+), 519 deletions(-).
1) Native XDP support for bonding driver & related BPF selftests, from Jussi Maki.
2) Large batch of new BPF JIT tests for test_bpf.ko that came out as a result from
32-bit MIPS JIT development, from Johan Almbladh.
3) Rewrite of netcnt BPF selftest and merge into test_progs, from Stanislav Fomichev.
4) Fix XDP bpf_prog_test_run infra after net to net-next merge, from Andrii Nakryiko.
5) Follow-up fix in unix_bpf_update_proto() to enforce socket type, from Cong Wang.
6) Fix bpf-iter-tcp4 selftest to print the correct dest IP, from Jose Blanquicet.
7) Various misc BPF XDP sample improvements, from Niklas Söderlund, Matthew Cover,
and Muhammad Falak R Wani.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits)
bpf, tests: Add tail call test suite
bpf, tests: Add tests for BPF_CMPXCHG
bpf, tests: Add tests for atomic operations
bpf, tests: Add test for 32-bit context pointer argument passing
bpf, tests: Add branch conversion JIT test
bpf, tests: Add word-order tests for load/store of double words
bpf, tests: Add tests for ALU operations implemented with function calls
bpf, tests: Add more ALU64 BPF_MUL tests
bpf, tests: Add more BPF_LSH/RSH/ARSH tests for ALU64
bpf, tests: Add more ALU32 tests for BPF_LSH/RSH/ARSH
bpf, tests: Add more tests of ALU32 and ALU64 bitwise operations
bpf, tests: Fix typos in test case descriptions
bpf, tests: Add BPF_MOV tests for zero and sign extension
bpf, tests: Add BPF_JMP32 test cases
samples, bpf: Add an explict comment to handle nested vlan tagging.
selftests/bpf: Add tests for XDP bonding
selftests/bpf: Fix xdp_tx.c prog section name
net, core: Allow netdev_lower_get_next_private_rcu in bh context
bpf, devmap: Exclude XDP broadcast to master device
net, bonding: Add XDP support to the bonding driver
...
====================
Link: https://lore.kernel.org/r/20210810130038.16927-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This adds the ndo_xdp_get_xmit_slave hook for transforming XDP_TX
into XDP_REDIRECT after BPF program run when the ingress device
is a bond slave.
The dev_xdp_prog_count is exposed so that slave devices can be checked
for loaded XDP programs in order to avoid the situation where both
bond master and slave have programs loaded according to xdp_state.
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Link: https://lore.kernel.org/bpf/20210731055738.16820-3-joamaki@gmail.com
Add the case if dev is NULL in dev_{put, hold}, so the caller doesn't
need to care whether dev is NULL or not.
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
netif_set_real_num_rx_queues() and netif_set_real_num_tx_queues()
can fail which breaks drivers trying to implement reconfiguration
in a way that can't leave the device half-broken. In other words
those functions are incompatible with prepare/commit approach.
Luckily setting real number of queues can fail only if the number
is increased, meaning that if we order operations correctly we
can guarantee ending up with either new config (success), or
the old one (on error).
Provide a helper implementing such logic so that drivers don't
have to duplicate it.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is now only used by a handful of old ISA drivers,
and can be moved into the file they already all depend on.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change adds the infrastructure for managing MCTP netdevices; we add
a pointer to the AF_MCTP-specific data to struct netdevice, and hook up
the rtnetlink operations for adding and removing addresses.
Includes changes from Matt Johnston <matt@codeconstruct.com.au>.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
All other user triggered operations are gone from ndo_ioctl, so move
the SIOCBOND family into a custom operation as well.
The .ndo_ioctl() helper is no longer called by the dev_ioctl.c code now,
but there are still a few definitions in obsolete wireless drivers as well
as the appletalk and ieee802154 layers to call SIOCSIFADDR/SIOCGIFADDR
helpers from inside the kernel.
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to further reduce the scope of ndo_do_ioctl(), move
out the SIOCWANDEV handling into a new network device operation
function.
Adjust the prototype to only pass the if_settings sub-structure
in place of the ifreq, and remove the redundant 'cmd' argument
in the process.
Cc: Krzysztof Halasa <khc@pm.waw.pl>
Cc: "Jan \"Yenya\" Kasprzak" <kas@fi.muni.cz>
Cc: Kevin Curtis <kevin.curtis@farsite.co.uk>
Cc: Zhao Qiang <qiang.zhao@nxp.com>
Cc: Martin Schiller <ms@dev.tdt.de>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: linux-x25@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most users of ndo_do_ioctl are ethernet drivers that implement
the MII commands SIOCGMIIPHY/SIOCGMIIREG/SIOCSMIIREG, or hardware
timestamping with SIOCSHWTSTAMP/SIOCGHWTSTAMP.
Separate these from the few drivers that use ndo_do_ioctl to
implement SIOCBOND, SIOCBR and SIOCWANDEV commands.
This is a purely cosmetic change intended to help readers find
their way through the implementation.
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Vivien Didelot <vivien.didelot@gmail.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Vladimir Oltean <olteanv@gmail.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: linux-rdma@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The compat handlers for SIOCDEVPRIVATE are incorrect for any driver that
passes data as part of struct ifreq rather than as an ifr_data pointer, or
that passes data back this way, since the compat_ifr_data_ioctl() helper
overwrites the ifr_data pointer and does not copy anything back out.
Since all drivers using devprivate commands are now converted to the
new .ndo_siocdevprivate callback, fix this by adding the missing piece
and passing the pointer separately the whole way.
This further unifies the native and compat logic for socket ioctls,
as the new code now passes the correct pointer as well as the correct
data for both native and compat ioctls.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
SIOCDEVPRIVATE ioctl commands are mainly used in really old
drivers, and they have a number of problems:
- They hide behind the normal .ndo_do_ioctl function that
is also used for other things in modern drivers, so it's
hard to spot a driver that actually uses one of these
- Since drivers use a number different calling conventions,
it is impossible to support compat mode for them in
a generic way.
- With all drivers using the same 16 commands codes, there
is no way to introspect the data being passed through
things like strace.
Add a new net_device_ops callback pointer, to address the
first two of these. Separating them from .ndo_do_ioctl
makes it easy to grep for drivers with a .ndo_siocdevprivate
callback, and the unwieldy name hopefully makes it easier
to spot in code review.
By passing the ifreq structure and the ifr_data pointer
separately, it is no longer necessary to overload these,
and the driver can use either one for a given command.
Cc: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
compat_ifreq_ioctl() is one of the last users of copy_in_user() and
compat_alloc_user_space(), as it attempts to convert the 'struct ifreq'
arguments from 32-bit to 64-bit format as used by dev_ioctl() and a
couple of socket family specific interpretations.
The current implementation works correctly when calling dev_ioctl(),
inet_ioctl(), ieee802154_sock_ioctl(), atalk_ioctl(), qrtr_ioctl()
and packet_ioctl(). The ioctl handlers for x25, netrom, rose and x25 do
not interpret the arguments and only block the corresponding commands,
so they do not care.
For af_inet6 and af_decnet however, the compat conversion is slightly
incorrect, as it will copy more data than the native handler accesses,
both of them use a structure that is shorter than ifreq.
Replace the copy_in_user() conversion with a pair of accessor functions
to read and write the ifreq data in place with the correct length where
needed, while leaving the other ones to copy the (already compatible)
structures directly.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The dev_ifconf() calling conventions make compat handling
more complicated than necessary, simplify this by moving
the in_compat_syscall() check into the function.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since dynamic registration of the gifconf() helper is only used for
IPv4, and this can not be in a loadable module, this can be simplified
noticeably by turning it into a direct function call as a preparation
for cleaning up the compat handling.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This helper can later be utilized in code that runs cpumap and devmap
programs in generic redirect mode and adjust skb based on changes made
to xdp_buff.
When returning XDP_REDIRECT/XDP_TX, it invokes __skb_push, so whenever a
generic redirect path invokes devmap/cpumap prog if set, it must
__skb_pull again as we expect mac header to be pulled.
It also drops the skb_reset_mac_len call after do_xdp_generic, as the
mac_header and network_header are advanced by the same offset, so the
difference (mac_len) remains constant.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210702111825.491065-2-memxor@gmail.com
Trivial conflict in net/netfilter/nf_tables_api.c.
Duplicate fix in tools/testing/selftests/net/devlink_port_split.py
- take the net-next version.
skmsg, and L4 bpf - keep the bpf code but remove the flags
and err params.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The goal is to keep the mark during a bpf_redirect(), like it is done for
legacy encapsulation / decapsulation, when there is no x-netns.
This was initially done in commit 213dd74aee ("skbuff: Do not scrub skb
mark within the same name space").
When the call to skb_scrub_packet() was added in dev_forward_skb() (commit
8b27f27797 ("skb: allow skb_scrub_packet() to be used by tunnels")), the
second argument (xnet) was set to true to force a call to skb_orphan(). At
this time, the mark was always cleanned up by skb_scrub_packet(), whatever
xnet value was.
This call to skb_orphan() was removed later in commit
9c4c325252 ("skbuff: preserve sock reference when scrubbing the skb.").
But this 'true' stayed here without any real reason.
Let's correctly set xnet in ____dev_forward_skb(), this function has access
to the previous interface and to the new interface.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
mlx5 devices were observed generating MLX5_PORT_CHANGE_SUBTYPE_ACTIVE
events without an intervening MLX5_PORT_CHANGE_SUBTYPE_DOWN. This
breaks link flap detection based on Linux carrier state transition
count as netif_carrier_on() does nothing if carrier is already on.
Make sure we count such events.
netif_carrier_event() increments the counters and fires the linkwatch
events. The latter is not necessary for the use case but seems like
the right thing to do.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Here is only one place where we want to specify new_ifindex. In all
other cases, callers pass 0 as new_ifindex. It looks reasonable to add a
low-level function with new_ifindex and to convert
dev_change_net_namespace to a static inline wrapper.
Fixes: eeb85a14ee ("net: Allow to specify ifindex when device is moved to another namespace")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, we can specify ifindex on link creation. This change allows
to specify ifindex when a device is moved to another network namespace.
Even now, a device ifindex can be changed if there is another device
with the same ifindex in the target namespace. So this change doesn't
introduce completely new behavior, it adds more control to the process.
CRIU users want to restore containers with pre-created network devices.
A user will provide network devices and instructions where they have to
be restored, then CRIU will restore network namespaces and move devices
into them. The problem is that devices have to be restored with the same
indexes that they have before C/R.
Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Suggested-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The switch might have already added the VLAN tag through PVID hardware
offload. Keep this extra VLAN in the flowtable but skip it on egress.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add .ndo_fill_forward_path for dsa slave port devices
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass on the PPPoE session ID, destination hardware address and the real
device.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>