Commit Graph

2270 Commits

Author SHA1 Message Date
Jakub Kicinski
2cb7b4890d devlink: expose instance locking and add locked port registering
It should be familiar and beneficial to expose devlink instance
lock to the drivers. This way drivers can block devlink from
calling them during critical sections without breakneck locking.

Add port helpers, port splitting callbacks will be the first
target.

Use 'devl_' prefix for "explicitly locked" API. Initial RFC used
'__devlink' but that's too much typing.

devl_lock_is_held() is not defined without lockdep, which is
the same behavior as lockdep_is_held() itself.

Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-16 12:56:31 -07:00
Eric Dumazet
65466904b0 tcp: adjust TSO packet sizes based on min_rtt
Back when tcp_tso_autosize() and TCP pacing were introduced,
our focus was really to reduce burst sizes for long distance
flows.

The simple heuristic of using sk_pacing_rate/1024 has worked
well, but can lead to too small packets for hosts in the same
rack/cluster, when thousands of flows compete for the bottleneck.

Neal Cardwell had the idea of making the TSO burst size
a function of both sk_pacing_rate and tcp_min_rtt()

Indeed, for local flows, sending bigger bursts is better
to reduce cpu costs, as occasional losses can be repaired
quite fast.

This patch is based on Neal Cardwell implementation
done more than two years ago.
bbr is adjusting max_pacing_rate based on measured bandwidth,
while cubic would over estimate max_pacing_rate.

/proc/sys/net/ipv4/tcp_tso_rtt_log can be used to tune or disable
this new feature, in logarithmic steps.

Tested:

100Gbit NIC, two hosts in the same rack, 4K MTU.
600 flows rate-limited to 20000000 bytes per second.

Before patch: (TSO sizes would be limited to 20000000/1024/4096 -> 4 segments per TSO)

~# echo 0 >/proc/sys/net/ipv4/tcp_tso_rtt_log
~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
  96005

 Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000':

         65,945.29 msec task-clock                #    2.845 CPUs utilized
         1,314,632      context-switches          # 19935.279 M/sec
             5,292      cpu-migrations            #   80.249 M/sec
           940,641      page-faults               # 14264.023 M/sec
   201,117,030,926      cycles                    # 3049769.216 GHz                   (83.45%)
    17,699,435,405      stalled-cycles-frontend   #    8.80% frontend cycles idle     (83.48%)
   136,584,015,071      stalled-cycles-backend    #   67.91% backend cycles idle      (83.44%)
    53,809,530,436      instructions              #    0.27  insn per cycle
                                                  #    2.54  stalled cycles per insn  (83.36%)
     9,062,315,523      branches                  # 137422329.563 M/sec               (83.22%)
       153,008,621      branch-misses             #    1.69% of all branches          (83.32%)

      23.182970846 seconds time elapsed

TcpInSegs                       15648792           0.0
TcpOutSegs                      58659110           0.0  # Average of 3.7 4K segments per TSO packet
TcpExtTCPDelivered              58654791           0.0
TcpExtTCPDeliveredCE            19                 0.0

After patch:

~# echo 9 >/proc/sys/net/ipv4/tcp_tso_rtt_log
~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
  96046

 Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000':

         48,982.58 msec task-clock                #    2.104 CPUs utilized
           186,014      context-switches          # 3797.599 M/sec
             3,109      cpu-migrations            #   63.472 M/sec
           941,180      page-faults               # 19214.814 M/sec
   153,459,763,868      cycles                    # 3132982.807 GHz                   (83.56%)
    12,069,861,356      stalled-cycles-frontend   #    7.87% frontend cycles idle     (83.32%)
   120,485,917,953      stalled-cycles-backend    #   78.51% backend cycles idle      (83.24%)
    36,803,672,106      instructions              #    0.24  insn per cycle
                                                  #    3.27  stalled cycles per insn  (83.18%)
     5,947,266,275      branches                  # 121417383.427 M/sec               (83.64%)
        87,984,616      branch-misses             #    1.48% of all branches          (83.43%)

      23.281200256 seconds time elapsed

TcpInSegs                       1434706            0.0
TcpOutSegs                      58883378           0.0  # Average of 41 4K segments per TSO packet
TcpExtTCPDelivered              58878971           0.0
TcpExtTCPDeliveredCE            9664               0.0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20220309015757.2532973-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-09 20:05:44 -08:00
Sebastian Andrzej Siewior
21f95a88ea docs: networking: Use netif_rx().
Since commit
   baebdf48c3 ("net: dev: Makes sure netif_rx() can be invoked in any context.")

the function netif_rx() can be used in preemptible/thread context as
well as in interrupt context.

Use netif_rx().

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: linux-doc@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-04 12:02:19 +00:00
Dust Li
f9f52c3474 net/smc: fix document build WARNING from smc-sysctl.rst
Stephen reported the following warning messages from smc-sysctl.rst

Documentation/networking/smc-sysctl.rst:3: WARNING: Title overline
too short.
Documentation/networking/smc-sysctl.rst: WARNING: document isn't
included in any toctree

Fix the title overline and add smc-sysctl entry into
Documentation/networking/index.rst

Fixes: 12bbb0d163 ("net/smc: add sysctl for autocorking")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220303113527.62047-1-dust.li@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-03 21:24:34 -08:00
Joe Damato
a3dd98281b Documentation: update networking/page_pool.rst
Add the new stats API, kernel config parameter, and stats structure
information to the page_pool documentation.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-03 09:55:28 +00:00
Dust Li
12bbb0d163 net/smc: add sysctl for autocorking
This add a new sysctl: net.smc.autocorking_size

We can dynamically change the behaviour of autocorking
by change the value of autocorking_size.
Setting to 0 disables autocorking in SMC

Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-01 14:25:12 +00:00
Vladimir Oltean
d27656d02d docs: net: dsa: sja1105: document limitations of tc-flower rule VLAN awareness
After change "net: dsa: tag_8021q: replace the SVL bridging with
VLAN-unaware IVL bridging", tag_8021q enforces two different pvids on a
port, depending on whether it is standalone or in a VLAN-unaware bridge.

Up until now, there was a single pvid, represented by
dsa_tag_8021q_rx_vid(), and that was used as the VLAN for VLAN-unaware
virtual link rules, regardless of whether the port was bridged or
standalone.

To keep VLAN-unaware virtual links working, we need to follow whether
the port is in a bridge or not, and update the VLAN ID from those rules.

In fact we can't fully do that. Depending on whether the switch is
VLAN-aware or not, we can accept Virtual Link rules with just the MAC
DA, or with a MAC DA and a VID. So we already deny changes to the VLAN
awareness of the switch. But the VLAN awareness may also change as a
result of joining or leaving a bridge.

One might say we could just allow the following: a port may leave a
VLAN-unaware bridge while it has VLAN-unaware VL (tc-flower) rules, and
the driver will update those with the new tag_8021q pvid for standalone
mode, but the driver won't accept joining a bridge at all while VL rules
were installed in standalone mode. This is sort of a compromise made
because leaving a bridge is an operation that cannot be vetoed.
But this sort of setup change is not fully supported, either: as
mentioned, VLAN filtering changes can also be triggered by leaving a
bridge, therefore, the existing veto we have in place for turning VLAN
filtering off with VLAN-aware VL rules active still isn't fully
effective.

I really don't know how to deal with this in a way that produces
predictable behavior for user space. Since at the moment, keeping this
feature fully functional on constellation changes (not changing the
tag_8021q port pvid when joining a bridge) is blocking progress for the
DSA FDB isolation, I'd rather document it as a (potentially temporary)
limitation and go on without it.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-27 11:06:13 +00:00
Subbaraya Sundeep
1241e329ce ethtool: add support to set/get completion queue event size
Add support to set completion queue event size via ethtool -G
parameter and get it via ethtool -g parameter.

~ # ./ethtool -G eth0 cqe-size 512
~ # ./ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX:             1048576
RX Mini:        n/a
RX Jumbo:       n/a
TX:             1048576
Current hardware settings:
RX:             256
RX Mini:        n/a
RX Jumbo:       n/a
TX:             4096
RX Buf Len:             2048
CQE Size:                128

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-23 20:33:05 -08:00
Hangbin Liu
129e3c1bab bonding: add new option ns_ip6_target
This patch add a new bonding option ns_ip6_target, which correspond
to the arp_ip_target. With this we set IPv6 targets and send IPv6 NS
request to determine the health of the link.

For other related options like the validation, we still use
arp_validate, and will change to ns_validate later.

Note: the sysfs configuration support was removed based on
https://lore.kernel.org/netdev/8863.1645071997@famine

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-21 12:13:45 +00:00
Matt Johnston
63ed1aab3d mctp: Add SIOCMCTP{ALLOC,DROP}TAG ioctls for tag control
This change adds a couple of new ioctls for mctp sockets:
SIOCMCTPALLOCTAG and SIOCMCTPDROPTAG.  These ioctls provide facilities
for explicit allocation / release of tags, overriding the automatic
allocate-on-send/release-on-reply and timeout behaviours. This allows
userspace more control over messages that may not fit a simple
request/response model.

In order to indicate a pre-allocated tag to the sendmsg() syscall, we
introduce a new flag to the struct sockaddr_mctp.smctp_tag value:
MCTP_TAG_PREALLOC.

Additional changes from Jeremy Kerr <jk@codeconstruct.com.au>.

Contains a fix that was:
Reported-by: kernel test robot <lkp@intel.com>

Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-09 12:00:11 +00:00
Jakub Kicinski
9690ae6042 ethtool: add header/data split indication
For applications running on a mix of platforms it's useful
to have a clear indication whether host's NIC supports the
geometry requirements of TCP zero-copy. TCP zero-copy Rx
requires data to be neatly placed into memory pages.
Most NICs can't do that.

This patch is adding GET support only, since the NICs
I work with either always have the feature enabled or
enable it whenever MTU is set to jumbo. In other words
I don't need SET. But adding set should be trivial.
(The only note on SET is that we will likely want
the setting to be "sticky" and use 0 / `unknown`
to reset it back to driver default.)

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-28 14:43:47 +00:00
Linus Torvalds
6f38be8f2c This isn't a hugely busy cycle for documentation, but a few significant
things still showed up:
 
  - A documentation section for ARC processors
  - Reworked and enhanced KUnit documentation
  - The ability to pick your own theme for HTML builds; if the default
    "Read the Docs" theme isn't ugly enough for you, you can now pick
    an uglier one.
  - More Chinese translation work
 
 Plus the usual assortment of fixes and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmHcgvUPHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5YCkcH/0d/bDbjkwh/6fFleHJDic+jasuoAk4HTpKn
 G3dMvEHDzYm2uXJiCZTaxagUtP8S8BdZ1cMrThZ0zcCni30+awvCmq/puK2zo3G3
 urofj9FfjTjKBLOb8GFnLCITzKVeWlSojtSj6cLbd/bbMX4eCEliis0KexEBrKpF
 rmlY7qVr0WeBiW0Wg7xXjTd0H2dCO9O+5IQ/BzSPDJOzxyBWGLg0B/mXcm+BeOJ7
 MukRtR7e2QcAw6MIxUI4tc2yoeG8XssNbbIrmcya1xwsSiAtcxHv3V2qCrJqmAx1
 8mK1q4Mot9zGcOcBQWJoRwotRuS8wqH0jr+XgRbg12AwEZc66zg=
 =oPyM
 -----END PGP SIGNATURE-----

Merge tag 'docs-5.17' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "This isn't a hugely busy cycle for documentation, but a few
  significant things still showed up:

   - A documentation section for ARC processors

   - Reworked and enhanced KUnit documentation

   - The ability to pick your own theme for HTML builds; if the default
     "Read the Docs" theme isn't ugly enough for you, you can now pick
     an uglier one.

   - More Chinese translation work

  Plus the usual assortment of fixes and cleanups"

* tag 'docs-5.17' of git://git.lwn.net/linux: (53 commits)
  scripts: sphinx-pre-install: Fix ctex support on Debian
  docs: discourage use of list tables
  docs: 5.Posting.rst: describe Fixes: and Link: tags
  Documentation: kgdb: Replace deprecated remotebaud
  docs: automarkup.py: Fix invalid HTML link output and broken URI fragments
  Documentation: refer to config RANDOMIZE_BASE for kernel address-space randomization
  Documentation: kgdb: properly capitalize the MAGIC_SYSRQ config
  docs/zh_CN: Update and fix a couple of typos
  scripts: sphinx-pre-install: add required ctex dependency
  Documentation: KUnit: Restyled Frequently Asked Questions
  Documentation: KUnit: Restyle Test Style and Nomenclature page
  Documentation: KUnit: Rework writing page to focus on writing tests
  Documentation: kunit: Reorganize documentation related to running tests
  Documentation: KUnit: Added KUnit Architecture
  Documentation: KUnit: Rewrite getting started
  Documentation: KUnit: Rewrite main page
  docs/zh_CN: Add zh_CN/accounting/delay-accounting.rst
  Documentation/sphinx: fix typos of "its"
  docs/zh_CN: Add sched-domains translation
  doc: fs: remove bdev_try_to_free_page related doc
  ...
2022-01-11 10:00:04 -08:00
Dario Binacchi
bc3897f79f docs: networking: device drivers: can: add flexcan
Add initial documentation for Flexcan driver.

Link: https://lore.kernel.org/all/20220107193105.1699523-8-mkl@pengutronix.de
Signed-off-by: Dario Binacchi <dario.binacchi@amarulasolutions.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-01-08 21:22:58 +01:00
Dario Binacchi
32db1660ee docs: networking: device drivers: add can sub-folder
Add the container for CAN drivers documentation.

Link: https://lore.kernel.org/all/20220107193105.1699523-7-mkl@pengutronix.de
Signed-off-by: Dario Binacchi <dario.binacchi@amarulasolutions.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-01-08 21:22:58 +01:00
Arthur Kiyanovski
273a2397fc net: ena: Update LLQ header length in ena documentation
LLQ entry length is 128 bytes. Therefore the maximum header in
the entry is calculated by:
tx_max_header_size =
LLQ_ENTRY_SIZE - DESCRIPTORS_NUM_BEFORE_HEADER * 16 =
128 - 2 * 16 = 96

This patch updates the documentation so that it states the correct
max header length.

Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-07 19:25:50 -08:00
Saeed Mahameed
745a13061a Documentation: devlink: mlx5.rst: Fix htmldoc build warning
Fix the following build warning:

Documentation/networking/devlink/mlx5.rst:13: WARNING: Error parsing content block for the "list-table" directive:
+uniform two-level bullet list expected, but row 2 does not contain the same number of items as row 1 (2 vs 3).
...

Add the missing item in the first row.

Fixes: 0844fa5f7b ("net/mlx5: Let user configure io_eq_size param")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06 16:22:55 -08:00
Jakub Kicinski
aec53e60e0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
  commit 077cdda764 ("net/mlx5e: TC, Fix memory leak with rules with internal port")
  commit 31108d142f ("net/mlx5: Fix some error handling paths in 'mlx5e_tc_add_fdb_flow()'")
  commit 4390c6edc0 ("net/mlx5: Fix some error handling paths in 'mlx5e_tc_add_fdb_flow()'")
  https://lore.kernel.org/all/20211229065352.30178-1-saeed@kernel.org/

net/smc/smc_wr.c
  commit 49dc9013e3 ("net/smc: Use the bitmap API when applicable")
  commit 349d43127d ("net/smc: fix kernel panic caused by race of smc_sock")
  bitmap_zero()/memset() is removed by the fix

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-30 12:12:12 -08:00
xu xin
be1c5b5322 Documentation: fix outdated interpretation of ip_no_pmtu_disc
The updating way of pmtu has changed, but documentation is still in the
old way. So this patch updates the interpretation of ip_no_pmtu_disc and
min_pmtu.

See commit 28d35bcdd3 ("net: ipv4: don't let PMTU updates increase
route MTU")

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-30 13:28:04 +00:00
Jakub Kicinski
8b3f913322 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
include/net/sock.h
  commit 8f905c0e73 ("inet: fully convert sk->sk_rx_dst to RCU rules")
  commit 43f51df417 ("net: move early demux fields close to sk_refcnt")
  https://lore.kernel.org/all/20211222141641.0caa0ab3@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-23 16:09:58 -08:00
Shay Drory
8680a60fc1 net/mlx5: Let user configure max_macs generic param
Currently, max_macs is taking 70Kbytes of memory per function. This
size is not needed in all use cases, and is critical with large scale.
Hence, allow user to configure the number of max_macs.

For example, to reduce the number of max_macs to 1, execute::
$ devlink dev param set pci/0000:00:0b.0 name max_macs value 1 \
              cmode driverinit
$ devlink dev reload pci/0000:00:0b.0

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-12-21 19:08:55 -08:00
Shay Drory
0ad598d0be devlink: Clarifies max_macs generic devlink param
The generic param max_macs documentation isn't clear.
Replace it with a more descriptive documentation

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-12-21 19:08:55 -08:00
Shay Drory
57ca767820 net/mlx5: Let user configure event_eq_size param
Event EQ is an EQ which received the notification of almost all the
events generated by the NIC.
Currently, each event EQ is taking 512KB of memory. This size is not
needed in most use cases, and is critical with large scale. Hence,
allow user to configure the size of the event EQ.

For example to reduce event EQ size to 64, execute::
$ devlink dev param set pci/0000:00:0b.0 name event_eq_size value 64 \
              cmode driverinit
$ devlink dev reload pci/0000:00:0b.0

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-12-21 19:08:55 -08:00
Shay Drory
0b5705ebc3 devlink: Add new "event_eq_size" generic device param
Add new device generic parameter to determine the size of the
asynchronous control events EQ.

For example, to reduce event EQ size to 64, execute:
$ devlink dev param set pci/0000:06:00.0 \
              name event_eq_size value 64 cmode driverinit
$ devlink dev reload pci/0000:06:00.0

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-12-21 19:08:54 -08:00
Shay Drory
0844fa5f7b net/mlx5: Let user configure io_eq_size param
Currently, each I/O EQ is taking 128KB of memory. This size
is not needed in all use cases, and is critical with large scale.
Hence, allow user to configure the size of I/O EQs.

For example, to reduce I/O EQ size to 64, execute:
$ devlink dev param set pci/0000:00:0b.0 name io_eq_size value 64 \
              cmode driverinit
$ devlink dev reload pci/0000:00:0b.0

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-12-21 19:08:54 -08:00
Shay Drory
47402385d0 devlink: Add new "io_eq_size" generic device param
Add new device generic parameter to determine the size of the
I/O completion EQs.

For example, to reduce I/O EQ size to 64, execute:
$ devlink dev param set pci/0000:06:00.0 \
              name io_eq_size value 64 cmode driverinit
$ devlink dev reload pci/0000:06:00.0

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-12-21 19:08:54 -08:00
Fernando Fernandez Mancera
1c15b05bae bonding: fix ad_actor_system option setting to default
When 802.3ad bond mode is configured the ad_actor_system option is set to
"00:00:00:00:00:00". But when trying to set the all-zeroes MAC as actors'
system address it was failing with EINVAL.

An all-zeroes ethernet address is valid, only multicast addresses are not
valid values.

Fixes: 171a42c38c ("bonding: add netlink support for sys prio, actor sys mac, and port key")
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Link: https://lore.kernel.org/r/20211221111345.2462-1-ffmancera@riseup.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-21 16:54:19 -08:00
Willem de Bruijn
a9725e1d39 docs: networking: replace skb_hwtstamp_tx with skb_tstamp_tx
Tiny doc fix. The hardware transmit function was called skb_tstamp_tx
from its introduction in commit ac45f602ee ("net: infrastructure for
hardware time stamping") in the same series as this documentation.

Fixes: cb9eff0978 ("net: new user space API for time stamping of incoming and outgoing packets")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20211220144608.2783526-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-20 18:47:11 -08:00
Sean Anderson
662f11d55f docs: networking: dpaa2: Fix DPNI header
The DPNI object should get its own header, like the rest of the objects.

Fixes: 60b91319a3 ("staging: fsl-mc: Convert documentation to rst format")
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-20 11:46:42 +00:00
Jakub Kicinski
7cd2802d74 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-16 16:13:19 -08:00
Robert Schlabbach
271225fd57 ixgbe: Document how to enable NBASE-T support
Commit a296d665ea ("ixgbe: Add ethtool support to enable 2.5 and 5.0
Gbps support") introduced suppression of the advertisement of NBASE-T
speeds by default, according to Todd Fujinaka to accommodate customers
with network switches which could not cope with advertised NBASE-T
speeds, as posted in the E1000-devel mailing list:

https://sourceforge.net/p/e1000/mailman/message/37106269/

However, the suppression was not documented at all, nor was how to
enable NBASE-T support.

Properly document the NBASE-T suppression and how to enable NBASE-T
support.

Fixes: a296d665ea ("ixgbe: Add ethtool support to enable 2.5 and 5.0 Gbps support")
Reported-by: Robert Schlabbach <robert_s@gmx.net>
Signed-off-by: Robert Schlabbach <robert_s@gmx.net>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-12-15 11:09:29 -08:00
Jacob Keller
399e27dbbd ice: support immediate firmware activation via devlink reload
The ice hardware contains an embedded chip with firmware which can be
updated using devlink flash. The firmware which runs on this chip is
referred to as the Embedded Management Processor firmware (EMP
firmware).

Activating the new firmware image currently requires that the system be
rebooted. This is not ideal as rebooting the system can cause unwanted
downtime.

In practical terms, activating the firmware does not always require a
full system reboot. In many cases it is possible to activate the EMP
firmware immediately. There are a couple of different scenarios to
cover.

 * The EMP firmware itself can be reloaded by issuing a special update
   to the device called an Embedded Management Processor reset (EMP
   reset). This reset causes the device to reset and reload the EMP
   firmware.

 * PCI configuration changes are only reloaded after a cold PCIe reset.
   Unfortunately there is no generic way to trigger this for a PCIe
   device without a system reboot.

When performing a flash update, firmware is capable of responding with
some information about the specific update requirements.

The driver updates the flash by programming a secondary inactive bank
with the contents of the new image, and then issuing a command to
request to switch the active bank starting from the next load.

The response to the final command for updating the inactive NVM flash
bank includes an indication of the minimum reset required to fully
update the device. This can be one of the following:

 * A full power on is required
 * A cold PCIe reset is required
 * An EMP reset is required

The response to the command to switch flash banks includes an indication
of whether or not the firmware will allow an EMP reset request.

For most updates, an EMP reset is sufficient to load the new EMP
firmware without issues. In some cases, this reset is not sufficient
because the PCI configuration space has changed. When this could cause
incompatibility with the new EMP image, the firmware is capable of
rejecting the EMP reset request.

Add logic to ice_fw_update.c to handle the response data flash update
AdminQ commands.

For the reset level, issue a devlink status notification informing the
user of how to complete the update with a simple suggestion like
"Activate new firmware by rebooting the system".

Cache the status of whether or not firmware will restrict the EMP reset
for use in implementing devlink reload.

Implement support for devlink reload with the "fw_activate" flag. This
allows user space to request the firmware be activated immediately.

For the .reload_down handler, we will issue a request for the EMP reset
using the appropriate firmware AdminQ command. If we know that the
firmware will not allow an EMP reset, simply exit with a suitable
netlink extended ACK message indicating that the EMP reset is not
available.

For the .reload_up handler, simply wait until the driver has finished
resetting. Logic to handle processing of an EMP reset already exists in
the driver as part of its reset and rebuild flows.

Implement support for the devlink reload interface with the
"fw_activate" action. This allows userspace to request activation of
firmware without a reboot.

Note that support for indicating the required reset and EMP reset
restriction is not supported on old versions of firmware. The driver can
determine if the two features are supported by checking the device
capabilities report. I confirmed support has existed since at least
version 5.5.2 as reported by the 'fw.mgmt' version. Support to issue the
EMP reset request has existed in all version of the EMP firmware for the
ice hardware.

Check the device capabilities report to determine whether or not the
indications are reported by the running firmware. If the reset
requirement indication is not supported, always assume a full power on
is necessary. If the reset restriction capability is not supported,
always assume the EMP reset is available.

Users can verify if the EMP reset has activated the firmware by using
the devlink info report to check that the 'running' firmware version has
updated. For example a user might do the following:

 # Check current version
 $ devlink dev info

 # Update the device
 $ devlink dev flash pci/0000:af:00.0 file firmware.bin

 # Confirm stored version updated
 $ devlink dev info

 # Reload to activate new firmware
 $ devlink dev reload pci/0000:af:00.0 action fw_activate

 # Confirm running version updated
 $ devlink dev info

Finally, this change does *not* implement basic driver-only reload
support. I did look into trying to do this. However, it requires
significant refactor of how the ice driver probes and loads everything.
The ice driver probe and allocation flows were not designed with such
a reload in mind. Refactoring the flow to support this is beyond the
scope of this change.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-12-15 08:40:38 -08:00
Jakub Kicinski
be3158290d Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Andrii Nakryiko says:

====================
bpf-next 2021-12-10 v2

We've added 115 non-merge commits during the last 26 day(s) which contain
a total of 182 files changed, 5747 insertions(+), 2564 deletions(-).

The main changes are:

1) Various samples fixes, from Alexander Lobakin.

2) BPF CO-RE support in kernel and light skeleton, from Alexei Starovoitov.

3) A batch of new unified APIs for libbpf, logging improvements, version
   querying, etc. Also a batch of old deprecations for old APIs and various
   bug fixes, in preparation for libbpf 1.0, from Andrii Nakryiko.

4) BPF documentation reorganization and improvements, from Christoph Hellwig
   and Dave Tucker.

5) Support for declarative initialization of BPF_MAP_TYPE_PROG_ARRAY in
   libbpf, from Hengqi Chen.

6) Verifier log fixes, from Hou Tao.

7) Runtime-bounded loops support with bpf_loop() helper, from Joanne Koong.

8) Extend branch record capturing to all platforms that support it,
   from Kajol Jain.

9) Light skeleton codegen improvements, from Kumar Kartikeya Dwivedi.

10) bpftool doc-generating script improvements, from Quentin Monnet.

11) Two libbpf v0.6 bug fixes, from Shuyi Cheng and Vincent Minet.

12) Deprecation warning fix for perf/bpf_counter, from Song Liu.

13) MAX_TAIL_CALL_CNT unification and MIPS build fix for libbpf,
    from Tiezhu Yang.

14) BTF_KING_TYPE_TAG follow-up fixes, from Yonghong Song.

15) Selftests fixes and improvements, from Ilya Leoshkevich, Jean-Philippe
    Brucker, Jiri Olsa, Maxim Mikityanskiy, Tirthendu Sarkar, Yucong Sun,
    and others.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (115 commits)
  libbpf: Add "bool skipped" to struct bpf_map
  libbpf: Fix typo in btf__dedup@LIBBPF_0.0.2 definition
  bpftool: Switch bpf_object__load_xattr() to bpf_object__load()
  selftests/bpf: Remove the only use of deprecated bpf_object__load_xattr()
  selftests/bpf: Add test for libbpf's custom log_buf behavior
  selftests/bpf: Replace all uses of bpf_load_btf() with bpf_btf_load()
  libbpf: Deprecate bpf_object__load_xattr()
  libbpf: Add per-program log buffer setter and getter
  libbpf: Preserve kernel error code and remove kprobe prog type guessing
  libbpf: Improve logging around BPF program loading
  libbpf: Allow passing user log setting through bpf_object_open_opts
  libbpf: Allow passing preallocated log_buf when loading BTF into kernel
  libbpf: Add OPTS-based bpf_btf_load() API
  libbpf: Fix bpf_prog_load() log_buf logic for log_level 0
  samples/bpf: Remove unneeded variable
  bpf: Remove redundant assignment to pointer t
  selftests/bpf: Fix a compilation warning
  perf/bpf_counter: Use bpf_map_create instead of bpf_create_map
  samples: bpf: Fix 'unknown warning group' build warning on Clang
  samples: bpf: Fix xdp_sample_user.o linking with Clang
  ...
====================

Link: https://lore.kernel.org/r/20211210234746.2100561-1-andrii@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-10 15:56:13 -08:00
Jonathan Corbet
a7fb920b15 Linux 5.16-rc4
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmGtOFYeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG1hUH/1bmlOYsscJ7biqd
 VECr5HhTg6iRvwWUiOpU27fLuBeAM1ZdF0oLuCtzvbK2h8lfTclcHfueihK0GIvX
 ci8BvwpOYfUdDUWHglgvGXqICqYch3PiBVMFiRRRkzcpZdyCFCirAynLdOeusdTU
 72Fi2RBaIM+U/5UVKcTx0J9WJsvFcG97lnNX5nT3dUmuoSW4WmX+h4vIe8VYFVmd
 8q1gD17hPL+ThTKcZApn7IsArU1LNEGRg0tYItgMJo8AMTvsZjwR6yQgXeyuQ0Xk
 xp6pZwzABtnL9dfNJ95q1GhsJBX5T5XvAVjt2uR1ADbgh6TDApC1VBKICm1Nva7g
 uT6S0yE=
 =JNL8
 -----END PGP SIGNATURE-----

Merge tag 'v5.16-rc4' into docs-next

I have a couple of fixes for warnings introduced after -rc1; catch up to
-rc4 so that the fixes have something to fix.
2021-12-10 13:57:09 -07:00
Christoph Hellwig
88691e9e1e bpf, docs: Split general purpose eBPF documentation out of filter.rst
filter.rst starts out documenting the classic BPF and then spills into
introducing and documentating eBPF.  Move the eBPF documentation into
rwo new files under Documentation/bpf/ for the instruction set and
the verifier and link to the BPF documentation from filter.rst.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20211119163215.971383-6-hch@lst.de
2021-11-30 10:52:11 -08:00
Christoph Hellwig
bc84e959e5 bpf, docs: Move handling of maps to Documentation/bpf/maps.rst
Move the general maps documentation into the maps.rst file from the
overall networking filter documentation and add a link instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20211119163215.971383-5-hch@lst.de
2021-11-30 10:52:11 -08:00
Christoph Hellwig
06edc59c1f bpf, docs: Prune all references to "internal BPF"
The eBPF name has completely taken over from eBPF in general usage for
the actual eBPF representation, or BPF for any general in-kernel use.
Prune all remaining references to "internal BPF".

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20211119163215.971383-4-hch@lst.de
2021-11-30 10:52:11 -08:00
Hangbin Liu
5944b5abd8 Bonding: add arp_missed_max option
Currently, we use hard code number to verify if we are in the
arp_interval timeslice. But some user may want to reduce/extend
the verify timeslice. With the similar team option 'missed_max'
the uers could change that number based on their own environment.

Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-30 12:15:58 +00:00
Ahmed Zaki
c4c5509006 Doc: networking: Fix the title's Sphinx overline in rds.rst
A missing "=" caused the title and sections to not show up properly in
the htmldocs.

Signed-off-by: Ahmed Zaki <anzaki@gmail.com>
Link: https://lore.kernel.org/r/20211128171719.3286255-1-anzaki@gmail.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2021-11-29 15:18:21 -07:00
Jakub Kicinski
93d5404e89 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/ipa/ipa_main.c
  8afc7e471a ("net: ipa: separate disabling setup from modem stop")
  76b5fbcd6b ("net: ipa: kill ipa_modem_init()")

Duplicated include, drop one.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-26 13:45:19 -08:00
Jakub Kicinski
cbb91dcbfb ptp: fix filter names in the documentation
All the filter names are missing _PTP in them.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Link: https://lore.kernel.org/r/20211126031921.2466944-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-26 11:19:19 -08:00
Shiraz Saleem
325e0d0aa6 devlink: Add 'enable_iwarp' generic device param
Add a new device generic parameter to enable and disable
iWARP functionality on a multi-protocol RDMA device.

Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Tested-by: Leszek Kaliszczuk <leszek.kaliszczuk@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-11-22 08:41:39 -08:00
Hao Chen
0b70c256eb ethtool: add support to set/get rx buf len via ethtool
Add support to set rx buf len via ethtool -G parameter and get
rx buf len via ethtool -g parameter.

Signed-off-by: Hao Chen <chenhao288@hisilicon.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-22 12:31:48 +00:00
David S. Miller
d6821c5bc6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter/IPVS fixes for net:

1) Add selftest for vrf+conntrack, from Florian Westphal.

2) Extend nfqueue selftest to cover nfqueue, also from Florian.

3) Remove duplicated include in nft_payload, from Wan Jiabing.

4) Several improvements to the nat port shadowing selftest,
   from Phil Sutter.

5) Fix filtering of reply tuple in ctnetlink, from Florent Fourcot.

6) Do not override error with -EINVAL in filter setup path, also
   from Florent.

7) Honor sysctl_expire_nodest_conn regardless conn_reuse_mode for
   reused connections, from yangxingwu.

8) Replace snprintf() by sysfs_emit() in xt_IDLETIMER as reported
   by Coccinelle, from Jing Yao.

9) Incorrect IPv6 tunnel match in flowtable offload, from Will
   Mortensen.

10) Switch port shadow selftest to use socat, from Florian Westphal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-19 11:00:01 +00:00
Vasudev Kamath
738baea497 Documentation: networking: net_failover: Fix documentation
Update net_failover documentation with missing and incomplete
details to get a proper working setup.

Signed-off-by: Vasudev Kamath <vasudev@copyninja.info>
Reviewed-by: Krishna Kumar <krikku@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-17 13:59:49 +00:00
Russell King (Oracle)
b9241f5413 net: document SMII and correct phylink's new validation mechanism
SMII has not been documented in the kernel, but information on this PHY
interface mode has been recently found. Document it, and correct the
recently introduced phylink handling for this interface mode.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/E1mmfVl-0075nP-14@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-16 19:22:30 -08:00
Linus Torvalds
f54ca91fe6 Networking fixes for 5.16-rc1, including fixes from bpf, can
and netfilter.
 
 Current release - regressions:
 
  - bpf: do not reject when the stack read size is different
    from the tracked scalar size
 
  - net: fix premature exit from NAPI state polling in napi_disable()
 
  - riscv, bpf: fix RV32 broken build, and silence RV64 warning
 
 Current release - new code bugs:
 
  - net: fix possible NULL deref in sock_reserve_memory
 
  - amt: fix error return code in amt_init(); fix stopping the workqueue
 
  - ax88796c: use the correct ioctl callback
 
 Previous releases - always broken:
 
  - bpf: stop caching subprog index in the bpf_pseudo_func insn
 
  - security: fixups for the security hooks in sctp
 
  - nfc: add necessary privilege flags in netlink layer, limit operations
    to admin only
 
  - vsock: prevent unnecessary refcnt inc for non-blocking connect
 
  - net/smc: fix sk_refcnt underflow on link down and fallback
 
  - nfnetlink_queue: fix OOB when mac header was cleared
 
  - can: j1939: ignore invalid messages per standard
 
  - bpf, sockmap:
    - fix race in ingress receive verdict with redirect to self
    - fix incorrect sk_skb data_end access when src_reg = dst_reg
    - strparser, and tls are reusing qdisc_skb_cb and colliding
 
  - ethtool: fix ethtool msg len calculation for pause stats
 
  - vlan: fix a UAF in vlan_dev_real_dev() when ref-holder tries
    to access an unregistering real_dev
 
  - udp6: make encap_rcv() bump the v6 not v4 stats
 
  - drv: prestera: add explicit padding to fix m68k build
 
  - drv: felix: fix broken VLAN-tagged PTP under VLAN-aware bridge
 
  - drv: mvpp2: fix wrong SerDes reconfiguration order
 
 Misc & small latecomers:
 
  - ipvs: auto-load ipvs on genl access
 
  - mctp: sanity check the struct sockaddr_mctp padding fields
 
  - libfs: support RENAME_EXCHANGE in simple_rename()
 
  - avoid double accounting for pure zerocopy skbs
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmGNQdwACgkQMUZtbf5S
 IrsiMQ//f66lTJ8PJ5Qj70hX9dC897olx7uGHB9eiKoyOcJI459hFlfXwRU2T4Tf
 fPNwPNUQ9Mynw9tX/jWEi+7zd6r6TSHGXK49U9/rIbQ95QjKY4LHowIE63x+vPl2
 5Cpf+80zXC3DUX1fijgyG1ujnU3kBaqopTxDLmlsHw2PGkwT5Ox1DUwkhc370eEL
 xlpq3PYGWA8/AQNyhSVBkG/UmoLaq0jYNP5yVcOj4jGjgcgLe1SLrqczENr35QHZ
 cRkuBsFBMBZF7wSX2f9qQIB/+b1pcLlD9IO+K3S7Ruq+rUd7qfL/tmwNxEh0axYK
 AyIun1Bxcy7QJGjtpGAz+Ku7jS9T3HxzyxhqilQo3co8jAW0WJ1YwHl+XPgQXyjV
 DLG6Vxt4syiwsoSXGn8MQugs4nlBT+0qWl8YamIR+o7KkAYPc2QWkXlzEDfNeIW8
 JNCZA3sy7VGi1ytorZGx16sQsEWnyRG9a6/WV20Dr+HVs1SKPcFzIfG6mVngR07T
 mQMHnbAF6Z5d8VTcPQfMxd7UH48s1bHtk5lcSTa3j0Cw+GkA6ytTmjPdJ1qRcdkH
 dl9jAfADe4O6frG+9XH7FEFqhmkghVI7bOCA4ZOhClVaIcDGgEZc2y7sY9/oZ7P4
 KXBD2R5X1caCUM0UtzwL7/8ddOtPtHIrFnhY+7+I6ijt9qmI0BY=
 =Ttgq
 -----END PGP SIGNATURE-----

Merge tag 'net-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf, can and netfilter.

  Current release - regressions:

   - bpf: do not reject when the stack read size is different from the
     tracked scalar size

   - net: fix premature exit from NAPI state polling in napi_disable()

   - riscv, bpf: fix RV32 broken build, and silence RV64 warning

  Current release - new code bugs:

   - net: fix possible NULL deref in sock_reserve_memory

   - amt: fix error return code in amt_init(); fix stopping the
     workqueue

   - ax88796c: use the correct ioctl callback

  Previous releases - always broken:

   - bpf: stop caching subprog index in the bpf_pseudo_func insn

   - security: fixups for the security hooks in sctp

   - nfc: add necessary privilege flags in netlink layer, limit
     operations to admin only

   - vsock: prevent unnecessary refcnt inc for non-blocking connect

   - net/smc: fix sk_refcnt underflow on link down and fallback

   - nfnetlink_queue: fix OOB when mac header was cleared

   - can: j1939: ignore invalid messages per standard

   - bpf, sockmap:
      - fix race in ingress receive verdict with redirect to self
      - fix incorrect sk_skb data_end access when src_reg = dst_reg
      - strparser, and tls are reusing qdisc_skb_cb and colliding

   - ethtool: fix ethtool msg len calculation for pause stats

   - vlan: fix a UAF in vlan_dev_real_dev() when ref-holder tries to
     access an unregistering real_dev

   - udp6: make encap_rcv() bump the v6 not v4 stats

   - drv: prestera: add explicit padding to fix m68k build

   - drv: felix: fix broken VLAN-tagged PTP under VLAN-aware bridge

   - drv: mvpp2: fix wrong SerDes reconfiguration order

  Misc & small latecomers:

   - ipvs: auto-load ipvs on genl access

   - mctp: sanity check the struct sockaddr_mctp padding fields

   - libfs: support RENAME_EXCHANGE in simple_rename()

   - avoid double accounting for pure zerocopy skbs"

* tag 'net-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (123 commits)
  selftests/net: udpgso_bench_rx: fix port argument
  net: wwan: iosm: fix compilation warning
  cxgb4: fix eeprom len when diagnostics not implemented
  net: fix premature exit from NAPI state polling in napi_disable()
  net/smc: fix sk_refcnt underflow on linkdown and fallback
  net/mlx5: Lag, fix a potential Oops with mlx5_lag_create_definer()
  gve: fix unmatched u64_stats_update_end()
  net: ethernet: lantiq_etop: Fix compilation error
  selftests: forwarding: Fix packet matching in mirroring selftests
  vsock: prevent unnecessary refcnt inc for nonblocking connect
  net: marvell: mvpp2: Fix wrong SerDes reconfiguration order
  net: ethernet: ti: cpsw_ale: Fix access to un-initialized memory
  net: stmmac: allow a tc-taprio base-time of zero
  selftests: net: test_vxlan_under_vrf: fix HV connectivity test
  net: hns3: allow configure ETS bandwidth of all TCs
  net: hns3: remove check VF uc mac exist when set by PF
  net: hns3: fix some mac statistics is always 0 in device version V2
  net: hns3: fix kernel crash when unload VF while it is being reset
  net: hns3: sync rx ring head in echo common pull
  net: hns3: fix pfc packet number incorrect after querying pfc parameters
  ...
2021-11-11 09:49:36 -08:00
yangxingwu
c95c07836f netfilter: ipvs: Fix reuse connection if RS weight is 0
We are changing expire_nodest_conn to work even for reused connections when
conn_reuse_mode=0, just as what was done with commit dc7b3eb900 ("ipvs:
Fix reuse connection if real server is dead").

For controlled and persistent connections, the new connection will get the
needed real server depending on the rules in ip_vs_check_template().

Fixes: d752c36457 ("ipvs: allow rescheduling of new connections when port reuse is detected")
Co-developed-by: Chuanqi Liu <legend050709@qq.com>
Signed-off-by: Chuanqi Liu <legend050709@qq.com>
Signed-off-by: yangxingwu <xingwu.yang@gmail.com>
Acked-by: Simon Horman <horms@verge.net.au>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-11-08 11:42:47 +01:00
Menglong Dong
69dfccbc11 net: udp: correct the document for udp_mem
udp_mem is a vector of 3 INTEGERs, which is used to limit the number of
pages allowed for queueing by all UDP sockets.

However, sk_has_memory_pressure() in __sk_mem_raise_allocated() always
return false for udp, as memory pressure is not supported by udp, which
means that __sk_mem_raise_allocated() will fail once pages allocated
for udp socket exceeds udp_mem[0].

Therefor, udp_mem[0] is the only one that limit the number of pages.
However, the document of udp_mem just express that udp_mem[2] is the
limitation. So, just fix it.

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-05 10:42:46 +00:00
Linus Torvalds
624ad333d4 This is a relatively unexciting cycle for documentation.
- Some small scripts/kerneldoc fixes
 
  - More Chinese translation work, but at a much reduced rate.
 
  - The tip-tree maintainer's handbook
 
 ...plus the usual array of build fixes, typo fixes, etc.
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmGBqbYPHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5YzMoH+wQwUbILvf4FB9h6l0qhbPnvcLMJtDiuIGSu
 Ivfc1t+Vh7/waehaYfn9erCNps6oE13Arsy0DDFZcr6vbjXFO5clFoh5jvSJsk+G
 zTXzZNB99SoJcZw9r8F7aJDbQNJfSXoyTTOg1mSeXNo+nkBFTSWO7QwCx5M2obaT
 76+r8HQpnEYmrGePsOXriV4aOP+yJxuGgkPb+VPlPtQhA7v6dzo5hmoh5tPIzCLd
 Fz8ek4aAm9sPEtu1UUoA+7MALZHTFwPv6aSAuyVeNfF/UBL6M4iAwusQyLdPFfoj
 JPO7f+1h6a8jKTfTWGjPI3o33DtU/8nc6DgnuBaXBTKW8Dl5Exk=
 =q37Q
 -----END PGP SIGNATURE-----

Merge tag 'docs-5.16' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "This is a relatively unexciting cycle for documentation.

   - Some small scripts/kerneldoc fixes

   - More Chinese translation work, but at a much reduced rate.

   - The tip-tree maintainer's handbook

  ...plus the usual array of build fixes, typo fixes, etc"

* tag 'docs-5.16' of git://git.lwn.net/linux: (53 commits)
  kernel-doc: support DECLARE_PHY_INTERFACE_MASK()
  docs/zh_CN: add core-api xarray translation
  docs/zh_CN: add core-api assoc_array translation
  speakup: Fix typo in documentation "boo" -> "boot"
  docs: submitting-patches: make section about the Link: tag more explicit
  docs: deprecated.rst: Clarify open-coded arithmetic with literals
  scripts: documentation-file-ref-check: fix bpf selftests path
  scripts: documentation-file-ref-check: ignore hidden files
  coding-style.rst: trivial: fix location of driver model macros
  docs: f2fs: fix text alignment
  docs/zh_CN add PCI pci.rst translation
  docs/zh_CN add PCI index.rst translation
  docs: translations: zh_CN: memory-hotplug.rst: fix a typo
  docs: translations: zn_CN: irq-affinity.rst: add a missing extension
  block: add documentation for inflight
  scripts: kernel-doc: Ignore __alloc_size() attribute
  docs: pdfdocs: Adjust \headheight for fancyhdr
  docs: UML: user_mode_linux_howto_v2 edits
  docs: use the lore redirector everywhere
  docs: proc.rst: mountinfo: align columns
  ...
2021-11-02 22:11:39 -07:00
James Prestwood
18ac597af2 net: ndisc: introduce ndisc_evict_nocarrier sysctl parameter
In most situations the neighbor discovery cache should be cleared on a
NOCARRIER event which is currently done unconditionally. But for wireless
roams the neighbor discovery cache can and should remain intact since
the underlying network has not changed.

This patch introduces a sysctl option ndisc_evict_nocarrier which can
be disabled by a wireless supplicant during a roam. This allows packets
to be sent after a roam immediately without having to wait for
neighbor discovery.

A user reported roughly a 1 second delay after a roam before packets
could be sent out (note, on IPv4). This delay was due to the ARP
cache being cleared. During testing of this same scenario using IPv6
no delay was noticed, but regardless there is no reason to clear
the ndisc cache for wireless roams.

Signed-off-by: James Prestwood <prestwoj@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-01 19:57:14 -07:00
James Prestwood
fcdb44d08a net: arp: introduce arp_evict_nocarrier sysctl parameter
This change introduces a new sysctl parameter, arp_evict_nocarrier.
When set (default) the ARP cache will be cleared on a NOCARRIER event.
This new option has been defaulted to '1' which maintains existing
behavior.

Clearing the ARP cache on NOCARRIER is relatively new, introduced by:

commit 859bd2ef1f
Author: David Ahern <dsahern@gmail.com>
Date:   Thu Oct 11 20:33:49 2018 -0700

    net: Evict neighbor entries on carrier down

The reason for this changes is to prevent the ARP cache from being
cleared when a wireless device roams. Specifically for wireless roams
the ARP cache should not be cleared because the underlying network has not
changed. Clearing the ARP cache in this case can introduce significant
delays sending out packets after a roam.

A user reported such a situation here:

https://lore.kernel.org/linux-wireless/CACsRnHWa47zpx3D1oDq9JYnZWniS8yBwW1h0WAVZ6vrbwL_S0w@mail.gmail.com/

After some investigation it was found that the kernel was holding onto
packets until ARP finished which resulted in this 1 second delay. It
was also found that the first ARP who-has was never responded to,
which is actually what caues the delay. This change is more or less
working around this behavior, but again, there is no reason to clear
the cache on a roam anyways.

As for the unanswered who-has, we know the packet made it OTA since
it was seen while monitoring. Why it never received a response is
unknown. In any case, since this is a problem on the AP side of things
all that can be done is to work around it until it is solved.

Some background on testing/reproducing the packet delay:

Hardware:
 - 2 access points configured for Fast BSS Transition (Though I don't
   see why regular reassociation wouldn't have the same behavior)
 - Wireless station running IWD as supplicant
 - A device on network able to respond to pings (I used one of the APs)

Procedure:
 - Connect to first AP
 - Ping once to establish an ARP entry
 - Start a tcpdump
 - Roam to second AP
 - Wait for operstate UP event, and note the timestamp
 - Start pinging

Results:

Below is the tcpdump after UP. It was recorded the interface went UP at
10:42:01.432875.

10:42:01.461871 ARP, Request who-has 192.168.254.1 tell 192.168.254.71, length 28
10:42:02.497976 ARP, Request who-has 192.168.254.1 tell 192.168.254.71, length 28
10:42:02.507162 ARP, Reply 192.168.254.1 is-at ac:86:74:55:b0:20, length 46
10:42:02.507185 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 1, length 64
10:42:02.507205 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 2, length 64
10:42:02.507212 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 3, length 64
10:42:02.507219 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 4, length 64
10:42:02.507225 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 5, length 64
10:42:02.507232 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 6, length 64
10:42:02.515373 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 1, length 64
10:42:02.521399 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 2, length 64
10:42:02.521612 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 3, length 64
10:42:02.521941 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 4, length 64
10:42:02.522419 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 5, length 64
10:42:02.523085 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 6, length 64

You can see the first ARP who-has went out very quickly after UP, but
was never responded to. Nearly a second later the kernel retries and
gets a response. Only then do the ping packets go out. If an ARP entry
is manually added prior to UP (after the cache is cleared) it is seen
that the first ping is never responded to, so its not only an issue with
ARP but with data packets in general.

As mentioned prior, the wireless interface was also monitored to verify
the ping/ARP packet made it OTA which was observed to be true.

Signed-off-by: James Prestwood <prestwoj@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-01 19:57:14 -07:00
Michael Chan
eff441f3b5 bnxt_en: Update bnxt.rst devlink documentation
Add 'enable_remote_dev_reset' documentation to bnxt.rst.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29 12:13:05 +01:00
Subbaraya Sundeep
442e796f0a devlink: add documentation for octeontx2 driver
Add a file to document devlink support for octeontx2
driver. Driver-specific parameters implemented by
AF, PF and VF drivers are documented.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28 14:35:34 +01:00
Jakub Kicinski
6b3671746a net/mlx5: remove the recent devlink params
revert commit 46ae40b94d ("net/mlx5: Let user configure io_eq_size param")
revert commit a6cb08daa3 ("net/mlx5: Let user configure event_eq_size param")
revert commit 5546040619 ("net/mlx5: Let user configure max_macs param")

The EQE parameters are applicable to more drivers, they should
be configured via standard API, probably ethtool. Example of
another driver needing something similar:

https://lore.kernel.org/all/1633454136-14679-3-git-send-email-sbhatta@marvell.com/

The last param for "max_macs" is probably fine but the documentation
is severely lacking. The meaning and implications for changing the
param need to be stated.

Link: https://lore.kernel.org/r/20211026152939.3125950-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-26 10:18:32 -07:00
Parav Pandit
d67ab0a8c1 net/mlx5: SF_DEV Add SF device trace points
Add SF device add and delete specific trace points.

echo mlx5:mlx5_sf_dev_add >> /sys/kernel/debug/tracing/set_event
echo mlx5:mlx5_sf_dev_del >> /sys/kernel/debug/tracing/set_event
echo mlx5:mlx5_sf_vhca_event >> /sys/kernel/debug/tracing/set_event

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-25 13:51:21 -07:00
Parav Pandit
b3ccada68b net/mlx5: SF, Add SF trace points
Add support for trace events for SFs to improve debugging.
This covers
(a) port add and free trace points
(b) device level trace points
(c) SF hardware context add, free trace points.
(d) SF function activate/deacticate and state trace points

SF events examples:
echo mlx5:mlx5_sf_add >> /sys/kernel/debug/tracing/set_event
echo mlx5:mlx5_sf_free >> /sys/kernel/debug/tracing/set_event
echo mlx5:mlx5_sf_hwc_alloc >> /sys/kernel/debug/tracing/set_event
echo mlx5:mlx5_sf_hwc_free >> /sys/kernel/debug/tracing/set_event
echo mlx5:mlx5_sf_hwc_deferred_free >> /sys/kernel/debug/tracing/set_event
echo mlx5:mlx5_sf_update_state >> /sys/kernel/debug/tracing/set_event
echo mlx5:mlx5_sf_activate >> /sys/kernel/debug/tracing/set_event
echo mlx5:mlx5_sf_deactivate >> /sys/kernel/debug/tracing/set_event

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-25 13:51:21 -07:00
Shay Drory
5546040619 net/mlx5: Let user configure max_macs param
Currently, max_macs is taking 70Kbytes of memory per function. This
size is not needed in all use cases, and is critical with large scale.
Hence, allow user to configure the number of max_macs.

For example, to reduce the number of max_macs to 1, execute::
$ devlink dev param set pci/0000:00:0b.0 name max_macs value 1 \
              cmode driverinit
$ devlink dev reload pci/0000:00:0b.0

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-25 13:51:21 -07:00
Shay Drory
a6cb08daa3 net/mlx5: Let user configure event_eq_size param
Event EQ is an EQ which received the notification of almost all the
events generated by the NIC.
Currently, each event EQ is taking 512KB of memory. This size is not
needed in most use cases, and is critical with large scale. Hence,
allow user to configure the size of the event EQ.

For example to reduce event EQ size to 64, execute::
$ devlink resource set pci/0000:00:0b.0 path /event_eq_size/ size 64
$ devlink dev reload pci/0000:00:0b.0

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-25 13:51:20 -07:00
Shay Drory
46ae40b94d net/mlx5: Let user configure io_eq_size param
Currently, each I/O EQ is taking 128KB of memory. This size
is not needed in all use cases, and is critical with large scale.
Hence, allow user to configure the size of I/O EQs.

For example, to reduce I/O EQ size to 64, execute:
$ devlink resource set pci/0000:00:0b.0 path /io_eq_size/ size 64
$ devlink dev reload pci/0000:00:0b.0

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-25 13:51:20 -07:00
Aya Levin
b87ef75cb5 net/mlx5: Print health buffer by log level
Add log macro which gets log level as a parameter. Use the severity
read from the health buffer and the new log macro to log the health buffer
with severity as log level.  Prior to this patch, health buffer was
printed in error log level regardless of its severity. Now the user may
filter dmesg (--level) or change kernel log level to focus on different
severity levels of firmware errors.

Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-25 13:51:19 -07:00
David S. Miller
bdfa75ad70 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Lots of simnple overlapping additions.

With a build fix from Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-22 11:41:16 +01:00
Jeremy Kerr
b416beb25d mctp: unify sockaddr_mctp types
Use the more precise __kernel_sa_family_t for smctp_family, to match
struct sockaddr.

Also, use an unsigned int for the network member; negative networks
don't make much sense. We're already using unsigned for mctp_dev and
mctp_skb_cb, but need to change mctp_sock to suit.

Fixes: 60fc639816 ("mctp: Add sockaddr_mctp to uapi")
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Acked-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18 13:47:09 +01:00
Brett Creeley
b726ddf984 ice: Print the api_patch as part of the fw.mgmt.api
Currently when a user uses "devlink dev info", the fw.mgmt.api will be
the major.minor numbers as shown below:

devlink dev info pci/0000:3b:00.0
pci/0000:3b:00.0:
  driver ice
  serial_number 00-01-00-ff-ff-00-00-00
  versions:
      fixed:
        board.id K91258-000
      running:
        fw.mgmt 6.1.2
        fw.mgmt.api 1.7 <--- No patch number included
        fw.mgmt.build 0xd75e7d06
        fw.mgmt.srev 5
        fw.undi 1.2992.0
        fw.undi.srev 5
        fw.psid.api 3.10
        fw.bundle_id 0x800085cc
        fw.app.name ICE OS Default Package
        fw.app 1.3.27.0
        fw.app.bundle_id 0xc0000001
        fw.netlist 3.10.2000-3.1e.0
        fw.netlist.build 0x2a76e110
      stored:
        fw.mgmt.srev 5
        fw.undi 1.2992.0
        fw.undi.srev 5
        fw.psid.api 3.10
        fw.bundle_id 0x800085cc
        fw.netlist 3.10.2000-3.1e.0
        fw.netlist.build 0x2a76e110

There are many features in the driver that depend on the major, minor,
and patch version of the FW. Without the patch number in the output for
fw.mgmt.api debugging issues related to the FW API version is difficult.
Also, using major.minor.patch aligns with the existing firmware version
which uses a 3 digit value.

Fix this by making the fw.mgmt.api print the major.minor.patch
versions. Shown below is the result:

devlink dev info pci/0000:3b:00.0
pci/0000:3b:00.0:
  driver ice
  serial_number 00-01-00-ff-ff-00-00-00
  versions:
      fixed:
        board.id K91258-000
      running:
        fw.mgmt 6.1.2
        fw.mgmt.api 1.7.9 <--- patch number included
        fw.mgmt.build 0xd75e7d06
        fw.mgmt.srev 5
        fw.undi 1.2992.0
        fw.undi.srev 5
        fw.psid.api 3.10
        fw.bundle_id 0x800085cc
        fw.app.name ICE OS Default Package
        fw.app 1.3.27.0
        fw.app.bundle_id 0xc0000001
        fw.netlist 3.10.2000-3.1e.0
        fw.netlist.build 0x2a76e110
      stored:
        fw.mgmt.srev 5
        fw.undi 1.2992.0
        fw.undi.srev 5
        fw.psid.api 3.10
        fw.bundle_id 0x800085cc
        fw.netlist 3.10.2000-3.1e.0
        fw.netlist.build 0x2a76e110

Fixes: ff2e5c700e ("ice: add basic handler for devlink .info_get")
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-10-14 10:14:46 -07:00
Thorsten Leemhuis
a9d85efb25 docs: use the lore redirector everywhere
Change all links from using the lkml redirector to the lore redirector,
as the kernel.org admin recently indicated: we shouldn't be using
lkml.kernel.org anymore because the domain can create confusion, as it
indicates it is only valid for messages sent to the LKML; the convention
has been to use https://lore.kernel.org/r/msgid for this reason.

In this process also change three links from using http to https.

Link: https://lore.kernel.org/r/20211006170025.qw3glxvocczfuhar@meerkat.local
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: Hu Haowen <src.res@email.cn>
CC: Alex Shi <alexs@kernel.org>
CC: Federico Vaga <federico.vaga@vaga.pv.it>
Signed-off-by: Thorsten Leemhuis <linux@leemhuis.info>
Reviewed-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Link: https://lore.kernel.org/r/5bb55bac6ba10fafab19bf2b21572dd0e2f8cea2.1633593385.git.linux@leemhuis.info
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2021-10-12 13:58:19 -06:00
Dust Li
2232642ec3 ipvs: add sysctl_run_estimation to support disable estimation
estimation_timer will iterate the est_list to do estimation
for each ipvs stats. When there are lots of services, the
list can be very large.
We found that estimation_timer() run for more then 200ms on a
machine with 104 CPU and 50K services.

yunhong-cgl jiang report the same phenomenon before:
https://www.spinics.net/lists/lvs-devel/msg05426.html

In some cases(for example a large K8S cluster with many ipvs services),
ipvs estimation may not be needed. So adding a sysctl blob to allow
users to disable this completely.

Default is: 1 (enable)

Cc: yunhong-cgl jiang <xintian1976@gmail.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-07 19:52:58 +02:00
Ido Schimmel
3dfb511260 ethtool: Add transceiver module extended state
Add an extended state and sub-state to describe link issues related to
transceiver modules.

The 'ETHTOOL_LINK_EXT_SUBSTATE_MODULE_CMIS_NOT_READY' extended sub-state
tells user space that port is unable to gain a carrier because the CMIS
Module State Machine did not reach the ModuleReady (Fully Operational)
state. For example, if the module is stuck at ModuleLowPwr or
ModuleFault state. In case of the latter, user space can read the fault
reason from the module's EEPROM and potentially reset it.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-06 17:47:50 -07:00
Ido Schimmel
353407d917 ethtool: Add ability to control transceiver modules' power mode
Add a pair of new ethtool messages, 'ETHTOOL_MSG_MODULE_SET' and
'ETHTOOL_MSG_MODULE_GET', that can be used to control transceiver
modules parameters and retrieve their status.

The first parameter to control is the power mode of the module. It is
only relevant for paged memory modules, as flat memory modules always
operate in low power mode.

When a paged memory module is in low power mode, its power consumption
is reduced to the minimum, the management interface towards the host is
available and the data path is deactivated.

User space can choose to put modules that are not currently in use in
low power mode and transition them to high power mode before putting the
associated ports administratively up. This is useful for user space that
favors reduced power consumption and lower temperatures over reduced
link up times. In QSFP-DD modules the transition from low power mode to
high power mode can take a few seconds and this transition is only
expected to get longer with future / more complex modules.

User space can control the power mode of the module via the power mode
policy attribute ('ETHTOOL_A_MODULE_POWER_MODE_POLICY'). Possible
values:

* high: Module is always in high power mode.

* auto: Module is transitioned by the host to high power mode when the
  first port using it is put administratively up and to low power mode
  when the last port using it is put administratively down.

The operational power mode of the module is available to user space via
the 'ETHTOOL_A_MODULE_POWER_MODE' attribute. The attribute is not
reported to user space when a module is not plugged-in.

The user API is designed to be generic enough so that it could be used
for modules with different memory maps (e.g., SFF-8636, CMIS).

The only implementation of the device driver API in this series is for a
MAC driver (mlxsw) where the module is controlled by the device's
firmware, but it is designed to be generic enough so that it could also
be used by implementations where the module is controlled by the CPU.

CMIS testing
============

 # ethtool -m swp11
 Identifier                                : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628))
 ...
 Module State                              : 0x03 (ModuleReady)
 LowPwrAllowRequestHW                      : Off
 LowPwrRequestSW                           : Off

The module is not in low power mode, as it is not forced by hardware
(LowPwrAllowRequestHW is off) or by software (LowPwrRequestSW is off).

The power mode can be queried from the kernel. In case
LowPwrAllowRequestHW was on, the kernel would need to take into account
the state of the LowPwrRequestHW signal, which is not visible to user
space.

 $ ethtool --show-module swp11
 Module parameters for swp11:
 power-mode-policy high
 power-mode high

Change the power mode policy to 'auto':

 # ethtool --set-module swp11 power-mode-policy auto

Query the power mode again:

 $ ethtool --show-module swp11
 Module parameters for swp11:
 power-mode-policy auto
 power-mode low

Verify with the data read from the EEPROM:

 # ethtool -m swp11
 Identifier                                : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628))
 ...
 Module State                              : 0x01 (ModuleLowPwr)
 LowPwrAllowRequestHW                      : Off
 LowPwrRequestSW                           : On

Put the associated port administratively up which will instruct the host
to transition the module to high power mode:

 # ip link set dev swp11 up

Query the power mode again:

 $ ethtool --show-module swp11
 Module parameters for swp11:
 power-mode-policy auto
 power-mode high

Verify with the data read from the EEPROM:

 # ethtool -m swp11
 Identifier                                : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628))
 ...
 Module State                              : 0x03 (ModuleReady)
 LowPwrAllowRequestHW                      : Off
 LowPwrRequestSW                           : Off

Put the associated port administratively down which will instruct the
host to transition the module to low power mode:

 # ip link set dev swp11 down

Query the power mode again:

 $ ethtool --show-module swp11
 Module parameters for swp11:
 power-mode-policy auto
 power-mode low

Verify with the data read from the EEPROM:

 # ethtool -m swp11
 Identifier                                : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628))
 ...
 Module State                              : 0x01 (ModuleLowPwr)
 LowPwrAllowRequestHW                      : Off
 LowPwrRequestSW                           : On

SFF-8636 testing
================

 # ethtool -m swp13
 Identifier                                : 0x11 (QSFP28)
 ...
 Extended identifier description           : 5.0W max. Power consumption,  High Power Class (> 3.5 W) enabled
 Power set                                 : Off
 Power override                            : On
 ...
 Transmit avg optical power (Channel 1)    : 0.7733 mW / -1.12 dBm
 Transmit avg optical power (Channel 2)    : 0.7649 mW / -1.16 dBm
 Transmit avg optical power (Channel 3)    : 0.7790 mW / -1.08 dBm
 Transmit avg optical power (Channel 4)    : 0.7837 mW / -1.06 dBm
 Rcvr signal avg optical power(Channel 1)  : 0.9302 mW / -0.31 dBm
 Rcvr signal avg optical power(Channel 2)  : 0.9079 mW / -0.42 dBm
 Rcvr signal avg optical power(Channel 3)  : 0.8993 mW / -0.46 dBm
 Rcvr signal avg optical power(Channel 4)  : 0.8778 mW / -0.57 dBm

The module is not in low power mode, as it is not forced by hardware
(Power override is on) or by software (Power set is off).

The power mode can be queried from the kernel. In case Power override
was off, the kernel would need to take into account the state of the
LPMode signal, which is not visible to user space.

 $ ethtool --show-module swp13
 Module parameters for swp13:
 power-mode-policy high
 power-mode high

Change the power mode policy to 'auto':

 # ethtool --set-module swp13 power-mode-policy auto

Query the power mode again:

 $ ethtool --show-module swp13
 Module parameters for swp13:
 power-mode-policy auto
 power-mode low

Verify with the data read from the EEPROM:

 # ethtool -m swp13
 Identifier                                : 0x11 (QSFP28)
 Extended identifier description           : 5.0W max. Power consumption,  High Power Class (> 3.5 W) not enabled
 Power set                                 : On
 Power override                            : On
 ...
 Transmit avg optical power (Channel 1)    : 0.0000 mW / -inf dBm
 Transmit avg optical power (Channel 2)    : 0.0000 mW / -inf dBm
 Transmit avg optical power (Channel 3)    : 0.0000 mW / -inf dBm
 Transmit avg optical power (Channel 4)    : 0.0000 mW / -inf dBm
 Rcvr signal avg optical power(Channel 1)  : 0.0000 mW / -inf dBm
 Rcvr signal avg optical power(Channel 2)  : 0.0000 mW / -inf dBm
 Rcvr signal avg optical power(Channel 3)  : 0.0000 mW / -inf dBm
 Rcvr signal avg optical power(Channel 4)  : 0.0000 mW / -inf dBm

Put the associated port administratively up which will instruct the host
to transition the module to high power mode:

 # ip link set dev swp13 up

Query the power mode again:

 $ ethtool --show-module swp13
 Module parameters for swp13:
 power-mode-policy auto
 power-mode high

Verify with the data read from the EEPROM:

 # ethtool -m swp13
 Identifier                                : 0x11 (QSFP28)
 ...
 Extended identifier description           : 5.0W max. Power consumption,  High Power Class (> 3.5 W) enabled
 Power set                                 : Off
 Power override                            : On
 ...
 Transmit avg optical power (Channel 1)    : 0.7934 mW / -1.01 dBm
 Transmit avg optical power (Channel 2)    : 0.7859 mW / -1.05 dBm
 Transmit avg optical power (Channel 3)    : 0.7885 mW / -1.03 dBm
 Transmit avg optical power (Channel 4)    : 0.7985 mW / -0.98 dBm
 Rcvr signal avg optical power(Channel 1)  : 0.9325 mW / -0.30 dBm
 Rcvr signal avg optical power(Channel 2)  : 0.9034 mW / -0.44 dBm
 Rcvr signal avg optical power(Channel 3)  : 0.9086 mW / -0.42 dBm
 Rcvr signal avg optical power(Channel 4)  : 0.8885 mW / -0.51 dBm

Put the associated port administratively down which will instruct the
host to transition the module to low power mode:

 # ip link set dev swp13 down

Query the power mode again:

 $ ethtool --show-module swp13
 Module parameters for swp13:
 power-mode-policy auto
 power-mode low

Verify with the data read from the EEPROM:

 # ethtool -m swp13
 Identifier                                : 0x11 (QSFP28)
 ...
 Extended identifier description           : 5.0W max. Power consumption,  High Power Class (> 3.5 W) not enabled
 Power set                                 : On
 Power override                            : On
 ...
 Transmit avg optical power (Channel 1)    : 0.0000 mW / -inf dBm
 Transmit avg optical power (Channel 2)    : 0.0000 mW / -inf dBm
 Transmit avg optical power (Channel 3)    : 0.0000 mW / -inf dBm
 Transmit avg optical power (Channel 4)    : 0.0000 mW / -inf dBm
 Rcvr signal avg optical power(Channel 1)  : 0.0000 mW / -inf dBm
 Rcvr signal avg optical power(Channel 2)  : 0.0000 mW / -inf dBm
 Rcvr signal avg optical power(Channel 3)  : 0.0000 mW / -inf dBm
 Rcvr signal avg optical power(Channel 4)  : 0.0000 mW / -inf dBm

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-06 17:47:49 -07:00
M Chetan Kumar
b8aa16541d net: wwan: iosm: correct devlink extra params
1. Removed driver specific extra params like download_region,
   address & region_count. The required information is passed
   as part of flash API.
2. IOSM Devlink documentation updated to reflect the same.

Signed-off-by: M Chetan Kumar <m.chetan.kumar@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-02 16:05:20 +01:00
Jacob Keller
a70e3f024d devlink: report maximum number of snapshots with regions
Each region has an independently configurable number of maximum
snapshots. This information is not reported to userspace, making it not
very discoverable. Fix this by adding a new
DEVLINK_ATTR_REGION_MAX_SNAPSHOST attribute which is used to report this
maximum.

Ex:

  $devlink region
  pci/0000:af:00.0/nvm-flash: size 10485760 snapshot [] max 1
  pci/0000:af:00.0/device-caps: size 4096 snapshot [] max 10
  pci/0000:af:00.1/nvm-flash: size 10485760 snapshot [] max 1
  pci/0000:af:00.1/device-caps: size 4096 snapshot [] max 10

This information enables users to understand why a new region command
may fail due to having too many existing snapshots.

Reported-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-01 14:28:55 +01:00
Jeremy Kerr
f4d41c5913 doc/mctp: Add a little detail about kernel internals
Describe common flows and refcounting behaviour.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-29 11:00:12 +01:00
Jakub Kicinski
2fcd14d0f7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
net/mptcp/protocol.c
  977d293e23 ("mptcp: ensure tx skbs always have the MPTCP ext")
  efe686ffce ("mptcp: ensure tx skbs always have the MPTCP ext")

same patch merged in both trees, keep net-next.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-09-23 11:19:49 -07:00
Eric Dumazet
d8b81175e4 tcp: remove sk_{tr}x_skb_cache
This reverts the following patches :

- commit 2e05fcae83 ("tcp: fix compile error if !CONFIG_SYSCTL")
- commit 4f661542a4 ("tcp: fix zerocopy and notsent_lowat issues")
- commit 472c2e07ee ("tcp: add one skb cache for tx")
- commit 8b27dae5a2 ("tcp: add one skb cache for rx")

Having a cache of one skb (in each direction) per TCP socket is fragile,
since it can cause a significant increase of memory needs,
and not good enough for high speed flows anyway where more than one skb
is needed.

We want instead to add a generic infrastructure, with more flexible
per-cpu caches, for alien NUMA nodes.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-23 12:50:26 +01:00
Masanari Iida
3e95cfa24e Doc: networking: Fox a typo in ice.rst
This patch fixes a spelling typo in ice.rst

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-21 11:01:25 +01:00
M Chetan Kumar
64302024bc net: wwan: iosm: devlink fw flashing & cd collection documentation
Documents devlink params, fw update & cd collection commands
and its usage.

Signed-off-by: M Chetan Kumar <m.chetan.kumar@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-20 10:03:37 +01:00
Alejandro Concepcion-Rodriguez
48e6d083b3 docs: net: dsa: sja1105: fix reference to sja1105.txt
The file sja1105.txt was converted to nxp,sja1105.yaml.

Signed-off-by: Alejandro Concepcion-Rodriguez <asconcepcion@acoro.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-19 12:45:03 +01:00
Linus Torvalds
626bf91a29 Networking stragglers and fixes for 5.15-rc1, including changes from netfilter,
wireless and can.
 
 Current release - regressions:
 
  - qrtr: revert check in qrtr_endpoint_post(), fixes audio and wifi
 
  - ip_gre: validate csum_start only on pull
 
  - bnxt_en: fix 64-bit doorbell operation on 32-bit kernels
 
  - ionic: fix double use of queue-lock, fix a sleeping in atomic
 
  - can: c_can: fix null-ptr-deref on ioctl()
 
  - cs89x0: disable compile testing on powerpc
 
 Current release - new code bugs:
 
  - bridge: mcast: fix vlan port router deadlock, consistently disable BH
 
 Previous releases - regressions:
 
  - dsa: tag_rtl4_a: fix egress tags, only port 0 was working
 
  - mptcp: fix possible divide by zero
 
  - netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutex
 
  - netfilter: socket: icmp6: fix use-after-scope
 
  - stmmac: fix MAC not working when system resume back with WoL active
 
 Previous releases - always broken:
 
  - ip/ip6_gre: use the same logic as SIT interfaces when computing v6LL
                address
 
  - seg6: set fc_nlinfo in nh_create_ipv4, nh_create_ipv6
 
  - mptcp: only send extra TCP acks in eligible socket states
 
  - dsa: lantiq_gswip: fix maximum frame length
 
  - stmmac: fix overall budget calculation for rxtx_napi
 
  - bnxt_en: fix firmware version reporting via devlink
 
  - renesas: sh_eth: add missing barrier to fix freeing wrong tx descriptor
 
 Stragglers:
 
  - netfilter: conntrack: switch to siphash
 
  - netfilter: refuse insertion if chain has grown too large
 
  - ncsi: add get MAC address command to get Intel i210 MAC address
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmE3uicACgkQMUZtbf5S
 IrtJVA//XdE8qAmw1JukjyYC87JH2ale20eoZ6ERn7/09e4tdv3M6dOTI4YfrM6+
 CMNP5MP2qit3IzY+lN0+yt9AAFH7k85z3MA8zLxsXN4z63OJcZvFv/G/OWy4Wp/0
 vOo/DH+rF3LR+fZZvjJI+8Xi9/orsRpD12cwGmjGRxybh+XcnHKI/GvK2RgE6oBR
 015RfBbbQBpzFQvESLnSwDzabN1XFEL1x/bz7N8ek3okfO/tab+f3E1tb6eYtTy+
 jyDyOWpayd4xDttKNMUuxwS1q+/oAWOAq8PzkaF/ZG2sBH1Z4yZN9ZtsLNZmPG8N
 5L1FEem/Nmgr54T9v/FhfiryhhGGysVfVgtQcCBkKRmVn1Kk2L6dFvtuanPtFFd3
 llbi5PvCDJy3rbMmxKmyoM3T4jpMwWxQRZKsosw+k/WQfb8/SUOjgpY713V1Wx/P
 S+2uadU4l9Ql9sF6X0IqZABnnt+j/BuDo6C6vVq7vyj0iQ9hEX9YxC0ybrAHOYpH
 suHWKndodRfTxxVOg8xRNYwXyRLNbm1AP6LMDNKBlFUjwNSZ362qFX7W7DuXoRup
 Rrnb8V1QFvM+pyFb2a0qNtBS68IXbjCdVQX5e8a5ELaAUnDPefNrfPN+/rrTLEtV
 LnusmBF+02llVSYdr88t1e+LmzqS/aqXFy2ry4y6owjq20ld2O0=
 =Zvuz
 -----END PGP SIGNATURE-----

Merge tag 'net-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes and stragglers from Jakub Kicinski:
 "Networking stragglers and fixes, including changes from netfilter,
  wireless and can.

  Current release - regressions:

   - qrtr: revert check in qrtr_endpoint_post(), fixes audio and wifi

   - ip_gre: validate csum_start only on pull

   - bnxt_en: fix 64-bit doorbell operation on 32-bit kernels

   - ionic: fix double use of queue-lock, fix a sleeping in atomic

   - can: c_can: fix null-ptr-deref on ioctl()

   - cs89x0: disable compile testing on powerpc

  Current release - new code bugs:

   - bridge: mcast: fix vlan port router deadlock, consistently disable
     BH

  Previous releases - regressions:

   - dsa: tag_rtl4_a: fix egress tags, only port 0 was working

   - mptcp: fix possible divide by zero

   - netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutex

   - netfilter: socket: icmp6: fix use-after-scope

   - stmmac: fix MAC not working when system resume back with WoL active

  Previous releases - always broken:

   - ip/ip6_gre: use the same logic as SIT interfaces when computing
     v6LL address

   - seg6: set fc_nlinfo in nh_create_ipv4, nh_create_ipv6

   - mptcp: only send extra TCP acks in eligible socket states

   - dsa: lantiq_gswip: fix maximum frame length

   - stmmac: fix overall budget calculation for rxtx_napi

   - bnxt_en: fix firmware version reporting via devlink

   - renesas: sh_eth: add missing barrier to fix freeing wrong tx
     descriptor

  Stragglers:

   - netfilter: conntrack: switch to siphash

   - netfilter: refuse insertion if chain has grown too large

   - ncsi: add get MAC address command to get Intel i210 MAC address"

* tag 'net-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (76 commits)
  ieee802154: Remove redundant initialization of variable ret
  net: stmmac: fix MAC not working when system resume back with WoL active
  net: phylink: add suspend/resume support
  net: renesas: sh_eth: Fix freeing wrong tx descriptor
  bonding: 3ad: pass parameter bond_params by reference
  cxgb3: fix oops on module removal
  can: c_can: fix null-ptr-deref on ioctl()
  can: rcar_canfd: add __maybe_unused annotation to silence warning
  net: wwan: iosm: Unify IO accessors used in the driver
  net: wwan: iosm: Replace io.*64_lo_hi() with regular accessors
  net: qcom/emac: Replace strlcpy with strscpy
  ip6_gre: Revert "ip6_gre: add validation for csum_start"
  net: hns3: make hclgevf_cmd_caps_bit_map0 and hclge_cmd_caps_bit_map0 static
  selftests/bpf: Test XDP bonding nest and unwind
  bonding: Fix negative jump label count on nested bonding
  MAINTAINERS: add VM SOCKETS (AF_VSOCK) entry
  stmmac: dwmac-loongson:Fix missing return value
  iwlwifi: fix printk format warnings in uefi.c
  net: create netdev->dev_addr assignment helpers
  bnxt_en: Fix possible unintended driver initiated error recovery
  ...
2021-09-07 14:02:58 -07:00
Jakub Kicinski
10905b4a68 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

1) Protect nft_ct template with global mutex, from Pavel Skripkin.

2) Two recent commits switched inet rt and nexthop exception hashes
   from jhash to siphash. If those two spots are problematic then
   conntrack is affected as well, so switch voer to siphash too.
   While at it, add a hard upper limit on chain lengths and reject
   insertion if this is hit. Patches from Florian Westphal.

3) Fix use-after-scope in nf_socket_ipv6 reported by KASAN,
   from Benjamin Hesmans.

* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
  netfilter: socket: icmp6: fix use-after-scope
  netfilter: refuse insertion if chain has grown too large
  netfilter: conntrack: switch to siphash
  netfilter: conntrack: sanitize table size default settings
  netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutex
====================

Link: https://lore.kernel.org/r/20210903163020.13741-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-09-03 16:20:37 -07:00
Linus Torvalds
4ac6d90867 Yet another set of documentation changes:
- A reworking of PDF generation to yield better results for documents
    using CJK fonts in particular.
 
  - A new set of translations into traditional Chinese, a dialect for which
    I am assured there is a community of interested readers.
 
  - A lot more regular Chinese translation work as well.
 
 ...plus the usual assortment of updates, fixes, typo tweaks, etc.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmEugrgACgkQF0NaE2wM
 fliWWQf/RXf34QkMIe+r77WlTRKc+/6R/cO9VlYPtM9vqreKHZZvGgM1t76aOusb
 M5QHwQGoZDzaE1wrv0PPm00HtB0Tw7GfZRUbZ4D+niJD1+gcbDTkTR6NdjOvWWUR
 zHX2Sx8KJiNrFDtLtRtlUexM8GD124KZ0A8GF6Hpu3WR3HTFDInTdiylUOmj/4eO
 3zUGgrJnUVzkqHLGZzV/kmE4kEHGpxyps2JwGq2iF7362t8R6xH3mEdKKKc1pUpx
 lGSxfHs+OPWRsNxVJsdYh8kneIpML8OK6lKda1pzwNj8QhIMz/6tZoutKziHsalI
 HkbC3exh+SHak2U6Had303vqkIM7cg==
 =2QUy
 -----END PGP SIGNATURE-----

Merge tag 'docs-5.15' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "Yet another set of documentation changes:

   - A reworking of PDF generation to yield better results for documents
     using CJK fonts in particular.

   - A new set of translations into traditional Chinese, a dialect for
     which I am assured there is a community of interested readers.

   - A lot more regular Chinese translation work as well.

  ... plus the usual assortment of updates, fixes, typo tweaks, etc"

* tag 'docs-5.15' of git://git.lwn.net/linux: (55 commits)
  docs: sphinx-requirements: Move sphinx_rtd_theme to top
  docs: pdfdocs: Enable language-specific font choice of zh_TW translations
  docs: pdfdocs: Teach xeCJK about character classes of quotation marks
  docs: pdfdocs: Permit AutoFakeSlant for CJK fonts
  docs: pdfdocs: One-half spacing for CJK translations
  docs: pdfdocs: Add conf.py local to translations for ascii-art alignment
  docs: pdfdocs: Preserve inter-phrase space in Korean translations
  docs: pdfdocs: Choose Serif font as CJK mainfont if possible
  docs: pdfdocs: Add CJK-language-specific font settings
  docs: pdfdocs: Refactor config for CJK document
  scripts/kernel-doc: Override -Werror from KCFLAGS with KDOC_WERROR
  docs/zh_CN: Add zh_CN/accounting/psi.rst
  doc: align Italian translation
  Documentation/features/vm: riscv supports THP now
  docs/zh_CN: add infiniband user_verbs translation
  docs/zh_CN: add infiniband user_mad translation
  docs/zh_CN: add infiniband tag_matching translation
  docs/zh_CN: add infiniband sysfs translation
  docs/zh_CN: add infiniband opa_vnic translation
  docs/zh_CN: add infiniband ipoib translation
  ...
2021-09-01 18:49:47 -07:00
Jakub Kicinski
19a31d7921 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
bpf-next 2021-08-31

We've added 116 non-merge commits during the last 17 day(s) which contain
a total of 126 files changed, 6813 insertions(+), 4027 deletions(-).

The main changes are:

1) Add opaque bpf_cookie to perf link which the program can read out again,
   to be used in libbpf-based USDT library, from Andrii Nakryiko.

2) Add bpf_task_pt_regs() helper to access userspace pt_regs, from Daniel Xu.

3) Add support for UNIX stream type sockets for BPF sockmap, from Jiang Wang.

4) Allow BPF TCP congestion control progs to call bpf_setsockopt() e.g. to switch
   to another congestion control algorithm during init, from Martin KaFai Lau.

5) Extend BPF iterator support for UNIX domain sockets, from Kuniyuki Iwashima.

6) Allow bpf_{set,get}sockopt() calls from setsockopt progs, from Prankur Gupta.

7) Add bpf_get_netns_cookie() helper for BPF_PROG_TYPE_{SOCK_OPS,CGROUP_SOCKOPT}
   progs, from Xu Liu and Stanislav Fomichev.

8) Support for __weak typed ksyms in libbpf, from Hao Luo.

9) Shrink struct cgroup_bpf by 504 bytes through refactoring, from Dave Marchevsky.

10) Fix a smatch complaint in verifier's narrow load handling, from Andrey Ignatov.

11) Fix BPF interpreter's tail call count limit, from Daniel Borkmann.

12) Big batch of improvements to BPF selftests, from Magnus Karlsson, Li Zhijian,
    Yucong Sun, Yonghong Song, Ilya Leoshkevich, Jussi Maki, Ilya Leoshkevich, others.

13) Another big batch to revamp XDP samples in order to give them consistent look
    and feel, from Kumar Kartikeya Dwivedi.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (116 commits)
  MAINTAINERS: Remove self from powerpc BPF JIT
  selftests/bpf: Fix potential unreleased lock
  samples: bpf: Fix uninitialized variable in xdp_redirect_cpu
  selftests/bpf: Reduce more flakyness in sockmap_listen
  bpf: Fix bpf-next builds without CONFIG_BPF_EVENTS
  bpf: selftests: Add dctcp fallback test
  bpf: selftests: Add connect_to_fd_opts to network_helpers
  bpf: selftests: Add sk_state to bpf_tcp_helpers.h
  bpf: tcp: Allow bpf-tcp-cc to call bpf_(get|set)sockopt
  selftests: xsk: Preface options with opt
  selftests: xsk: Make enums lower case
  selftests: xsk: Generate packets from specification
  selftests: xsk: Generate packet directly in umem
  selftests: xsk: Simplify cleanup of ifobjects
  selftests: xsk: Decrease sending speed
  selftests: xsk: Validate tx stats on tx thread
  selftests: xsk: Simplify packet validation in xsk tests
  selftests: xsk: Rename worker_* functions that are not thread entry points
  selftests: xsk: Disassociate umem size with packets sent
  selftests: xsk: Remove end-of-test packet
  ...
====================

Link: https://lore.kernel.org/r/20210830225618.11634-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-30 16:42:47 -07:00
David S. Miller
9dfa859da0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Clean up and consolidate ct ecache infrastructure by merging ct and
   expect notifiers, from Florian Westphal.

2) Missing counters and timestamp in nfnetlink_queue and _log conntrack
   information.

3) Missing error check for xt_register_template() in iptables mangle,
   as a incremental fix for the previous pull request, also from
   Florian Westphal.

4) Add netfilter hooks for the SRv6 lightweigh tunnel driver, from
   Ryoga Sato. The hooks are enabled via nf_hooks_lwtunnel sysctl
   to make sure existing netfilter rulesets do not break. There is
   a static key to disable the hooks by default.

   The pktgen_bench_xmit_mode_netif_receive.sh shows no noticeable
   impact in the seg6_input path for non-netfilter users: similar
   numbers with and without this patch.

   This is a sample of the perf report output:

    11.67%  kpktgend_0       [ipv6]                    [k] ipv6_get_saddr_eval
     7.89%  kpktgend_0       [ipv6]                    [k] __ipv6_addr_label
     7.52%  kpktgend_0       [ipv6]                    [k] __ipv6_dev_get_saddr
     6.63%  kpktgend_0       [kernel.vmlinux]          [k] asm_exc_nmi
     4.74%  kpktgend_0       [ipv6]                    [k] fib6_node_lookup_1
     3.48%  kpktgend_0       [kernel.vmlinux]          [k] pskb_expand_head
     3.33%  kpktgend_0       [ipv6]                    [k] ip6_rcv_core.isra.29
     3.33%  kpktgend_0       [ipv6]                    [k] seg6_do_srh_encap
     2.53%  kpktgend_0       [ipv6]                    [k] ipv6_dev_get_saddr
     2.45%  kpktgend_0       [ipv6]                    [k] fib6_table_lookup
     2.24%  kpktgend_0       [kernel.vmlinux]          [k] ___cache_free
     2.16%  kpktgend_0       [ipv6]                    [k] ip6_pol_route
     2.11%  kpktgend_0       [kernel.vmlinux]          [k] __ipv6_addr_type
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-30 10:57:54 +01:00
Florian Westphal
d532bcd0b2 netfilter: conntrack: sanitize table size default settings
conntrack has two distinct table size settings:
nf_conntrack_max and nf_conntrack_buckets.

The former limits how many conntrack objects are allowed to exist
in each namespace.

The second sets the size of the hashtable.

As all entries are inserted twice (once for original direction, once for
reply), there should be at least twice as many buckets in the table than
the maximum number of conntrack objects that can exist at the same time.

Change the default multiplier to 1 and increase the chosen bucket sizes.
This results in the same nf_conntrack_max settings as before but reduces
the average bucket list length.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-30 11:49:54 +02:00
Ryoga Saito
7a3f5b0de3 netfilter: add netfilter hooks to SRv6 data plane
This patch introduces netfilter hooks for solving the problem that
conntrack couldn't record both inner flows and outer flows.

This patch also introduces a new sysctl toggle for enabling lightweight
tunnel netfilter hooks.

Signed-off-by: Ryoga Saito <contact@proelbtn.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-30 01:51:36 +02:00
Juhee Kang
246b184fff pktgen: document the latest pktgen usage options
Currently, the pktgen.rst documentation doesn't cover the latest pktgen
sample usage options such as count and IPv6, and so on. Also, this
documentation includes the old sample scripts which are no longer use
because it was removed by the commit a4b6ade835 ("samples/pktgen :
remove remaining old pktgen sample scripts")

Thus, this commit documents pktgen sample usage using the latest options
and removes old sample scripts, and fixes a minor typo.

Signed-off-by: Juhee Kang <claudiajkang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-25 13:44:30 +01:00
Yufeng Mo
029ee6b143 ethtool: add two coalesce attributes for CQE mode
Currently, there are many drivers who support CQE mode configuration,
some configure it as a fixed when initialized, some provide an
interface to change it by ethtool private flags. In order to make it
more generic, add two new 'ETHTOOL_A_COALESCE_USE_CQE_TX' and
'ETHTOOL_A_COALESCE_USE_CQE_RX' coalesce attributes, then these
parameters can be accessed by ethtool netlink coalesce uAPI.

Also add an new structure kernel_ethtool_coalesce, then the
new parameter can be added into this struct.

Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-24 07:38:28 -07:00
Benjamin Poirier
b1165777fe doc: Document unexpected tcp_l3mdev_accept=1 behavior
As suggested by David, document a somewhat unexpected behavior that results
from net.ipv4.tcp_l3mdev_accept=1. This behavior was encountered while
debugging FRR, a VRF-aware application, on a system which used
net.ipv4.tcp_l3mdev_accept=1 and where TCP connections for BGP with MD5
keys were failing to establish.

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-23 11:53:24 +01:00
Vladimir Oltean
95ca38194c docs: net: dsa: document the new methods for bridge TX forwarding offload
Two new methods have been introduced, add some verbiage about what they do.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-22 09:49:04 +01:00
Vladimir Oltean
37f299d989 docs: net: dsa: remove references to struct dsa_device_ops::filter
This function has disappeared in commit edac6f6332 ("Revert "net: dsa:
Allow drivers to filter packets they can decode source port from"").

Also, since commit 4e50025129 ("net: dsa: generalize overhead for
taggers that use both headers and trailers"), the next paragraph is no
longer true (it is still discouraged to do that, but it is now
supported, so no point in mentioning it). Delete.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-22 09:49:04 +01:00
Vladimir Oltean
5702d94bd9 docs: net: dsa: sja1105: update list of limitations
Remove the paragraphs that talk about the various modes of traffic
support, bridging with foreign interfaces, etc etc. There is nothing
that the user needs to know now, it should all work out of the box as
expected.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-22 09:49:04 +01:00
Vladimir Oltean
27dd613f10 docs: devlink: remove the references to sja1105
The sja1105 driver has removed its devlink params, so there is nothing
to see here.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-22 09:49:03 +01:00
David S. Miller
f96b48c621 mlx5-updates-2021-08-19
This series introduces the support for two new mlx5 features:
 
 1) Sample offload for tunneled traffic
 2) devlink rate objects support
 
 1) From Chris Mi: Sample offload for tunneled traffic
 =====================================================
 
 Background and solution
 -----------------------
 
 Currently the sample offload actions send the encapsulated packet
 to software. This series de-capsulates the packet before performing
 the sampling and set the tunnel properties on the skb metadata
 fields to make the behavior consistent with OVS sFlow.
 
 If de-capsulating first, we can't use the same match like before in
 default table. So instantiate a post action instance to continue
 processing the action list. If HW can preserve reg_c, also use the
 post action instance.
 
 Post action infrastructure
 --------------------------
 
 Some tc actions are modeled in hardware using multiple tables
 causing a tc action list split. For example, CT action is modeled
 by jumping to a ct table which is controlled by nf flow table.
 sFlow jumps in hardware to a sample table, which continues to a
 "default table" where it should continue processing the action list.
 
 Multi table actions are modeled in hardware using a unique fte_id.
 The fte_id is set before jumping to a table. Split actions continue
 to a post-action table where the matched fte_id value continues the
 execution the tc action list.
 
 This series also introduces post action infrastructure. Both ct and
 sample use it.
 
 Sample for tunnel in TC SW
 --------------------------
 
 tc filter add dev vxlan1 protocol ip parent ffff: prio 3		\
 	flower src_mac 24:25:d0:e1:00:00 dst_mac 02:25:d0:13:01:02	\
 	enc_src_ip 192.168.1.14 enc_dst_ip 192.168.1.13			\
 	enc_dst_port 4789 enc_key_id 4					\
 	action sample rate 1 group 6					\
 	action tunnel_key unset						\
 	action mirred egress redirect dev enp4s0f0_1
 
 MLX5 sample HW offload
 ----------------------
 
 For the following typical flow table:
 
 +-------------------------------+
 +       original flow table     +
 +-------------------------------+
 +         original match        +
 +-------------------------------+
 + sample action + other actions +
 +-------------------------------+
 
 We translate the tc filter with sample action to the following HW model:
 
         +---------------------+
         + original flow table +
         +---------------------+
         +   original match    +
         +---------------------+
               | set fte_id (if reg_c preserve cap)
               | do decap
               v
 +------------------------------------------------+
 +                Flow Sampler Object             +
 +------------------------------------------------+
 +                    sample ratio                +
 +------------------------------------------------+
 +    sample table id    |    default table id    +
 +------------------------------------------------+
            |                            |
            v                            v
 +-----------------------------+  +-------------------+
 +        sample table         +  +   default table   +
 +-----------------------------+  +-------------------+
 + forward to management vport +             |
 +-----------------------------+             |
                                     +-------+------+
                                     |              |reg_c preserve cap
                                     |              |or decap action
                                     v              v
                        +-----------------+   +-------------+
                        + per vport table +   + post action +
                        +-----------------+   +-------------+
                        + original match  +
                        +-----------------+
                        + other actions   +
                        +-----------------+
 
 2) From Dmytro Linkin: devlink rate object support for mlx5_core driver
 =======================================================================
 
 HIGH-LEVEL OVERVIEW
 
 Devlink leaf rate objects created per vport (VF/SF, and PF on BlueField)
 in switchdev mode on devlink port registration.
 Implement devlink ops callbacks to create/destroy rate groups, set TX
 rate values of the vport/group, assign vport to the group.
 Driver accepts TX rate values as fraction of 1Mbps.
 
 Refactor existing eswitch QoS infrastructure to be accessible by legacy
 NDO rate API and new devlink rate API. NDO rate API is not
 removed/disabled in switchdev mode to not break existing users. Rate
 values configured with NDO rate API are not visible for devlink
 infrastructure, therefore APIs should not be used simultaneously.
 
 IMPLEMENTATION DETAILS
 
 Driver provide two level rate hierarchy to manage bandwidth - group
 level and vport level. Initially each vport added to internal unlimited
 group created by default. Each rate element (vport or group) receive
 bandwidth relative to its parent element (for groups the parent is a
 physical link itself) in a Round Robin manner, where element get
 bandwidth value according to its weight. Example:
 
 Created four rate groups with tx_share limits:
 
 $ devlink port function rate add \
     pci/0000:06:00.0/group_1 tx_share 30gbit
 $ devlink port function rate add \
     pci/0000:06:00.0/group_2 tx_share 20gbit
 $ devlink port function rate add \
     pci/0000:06:00.0/group_3 tx_share 20gbit
 $ devlink port function rate add \
     pci/0000:06:00.0/group_4 tx_share 10gbit
 
 Weights created in HW for each group are relative to the bigest tx_share
 value, which is 30gbit:
 
 <group_1> 1.0
 <group_2> 0.67
 <group_3> 0.67
 <group_4> 0.33
 
 Assuming link speed is 50 Gbit/sec and each group can sustain such
 amount of traffic, maximum bandwidth is 50 / (1.0 + 0.67 + 0.67 + 0.33)
  = ~18.75 Gbit/sec. Normilized bandwidth values for groups:
 
 <group_1> 18.75 * 1.0  = 18.75 Gbit/sec
 <group_2> 18.75 * 0.67 = 12.5 Gbit/sec
 <group_3> 18.75 * 0.67 = 12.5 Gbit/sec
 <group_4> 18.75 * 0.33 = 6.25 Gbit/sec
 
 If in example above group_1 doesn't produce any traffic, then maximum
 bandwidth becomes 50 / (0.67 + 0.67 + 0.33) = ~30.0 Gbit/sec. Normalized
 values:
 
 <group_2> 30.0 * 0.67 = 20.0 Gbit/sec
 <group_3> 30.0 * 0.67 = 20.0 Gbit/sec
 <group_4> 30.0 * 0.33 = 10.0 Gbit/sec
 
 Same normalization applied to each vport in the group.
 
 Normalized values are internal, therefore driver provides QoS
 tracepoints for next events:
 
 * vport rate element creation/deletion:
 * vport rate element configuration;
 * group rate element creation/deletion;
 * group rate element configuration.
 
 PATCHES OVERVIEW
 
 1 - Moving and isolation of eswitch QoS logic in separate file;
 
 2 - Implement devlink leaf rate object support for vports;
 
 3 - Implement rate groups creation/deletion;
 
 4 - Implement TX rate management for the groups;
 
 5 - Implement parent set for vports;
 
 6 - Eswitch QoS tracepoints.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmEfNKEACgkQSD+KveBX
 +j4DVgf/ZTX3n7xDVrqNgM3hkUOT7QKVCq5zUlDKw1IizE6I+8xB6LO3KmUPyWIn
 +VBXE3c+aqUNZu4XlqdkVn1JskVdfTdAGXIIeBgMQskJ/VFCNqU4E9uieEmpiHnI
 DUUkEI6eBURJSu1KPD0xdMqtdGJE+/KjwmfZFnCsa4uxmRuV7B0BdxzGIA6AMFKn
 +jNS/PFbQM6bUDqP2UUwd97sThtTzDIVH86gu36yK/mdcwdLreqKeuxoHJWePWHC
 qLReBC5OQ9zXD2F1Dv2u3WU7EJT7qyCLNrBUrTwHcR9N0Di+2a6lGvjRL5tjWKKC
 KOrNkkviurmPp+VieJU+rHHYQwjoGQ==
 =XfOY
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2021-08-19' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2021-08-19

This series introduces the support for two new mlx5 features:

1) Sample offload for tunneled traffic
2) devlink rate objects support

1) From Chris Mi: Sample offload for tunneled traffic
=====================================================

Background and solution
-----------------------

Currently the sample offload actions send the encapsulated packet
to software. This series de-capsulates the packet before performing
the sampling and set the tunnel properties on the skb metadata
fields to make the behavior consistent with OVS sFlow.

If de-capsulating first, we can't use the same match like before in
default table. So instantiate a post action instance to continue
processing the action list. If HW can preserve reg_c, also use the
post action instance.

Post action infrastructure
--------------------------

Some tc actions are modeled in hardware using multiple tables
causing a tc action list split. For example, CT action is modeled
by jumping to a ct table which is controlled by nf flow table.
sFlow jumps in hardware to a sample table, which continues to a
"default table" where it should continue processing the action list.

Multi table actions are modeled in hardware using a unique fte_id.
The fte_id is set before jumping to a table. Split actions continue
to a post-action table where the matched fte_id value continues the
execution the tc action list.

This series also introduces post action infrastructure. Both ct and
sample use it.

Sample for tunnel in TC SW
--------------------------

tc filter add dev vxlan1 protocol ip parent ffff: prio 3		\
	flower src_mac 24:25:d0:e1:00:00 dst_mac 02:25:d0:13:01:02	\
	enc_src_ip 192.168.1.14 enc_dst_ip 192.168.1.13			\
	enc_dst_port 4789 enc_key_id 4					\
	action sample rate 1 group 6					\
	action tunnel_key unset						\
	action mirred egress redirect dev enp4s0f0_1

MLX5 sample HW offload
----------------------

For the following typical flow table:

+-------------------------------+
+       original flow table     +
+-------------------------------+
+         original match        +
+-------------------------------+
+ sample action + other actions +
+-------------------------------+

We translate the tc filter with sample action to the following HW model:

        +---------------------+
        + original flow table +
        +---------------------+
        +   original match    +
        +---------------------+
              | set fte_id (if reg_c preserve cap)
              | do decap
              v
+------------------------------------------------+
+                Flow Sampler Object             +
+------------------------------------------------+
+                    sample ratio                +
+------------------------------------------------+
+    sample table id    |    default table id    +
+------------------------------------------------+
           |                            |
           v                            v
+-----------------------------+  +-------------------+
+        sample table         +  +   default table   +
+-----------------------------+  +-------------------+
+ forward to management vport +             |
+-----------------------------+             |
                                    +-------+------+
                                    |              |reg_c preserve cap
                                    |              |or decap action
                                    v              v
                       +-----------------+   +-------------+
                       + per vport table +   + post action +
                       +-----------------+   +-------------+
                       + original match  +
                       +-----------------+
                       + other actions   +
                       +-----------------+

2) From Dmytro Linkin: devlink rate object support for mlx5_core driver
=======================================================================

HIGH-LEVEL OVERVIEW

Devlink leaf rate objects created per vport (VF/SF, and PF on BlueField)
in switchdev mode on devlink port registration.
Implement devlink ops callbacks to create/destroy rate groups, set TX
rate values of the vport/group, assign vport to the group.
Driver accepts TX rate values as fraction of 1Mbps.

Refactor existing eswitch QoS infrastructure to be accessible by legacy
NDO rate API and new devlink rate API. NDO rate API is not
removed/disabled in switchdev mode to not break existing users. Rate
values configured with NDO rate API are not visible for devlink
infrastructure, therefore APIs should not be used simultaneously.

IMPLEMENTATION DETAILS

Driver provide two level rate hierarchy to manage bandwidth - group
level and vport level. Initially each vport added to internal unlimited
group created by default. Each rate element (vport or group) receive
bandwidth relative to its parent element (for groups the parent is a
physical link itself) in a Round Robin manner, where element get
bandwidth value according to its weight. Example:

Created four rate groups with tx_share limits:

$ devlink port function rate add \
    pci/0000:06:00.0/group_1 tx_share 30gbit
$ devlink port function rate add \
    pci/0000:06:00.0/group_2 tx_share 20gbit
$ devlink port function rate add \
    pci/0000:06:00.0/group_3 tx_share 20gbit
$ devlink port function rate add \
    pci/0000:06:00.0/group_4 tx_share 10gbit

Weights created in HW for each group are relative to the bigest tx_share
value, which is 30gbit:

<group_1> 1.0
<group_2> 0.67
<group_3> 0.67
<group_4> 0.33

Assuming link speed is 50 Gbit/sec and each group can sustain such
amount of traffic, maximum bandwidth is 50 / (1.0 + 0.67 + 0.67 + 0.33)
 = ~18.75 Gbit/sec. Normilized bandwidth values for groups:

<group_1> 18.75 * 1.0  = 18.75 Gbit/sec
<group_2> 18.75 * 0.67 = 12.5 Gbit/sec
<group_3> 18.75 * 0.67 = 12.5 Gbit/sec
<group_4> 18.75 * 0.33 = 6.25 Gbit/sec

If in example above group_1 doesn't produce any traffic, then maximum
bandwidth becomes 50 / (0.67 + 0.67 + 0.33) = ~30.0 Gbit/sec. Normalized
values:

<group_2> 30.0 * 0.67 = 20.0 Gbit/sec
<group_3> 30.0 * 0.67 = 20.0 Gbit/sec
<group_4> 30.0 * 0.33 = 10.0 Gbit/sec

Same normalization applied to each vport in the group.

Normalized values are internal, therefore driver provides QoS
tracepoints for next events:

* vport rate element creation/deletion:
* vport rate element configuration;
* group rate element creation/deletion;
* group rate element configuration.

PATCHES OVERVIEW

1 - Moving and isolation of eswitch QoS logic in separate file;

2 - Implement devlink leaf rate object support for vports;

3 - Implement rate groups creation/deletion;

4 - Implement TX rate management for the groups;

5 - Implement parent set for vports;

6 - Eswitch QoS tracepoints.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20 12:49:04 +01:00
David S. Miller
815cc21d8d This cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
 
  - update docs about move IRC channel away from freenode,
    by Sven Eckelmann
 
  - Switch to kstrtox.h for kstrtou64, by Sven Eckelmann
 
  - Update NULL checks, by Sven Eckelmann (2 patches)
 
  - remove remaining skb-copy calls for broadcast packets,
    by Linus Lüssing
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAmEeeK8WHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeofhWD/0YwNndM/FFo/NHcO3GFDZx9eLM
 dFuO7zdMilzgg462q7+mgi0jXA2Kp50Y+JcCqS2XVRIsMgKTVABflgmSlUIOdDoC
 A3KKRVgQ1HNPD4WREaEV2CLvBdhR9wEI0jRHvZou7n/VWrfJcUHgdl9aDA2/ptlP
 NcuYCKC99HCQmvaBt4GZgOunYDeplmo2qLip2gpwJWf9/vkL7HiBe3HtQSh1HI2y
 EIn4SExZOFcxMmKeJMsYl35OZh9oFv7nTnpZBGyKjA+HS0pu03aaPNRGMjW/pdhF
 f7V61aDJBU0xU6PjWvUegY4VMInrjW8F10EEJck461J/B9PXjUHUaH8BXXuGBkRM
 0kU0Cv21a3Ovz23lgnXSnXu/xjqq5/zZHjnGvyPAMMppAI5f73q/0THtv9iOu+Cz
 Qf/tYl0BIRir20ZWtddQ9x2W3+cBYPOYrf/tnmWqFhPddenn+xitwTysVA6fOykQ
 pVksQ5UVpDZasZI9Al+R2M0CBttn7tS/iu95PV9CMST8aRgUuU90yd2Ocg3rRDNQ
 iEor0AozmO879W460BFQcTILw+D7OdlErUV8H8VW4507imZ7JXGPwZTFxhjM2Xhx
 wUXo/o2sxt/ITSdtZAeQj8zOXQMtOi3KlXtTl8ZzyRT//YLWah0j4oBf0a8K62/y
 i1Pd5MgXDQAm8fHkBg==
 =sHz+
 -----END PGP SIGNATURE-----

Merge tag 'batadv-next-pullrequest-20210819' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This cleanup patchset includes the following patches:

 - bump version strings, by Simon Wunderlich

 - update docs about move IRC channel away from freenode,
   by Sven Eckelmann

 - Switch to kstrtox.h for kstrtou64, by Sven Eckelmann

 - Update NULL checks, by Sven Eckelmann (2 patches)

 - remove remaining skb-copy calls for broadcast packets,
   by Linus Lüssing
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-20 12:04:50 +01:00
Dmytro Linkin
3202ea65f8 net/mlx5: E-switch, Add QoS tracepoints
Add tracepoints to log QoS enabling/disabling/configuration for vports
and rate groups.

Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Huy Nguyen <huyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-19 21:50:41 -07:00
Andrii Nakryiko
fb7dd8bca0 bpf: Refactor BPF_PROG_RUN into a function
Turn BPF_PROG_RUN into a proper always inlined function. No functional and
performance changes are intended, but it makes it much easier to understand
what's going on with how BPF programs are actually get executed. It's more
obvious what types and callbacks are expected. Also extra () around input
parameters can be dropped, as well as `__` variable prefixes intended to avoid
naming collisions, which makes the code simpler to read and write.

This refactoring also highlighted one extra issue. BPF_PROG_RUN is both
a macro and an enum value (BPF_PROG_RUN == BPF_PROG_TEST_RUN). Turning
BPF_PROG_RUN into a function causes naming conflict compilation error. So
rename BPF_PROG_RUN into lower-case bpf_prog_run(), similar to
bpf_prog_run_xdp(), bpf_prog_run_pin_on_cpu(), etc. All existing callers of
BPF_PROG_RUN, the macro, are switched to bpf_prog_run() explicitly.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210815070609.987780-2-andrii@kernel.org
2021-08-17 00:45:07 +02:00
Guangbin Huang
958ab281eb docs: ethtool: Add two link extended substates of bad signal integrity
Add documentation for two bad signal integrity substates:
ETHTOOL_LINK_EXT_SUBSTATE_BSI_SERDES_REFERENCE_CLOCK_LOST
ETHTOOL_LINK_EXT_SUBSTATE_BSI_SERDES_ALOS.

Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-16 15:12:13 -07:00
Paolo Abeni
ff5a0b421c mptcp: faster active backup recovery
The msk can use backup subflows to transmit in-sequence data
only if there are no other active subflow. On active backup
scenario, the MPTCP connection can do forward progress only
due to MPTCP retransmissions - rtx can pick backup subflows.

This patch introduces a new flag flow MPTCP subflows: if the
underlying TCP connection made no progresses for long time,
and there are other less problematic subflows available, the
given subflow become stale.

Stale subflows are not considered active: if all non backup
subflows become stale, the MPTCP scheduler can pick backup
subflows for plain transmissions.

Stale subflows can return in active state, as soon as any reply
from the peer is observed.

Active backup scenarios can now leverage the available b/w
with no restrinction.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/207
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-14 11:37:25 +01:00
Jakub Kicinski
f4083a752a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.h
  9e26680733 ("bnxt_en: Update firmware call to retrieve TX PTP timestamp")
  9e518f2580 ("bnxt_en: 1PPS functions to configure TSIO pins")
  099fdeda65 ("bnxt_en: Event handler for PPS events")

kernel/bpf/helpers.c
include/linux/bpf-cgroup.h
  a2baf4e8bb ("bpf: Fix potentially incorrect results with bpf_get_local_storage()")
  c7603cfa04 ("bpf: Add ambient BPF runtime context stored in current")

drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
  5957cc557d ("net/mlx5: Set all field of mlx5_irq before inserting it to the xarray")
  2d0b41a376 ("net/mlx5: Refcount mlx5_irq with integer")

MAINTAINERS
  7b637cd52f ("MAINTAINERS: fix Microchip CAN BUS Analyzer Tool entry typo")
  7d901a1e87 ("net: phy: add Maxlinear GPY115/21x/24x driver")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-13 06:41:22 -07:00
Parav Pandit
076b2a9dbb devlink: Add new "enable_vnet" generic device param
Add new device generic parameter to enable/disable creation of
VDPA net auxiliary device and associated device functionality
in the devlink instance.

User who prefers to disable such functionality can disable it using below
example.

$ devlink dev param set pci/0000:06:00.0 \
              name enable_vnet value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0

At this point devlink instance do not create auxiliary device for the
VDPA net functionality.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11 14:34:21 +01:00
Parav Pandit
8ddaabee3c devlink: Add new "enable_rdma" generic device param
Add new device generic parameter to enable/disable creation of
RDMA auxiliary device and associated device functionality
in the devlink instance.

User who prefers to disable such functionality can disable it using below
example.

$ devlink dev param set pci/0000:06:00.0 \
              name enable_rdma value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0

At this point devlink instance do not create auxiliary device for the
RDMA functionality.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11 14:34:21 +01:00
Parav Pandit
f13a5ad881 devlink: Add new "enable_eth" generic device param
Add new device generic parameter to enable/disable creation of
Ethernet auxiliary device and associated device functionality
in the devlink instance.

User who prefers to disable such functionality can disable it using below
example.

$ devlink dev param set pci/0000:06:00.0 \
              name enable_eth value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0

At this point devlink instance do not create auxiliary device for the
Ethernet functionality.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11 14:34:21 +01:00
Sven Eckelmann
71d41c09f1 batman-adv: Move IRC channel to hackint.org
Due to recent developments around the Freenode.org IRC network, the
opinions about the usage of this service shifted dramatically. The majority
of the still active users of the #batman channel prefers a move to the
hackint.org network.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
2021-08-08 20:05:46 +02:00