The purpose of the delayed work in the SPAN module is to potentially
update the destination port and various encapsulation parameters of SPAN
agents that point to a VLAN device or a GRE tap. The destination port
can change following the insertion of a new route, for example.
SPAN agents that point to a physical port or the CPU port are static and
never change throughout the lifetime of the SPAN agent. Therefore, skip
over them in the delayed work.
This fixes an issue where the delayed work overwrites the policer
that was set on a SPAN agent pointing to the CPU. Modifying the delayed
work to inherit the original policer configuration is error-prone, as
the same will be needed for any new parameter.
Fixes: 4039504e6a ("mlxsw: spectrum_span: Allow setting policer on a SPAN agent")
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Avoid trapping packets when the interface is down, and revive them when
interface is back up. Add API to mlx5 core retrieving the action by trap
id. Use it to apply traps when interface is up, and disable then when
interface is down.
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add support for trapping packets which didn't match any DMAC in the MAC
table. Add a listener which adds/removes MAC trap rule in the flow
steering according to the trap's action trap/drop.
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add support for listening to blocking events in the ETH driver. Listen
on trap event. If received, call mlx5e_handle_trap_event() which:
1) Verifies if driver needs open/close trap-RQ with respect to the
active traps count.
2) Inspects trap id and its action (trap/drop) and add/remove the flow
steering rule accordingly.
Otherwise, return an error.
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduce mlx5e_trap which includes a dedicated RQ and NAPI for trapped
packets. Trap-RQ processes packets that were destined to be dropped,
but for debug and visibility sake these packets are trapped and reported
to devlink.
Trap-RQ connects between the HW and the driver and is not a part of a
channel. Open mlx5e_create_rq() and mlx5_core_destroy_rq() as API and
add dedicate RQ handlers which report to devlink of trapped packets.
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In order to support RQs outside of channel context, change
mlx5e_init_di_list() signature to accept NUMA node instead of cpu.
In addition, expose dma info helpers as API. This API will be used for
RQ's creation in other files in downstream patches.
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add flow group to the L2 table to hold the catch-all DMAC rule. Add API
which adds/removes DMAC trap rule. This rule catches packets that were
destined to be dropped due to no-match with previous DMAC rules. The
trap rule steer these packets to the trap tir related to the trap-RQ.
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add flow group to the VLAN table to hold the catch-all VLAN rule. Add
API which adds/removes VLAN trap rule. This rule catches packets that
were destined to be dropped due to no-match with previous VLAN rules.
The trap rule steer these packets to the trap tir related to the
trap-RQ.
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Change steering flow to optimize traffic in promiscuous mode. On demand,
add a high priority table containing a catch-all rule. All incoming
packets are caught by this rule and steered directly to the TTC table.
Prior to this change, packets in promiscuous mode may suffer from up to
4 steering hops before reaching TTC table.
In addition, this patch will allow us adding a catch-all rule at the end
of MAC table to serve MAC trap, with no impact on promiscuous mode
performance.
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In order to allow mlx5 core driver to trigger synchronous operations to
its consumers, add a blocking events handler. Add wrappers to
blocking_notifier_[call_chain/chain_register/chain_unregister]. Add trap
callback for action set and notify about this change. Following patches
in the set add a listener for this event.
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Change the naming of events notifier head to clarify that it handles
only firmware events. Coming patches in the set, add event notifier for
software events.
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Core driver is registered to the devlink traps service, which enables the
admin to redeem packets that were destined to be dropped due to a
particular reason. Register to DMAC filter, allow visibility of packets
that were filtered out by the MAC table.
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If a non nat tuple entry is inserted just to the regular tuples
rhashtable (ct_tuples_ht) and not to natted tuples rhashtable
(ct_nat_tuples_ht). Commit bc562be967 ("net/mlx5e: CT: Save ct entries
tuples in hashtables") mixed up the return labels and names sot that on
cleanup or failure we still try to remove for the natted tuples rhashtable.
Fix that by correctly checking if a natted tuples insertion
before removing it. While here make it more readable.
Fixes: bc562be967 ("net/mlx5e: CT: Save ct entries tuples in hashtables")
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Sometimes, channel params are changed without recreating the channels.
It happens in two basic cases: when the channels are closed, and when
the parameter being changed doesn't affect how channels are configured.
Such changes invoke a hardware command that might fail. The whole
operation should be reverted in such cases, but the code that restores
the parameters' values in the driver was missing. This commit adds this
handling.
Fixes: 2e20a15120 ("net/mlx5e: Fail safe mtu and lro setting")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Trust state may be changed without recreating the channels. It happens
when the channels are closed, and when channel parameters (min inline
mode) stay the same after changing the trust state. Changing the trust
state is a hardware command that may fail. The current code didn't
restore the channel parameters to their old values if an error happened
and the channels were closed. This commit adds handling for this case.
Fixes: 6e0504c698 ("net/mlx5e: Change inline mode correctly when changing trust state")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This commit addresses two issues related to changing the number of
queues when the channels are closed:
1. Missing call to mlx5e_num_channels_changed to update
real_num_tx_queues when the number of TCs is changed.
2. When mlx5e_num_channels_changed returns an error, the channel
parameters must be reverted.
Two Fixes: tags correspond to the first commits where these two issues
were introduced.
Fixes: 3909a12e79 ("net/mlx5e: Fix configuration of XPS cpumasks and netdev queues in corner cases")
Fixes: fa3748775b ("net/mlx5e: Handle errors from netif_set_real_num_{tx,rx}_queues")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently, if a neighbour isn't valid when offloading tunnel encap rules,
we offload the original match and replace the original action with
"goto slow path" action. For this we use a temporary flow attribute based
on the original flow attribute and then change the action. Flow flags,
which among those is the CT flag, are still shared for the slow path rule
offload, so we end up parsing this flow as a CT + goto slow path rule.
Besides being unnecessary, CT action offload saves extra information in
the passed flow attribute, such as created ct_flow and mod_hdr, which
is lost onces the temporary flow attribute is freed.
When a neigh is updated and is valid, we offload the original CT rule
with original CT action, which again creates a ct_flow and mod_hdr
and saves it in the flow's original attribute. Then we delete the slow
path rule with a temporary flow attribute based on original updated
flow attribute, and we free the relevant ct_flow and mod_hdr.
Then when tc deletes this flow, we try to free the ct_flow and mod_hdr
on the flow's attribute again.
To fix the issue, skip all furture proccesing (CT/Sample/Split rules)
in offload/unoffload of slow path rules.
Call trace:
[ 758.850525] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000218
[ 758.952987] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[ 758.964170] Modules linked in: act_csum(E) act_pedit(E) act_tunnel_key(E) act_ct(E) nf_flow_table(E) xt_nat(E) ip6table_filter(E) ip6table_nat(E) xt_comment(E) ip6_tables(E) xt_conntrack(E) xt_MASQUERADE(E) nf_conntrack_netlink(E) xt_addrtype(E) iptable_filter(E) iptable_nat(E) bpfilter(E) br_netfilter(E) bridge(E) stp(E) llc(E) xfrm_user(E) overlay(E) act_mirred(E) act_skbedit(E) rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) esp6_offload(E) esp6(E) esp4_offload(E) esp4(E) xfrm_algo(E) mlx5_ib(OE) ib_uverbs(OE) geneve(E) ip6_udp_tunnel(E) udp_tunnel(E) nfnetlink_cttimeout(E) nfnetlink(E) mlx5_core(OE) act_gact(E) cls_flower(E) sch_ingress(E) openvswitch(E) nsh(E) nf_conncount(E) nf_nat(E) mlxfw(OE) psample(E) nf_conntrack(E) nf_defrag_ipv4(E) vfio_mdev(E) mdev(E) ib_core(OE) mlx_compat(OE) crct10dif_ce(E) uio_pdrv_genirq(E) uio(E) i2c_mlx(E) mlxbf_pmc(E) sbsa_gwdt(E) mlxbf_gige(E) gpio_mlxbf2(E) mlxbf_pka(E) mlx_trio(E) mlx_bootctl(E) bluefield_edac(E) knem(O)
[ 758.964225] ip_tables(E) mlxbf_tmfifo(E) ipv6(E) crc_ccitt(E) nf_defrag_ipv6(E)
[ 759.154186] CPU: 5 PID: 122 Comm: kworker/u16:1 Tainted: G OE 5.4.60-mlnx.52.gde81e85 #1
[ 759.172870] Hardware name: https://www.mellanox.com BlueField SoC/BlueField SoC, BIOS BlueField:3.5.0-2-gc1b5d64 Jan 4 2021
[ 759.195466] Workqueue: mlx5e mlx5e_rep_neigh_update [mlx5_core]
[ 759.207344] pstate: a0000005 (NzCv daif -PAN -UAO)
[ 759.217003] pc : mlx5_del_flow_rules+0x5c/0x160 [mlx5_core]
[ 759.228229] lr : mlx5_del_flow_rules+0x34/0x160 [mlx5_core]
[ 759.405858] Call trace:
[ 759.410804] mlx5_del_flow_rules+0x5c/0x160 [mlx5_core]
[ 759.421337] __mlx5_eswitch_del_rule.isra.43+0x5c/0x1c8 [mlx5_core]
[ 759.433963] mlx5_eswitch_del_offloaded_rule_ct+0x34/0x40 [mlx5_core]
[ 759.446942] mlx5_tc_rule_delete_ct+0x68/0x74 [mlx5_core]
[ 759.457821] mlx5_tc_ct_delete_flow+0x160/0x21c [mlx5_core]
[ 759.469051] mlx5e_tc_unoffload_fdb_rules+0x158/0x168 [mlx5_core]
[ 759.481325] mlx5e_tc_encap_flows_del+0x140/0x26c [mlx5_core]
[ 759.492901] mlx5e_rep_update_flows+0x11c/0x1ec [mlx5_core]
[ 759.504127] mlx5e_rep_neigh_update+0x160/0x200 [mlx5_core]
[ 759.515314] process_one_work+0x178/0x400
[ 759.523350] worker_thread+0x58/0x3e8
[ 759.530685] kthread+0x100/0x12c
[ 759.537152] ret_from_fork+0x10/0x18
[ 759.544320] Code: 97ffef55 51000673 3100067f 54ffff41 (b9421ab3)
[ 759.556548] ---[ end trace fab818bb1085832d ]---
Fixes: 4c3844d9e9 ("net/mlx5e: CT: Introduce connection tracking")
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The cited commit introduce new CONFIG_MLX5_CLS_ACT kconfig variable
to control compilation of TC hardware offloads implementation.
When this configuration is disabled the driver is still wrongly
reports in ethtool that hw-tc-offload is supported.
Fixed by reporting hw-tc-offload is supported only when
CONFIG_MLX5_CLS_ACT is enabled.
Fixes: d956873f90 ("net/mlx5e: Introduce kconfig var for TC support")
Signed-off-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Pages for the host PF and ECPF were stored in the same tree, so the ECPF
pages were being freed along with the host PF's when the host driver
unloaded.
Combine the function ID and ECPF flag to use as an index into the
x-array containing the trees to get a different tree for the host PF and
ECPF.
Fixes: c6168161f6 ("net/mlx5: Add support for release all pages event")
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When IPSEC offload isn't active, the number of stats is not zero, but
the strings are not filled, leading to exposing stats with empty names.
Fix this by using the same condition for NUM_STATS and FILL_STRS.
Fixes: 0aab3e1b04 ("net/mlx5e: IPSec, Expose IPsec HW stat only for supporting HW")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
"Unsupported key used:" appears in kernel log when flows with
unsupported key are used, arp fields for example.
OpenVSwitch was changed to match on arp fields by default that
caused this warning to appear in kernel log for every arp rule, which
can be a lot.
Fix by lowering print level from warning to debug.
Fixes: e3a2b7ed01 ("net/mlx5e: Support offload cls_flower with drop action")
Signed-off-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Instead of directly return, goto the error handling label to free
allocated page.
Fixes: 5f29458b77 ("net/mlx5e: Support dump callback in TX reporter")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
rate_bytes_ps is a 64-bit field. It passed as 32-bit field to
apply_police_params(). Due to this when police rate is higher
than 4Gbps, 32-bit calculation ignores the carry. This results
in incorrect rate configurationn the device.
Fix it by performing 64-bit calculation.
Fixes: fcb64c0f56 ("net/mlx5: E-Switch, add ingress rate support")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When we create the ft object we also init rhltable in ft->fgs_hash.
So in error flow before kfree of ft we need to destroy that rhltable.
Fixes: 693c6883bb ("net/mlx5: Add hash table for flow groups in flow table")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The switch ASIC has a limited capacity of physical ('flavour physical'
in devlink terminology) ports that it can support. While each system is
brought up with a different number of ports, this number can be
increased via splitting up to the ASIC's limit.
Expose physical ports as a devlink resource so that user space will have
visibility to the maximum number of ports that can be supported and the
current occupancy.
In addition, add a "Generic Resources" section in devlink-resource
documentation so the different drivers will be aligned by the same resource
name when exposing to user space.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit adds support for HTB offload in the mlx5e driver.
Performance:
NIC: Mellanox ConnectX-6 Dx
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (24 cores with HT)
100 Gbit/s line rate, 500 UDP streams @ ~200 Mbit/s each
48 traffic classes, flower used for steering
No shaping (rate limits set to 4 Gbit/s per TC) - checking for max
throughput.
Baseline: 98.7 Gbps, 8.25 Mpps
HTB: 6.7 Gbps, 0.56 Mpps
HTB offload: 95.6 Gbps, 8.00 Mpps
Limitations:
1. 256 leaf nodes, 3 levels of depth.
2. Granularity for ceil is 1 Mbit/s. Rates are converted to weights, and
the bandwidth is split among the siblings according to these weights.
Other parameters for classes are not supported.
Ethtool statistics support for QoS SQs are also added. The counters are
called qos_txN_*, where N is the QoS queue number (starting from 0, the
numeration is separate from the normal SQs), and * is the counter name
(the counters are the same as for the normal SQs).
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Support changing the state of the SF port's function through devlink.
When activating the SF port's function, enable the hca in the device
followed by adding its auxiliary device.
When deactivating the SF port's function, delete its auxiliary device
followed by disabling the vHCA.
Port function attributes get/set callbacks are invoked with devlink
instance lock held. Such callbacks need to synchronize with sf port
table getting disabled either via sriov sysfs callback. Such callbacks
synchronize with table disable context holding table refcount.
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
function:
hw_addr 00:00:00:00:00:00 state inactive opstate detached
$ devlink port show ens2f0npf0sf88
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
function:
hw_addr 00:00:00:00:88:88 state inactive opstate detached
$ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88 state active
$ devlink port show ens2f0npf0sf88 -jp
{
"port": {
"pci/0000:06:00.0/32768": {
"type": "eth",
"netdev": "ens2f0npf0sf88",
"flavour": "pcisf",
"controller": 0,
"pfnum": 0,
"sfnum": 88,
"external": false,
"splittable": false,
"function": {
"hw_addr": "00:00:00:00:88:88",
"state": "active",
"opstate": "attached"
}
}
}
}
On port function activation, an auxiliary device is created in below
example.
$ devlink dev show
devlink dev show auxiliary/mlx5_core.sf.4
$ devlink port show auxiliary/mlx5_core.sf.4/1
auxiliary/mlx5_core.sf.4/1: type eth netdev p0sf88 flavour virtual port 0 splittable false
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
To handle SF port management outside of the eswitch as independent
software layer, introduce eswitch notifier APIs so that mlx5 upper
layer who wish to support sf port management in switchdev mode can
perform its task whenever eswitch mode is set to switchdev or before
eswitch is disabled.
Initialize sf port table on such eswitch event.
Add SF port add and delete functionality in switchdev mode.
Destroy all SF ports when eswitch is disabled.
Expose SF port add and delete to user via devlink commands.
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
function:
hw_addr 00:00:00:00:00:00 state inactive opstate detached
$ devlink port show ens2f0npf0sf88
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
function:
hw_addr 00:00:00:00:00:00 state inactive opstate detached
or by its unique port index:
$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
function:
hw_addr 00:00:00:00:00:00 state inactive opstate detached
$ devlink port show ens2f0npf0sf88 -jp
{
"port": {
"pci/0000:06:00.0/32768": {
"type": "eth",
"netdev": "ens2f0npf0sf88",
"flavour": "pcisf",
"controller": 0,
"pfnum": 0,
"sfnum": 88,
"external": false,
"splittable": false,
"function": {
"hw_addr": "00:00:00:00:00:00",
"state": "inactive",
"opstate": "detached"
}
}
}
}
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add helpers to enable/disable eswitch port, register its devlink port and
load its representor.
Signed-off-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Prepare eswitch to handle SF vport during
(a) querying eswitch functions
(b) egress ACL creation
(c) account for SF vports in total vports calculation
Assign a dedicated placeholder for SFs vports and their representors.
They are placed after VFs vports and before ECPF vports as below:
[PF,VF0,...,VFn,SF0,...SFm,ECPF,UPLINK].
Change functions to map SF's vport numbers to indices when
accessing the vports or representors arrays, and vice versa.
Signed-off-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add auxiliary device driver for mlx5 subfunction auxiliary device.
A mlx5 subfunction is similar to PCI PF and VF. For a subfunction
an auxiliary device is created.
As a result, when mlx5 SF auxiliary device binds to the driver,
its netdev and rdma device are created, they appear as
$ ls -l /sys/bus/auxiliary/devices/
mlx5_core.sf.4 -> ../../../devices/pci0000:00/0000:00:03.0/0000:06:00.0/mlx5_core.sf.4
$ ls -l /sys/class/net/eth1/device
/sys/class/net/eth1/device -> ../../../mlx5_core.sf.4
$ cat /sys/bus/auxiliary/devices/mlx5_core.sf.4/sfnum
88
$ devlink dev show
pci/0000:06:00.0
auxiliary/mlx5_core.sf.4
$ devlink port show auxiliary/mlx5_core.sf.4/1
auxiliary/mlx5_core.sf.4/1: type eth netdev p0sf88 flavour virtual port 0 splittable false
$ rdma link show mlx5_0/1
link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev p0sf88
$ rdma dev show
8: rocep6s0f1: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d113 sys_image_guid 248a:0703:00b3:d112
13: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112
In future, devlink device instance name will adapt to have sfnum
annotation using either an alias or as devlink instance name described
in RFC [1].
[1] https://lore.kernel.org/netdev/20200519092258.GF4655@nanopsycho/
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Introduce API to add and delete an auxiliary device for an SF.
Each SF has its own dedicated window in the PCI BAR 2.
SF device is similar to PCI PF and VF that supports multiple class of
devices such as net, rdma and vdpa.
SF device will be added or removed in subsequent patch during SF
devlink port function state change command.
A subfunction device exposes user supplied subfunction number which will
be further used by systemd/udev to have deterministic name for its
netdevice and rdma device.
An mlx5 subfunction auxiliary device example:
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
pci/0000:08:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
function:
hw_addr 00:00:00:00:00:00 state inactive opstate detached
$ devlink port show ens2f0npf0sf88
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
function:
hw_addr 00:00:00:00:88:88 state inactive opstate detached
$ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88 state active
On activation,
$ ls -l /sys/bus/auxiliary/devices/
mlx5_core.sf.4 -> ../../../devices/pci0000:00/0000:00:03.0/0000:06:00.0/mlx5_core.sf.4
$ cat /sys/bus/auxiliary/devices/mlx5_core.sf.4/sfnum
88
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
vhca state events indicates change in the state of the vhca that may
occur due to a SF allocation, deallocation or enabling/disabling the
SF HCA.
Introduce vhca state event handler which will be used by SF devlink
port manager and SF hardware id allocator in subsequent patches
to act on the event.
This enables single entity to subscribe, query and rearm the event
for a function.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This reverts commit fbdd0049d9.
Due to commit in fixes tag, netdevice events were received only in one net
namespace of mlx5_core_dev. Due to this when netdevice events arrive in
net namespace other than net namespace of mlx5_core_dev, they are missed.
This results in empty GID table due to RDMA device being detached from its
net device.
Hence, revert back to receive netdevice events in all net namespaces to
restore back RDMA functionality in non init_net net namespace. The
deadlock will have to be addressed in another patch.
Fixes: fbdd0049d9 ("RDMA/mlx5: Fix devlink deadlock on net namespace deletion")
Link: https://lore.kernel.org/r/20210117092633.10690-1-leon@kernel.org
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Daniel Borkmann says:
====================
pull-request: bpf-next 2021-01-16
1) Extend atomic operations to the BPF instruction set along with x86-64 JIT support,
that is, atomic{,64}_{xchg,cmpxchg,fetch_{add,and,or,xor}}, from Brendan Jackman.
2) Add support for using kernel module global variables (__ksym externs in BPF
programs) retrieved via module's BTF, from Andrii Nakryiko.
3) Generalize BPF stackmap's buildid retrieval and add support to have buildid
stored in mmap2 event for perf, from Jiri Olsa.
4) Various fixes for cross-building BPF sefltests out-of-tree which then will
unblock wider automated testing on ARM hardware, from Jean-Philippe Brucker.
5) Allow to retrieve SOL_SOCKET opts from sock_addr progs, from Daniel Borkmann.
6) Clean up driver's XDP buffer init and split into two helpers to init per-
descriptor and non-changing fields during processing, from Lorenzo Bianconi.
7) Minor misc improvements to libbpf & bpftool, from Ian Rogers.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (41 commits)
perf: Add build id data in mmap2 event
bpf: Add size arg to build_id_parse function
bpf: Move stack_map_get_build_id into lib
bpf: Document new atomic instructions
bpf: Add tests for new BPF atomic operations
bpf: Add bitwise atomic instructions
bpf: Pull out a macro for interpreting atomic ALU operations
bpf: Add instructions for atomic_[cmp]xchg
bpf: Add BPF_FETCH field / create atomic_fetch_add instruction
bpf: Move BPF_STX reserved field check into BPF_STX verifier code
bpf: Rename BPF_XADD and prepare to encode other atomics in .imm
bpf: x86: Factor out a lookup table for some ALU opcodes
bpf: x86: Factor out emission of REX byte
bpf: x86: Factor out emission of ModR/M for *(reg + off)
tools/bpftool: Add -Wall when building BPF programs
bpf, libbpf: Avoid unused function warning on bpf_tail_call_static
selftests/bpf: Install btf_dump test cases
selftests/bpf: Fix installation of urandom_read
selftests/bpf: Move generated test files to $(TEST_GEN_FILES)
selftests/bpf: Fix out-of-tree build
...
====================
Link: https://lore.kernel.org/r/20210116012922.17823-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
MLX5_IPSEC_DEV() is always defined, no need to protect it under config
flag CONFIG_MLX5_EN_IPSEC, especially in slow path.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Reviewed-by: Huy Nguyen <huyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Feature check functions are in the TX fast-path of all SKBs, not only
IPsec traffic.
Move the IPsec feature check function into a header and turn it inline.
Use a stub and clean the config flag condition in Eth main driver file.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Reviewed-by: Huy Nguyen <huyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All IPsec logic should be wrapped under the compile flag,
including its checksum logic.
Introduce an inline function in ipsec datapath header,
with a corresponding stub.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Reviewed-by: Huy Nguyen <huyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The zone member is of type u16 so there is no reason to apply
the zone mask on it. This is also matching the call to set a
match in other places which don't need and don't apply the mask.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
miss_rule and prio_s args are not being referenced before assigned
so there is no need to init them.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
No need to pass zero spec to mlx5_add_flow_rules() as the
function can handle null spec.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Miss path handling of tc multi chain filters (i.e. filters that are
defined on chain > 0) requires the hardware to communicate to the
driver the last chain that was processed. This is possible only when
the hardware is capable of performing the combination of modify header
and forward to table actions. Currently, if the hardware is missing
this capability then the driver only offloads rules that are defined
on tc chain 0 prio 1. However, this restriction can be relaxed because
packets that miss from chain 0 are processed through all the
priorities by tc software.
Allow the offload of all the supported priorities for chain 0 even
when the hardware is not capable to perform modify header and goto
table actions.
Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use tunnel_stateless_mpls_over_udp instead of
MLX5_FLEX_PROTO_CW_MPLS_UDP since new devices have native support for
mpls over udp and do not rely on flex parser.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
esw->qos.enabled will only be true if both MLX5_CAP_GEN(dev, qos) and
MLX5_CAP_QOS(dev, esw_scheduling) are true. Therefore, remove them from
the condition in and rely only on esw->qos.enabled.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As of commit 457e20d659 ("mlxsw: spectrum_switchdev: Avoid returning
errors in commit phase"), the mlxsw driver performs the VLAN object
offloading during the prepare phase. So conversion just seems to be a
matter of removing the code that was running in the commit phase.
Ido Schimmel explains that the reason why mlxsw_sp_span_respin is called
unconditionally is because the bridge driver will ignore -EOPNOTSUPP and
actually add the VLAN on the bridge device - see commit 9c86ce2c1a
("net: bridge: Notify about bridge VLANs") and commit ea47217519
("mlxsw: spectrum_switchdev: Ignore bridge VLAN events"). Since the VLAN
was successfully added on the bridge device, mlxsw_sp_span_respin_work()
should be able to resolve the egress port for a packet that is mirrored
to a gre tap and passes through the bridge device. Therefore keep the
logic as it is.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since the introduction of the switchdev API, port attributes were
transmitted to drivers for offloading using a two-step transactional
model, with a prepare phase that was supposed to catch all errors, and a
commit phase that was supposed to never fail.
Some classes of failures can never be avoided, like hardware access, or
memory allocation. In the latter case, merely attempting to move the
memory allocation to the preparation phase makes it impossible to avoid
memory leaks, since commit 91cf8eceff ("switchdev: Remove unused
transaction item queue") which has removed the unused mechanism of
passing on the allocated memory between one phase and another.
It is time we admit that separating the preparation from the commit
phase is something that is best left for the driver to decide, and not
something that should be baked into the API, especially since there are
no switchdev callers that depend on this.
This patch removes the struct switchdev_trans member from switchdev port
attribute notifier structures, and converts drivers to not look at this
member.
In part, this patch contains a revert of my previous commit 2e554a7a5d
("net: dsa: propagate switchdev vlan_filtering prepare phase to
drivers").
For the most part, the conversion was trivial except for:
- Rocker's world implementation based on Broadcom OF-DPA had an odd
implementation of ofdpa_port_attr_bridge_flags_set. The conversion was
done mechanically, by pasting the implementation twice, then only
keeping the code that would get executed during prepare phase on top,
then only keeping the code that gets executed during the commit phase
on bottom, then simplifying the resulting code until this was obtained.
- DSA's offloading of STP state, bridge flags, VLAN filtering and
multicast router could be converted right away. But the ageing time
could not, so a shim was introduced and this was left for a further
commit.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
Reviewed-by: Linus Walleij <linus.walleij@linaro.org> # RTL8366RB
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since the introduction of the switchdev API, port objects were
transmitted to drivers for offloading using a two-step transactional
model, with a prepare phase that was supposed to catch all errors, and a
commit phase that was supposed to never fail.
Some classes of failures can never be avoided, like hardware access, or
memory allocation. In the latter case, merely attempting to move the
memory allocation to the preparation phase makes it impossible to avoid
memory leaks, since commit 91cf8eceff ("switchdev: Remove unused
transaction item queue") which has removed the unused mechanism of
passing on the allocated memory between one phase and another.
It is time we admit that separating the preparation from the commit
phase is something that is best left for the driver to decide, and not
something that should be baked into the API, especially since there are
no switchdev callers that depend on this.
This patch removes the struct switchdev_trans member from switchdev port
object notifier structures, and converts drivers to not look at this
member.
Where driver conversion is trivial (like in the case of the Marvell
Prestera driver, NXP DPAA2 switch, TI CPSW, and Rocker drivers), it is
done in this patch.
Where driver conversion needs more attention (DSA, Mellanox Spectrum),
the conversion is left for subsequent patches and here we only fake the
prepare/commit phases at a lower level, just not in the switchdev
notifier itself.
Where the code has a natural structure that is best left alone as a
preparation and a commit phase (as in the case of the Ocelot switch),
that structure is left in place, just made to not depend upon the
switchdev transactional model.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The call path of a switchdev VLAN addition to the bridge looks something
like this today:
nbp_vlan_init
| __br_vlan_set_default_pvid
| | |
| | br_afspec |
| | | |
| | v |
| | br_process_vlan_info |
| | | |
| | v |
| | br_vlan_info |
| | / \ /
| | / \ /
| | / \ /
| | / \ /
v v v v v
nbp_vlan_add br_vlan_add ------+
| ^ ^ | |
| / | | |
| / / / |
\ br_vlan_get_master/ / v
\ ^ / / br_vlan_add_existing
\ | / / |
\ | / / /
\ | / / /
\ | / / /
\ | / / /
v | | v /
__vlan_add /
/ | /
/ | /
v | /
__vlan_vid_add | /
\ | /
v v v
br_switchdev_port_vlan_add
The ranges UAPI was introduced to the bridge in commit bdced7ef78
("bridge: support for multiple vlans and vlan ranges in setlink and
dellink requests") (Jan 10 2015). But the VLAN ranges (parsed in br_afspec)
have always been passed one by one, through struct bridge_vlan_info
tmp_vinfo, to br_vlan_info. So the range never went too far in depth.
Then Scott Feldman introduced the switchdev_port_bridge_setlink function
in commit 47f8328bb1 ("switchdev: add new switchdev bridge setlink").
That marked the introduction of the SWITCHDEV_OBJ_PORT_VLAN, which made
full use of the range. But switchdev_port_bridge_setlink was called like
this:
br_setlink
-> br_afspec
-> switchdev_port_bridge_setlink
Basically, the switchdev and the bridge code were not tightly integrated.
Then commit 41c498b935 ("bridge: restore br_setlink back to original")
came, and switchdev drivers were required to implement
.ndo_bridge_setlink = switchdev_port_bridge_setlink for a while.
In the meantime, commits such as 0944d6b5a2 ("bridge: try switchdev op
first in __vlan_vid_add/del") finally made switchdev penetrate the
br_vlan_info() barrier and start to develop the call path we have today.
But remember, br_vlan_info() still receives VLANs one by one.
Then Arkadi Sharshevsky refactored the switchdev API in 2017 in commit
29ab586c3d ("net: switchdev: Remove bridge bypass support from
switchdev") so that drivers would not implement .ndo_bridge_setlink any
longer. The switchdev_port_bridge_setlink also got deleted.
This refactoring removed the parallel bridge_setlink implementation from
switchdev, and left the only switchdev VLAN objects to be the ones
offloaded from __vlan_vid_add (basically RX filtering) and __vlan_add
(the latter coming from commit 9c86ce2c1a ("net: bridge: Notify about
bridge VLANs")).
That is to say, today the switchdev VLAN object ranges are not used in
the kernel. Refactoring the above call path is a bit complicated, when
the bridge VLAN call path is already a bit complicated.
Let's go off and finish the job of commit 29ab586c3d by deleting the
bogus iteration through the VLAN ranges from the drivers. Some aspects
of this feature never made too much sense in the first place. For
example, what is a range of VLANs all having the BRIDGE_VLAN_INFO_PVID
flag supposed to mean, when a port can obviously have a single pvid?
This particular configuration _is_ denied as of commit 6623c60dc2
("bridge: vlan: enforce no pvid flag in vlan ranges"), but from an API
perspective, the driver still has to play pretend, and only offload the
vlan->vid_end as pvid. And the addition of a switchdev VLAN object can
modify the flags of another, completely unrelated, switchdev VLAN
object! (a VLAN that is PVID will invalidate the PVID flag from whatever
other VLAN had previously been offloaded with switchdev and had that
flag. Yet switchdev never notifies about that change, drivers are
supposed to guess).
Nonetheless, having a VLAN range in the API makes error handling look
scarier than it really is - unwinding on errors and all of that.
When in reality, no one really calls this API with more than one VLAN.
It is all unnecessary complexity.
And despite appearing pretentious (two-phase transactional model and
all), the switchdev API is really sloppy because the VLAN addition and
removal operations are not paired with one another (you can add a VLAN
100 times and delete it just once). The bridge notifies through
switchdev of a VLAN addition not only when the flags of an existing VLAN
change, but also when nothing changes. There are switchdev drivers out
there who don't like adding a VLAN that has already been added, and
those checks don't really belong at driver level. But the fact that the
API contains ranges is yet another factor that prevents this from being
addressed in the future.
Of the existing switchdev pieces of hardware, it appears that only
Mellanox Spectrum supports offloading more than one VLAN at a time,
through mlxsw_sp_port_vlan_set. I have kept that code internal to the
driver, because there is some more bookkeeping that makes use of it, but
I deleted it from the switchdev API. But since the switchdev support for
ranges has already been de facto deleted by a Mellanox employee and
nobody noticed for 4 years, I'm going to assume it's not a biggie.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com> # switchdev and mlxsw
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Increase critical threshold for ASIC thermal zone from 110C to 140C
according to the system hardware requirements. All the supported ASICs
(Spectrum-1, Spectrum-2, Spectrum-3) could be still operational with ASIC
temperature below 140C. With the old critical threshold value system
can perform unjustified shutdown.
All the systems equipped with the above ASICs implement thermal
protection mechanism at firmware level and firmware could decide to
perform system thermal shutdown in case the temperature is below 140C.
So with the new threshold system will not meltdown, while thermal
operating range will be aligned with hardware abilities.
Fixes: 41e760841d ("mlxsw: core: Replace thermal temperature trips with defines")
Fixes: a50c1e3565 ("mlxsw: core: Implement thermal zone")
Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Validate thresholds to avoid a single failure due to some transceiver
unreliability. Ignore the last readouts in case warning temperature is
above alarm temperature, since it can cause unexpected thermal
shutdown. Stay with the previous values and refresh threshold within
the next iteration.
This is a rare scenario, but it was observed at a customer site.
Fixes: 6a79507cfe ("mlxsw: core: Extend thermal module with per QSFP module thermal zones")
Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All UDP tunnel port management is now routed via udp_tunnel_nic
infra directly. Remove the old callbacks.
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When mlx5_create_flow_group() fails, ft->g should be
freed just like when kvzalloc() fails. The caller of
mlx5e_create_l2_table_groups() does not catch this
issue on failure, which leads to memleak.
Fixes: 33cfaaa8f3 ("net/mlx5e: Split the main flow steering table")
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
mlx5e_create_ttc_table_groups() frees ft->g on failure of
kvzalloc(), but such failure will be caught by its caller
in mlx5e_create_ttc_table() and ft->g will be freed again
in mlx5e_destroy_flow_table(). The same issue also occurs
in mlx5e_create_ttc_table_groups(). Set ft->g to NULL after
kfree() to avoid double free.
Fixes: 7b3722fa9e ("net/mlx5e: Support RSS for GRE tunneled packets")
Fixes: 33cfaaa8f3 ("net/mlx5e: Split the main flow steering table")
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Prior to this patch, configuring speed to 50G with autoneg off over
devices supporting 50G per lane failed.
Support for 50G per lane introduced a new set of link-modes, on which
driver always performed a speed validation as if only legacy link-modes
were configured. Fix driver speed validation to force setting autoneg
over 56G only if in legacy link-mode.
Fixes: 3d7cadae51 ("net/mlx5e: ethtool, Fix analysis of speed setting")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
sop_drop_qpn field in the cqe is used by two features, in SWITCHDEV mode
to restore the chain id in case of a miss and in LEGACY mode to support
skbedit mark action. In build RX skb, the skb mark field is set regardless
of the configured mode which cause a corruption of the mark field in case
of switchdev mode.
Fix by overriding the mark value back to 0 in the representor tc update
skb flow.
Fixes: 8f1e0b97cc ("net/mlx5: E-Switch, Mark miss packets with new chain id mapping")
Signed-off-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Adding vf VLANID for the first time, or after having cleared previously
defined VLANID works fine, however, attempting to change an existing vf
VLANID clears the rules on the firmware, but does not add new rules for
the new vf VLANID.
Fix this by changing the logic in function esw_acl_egress_lgcy_setup()
so that it will always configure egress rules.
Fixes: ea651a86d4 ("net/mlx5: E-Switch, Refactor eswitch egress acl codes")
Signed-off-by: Alaa Hleihel <alaa@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In case WQE includes inline header the vlan is inserted by driver even
if vlan offload is set. On geneve over vlan interface where software
parser is used the SWP offsets should be updated according to the added
vlan.
Fixes: e3cfc7e6b7 ("net/mlx5e: TX, Add geneve tunnel stateless offload support")
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Connection counters may be shared for both directions when the counter
is used for connection aging purposes. However, if TC flow
accounting is enabled then a unique counter is required per direction.
Instantiate a unique counter per direction if the conntrack accounting
extension is enabled. Use a shared counter when the connection accounting
extension is disabled.
Fixes: 1edae2335a ("net/mlx5e: CT: Use the same counter for both directions")
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In multi-port mode, FW reports syndrome 0x2ea48 (invalid vhca_port_number)
if the port_num is not 1 or 2.
Fixes: 80f09dfc23 ("net/mlx5: Eswitch, enable RoCE loopback traffic")
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Expose firmware indication that it supports setting eswitch uplink state
to follow (follow the physical link). Condition setting the eswitch
uplink admin-state with this capability bit. Older FW may not support
the uplink state setting.
Fixes: 7d0314b11c ("net/mlx5e: Modify uplink state on interface up/down")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Use kzalloc rather than kcalloc(1,...)
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
@@
- kcalloc(1,
+ kzalloc(
...)
// </smpl>
Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move HW specific modify header fields and logic to STEv0 file
and use the new STE context callbacks.
Since STEv0 and STEv1 modify actions values are different, each
version has its own implementation.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Use STE tx/rx actions per-device API: move HW specific
action apply logic from dr_ste to STEv0 file - STEv0 and
STEv1 actions format is different, each version should
have its own implementation.
Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Use the new setters and getters API for STEv0: move HW specific setter and
getters from dr_ste to STEv0 file. Since STEv0 and STEv1 format are
different each version should implemented different setters and getters.
Rename remaining static functions w/o mlx5 prefix.
Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The action apply logic is device specific per STE version,
moving to the STE layer will allow implementing it for
both devices while keeping DR upper layers the same.
Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The lookup types are device specific and should not be
exposed to DR upper layers, matchers/tables.
Each HW STE version should keep them internal.
The lu_type size is updated to support larger lu_types as
required for STEv1.
Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Merge DR_STE_STE macros for better code reuse, the macro
DR_STE_SET_MASK_V and DR_STE_SET_TAG are merged to avoid
tag and bit_mask function creation which are usually the
same.
Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Check vport_cap only if match on source gvmi is required.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Move current STE match logic to a seprate file.
This file will be used for HW specific STEv0.
Future patches will add functionality for v1 steering.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Split the STE builders functionality into the common part and
device-specific part. All the device-specific part (with 'v0' in
the function names) is accessed through the STE context structure.
Subsequent patches will have the device-specific logic moved to a
separate file.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Move some macros from dr_ste.c to header - these macros
will be used by all the format-specific functions.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add a struct of device specific callbacks for STE layer below dr_ste.
Each device will implement its HW-specific function, and a comon logic
from the DR code will access these functions through the new ste_ctx API.
More callbacks will follow in the subsequent patches.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
accesses, inefficient and disfunctional code. The goal is to remove the
export of irq_to_desc() to prevent these things from creeping up again.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/ifgsTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoYm6EACAo8sObkuY3oWLagtGj1KHxon53oGZ
VfDw2LYKM+rgJjDWdiyocyxQU5gtm6loWCrIHjH2adRQ4EisB5r8hfI8NZHxNMyq
8khUi822NRBfFN6SCpO8eW9o95euscNQwCzqi7gV9/U/BAKoDoSEYzS4y0YmJlup
mhoikkrFiBuFXplWI0gbP4ihb8S/to2+kTL6o7eBoJY9+fSXIFR3erZ6f3fLjYZG
CQUUysTywdDhLeDkC9vaesXwgdl2XnaPRwcQqmK8Ez0QYNYpawyILUHLD75cIHDu
bHdK2ZoDv/wtad/3BoGTK3+wChz20a/4/IAnBIUVgmnSLsPtW8zNEOPWNNc0aGg+
rtafi5bvJ1lMoSZhkjLWQDOGU6vFaXl9NkC2fpF+dg1skFMT2CyLC8LD/ekmocon
zHAPBva9j3m2A80hI3dUH9azo/IOl1GHG8ccM6SCxY3S/9vWSQChNhQDLe25xBEO
VtKZS7DYFCRiL8mIy9GgwZWof8Vy2iMua2ML+W9a3mC9u3CqSLbCFmLMT/dDoXl1
oHnMdAHk1DRatA8pJAz83C75RxbAS2riGEqtqLEQ6OaNXn6h0oXCanJX9jdKYDBh
z6ijWayPSRMVktN6FDINsVNFe95N4GwYcGPfagIMqyMMhmJDic6apEzEo7iA76lk
cko28MDqTIK4UQ==
=BXv+
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2020-12-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"This is the second attempt after the first one failed miserably and
got zapped to unblock the rest of the interrupt related patches.
A treewide cleanup of interrupt descriptor (ab)use with all sorts of
racy accesses, inefficient and disfunctional code. The goal is to
remove the export of irq_to_desc() to prevent these things from
creeping up again"
* tag 'irq-core-2020-12-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
genirq: Restrict export of irq_to_desc()
xen/events: Implement irq distribution
xen/events: Reduce irq_info:: Spurious_cnt storage size
xen/events: Only force affinity mask for percpu interrupts
xen/events: Use immediate affinity setting
xen/events: Remove disfunct affinity spreading
xen/events: Remove unused bind_evtchn_to_irq_lateeoi()
net/mlx5: Use effective interrupt affinity
net/mlx5: Replace irq_to_desc() abuse
net/mlx4: Use effective interrupt affinity
net/mlx4: Replace irq_to_desc() abuse
PCI: mobiveil: Use irq_data_get_irq_chip_data()
PCI: xilinx-nwl: Use irq_data_get_irq_chip_data()
NTB/msi: Use irq_has_action()
mfd: ab8500-debugfs: Remove the racy fiddling with irq_desc
pinctrl: nomadik: Use irq_has_action()
drm/i915/pmu: Replace open coded kstat_irqs() copy
drm/i915/lpe_audio: Remove pointless irq_to_desc() usage
s390/irq: Use irq_desc_kstat_cpu() in show_msi_interrupt()
parisc/irq: Use irq_desc_kstat_cpu() in show_interrupts()
...
Using the interrupt affinity mask for checking locality is not really
working well on architectures which support effective affinity masks.
The affinity mask is either the system wide default or set by user space,
but the architecture can or even must reduce the mask to the effective set,
which means that checking the affinity mask itself does not really tell
about the actual target CPUs.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20201210194044.876342330@linutronix.de
No driver has any business with the internals of an interrupt
descriptor. Storing a pointer to it just to use yet another helper at the
actual usage site to retrieve the affinity mask is creative at best. Just
because C does not allow encapsulation does not mean that the kernel has no
limits.
Retrieve a pointer to the affinity mask itself and use that. It's still
using an interface which is usually not for random drivers, but definitely
less hideous than the previous hack.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20201210194044.769458162@linutronix.de
Using the interrupt affinity mask for checking locality is not really
working well on architectures which support effective affinity masks.
The affinity mask is either the system wide default or set by user space,
but the architecture can or even must reduce the mask to the effective set,
which means that checking the affinity mask itself does not really tell
about the actual target CPUs.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20201210194044.672935978@linutronix.de
No driver has any business with the internals of an interrupt
descriptor. Storing a pointer to it just to use yet another helper at the
actual usage site to retrieve the affinity mask is creative at best. Just
because C does not allow encapsulation does not mean that the kernel has no
limits.
Retrieve a pointer to the affinity mask itself and use that. It's still
using an interface which is usually not for random drivers, but definitely
less hideous than the previous hack.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20201210194044.580936243@linutronix.de
In case the eXtended mezzanine is present on the system, use it for IPv4
router offload.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Set a profile option to instruct FW to use 1/2 of KVH for XLT cache, not
the whole one.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Upon route insertion and removal, it is needed to flush possibly cached
entries from the XM cache. Extend XM op context to carry information
needed for the flush. Implement the flush in delayed work since for HW
design reasons there is a need to wait 50usec before the flush can be
done. If during this time comes the same flush request, consolidate it
to the first one. Implement this queued flushes by a hashtable.
v2:
* Fix GENMASK() high bit
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The RLPMCE allows disabling the LPM cache. Can be changed on the fly.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The RLCMLD register is used to bulk delete the XLT-LPM cache ML entries.
This can be used by SW when L is increased or decreased, thus need to
remove entries with old ML values.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is a table that assigns L-value per M-index. The L is always the
biggest from the currently inserted prefixes. Setup a hashtable to track
the M-index information and the prefixes that are related to it. Ensure
the L-value is always correctly set.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The XRMT configures the M-Table for the XLT-LPM.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
During the router init flow, call into XM code and initialize couple of
items needed for XM functionality:
1) Query the capabilities and sizes. Check the XM device id.
2) Initialize the M-value. Note that currently the M-value is set fixed
to 16 for IPv4. In future this may change to better cover the actual
inserted routes.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The XLTQ is used to query HW for XM-related info.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The RXLTM configures and selects the M for the XM lookups.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the info stored in the bus_info struct about the eXtended mezanine
connected ports and don't expose them.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The output of boardinfo command was extended to contain information
about XM. Indicates if is present and in case it is, tells which
localports are used for the connection. So parse this info and store it
in bus_info passed up to the driver.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In order to offload entries to XM, implement a set of low-level
functions to work with LPM trees in XM and also to pack and write
FIB entries into XM.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The RXLTE enables XLT (eXtended Lookup Table) LPM lookups if a capable
XM is present on the system.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The XMDR allows direct access to the XM device via the switch.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
xdp_return_frame_bulk() needs to pass a xdp_buff
to __xdp_return().
strlcpy got converted to strscpy but here it makes no
functional difference, so just keep the right code.
Conflicts:
net/netfilter/nf_tables_api.c
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In case error CQE was found while polling TX CQ, the QP is in error
state and all posted WQEs will generate error CQEs without any data
transmitted. Fix it by reopening the channels, via same method used for
TX timeout handling.
In addition add some more info on error CQE and WQE for debug.
Fixes: bd2f631d7c ("net/mlx4_en: Notify user when TX ring in error state")
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add restarting state flag to avoid scheduling another restart task while
such task is already running. Change task name from watchdog_task to
restart_task to better fit the task role.
Fixes: 1e338db56e ("mlx4_en: Fix a race at restart task")
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace a comma between expression statements by a semicolon.
Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The previous patches added support for VxLAN device enslaved to 802.1ad
bridge in Spectrum-2 ASIC and vetoed it in Spectrum-1.
Do not veto VxLAN with 802.1ad bridge.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implementation of Q-in-VNI is different between ASIC types, this set adds
support only for Spectrum-2.
Return an error when trying to create VxLAN device and enslave it to
802.1ad bridge in Spectrum-1.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently mlxsw_sp_switchdev_vxlan_vlan_add() always calls
mlxsw_sp_bridge_8021q_vxlan_join() because VLANs were only ever added to
a VLAN-filtering bridge, which is only 802.1q bridge.
This set adds support for VxLAN with 802.1ad bridge, so VLAN-filtering
bridge is not only 802.1q.
Call ops->vxlan_join(), so mlxsw_sp_bridge_802{1q, 1ad}_vxlan_join()
will be called according to bridge type.
This is needed to ensure that VxLAN with 802.1ad bridge will be vetoed
in Spectrum-1 with the next patch.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On Spectrum-2, the default setting is not to push VLAN to the decapsulated
packet. This is controlled by SPVTR.ipvid_mode.
Set SPVTR.ipvid_mode to always push VLAN.
Without this setting, Spectrum-2 overtakes the VLAN tag of decapsulated
packet for bridging.
In addition, set SPVID register to use EtherType saved in
mlxsw_sp_nve_config when VLAN is pushed for the NVE tunnel.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Declare mlxsw_sp_ethtype_to_sver_type() in spectrum.h to enable using it
in other files.
It will be used in the next patch to map between EtherType and the
relevant value configured by SVER register.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add EtherType field to mlxsw_sp_nve_config struct.
Set EtherType according to mlxsw_sp_nve_params.ethertype.
Pass 'mlxsw_sp_nve_params' instead of 'mlxsw_sp_nve_params->dev' to the
function which initializes mlxsw_sp_nve_config struct to know which
EtherType to use.
This field is needed to configure which EtherType will be used when
VLAN is pushed at ingress of the tunnel port.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add EtherType field to mlxsw_sp_nve_params struct.
Set it when VxLAN device is added to bridge device.
This field is needed to configure which EtherType will be used when
VLAN is pushed at ingress of the tunnel port.
Use ETH_P_8021Q for tunnel port enslaved to 802.1d and 802.1q bridges.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code in mlxsw_sp_bridge_8021q_vxlan_join() can be used also for
802.1ad bridge.
Move the code to function called mlxsw_sp_bridge_vlan_aware_vxlan_join()
and call it from mlxsw_sp_bridge_8021q_vxlan_join() to enable code
reuse.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add spvid_tport field which indicates if the port is tunnel port.
When spvid_tport is true, local_port field supposed to be tunnel port
type.
It will be used to configure which Ethertype will be used when VLAN is
pushed at ingress for tunnel port.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SPVTR register configures the VLAN mode of the port to enable VLAN
stacking.
It will be used to configure VxLAN to push VLAN to the decapsulated packet.
Without this setting, Spectrum-2 overtakes the VLAN tag of decapsulated
packet for bridging.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently SFN, TNUMT and TNPC registers use separate enums for
tunnel_port.
Create one enum with a neutral name and use it.
Remove the enums that are not currently required.
The next patches add two more registers that contain tunnel_port field,
the new enum can be used for them also.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a function to fill the fields of struct mlx5e_create_cq_param
based on a channel. The purpose is code reuse between normal CQs, XSK
CQs and the upcoming QoS CQs.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Use the new FW caps to advertise for ip-in-ip tunnel support separately
for RX and TX.
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Fix smatch warnings:
drivers/net/ethernet/mellanox/mlx5/core/esw/acl/egress_lgcy.c:105 esw_acl_egress_lgcy_setup() warn: passing zero to 'PTR_ERR'
drivers/net/ethernet/mellanox/mlx5/core/esw/acl/egress_ofld.c:177 esw_acl_egress_ofld_setup() warn: passing zero to 'PTR_ERR'
drivers/net/ethernet/mellanox/mlx5/core/esw/acl/ingress_lgcy.c:184 esw_acl_ingress_lgcy_setup() warn: passing zero to 'PTR_ERR'
drivers/net/ethernet/mellanox/mlx5/core/esw/acl/ingress_ofld.c:262 esw_acl_ingress_ofld_setup() warn: passing zero to 'PTR_ERR'
esw_acl_table_create() never returns NULL, so
NULL test should be removed.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently, when more than one EQ is sharing an IRQ, and this IRQ is
being interrupted, all the EQs sharing the IRQ will be armed. This is
done regardless of whether an EQ has EQE.
When multiple EQs are sharing an IRQ, one or more EQs can have valid
EQEs.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Since kvzalloc will initialize the allocated memory, it is not
necessary to initialize it once again.
Fixes: 11b717d615 ("net/mlx5: E-Switch, Get reg_c0 value on CQE")
Signed-off-by: Zhu Yanjun <yanjunz@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Transmitted packet timestamping accuracy can be improved when using
timestamp from the port, instead of packet CQE creation timestamp, as
it better reflects the actual time of a packet's transmit.
TX port timestamping is supported starting from ConnectX6-DX hardware.
Although at the original completion, only CQE timestamp can be attached,
we are able to get TX port timestamping via an additional completion over
a special CQ associated with the SQ (in addition to the regular CQ).
Driver to ignore the original packet completion timestamp, and report
back the timestamp of the special CQ completion. If the absolute timestamp
diff between the two completions is greater than 1 / 128 second, ignore
the TX port timestamp as it has a jitter which is too big.
No skb will be generate out of the extra completion.
Allocate additional CQ per ptpsq, to receive the TX port timestamp.
Driver to hold an skb FIFO in order to map between transmitted skb to
the two expected completions. When using ptpsq, hold double refcount on
the skb, to gaurantee it will not get released before both completions
arrive.
Expose dedicated counters of the ptp additional CQ and connect it to the
TX health reporter.
This patch improves TX Hardware timestamping offset to be less than 40ns
at a 100Gbps line rate, compared to 600ns before.
With that, making our HW compliant with G.8273.2 class C, and allow Linux
systems to be deployed in the 5G telco edge, where this standard is a
must.
Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add TX PTP port object support for better TX timestamping accuracy.
Currently, driver supports CQE based TX port timestamp. Device
also offers TX port timestamp, which has less jitter and better
reflects the actual time of a packet's transmit.
Define new driver layout called ptpsq, on which driver will create
SQs that will support TX port timestamp for their transmitted packets.
Driver to identify PTP TX skbs and steer them to these dedicated SQs
as part of the select queue ndo.
Driver to hold ptpsq per TC and report them at
netif_set_real_num_tx_queues().
Add support for all needed functionality in order to xmit and poll
completions received via ptpsq.
Add ptpsq to the TX reporter recover, diagnose and dump methods.
Creation of ptpsqs is disabled by default, and can be enabled via
tx_port_ts private flag.
This patch steer all timestamp related packets to a ptpsq, but it
does not open the port timestamp support for it. The support will
be added in the following patch.
Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
MLX5E_RX_ERR_CQE Macro is used only in data-path, move it to the
appropriate header file.
Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
SW group counter update function aggregates sw stats out of many
mlx5e_*_stats resides in a given mlx5e_channel_stats struct.
Split the function into a few helper functions.
This will be used later in the series to calculate specific
mlx5e_*_stats which are not defined inside mlx5e_channel_stats.
Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The skb fifo push/pop API used pre-defined attributes within the
mlx5e_txqsq.
In order to share the skb fifo API with other non-SQ use cases,
change the API input to get newly defined mlx5e_skb_fifo struct.
Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In order to be able to create an SQ outside of a channel context, remove
sq->channel direct pointer. This requires adding a direct pointer to:
netdevice, priv and mlx5_core in order to support SQs that are part of
mlx5e_channel. Use channel_stats from the corresponding CQ.
Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com>
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In order to be able to create an RQ outside of a channel context, remove
rq->channel direct pointer. This requires adding a direct pointer to:
ICOSQ and priv in order to support RQs that are part of mlx5e_channel.
Use channel_stats from the corresponding CQ.
Signed-off-by: Aya Levin <ayal@nvidia.com>
Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In order to be able to create a CQ outside of a channel context, remove
cq->channel direct pointer. This requires adding a direct pointer to
channel statistics, netdevice, priv and to mlx5_core in order to support
CQs that are a part of mlx5e_channel.
In addition, parameters the were previously derived from the channel
like napi, NUMA node, channel stats and index are now assembled in
struct mlx5e_create_cq_param which is given to mlx5e_open_cq() instead
of channel pointer. Generalizing mlx5e_open_cq() allows opening CQ
outside of channel context which will be used in following patches in
the patch-set.
Signed-off-by: Aya Levin <ayal@nvidia.com>
Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The drop RQ has very limited objects to be freed, and differs
from regular RQs in the context that it is freed from.
Add a dedicated function for it, use it where needed, and remove
the drop_rq-specific checks in the generic function.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Saeed Mahameed says:
====================
mlx5-next auxbus support
This pull request is targeting net-next and rdma-next branches.
This series provides mlx5 support for auxiliary bus devices.
It starts with a merge commit of tag 'auxbus-5.11-rc1' from
gregkh/driver-core into mlx5-next, then the mlx5 patches that will convert
mlx5 ulp devices (netdev, rdma, vdpa) to use the proper auxbus
infrastructure instead of the internal mlx5 device and interface management
implementation, which Leon is deleting at the end of this patchset.
Link: https://lore.kernel.org/alsa-devel/20201026111849.1035786-1-leon@kernel.org/
Thanks to everyone for the joint effort !
* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
RDMA/mlx5: Remove IB representors dead code
net/mlx5: Simplify eswitch mode check
net/mlx5: Delete custom device management logic
RDMA/mlx5: Convert mlx5_ib to use auxiliary bus
net/mlx5e: Connect ethernet part to auxiliary bus
vdpa/mlx5: Connect mlx5_vdpa to auxiliary bus
net/mlx5: Register mlx5 devices to auxiliary virtual bus
vdpa/mlx5: Make hardware definitions visible to all mlx5 devices
net/mlx5_core: Clean driver version and name
net/mlx5: Properly convey driver version to firmware
driver core: auxiliary bus: minor coding style tweaks
driver core: auxiliary bus: make remove function return void
driver core: auxiliary bus: move slab.h from include file
Add auxiliary bus support
====================
Link: https://lore.kernel.org/r/20201207053349.402772-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Turned out that mlxsw_sp_ipip_fib_entry_op_gre4() does not need to
figure out the IP address and virtual router id. Those are exactly
the same as in the fib_entry it is called for. So just use that and
reduce mlxsw_sp_ipip_fib_entry_op_gre4() function to only call
mlxsw_sp_ipip_fib_entry_op_gre4_rtdp() make the ipip decap op
code similar to nve.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The indicated version fixes an issue whereby the MOMTE register would by
default enable mirroring of ECN-marked traffic from all traffic classes,
once the ECN mirroring was configured. This fix is necessary for offload
of RED "ecn_mark" qevent.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Suppresses the following coccinelle warning:
drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c:139:3-7:
WARNING use flexible-array member instead
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Suppresses the following coccinelle warning:
drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c:18:15-19: WARNING use flexible-array member instead
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, mlxsw triggers the 'devlink:devlink_hwmsg' tracepoint
whenever a request is sent to the device and whenever a response is
received from it. However, the tracepoint is not triggered when an event
(e.g., port up / down) is received from the device.
Also trace EMAD events in order to log a more complete picture of all
the exchanged hardware messages.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case a router interface (RIF) is configured for a LAG, make sure its
configuration is applied on the new LAG member.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Provide mlx5_core device instead of "priv" pointer while checking
eswith mode.
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
After conversion to use auxiliary bus, all custom device management is
not needed anymore, delete it.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
The conversion to auxiliary bus solves long standing issue with
existing mlx5_ib<->mlx5_core coupling. It required to have both
modules in initramfs if one of them needed for the boot.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>