mlx5_core_create_{cq/dct} functions are non-trivial mlx5 commands
functions. They check command execution status themselves and hide
valuable FW failure information.
For mlx5_core/eth kernel user this is what we actually want, but for a
devx/rdma user the hidden information is essential and should be propagated
up to the caller, thus we convert these commands to use mlx5_cmd_do
to return the FW/driver and command outbox status as is, and let the caller
decide what to do with it.
For kernel callers of mlx5_core_create_{cq/dct} or those who only care about
the binary status (FAIL/SUCCESS) they must check status themselves via
mlx5_cmd_check() to restore the current behavior.
err = mlx5_create_cq(in, out)
err = mlx5_cmd_check(err, in, out)
if (err)
// handle err
For DEVX users and those who care about full visibility, They will just
propagate the error to user space, and app can check if err == -EREMOTEIO,
then outbox.{status,syndrome} are valid.
API Note:
mlx5_cmd_check() must be used by kernel users since it allows the driver
to intercept the command execution status and return a driver simulated
status in case of driver induced error handling or reset/recovery flows.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add mlx5_cmd_do. Unlike mlx5_cmd_exec, this function will not modify
or translate outbox.status.
The function will return:
return = 0: Command was executed, outbox.status == MLX5_CMD_STAT_OK.
return = -EREMOTEIO: Executed, outbox.status != MLX5_CMD_STAT_OK.
return < 0: Command execution couldn't be performed by FW or driver.
And document other mlx5_cmd_exec functions.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Do not mangle the command outbox in the internal low level cmd_exec and
cmd_invoke functions.
Instead return a proper unique error code and move the driver error
checking to be at a higher level in mlx5_cmd_exec().
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
If the flow table isn't an autogroup the upper driver has to create the
flow groups explicitly. This information can't later be used when
creating rules to insert into a specific flow group. Allow such use case.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
mlx5 has some unused static inline helpers in include/
while at it also clean static inlines in the driver itself.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
struct mlx5_ifc_modify_tir_bitmask_bits is used for the bitmask
of MODIFY_TIR operations.
Remove the unused bitmask enums.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently IRQs are requested one by one. To balance spreading IRQs
among cpus using such scheme requires remembering cpu mask for the
cpus used for a given device. This complicates the IRQ allocation
scheme in subsequent patch.
Hence, prepare the code for bulk IRQs allocation. This enables
spreading IRQs among cpus in subsequent patch.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When modifying TTL, packet's csum has to be recalculated.
Due to HW issue in ConnectX-5, csum recalculation for modify TTL
is supported through a work-around that is specifically enabled
by configuration.
If the work-around isn't enabled, ignore the modify TTL action
rather than adding an unsupported action.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Match on geneve_tlv_option_0_exist field on devices that support STEv1.
Signed-off-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Add support for misc5 match parameter as per HW spec, this will allow
matching on tunnel_header fields.
Signed-off-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Macros prefix should be capital letters - fix the prefix in
mlx5_FLEX_PARSER_MPLS_OVER_UDP_ENABLED.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Saeed Mahameed says:
====================
mlx5-next branch 2021-12-15
Hi Dave, Jakub, Jason
This pulls mlx5-next branch into net-next and rdma branches.
All patches already reviewed on both rdma and netdev mailing lists.
Please pull and let me know if there's any problem.
1) Add multiple FDB steering priorities [1]
2) Introduce HW bits needed to configure MAC list size of VF/SF.
Required for ("net/mlx5: Memory optimizations") upcoming series [2].
[1] https://lore.kernel.org/netdev/20211201193621.9129-1-saeed@kernel.org/
[2] https://lore.kernel.org/lkml/20211208141722.13646-1-shayd@nvidia.com/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Downstream patch will use this bit in order to know whether the device
supports changing of max_uc_list.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This patch doesn't add an additional namespaces, but just separates the
naming to be used by each FDB user, bypass and kernel.
Downstream patches will actually split this up and allow to have more
than single priority for the bypass users.
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Acked-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The flow counters bulk query buffer is allocated once during
mlx5_fc_init_stats(). For PFs and VFs this buffer usually takes a little
more than 512KB of memory, which is aligned to the next power of 2, to
1MB. For SFs, this buffer is reduced and takes around 128 Bytes.
The buffer size determines the maximum number of flow counters that
can be queried at a time. Thus, having a bigger buffer can improve
performance for users that need to query many flow counters.
There are cases that don't use many flow counters and don't need a big
buffer (e.g. SFs, VFs). Since this size is critical with large scale,
in these cases the buffer size should be reduced.
In order to reduce memory consumption while maintaining query
performance, change the query buffer's allocation scheme to the
following:
- First allocate the buffer with small initial size.
- If the number of counters surpasses the initial size, resize the
buffer to the maximum size.
The buffer only grows and isn't shrank, because users with many flow
counters don't care about the buffer size and we don't want to add
resize overhead if the current number of counters drops.
This solution is preferable to the current one, which is less accurate
and only addresses SFs.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Validate MRTC register is supported before triggering a delayed work
which accesses it.
Fixes: 5a1023deee ("net/mlx5: Add periodic update of host time to firmware")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
E-Switch encap mode is relevant only when in switchdev mode.
The RDMA driver can query the encap configuration via
mlx5_eswitch_get_encap_mode(). Make sure it returns the currently
used mode and not the set one.
This reverts the cited commit which reset the encap mode
on entering switchdev and fixes the original issue properly.
Fixes: 9a64144d68 ("net/mlx5: E-Switch, Fix default encap mode")
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
CT creates a counter for each CT rule, and for each such counter,
fs_counters tries to queue mlx5_fc_stats_work() work again via
mod_delayed_work(0) call to refresh all counters. This call has a
large performance impact when reaching high insertion rate and
accounts for ~8% of the insertion time when using software steering.
Allow skipping the refresh of all counters during counter creation.
Change CT to use this refresh skipping for it's counters.
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This patch adds the new CQE SHAMPO fields:
- flush: indicates that we must close the current session and pass the SKB
to the network stack.
- match: indicates that the current packet matches the oppened session,
the packet will be merge into the current SKB.
- header_size: the size of the packet headers that written into the headers
buffer.
- header_entry_index: the entry index in the headers buffer.
- data_offset: packets data offset in the WQE.
Also new cqe handler is added to handle SHAMPO packets:
- The new handler uses CQE SHAMPO fields to build the SKB.
CQE's Flush and match fields are not used in this patch, packets are not
merged in this patch.
Signed-off-by: Khalid Manaa <khalidm@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This commit adds the needed definitions for using the klm_umr_wqe.
UMR stands for user-mode memory registration, is a mechanism to alter
address translation properties of MKEY by posting WorkQueueElement
aka WQE on send queue.
MKEY stands for memory key, MKEY are used to describe a region in memory that
can be later used by HW.
KLM stands for {Key, Length, MemVa}, KLM_MKEY is indirect MKEY that enables
to map multiple memory spaces with different sizes in unified MKEY.
klm_umr_wqe is a UMR that use to update a KLM_MKEY.
SHAMPO feature uses KLM_MKEY for memory registration of his header buffer.
Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This series introduces new packet merge type, therefore rename lro
functions to packet merge to support the new merge type:
- Generalize + rename mlx5e_build_tir_ctx_lro to
mlx5e_build_tir_ctx_packet_merge.
- Rename mlx5e_modify_tirs_lro to mlx5e_modify_tirs_packet_merge.
- Rename lro bit in mlx5_ifc_modify_tir_bitmask_bits to packet_merge.
- Rename lro_en in mlx5e_params to packet_merge_type type and combine
packet_merge params into one struct mlx5e_packet_merge_param.
Signed-off-by: Khalid Manaa <khalidm@nvidia.com>
Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This commit adds SHAMPO bit to hca_cap and SHAMPO capabilities structure,
SHAMPO related HW spec hardware fields and enumerations.
SHAMPO stands for: split headers and merge payload offload.
SHAMPO new fields:
WQ:
- headers_mkey: mkey that represents the headers buffer, where the packets
headers will be written by the HW.
- shampo_enable: flag to verify if the WQ supports SHAMPO feature.
- log_reservation_size: the log of the reservation size where the data of
the packet will be written by the HW.
- log_max_num_of_packets_per_reservation: log of the maximum number of
packets that can be written to the same reservation.
- log_headers_entry_size: log of the header entry size of the headers buffer.
- log_headers_buffer_entry_num: log of the entries number of the headers buffer.
RQ:
- shampo_no_match_alignment_granularity: the HW alignment granularity
in case the received packet doesn't match the current session.
- shampo_match_criteria_type: the type of match criteria.
- reservation_timeout: the maximum time that the HW will hold the
reservation.
mlx5_ifc_shampo_cap_bits, the capabilities of the SHAMPO feature:
- shampo_log_max_reservation_size: the maximum allowed value of the field
WQ.log_reservation_size.
- log_reservation_size: the minimum allowed value of the field
WQ.log_reservation_size.
- shampo_min_mss_size: the minimum payload size of packet that can open
a new session or be merged to a session.
- shampo_max_log_headers_entry_size: the maximum allowed value of the field
WQ.log_headers_entry_size
Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
TIR stands for transport interface receive, the TIR object is
responsible for performing all transport related operations on
the receive side like packet processing, demultiplexing the packets
to different RQ's, etc.
lro_timeout is a field in the TIR that is used to set the timeout for lro
session, this series introduces new packet merge type, therefore rename
lro_timeout to packet_merge_timeout for all packet merge types.
Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
revert commit 46ae40b94d ("net/mlx5: Let user configure io_eq_size param")
revert commit a6cb08daa3 ("net/mlx5: Let user configure event_eq_size param")
revert commit 5546040619 ("net/mlx5: Let user configure max_macs param")
The EQE parameters are applicable to more drivers, they should
be configured via standard API, probably ethtool. Example of
another driver needing something similar:
https://lore.kernel.org/all/1633454136-14679-3-git-send-email-sbhatta@marvell.com/
The last param for "max_macs" is probably fine but the documentation
is severely lacking. The meaning and implications for changing the
param need to be stated.
Link: https://lore.kernel.org/r/20211026152939.3125950-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, max_macs is taking 70Kbytes of memory per function. This
size is not needed in all use cases, and is critical with large scale.
Hence, allow user to configure the number of max_macs.
For example, to reduce the number of max_macs to 1, execute::
$ devlink dev param set pci/0000:00:0b.0 name max_macs value 1 \
cmode driverinit
$ devlink dev reload pci/0000:00:0b.0
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Event EQ is an EQ which received the notification of almost all the
events generated by the NIC.
Currently, each event EQ is taking 512KB of memory. This size is not
needed in most use cases, and is critical with large scale. Hence,
allow user to configure the size of the event EQ.
For example to reduce event EQ size to 64, execute::
$ devlink resource set pci/0000:00:0b.0 path /event_eq_size/ size 64
$ devlink dev reload pci/0000:00:0b.0
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently, each I/O EQ is taking 128KB of memory. This size
is not needed in all use cases, and is critical with large scale.
Hence, allow user to configure the size of I/O EQs.
For example, to reduce I/O EQ size to 64, execute:
$ devlink resource set pci/0000:00:0b.0 path /io_eq_size/ size 64
$ devlink dev reload pci/0000:00:0b.0
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Firmware logs its asserts also to non-volatile memory. In order to
reduce drift between the NIC and the host, the driver sets the host
epoch-time to the firmware every hour.
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
A small series to clean up the mlx5 mkey code across the mlx5_core and
InfiniBand.
* branch 'mlx5_mkey':
RDMA/mlx5: Attach ndescs to mlx5_ib_mkey
RDMA/mlx5: Move struct mlx5_core_mkey to mlx5_ib
RDMA/mlx5: Replace struct mlx5_core_mkey by u32 key
RDMA/mlx5: Remove pd from struct mlx5_core_mkey
RDMA/mlx5: Remove size from struct mlx5_core_mkey
RDMA/mlx5: Remove iova from struct mlx5_core_mkey
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Both multipath and bonding events are changing the HW LAG state
independently.
Handling one of the features events while the other is already
enabled can cause unwanted behavior, for example handling
bonding event while multipath enabled will disable the lag and
cause multipath to stop working.
Fix it by ignoring bonding event while in multipath and ignoring FIB
events while in bonding mode.
Fixes: 544fe7c2e6 ("net/mlx5e: Activate HW multipath and handle port affinity based on FIB events")
Signed-off-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Move mlx5_core_mkey struct to mlx5_ib, as the mlx5_core doesn't use it
at this point.
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
In mlx5_core and vdpa there is no use of mlx5_core_mkey members except
for the key itself.
As preparation for moving mlx5_core_mkey to mlx5_ib, the occurrences of
struct mlx5_core_mkey in all modules except for mlx5_ib are replaced by
a u32 key.
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
There is no read of mkey->pd, only write. Remove it.
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
mkey->size is already stored in ibmr->length, no need to store it here.
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
iova is already stored in ibmr->iova, no need to store it here.
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
The uplink destination type should be used in rules to steer the
packet to the uplink when the device is in steering based LAG mode.
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Introduce new APIs to create and destroy flow matcher
for given format id.
Flow match definer object is used for defining the fields and
mask used for the hash calculation. User should mask the desired
fields like done in the match criteria.
This object is assigned to flow group of type hash. In this flow
group type, packets lookup is done based on the hash result.
This patch also adds the required bits to create such flow group.
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add new port selection flow steering namespace. Flow steering rules in
this namespaceare are used to determine the physical port for egress
packets.
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Using "native_port_num" can support more NICs.
Fallback to PCIe IDs if "native_port_num" query fails.
Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently, when a user disables roce via the devlink param, this change
isn't passed down to the device.
If device allows disabling RoCE at device level, make use of it. This
instructs the device to skip memory allocations related to RoCE
functionality which otherwise is done by the device.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Replace hard coded timeouts with values stored in firmware's init
segment. Timeouts are read from init segment during driver load. If init
segment timeouts are not supported then fallback to hard coded defaults
instead. Also move pre initialization timeouts which cannot be read from
firmware to the new mechanism.
Signed-off-by: Amir Tzin <amirtz@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add needed structures and defines for DTOR (default timeouts register).
This will be used to get timeouts values from FW instead of hard coded
values in the driver code thus enabling support for slower devices which
need longer timeouts.
Signed-off-by: Amir Tzin <amirtz@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
tools/testing/selftests/net/ioam6.sh
7b1700e009 ("selftests: net: modify IOAM tests for undef bits")
bf77b1400a ("selftests: net: Test for the IOAM encapsulation with IPv6")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Due to current HW arch limitations, RX-FCS (scattering FCS frame field
to software) and RX-port-timestamp (improved timestamp accuracy on the
receive side) can't work together.
RX-port-timestamp is not controlled by the user and it is enabled by
default when supported by the HW/FW.
This patch sets RX-port-timestamp opposite to RX-FCS configuration.
Fixes: 102722fc68 ("net/mlx5e: Add support for RXFCS feature flag")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add additional flow steering priorities in the RDMA namespace.
This allows adding flow counters to count filtered RDMA traffic and then
continue processing in the regular RDMA steering flow.
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Adding bth_opcode field and the relevant bits. This field will be used
to capture and count congestion notification packets (CNP).
Adding source_vhca_port support bit.
This field will be used to check the capability to use the
source_vhca_port as a match criteria in cases of dual port.
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Prior to this patch the driver requires two IRQs to function properly,
one required IRQ for control and at least one required IRQ for IO.
This requirement can be relaxed to one as the driver now allows
sharing of IRQs, so control and IO EQs can share the same irq.
This is needed for high scale amount of VFs.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Control IRQ is the first IRQ vector. This complicates handling of
completion irqs as we need to offset them by one.
in the next patch, there are scenarios where completion and control EQs
will share the same irq. for example: functions with single IRQ. To ease
such scenarios, we shift control IRQ to the end of the irq array.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
On ingress VLAN push also assign value 0x7FE to reg_c1 tunnel id+opts
bits (tunnel id 0, which is not a valid tunnel id, and option 0x7FE which
was reserved by one of previous patches in the series). In following patch
the reg value is matched on egress miss to restore the packet to its
original state by removing the VLAN before passing it to the software data
path.
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
UID field was added to alloc_uar and dealloc_uar PRM command, to specify
DevX UID for UAR. This change enables firmware validating user access to
its own UAR resources.
For the kernel allocated UARs the UID will stay 0 as of today.
Signed-off-by: Meir Lichtinger <meirl@nvidia.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Add uid field to mlx5_ifc_alloc_uar_in_bits and
mlx5_ifc_dealloc_uar_out_bits structs.
This field will be used by FW to manage UAR according to uid.
Signed-off-by: Meir Lichtinger <meirl@nvidia.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Extend eswitch API with rate limiting groups:
- Define new struct mlx5_esw_rate_group that is used to hold all
internal group data.
- Implement functions that allow creation, destruction and cleanup of
groups.
- Assign all vports to internal unlimited zero group by default.
This commit lays the groundwork for group rate limiting by implementing
devlink_ops->rate_node_{new|del}() callbacks to support creating and
deleting groups through devlink rate node objects. APIs that allows
setting rates and adding/removing members are implemented in following
patches.
Co-developed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Huy Nguyen <huyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Fixes in virtio,vhost,vdpa drivers.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmEZ7l0PHG1zdEByZWRo
YXQuY29tAAoJECgfDbjSjVRpdoQH/ir3ycgBco4pgDlpo0EQzw1eqwuyT69L7fvE
hlEZOqYOE39WvRTbFZLdrk4RQsULuu4x1vBr9AZ/qV2kHIHUlIGkuNqXnJiihsZE
bHGzKV7XGuzwRFXzCEmzTCDo6SFICVpqN9sb+tKMEsb/qiANi22OuDuDqffHldOH
wYmw6BaHPdj+w1+w6PYW8R/M0A9yaI7HngfBxt9OiVCYXNK2QQDiUWOsAaxmshSt
wsTDSwz4T6rRn/chztWC4JxlpossWJ7zywJexPKW02PSBqOV+z6irPkr7Ku3MhJ7
T2OjLzSSub1R+ikuQikZWKY67mvr45fnWFglUsPtO7H4f6biDeA=
=06n4
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio fixes from Michael Tsirkin:
"Fixes in virtio, vhost, and vdpa drivers"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vdpa/mlx5: Fix queue type selection logic
vdpa/mlx5: Avoid destroying MR on empty iotlb
tools/virtio: fix build
virtio_ring: pull in spinlock header
vringh: pull in spinlock header
virtio-blk: Add validation for block size in config space
vringh: Use wiov->used to check for read/write desc order
virtio_vdpa: reject invalid vq indices
vdpa: Add documentation for vdpa_alloc_device() macro
vDPA/ifcvf: Fix return value check for vdpa_alloc_device()
vp_vdpa: Fix return value check for vdpa_alloc_device()
vdpa_sim: Fix return value check for vdpa_alloc_device()
vhost: Fix the calculation in vhost_overflow()
vhost-vdpa: Fix integer overflow in vhost_vdpa_process_iotlb_update()
virtio_pci: Support surprise removal of virtio pci device
virtio: Protect vqs list access
virtio: Keep vring_del_virtqueue() mirror of VQ create
virtio: Improve vq->broken access to avoid any compiler optimization
Currently mlx5_core_dev contains array of capabilities. It contains 19
valid capabilities of the device, 2 reserved entries and 12 holes.
Due to this for 14 unused entries, mlx5_core_dev allocates 14 * 8K = 112K
bytes of memory which is never used. Due to this mlx5_core_dev structure
size is 270Kbytes odd. This allocation further aligns to next power of 2
to 512Kbytes.
By skipping non-existent entries,
(a) 112Kbyte is saved,
(b) mlx5_core_dev reduces to 8KB with alignment
(c) 350KB saved in alignment
In future individual capability allocation can be used to skip its
allocation when such capability is disabled at the device level. This
patch prepares mlx5_core_dev to hold capability using a pointer instead
of inline array.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In the current code, the current and maximal capabilities are
maintained in separate arrays which are both per type. In order to
allow the creation of such a basic structure as a dynamically
allocated array, we move curr and max fields to a unified
structure so that specific capabilities can be allocated as one unit.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
New mlx5_core device structure is allocated through devlink_alloc
with\ kzalloc and that ensures that all fields are equal to zero
and it includes ->state too.
That means that checks of that field in the mlx5_init_one() is
completely redundant, because that function is called only once
in the begging of mlx5_core_dev lifetime.
PCI:
.probe()
-> probe_one()
-> mlx5_init_one()
The recovery flow can't run at that time or before it, because relevant
work initialized later in mlx5_init_once().
Such initialization flow ensures that dev->state can't be
MLX5_DEVICE_STATE_UNINITIALIZED at all, so remove such impossible
checks.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
get_queue_type() comments that splict virtqueue is preferred, however,
the actual logic preferred packed virtqueues. Since firmware has not
supported packed virtqueues we ended up using split virtqueues as was
desired.
Since we do not advertise support for packed virtqueues, we add a check
to verify split virtqueues are indeed supported.
Fixes: 1a86b377aa ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices")
Signed-off-by: Eli Cohen <elic@nvidia.com>
Link: https://lore.kernel.org/r/20210811053759.66752-1-elic@nvidia.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Saeed Mahameed says:
====================
pull-request: mlx5-next 2020-08-9
This pulls mlx5-next branch which includes patches already reviewed on
net-next and rdma mailing lists.
1) mlx5 single E-Switch FDB for lag
2) IB/mlx5: Rename is_apu_thread_cq function to is_apu_cq
3) Add DCS caps & fields support
[1] https://patchwork.kernel.org/project/netdevbpf/cover/20210803231959.26513-1-saeed@kernel.org/
[2] 0e3364dab7.1626609184.git.leonro@nvidia.com/
[3] 55e1d69bef.1624258894.git.leonro@nvidia.com/
* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Lag, Create shared FDB when in switchdev mode
net/mlx5: E-Switch, add logic to enable shared FDB
net/mlx5: Lag, move lag destruction to a workqueue
net/mlx5: Lag, properly lock eswitch if needed
net/mlx5: Add send to vport rules on paired device
net/mlx5: E-Switch, Add event callback for representors
net/mlx5e: Use shared mappings for restoring from metadata
net/mlx5e: Add an option to create a shared mapping
net/mlx5: E-Switch, set flow source for send to uplink rule
RDMA/mlx5: Add shared FDB support
{net, RDMA}/mlx5: Extend send to vport rules
RDMA/mlx5: Fill port info based on the relevant eswitch
net/mlx5: Lag, add initial logic for shared FDB
net/mlx5: Return mdev from eswitch
IB/mlx5: Rename is_apu_thread_cq function to is_apu_cq
net/mlx5: Add DCS caps & fields support
====================
Link: https://lore.kernel.org/r/20210809202522.316930-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The CQ destroy is performed based on the IRQ number that is stored in
cq->irqn. That number wasn't set explicitly during CQ creation and as
expected some of the API users of mlx5_core_create_cq() forgot to update
it.
This caused to wrong synchronization call of the wrong IRQ with a number
0 instead of the real one.
As a fix, set the IRQ number directly in the mlx5_core_create_cq() and
update all users accordingly.
Fixes: 1a86b377aa ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices")
Fixes: ef1659ade3 ("IB/mlx5: Add DEVX support for CQ events")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This callback will allow to notify representors about relevant events
when in OFFLOADS mode. In downstream patches, this will be used to notify
about PAIR/UNPAIR devcom events.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In shared FDB there is only one eswitch which is active and it receives
traffic from all representors and all vports in the HCA.
While the Ethernet representor will always reside on its native PF
the IB representor will not. Extend send to vport rule creation to
support such flows. Need to account for source vport that sends the
traffic (on which the representors resides) and the target eswitch
the traffic which reach.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
As shared FDB requires changes in two subsystems first expose the needed
core functions so the RDMA side can be changed.
mlx5_lag_is_master(): return true if a given mlx5 device is the lag master.
mlx5_lag_is_shared_fdb(): Returns true if the lag mode is shared FDB.
mlx5_lag_get_peer_mdev(): Return the peer mdev in lag.
The mentioned functions will be used by downstream patches in order
to add support for shared FDB for the RDMA side.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Export a function so users can retrieve the mellanox device that manages
the eswitch from the eswitch device.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Now that TTC logic is not dependent on mlx5e structs, move it to
lib/fs_ttc.c so it could be used other part of the mlx5 driver.
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This commit does a cleanup in LRO configuration.
LRO is a parameter of an RQ, but its state is changed by modifying a TIR
related to the RQ.
The current status: LRO for tunneled packets is not supported in the
driver, inner TIRs may enable LRO on creation, but LRO status of inner
TIRs isn't changed in mlx5e_modify_tirs_lro(). This is inconsistent, but
as long as the firmware doesn't declare support for tunneled LRO, it
works, because the same RQs are shared between the inner and outer TIRs.
This commit does two fixes:
1. If the firmware has the tunneled LRO capability, LRO is blocked
altogether, because it's not possible to block it for inner TIRs only,
when the same RQs are shared between inner and outer TIRs, and the
driver won't be able to handle tunneled LRO traffic.
2. mlx5e_modify_tirs_lro() is patched to modify LRO state for all TIRs,
including inner ones, because all TIRs related to an RQ should agree on
their LRO state.
Fixes: 7b3722fa9e ("net/mlx5e: Support RSS for GRE tunneled packets")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
is_apu_thread_cq() used to detect CQs which are attached to APU
threads. This was extended to support other elements as well,
so the function was renamed to is_apu_cq().
c_eqn_or_apu_element was extended from 8 bits to 32 bits, which wan't
reflected when the APU support was first introduced.
Acked-by: Michael S. Tsirkin <mst@redhat.com> # vdpa
Signed-off-by: Tal Gilboa <talgi@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
This fields will be needed when adding a support for DCS offload
max_dci_stream_channels - maximum DCI stream channels supported per DCI.
max_dci_errored_streams - maximum DCI error stream channels
supported per DCI before a DCI move to error state.
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Currently all resources must be created with uid != 0 which is essential
when userspace processes are allocating virtquueue resources. Since this
is a kernel implementation, it is perfectly legal to open resources with
uid == 0.
In case firmware supports, avoid allocating user context.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Link: https://lore.kernel.org/r/20210531160404.31368-1-elic@nvidia.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
This PR contains a replacement driver for Intel iWarp hardware. This new
driver supports the old ethernet hardware and also newer chips that can do
ROCE. Otherwise this contains the typical mix of patches:
- Driver updates and cleanups for bnxt_re, cxgb4, mlx4, and mlx5
- Many static checker driven code clean ups, including a wide refcount_t
conversion
- Several series for the hns driver, more HIP09 HW capabilities, migration
to new HW register manipulators, and code cleanups
- Minor fixes and improvements in srp, rts, and cm
- Improvements throughout for sysfs related code to use DEVICE_ATTR_*,
make the ib_port sysfs first-class, and overall use sysfs APIs properly
- Intel's new irdma driver replacing i40iw
- rxe general clean ups and Memory Window support
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAmDcunQACgkQOG33FX4g
mxqSBA//dsZi/UzpzgU+YqyMFmUp04wd2/iCYzOcCViNPQZCyCARbGaMXI4kMa4s
8dM5xU76OnCuNSnXHaIwvHC3CdN9GUm08j9eWY7syvAiKtXCjzv7qmCVfBw35UyK
IXKfXh57toTSSAIfxw8yKc97QgaDSJ2zQ34fXkoE0AvTlfyN6pHQe9ef/Ca0ejS4
awUGYVG/oilLXrEHcSSAv5UoX6hOUje6jqqRgp5jmZTI3g7SlIPL8mWgXBkHAYmd
kDX7lBd09CKo2bmR071/kF6xUzvbCg1tmeE6lZze7gE+aKlBkZcvCBe1RAh3sBzK
ysLfON5GGw5qnkMaY8j5h3sgWvi3qTTEW+jCAmmVi/6z4PF47mvmVVn+/pZc3y2e
PqH43cunhwS0KuoUJ5Sd48J/UvabrvdbCNZrjCGCpt45EF4VwKxYMh74Bf0ABEQS
i9eKR/+wyHG6Uv1U37fIXsqa8yUttl9aV/s88s8irn4xhG8ygBLZgeVQNeGUfvdV
1W0XLEjRmKFezC1FhiPOz7CLIgL3BfSU1V+S7p0Gvb6ijZqyZTfRUaWbaD3KJpRT
1kwzE4qp6IbJMEqgQH/lq0xBzvzF48FPvBslX5kwlm0phQRrMCwMVIafutpu395q
ySeStEvsTVfz/JUHL3ZaEJyTRjAvPL0lXLH80XUpgWk9GzsksOM=
=wyqt
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"This contains a replacement driver for Intel iWarp hardware. This new
driver supports the old ethernet hardware and also newer chips that
can do ROCE.
Other than that, this contains the typical mix of patches:
- Driver updates and cleanups for bnxt_re, cxgb4, mlx4, and mlx5
- Many static checker driven code clean ups, including a wide
refcount_t conversion
- Several series for the hns driver, more HIP09 HW capabilities,
migration to new HW register manipulators, and code cleanups
- Minor fixes and improvements in srp, rts, and cm
- Improvements throughout for sysfs related code to use
DEVICE_ATTR_*, make the ib_port sysfs first-class, and overall use
sysfs APIs properly
- Intel's new irdma driver replacing i40iw
- rxe general clean ups and Memory Window support"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (211 commits)
RDMA/core: Always release restrack object
RDMA/mlx5: Don't access NULL-cleared mpi pointer
RDMA/irdma: Fix potential overflow expression in irdma_prm_get_pbles
RDMA/irdma: Check contents of user-space irdma_mem_reg_req object
RDMA/rxe: Missing unlock on error in get_srq_wqe()
RDMA/cma: Fix rdma_resolve_route() memory leak
RDMA/core/sa_query: Remove unused argument
RDMA/cma: Fix incorrect Packet Lifetime calculation
RDMA/cma: Protect RMW with qp_mutex
RDMA/cma: Remove unnecessary INIT->INIT transition
RDMA/hns: Add window selection field of congestion control
RDMA/hfi1: Remove use of kmap()
RDMA/irdma: Remove use of kmap()
RDMA/bnxt_re: Fix uninitialized struct bit field rsvd1
IB/isert: Align target max I/O size to initiator size
RDMA/hns: Fix incorrect vlan enable bit in QPC
MAINTAINERS: Update Broadcom RDMA maintainers
RDMA/irdma: Use the queried port attributes
RDMA/rxe: Fix redundant skb_put_zero
RDMA/rxe: Fix extra copy in prepare_ack_packet
...
Aharon Landau says:
====================
In case device supports only real-time timestamp, the kernel will fail to
create QP despite rdma-core requested such timestamp type.
It is because device returns free-running timestamp, and the conversion
from free-running to real-time is performed in the user space.
This series fixes it, by returning real-time timestamp.
====================
* mlx5_realtime_ts:
RDMA/mlx5: Support real-time timestamp directly from the device
RDMA/mlx5: Refactor get_ts_format functions to simplify code
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
QPC, SQC and RQC timestamp formats and capabilities are always equal
because they represent general hardware support. So instead of code
duplication, let's merge them into general enum and logic.
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Trivial conflicts in net/can/isotp.c and
tools/testing/selftests/net/mptcp/mptcp_connect.sh
scaled_ppm_to_ppb() was moved from drivers/ptp/ptp_clock.c
to include/linux/ptp_clock_kernel.h in -next so re-apply
the fix there.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Running devlink reload command for port in switchdev mode cause
resources to corrupt: driver can't release allocated EQ and reclaim
memory pages, because "rdma" auxiliary device had add CQs which blocks
EQ from deletion.
Erroneous sequence happens during reload-down phase, and is following:
1. detach device - suspends auxiliary devices which support it, destroys
others. During this step "eth-rep" and "rdma-rep" are destroyed,
"eth" - suspended.
2. disable SRIOV - moves device to legacy mode; as part of disablement -
rescans drivers. This step adds "rdma" auxiliary device.
3. destroy EQ table - <failure>.
Driver shouldn't create any device during unload flows. To handle that
implement MLX5_PRIV_FLAGS_DETACH flag, set it on device detach and unset
on device attach. If flag is set do no-op on drivers rescan.
Fixes: a925b5e309 ("net/mlx5: Register mlx5 devices to auxiliary virtual bus")
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
FW is now supporting more than 256 MSI-X per PF (up to 2K).
Hence, enlarge interrupt field in CREATE_EQ to make use of the new
MSI-X's.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The users of EQ are running their code on different CPUs and with
various affinity patterns. Move the cpumask setting close to their
actual usage.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Create new files bridge.{c|h} in en/rep directory that implement bridge
interaction with representor netdevices and handle required
events/notifications, bridge.{c|h} in esw directory that implement all
necessary eswitch offloading infrastructure and works on vport/eswitch
level. Provide new kconfig MLX5_BRIDGE which is automatically selected when
both kernel bridge and mlx5 eswitch configs are enabled.
Provide basic infrastructure for bridge offloads:
- struct mlx5_esw_bridge_offloads - per-eswitch bridge offload structure
that encapsulates generic bridge-offloads data (notifier blocks, ingress
flow table/group, etc.) that is created/deleted on enable/disable eswitch
offloads.
- struct mlx5_esw_bridge - per-bridge structure that encapsulates
per-bridge data (reference counter, FDB, egress flow table/group, etc.)
that is created when first eswitch represetor is attached to new bridge and
deleted when last representor is removed from the bridge as a result of
NETDEV_CHANGEUPPER event.
The bridge tables are created with new priority FDB_BR_OFFLOAD in FDB
namespace. The new priority is between tc-miss and slow path priorities.
Priority consist of two levels: the ingress table that is global per
eswitch and matches incoming packets by src_mac/vid and redirects them to
next level (egress table) that is chosen according to ingress port bridge
membership and matches on dst_mac/vid in order to redirect packet to vport
according to the following diagram:
+
|
+---------v----------+
| |
| FDB_TC_OFFLOAD |
| |
+---------+----------+
|
|
+---------v----------+
| |
| FDB_FT_OFFLOAD |
| |
+---------+----------+
|
|
+---------v----------+
| |
| FDB_TC_MISS |
| |
+---------+----------+
|
+--------------------------------------+
| | |
| +------+ |
| | |
| +------v--------+ FDB_BR_OFFLOAD |
| | INGRESS_TABLE | |
| +------+---+----+ |
| | | match |
| | +---------+ |
| | | | +-------+
| | +-------v-------+ match | | |
| | | EGRESS_TABLE +------------> vport |
| | +-------+-------+ | | |
| | | | +-------+
| | miss | |
| +------+------+ |
| | |
+--------------------------------------+
|
|
+---------v----------+
| |
| FDB_SLOW_PATH |
| |
+---------+----------+
|
v
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In order to adhere to kernel software datapath model bridge offloads must
come after TC and NF FDBs. Following patches in this series add new FDB
priority for bridge after FDB_FT_OFFLOAD. However, since netfilter offload
is implemented with unmanaged tables, its miss path is not automatically
connected to next priority and requires the code to manually connect with
slow table. To keep bridge offloads encapsulated and not mix it with
eswitch offloads, create a new FDB_TC_MISS priority between FDB_FT_OFFLOAD
and FDB_SLOW_PATH:
+
|
+---------v----------+
| |
| FDB_TC_OFFLOAD |
| |
+---------+----------+
|
|
|
+---------v----------+
| |
| FDB_FT_OFFLOAD |
| |
+---------+----------+
|
|
|
+---------v----------+
| |
| FDB_TC_MISS |
| |
+---------+----------+
|
|
|
+---------v----------+
| |
| FDB_SLOW_PATH |
| |
+---------+----------+
|
v
Initialize the new priority with single default empty managed table and use
the table as TC/NF miss patch instead of slow table. This approach allows
bridge offloads to be created as new FDB namespace priority between
FDB_TC_MISS and FDB_SLOW_PATH without exposing its internal tables to any
other modules since miss path of managed TC-miss table is automatically
wired to next priority.
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Adding new reformat context type (INSERT_HEADER) requires adding two new
parameters to reformat context - reformat_param_0 and reformat_param_1.
As defined by HW spec, these parameters have different meaning for
different reformat context type.
The first parameter (reformat_param_0) is not new to HW spec, but it
wasn't used by any of the supported reformats. The second parameter
(reformat_param_1) is new to the HW spec - it was added to allow
supporting INSERT_HEADER.
For NSERT_HEADER, reformat_param_0 indicates the header used to
reference the location of the inserted header, and reformat_param_1
indicates the offset of the inserted header from the reference point
defined by reformat_param_0.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add support for HCA caps 2 that contains capabilities for the new
insert/remove header actions.
Added the required definitions for supporting the new reformat type:
added packet reformat parameters, reformat anchors and definitions
to allow copy/set into the inserted EMD (Embedded MetaData) tag.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Flow table that contains flow pointing to multiple flow tables or multiple
TIRs must have a level lower than 64. In our case it applies to muli-
destination flow table.
Fix the level of the created table to comply with HW Spec definitions, and
still make sure that its level lower than SW-owned tables, so that it
would be possible to point from the multi-destination FW table to SW
tables.
Fixes: 34583beea4 ("net/mlx5: DR, Create multi-destination table for SW-steering use")
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Misc update for mlx5 driver,
1) Clean up patches for lag and SF
2) Reserve bit 31 in steering register C1 for IPSec offload usage
3) Move steering tables pool logic into the steering core and
increase the maximum table size to 2G entries when software steering
is enabled.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmCv6vAACgkQSD+KveBX
+j6qnAgAz0eKWKCsFCqlXGIgF1cg3FrGR5W2Zi5euriHhHwNqnZof3AIMkzcXjLL
wBlPjWk3YLfBaBNPTziz6EJuGl1vZZxuSdc7bqsNnl0srujRtQFu3JyerdgXEXNL
W2NxjSTiVwu8lq2qlYauQvcE0v+JrB/LMe9tvq1UQ2v9FtBMMhs9hGUSCro2huwj
XYF0m0ve89+mYlm6/m0SIUpPVdMiIhm4+coO1wibk7+8jn6+ZT6EJbbZvjc9eQg7
ZKr8f/TpfmvHToG8LPOc6HqHzRiHlp3Yzsft+xm54r082n4F/noGhL+Hqvvj1aTj
C6Ip5N7VkzT+erMLMrjIbrmEP94cyQ==
=torZ
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2021-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2021-05-26
Misc update for mlx5 driver,
1) Clean up patches for lag and SF
2) Reserve bit 31 in steering register C1 for IPSec offload usage
3) Move steering tables pool logic into the steering core and
increase the maximum table size to 2G entries when software steering
is enabled.
* tag 'mlx5-updates-2021-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: Fix lag port remapping logic
net/mlx5: Use boolean arithmetic to evaluate roce_lag
net/mlx5: Remove unnecessary spin lock protection
net/mlx5: Cap the maximum flow group size to 16M entries
net/mlx5: DR, Set max table size to 2G entries
net/mlx5: Move chains ft pool to be used by all firmware steering
net/mlx5: Move table size calculation to steering cmd layer
net/mlx5: Add case for FS_FT_NIC_TX FT in MLX5_CAP_FLOWTABLE_TYPE
net/mlx5: DR, Remove unused field of send_ring struct
net/mlx5e: RX, Remove unnecessary check in RX CQE compression handling
net/mlx5e: IPsec/rep_tc: Fix rep_tc_update_skb drops IPsec packet
net/mlx5e: TC: Reserved bit 31 of REG_C1 for IPsec offload
net/mlx5e: TC: Use bit counts for register mapping
net/mlx5: CT: Avoid reusing modify header context for natted entries
net/mlx5e: CT, Remove newline from ct_dbg call
====================
Link: https://lore.kernel.org/r/20210527185624.694304-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Firmware FT pool is per device, but the software tracking of this pool
only services fs_chains users, and if another layer takes a flow table,
the pool will not be updated, and fs_chains will fail creating a flow
table, with no recovery till the flow table is returned.
Move FT pool to be global per device, and stored at the cmd level,
so all layers can use it.
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently ASAP features fully utilize all the bits of the CQE's flow tag
and ft_metadata field. The flow tag field cannot be used because the
flow table tagging in FTE does not allow partial write.
We agree to reserve bit 31 of CQE's ft_metadata for IPsec to avoid
ASAP CT from dropping IPsec offloaded packet
Here is the new bit layout of REG_C1. Tunnel option id is reduced to
11 bits:
< IPSEC MARKER (1) | ESW_TUN_ID(12) | ESW_TUN_OPTS(11) | ESW_ZONE_ID(8) >
Signed-off-by: Huy Nguyen <huyn@nvidia.com>
Signed-off-by: Raed Salem <raeds@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Paul Blakey <paulb@nvidia.com>
net/mlx5: Expose MPFS configuration API
MPFS is the multi physical function switch that bridges traffic between
the physical port and any physical functions associated with it. The
driver is required to add or remove MAC entries to properly forward
incoming traffic to the correct physical function.
We export the API to control MPFS so that other drivers, such as
mlx5_vdpa are able to add MAC addresses of their network interfaces.
The MAC address of the vdpa interface must be configured into the MPFS L2
address. Failing to do so could cause, in some NIC configurations, failure
to forward packets to the vdpa network device instance.
Fix this by adding calls to update the MPFS table.
CC: <mst@redhat.com>
CC: <jasowang@redhat.com>
CC: <virtualization@lists.linux-foundation.org>
Fixes: 1a86b377aa ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices")
Signed-off-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
mlx5_core_dev holds pointer to static profile, hence when the
log_max_qp of the profile is override by some device, then it
effect all other mlx5 devices that share the same profile.
Fix it by having a profile instance for every mlx5 device.
Fixes: 883371c453 ("net/mlx5: Check FW limitations on log_max_qp before setting it")
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This is significantly bug fixes and general cleanups. The noteworthy new
features are fairly small:
- XRC support for HNS and improves RQ operations
- Bug fixes and updates for hns, mlx5, bnxt_re, hfi1, i40iw, rxe, siw and
qib
- Quite a few general cleanups on spelling, error handling, static checker
detections, etc
- Increase the number of device ports supported beyond 255. High port
count software switches now exist
- Several bug fixes for rtrs
- mlx5 Device Memory support for host controlled atomics
- Report SRQ tables through to rdma-tool
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAmCMMHEACgkQOG33FX4g
mxri3Q//RAgIExCGHebQ9xkptZHVyTLLJMpiMl2cqk3ZVRdDZ7QdiQjIqY2KqlUK
nxBj7EXJeX6rV5a1xqCcOO1gBetB28TSwnCNE2ZqrXP5B59ISW8D052IWza3UkUz
WmHLARxHQlyKBWA4+ZAgfoUGL0NmWA8QPf56t/RK/3/OsuYnGzcnWmmFbt8XKFcH
NtO3KC45mKWDqqG0A0XRrLbEQz/ElO3OuPBqlBKgB3ZgGPzgsOUTOGkm1tCcZ89L
/pvZGB7SklKZdCX8TxdpVGd9h0zHl8pqh1yEzvTA1ypNAYSUId2mvZXluU8J5yJl
FLk7E1IxE5050FNEc7T5uZdUVntulYiqL2558coRI34l5w26pKGjIMxw/nTB8hg8
4ZfBtKVemIG6yzW5Up6iBpK7qWYpvLWVShwYAWhbNsjN7JGzJuh1gJnjbmYgyz2P
RTMU9wjFPLL2wZxg4LDHACVJNBb82j6KKuE+kZWpk11ro7INw9+7YwRuTo7/ezxC
BwXKu8wF4igwSigV55jM+WnGXLhxdC3qmx/2cbtWyLM/PzdRL96tM0RWW5v8/Nv7
teFhkt+f3RVqcfYH5K1qCXy3UFrxG6bxFSvcHHSBx2bdIrqhuTY5FqszAYImeW2j
iHoyIsuSuGu79HQgOzAQZsEyksWi6OYDvA9Q9VBoPP4bJ3DOAa4=
=vsXA
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"This is significantly bug fixes and general cleanups. The noteworthy
new features are fairly small:
- XRC support for HNS and improves RQ operations
- Bug fixes and updates for hns, mlx5, bnxt_re, hfi1, i40iw, rxe, siw
and qib
- Quite a few general cleanups on spelling, error handling, static
checker detections, etc
- Increase the number of device ports supported beyond 255. High port
count software switches now exist
- Several bug fixes for rtrs
- mlx5 Device Memory support for host controlled atomics
- Report SRQ tables through to rdma-tool"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (145 commits)
IB/qib: Remove redundant assignment to ret
RDMA/nldev: Add copy-on-fork attribute to get sys command
RDMA/bnxt_re: Fix a double free in bnxt_qplib_alloc_res
RDMA/siw: Fix a use after free in siw_alloc_mr
IB/hfi1: Remove redundant variable rcd
RDMA/nldev: Add QP numbers to SRQ information
RDMA/nldev: Return SRQ information
RDMA/restrack: Add support to get resource tracking for SRQ
RDMA/nldev: Return context information
RDMA/core: Add CM to restrack after successful attachment to a device
RDMA/cma: Skip device which doesn't support CM
RDMA/rxe: Fix a bug in rxe_fill_ip_info()
RDMA/mlx5: Expose private query port
RDMA/mlx4: Remove an unused variable
RDMA/mlx5: Fix type assignment for ICM DM
IB/mlx5: Set right RoCE l3 type and roce version while deleting GID
RDMA/i40iw: Fix error unwinding when i40iw_hmc_sd_one fails
RDMA/cxgb4: add missing qpid increment
IB/ipoib: Remove unnecessary struct declaration
RDMA/bnxt_re: Get rid of custom module reference counting
...
Total vports are already stored during eswitch initialization. Instead
of calculating everytime, read directly from eswitch.
Additionally, host PF's SF vport information is available using
QUERY_HCA_CAP command. It is not available through HCA_CAP of the
eswitch manager PF.
Hence, this patch prepares the return total eswitch vport count from the
existing eswitch struct.
This further helps to keep eswitch port counting macros and logic within
eswitch.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
mlx5_eswitch_get_total_vports() doesn't honor MLX5_ESWICH Kconfig flag.
When MLX5_ESWITCH is disabled, FS layer continues to initialize eswitch
specific ACL namespaces.
Instead, start honoring MLX5_ESWITCH flag and perform vport specific
initialization only when vport count is non zero.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When using SW steering, rule insertion rate depends on the RDMA RC QP
performance used for writing to the ICM. During stress this QP is competing
on the HW resources with all the other QPs that are used to send data.
To protect SW steering QP's performance in such cases, we set this QP to
use isolated VL. The VL number is reserved by FW and is not exposed to the
driver.
Support for this QP on isolated VL exists only when both force-loopback and
isolate_vl_tc capabilities are set.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When supported by the device, SW steering RoCE RC QP that is used to
write/read to/from ICM will be created with force-loopback attribute.
Such QP doesn't require GID index upon creation.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>