Some mlx5 devices do not support the default advertised maximum frequency
adjustment value for the PTP hardware clock that is set by the driver.
These devices need to be queried when initializing the clock functionality
in order to get the maximum supported frequency adjustment value. This
value can be greater than the minimum supported frequency adjustment across
mlx5 devices (50 million ppb).
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
vdpa/mlx5:
VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK
new maintainer
vdpa:
support for vq descriptor mappings
decouple reset of iotlb mapping from device reset
fixes, cleanups all over the place
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmVCUMYPHG1zdEByZWRo
YXQuY29tAAoJECgfDbjSjVRp4L0H/RKcnNXPRqzhhBI1XVQ11Z8CO8WjovcmJalu
ADHNEGmvuWnY79fp9eLiZ4iVaTx1qbzqIB5Q500DJ65jh71W7UQ8ww6CGjNUoRGs
Zoe4G09WoOf4bvDZZzVV7ml/AzMdsHWSZK8pxY3QI9CsC9Zfp9hg20QYxPylCqYx
SIJx7w2MkoojfmtOHRx1WUxaQz99yfU4Z0C5PxtRE1HGN6/a1aY0P0CAl5jq8uCK
U5sCRsfCmP7VKlspeEddMiPA35ADbCiysSobCbwGVQEs5cHpMUX7KWa+oV0tF/PY
9uyJb2rJy6zG3tXmL4XNib665ZR86HX6qiWRfm2nBQQStuHaJyg=
=mXgo
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:
"vhost,virtio,vdpa: features, fixes, cleanups.
vdpa/mlx5:
- VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK
- new maintainer
vdpa:
- support for vq descriptor mappings
- decouple reset of iotlb mapping from device reset
and fixes, cleanups all over the place"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (34 commits)
vdpa_sim: implement .reset_map support
vdpa/mlx5: implement .reset_map driver op
vhost-vdpa: clean iotlb map during reset for older userspace
vdpa: introduce .compat_reset operation callback
vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
vhost-vdpa: reset vendor specific mapping to initial state in .release
vdpa: introduce .reset_map operation callback
virtio_pci: add check for common cfg size
virtio-blk: fix implicit overflow on virtio_max_dma_size
virtio_pci: add build offset check for the new common cfg items
virtio: add definition of VIRTIO_F_NOTIF_CONFIG_DATA feature bit
vduse: make vduse_class constant
vhost-scsi: Spelling s/preceeding/preceding/g
virtio: kdoc for struct virtio_pci_modern_device
vdpa: Update sysfs ABI documentation
MAINTAINERS: Add myself as mlx5_vdpa driver
virtio-balloon: correct the comment of virtballoon_migratepage()
mlx5_vdpa: offer VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK
vdpa/mlx5: Update cvq iotlb mapping on ASID change
vdpa/mlx5: Make iotlb helper functions more generic
...
The usual collection of patches:
- Bug fixes for hns, mlx5, and hfi1
- Hardening patches for size_*, counted_by, strscpy
- rts fixes from static analysis
- Dump SRQ objects in rdma netlink, with hns support
- Fix a performance regression in mlx5 MR deregistration
- New XDR (200Gb/lane) link speed
- SRQ record doorbell latency optimization for hns
- IPSEC support for mlx5 multi-port mode
- ibv_rereg_mr() support for irdma
- Affiliated event support for bnxt_re
- Opt out for the spec compliant qkey security enforcement as we
discovered SW that breaks under enforcement
- Comment and trivial updates
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZUEvXwAKCRCFwuHvBreF
YQ/iAPkBl7vCH+oFGJKnh9/pUjKrWwRYBkGNgNesH7lX4Q/cIAEA+R3bAI2d4O4i
tMrRRowTnxohbVRebuy4Pfsoo6m9dA4=
=TxMJ
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"Nothing exciting this cycle, most of the diffstat is changing SPDX
'or' to 'OR'.
Summary:
- Bugfixes for hns, mlx5, and hfi1
- Hardening patches for size_*, counted_by, strscpy
- rts fixes from static analysis
- Dump SRQ objects in rdma netlink, with hns support
- Fix a performance regression in mlx5 MR deregistration
- New XDR (200Gb/lane) link speed
- SRQ record doorbell latency optimization for hns
- IPSEC support for mlx5 multi-port mode
- ibv_rereg_mr() support for irdma
- Affiliated event support for bnxt_re
- Opt out for the spec compliant qkey security enforcement as we
discovered SW that breaks under enforcement
- Comment and trivial updates"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (50 commits)
IB/mlx5: Fix init stage error handling to avoid double free of same QP and UAF
RDMA/mlx5: Fix mkey cache WQ flush
RDMA/hfi1: Workaround truncation compilation error
IB/hfi1: Fix potential deadlock on &irq_src_lock and &dd->uctxt_lock
RDMA/core: Remove NULL check before dev_{put, hold}
RDMA/hfi1: Remove redundant assignment to pointer ppd
RDMA/mlx5: Change the key being sent for MPV device affiliation
RDMA/bnxt_re: Fix clang -Wimplicit-fallthrough in bnxt_re_handle_cq_async_error()
RDMA/hns: Fix init failure of RoCE VF and HIP08
RDMA/hns: Fix unnecessary port_num transition in HW stats allocation
RDMA/hns: The UD mode can only be configured with DCQCN
RDMA/hns: Add check for SL
RDMA/hns: Fix signed-unsigned mixed comparisons
RDMA/hns: Fix uninitialized ucmd in hns_roce_create_qp_common()
RDMA/hns: Fix printing level of asynchronous events
RDMA/core: Add support to set privileged QKEY parameter
RDMA/bnxt_re: Do not report SRQ error in srq notification
RDMA/bnxt_re: Report async events and errors
RDMA/bnxt_re: Update HW interface headers
IB/mlx5: Fix rdma counter binding for RAW QP
...
Vq descriptor mappings are supported in hardware by filling in an
additional mkey which contains the descriptor mappings to the hw vq.
A previous patch in this series added support for hw mkey (mr) creation
for ASID 1.
This patch fills in both the vq data and vq descriptor mkeys based on
group ASID mapping.
The feature is signaled to the vdpa core through the presence of the
.get_vq_desc_group op.
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20231018171456.1624030-16-dtatulea@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com>
Tested-by: Si-Wei Liu <si-wei.liu@oracle.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Commit 2ac9cfe782 ("net/mlx5e: IPSec, Add Innova IPSec offload TX data path")
declared mlx5e_ipsec_inverse_table_init() but never implemented it.
Commit f52f2faee5 ("net/mlx5e: Introduce flow steering API")
declared mlx5e_fs_set_tc() but never implemented it.
Commit f2f3df5501 ("net/mlx5: EQ, Privatize eq_table and friends")
declared mlx5_eq_comp_cpumask() but never implemented it.
Commit cac1eb2cf2 ("net/mlx5: Lag, properly lock eswitch if needed")
removed mlx5_lag_update() but not its declaration.
Commit 35ba005d82 ("net/mlx5: DR, Set flex parser for TNL_MPLS dynamically")
removed mlx5dr_ste_build_tnl_mpls() but not its declaration.
Commit e126ba97db ("mlx5: Add driver for Mellanox Connect-IB adapters")
declared but never implemented mlx5_alloc_cmd_mailbox_chain() and mlx5_free_cmd_mailbox_chain().
Commit 0cf53c1247 ("net/mlx5: FWPage, Use async events chain")
removed mlx5_core_req_pages_handler() but not its declaration.
Commit 938fe83c8d ("net/mlx5_core: New device capabilities handling")
removed mlx5_query_odp_caps() but not its declaration.
Commit f6a8a19bb1 ("RDMA/netdev: Hoist alloc_netdev_mqs out of the driver")
removed mlx5_rdma_netdev_alloc() but not its declaration.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
LAG peer device lookout bus logic required the usage of global lock,
mlx5_intf_mutex.
As part of the effort to remove this global lock, refactor LAG peer
device lookout to use mlx5 devcom layer.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
At present, mlx5 driver have a general purpose
event handler which not only handles vhca event
but also many other events. This incurs a huge
bottleneck because the event handler is
implemented by single threaded workqueue and all
events are forced to be handled in serial manner
even though application tries to create multiple
SFs simultaneously.
Introduce a dedicated vhca event handler which
manages SFs parallel creation.
Signed-off-by: Wei Zhang <weizhang@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Leon Romanovsky says:
====================
This PR is collected from
https://lore.kernel.org/all/cover.1695296682.git.leon@kernel.org
This series from Patrisious extends mlx5 to support IPsec packet offload
in multiport devices (MPV, see [1] for more details).
These devices have single flow steering logic and two netdev interfaces,
which require extra logic to manage IPsec configurations as they performed
on netdevs.
[1] https://lore.kernel.org/linux-rdma/20180104152544.28919-1-leon@kernel.org/
* 'mlx5-next' of https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Handle IPsec steering upon master unbind/bind
net/mlx5: Configure IPsec steering for ingress RoCEv2 MPV traffic
net/mlx5: Configure IPsec steering for egress RoCEv2 MPV traffic
net/mlx5: Add create alias flow table function to ipsec roce
net/mlx5: Implement alias object allow and create functions
net/mlx5: Add alias flow table bits
net/mlx5: Store devcom pointer inside IPsec RoCE
net/mlx5: Register mlx5e priv to devcom in MPV mode
RDMA/mlx5: Send events from IB driver about device affiliation state
net/mlx5: Introduce ifc bits for migration in a chunk mode
====================
Link: https://lore.kernel.org/r/20231002083832.19746-1-leon@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This series from Patrisious extends mlx5 to support IPsec packet offload
in multiport devices (MPV, see [1] for more details).
These devices have single flow steering logic and two netdev interfaces,
which require extra logic to manage IPsec configurations as they performed
on netdevs.
Thanks
[1] https://lore.kernel.org/linux-rdma/20180104152544.28919-1-leon@kernel.org/
Link: https://lore.kernel.org/all/20231002083832.19746-1-leon@kernel.org
Signed-of-by: Leon Romanovsky <leon@kernel.org>
* mlx5-next: (576 commits)
net/mlx5: Handle IPsec steering upon master unbind/bind
net/mlx5: Configure IPsec steering for ingress RoCEv2 MPV traffic
net/mlx5: Configure IPsec steering for egress RoCEv2 MPV traffic
net/mlx5: Add create alias flow table function to ipsec roce
net/mlx5: Implement alias object allow and create functions
net/mlx5: Add alias flow table bits
net/mlx5: Store devcom pointer inside IPsec RoCE
net/mlx5: Register mlx5e priv to devcom in MPV mode
RDMA/mlx5: Send events from IB driver about device affiliation state
net/mlx5: Introduce ifc bits for migration in a chunk mode
Linux 6.6-rc3
...
Necessary for improved live migration flow. Actual support will be added
in a downstream patch.
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Link: https://lore.kernel.org/r/20230928164550.980832-3-dtatulea@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Add all the capabilities needed to check for alias object support.
As well as all the fields or commands needed for its creation and
the creation of flow table that is able to jump to an alias object.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://lore.kernel.org/r/544c030f2a78c4adf3fe6b64f97a39cc1bbdabb9.1695296682.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Send blocking events from IB driver whenever the device is done being
affiliated or if it is removed from an affiliation.
This is useful since now the EN driver can register to those event and
know when a device is affiliated or not.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://lore.kernel.org/r/a7491c3e483cfd8d962f5f75b9a25f253043384a.1695296682.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Introduce ifc related stuff to enable migration in a chunk mode.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20230911093856.81910-2-yishaih@nvidia.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Add a check for 800G_8X speed when querying PTYS and report it back
correctly when needed.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/26fd0b6e1fac071c3eb779657bb3d8ba47f47c4f.1695204156.git.leon@kernel.org
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Add new health error syndrome to indicate that pci data poisoned error
has been received while fetching device ICM data.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In order to have mcast offloads the driver needs the following:
It should know if that mcast comes from wire port, in addition the flow
should not be marked as any specific source, that way it will give the
flexibility for the driver not to be depended on the way iterator
implemented in the FW.
Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Benefit from the existence of internal mlx5 notifier and extend it by
event MLX5_DRIVER_EVENT_SF_PEER_DEVLINK. Use this event from SF
auxiliary device probe/remove functions to pass the registered SF
devlink instance to the SF representor.
Process the new event in SF representor code and call
devl_port_fn_devlink_set() to do the assignments. Implement this in work
to avoid possible deadlock when probe/remove function of SF may be
called with devlink instance lock held during devlink reload.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement SyncE support using newly introduced DPLL support.
Make sure that each PFs/VFs/SFs probed with appropriate capability
will spawn a dpll auxiliary device and register appropriate dpll device
and pin instances.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
mlx5 HW can't perform IPsec offload operation simultaneously both on PF
and VFs at the same time. While the previous patches added devlink knobs
to change IPsec capabilities dynamically, there is a need to add a logic
to block such IPsec capabilities for the cases when IPsec is already
configured.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20230825062836.103744-7-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add hardware definitions to allow to control IPSec capabilities.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20230825062836.103744-6-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Leon Romanovsky says:
====================
mlx5 MACsec RoCEv2 support
From Patrisious:
This series extends previously added MACsec offload support
to cover RoCE traffic either.
In order to achieve that, we need configure MACsec with offload between
the two endpoints, like below:
REMOTE_MAC=10:70:fd:43:71:c0
* ip addr add 1.1.1.1/16 dev eth2
* ip link set dev eth2 up
* ip link add link eth2 macsec0 type macsec encrypt on
* ip macsec offload macsec0 mac
* ip macsec add macsec0 tx sa 0 pn 1 on key 00 dffafc8d7b9a43d5b9a3dfbbf6a30c16
* ip macsec add macsec0 rx port 1 address $REMOTE_MAC
* ip macsec add macsec0 rx port 1 address $REMOTE_MAC sa 0 pn 1 on key 01 ead3664f508eb06c40ac7104cdae4ce5
* ip addr add 10.1.0.1/16 dev macsec0
* ip link set dev macsec0 up
And in a similar manner on the other machine, while noting the keys order
would be reversed and the MAC address of the other machine.
RDMA traffic is separated through relevant GID entries and in case
of IP ambiguity issue - meaning we have a physical GIDs and a MACsec
GIDs with the same IP/GID, we disable our physical GID in order
to force the user to only use the MACsec GID.
v0: https://lore.kernel.org/netdev/20230813064703.574082-1-leon@kernel.org/
* 'mlx5-next' of https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
RDMA/mlx5: Handles RoCE MACsec steering rules addition and deletion
net/mlx5: Add RoCE MACsec steering infrastructure in core
net/mlx5: Configure MACsec steering for ingress RoCEv2 traffic
net/mlx5: Configure MACsec steering for egress RoCEv2 traffic
IB/core: Reorder GID delete code for RoCE
net/mlx5: Add MACsec priorities in RDMA namespaces
RDMA/mlx5: Implement MACsec gid addition and deletion
net/mlx5: Maintain fs_id xarray per MACsec device inside macsec steering
net/mlx5: Remove netdevice from MACsec steering
net/mlx5e: Move MACsec flow steering and statistics database from ethernet to core
net/mlx5e: Rename MACsec flow steering functions/parameters to suit core naming style
net/mlx5: Remove dependency of macsec flow steering on ethernet
net/mlx5e: Move MACsec flow steering operations to be used as core library
macsec: add functions to get macsec real netdevice and check offload
====================
Link: https://lore.kernel.org/r/20230821073833.59042-1-leon@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add RoCE MACsec rules when a gid is added for the MACsec netdevice and
handle their cleanup when the gid is removed or the MACsec SA is deleted.
Also support alias IP for the MACsec device, as long as we don't have
more ips than what the gid table can hold.
In addition handle the case where a gid is added but there are still no
SAs added for the MACsec device, so the rules are added later on when
the SAs are added.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Adds all the core steering helper functions that are needed in order
to setup RoCE steering rules which includes both the RX and TX rules
addition and deletion.
As well as exporting the function to be ready to use from the IB driver
where we expose functions to allow deletion of all rules, which is
needed when a GID is deleted, or a deletion of a specific rule when an SA
is deleted, and a similar manner for the rules addition.
These functions are used in a later patch by IB driver to trigger the
rules addition/deletion when needed.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Add MACsec flow steering priorities in RDMA namespaces. This allows
adding tables/rules to forward RoCEv2 traffic to the MACsec crypto
tables in NIC_TX domain, and accept RoCEv2 traffic from NIC_RX domain.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Handle MACsec IP ambiguity issue, since mlx5 hw can't support
programming both the MACsec and the physical gid when they have the same
IP address, because it wouldn't know to whom to steer the traffic.
Hence in such case we delete the physical gid from the hw gid table,
which would then cause all traffic sent over it to fail, and we'll only
be able to send traffic over the MACsec gid.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Since now MACsec flow steering (macsec_fs) and MACsec statistics (stats)
are maintained by the core driver, move their data as well to be saved
inside core structures instead of staying part of ethernet MACsec database.
In addition cleanup all MACsec stats functions from the ethernet MACsec
code and move what's needed to be part of macsec_fs instead.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Each device cap has two modes: MAX and CUR. The driver maintains a
cache of both modes of the capabilities. For most device caps, the MAX
cap mode is never used.
Hence, remove all driver queries of the MAX mode of the said caps as
well as their helper MACROs.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
mlx5 driver queries the device for VECTOR_CALC and SHAMPO caps, but
there isn't any user who requires them.
As well as, MLX5_MCAM_REGS_0x9080_0x90FF is queried but not used.
Thus, drop all usages and definitions of the mentioned caps above.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Even if the PF driver had no error on his part of the sync reset flow,
the firmware can see wider picture as it syncs all the PFs in the flow.
So add at end of sync reset flow check with firmware by reading MFRL
register and initialization segment that the flow had no issue from
firmware point of view too.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Expose NIC temperature by implementing hwmon kernel API, which turns
current thermal zone kernel API to redundant.
For each one of the supported and exposed thermal diode sensors, expose
the following attributes:
1) Input temperature.
2) Highest temperature.
3) Temperature label:
Depends on the firmware capability, if firmware doesn't support
sensors naming, the fallback naming convention would be: "sensorX",
where X is the HW spec (MTMP register) sensor index.
4) Temperature critical max value:
refers to the high threshold of Warning Event. Will be exposed as
`tempY_crit` hwmon attribute (RO attribute). For example for
ConnectX5 HCA's this temperature value will be 105 Celsius, 10
degrees lower than the HW shutdown temperature).
5) Temperature reset history: resets highest temperature.
For example, for dualport ConnectX5 NIC with a single IC thermal diode
sensor will have 2 hwmon directories (one for each PCI function)
under "/sys/class/hwmon/hwmon[X,Y]".
Listing one of the directories above (hwmonX/Y) generates the
corresponding output below:
$ grep -H -d skip . /sys/class/hwmon/hwmon0/*
Output
=======================================================================
/sys/class/hwmon/hwmon0/name:mlx5
/sys/class/hwmon/hwmon0/temp1_crit:105000
/sys/class/hwmon/hwmon0/temp1_highest:48000
/sys/class/hwmon/hwmon0/temp1_input:46000
/sys/class/hwmon/hwmon0/temp1_label:asic
grep: /sys/class/hwmon/hwmon0/temp1_reset_history: Permission denied
In addition, displaying the sensors data via lm_sensors generates the
corresponding output below:
$ sensors
Output
=======================================================================
mlx5-pci-0800
Adapter: PCI adapter
asic: +46.0°C (crit = +105.0°C, highest = +48.0°C)
mlx5-pci-0801
Adapter: PCI adapter
asic: +46.0°C (crit = +105.0°C, highest = +48.0°C)
CC: Jean Delvare <jdelvare@suse.com>
Signed-off-by: Adham Faris <afaris@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230807180507.22984-3-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit enables the dynamic allocation of EQs at runtime, allowing
for more flexibility in managing completion EQs and reducing the memory
overhead of driver load. Whenever a CQ is created for a given vector
index, the driver will lookup to see if there is an already mapped
completion EQ for that vector, if so, utilize it. Otherwise, allocate a
new EQ on demand and then utilize it for the CQ completion events.
Add a protection lock to the EQ table to protect from concurrent EQ
creation attempts.
While at it, replace mlx5_vector2irqn()/mlx5_vector2eqn() with
mlx5_comp_eqn_get() and mlx5_comp_irqn_get() which will allocate an
EQ on demand if no EQ is found for the given vector.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
To accurately represent its purpose, rename the function that retrieves
the value of maximum vectors from mlx5_comp_vectors_count() to
mlx5_comp_vectors_max().
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently, once driver load completes, IRQ requests were performed for all
vectors. However, as we move to support dynamic creation of EQs, this will
not be the case as some IRQs will not exist at this stage. Thus, in such
case, use the default CPU to IRQ mapping which is the serial mapping based
on IRQ vector index. Meaning, the n'th vector gets mapped to the n'th CPU.
Introduce an API function mlx5_comp_vector_cpu() that takes an IRQ index and
provides the corresponding CPU mapping. It utilizes the existing IRQ
affinity if defined, or resorts to the default serialized CPU mapping
otherwise.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
For IPsec packet offload mode, the order of TC offload and IPsec
offload on the same netdevice is not aligned with the order in the
non-offload software. For example, for RX, the software performs TC
first and then IPsec transformation, but the implementation for
offload does that in the opposite way.
To resolve the difference for now, either IPsec offload or TC offload,
not both, is allowed for a specific interface.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/8e2e5e3b0984d785066e8663aaf97b3ba1bb873f.1690802064.git.leon@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The IPsec encryption is done at the last, so add new prio for IPsec
offload in FDB, and put it just lower than the slow path prio and
higher than the per-vport prio.
Three levels are added for TX. The first one is for ip xfrm policy.
The sa table is created in the second level for ip xfrm state. The
status table is created at the last to count the number of packets
encrypted.
The rules, which forward packets to uplink, are changed to forward
them to IPsec TX tables first. These rules are restored after those
tables are destroyed, which is done immediately when there is no
reference to them, just as what does in legacy mode. The support for
slow path is added here, by refreshing uplink's channels. But, the
handling for TC fast path, which is more complicated, will be added
later. Besides, reg c4 is used instead to match reqid.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/cfd0e6ffaf0b8c55ebaa9fb0649b7c504b6b8ec6.1690802064.git.leon@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reuse tun opts bits in reg c1, to pass IPsec obj id to datapath.
As this is only for RX SA and there are only 11 bits, xarray is used
to map IPsec obj id to an index, which is between 1 and 0x7ff, and
replace obj id to write to reg c1.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/43d60fbcc9cd672a97d7e2a2f7fe6a3d9e9a776d.1690802064.git.leon@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As decryption must be done first, add new prio for IPsec offload in
FDB, and put it just lower than BYPASS prio and higher than TC prio.
Three levels are added for RX. The first one is for ip xfrm policy. SA
table is created in the second level for ip xfrm state. The status
table is created in the last to check the decryption result. If
success, packets continue with the next process, or dropped otherwise.
For now, the set of reg c1 is removed for swtichdev mode, and the
datapath process will be added in the next patch.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/c91063554cf643fb50b99cf093e8a9bf11729de5.1690802064.git.leon@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Command stats is an array with more than 2K entries, which amounts to
~180KB. This is way more than actually needed, as only ~190 entries
are being used.
Therefore, replace the array with xarray.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Downstream patch will split mlx5_cmd_init() to probe and reload
routines. As a preparation, organize mlx5_cmd struct so that any
field that will be used in the reload routine are grouped at new
nested struct.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Update devcom infrastructure to be more generic, without
depending on max supported ports definition or a device guid,
and also more encapsulated so callers don't need to pass
the register devcom component id per event call.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Provide an ability to check if flow steering supports UDP
encapsulation and decapsulation of IPsec ESP packets.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This cycle saw a focus on rxe and bnxt_re drivers:
- Code cleanups for irdma, rxe, rtrs, hns, vmw_pvrdma
- rxe uses workqueues instead of tasklets
- rxe has better compliance around access checks for MRs and rereg_mr
- mana supportst he 'v2' FW interface for RX coalescing
- hfi1 bug fix for stale cache entries in its MR cache
- mlx5 buf fix to handle FW failures when destroying QPs
- erdma HW has a new doorbell allocation mechanism for uverbs that is
secure
- Lots of small cleanups and rework in bnxt_re
* Use the common mmap functions
* Support disassociation
* Improve FW command flow
- bnxt_re support for "low latency push", this allows a packet
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZJxq+wAKCRCFwuHvBreF
YSN+AQDKAmWdKCqmc3boMIq5wt4h7yYzdW47LpzGarOn5Hf+UgEA6mpPJyRqB43C
CNXYIbASl/LLaWzFvxCq/AYp6tzuog4=
=I8ju
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"This cycle saw a focus on rxe and bnxt_re drivers:
- Code cleanups for irdma, rxe, rtrs, hns, vmw_pvrdma
- rxe uses workqueues instead of tasklets
- rxe has better compliance around access checks for MRs and rereg_mr
- mana supportst he 'v2' FW interface for RX coalescing
- hfi1 bug fix for stale cache entries in its MR cache
- mlx5 buf fix to handle FW failures when destroying QPs
- erdma HW has a new doorbell allocation mechanism for uverbs that is
secure
- Lots of small cleanups and rework in bnxt_re:
- Use the common mmap functions
- Support disassociation
- Improve FW command flow
- support for 'low latency push'"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (71 commits)
RDMA/bnxt_re: Fix an IS_ERR() vs NULL check
RDMA/bnxt_re: Fix spelling mistake "priviledged" -> "privileged"
RDMA/bnxt_re: Remove duplicated include in bnxt_re/main.c
RDMA/bnxt_re: Refactor code around bnxt_qplib_map_rc()
RDMA/bnxt_re: Remove incorrect return check from slow path
RDMA/bnxt_re: Enable low latency push
RDMA/bnxt_re: Reorg the bar mapping
RDMA/bnxt_re: Move the interface version to chip context structure
RDMA/bnxt_re: Query function capabilities from firmware
RDMA/bnxt_re: Optimize the bnxt_re_init_hwrm_hdr usage
RDMA/bnxt_re: Add disassociate ucontext support
RDMA/bnxt_re: Use the common mmap helper functions
RDMA/bnxt_re: Initialize opcode while sending message
RDMA/cma: Remove NULL check before dev_{put, hold}
RDMA/rxe: Simplify cq->notify code
RDMA/rxe: Fixes mr access supported list
RDMA/bnxt_re: optimize the parameters passed to helper functions
RDMA/bnxt_re: remove redundant cmdq_bitmap
RDMA/bnxt_re: use firmware provided max request timeout
RDMA/bnxt_re: cancel all control path command waiters upon error
...
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmSYzfYeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG/ucH/iOM/1Py/fSg0qSs
7NJ4XXlourT5zrnRMom3cm3d9gYqgTzgvKFL3kjMEexTRVYbhlcO4ZPRsiry8zxF
ToGX+V8tDMqb8WSdFHzkljRY+zDRyfEUDMlTzROAD9DunLmQtkJKyrggkeGdjkpP
OyfGqKpwlLXZRAXBil/U8Mx9MHdjJubloZwghLZr33VdUZa68+JJ9l6w163Oe/ET
K264NM0wxN/kvN57JvePgqMccQwpINylg8IhRI+XelgczjUXeJBsOA8TDv4bDN4Q
bjCLhkWbIaZtTYqvOXa/kD0T8wd7KETsMBQN8YzyDh6W0GmAlJjTawyAhA6jA5in
x3uz2W8=
=L3zp
-----END PGP SIGNATURE-----
Merge tag 'v6.4' into rdma.git for-next
Linux 6.4
Resolve conflicts between rdma rc and next in rxe_cq matching linux-next:
drivers/infiniband/sw/rxe/rxe_cq.c:
https://lore.kernel.org/r/20230622115246.365d30ad@canb.auug.org.au
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
A member of struct mlx5_ifc_cmd_hca_cap_bits has been mistakenly
assigned the wrong reserved_at offset value. Correct it to align to the
right value, thus avoid future miscalculation.
Signed-off-by: Lama Kayal <lkayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add needed HW bits for querying local loopback counter and the
HCA capability for it.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Added a new event handler to firmware sync reset, which is used to
support firmware sync reset flow on smart NIC. Adding this new stage to
the flow enables the firmware to ensure host PFs unload before ECPFs
unload, to avoid race of PFs recovery.
If firmware sends sync_reset_unload event to driver the driver should
unload and close all HW resources of the function. Once the driver
finishes unloading part, it can't get any more events from firmware as
event queues are closed, so it polls the reset state field to know when
to continue to next stage of the sync reset flow.
Added capability bit for supporting sync_reset_unload event.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Expose new timoueout in Default Timeouts Register to be used on sync
reset flow running on smart NIC. In this flow the driver should know how
much time to wait from getting unload request till firmware will ask the
PF to continue to next stage of the flow.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Many rc bug fixes:
- Two rtrs bug fixes for error unwind bugs
- Several rxe bug fixes:
* Incorrect Rx packet validation
* Using memory without a refcount
* Syzkaller found use before initialization
* Regression fix for missing locking with the tasklet conversion from
this merge window
- Have bnxt report the correct link properties to userspace, this was
a regression in v6.3
- Several mlx5 bug fixes:
* Kernel crash triggerable by userspace for the RAW ethernet profile
* Defend against steering refcounting issues created by userspace
* Incorrect change of QP port affinity parameters in some LAG configurations
- Fix mlx5 Q counters:
* Do not over allocate Q counters to allow userspace to use the full
port capacity
* Kernel crash triggered by eswitch due to mis-use of Q counters
* Incorrect mlx5_device for Q counters in some LAG configurations
- Properly implement the IBA spec restricting privileged qkeys to root
- Always an error when reading from a disassociated device's event queue
- isert bug fixes:
* Avoid a deadlock with the CM handler and CM ID destruction
* Correct list corruption due to incorrect locking
* Fix a use after free around connection tear down
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZIoZ7gAKCRCFwuHvBreF
YUSyAPsH+VdqzYd1z/bk0zMds9JR2oqMOR8l02e4gq8K9hjTJAD/ePEuT5PTpFZu
FrT4SDhOfA30XNz+RofRNisJC92OOAo=
=KyF7
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"This is an unusually large bunch of bug fixes for the later rc cycle,
rxe and mlx5 both dumped a lot of things at once. rxe continues to fix
itself, and mlx5 is fixing a bunch of "queue counters" related bugs.
There is one highly notable bug fix regarding the qkey. This small
security check was missed in the original 2005 implementation and it
allows some significant issues.
Summary:
- Two rtrs bug fixes for error unwind bugs
- Several rxe bug fixes:
* Incorrect Rx packet validation
* Using memory without a refcount
* Syzkaller found use before initialization
* Regression fix for missing locking with the tasklet conversion
from this merge window
- Have bnxt report the correct link properties to userspace, this was
a regression in v6.3
- Several mlx5 bug fixes:
* Kernel crash triggerable by userspace for the RAW ethernet
profile
* Defend against steering refcounting issues created by userspace
* Incorrect change of QP port affinity parameters in some LAG
configurations
- Fix mlx5 Q counters:
* Do not over allocate Q counters to allow userspace to use the
full port capacity
* Kernel crash triggered by eswitch due to mis-use of Q counters
* Incorrect mlx5_device for Q counters in some LAG configurations
- Properly implement the IBA spec restricting privileged qkeys to
root
- Always an error when reading from a disassociated device's event
queue
- isert bug fixes:
* Avoid a deadlock with the CM handler and CM ID destruction
* Correct list corruption due to incorrect locking
* Fix a use after free around connection tear down"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/rxe: Fix rxe_cq_post
IB/isert: Fix incorrect release of isert connection
IB/isert: Fix possible list corruption in CMA handler
IB/isert: Fix dead lock in ib_isert
RDMA/mlx5: Fix affinity assignment
IB/uverbs: Fix to consider event queue closing also upon non-blocking mode
RDMA/uverbs: Restrict usage of privileged QKEYs
RDMA/cma: Always set static rate to 0 for RoCE
RDMA/mlx5: Fix Q-counters query in LAG mode
RDMA/mlx5: Remove vport Q-counters dependency on normal Q-counters
RDMA/mlx5: Fix Q-counters per vport allocation
RDMA/mlx5: Create an indirect flow table for steering anchor
RDMA/mlx5: Initiate dropless RQ for RAW Ethernet functions
RDMA/rxe: Fix the use-before-initialization error of resp_pkts
RDMA/bnxt_re: Fix reporting active_{speed,width} attributes
RDMA/rxe: Fix ref count error in check_rkey()
RDMA/rxe: Fix packet length checks
RDMA/rtrs: Fix rxe_dealloc_pd warning
RDMA/rtrs: Fix the last iu->buf leak in err path
The cited commit aimed to ensure that Virtual Functions (VFs) assign a
queue affinity to a Queue Pair (QP) to distribute traffic when
the LAG master creates a hardware LAG. If the affinity was set while
the hardware was not in LAG, the firmware would ignore the affinity value.
However, this commit unintentionally assigned an affinity to QPs on the LAG
master's VPORT even if the RDMA device was not marked as LAG-enabled.
In most cases, this was not an issue because when the hardware entered
hardware LAG configuration, the RDMA device of the LAG master would be
destroyed and a new one would be created, marked as LAG-enabled.
The problem arises when a user configures Equal-Cost Multipath (ECMP).
In ECMP mode, traffic can be directed to different physical ports based on
the queue affinity, which is intended for use by VPORTS other than the
E-Switch manager. ECMP mode is supported only if both E-Switch managers are
in switchdev mode and the appropriate route is configured via IP. In this
configuration, the RDMA device is not destroyed, and we retain the RDMA
device that is not marked as LAG-enabled.
To ensure correct behavior, Send Queues (SQs) opened by the E-Switch
manager through verbs should be assigned strict affinity. This means they
will only be able to communicate through the native physical port
associated with the E-Switch manager. This will prevent the firmware from
assigning affinity and will not allow the SQs to be remapped in case of
failover.
Fixes: 802dcc7fc5 ("RDMA/mlx5: Support TX port affinity for VF drivers in LAG mode")
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://lore.kernel.org/r/425b05f4da840bc684b0f7e8ebf61aeb5cef09b0.1685960567.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When the embedded cpu supports SRIOV it can be enabled and disabled
independently from the host SRIOV. Track the pages separately so we can
properly wait for returned VF pages.
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: William Tu <witu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
These functions are for query/set by vport, there was an underlying
assumption that vport was equal to function ID. That's not the case for
EC VF functions. Set the ec_vf_function bit accordingly.
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: William Tu <witu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Enable creation of a devlink port for EC VF vports.
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: William Tu <witu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add ec_vf_vport_base to HCA Capabilities 2. This indicates the base vport
of embedded CPU virtual functions that are connected to the eswitch.
Add ec_vf_function to query/set_hca_caps. If set this indicates
accessing a virtual function on the embedded CPU by function ID. This
should only be used with other_function set to 1.
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: Bodong Wang <bodong@nvidia.com>
Reviewed-by: William Tu <witu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add generated_pkt_steering_fail and handled_pkt_steering_fail to devlink
heatlth reporter.
generated_pkt_steering_fail indicates the number of packets dropped due to
illegal steering operation within the vport steering domain.
handled_pkt_steering_fail indicates the number of packets dropped due to
illegal steering operation, originated by the vport.
Also, update devlink reporter functionality documentation with the newly
exposed counters.
Signed-off-by: Lama Kayal <lkayal@nvidia.com>
Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Introduce a generic APIs to iterate over all the devices which are part
of the LAG. This API replace mlx5_lag_get_peer_mdev() which retrieve
only a single peer device from the lag.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Don't query the firmware so many times (num rqs * num wqes * wqe frags)
because it slows down linearly the interface creation time when the
product is larger. Do it only once per mdev and store the result in
mlx5e_param.
Due to helper function being called from different files, move it to
an appropriate location. Rename the function with a proper prefix and
add a small cleanup.
This fix applies only for legacy rq.
Fixes: 1b1e486883 ("net/mlx5e: Use query_special_contexts for mkeys")
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
SW Steering uses RC QP for writing STEs to ICM. This writingis done in LB
(loopback), and FL (force-loopback) QP is preferred for performance. FL is
available when RoCE is enabled or disabled based on RoCE caps.
This patch adds reading of FL capability from HCA caps in addition to the
existing reading from RoCE caps, thus fixing the case where we didn't
have loopback enabled when RoCE was disabled.
Fixes: 7304d603a5 ("net/mlx5: DR, Add support for force-loopback QP")
Signed-off-by: Itamar Gozlan <igozlan@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Usual wide collection of unrelated items in drivers:
- Driver bug fixes and treewide cleanups in hfi1, siw, qib, mlx5, rxe,
usnic, usnic, bnxt_re, ocrdma, iser
* Unnecessary NULL checks
* kmap obsolescence
* pci_enable_pcie_error_reporting() obsolescence
* Unused variables and macros
* trace event related warnings
* casting warnings
- Code cleanups for irdm and erdma
- EFA reporting of 128 byte PCIe TLP support
- mlx5 more agressively uses the out of order HW feature
- Big rework of how state machines and tasks work in rxe
- Fix a syzkaller found crash netdev refcount leak in siw
- bnxt_re revises their HW description header
- Congestion control for bnxt_re
- Use mmu_notifiers more safely in hfi1
- mlx5 gets better support for PCIe relaxed ordering inside VMs
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZEva5wAKCRCFwuHvBreF
YZFmAQC9T3b/XQ3bRknYciuzbatC98o9xB0FTqmEFYGj+Y2lVAD9EEVe3HKfHfi3
t/GxXYB5r22oxg5bgsblZfEdEdTVCg8=
=akMm
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"Usual wide collection of unrelated items in drivers:
- Driver bug fixes and treewide cleanups in hfi1, siw, qib, mlx5,
rxe, usnic, usnic, bnxt_re, ocrdma, iser:
- remove unnecessary NULL checks
- kmap obsolescence
- pci_enable_pcie_error_reporting() obsolescence
- unused variables and macros
- trace event related warnings
- casting warnings
- Code cleanups for irdm and erdma
- EFA reporting of 128 byte PCIe TLP support
- mlx5 more agressively uses the out of order HW feature
- Big rework of how state machines and tasks work in rxe
- Fix a syzkaller found crash netdev refcount leak in siw
- bnxt_re revises their HW description header
- Congestion control for bnxt_re
- Use mmu_notifiers more safely in hfi1
- mlx5 gets better support for PCIe relaxed ordering inside VMs"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (81 commits)
RDMA/efa: Add rdma write capability to device caps
RDMA/mlx5: Use correct device num_ports when modify DC
RDMA/irdma: Drop spurious WQ_UNBOUND from alloc_ordered_workqueue() call
RDMA/rxe: Fix spinlock recursion deadlock on requester
RDMA/mlx5: Fix flow counter query via DEVX
RDMA/rxe: Protect QP state with qp->state_lock
RDMA/rxe: Move code to check if drained to subroutine
RDMA/rxe: Remove qp->req.state
RDMA/rxe: Remove qp->comp.state
RDMA/rxe: Remove qp->resp.state
RDMA/mlx5: Allow relaxed ordering read in VFs and VMs
net/mlx5: Update relaxed ordering read HCA capabilities
RDMA/mlx5: Check pcie_relaxed_ordering_enabled() in UMR
RDMA/mlx5: Remove pcie_relaxed_ordering_enabled() check for RO write
RDMA: Add ib_virt_dma_to_page()
RDMA/rxe: Fix the error "trying to register non-static key in rxe_cleanup_task"
RDMA/irdma: Slightly optimize irdma_form_ah_cm_frame()
RDMA/rxe: Fix incorrect TASKLET_STATE_SCHED check in rxe_task.c
IB/hfi1: Place struct mmu_rb_handler on cache line start
IB/hfi1: Fix bugs with non-PAGE_SIZE-end multi-iovec user SDMA requests
...
1) Dragos Improves RX page pool, and provides some fixes to his previous series:
1.1) Fix releasing page_pool for striding RQ and legacy RQ nonlinear case
1.2) Hook NAPIs to page pools to gain more performance.
2) From Roi, Some cleanups to TC and eswitch modules.
3) Maher migrates vnic diagnostic counters reporting from debugfs to a
dedicated devlink health reporter
Maher Says:
===========
net/mlx5: Expose vnic diagnostic counters using devlink
Currently, vnic diagnostic counters are exposed through the following
debugfs:
$ ls /sys/kernel/debug/mlx5/0000:08:00.0/esw/vf_0/vnic_diag/
cq_overrun
quota_exceeded_command
total_q_under_processor_handle
invalid_command
send_queue_priority_update_flow
nic_receive_steering_discard
The current design does not allow the hypervisor to view the diagnostic
counters of its VFs, in case the VFs get bound to a VM. In other words,
the counters are not exposed for representor interfaces.
Furthermore, the debugfs design is inconvenient future-wise, in case more
counters need to be reported by the driver in the future.
As these counters pertain to vNIC health, it is more appropriate to
utilize the devlink health reporter to expose them.
Thus, this patchest includes the following changes:
* Drop the current vnic diagnostic counters debugfs interface.
* Add a vnic devlink health reporter for PFs/VFs core devices, which
when diagnosed will dump vnic diagnostic counter values that are
queried from FW.
* Add a vnic devlink health reporter for the representor interface, which
serves the same purpose listed in the previous point, in addition to
allowing the hypervisor to view its VFs diagnostic counters, even when
the VFs are bounded to external VMs.
Example of devlink health reporter usage is:
$devlink health diagnose pci/0000:08:00.0 reporter vnic
vNIC env counters:
total_error_queues: 0 send_queue_priority_update_flow: 0
comp_eq_overrun: 0 async_eq_overrun: 0 cq_overrun: 0
invalid_command: 0 quota_exceeded_command: 0
nic_receive_steering_discard: 0
===========
4) SW steering fixes and improvements
Yevgeny Kliteynik Says:
=======================
These short patch series are just small fixes / improvements for
SW steering:
- Patch 1: Fix dumping of legacy modify_hdr in debug dump to
align to what is expected by parser
- Patch 2: Have separate threshold for ICM sync per ICM type
- Patch 3: Add more info to the steering debug dump - Linux
version and device name
- Patch 4: Keep track of number of buddies that are currently
in use per domain per buddy type
=======================
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmRB6HcACgkQSD+KveBX
+j4+nAf/cCJ7poDQQOp/Oug6H3xn/COYPv3Iasl0dhT8yu+LlNMQTF0bRrU4UpAQ
aIKcja2biOBAnD96EhA1nJoo9bJUTtKLokUDDyK/xRHS+wIyr8Lia6vxTz1yjj3C
jDqX3+ZP4rFuhAvh+92AT1I0JvS0g+ocokPVKmm+Pwf4y7sG69CZ7phVGSc0iFfT
y+gnP4C6cdIr7kNLByeeX6alDHL/q83vfNFWrugRPna2uXjcSR5Gtp03pJ0OVkI5
qHxG7Bz0BE/hcMYwNcNVTu/5e02+5PS6B8kN/ho5DkhVFhp+h17XWWvOMZxC3jfI
k0ijRSWocG1jIgRgkioBRgzXmc6nNw==
=jUe8
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2023-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2023-04-20
1) Dragos Improves RX page pool, and provides some fixes to his previous
series:
1.1) Fix releasing page_pool for striding RQ and legacy RQ nonlinear case
1.2) Hook NAPIs to page pools to gain more performance.
2) From Roi, Some cleanups to TC and eswitch modules.
3) Maher migrates vnic diagnostic counters reporting from debugfs to a
dedicated devlink health reporter
Maher Says:
===========
net/mlx5: Expose vnic diagnostic counters using devlink
Currently, vnic diagnostic counters are exposed through the following
debugfs:
$ ls /sys/kernel/debug/mlx5/0000:08:00.0/esw/vf_0/vnic_diag/
cq_overrun
quota_exceeded_command
total_q_under_processor_handle
invalid_command
send_queue_priority_update_flow
nic_receive_steering_discard
The current design does not allow the hypervisor to view the diagnostic
counters of its VFs, in case the VFs get bound to a VM. In other words,
the counters are not exposed for representor interfaces.
Furthermore, the debugfs design is inconvenient future-wise, in case more
counters need to be reported by the driver in the future.
As these counters pertain to vNIC health, it is more appropriate to
utilize the devlink health reporter to expose them.
Thus, this patchest includes the following changes:
* Drop the current vnic diagnostic counters debugfs interface.
* Add a vnic devlink health reporter for PFs/VFs core devices, which
when diagnosed will dump vnic diagnostic counter values that are
queried from FW.
* Add a vnic devlink health reporter for the representor interface, which
serves the same purpose listed in the previous point, in addition to
allowing the hypervisor to view its VFs diagnostic counters, even when
the VFs are bounded to external VMs.
Example of devlink health reporter usage is:
$devlink health diagnose pci/0000:08:00.0 reporter vnic
vNIC env counters:
total_error_queues: 0 send_queue_priority_update_flow: 0
comp_eq_overrun: 0 async_eq_overrun: 0 cq_overrun: 0
invalid_command: 0 quota_exceeded_command: 0
nic_receive_steering_discard: 0
===========
4) SW steering fixes and improvements
Yevgeny Kliteynik Says:
=======================
These short patch series are just small fixes / improvements for
SW steering:
- Patch 1: Fix dumping of legacy modify_hdr in debug dump to
align to what is expected by parser
- Patch 2: Have separate threshold for ICM sync per ICM type
- Patch 3: Add more info to the steering debug dump - Linux
version and device name
- Patch 4: Keep track of number of buddies that are currently
in use per domain per buddy type
=======================
* tag 'mlx5-updates-2023-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: Update op_mode to op_mod for port selection
net/mlx5: E-Switch, Remove unused mlx5_esw_offloads_vport_metadata_set()
net/mlx5: E-Switch, Remove redundant dev arg from mlx5_esw_vport_alloc()
net/mlx5: Include linux/pci.h for pci_msix_can_alloc_dyn()
net/mlx5e: RX, Hook NAPIs to page pools
net/mlx5e: RX, Fix XDP_TX page release for legacy rq nonlinear case
net/mlx5e: RX, Fix releasing page_pool pages twice for striding RQ
net/mlx5e: Add vnic devlink health reporter to representors
net/mlx5: Add vnic devlink health reporter to PFs/VFs
Revert "net/mlx5: Expose vnic diagnostic counters for eswitch managed vports"
Revert "net/mlx5: Expose steering dropped packets counter"
net/mlx5: DR, Add memory statistics for domain object
net/mlx5: DR, Add more info in domain dbg dump
net/mlx5: DR, Calculate sync threshold of each pool according to its type
net/mlx5: DR, Fix dumping of legacy modify_hdr in debug dump
====================
Link: https://lore.kernel.org/r/20230421013850.349646-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To be consistent with the other enum keys use OP_MOD
instead of OP_MODE.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Create a vnic devlink health reporter for PFs/VFs interfaces.
The reporter's diagnose callback displays the values of vNIC/vport
transport debug counters of PFs/VFs, as follows:
$ devlink health diagnose pci/0000:08:00.0 reporter vnic
vNIC env counters:
total_error_queues: 0 send_queue_priority_update_flow: 0
comp_eq_overrun: 0 async_eq_overrun: 0 cq_overrun: 0
invalid_command: 0 quota_exceeded_command: 0
nic_receive_steering_discard: 0
Moreover, add documentation on the reporter functionality and the
counters description.
While at it, expose the vNIC counters diagnose function to be used by
the downstream patch, which will reveal the counters for representor
interfaces.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Commit cited in "fixes" tag added bulk support for flow counters but it
didn't account that's also possible to query a counter using a non-base id
if the counter was allocated as bulk.
When a user performs a query, validate the flow counter id given in the
mailbox is inside the valid range taking bulk value into account.
Fixes: 208d70f562 ("IB/mlx5: Support flow counters offset for bulk counters")
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Link: https://lore.kernel.org/r/79d7fbe291690128e44672418934256254d93115.1681377114.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Rename existing HCA capability relaxed_ordering_read to
relaxed_ordering_read_pci_enabled. This is in accordance with recent PRM
change to better describe the capability, as it's set only if both the
device supports relaxed ordering (RO) read and RO is enabled in PCI
config space.
In addition, add new HCA capability relaxed_ordering_read which is set
if the device supports RO read, regardless of RO in PCI config space.
This will be used in the following patch to allow RO in VFs and VMs.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Link: https://lore.kernel.org/r/caa0002fd8135086357dfcc368e2f5cc73b08480.1681131553.git.leon@kernel.org
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
1) Vlad adds the support for linux bridge multicast offload support
Patches #1 through #9
Synopsis
Vlad Says:
==============
Implement support of bridge multicast offload in mlx5. Handle port object
attribute SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED notification to toggle multicast
offload and bridge snooping support on bridge. Handle port object
SWITCHDEV_OBJ_ID_PORT_MDB notification to attach a bridge port to MDB.
Steering architecture
Existing offload infrastructure relies on two levels of flow tables - bridge
ingress and egress. For multicast offload the architecture is extended with
additional layer of per-port multicast replication tables. Such tables filter
loopback traffic (so packets are not replicated to their source port) and pop
VLAN headers for "untagged" VLANs. The tables are referenced by the MDB rules in
egress table. MDB egress rule can point to multiple per-port multicast tables,
which causes matching multicast traffic to be replicated to all of them, and,
consecutively, to several bridge ports:
+--------+--+
+---------------------------------------> Port 1 | |
| +-^------+--+
| |
| |
+-----------------------------------------+ | +---------------------------+ |
| EGRESS table | | +--> PORT 1 multicast table | |
+----------------------------------+ +-----------------------------------------+ | | +---------------------------+ |
| INGRESS table | | | | | | | |
+----------------------------------+ | dst_mac=P1,vlan=X -> pop vlan, goto P1 +--+ | | FG0: | |
| | | dst_mac=P1,vlan=Y -> pop vlan, goto P1 | | | src_port=dst_port -> drop | |
| src_mac=M1,vlan=X -> goto egress +---> dst_mac=P2,vlan=X -> pop vlan, goto P2 +--+ | | FG1: | |
| ... | | dst_mac=P2,vlan=Y -> goto P2 | | | | VLAN X -> pop, goto port | |
| | | dst_mac=MDB1,vlan=Y -> goto mcast P1,P2 +-----+ | ... | |
+----------------------------------+ | | | | | VLAN Y -> pop, goto port +-------+
+-----------------------------------------+ | | | FG3: |
| | | matchall -> goto port |
| | | |
| | +---------------------------+
| |
| |
| | +--------+--+
+---------------------------------------> Port 2 | |
| +-^------+--+
| |
| |
| +---------------------------+ |
+--> PORT 2 multicast table | |
+---------------------------+ |
| | |
| FG0: | |
| src_port=dst_port -> drop | |
| FG1: | |
| VLAN X -> pop, goto port | |
| ... | |
| | |
| FG3: | |
| matchall -> goto port +-------+
| |
+---------------------------+
Patches overview:
- Patch 1 adds hardware definition bits for capabilities required to replicate
multicast packets to multiple per-port tables. These bits are used by
following patches to only attempt multicast offload if firmware and hardware
provide necessary support.
- Pathces 2-4 patches are preparations and refactoring.
- Patch 5 implements necessary infrastructure to toggle multicast offload
via SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED port object attribute notification.
This also enabled IGMP and MLD snooping.
- Patch 6 implements per-port multicast replication tables. It only supports
filtering of loopback packets.
- Patch 7 extends per-port multicast tables with VLAN pop support for 'untagged'
VLANs.
- Patch 8 handles SWITCHDEV_OBJ_ID_PORT_MDB port object notifications. It
creates MDB replication rules in egress table that can replicate packets to
multiple per-port multicast tables.
- Patch 9 adds tracepoints for MDB events.
==============
2) Parav Create a new allocation profile for SFs, to save on memory
3) Yevgeny provides some initial patches for upcoming software steering
support new pattern/arguments type of modify_header actions.
Starting with ConnectX-6 DX, we use a new design of modify_header FW object.
The current modify_header object allows for having only limited number of
these FW objects, which means that we are limited in the number of offloaded
flows that require modify_header action.
As a preparation Yevgeny provides the following 4 patches:
- Patch 1: Add required mlx5_ifc HW bits
- Patch 2, 3: Add new WQE type and opcode that is required for pattern/arg
support and adds appropriate support in dr_send.c
- Patch 4: Add ICM pool for modify-header-pattern objects and implement
patterns cache, allowing patterns reuse for different flows
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmQ2LDIACgkQSD+KveBX
+j59UAf+NfLEq7i/gYOri1rLwXOrgPd13w64Oob79+pKVXzRqik/gBg4d0CKdc2X
Qafntudus7IvVrGJCM0vurw7nC75H9BMFh3dQW2u3uaR9m6i0k4tSKIAoJUFHMdj
dMgAJywYkCCXlBbGVhRE3yybal2EXfOSJ+Fr2tGeqpL4vRlg4kIaVuFGcGk2gMpq
gOahX6wtuop3ABzY3sqKCIQ1Q2jWx9AWWz+lEmw7HJGmHGugscYrgSIloHDnu1fz
Bt6uODn7GFr866rQr9+JemG0VyK8ahyisEFnX1oJ3642ZidPcDrbIrrtMWeEqo7X
1tcmY/wNti87xFbg3gL+DZekhHTHoQ==
=oIN8
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2023-04-11
1) Vlad adds the support for linux bridge multicast offload support
Patches #1 through #9
Synopsis
Vlad Says:
==============
Implement support of bridge multicast offload in mlx5. Handle port object
attribute SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED notification to toggle multicast
offload and bridge snooping support on bridge. Handle port object
SWITCHDEV_OBJ_ID_PORT_MDB notification to attach a bridge port to MDB.
Steering architecture
Existing offload infrastructure relies on two levels of flow tables - bridge
ingress and egress. For multicast offload the architecture is extended with
additional layer of per-port multicast replication tables. Such tables filter
loopback traffic (so packets are not replicated to their source port) and pop
VLAN headers for "untagged" VLANs. The tables are referenced by the MDB rules in
egress table. MDB egress rule can point to multiple per-port multicast tables,
which causes matching multicast traffic to be replicated to all of them, and,
consecutively, to several bridge ports:
+--------+--+
+---------------------------------------> Port 1 | |
| +-^------+--+
| |
| |
+-----------------------------------------+ | +---------------------------+ |
| EGRESS table | | +--> PORT 1 multicast table | |
+----------------------------------+ +-----------------------------------------+ | | +---------------------------+ |
| INGRESS table | | | | | | | |
+----------------------------------+ | dst_mac=P1,vlan=X -> pop vlan, goto P1 +--+ | | FG0: | |
| | | dst_mac=P1,vlan=Y -> pop vlan, goto P1 | | | src_port=dst_port -> drop | |
| src_mac=M1,vlan=X -> goto egress +---> dst_mac=P2,vlan=X -> pop vlan, goto P2 +--+ | | FG1: | |
| ... | | dst_mac=P2,vlan=Y -> goto P2 | | | | VLAN X -> pop, goto port | |
| | | dst_mac=MDB1,vlan=Y -> goto mcast P1,P2 +-----+ | ... | |
+----------------------------------+ | | | | | VLAN Y -> pop, goto port +-------+
+-----------------------------------------+ | | | FG3: |
| | | matchall -> goto port |
| | | |
| | +---------------------------+
| |
| |
| | +--------+--+
+---------------------------------------> Port 2 | |
| +-^------+--+
| |
| |
| +---------------------------+ |
+--> PORT 2 multicast table | |
+---------------------------+ |
| | |
| FG0: | |
| src_port=dst_port -> drop | |
| FG1: | |
| VLAN X -> pop, goto port | |
| ... | |
| | |
| FG3: | |
| matchall -> goto port +-------+
| |
+---------------------------+
Patches overview:
- Patch 1 adds hardware definition bits for capabilities required to replicate
multicast packets to multiple per-port tables. These bits are used by
following patches to only attempt multicast offload if firmware and hardware
provide necessary support.
- Pathces 2-4 patches are preparations and refactoring.
- Patch 5 implements necessary infrastructure to toggle multicast offload
via SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED port object attribute notification.
This also enabled IGMP and MLD snooping.
- Patch 6 implements per-port multicast replication tables. It only supports
filtering of loopback packets.
- Patch 7 extends per-port multicast tables with VLAN pop support for 'untagged'
VLANs.
- Patch 8 handles SWITCHDEV_OBJ_ID_PORT_MDB port object notifications. It
creates MDB replication rules in egress table that can replicate packets to
multiple per-port multicast tables.
- Patch 9 adds tracepoints for MDB events.
==============
2) Parav Create a new allocation profile for SFs, to save on memory
3) Yevgeny provides some initial patches for upcoming software steering
support new pattern/arguments type of modify_header actions.
Starting with ConnectX-6 DX, we use a new design of modify_header FW object.
The current modify_header object allows for having only limited number of
these FW objects, which means that we are limited in the number of offloaded
flows that require modify_header action.
As a preparation Yevgeny provides the following 4 patches:
- Patch 1: Add required mlx5_ifc HW bits
- Patch 2, 3: Add new WQE type and opcode that is required for pattern/arg
support and adds appropriate support in dr_send.c
- Patch 4: Add ICM pool for modify-header-pattern objects and implement
patterns cache, allowing patterns reuse for different flows
* tag 'mlx5-updates-2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: DR, Add modify-header-pattern ICM pool
net/mlx5: DR, Prepare sending new WQE type
net/mlx5: Add new WQE for updating flow table
net/mlx5: Add mlx5_ifc bits for modify header argument
net/mlx5: DR, Set counter ID on the last STE for STEv1 TX
net/mlx5: Create a new profile for SFs
net/mlx5: Bridge, add tracepoints for multicast
net/mlx5: Bridge, implement mdb offload
net/mlx5: Bridge, support multicast VLAN pop
net/mlx5: Bridge, add per-port multicast replication tables
net/mlx5: Bridge, snoop igmp/mld packets
net/mlx5: Bridge, extract code to lookup parent bridge of port
net/mlx5: Bridge, move additional data structures to priv header
net/mlx5: Bridge, increase bridge tables sizes
net/mlx5: Add mlx5_ifc definitions for bridge multicast support
====================
Link: https://lore.kernel.org/r/20230412040752.14220-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Update API for bpf_xdp_metadata_rx_hash() with arg for xdp rss hash type
via mapping table.
The mlx5 hardware can also identify and RSS hash IPSEC. This indicate
hash includes SPI (Security Parameters Index) as part of IPSEC hash.
Extend xdp core enum xdp_rss_hash_type with IPSEC hash type.
Fixes: bc8d405b1b ("net/mlx5e: Support RX XDP metadata")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/168132892548.340624.11185734579430124869.stgit@firesoul
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add new WQE type: FLOW_TBL_ACCESS, which will be used for
writing modify header arguments.
This type has specific control segment and special data segment.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add enum value for modify-header argument object and mlx5_bits
for the related capabilities.
Signed-off-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Create a new profile for SFs in order to disable the command cache.
Each function command cache consumes ~500KB of memory, when using a
large number of SFs this savings is notable on memory constarined
systems.
Use a new profile to provide for future differences between SFs and PFs.
The mr_cache not used for non-PF functions, so it is excluded from the
new profile.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Bodong Wang <bodong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
These new fields in QUERY_Q_COUNTER command allow us to access
another vport counters during the query command, which is specially
useful to query representor vports.
In addition also add the required caps to check if this capability
is actually supported.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://lore.kernel.org/r/75c73a4a0e60f18c37b35a4a11ca2e2415e4a6f3.1679566038.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
mlx5 dynamic msix
This patch series adds support for dynamic msix vectors allocation in mlx5.
Eli Cohen Says:
================
The following series of patches modifies mlx5_core to work with the
dynamic MSIX API. Currently, mlx5_core allocates all the interrupt
vectors it needs and distributes them amongst the consumers. With the
introduction of dynamic MSIX support, which allows for allocation of
interrupts more than once, we now allocate vectors as we need them.
This allows other drivers running on top of mlx5_core to allocate
interrupt vectors for their own use. An example for this is mlx5_vdpa,
which uses these vectors to propagate interrupts directly from the
hardware to the vCPU [1].
As a preparation for using this series, a use after free issue is fixed
in lib/cpu_rmap.c and the allocator for rmap entries has been modified.
A complementary API for irq_cpu_rmap_add() has also been introduced.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/patch/?id=0f2bf1fcae96a83b8c5581854713c9fc3407556e
================
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmQeTIUACgkQSD+KveBX
+j7oCQgAx9yNHM4BZD2UfIx/P+W13v1B+xOds04Vezl9JlakoqvviPxm3vvuKkl+
j/8DdyoqMUbWV0j5XxgZ+GG91bc14jN1GQ+4fUf63SzA99vAGb9GJPV2aQt5roGh
JmMqI2utDfoz+29qtQ+kVchY5AN5AoPXSQH2zkEZmJaPUjYb9Dr/4IayL0JaViAw
S31QLHKkSJ8bL8Wc6Op1emNVV7eXs18f7IIjVs3sYOb3WJRPVpmdKneRqLgVYplf
Td40Gwobl1elpjEqSSRTJI5YUSR8gcAJlBqIwHeJzFFpO3Pnciopl761osNKKs/a
5ctES5DS6JHqqFGbWV1gKYcRMil3LA==
=9i8l
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2023-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2023-03-20
mlx5 dynamic msix
This patch series adds support for dynamic msix vectors allocation in mlx5.
Eli Cohen Says:
================
The following series of patches modifies mlx5_core to work with the
dynamic MSIX API. Currently, mlx5_core allocates all the interrupt
vectors it needs and distributes them amongst the consumers. With the
introduction of dynamic MSIX support, which allows for allocation of
interrupts more than once, we now allocate vectors as we need them.
This allows other drivers running on top of mlx5_core to allocate
interrupt vectors for their own use. An example for this is mlx5_vdpa,
which uses these vectors to propagate interrupts directly from the
hardware to the vCPU [1].
As a preparation for using this series, a use after free issue is fixed
in lib/cpu_rmap.c and the allocator for rmap entries has been modified.
A complementary API for irq_cpu_rmap_add() has also been introduced.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/patch/?id=0f2bf1fcae96a83b8c5581854713c9fc3407556e
================
* tag 'mlx5-updates-2023-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: Provide external API for allocating vectors
net/mlx5: Use one completion vector if eth is disabled
net/mlx5: Refactor calculation of required completion vectors
net/mlx5: Move devlink registration before mlx5_load
net/mlx5: Use dynamic msix vectors allocation
net/mlx5: Refactor completion irq request/release code
net/mlx5: Improve naming of pci function vectors
net/mlx5: Use newer affinity descriptor
net/mlx5: Modify struct mlx5_irq to use struct msi_map
net/mlx5: Fix wrong comment
net/mlx5e: Coding style fix, add empty line
lib: cpu_rmap: Add irq_cpu_rmap_remove to complement irq_cpu_rmap_add
lib: cpu_rmap: Use allocator for rmap entries
lib: cpu_rmap: Avoid use after free on rmap->obj array entries
====================
Link: https://lore.kernel.org/r/20230324231341.29808-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Provide external API to be used by other drivers relying on mlx5_core,
for allocating MSIX vectors. An example for such a driver would be
mlx5_vdpa.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Downstream patches require devlink params to access the PTYS register,
move the needed functions from mlx5e to the core layer.
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20230314054234.267365-11-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Implement thermal zone support for mlx5 based HW. The NIC
uses temperature sensor provided by ASIC to report current temperature
to thermal core.
Signed-off-by: Sandipan Patra <spatra@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20230314054234.267365-5-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Small cycle this time:
- Minor driver updates for hfi1, cxgb4, erdma, hns, irdma, mlx5, siw, mana
- inline CQE support for hns
- Have mlx5 display device error codes
- Pinned DMABUF support for irdma
- Continued rxe cleanups, particularly converting the MRs to use xarray
- Improvements to what can be cached in the mlx5 mkey cache
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCY/gPmgAKCRCFwuHvBreF
YW5IAP4xOAiTif4f87vD1twRU/ebq4VEX0r+C2NX5x5fwlCJrAEA7RLV8uG9Uii2
ez0BuWNxfajuvFHntnZ1E+7UDP0S8gk=
=CgUH
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"Quite a small cycle this time, even with the rc8. I suppose everyone
went to sleep over xmas.
- Minor driver updates for hfi1, cxgb4, erdma, hns, irdma, mlx5, siw,
mana
- inline CQE support for hns
- Have mlx5 display device error codes
- Pinned DMABUF support for irdma
- Continued rxe cleanups, particularly converting the MRs to use
xarray
- Improvements to what can be cached in the mlx5 mkey cache"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (61 commits)
IB/mlx5: Extend debug control for CC parameters
IB/hfi1: Fix sdma.h tx->num_descs off-by-one errors
IB/hfi1: Fix math bugs in hfi1_can_pin_pages()
RDMA/irdma: Add support for dmabuf pin memory regions
RDMA/mlx5: Use query_special_contexts for mkeys
net/mlx5e: Use query_special_contexts for mkeys
net/mlx5: Change define name for 0x100 lkey value
net/mlx5: Expose bits for querying special mkeys
RDMA/rxe: Fix missing memory barriers in rxe_queue.h
RDMA/mana_ib: Fix a bug when the PF indicates more entries for registering memory on first packet
RDMA/rxe: Remove rxe_alloc()
RDMA/cma: Distinguish between sockaddr_in and sockaddr_in6 by size
Subject: RDMA/rxe: Handle zero length rdma
iw_cxgb4: Fix potential NULL dereference in c4iw_fill_res_cm_id_entry()
RDMA/mlx5: Use rdma_umem_for_each_dma_block()
RDMA/umem: Remove unused 'work' member from struct ib_umem
RDMA/irdma: Cap MSIX used to online CPUs + 1
RDMA/mlx5: Check reg_create() create for errors
RDMA/restrack: Correct spelling
RDMA/cxgb4: Fix potential null-ptr-deref in pass_establish()
...
This patch adds rtt_resp_dscp to the current debug controllability of
congestion control (CC) parameters.
rtt_resp_dscp can be read or written through debugfs.
If set, its value overwrites the DSCP of the generated RTT response.
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Link: https://lore.kernel.org/r/1dcc3440ee53c688f19f579a051ded81a2aaa70a.1676538714.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Synchronize the shared mlx5 branch with net:
- From Jiri: fixe a deadlock in mlx5_ib's netdev notifier unregister.
- From Mark and Patrisious: add IPsec RoCEv2 support.
- From Or: Rely on firmware to get special mkeys
* branch mlx5-next:
RDMA/mlx5: Use query_special_contexts for mkeys
net/mlx5e: Use query_special_contexts for mkeys
net/mlx5: Change define name for 0x100 lkey value
net/mlx5: Expose bits for querying special mkeys
net/mlx5: Configure IPsec steering for egress RoCEv2 traffic
net/mlx5: Configure IPsec steering for ingress RoCEv2 traffic
net/mlx5: Add IPSec priorities in RDMA namespaces
net/mlx5: Implement new destination type TABLE_TYPE
net/mlx5: Introduce new destination type TABLE_TYPE
RDMA/mlx5: Track netdev to avoid deadlock during netdev notifier unregister
net/mlx5e: Propagate an internal event in case uplink netdev changes
net/mlx5e: Fix trap event handling
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Change define of 0x100 lkey value from MLX5_INVALID_LKEY to be
MLX5_TERMINATE_SCATTER_LIST_LKEY as 0x100 is the value of
terminate_scatter_list_mkey.
Link: https://lore.kernel.org/r/3a116dc3fbae4cb6b76a63d27d418830b06ade0c.1673960981.git.leon@kernel.org
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add IPSec flow steering priorities in RDMA namespaces. This allows
adding tables/rules to forward RoCEv2 traffic to the IPSec crypto
tables in NIC_TX domain, and accept RoCEv2 traffic from NIC_RX domain.
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Implement new destination type to support flow transition between
different table types.
e.g. from NIC_RX to RDMA_RX or from RDMA_TX to NIC_TX.
The new destination is described in the tracepoint as follows:
"mlx5_fs_add_rule: rule=00000000d53cd0ed fte=0000000048a8a6ed index=0 sw_action=<> [dst] flow_table_type=7 id:262152"
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
This new destination type supports flow transition between different
table types, e.g. from NIC_RX to RDMA_RX or from RDMA_TX to NIC_TX.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
No need to have dl_port which is tightly coupled with mlx5e code
in mlx5 core code. Move it to struct mlx5e_dev and loose
mlx5e_devlink_get_dl_port() helper.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In MultiPort E-Switch mode a single RDMA is created. This device has multiple
RDMA ports that represent the uplink ports that are connected to the E-Switch.
Account for this when creating the RDMA device so it has an additional port for
the non native uplink.
As a side effect of this patch, use shared fdb in multiport eswitch mode.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In a follow-up commit multiport eswitch mode will use a shared fdb.
In shared fdb there is a single eswitch fdb and traffic could come from any
port. to distinguish between the ports set a different metadata per uplink port.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Extend the action stats callback implementation to update stats for actions
that are associated with hw counters.
Note that the callback may be called from tc action utility or from tc
flower. Both apis expect the driver to return the stats difference from
the last update. As such, query the raw counter value and maintain
the diff from the last api call in the tc layer, instead of the fs_core
layer.
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This series from Jiri solves a deadlock when removing a network namespace
with mlx5 devlink instance being in it.
The deadlock is between:
1) mlx5_ib->unregister_netdevice_notifier()
AND
2) mlx5_core->devlink_reload->cleanup_net()
To slove this introduced mlx5 netdev added/removed events to track uplink
netdev to be used for register_netdevice_notifier_dev_net() purposes.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmPkeZAACgkQSD+KveBX
+j6Digf/fTtMmV2I2GwKQJCza4+MAP8Nt9tKInj3x02AoNVXNwHupL72HWZiaKnB
YGvPAwjDvxPy2Ok1BsHJLyEOTZpZse8QtS/Sjzk00lovtOYzCwLdJfBrNnVRS5KV
Cz/dNtlQcpsAoErFSfmvraLhn7tMNrHMTDahzaNalDkO3wZYXUh+2VDwnXErQy+3
1HI9m2pGy8hQ3sNQTNhqcyY4mp1Qw3nTVIkE8c9E5TJcawVkk4xqlgQuT43nqcn5
H+CTXJTFyUMNkF8kNPTMvMoYfTYWhBqbZKuf+YDyQKwdf5IZyc1kuRIaqJNs5VjU
mUtwKHMk5apKLbE8rvmZlg/+geTlJA==
=Aqcc
-----END PGP SIGNATURE-----
Merge tag 'mlx5-next-netdev-deadlock' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Saeed Mahameed says:
====================
mlx5-next-netdev-deadlock
This series from Jiri solves a deadlock when removing a network namespace
with mlx5 devlink instance being in it.
The deadlock is between:
1) mlx5_ib->unregister_netdevice_notifier()
AND
2) mlx5_core->devlink_reload->cleanup_net()
To slove this introduced mlx5 netdev added/removed events to track uplink
netdev to be used for register_netdevice_notifier_dev_net() purposes.
* tag 'mlx5-next-netdev-deadlock' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
RDMA/mlx5: Track netdev to avoid deadlock during netdev notifier unregister
net/mlx5e: Propagate an internal event in case uplink netdev changes
net/mlx5e: Fix trap event handling
net/mlx5: Introduce CQE error syndrome
====================
Link: https://lore.kernel.org/r/20230208005626.72930-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Whenever uplink netdev is set/cleared, propagate newly introduced event
to inform notifier blocks netdev was added/removed.
Move the set() helper to core.c from header, introduce clear() and
netdev_added_event_replay() helpers. The last one is going to be called
from rdma driver, so export it.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>