Commit 2ac9cfe782 ("net/mlx5e: IPSec, Add Innova IPSec offload TX data path")
declared mlx5e_ipsec_inverse_table_init() but never implemented it.
Commit f52f2faee5 ("net/mlx5e: Introduce flow steering API")
declared mlx5e_fs_set_tc() but never implemented it.
Commit f2f3df5501 ("net/mlx5: EQ, Privatize eq_table and friends")
declared mlx5_eq_comp_cpumask() but never implemented it.
Commit cac1eb2cf2 ("net/mlx5: Lag, properly lock eswitch if needed")
removed mlx5_lag_update() but not its declaration.
Commit 35ba005d82 ("net/mlx5: DR, Set flex parser for TNL_MPLS dynamically")
removed mlx5dr_ste_build_tnl_mpls() but not its declaration.
Commit e126ba97db ("mlx5: Add driver for Mellanox Connect-IB adapters")
declared but never implemented mlx5_alloc_cmd_mailbox_chain() and mlx5_free_cmd_mailbox_chain().
Commit 0cf53c1247 ("net/mlx5: FWPage, Use async events chain")
removed mlx5_core_req_pages_handler() but not its declaration.
Commit 938fe83c8d ("net/mlx5_core: New device capabilities handling")
removed mlx5_query_odp_caps() but not its declaration.
Commit f6a8a19bb1 ("RDMA/netdev: Hoist alloc_netdev_mqs out of the driver")
removed mlx5_rdma_netdev_alloc() but not its declaration.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
LAG peer device lookout bus logic required the usage of global lock,
mlx5_intf_mutex.
As part of the effort to remove this global lock, refactor LAG peer
device lookout to use mlx5 devcom layer.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
At present, mlx5 driver have a general purpose
event handler which not only handles vhca event
but also many other events. This incurs a huge
bottleneck because the event handler is
implemented by single threaded workqueue and all
events are forced to be handled in serial manner
even though application tries to create multiple
SFs simultaneously.
Introduce a dedicated vhca event handler which
manages SFs parallel creation.
Signed-off-by: Wei Zhang <weizhang@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Leon Romanovsky says:
====================
This PR is collected from
https://lore.kernel.org/all/cover.1695296682.git.leon@kernel.org
This series from Patrisious extends mlx5 to support IPsec packet offload
in multiport devices (MPV, see [1] for more details).
These devices have single flow steering logic and two netdev interfaces,
which require extra logic to manage IPsec configurations as they performed
on netdevs.
[1] https://lore.kernel.org/linux-rdma/20180104152544.28919-1-leon@kernel.org/
* 'mlx5-next' of https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Handle IPsec steering upon master unbind/bind
net/mlx5: Configure IPsec steering for ingress RoCEv2 MPV traffic
net/mlx5: Configure IPsec steering for egress RoCEv2 MPV traffic
net/mlx5: Add create alias flow table function to ipsec roce
net/mlx5: Implement alias object allow and create functions
net/mlx5: Add alias flow table bits
net/mlx5: Store devcom pointer inside IPsec RoCE
net/mlx5: Register mlx5e priv to devcom in MPV mode
RDMA/mlx5: Send events from IB driver about device affiliation state
net/mlx5: Introduce ifc bits for migration in a chunk mode
====================
Link: https://lore.kernel.org/r/20231002083832.19746-1-leon@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This series from Patrisious extends mlx5 to support IPsec packet offload
in multiport devices (MPV, see [1] for more details).
These devices have single flow steering logic and two netdev interfaces,
which require extra logic to manage IPsec configurations as they performed
on netdevs.
Thanks
[1] https://lore.kernel.org/linux-rdma/20180104152544.28919-1-leon@kernel.org/
Link: https://lore.kernel.org/all/20231002083832.19746-1-leon@kernel.org
Signed-of-by: Leon Romanovsky <leon@kernel.org>
* mlx5-next: (576 commits)
net/mlx5: Handle IPsec steering upon master unbind/bind
net/mlx5: Configure IPsec steering for ingress RoCEv2 MPV traffic
net/mlx5: Configure IPsec steering for egress RoCEv2 MPV traffic
net/mlx5: Add create alias flow table function to ipsec roce
net/mlx5: Implement alias object allow and create functions
net/mlx5: Add alias flow table bits
net/mlx5: Store devcom pointer inside IPsec RoCE
net/mlx5: Register mlx5e priv to devcom in MPV mode
RDMA/mlx5: Send events from IB driver about device affiliation state
net/mlx5: Introduce ifc bits for migration in a chunk mode
Linux 6.6-rc3
...
Necessary for improved live migration flow. Actual support will be added
in a downstream patch.
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Link: https://lore.kernel.org/r/20230928164550.980832-3-dtatulea@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Add all the capabilities needed to check for alias object support.
As well as all the fields or commands needed for its creation and
the creation of flow table that is able to jump to an alias object.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://lore.kernel.org/r/544c030f2a78c4adf3fe6b64f97a39cc1bbdabb9.1695296682.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Send blocking events from IB driver whenever the device is done being
affiliated or if it is removed from an affiliation.
This is useful since now the EN driver can register to those event and
know when a device is affiliated or not.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://lore.kernel.org/r/a7491c3e483cfd8d962f5f75b9a25f253043384a.1695296682.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Introduce ifc related stuff to enable migration in a chunk mode.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20230911093856.81910-2-yishaih@nvidia.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Add a check for 800G_8X speed when querying PTYS and report it back
correctly when needed.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/26fd0b6e1fac071c3eb779657bb3d8ba47f47c4f.1695204156.git.leon@kernel.org
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Add new health error syndrome to indicate that pci data poisoned error
has been received while fetching device ICM data.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In order to have mcast offloads the driver needs the following:
It should know if that mcast comes from wire port, in addition the flow
should not be marked as any specific source, that way it will give the
flexibility for the driver not to be depended on the way iterator
implemented in the FW.
Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Benefit from the existence of internal mlx5 notifier and extend it by
event MLX5_DRIVER_EVENT_SF_PEER_DEVLINK. Use this event from SF
auxiliary device probe/remove functions to pass the registered SF
devlink instance to the SF representor.
Process the new event in SF representor code and call
devl_port_fn_devlink_set() to do the assignments. Implement this in work
to avoid possible deadlock when probe/remove function of SF may be
called with devlink instance lock held during devlink reload.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement SyncE support using newly introduced DPLL support.
Make sure that each PFs/VFs/SFs probed with appropriate capability
will spawn a dpll auxiliary device and register appropriate dpll device
and pin instances.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
mlx5 HW can't perform IPsec offload operation simultaneously both on PF
and VFs at the same time. While the previous patches added devlink knobs
to change IPsec capabilities dynamically, there is a need to add a logic
to block such IPsec capabilities for the cases when IPsec is already
configured.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20230825062836.103744-7-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add hardware definitions to allow to control IPSec capabilities.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20230825062836.103744-6-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Leon Romanovsky says:
====================
mlx5 MACsec RoCEv2 support
From Patrisious:
This series extends previously added MACsec offload support
to cover RoCE traffic either.
In order to achieve that, we need configure MACsec with offload between
the two endpoints, like below:
REMOTE_MAC=10:70:fd:43:71:c0
* ip addr add 1.1.1.1/16 dev eth2
* ip link set dev eth2 up
* ip link add link eth2 macsec0 type macsec encrypt on
* ip macsec offload macsec0 mac
* ip macsec add macsec0 tx sa 0 pn 1 on key 00 dffafc8d7b9a43d5b9a3dfbbf6a30c16
* ip macsec add macsec0 rx port 1 address $REMOTE_MAC
* ip macsec add macsec0 rx port 1 address $REMOTE_MAC sa 0 pn 1 on key 01 ead3664f508eb06c40ac7104cdae4ce5
* ip addr add 10.1.0.1/16 dev macsec0
* ip link set dev macsec0 up
And in a similar manner on the other machine, while noting the keys order
would be reversed and the MAC address of the other machine.
RDMA traffic is separated through relevant GID entries and in case
of IP ambiguity issue - meaning we have a physical GIDs and a MACsec
GIDs with the same IP/GID, we disable our physical GID in order
to force the user to only use the MACsec GID.
v0: https://lore.kernel.org/netdev/20230813064703.574082-1-leon@kernel.org/
* 'mlx5-next' of https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
RDMA/mlx5: Handles RoCE MACsec steering rules addition and deletion
net/mlx5: Add RoCE MACsec steering infrastructure in core
net/mlx5: Configure MACsec steering for ingress RoCEv2 traffic
net/mlx5: Configure MACsec steering for egress RoCEv2 traffic
IB/core: Reorder GID delete code for RoCE
net/mlx5: Add MACsec priorities in RDMA namespaces
RDMA/mlx5: Implement MACsec gid addition and deletion
net/mlx5: Maintain fs_id xarray per MACsec device inside macsec steering
net/mlx5: Remove netdevice from MACsec steering
net/mlx5e: Move MACsec flow steering and statistics database from ethernet to core
net/mlx5e: Rename MACsec flow steering functions/parameters to suit core naming style
net/mlx5: Remove dependency of macsec flow steering on ethernet
net/mlx5e: Move MACsec flow steering operations to be used as core library
macsec: add functions to get macsec real netdevice and check offload
====================
Link: https://lore.kernel.org/r/20230821073833.59042-1-leon@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add RoCE MACsec rules when a gid is added for the MACsec netdevice and
handle their cleanup when the gid is removed or the MACsec SA is deleted.
Also support alias IP for the MACsec device, as long as we don't have
more ips than what the gid table can hold.
In addition handle the case where a gid is added but there are still no
SAs added for the MACsec device, so the rules are added later on when
the SAs are added.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Adds all the core steering helper functions that are needed in order
to setup RoCE steering rules which includes both the RX and TX rules
addition and deletion.
As well as exporting the function to be ready to use from the IB driver
where we expose functions to allow deletion of all rules, which is
needed when a GID is deleted, or a deletion of a specific rule when an SA
is deleted, and a similar manner for the rules addition.
These functions are used in a later patch by IB driver to trigger the
rules addition/deletion when needed.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Add MACsec flow steering priorities in RDMA namespaces. This allows
adding tables/rules to forward RoCEv2 traffic to the MACsec crypto
tables in NIC_TX domain, and accept RoCEv2 traffic from NIC_RX domain.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Handle MACsec IP ambiguity issue, since mlx5 hw can't support
programming both the MACsec and the physical gid when they have the same
IP address, because it wouldn't know to whom to steer the traffic.
Hence in such case we delete the physical gid from the hw gid table,
which would then cause all traffic sent over it to fail, and we'll only
be able to send traffic over the MACsec gid.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Since now MACsec flow steering (macsec_fs) and MACsec statistics (stats)
are maintained by the core driver, move their data as well to be saved
inside core structures instead of staying part of ethernet MACsec database.
In addition cleanup all MACsec stats functions from the ethernet MACsec
code and move what's needed to be part of macsec_fs instead.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Each device cap has two modes: MAX and CUR. The driver maintains a
cache of both modes of the capabilities. For most device caps, the MAX
cap mode is never used.
Hence, remove all driver queries of the MAX mode of the said caps as
well as their helper MACROs.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
mlx5 driver queries the device for VECTOR_CALC and SHAMPO caps, but
there isn't any user who requires them.
As well as, MLX5_MCAM_REGS_0x9080_0x90FF is queried but not used.
Thus, drop all usages and definitions of the mentioned caps above.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Even if the PF driver had no error on his part of the sync reset flow,
the firmware can see wider picture as it syncs all the PFs in the flow.
So add at end of sync reset flow check with firmware by reading MFRL
register and initialization segment that the flow had no issue from
firmware point of view too.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Expose NIC temperature by implementing hwmon kernel API, which turns
current thermal zone kernel API to redundant.
For each one of the supported and exposed thermal diode sensors, expose
the following attributes:
1) Input temperature.
2) Highest temperature.
3) Temperature label:
Depends on the firmware capability, if firmware doesn't support
sensors naming, the fallback naming convention would be: "sensorX",
where X is the HW spec (MTMP register) sensor index.
4) Temperature critical max value:
refers to the high threshold of Warning Event. Will be exposed as
`tempY_crit` hwmon attribute (RO attribute). For example for
ConnectX5 HCA's this temperature value will be 105 Celsius, 10
degrees lower than the HW shutdown temperature).
5) Temperature reset history: resets highest temperature.
For example, for dualport ConnectX5 NIC with a single IC thermal diode
sensor will have 2 hwmon directories (one for each PCI function)
under "/sys/class/hwmon/hwmon[X,Y]".
Listing one of the directories above (hwmonX/Y) generates the
corresponding output below:
$ grep -H -d skip . /sys/class/hwmon/hwmon0/*
Output
=======================================================================
/sys/class/hwmon/hwmon0/name:mlx5
/sys/class/hwmon/hwmon0/temp1_crit:105000
/sys/class/hwmon/hwmon0/temp1_highest:48000
/sys/class/hwmon/hwmon0/temp1_input:46000
/sys/class/hwmon/hwmon0/temp1_label:asic
grep: /sys/class/hwmon/hwmon0/temp1_reset_history: Permission denied
In addition, displaying the sensors data via lm_sensors generates the
corresponding output below:
$ sensors
Output
=======================================================================
mlx5-pci-0800
Adapter: PCI adapter
asic: +46.0°C (crit = +105.0°C, highest = +48.0°C)
mlx5-pci-0801
Adapter: PCI adapter
asic: +46.0°C (crit = +105.0°C, highest = +48.0°C)
CC: Jean Delvare <jdelvare@suse.com>
Signed-off-by: Adham Faris <afaris@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230807180507.22984-3-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit enables the dynamic allocation of EQs at runtime, allowing
for more flexibility in managing completion EQs and reducing the memory
overhead of driver load. Whenever a CQ is created for a given vector
index, the driver will lookup to see if there is an already mapped
completion EQ for that vector, if so, utilize it. Otherwise, allocate a
new EQ on demand and then utilize it for the CQ completion events.
Add a protection lock to the EQ table to protect from concurrent EQ
creation attempts.
While at it, replace mlx5_vector2irqn()/mlx5_vector2eqn() with
mlx5_comp_eqn_get() and mlx5_comp_irqn_get() which will allocate an
EQ on demand if no EQ is found for the given vector.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
To accurately represent its purpose, rename the function that retrieves
the value of maximum vectors from mlx5_comp_vectors_count() to
mlx5_comp_vectors_max().
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently, once driver load completes, IRQ requests were performed for all
vectors. However, as we move to support dynamic creation of EQs, this will
not be the case as some IRQs will not exist at this stage. Thus, in such
case, use the default CPU to IRQ mapping which is the serial mapping based
on IRQ vector index. Meaning, the n'th vector gets mapped to the n'th CPU.
Introduce an API function mlx5_comp_vector_cpu() that takes an IRQ index and
provides the corresponding CPU mapping. It utilizes the existing IRQ
affinity if defined, or resorts to the default serialized CPU mapping
otherwise.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
For IPsec packet offload mode, the order of TC offload and IPsec
offload on the same netdevice is not aligned with the order in the
non-offload software. For example, for RX, the software performs TC
first and then IPsec transformation, but the implementation for
offload does that in the opposite way.
To resolve the difference for now, either IPsec offload or TC offload,
not both, is allowed for a specific interface.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/8e2e5e3b0984d785066e8663aaf97b3ba1bb873f.1690802064.git.leon@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The IPsec encryption is done at the last, so add new prio for IPsec
offload in FDB, and put it just lower than the slow path prio and
higher than the per-vport prio.
Three levels are added for TX. The first one is for ip xfrm policy.
The sa table is created in the second level for ip xfrm state. The
status table is created at the last to count the number of packets
encrypted.
The rules, which forward packets to uplink, are changed to forward
them to IPsec TX tables first. These rules are restored after those
tables are destroyed, which is done immediately when there is no
reference to them, just as what does in legacy mode. The support for
slow path is added here, by refreshing uplink's channels. But, the
handling for TC fast path, which is more complicated, will be added
later. Besides, reg c4 is used instead to match reqid.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/cfd0e6ffaf0b8c55ebaa9fb0649b7c504b6b8ec6.1690802064.git.leon@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reuse tun opts bits in reg c1, to pass IPsec obj id to datapath.
As this is only for RX SA and there are only 11 bits, xarray is used
to map IPsec obj id to an index, which is between 1 and 0x7ff, and
replace obj id to write to reg c1.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/43d60fbcc9cd672a97d7e2a2f7fe6a3d9e9a776d.1690802064.git.leon@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As decryption must be done first, add new prio for IPsec offload in
FDB, and put it just lower than BYPASS prio and higher than TC prio.
Three levels are added for RX. The first one is for ip xfrm policy. SA
table is created in the second level for ip xfrm state. The status
table is created in the last to check the decryption result. If
success, packets continue with the next process, or dropped otherwise.
For now, the set of reg c1 is removed for swtichdev mode, and the
datapath process will be added in the next patch.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/c91063554cf643fb50b99cf093e8a9bf11729de5.1690802064.git.leon@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Command stats is an array with more than 2K entries, which amounts to
~180KB. This is way more than actually needed, as only ~190 entries
are being used.
Therefore, replace the array with xarray.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Downstream patch will split mlx5_cmd_init() to probe and reload
routines. As a preparation, organize mlx5_cmd struct so that any
field that will be used in the reload routine are grouped at new
nested struct.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Update devcom infrastructure to be more generic, without
depending on max supported ports definition or a device guid,
and also more encapsulated so callers don't need to pass
the register devcom component id per event call.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Provide an ability to check if flow steering supports UDP
encapsulation and decapsulation of IPsec ESP packets.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This cycle saw a focus on rxe and bnxt_re drivers:
- Code cleanups for irdma, rxe, rtrs, hns, vmw_pvrdma
- rxe uses workqueues instead of tasklets
- rxe has better compliance around access checks for MRs and rereg_mr
- mana supportst he 'v2' FW interface for RX coalescing
- hfi1 bug fix for stale cache entries in its MR cache
- mlx5 buf fix to handle FW failures when destroying QPs
- erdma HW has a new doorbell allocation mechanism for uverbs that is
secure
- Lots of small cleanups and rework in bnxt_re
* Use the common mmap functions
* Support disassociation
* Improve FW command flow
- bnxt_re support for "low latency push", this allows a packet
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZJxq+wAKCRCFwuHvBreF
YSN+AQDKAmWdKCqmc3boMIq5wt4h7yYzdW47LpzGarOn5Hf+UgEA6mpPJyRqB43C
CNXYIbASl/LLaWzFvxCq/AYp6tzuog4=
=I8ju
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"This cycle saw a focus on rxe and bnxt_re drivers:
- Code cleanups for irdma, rxe, rtrs, hns, vmw_pvrdma
- rxe uses workqueues instead of tasklets
- rxe has better compliance around access checks for MRs and rereg_mr
- mana supportst he 'v2' FW interface for RX coalescing
- hfi1 bug fix for stale cache entries in its MR cache
- mlx5 buf fix to handle FW failures when destroying QPs
- erdma HW has a new doorbell allocation mechanism for uverbs that is
secure
- Lots of small cleanups and rework in bnxt_re:
- Use the common mmap functions
- Support disassociation
- Improve FW command flow
- support for 'low latency push'"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (71 commits)
RDMA/bnxt_re: Fix an IS_ERR() vs NULL check
RDMA/bnxt_re: Fix spelling mistake "priviledged" -> "privileged"
RDMA/bnxt_re: Remove duplicated include in bnxt_re/main.c
RDMA/bnxt_re: Refactor code around bnxt_qplib_map_rc()
RDMA/bnxt_re: Remove incorrect return check from slow path
RDMA/bnxt_re: Enable low latency push
RDMA/bnxt_re: Reorg the bar mapping
RDMA/bnxt_re: Move the interface version to chip context structure
RDMA/bnxt_re: Query function capabilities from firmware
RDMA/bnxt_re: Optimize the bnxt_re_init_hwrm_hdr usage
RDMA/bnxt_re: Add disassociate ucontext support
RDMA/bnxt_re: Use the common mmap helper functions
RDMA/bnxt_re: Initialize opcode while sending message
RDMA/cma: Remove NULL check before dev_{put, hold}
RDMA/rxe: Simplify cq->notify code
RDMA/rxe: Fixes mr access supported list
RDMA/bnxt_re: optimize the parameters passed to helper functions
RDMA/bnxt_re: remove redundant cmdq_bitmap
RDMA/bnxt_re: use firmware provided max request timeout
RDMA/bnxt_re: cancel all control path command waiters upon error
...
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmSYzfYeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG/ucH/iOM/1Py/fSg0qSs
7NJ4XXlourT5zrnRMom3cm3d9gYqgTzgvKFL3kjMEexTRVYbhlcO4ZPRsiry8zxF
ToGX+V8tDMqb8WSdFHzkljRY+zDRyfEUDMlTzROAD9DunLmQtkJKyrggkeGdjkpP
OyfGqKpwlLXZRAXBil/U8Mx9MHdjJubloZwghLZr33VdUZa68+JJ9l6w163Oe/ET
K264NM0wxN/kvN57JvePgqMccQwpINylg8IhRI+XelgczjUXeJBsOA8TDv4bDN4Q
bjCLhkWbIaZtTYqvOXa/kD0T8wd7KETsMBQN8YzyDh6W0GmAlJjTawyAhA6jA5in
x3uz2W8=
=L3zp
-----END PGP SIGNATURE-----
Merge tag 'v6.4' into rdma.git for-next
Linux 6.4
Resolve conflicts between rdma rc and next in rxe_cq matching linux-next:
drivers/infiniband/sw/rxe/rxe_cq.c:
https://lore.kernel.org/r/20230622115246.365d30ad@canb.auug.org.au
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
A member of struct mlx5_ifc_cmd_hca_cap_bits has been mistakenly
assigned the wrong reserved_at offset value. Correct it to align to the
right value, thus avoid future miscalculation.
Signed-off-by: Lama Kayal <lkayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add needed HW bits for querying local loopback counter and the
HCA capability for it.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Added a new event handler to firmware sync reset, which is used to
support firmware sync reset flow on smart NIC. Adding this new stage to
the flow enables the firmware to ensure host PFs unload before ECPFs
unload, to avoid race of PFs recovery.
If firmware sends sync_reset_unload event to driver the driver should
unload and close all HW resources of the function. Once the driver
finishes unloading part, it can't get any more events from firmware as
event queues are closed, so it polls the reset state field to know when
to continue to next stage of the sync reset flow.
Added capability bit for supporting sync_reset_unload event.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Expose new timoueout in Default Timeouts Register to be used on sync
reset flow running on smart NIC. In this flow the driver should know how
much time to wait from getting unload request till firmware will ask the
PF to continue to next stage of the flow.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Many rc bug fixes:
- Two rtrs bug fixes for error unwind bugs
- Several rxe bug fixes:
* Incorrect Rx packet validation
* Using memory without a refcount
* Syzkaller found use before initialization
* Regression fix for missing locking with the tasklet conversion from
this merge window
- Have bnxt report the correct link properties to userspace, this was
a regression in v6.3
- Several mlx5 bug fixes:
* Kernel crash triggerable by userspace for the RAW ethernet profile
* Defend against steering refcounting issues created by userspace
* Incorrect change of QP port affinity parameters in some LAG configurations
- Fix mlx5 Q counters:
* Do not over allocate Q counters to allow userspace to use the full
port capacity
* Kernel crash triggered by eswitch due to mis-use of Q counters
* Incorrect mlx5_device for Q counters in some LAG configurations
- Properly implement the IBA spec restricting privileged qkeys to root
- Always an error when reading from a disassociated device's event queue
- isert bug fixes:
* Avoid a deadlock with the CM handler and CM ID destruction
* Correct list corruption due to incorrect locking
* Fix a use after free around connection tear down
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZIoZ7gAKCRCFwuHvBreF
YUSyAPsH+VdqzYd1z/bk0zMds9JR2oqMOR8l02e4gq8K9hjTJAD/ePEuT5PTpFZu
FrT4SDhOfA30XNz+RofRNisJC92OOAo=
=KyF7
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"This is an unusually large bunch of bug fixes for the later rc cycle,
rxe and mlx5 both dumped a lot of things at once. rxe continues to fix
itself, and mlx5 is fixing a bunch of "queue counters" related bugs.
There is one highly notable bug fix regarding the qkey. This small
security check was missed in the original 2005 implementation and it
allows some significant issues.
Summary:
- Two rtrs bug fixes for error unwind bugs
- Several rxe bug fixes:
* Incorrect Rx packet validation
* Using memory without a refcount
* Syzkaller found use before initialization
* Regression fix for missing locking with the tasklet conversion
from this merge window
- Have bnxt report the correct link properties to userspace, this was
a regression in v6.3
- Several mlx5 bug fixes:
* Kernel crash triggerable by userspace for the RAW ethernet
profile
* Defend against steering refcounting issues created by userspace
* Incorrect change of QP port affinity parameters in some LAG
configurations
- Fix mlx5 Q counters:
* Do not over allocate Q counters to allow userspace to use the
full port capacity
* Kernel crash triggered by eswitch due to mis-use of Q counters
* Incorrect mlx5_device for Q counters in some LAG configurations
- Properly implement the IBA spec restricting privileged qkeys to
root
- Always an error when reading from a disassociated device's event
queue
- isert bug fixes:
* Avoid a deadlock with the CM handler and CM ID destruction
* Correct list corruption due to incorrect locking
* Fix a use after free around connection tear down"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/rxe: Fix rxe_cq_post
IB/isert: Fix incorrect release of isert connection
IB/isert: Fix possible list corruption in CMA handler
IB/isert: Fix dead lock in ib_isert
RDMA/mlx5: Fix affinity assignment
IB/uverbs: Fix to consider event queue closing also upon non-blocking mode
RDMA/uverbs: Restrict usage of privileged QKEYs
RDMA/cma: Always set static rate to 0 for RoCE
RDMA/mlx5: Fix Q-counters query in LAG mode
RDMA/mlx5: Remove vport Q-counters dependency on normal Q-counters
RDMA/mlx5: Fix Q-counters per vport allocation
RDMA/mlx5: Create an indirect flow table for steering anchor
RDMA/mlx5: Initiate dropless RQ for RAW Ethernet functions
RDMA/rxe: Fix the use-before-initialization error of resp_pkts
RDMA/bnxt_re: Fix reporting active_{speed,width} attributes
RDMA/rxe: Fix ref count error in check_rkey()
RDMA/rxe: Fix packet length checks
RDMA/rtrs: Fix rxe_dealloc_pd warning
RDMA/rtrs: Fix the last iu->buf leak in err path
The cited commit aimed to ensure that Virtual Functions (VFs) assign a
queue affinity to a Queue Pair (QP) to distribute traffic when
the LAG master creates a hardware LAG. If the affinity was set while
the hardware was not in LAG, the firmware would ignore the affinity value.
However, this commit unintentionally assigned an affinity to QPs on the LAG
master's VPORT even if the RDMA device was not marked as LAG-enabled.
In most cases, this was not an issue because when the hardware entered
hardware LAG configuration, the RDMA device of the LAG master would be
destroyed and a new one would be created, marked as LAG-enabled.
The problem arises when a user configures Equal-Cost Multipath (ECMP).
In ECMP mode, traffic can be directed to different physical ports based on
the queue affinity, which is intended for use by VPORTS other than the
E-Switch manager. ECMP mode is supported only if both E-Switch managers are
in switchdev mode and the appropriate route is configured via IP. In this
configuration, the RDMA device is not destroyed, and we retain the RDMA
device that is not marked as LAG-enabled.
To ensure correct behavior, Send Queues (SQs) opened by the E-Switch
manager through verbs should be assigned strict affinity. This means they
will only be able to communicate through the native physical port
associated with the E-Switch manager. This will prevent the firmware from
assigning affinity and will not allow the SQs to be remapped in case of
failover.
Fixes: 802dcc7fc5 ("RDMA/mlx5: Support TX port affinity for VF drivers in LAG mode")
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://lore.kernel.org/r/425b05f4da840bc684b0f7e8ebf61aeb5cef09b0.1685960567.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When the embedded cpu supports SRIOV it can be enabled and disabled
independently from the host SRIOV. Track the pages separately so we can
properly wait for returned VF pages.
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: William Tu <witu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
These functions are for query/set by vport, there was an underlying
assumption that vport was equal to function ID. That's not the case for
EC VF functions. Set the ec_vf_function bit accordingly.
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: William Tu <witu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Enable creation of a devlink port for EC VF vports.
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: William Tu <witu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add ec_vf_vport_base to HCA Capabilities 2. This indicates the base vport
of embedded CPU virtual functions that are connected to the eswitch.
Add ec_vf_function to query/set_hca_caps. If set this indicates
accessing a virtual function on the embedded CPU by function ID. This
should only be used with other_function set to 1.
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: Bodong Wang <bodong@nvidia.com>
Reviewed-by: William Tu <witu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add generated_pkt_steering_fail and handled_pkt_steering_fail to devlink
heatlth reporter.
generated_pkt_steering_fail indicates the number of packets dropped due to
illegal steering operation within the vport steering domain.
handled_pkt_steering_fail indicates the number of packets dropped due to
illegal steering operation, originated by the vport.
Also, update devlink reporter functionality documentation with the newly
exposed counters.
Signed-off-by: Lama Kayal <lkayal@nvidia.com>
Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Introduce a generic APIs to iterate over all the devices which are part
of the LAG. This API replace mlx5_lag_get_peer_mdev() which retrieve
only a single peer device from the lag.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Don't query the firmware so many times (num rqs * num wqes * wqe frags)
because it slows down linearly the interface creation time when the
product is larger. Do it only once per mdev and store the result in
mlx5e_param.
Due to helper function being called from different files, move it to
an appropriate location. Rename the function with a proper prefix and
add a small cleanup.
This fix applies only for legacy rq.
Fixes: 1b1e486883 ("net/mlx5e: Use query_special_contexts for mkeys")
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
SW Steering uses RC QP for writing STEs to ICM. This writingis done in LB
(loopback), and FL (force-loopback) QP is preferred for performance. FL is
available when RoCE is enabled or disabled based on RoCE caps.
This patch adds reading of FL capability from HCA caps in addition to the
existing reading from RoCE caps, thus fixing the case where we didn't
have loopback enabled when RoCE was disabled.
Fixes: 7304d603a5 ("net/mlx5: DR, Add support for force-loopback QP")
Signed-off-by: Itamar Gozlan <igozlan@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Usual wide collection of unrelated items in drivers:
- Driver bug fixes and treewide cleanups in hfi1, siw, qib, mlx5, rxe,
usnic, usnic, bnxt_re, ocrdma, iser
* Unnecessary NULL checks
* kmap obsolescence
* pci_enable_pcie_error_reporting() obsolescence
* Unused variables and macros
* trace event related warnings
* casting warnings
- Code cleanups for irdm and erdma
- EFA reporting of 128 byte PCIe TLP support
- mlx5 more agressively uses the out of order HW feature
- Big rework of how state machines and tasks work in rxe
- Fix a syzkaller found crash netdev refcount leak in siw
- bnxt_re revises their HW description header
- Congestion control for bnxt_re
- Use mmu_notifiers more safely in hfi1
- mlx5 gets better support for PCIe relaxed ordering inside VMs
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZEva5wAKCRCFwuHvBreF
YZFmAQC9T3b/XQ3bRknYciuzbatC98o9xB0FTqmEFYGj+Y2lVAD9EEVe3HKfHfi3
t/GxXYB5r22oxg5bgsblZfEdEdTVCg8=
=akMm
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"Usual wide collection of unrelated items in drivers:
- Driver bug fixes and treewide cleanups in hfi1, siw, qib, mlx5,
rxe, usnic, usnic, bnxt_re, ocrdma, iser:
- remove unnecessary NULL checks
- kmap obsolescence
- pci_enable_pcie_error_reporting() obsolescence
- unused variables and macros
- trace event related warnings
- casting warnings
- Code cleanups for irdm and erdma
- EFA reporting of 128 byte PCIe TLP support
- mlx5 more agressively uses the out of order HW feature
- Big rework of how state machines and tasks work in rxe
- Fix a syzkaller found crash netdev refcount leak in siw
- bnxt_re revises their HW description header
- Congestion control for bnxt_re
- Use mmu_notifiers more safely in hfi1
- mlx5 gets better support for PCIe relaxed ordering inside VMs"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (81 commits)
RDMA/efa: Add rdma write capability to device caps
RDMA/mlx5: Use correct device num_ports when modify DC
RDMA/irdma: Drop spurious WQ_UNBOUND from alloc_ordered_workqueue() call
RDMA/rxe: Fix spinlock recursion deadlock on requester
RDMA/mlx5: Fix flow counter query via DEVX
RDMA/rxe: Protect QP state with qp->state_lock
RDMA/rxe: Move code to check if drained to subroutine
RDMA/rxe: Remove qp->req.state
RDMA/rxe: Remove qp->comp.state
RDMA/rxe: Remove qp->resp.state
RDMA/mlx5: Allow relaxed ordering read in VFs and VMs
net/mlx5: Update relaxed ordering read HCA capabilities
RDMA/mlx5: Check pcie_relaxed_ordering_enabled() in UMR
RDMA/mlx5: Remove pcie_relaxed_ordering_enabled() check for RO write
RDMA: Add ib_virt_dma_to_page()
RDMA/rxe: Fix the error "trying to register non-static key in rxe_cleanup_task"
RDMA/irdma: Slightly optimize irdma_form_ah_cm_frame()
RDMA/rxe: Fix incorrect TASKLET_STATE_SCHED check in rxe_task.c
IB/hfi1: Place struct mmu_rb_handler on cache line start
IB/hfi1: Fix bugs with non-PAGE_SIZE-end multi-iovec user SDMA requests
...
1) Dragos Improves RX page pool, and provides some fixes to his previous series:
1.1) Fix releasing page_pool for striding RQ and legacy RQ nonlinear case
1.2) Hook NAPIs to page pools to gain more performance.
2) From Roi, Some cleanups to TC and eswitch modules.
3) Maher migrates vnic diagnostic counters reporting from debugfs to a
dedicated devlink health reporter
Maher Says:
===========
net/mlx5: Expose vnic diagnostic counters using devlink
Currently, vnic diagnostic counters are exposed through the following
debugfs:
$ ls /sys/kernel/debug/mlx5/0000:08:00.0/esw/vf_0/vnic_diag/
cq_overrun
quota_exceeded_command
total_q_under_processor_handle
invalid_command
send_queue_priority_update_flow
nic_receive_steering_discard
The current design does not allow the hypervisor to view the diagnostic
counters of its VFs, in case the VFs get bound to a VM. In other words,
the counters are not exposed for representor interfaces.
Furthermore, the debugfs design is inconvenient future-wise, in case more
counters need to be reported by the driver in the future.
As these counters pertain to vNIC health, it is more appropriate to
utilize the devlink health reporter to expose them.
Thus, this patchest includes the following changes:
* Drop the current vnic diagnostic counters debugfs interface.
* Add a vnic devlink health reporter for PFs/VFs core devices, which
when diagnosed will dump vnic diagnostic counter values that are
queried from FW.
* Add a vnic devlink health reporter for the representor interface, which
serves the same purpose listed in the previous point, in addition to
allowing the hypervisor to view its VFs diagnostic counters, even when
the VFs are bounded to external VMs.
Example of devlink health reporter usage is:
$devlink health diagnose pci/0000:08:00.0 reporter vnic
vNIC env counters:
total_error_queues: 0 send_queue_priority_update_flow: 0
comp_eq_overrun: 0 async_eq_overrun: 0 cq_overrun: 0
invalid_command: 0 quota_exceeded_command: 0
nic_receive_steering_discard: 0
===========
4) SW steering fixes and improvements
Yevgeny Kliteynik Says:
=======================
These short patch series are just small fixes / improvements for
SW steering:
- Patch 1: Fix dumping of legacy modify_hdr in debug dump to
align to what is expected by parser
- Patch 2: Have separate threshold for ICM sync per ICM type
- Patch 3: Add more info to the steering debug dump - Linux
version and device name
- Patch 4: Keep track of number of buddies that are currently
in use per domain per buddy type
=======================
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmRB6HcACgkQSD+KveBX
+j4+nAf/cCJ7poDQQOp/Oug6H3xn/COYPv3Iasl0dhT8yu+LlNMQTF0bRrU4UpAQ
aIKcja2biOBAnD96EhA1nJoo9bJUTtKLokUDDyK/xRHS+wIyr8Lia6vxTz1yjj3C
jDqX3+ZP4rFuhAvh+92AT1I0JvS0g+ocokPVKmm+Pwf4y7sG69CZ7phVGSc0iFfT
y+gnP4C6cdIr7kNLByeeX6alDHL/q83vfNFWrugRPna2uXjcSR5Gtp03pJ0OVkI5
qHxG7Bz0BE/hcMYwNcNVTu/5e02+5PS6B8kN/ho5DkhVFhp+h17XWWvOMZxC3jfI
k0ijRSWocG1jIgRgkioBRgzXmc6nNw==
=jUe8
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2023-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2023-04-20
1) Dragos Improves RX page pool, and provides some fixes to his previous
series:
1.1) Fix releasing page_pool for striding RQ and legacy RQ nonlinear case
1.2) Hook NAPIs to page pools to gain more performance.
2) From Roi, Some cleanups to TC and eswitch modules.
3) Maher migrates vnic diagnostic counters reporting from debugfs to a
dedicated devlink health reporter
Maher Says:
===========
net/mlx5: Expose vnic diagnostic counters using devlink
Currently, vnic diagnostic counters are exposed through the following
debugfs:
$ ls /sys/kernel/debug/mlx5/0000:08:00.0/esw/vf_0/vnic_diag/
cq_overrun
quota_exceeded_command
total_q_under_processor_handle
invalid_command
send_queue_priority_update_flow
nic_receive_steering_discard
The current design does not allow the hypervisor to view the diagnostic
counters of its VFs, in case the VFs get bound to a VM. In other words,
the counters are not exposed for representor interfaces.
Furthermore, the debugfs design is inconvenient future-wise, in case more
counters need to be reported by the driver in the future.
As these counters pertain to vNIC health, it is more appropriate to
utilize the devlink health reporter to expose them.
Thus, this patchest includes the following changes:
* Drop the current vnic diagnostic counters debugfs interface.
* Add a vnic devlink health reporter for PFs/VFs core devices, which
when diagnosed will dump vnic diagnostic counter values that are
queried from FW.
* Add a vnic devlink health reporter for the representor interface, which
serves the same purpose listed in the previous point, in addition to
allowing the hypervisor to view its VFs diagnostic counters, even when
the VFs are bounded to external VMs.
Example of devlink health reporter usage is:
$devlink health diagnose pci/0000:08:00.0 reporter vnic
vNIC env counters:
total_error_queues: 0 send_queue_priority_update_flow: 0
comp_eq_overrun: 0 async_eq_overrun: 0 cq_overrun: 0
invalid_command: 0 quota_exceeded_command: 0
nic_receive_steering_discard: 0
===========
4) SW steering fixes and improvements
Yevgeny Kliteynik Says:
=======================
These short patch series are just small fixes / improvements for
SW steering:
- Patch 1: Fix dumping of legacy modify_hdr in debug dump to
align to what is expected by parser
- Patch 2: Have separate threshold for ICM sync per ICM type
- Patch 3: Add more info to the steering debug dump - Linux
version and device name
- Patch 4: Keep track of number of buddies that are currently
in use per domain per buddy type
=======================
* tag 'mlx5-updates-2023-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: Update op_mode to op_mod for port selection
net/mlx5: E-Switch, Remove unused mlx5_esw_offloads_vport_metadata_set()
net/mlx5: E-Switch, Remove redundant dev arg from mlx5_esw_vport_alloc()
net/mlx5: Include linux/pci.h for pci_msix_can_alloc_dyn()
net/mlx5e: RX, Hook NAPIs to page pools
net/mlx5e: RX, Fix XDP_TX page release for legacy rq nonlinear case
net/mlx5e: RX, Fix releasing page_pool pages twice for striding RQ
net/mlx5e: Add vnic devlink health reporter to representors
net/mlx5: Add vnic devlink health reporter to PFs/VFs
Revert "net/mlx5: Expose vnic diagnostic counters for eswitch managed vports"
Revert "net/mlx5: Expose steering dropped packets counter"
net/mlx5: DR, Add memory statistics for domain object
net/mlx5: DR, Add more info in domain dbg dump
net/mlx5: DR, Calculate sync threshold of each pool according to its type
net/mlx5: DR, Fix dumping of legacy modify_hdr in debug dump
====================
Link: https://lore.kernel.org/r/20230421013850.349646-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To be consistent with the other enum keys use OP_MOD
instead of OP_MODE.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Create a vnic devlink health reporter for PFs/VFs interfaces.
The reporter's diagnose callback displays the values of vNIC/vport
transport debug counters of PFs/VFs, as follows:
$ devlink health diagnose pci/0000:08:00.0 reporter vnic
vNIC env counters:
total_error_queues: 0 send_queue_priority_update_flow: 0
comp_eq_overrun: 0 async_eq_overrun: 0 cq_overrun: 0
invalid_command: 0 quota_exceeded_command: 0
nic_receive_steering_discard: 0
Moreover, add documentation on the reporter functionality and the
counters description.
While at it, expose the vNIC counters diagnose function to be used by
the downstream patch, which will reveal the counters for representor
interfaces.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Commit cited in "fixes" tag added bulk support for flow counters but it
didn't account that's also possible to query a counter using a non-base id
if the counter was allocated as bulk.
When a user performs a query, validate the flow counter id given in the
mailbox is inside the valid range taking bulk value into account.
Fixes: 208d70f562 ("IB/mlx5: Support flow counters offset for bulk counters")
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Link: https://lore.kernel.org/r/79d7fbe291690128e44672418934256254d93115.1681377114.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Rename existing HCA capability relaxed_ordering_read to
relaxed_ordering_read_pci_enabled. This is in accordance with recent PRM
change to better describe the capability, as it's set only if both the
device supports relaxed ordering (RO) read and RO is enabled in PCI
config space.
In addition, add new HCA capability relaxed_ordering_read which is set
if the device supports RO read, regardless of RO in PCI config space.
This will be used in the following patch to allow RO in VFs and VMs.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Link: https://lore.kernel.org/r/caa0002fd8135086357dfcc368e2f5cc73b08480.1681131553.git.leon@kernel.org
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
1) Vlad adds the support for linux bridge multicast offload support
Patches #1 through #9
Synopsis
Vlad Says:
==============
Implement support of bridge multicast offload in mlx5. Handle port object
attribute SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED notification to toggle multicast
offload and bridge snooping support on bridge. Handle port object
SWITCHDEV_OBJ_ID_PORT_MDB notification to attach a bridge port to MDB.
Steering architecture
Existing offload infrastructure relies on two levels of flow tables - bridge
ingress and egress. For multicast offload the architecture is extended with
additional layer of per-port multicast replication tables. Such tables filter
loopback traffic (so packets are not replicated to their source port) and pop
VLAN headers for "untagged" VLANs. The tables are referenced by the MDB rules in
egress table. MDB egress rule can point to multiple per-port multicast tables,
which causes matching multicast traffic to be replicated to all of them, and,
consecutively, to several bridge ports:
+--------+--+
+---------------------------------------> Port 1 | |
| +-^------+--+
| |
| |
+-----------------------------------------+ | +---------------------------+ |
| EGRESS table | | +--> PORT 1 multicast table | |
+----------------------------------+ +-----------------------------------------+ | | +---------------------------+ |
| INGRESS table | | | | | | | |
+----------------------------------+ | dst_mac=P1,vlan=X -> pop vlan, goto P1 +--+ | | FG0: | |
| | | dst_mac=P1,vlan=Y -> pop vlan, goto P1 | | | src_port=dst_port -> drop | |
| src_mac=M1,vlan=X -> goto egress +---> dst_mac=P2,vlan=X -> pop vlan, goto P2 +--+ | | FG1: | |
| ... | | dst_mac=P2,vlan=Y -> goto P2 | | | | VLAN X -> pop, goto port | |
| | | dst_mac=MDB1,vlan=Y -> goto mcast P1,P2 +-----+ | ... | |
+----------------------------------+ | | | | | VLAN Y -> pop, goto port +-------+
+-----------------------------------------+ | | | FG3: |
| | | matchall -> goto port |
| | | |
| | +---------------------------+
| |
| |
| | +--------+--+
+---------------------------------------> Port 2 | |
| +-^------+--+
| |
| |
| +---------------------------+ |
+--> PORT 2 multicast table | |
+---------------------------+ |
| | |
| FG0: | |
| src_port=dst_port -> drop | |
| FG1: | |
| VLAN X -> pop, goto port | |
| ... | |
| | |
| FG3: | |
| matchall -> goto port +-------+
| |
+---------------------------+
Patches overview:
- Patch 1 adds hardware definition bits for capabilities required to replicate
multicast packets to multiple per-port tables. These bits are used by
following patches to only attempt multicast offload if firmware and hardware
provide necessary support.
- Pathces 2-4 patches are preparations and refactoring.
- Patch 5 implements necessary infrastructure to toggle multicast offload
via SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED port object attribute notification.
This also enabled IGMP and MLD snooping.
- Patch 6 implements per-port multicast replication tables. It only supports
filtering of loopback packets.
- Patch 7 extends per-port multicast tables with VLAN pop support for 'untagged'
VLANs.
- Patch 8 handles SWITCHDEV_OBJ_ID_PORT_MDB port object notifications. It
creates MDB replication rules in egress table that can replicate packets to
multiple per-port multicast tables.
- Patch 9 adds tracepoints for MDB events.
==============
2) Parav Create a new allocation profile for SFs, to save on memory
3) Yevgeny provides some initial patches for upcoming software steering
support new pattern/arguments type of modify_header actions.
Starting with ConnectX-6 DX, we use a new design of modify_header FW object.
The current modify_header object allows for having only limited number of
these FW objects, which means that we are limited in the number of offloaded
flows that require modify_header action.
As a preparation Yevgeny provides the following 4 patches:
- Patch 1: Add required mlx5_ifc HW bits
- Patch 2, 3: Add new WQE type and opcode that is required for pattern/arg
support and adds appropriate support in dr_send.c
- Patch 4: Add ICM pool for modify-header-pattern objects and implement
patterns cache, allowing patterns reuse for different flows
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmQ2LDIACgkQSD+KveBX
+j59UAf+NfLEq7i/gYOri1rLwXOrgPd13w64Oob79+pKVXzRqik/gBg4d0CKdc2X
Qafntudus7IvVrGJCM0vurw7nC75H9BMFh3dQW2u3uaR9m6i0k4tSKIAoJUFHMdj
dMgAJywYkCCXlBbGVhRE3yybal2EXfOSJ+Fr2tGeqpL4vRlg4kIaVuFGcGk2gMpq
gOahX6wtuop3ABzY3sqKCIQ1Q2jWx9AWWz+lEmw7HJGmHGugscYrgSIloHDnu1fz
Bt6uODn7GFr866rQr9+JemG0VyK8ahyisEFnX1oJ3642ZidPcDrbIrrtMWeEqo7X
1tcmY/wNti87xFbg3gL+DZekhHTHoQ==
=oIN8
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2023-04-11
1) Vlad adds the support for linux bridge multicast offload support
Patches #1 through #9
Synopsis
Vlad Says:
==============
Implement support of bridge multicast offload in mlx5. Handle port object
attribute SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED notification to toggle multicast
offload and bridge snooping support on bridge. Handle port object
SWITCHDEV_OBJ_ID_PORT_MDB notification to attach a bridge port to MDB.
Steering architecture
Existing offload infrastructure relies on two levels of flow tables - bridge
ingress and egress. For multicast offload the architecture is extended with
additional layer of per-port multicast replication tables. Such tables filter
loopback traffic (so packets are not replicated to their source port) and pop
VLAN headers for "untagged" VLANs. The tables are referenced by the MDB rules in
egress table. MDB egress rule can point to multiple per-port multicast tables,
which causes matching multicast traffic to be replicated to all of them, and,
consecutively, to several bridge ports:
+--------+--+
+---------------------------------------> Port 1 | |
| +-^------+--+
| |
| |
+-----------------------------------------+ | +---------------------------+ |
| EGRESS table | | +--> PORT 1 multicast table | |
+----------------------------------+ +-----------------------------------------+ | | +---------------------------+ |
| INGRESS table | | | | | | | |
+----------------------------------+ | dst_mac=P1,vlan=X -> pop vlan, goto P1 +--+ | | FG0: | |
| | | dst_mac=P1,vlan=Y -> pop vlan, goto P1 | | | src_port=dst_port -> drop | |
| src_mac=M1,vlan=X -> goto egress +---> dst_mac=P2,vlan=X -> pop vlan, goto P2 +--+ | | FG1: | |
| ... | | dst_mac=P2,vlan=Y -> goto P2 | | | | VLAN X -> pop, goto port | |
| | | dst_mac=MDB1,vlan=Y -> goto mcast P1,P2 +-----+ | ... | |
+----------------------------------+ | | | | | VLAN Y -> pop, goto port +-------+
+-----------------------------------------+ | | | FG3: |
| | | matchall -> goto port |
| | | |
| | +---------------------------+
| |
| |
| | +--------+--+
+---------------------------------------> Port 2 | |
| +-^------+--+
| |
| |
| +---------------------------+ |
+--> PORT 2 multicast table | |
+---------------------------+ |
| | |
| FG0: | |
| src_port=dst_port -> drop | |
| FG1: | |
| VLAN X -> pop, goto port | |
| ... | |
| | |
| FG3: | |
| matchall -> goto port +-------+
| |
+---------------------------+
Patches overview:
- Patch 1 adds hardware definition bits for capabilities required to replicate
multicast packets to multiple per-port tables. These bits are used by
following patches to only attempt multicast offload if firmware and hardware
provide necessary support.
- Pathces 2-4 patches are preparations and refactoring.
- Patch 5 implements necessary infrastructure to toggle multicast offload
via SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED port object attribute notification.
This also enabled IGMP and MLD snooping.
- Patch 6 implements per-port multicast replication tables. It only supports
filtering of loopback packets.
- Patch 7 extends per-port multicast tables with VLAN pop support for 'untagged'
VLANs.
- Patch 8 handles SWITCHDEV_OBJ_ID_PORT_MDB port object notifications. It
creates MDB replication rules in egress table that can replicate packets to
multiple per-port multicast tables.
- Patch 9 adds tracepoints for MDB events.
==============
2) Parav Create a new allocation profile for SFs, to save on memory
3) Yevgeny provides some initial patches for upcoming software steering
support new pattern/arguments type of modify_header actions.
Starting with ConnectX-6 DX, we use a new design of modify_header FW object.
The current modify_header object allows for having only limited number of
these FW objects, which means that we are limited in the number of offloaded
flows that require modify_header action.
As a preparation Yevgeny provides the following 4 patches:
- Patch 1: Add required mlx5_ifc HW bits
- Patch 2, 3: Add new WQE type and opcode that is required for pattern/arg
support and adds appropriate support in dr_send.c
- Patch 4: Add ICM pool for modify-header-pattern objects and implement
patterns cache, allowing patterns reuse for different flows
* tag 'mlx5-updates-2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: DR, Add modify-header-pattern ICM pool
net/mlx5: DR, Prepare sending new WQE type
net/mlx5: Add new WQE for updating flow table
net/mlx5: Add mlx5_ifc bits for modify header argument
net/mlx5: DR, Set counter ID on the last STE for STEv1 TX
net/mlx5: Create a new profile for SFs
net/mlx5: Bridge, add tracepoints for multicast
net/mlx5: Bridge, implement mdb offload
net/mlx5: Bridge, support multicast VLAN pop
net/mlx5: Bridge, add per-port multicast replication tables
net/mlx5: Bridge, snoop igmp/mld packets
net/mlx5: Bridge, extract code to lookup parent bridge of port
net/mlx5: Bridge, move additional data structures to priv header
net/mlx5: Bridge, increase bridge tables sizes
net/mlx5: Add mlx5_ifc definitions for bridge multicast support
====================
Link: https://lore.kernel.org/r/20230412040752.14220-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Update API for bpf_xdp_metadata_rx_hash() with arg for xdp rss hash type
via mapping table.
The mlx5 hardware can also identify and RSS hash IPSEC. This indicate
hash includes SPI (Security Parameters Index) as part of IPSEC hash.
Extend xdp core enum xdp_rss_hash_type with IPSEC hash type.
Fixes: bc8d405b1b ("net/mlx5e: Support RX XDP metadata")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/168132892548.340624.11185734579430124869.stgit@firesoul
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add new WQE type: FLOW_TBL_ACCESS, which will be used for
writing modify header arguments.
This type has specific control segment and special data segment.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add enum value for modify-header argument object and mlx5_bits
for the related capabilities.
Signed-off-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Create a new profile for SFs in order to disable the command cache.
Each function command cache consumes ~500KB of memory, when using a
large number of SFs this savings is notable on memory constarined
systems.
Use a new profile to provide for future differences between SFs and PFs.
The mr_cache not used for non-PF functions, so it is excluded from the
new profile.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Bodong Wang <bodong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
These new fields in QUERY_Q_COUNTER command allow us to access
another vport counters during the query command, which is specially
useful to query representor vports.
In addition also add the required caps to check if this capability
is actually supported.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://lore.kernel.org/r/75c73a4a0e60f18c37b35a4a11ca2e2415e4a6f3.1679566038.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
mlx5 dynamic msix
This patch series adds support for dynamic msix vectors allocation in mlx5.
Eli Cohen Says:
================
The following series of patches modifies mlx5_core to work with the
dynamic MSIX API. Currently, mlx5_core allocates all the interrupt
vectors it needs and distributes them amongst the consumers. With the
introduction of dynamic MSIX support, which allows for allocation of
interrupts more than once, we now allocate vectors as we need them.
This allows other drivers running on top of mlx5_core to allocate
interrupt vectors for their own use. An example for this is mlx5_vdpa,
which uses these vectors to propagate interrupts directly from the
hardware to the vCPU [1].
As a preparation for using this series, a use after free issue is fixed
in lib/cpu_rmap.c and the allocator for rmap entries has been modified.
A complementary API for irq_cpu_rmap_add() has also been introduced.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/patch/?id=0f2bf1fcae96a83b8c5581854713c9fc3407556e
================
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmQeTIUACgkQSD+KveBX
+j7oCQgAx9yNHM4BZD2UfIx/P+W13v1B+xOds04Vezl9JlakoqvviPxm3vvuKkl+
j/8DdyoqMUbWV0j5XxgZ+GG91bc14jN1GQ+4fUf63SzA99vAGb9GJPV2aQt5roGh
JmMqI2utDfoz+29qtQ+kVchY5AN5AoPXSQH2zkEZmJaPUjYb9Dr/4IayL0JaViAw
S31QLHKkSJ8bL8Wc6Op1emNVV7eXs18f7IIjVs3sYOb3WJRPVpmdKneRqLgVYplf
Td40Gwobl1elpjEqSSRTJI5YUSR8gcAJlBqIwHeJzFFpO3Pnciopl761osNKKs/a
5ctES5DS6JHqqFGbWV1gKYcRMil3LA==
=9i8l
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2023-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2023-03-20
mlx5 dynamic msix
This patch series adds support for dynamic msix vectors allocation in mlx5.
Eli Cohen Says:
================
The following series of patches modifies mlx5_core to work with the
dynamic MSIX API. Currently, mlx5_core allocates all the interrupt
vectors it needs and distributes them amongst the consumers. With the
introduction of dynamic MSIX support, which allows for allocation of
interrupts more than once, we now allocate vectors as we need them.
This allows other drivers running on top of mlx5_core to allocate
interrupt vectors for their own use. An example for this is mlx5_vdpa,
which uses these vectors to propagate interrupts directly from the
hardware to the vCPU [1].
As a preparation for using this series, a use after free issue is fixed
in lib/cpu_rmap.c and the allocator for rmap entries has been modified.
A complementary API for irq_cpu_rmap_add() has also been introduced.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/patch/?id=0f2bf1fcae96a83b8c5581854713c9fc3407556e
================
* tag 'mlx5-updates-2023-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: Provide external API for allocating vectors
net/mlx5: Use one completion vector if eth is disabled
net/mlx5: Refactor calculation of required completion vectors
net/mlx5: Move devlink registration before mlx5_load
net/mlx5: Use dynamic msix vectors allocation
net/mlx5: Refactor completion irq request/release code
net/mlx5: Improve naming of pci function vectors
net/mlx5: Use newer affinity descriptor
net/mlx5: Modify struct mlx5_irq to use struct msi_map
net/mlx5: Fix wrong comment
net/mlx5e: Coding style fix, add empty line
lib: cpu_rmap: Add irq_cpu_rmap_remove to complement irq_cpu_rmap_add
lib: cpu_rmap: Use allocator for rmap entries
lib: cpu_rmap: Avoid use after free on rmap->obj array entries
====================
Link: https://lore.kernel.org/r/20230324231341.29808-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Provide external API to be used by other drivers relying on mlx5_core,
for allocating MSIX vectors. An example for such a driver would be
mlx5_vdpa.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Downstream patches require devlink params to access the PTYS register,
move the needed functions from mlx5e to the core layer.
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20230314054234.267365-11-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Implement thermal zone support for mlx5 based HW. The NIC
uses temperature sensor provided by ASIC to report current temperature
to thermal core.
Signed-off-by: Sandipan Patra <spatra@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20230314054234.267365-5-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Small cycle this time:
- Minor driver updates for hfi1, cxgb4, erdma, hns, irdma, mlx5, siw, mana
- inline CQE support for hns
- Have mlx5 display device error codes
- Pinned DMABUF support for irdma
- Continued rxe cleanups, particularly converting the MRs to use xarray
- Improvements to what can be cached in the mlx5 mkey cache
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCY/gPmgAKCRCFwuHvBreF
YW5IAP4xOAiTif4f87vD1twRU/ebq4VEX0r+C2NX5x5fwlCJrAEA7RLV8uG9Uii2
ez0BuWNxfajuvFHntnZ1E+7UDP0S8gk=
=CgUH
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"Quite a small cycle this time, even with the rc8. I suppose everyone
went to sleep over xmas.
- Minor driver updates for hfi1, cxgb4, erdma, hns, irdma, mlx5, siw,
mana
- inline CQE support for hns
- Have mlx5 display device error codes
- Pinned DMABUF support for irdma
- Continued rxe cleanups, particularly converting the MRs to use
xarray
- Improvements to what can be cached in the mlx5 mkey cache"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (61 commits)
IB/mlx5: Extend debug control for CC parameters
IB/hfi1: Fix sdma.h tx->num_descs off-by-one errors
IB/hfi1: Fix math bugs in hfi1_can_pin_pages()
RDMA/irdma: Add support for dmabuf pin memory regions
RDMA/mlx5: Use query_special_contexts for mkeys
net/mlx5e: Use query_special_contexts for mkeys
net/mlx5: Change define name for 0x100 lkey value
net/mlx5: Expose bits for querying special mkeys
RDMA/rxe: Fix missing memory barriers in rxe_queue.h
RDMA/mana_ib: Fix a bug when the PF indicates more entries for registering memory on first packet
RDMA/rxe: Remove rxe_alloc()
RDMA/cma: Distinguish between sockaddr_in and sockaddr_in6 by size
Subject: RDMA/rxe: Handle zero length rdma
iw_cxgb4: Fix potential NULL dereference in c4iw_fill_res_cm_id_entry()
RDMA/mlx5: Use rdma_umem_for_each_dma_block()
RDMA/umem: Remove unused 'work' member from struct ib_umem
RDMA/irdma: Cap MSIX used to online CPUs + 1
RDMA/mlx5: Check reg_create() create for errors
RDMA/restrack: Correct spelling
RDMA/cxgb4: Fix potential null-ptr-deref in pass_establish()
...
This patch adds rtt_resp_dscp to the current debug controllability of
congestion control (CC) parameters.
rtt_resp_dscp can be read or written through debugfs.
If set, its value overwrites the DSCP of the generated RTT response.
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Link: https://lore.kernel.org/r/1dcc3440ee53c688f19f579a051ded81a2aaa70a.1676538714.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Synchronize the shared mlx5 branch with net:
- From Jiri: fixe a deadlock in mlx5_ib's netdev notifier unregister.
- From Mark and Patrisious: add IPsec RoCEv2 support.
- From Or: Rely on firmware to get special mkeys
* branch mlx5-next:
RDMA/mlx5: Use query_special_contexts for mkeys
net/mlx5e: Use query_special_contexts for mkeys
net/mlx5: Change define name for 0x100 lkey value
net/mlx5: Expose bits for querying special mkeys
net/mlx5: Configure IPsec steering for egress RoCEv2 traffic
net/mlx5: Configure IPsec steering for ingress RoCEv2 traffic
net/mlx5: Add IPSec priorities in RDMA namespaces
net/mlx5: Implement new destination type TABLE_TYPE
net/mlx5: Introduce new destination type TABLE_TYPE
RDMA/mlx5: Track netdev to avoid deadlock during netdev notifier unregister
net/mlx5e: Propagate an internal event in case uplink netdev changes
net/mlx5e: Fix trap event handling
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Change define of 0x100 lkey value from MLX5_INVALID_LKEY to be
MLX5_TERMINATE_SCATTER_LIST_LKEY as 0x100 is the value of
terminate_scatter_list_mkey.
Link: https://lore.kernel.org/r/3a116dc3fbae4cb6b76a63d27d418830b06ade0c.1673960981.git.leon@kernel.org
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add IPSec flow steering priorities in RDMA namespaces. This allows
adding tables/rules to forward RoCEv2 traffic to the IPSec crypto
tables in NIC_TX domain, and accept RoCEv2 traffic from NIC_RX domain.
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Implement new destination type to support flow transition between
different table types.
e.g. from NIC_RX to RDMA_RX or from RDMA_TX to NIC_TX.
The new destination is described in the tracepoint as follows:
"mlx5_fs_add_rule: rule=00000000d53cd0ed fte=0000000048a8a6ed index=0 sw_action=<> [dst] flow_table_type=7 id:262152"
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
This new destination type supports flow transition between different
table types, e.g. from NIC_RX to RDMA_RX or from RDMA_TX to NIC_TX.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
No need to have dl_port which is tightly coupled with mlx5e code
in mlx5 core code. Move it to struct mlx5e_dev and loose
mlx5e_devlink_get_dl_port() helper.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In MultiPort E-Switch mode a single RDMA is created. This device has multiple
RDMA ports that represent the uplink ports that are connected to the E-Switch.
Account for this when creating the RDMA device so it has an additional port for
the non native uplink.
As a side effect of this patch, use shared fdb in multiport eswitch mode.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In a follow-up commit multiport eswitch mode will use a shared fdb.
In shared fdb there is a single eswitch fdb and traffic could come from any
port. to distinguish between the ports set a different metadata per uplink port.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Extend the action stats callback implementation to update stats for actions
that are associated with hw counters.
Note that the callback may be called from tc action utility or from tc
flower. Both apis expect the driver to return the stats difference from
the last update. As such, query the raw counter value and maintain
the diff from the last api call in the tc layer, instead of the fs_core
layer.
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This series from Jiri solves a deadlock when removing a network namespace
with mlx5 devlink instance being in it.
The deadlock is between:
1) mlx5_ib->unregister_netdevice_notifier()
AND
2) mlx5_core->devlink_reload->cleanup_net()
To slove this introduced mlx5 netdev added/removed events to track uplink
netdev to be used for register_netdevice_notifier_dev_net() purposes.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmPkeZAACgkQSD+KveBX
+j6Digf/fTtMmV2I2GwKQJCza4+MAP8Nt9tKInj3x02AoNVXNwHupL72HWZiaKnB
YGvPAwjDvxPy2Ok1BsHJLyEOTZpZse8QtS/Sjzk00lovtOYzCwLdJfBrNnVRS5KV
Cz/dNtlQcpsAoErFSfmvraLhn7tMNrHMTDahzaNalDkO3wZYXUh+2VDwnXErQy+3
1HI9m2pGy8hQ3sNQTNhqcyY4mp1Qw3nTVIkE8c9E5TJcawVkk4xqlgQuT43nqcn5
H+CTXJTFyUMNkF8kNPTMvMoYfTYWhBqbZKuf+YDyQKwdf5IZyc1kuRIaqJNs5VjU
mUtwKHMk5apKLbE8rvmZlg/+geTlJA==
=Aqcc
-----END PGP SIGNATURE-----
Merge tag 'mlx5-next-netdev-deadlock' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Saeed Mahameed says:
====================
mlx5-next-netdev-deadlock
This series from Jiri solves a deadlock when removing a network namespace
with mlx5 devlink instance being in it.
The deadlock is between:
1) mlx5_ib->unregister_netdevice_notifier()
AND
2) mlx5_core->devlink_reload->cleanup_net()
To slove this introduced mlx5 netdev added/removed events to track uplink
netdev to be used for register_netdevice_notifier_dev_net() purposes.
* tag 'mlx5-next-netdev-deadlock' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
RDMA/mlx5: Track netdev to avoid deadlock during netdev notifier unregister
net/mlx5e: Propagate an internal event in case uplink netdev changes
net/mlx5e: Fix trap event handling
net/mlx5: Introduce CQE error syndrome
====================
Link: https://lore.kernel.org/r/20230208005626.72930-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Whenever uplink netdev is set/cleared, propagate newly introduced event
to inform notifier blocks netdev was added/removed.
Move the set() helper to core.c from header, introduce clear() and
netdev_added_event_replay() helpers. The last one is going to be called
from rdma driver, so export it.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently, each core device has VF pages counter which stores number of
fw pages used by its VFs and SFs.
The current design led to a hang when performing firmware reset on DPU,
where the DPU PFs stalled in sriov unload flow due to waiting on release
of SFs pages instead of waiting on only VFs pages.
Thus, Add a separate counter for SF firmware pages, which will prevent
the stall scenario described above.
Fixes: 1958fc2f07 ("net/mlx5: SF, Add auxiliary device driver")
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently, an independent page counter is used for tracking memory usage
for each function type such as VF, PF and host PF (DPU).
For better code-readibilty, use a single array that stores
the number of allocated memory pages for each function type.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In case a new string DB is added to the FW, the FW publishes an event
notifying the strings DB have updated.
Add support in driver for handling this event.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Commit 90e7cb78b8 ("net/mlx5: fix missing mutex_unlock in
mlx5_fw_fatal_reporter_err_work()") introduced another checking of
MLX5_DROP_HEALTH_NEW_WORK. At this point, the first check of
MLX5_DROP_HEALTH_NEW_WORK is redundant and so is the lock that
protects it.
Remove the lock and rename MLX5_DROP_HEALTH_NEW_WORK to reflect these
changes.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When device is capable of handling scaled ppm values for adjusting
frequency, conversion to ppb will not be done by the driver. Instead, the
scaled ppm value will be passed directly to the device for the frequency
adjustment operation.
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add CAP for crypto offload, do the simple initialization if hardware
supports it. Currently set log_dek_obj_range to 12, so 4k DEKs will be
created in one bulk allocation.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Change the naming of key type in DEK fields and macros, to be
consistent with the device spec.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add and extend structure layouts and defines for fast crypto key
update. This is a prerequisite to support bulk creation, key
modification and destruction, software wrapped DEK, and SYNC_CRYPTO
command.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Before this patch, the log_obj_range was defined inside
general_obj_in_cmd_hdr to support bulk allocation. However, we need to
modify/query one of the object in the bulk in later patch, so change
those fields to param bits for parameters specific for cmd header, and
add general_obj_create_param according to what was updated in spec.
We will also add general_obj_query_param for modify/query later.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Implicit ODP mkey doesn't have unique properties. It shares the same
properties as the order 18 cache entry. There is no need to devote a
special entry for that.
Link: https://lore.kernel.org/r/20230125222807.6921-3-michaelgur@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
1) From Rahul,
1.1) extended range for PTP adjtime and adjphase
1.2) adjphase function to support hardware-only offset control
2) From Roi, code cleanup to the TC module.
3) From Maor, TC support for Geneve and GRE with VF tunnel offload
4) Cleanups and minor updates.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmPIO6EACgkQSD+KveBX
+j4eKwf/YTa7bbbyWPEb7JE2mq0PNSFynQWnYAy+MCPdyxwl1ho2gVlhJ94qPkQ+
IvumOi9W+Yjy1Xie+k6s7W4UD7iAursASq27ipt0F0t5Gmb0G8yfMePPkoTTjz2s
bjwEqDRKYhVB7QynMUir2y71Vo0/7jd4bm9nPlh+MSiD7cYcC5gW3ExC6YoRw+5L
5ry5FpkJJsv+zfX6Z/p3vTQso5LRx+cvaVc3DDbYV4w8D0gf2YnbMBmJa1kSn6So
aJhcejPdD4hHXEiUT7QzPRjhBxXbEIMKE6dVzdHBMZ8Le3lz2Hrl5LvNx7IDdLJk
q3JSbT+V5xRRxfdIDaCdhUZ8VnMK+w==
=Spe7
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2023-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2023-01-18
1) From Rahul,
1.1) extended range for PTP adjtime and adjphase
1.2) adjphase function to support hardware-only offset control
2) From Roi, code cleanup to the TC module.
3) From Maor, TC support for Geneve and GRE with VF tunnel offload
4) Cleanups and minor updates.
* tag 'mlx5-updates-2023-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5e: Use read lock for eswitch get callbacks
net/mlx5e: Remove redundant allocation of spec in create indirect fwd group
net/mlx5e: Support Geneve and GRE with VF tunnel offload
net/mlx5: E-Switch, Fix typo for egress
net/mlx5e: Warn when destroying mod hdr hash table that is not empty
net/mlx5e: TC, Use common function allocating flow mod hdr or encap mod hdr
net/mlx5e: TC, Add tc prefix to attach/detach hdr functions
net/mlx5e: TC, Pass flow attr to attach/detach mod hdr functions
net/mlx5e: Add warning when log WQE size is smaller than log stride size
net/mlx5e: Fail with messages when params are not valid for XSK
net/mlx5: E-switch, Remove redundant comment about meta rules
net/mlx5: Add hardware extended range support for PTP adjtime and adjphase
net/mlx5: Add adjphase function to support hardware-only offset control
net/mlx5: Suppress error logging on UCTX creation
net/mlx5e: Suppress Send WQEBB room warning for PAGE_SIZE >= 16KB
====================
Link: https://lore.kernel.org/r/20230118183602.124323-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The MLX5E_LOCKED_FLOW flag is not checked anywhere now so remove it
entirely.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Capable hardware can use an extended range for offsetting the clock. An
extended range of [-200000,200000] is used instead of [-32768,32767] for
the delta/phase parameter of the adjtime/adjphase ptp_clock_info callbacks.
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Introduces CQE error syndrome bits which are inside qp_context_extension
and are used to report the reason the QP was moved to error state.
Useful for cases in which a CQE isn't generated, such as remote write
rkey violation.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Link: https://lore.kernel.org/r/f8359315f8130f6d2abe4b94409ac7802f54bce3.1672821186.git.leonro@nvidia.com
Reviewed-by: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Certain connection-based device-offload protocols (like TLS) use
per-connection HW objects to track the state, maintain the context, and
perform the offload properly. Some of these objects are created,
modified, and destroyed via FW commands. Under high connection rate,
this type of FW commands might continuously populate all slots of the FW
command interface and throttle it, while starving other critical control
FW commands.
Limit these throttle commands to using only up to a portion (half) of
the FW command interface slots. FW commands maximal rate is not hit, and
the same high rate is still reached when applying this limitation.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Enable initialization of DPU Management PF, which is a new loopback PF
designed for communication with BMC.
For now Management PF doesn't support nor require most upper layer
protocols so avoid them.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Fix SRIOV VST mode behavior to insert cvlan when a guest tag is already
present in the frame. Previous VST mode behavior was to drop packets or
override existing tag, depending on the device version.
In this patch we fix this behavior by correctly building the HW steering
rule with a push vlan action, or for older devices we ask the FW to stack
the vlan when a vlan is already present.
Fixes: 07bab95026 ("net/mlx5: E-Switch, Refactor eswitch ingress acl codes")
Fixes: dfcb1ed3c3 ("net/mlx5: E-Switch, Vport ingress/egress ACLs rules for VST mode")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
- Replace deprecated git://github.com link in MAINTAINERS. (Palmer Dabbelt)
- Simplify vfio/mlx5 with module_pci_driver() helper. (Shang XiaoJing)
- Drop unnecessary buffer from ACPI call. (Rafael Mendonca)
- Correct latent missing include issue in iova-bitmap and fix support
for unaligned bitmaps. Follow-up with better fix through refactor.
(Joao Martins)
- Rework ccw mdev driver to split private data from parent structure,
better aligning with the mdev lifecycle and allowing us to remove
a temporary workaround. (Eric Farman)
- Add an interface to get an estimated migration data size for a device,
allowing userspace to make informed decisions, ex. more accurately
predicting VM downtime. (Yishai Hadas)
- Fix minor typo in vfio/mlx5 array declaration. (Yishai Hadas)
- Simplify module and Kconfig through consolidating SPAPR/EEH code and
config options and folding virqfd module into main vfio module.
(Jason Gunthorpe)
- Fix error path from device_register() across all vfio mdev and sample
drivers. (Alex Williamson)
- Define migration pre-copy interface and implement for vfio/mlx5
devices, allowing portions of the device state to be saved while the
device continues operation, towards reducing the stop-copy state
size. (Jason Gunthorpe, Yishai Hadas, Shay Drory)
- Implement pre-copy for hisi_acc devices. (Shameer Kolothum)
- Fixes to mdpy mdev driver remove path and error path on probe.
(Shang XiaoJing)
- vfio/mlx5 fixes for incorrect return after copy_to_user() fault and
incorrect buffer freeing. (Dan Carpenter)
-----BEGIN PGP SIGNATURE-----
iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmObfPgbHGFsZXgud2ls
bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsiDogP/i9GuBKposvZpnfxXWwo
oNpKBZSOVMW8wgavNEuryMb+9WoouIghce8XU49MmONoP26kIh5TA14Zpi3XWkLK
K+NlpwicESvLeZVHU7f3R8meVqmPtlxIi59jE+CfEHB8BW2HIAsEdwdhkxMwus9C
nuiiK/2YYyQWOXYc4LAIkspMzjtGPy6Im5P6AED+dI+TFCEqJAM5qgOLJZFlk4a/
WwZY2xjVKOl6xf5VZXGw+v7fDgz2Ju+j4Bm3X5lx1HgiDrEH83MjXY5h67neAIVb
bXrfNLN++MiuO5niGTFMbUjGVUIFxsfmJzBnL9QrLsuj0JrGEKsu/1JEO78g0Km0
ZCChoJ6UyUOgxt6evEymUAZAAkbcKaaht2gdbAXW71tv9p1TripAbBKwVeah1bQp
SiHPqy9InKJlhaf+GbXL9eux1WVMfQ6FZccU16bNt7VaV2I8js85z/2gqVD0a5Mw
+gnwp5XMUFWNKlJrnc7uVCD0bDExwQhr75OP4rWjMNvvLi9hPXJ2cI2Sg+9OLzQw
vm/I+Df+FfXCuGAgX4Lxq76pqWlYGJH0Qxc14Ds6YoXqygBPz9yvTtuBv8mTHJzE
KdAl/6DmZZxZ/JFD9lPF80KRiAsJ6iNf6tPTWES7hfDBfIdgQ/DZbXridLWJPNoi
xLfaW19yrLTXWKSmR7G2Lsz4
=q9xs
-----END PGP SIGNATURE-----
Merge tag 'vfio-v6.2-rc1' of https://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Replace deprecated git://github.com link in MAINTAINERS (Palmer
Dabbelt)
- Simplify vfio/mlx5 with module_pci_driver() helper (Shang XiaoJing)
- Drop unnecessary buffer from ACPI call (Rafael Mendonca)
- Correct latent missing include issue in iova-bitmap and fix support
for unaligned bitmaps. Follow-up with better fix through refactor
(Joao Martins)
- Rework ccw mdev driver to split private data from parent structure,
better aligning with the mdev lifecycle and allowing us to remove a
temporary workaround (Eric Farman)
- Add an interface to get an estimated migration data size for a
device, allowing userspace to make informed decisions, ex. more
accurately predicting VM downtime (Yishai Hadas)
- Fix minor typo in vfio/mlx5 array declaration (Yishai Hadas)
- Simplify module and Kconfig through consolidating SPAPR/EEH code and
config options and folding virqfd module into main vfio module (Jason
Gunthorpe)
- Fix error path from device_register() across all vfio mdev and sample
drivers (Alex Williamson)
- Define migration pre-copy interface and implement for vfio/mlx5
devices, allowing portions of the device state to be saved while the
device continues operation, towards reducing the stop-copy state size
(Jason Gunthorpe, Yishai Hadas, Shay Drory)
- Implement pre-copy for hisi_acc devices (Shameer Kolothum)
- Fixes to mdpy mdev driver remove path and error path on probe (Shang
XiaoJing)
- vfio/mlx5 fixes for incorrect return after copy_to_user() fault and
incorrect buffer freeing (Dan Carpenter)
* tag 'vfio-v6.2-rc1' of https://github.com/awilliam/linux-vfio: (42 commits)
vfio/mlx5: error pointer dereference in error handling
vfio/mlx5: fix error code in mlx5vf_precopy_ioctl()
samples: vfio-mdev: Fix missing pci_disable_device() in mdpy_fb_probe()
hisi_acc_vfio_pci: Enable PRE_COPY flag
hisi_acc_vfio_pci: Move the dev compatibility tests for early check
hisi_acc_vfio_pci: Introduce support for PRE_COPY state transitions
hisi_acc_vfio_pci: Add support for precopy IOCTL
vfio/mlx5: Enable MIGRATION_PRE_COPY flag
vfio/mlx5: Fallback to STOP_COPY upon specific PRE_COPY error
vfio/mlx5: Introduce multiple loads
vfio/mlx5: Consider temporary end of stream as part of PRE_COPY
vfio/mlx5: Introduce vfio precopy ioctl implementation
vfio/mlx5: Introduce SW headers for migration states
vfio/mlx5: Introduce device transitions of PRE_COPY
vfio/mlx5: Refactor to use queue based data chunks
vfio/mlx5: Refactor migration file state
vfio/mlx5: Refactor MKEY usage
vfio/mlx5: Refactor PD usage
vfio/mlx5: Enforce a single SAVE command at a time
vfio: Extend the device migration protocol with PRE_COPY
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmOS/ooACgkQrB3Eaf9P
W7cOVA/+L8rHwLe78DDz/PNESyShtTVCYBDF/ngYMV8AIvjSfPresMbFV3NKqO5E
3qbMl199QH2eWI7dhQaQ+edynSG0QCx5FmPai0UuHPLxATct1pNPJPpvBryO/4jC
ZouYBIVjdMbq6Y8vD2gJ8UtA7TZpncP0HYOKTvYyDL9kQ+nUmu9KUYxcEcNHL5w+
TjL9jJafR+GqczCRiwAoMKIFV7lUrTFzh7slfINNN5DVTuzN33H7Tp70z6IKOfVL
1LATlZv7mqpLVF6dQuMXOt6kd/BEBl1y4ZHTHow5nstJvwu99P96iKwEfIXuOvWK
fulhDU61eIik8D9QJWeM7TuZDbYewWI77plwVY/R/zRt0At4VLpq7I1m33CmLLMY
Fb5fMxJPkM8YAtDID+BknYPrSAcxo8ji04BWFrVqQ6InPmtGfnP83XSSkYfxY7FB
3hUfz4igsJpV5vrS1EFRhjklNwI+jY2yAvIggQtdkJ97ubSUY3E4ACfNqlJ5lJbv
2KqWnSKlG21F9ZTR68VzcQVhFIQF6j/EuQqro+4TQUIdZswcml2iK32zrel0rs9C
iAsgQQaMV9a2vEaScRZqdOJ4HENTbm9wD7Mso/i5vr+lnpr1ThKjQo8osU8YUlbC
SDTMeWRRos+esFML6SP+YZ7SM/qXMluou204x/llJ/VDMXQ5e8k=
=enQp
-----END PGP SIGNATURE-----
Merge tag 'ipsec-next-2022-12-09' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
ipsec-next 2022-12-09
1) Add xfrm packet offload core API.
From Leon Romanovsky.
2) Add xfrm packet offload support for mlx5.
From Leon Romanovsky and Raed Salem.
3) Fix a typto in a error message.
From Colin Ian King.
* tag 'ipsec-next-2022-12-09' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: (38 commits)
xfrm: Fix spelling mistake "oflload" -> "offload"
net/mlx5e: Open mlx5 driver to accept IPsec packet offload
net/mlx5e: Handle ESN update events
net/mlx5e: Handle hardware IPsec limits events
net/mlx5e: Update IPsec soft and hard limits
net/mlx5e: Store all XFRM SAs in Xarray
net/mlx5e: Provide intermediate pointer to access IPsec struct
net/mlx5e: Skip IPsec encryption for TX path without matching policy
net/mlx5e: Add statistics for Rx/Tx IPsec offloaded flows
net/mlx5e: Improve IPsec flow steering autogroup
net/mlx5e: Configure IPsec packet offload flow steering
net/mlx5e: Use same coding pattern for Rx and Tx flows
net/mlx5e: Add XFRM policy offload logic
net/mlx5e: Create IPsec policy offload tables
net/mlx5e: Generalize creation of default IPsec miss group and rule
net/mlx5e: Group IPsec miss handles into separate struct
net/mlx5e: Make clear what IPsec rx_err does
net/mlx5e: Flatten the IPsec RX add rule path
net/mlx5e: Refactor FTE setup code to be more clear
net/mlx5e: Move IPsec flow table creation to separate function
...
====================
Link: https://lore.kernel.org/r/20221209093310.4018731-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Range is a new flow destination type which allows matching on
a range of values instead of matching on a specific value.
Range flow destination has the following fields:
- hit_ft: flow table to forward the traffic in case of hit
- miss_ft: flow table to forward the traffic in case of miss
- field: which packet characteristic to match on
- min: minimal value for the selected field
- max: maximal value for the selected field
Note:
- In order to match, the value in the packet should meet
the following criteria: min <= value < max
- Currently, the only supported field type is L2 packet length
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Update full structure of match definer and add an ID of
the SELECT match definer type.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Downstream patch requires to get other function GENERAL2 caps while
mlx5_vport_get_other_func_cap() gets only one type of caps (general).
Rename it to represent this and introduce a generic implementation
of mlx5_vport_get_other_func_cap().
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Acked-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduce IFC related capabilities to enable setting VF to be able to
perform live migration. e.g.: to be migratable.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Acked-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduce ifc related stuff to enable PRE_COPY of VF during migration.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Acked-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20221206083438.37807-2-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
MLX5_UMR_KLM_ALIGNMENT is in units of number of entries, while
MLX5_UMR_MTT_ALIGNMENT (generalized and renamed to
MLX5_UMR_FLEX_ALIGNMENT) is in byte units. This is misleading and
confusing.
Replace this KLM definition with one based on the generic definition.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Per the device spec, MLX5_UMR_MTT_ALIGNMENT is good not only for UMR MTT
entries, but for all other entries as well, like KLMs and KSMs.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Defines MLX5_UMR_MTT_MASK and MLX5_UMR_MTT_MIN_CHUNK_SIZE are not in
use. Remove them.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Per the device spec, MTTs/KLMs list in a UMR WQE must be aligned to 64B.
Per our SW design, the MTT/KLMs list would need alignment only if it's
too small, for example on PPC when PAGE_SIZE is 64KB, and only 4 pages
are needed to cover a MPWQE of size 256KB.
Padding, if needed, is taken into account when calculating the UMR WQE
fields (ds_cnt and xlt_octowords), however no entries are provided,
instead garbage is passed.
No real harm though, as these parts act as gaps between the RX MPWQEs
and not used by any of them. Hence, in practice, device does not try to
write any incoming packet to them. Still, prefer providing clean padding
marking the end of the list, and do not map garbage into the RQ memory
region.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Remove mlx5_priv.ctx_list and ctx_lock which are no longer used after
commit 601c10c89c ("net/mlx5: Delete custom device management logic").
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
While moving to new CMD API (quiet API), some pre-existing flows may call the new API
function that in case of error, returns the error instead of printing it as previously done.
For such flows we bring back the print but to tracepoint this time for sys admins to
have the ability to check for errors especially for commands using the new quiet API.
Tracepoint output example:
devlink-1333 [001] ..... 822.746922: mlx5_cmd: ACCESS_REG(0x805) op_mod(0x0) failed, status bad resource(0x5), syndrome (0xb06e1f), err(-22)
Fixes: f23519e542 ("net/mlx5: cmdif, Add new api for command execution")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
CQE compression feature improves performance by reducing PCI bandwidth
bottleneck on CQEs write.
Enhanced CQE compression introduced in ConnectX-6 and it aims to reduce
CPU utilization of SW side packets decompression by eliminating the
need to rewrite ownership bit, which is likely to cost a cache-miss, is
replaced by validity byte handled solely by HW.
Another advantage of the enhanced feature is that session packets are
available to SW as soon as a single CQE slot is filled, instead of
waiting for session to close, this improves packet latency from NIC to
host.
Performance:
Following are tested scenarios and reults comparing basic and enahnced
CQE compression.
setup: IXIA 100GbE connected directly to port 0 and port 1 of
ConnectX-6 Dx 100GbE dual port.
Case #1 RX only, single flow goes to single queue:
IRQ rate reduced by ~ 30%, CPU utilization improved by 2%.
Case #2 IP forwarding from port 1 to port 0 single flow goes to
single queue:
Avg latency improved from 60us to 21us, frame loss improved from 0.5% to 0.0%.
Case #3 IP forwarding from port 1 to port 0 Max Throughput IXIA sends
100%, 8192 UDP flows, goes to 24 queues:
Enhanced is equal or slightly better than basic.
Testing the basic compression feature with this patch shows there is
no perfrormance degradation of the basic compression feature.
Signed-off-by: Ofer Levi <oferle@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
- Small bug fixes in mlx5, efa, rxe, hns, irdma, erdma, siw
- rts tracing improvements
- Code improvements: strlscpy conversion, unused parameter, spelling
mistakes, unused variables, flex arrays
- restrack device details report for hns
- Simplify struct device initialization in SRP
- Eliminate the never-used service_mask support in IB CM
- Make rxe not print to the console for some kinds of network packets
- Asymetric paths and router support in the CM through netlink messages
- DMABUF importer support for mlx5devx umem's
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCYz9bgAAKCRCFwuHvBreF
YevoAP47J/svlOFlFtBhTVF79Ddtf+MMeqeVvLoHHQbCU5rUpAD+KUpTXAvwNcM9
dHwNXz9ctanP5397qusH0rxOKPo/EA4=
=lgSv
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"Not a big list of changes this cycle, mostly small things. The new
MANA rdma driver should come next cycle along with a bunch of work on
rxe.
Summary:
- Small bug fixes in mlx5, efa, rxe, hns, irdma, erdma, siw
- rts tracing improvements
- Code improvements: strlscpy conversion, unused parameter, spelling
mistakes, unused variables, flex arrays
- restrack device details report for hns
- Simplify struct device initialization in SRP
- Eliminate the never-used service_mask support in IB CM
- Make rxe not print to the console for some kinds of network packets
- Asymetric paths and router support in the CM through netlink
messages
- DMABUF importer support for mlx5devx umem's"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (84 commits)
RDMA/rxe: Remove error/warning messages from packet receiver path
RDMA/usnic: fix set-but-not-unused variable 'flags' warning
IB/hfi1: Use skb_put_data() instead of skb_put/memcpy pair
RDMA/hns: Unified Log Printing Style
RDMA/hns: Replacing magic number with macros in apply_func_caps()
RDMA/hns: Repacing 'dseg_len' by macros in fill_ext_sge_inl_data()
RDMA/hns: Remove redundant 'max_srq_desc_sz' in caps
RDMA/hns: Remove redundant 'num_mtt_segs' and 'max_extend_sg'
RDMA/hns: Remove redundant 'phy_addr' in hns_roce_hem_list_find_mtt()
RDMA/hns: Remove redundant 'use_lowmem' argument from hns_roce_init_hem_table()
RDMA/hns: Remove redundant 'bt_level' for hem_list_alloc_item()
RDMA/hns: Remove redundant 'attr_mask' in modify_qp_init_to_init()
RDMA/hns: Remove unnecessary brackets when getting point
RDMA/hns: Remove unnecessary braces for single statement blocks
RDMA/hns: Cleanup for a spelling error of Asynchronous
IB/rdmavt: Add __init/__exit annotations to module init/exit funcs
RDMA/rxe: Remove redundant num_sge fields
RDMA/mlx5: Enable ATS support for MRs and umems
RDMA/mlx5: Add support for dmabuf to devx umem
RDMA/core: Add UVERBS_ATTR_RAW_FD
...
Start health poll at earlier stage, so if fw fatal issue occurred before
or during initialization commands such as init_hca or set_hca_cap the
poll health can detect and indicate that the driver is already in error
state.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add the rx_oversize_pkts_buffer counter to ethtool statistics.
This counter exposes the number of dropped received packets due to
length which arrived to RQ and exceed software buffer size allocated by
the device for incoming traffic. It might imply that the device MTU is
larger than the software buffers size.
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Instead of passing the unaligned flag, pass an enum that indicates the
UMR mode. The next commit will add the third mode (KLM for certain
configurations of XSK), which will be added to this enum instead of
adding another bool flag everywhere.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
UMR MTTs used in striding RQ have certain alignment requirements. While
it's guaranteed to work when UMR pages are aligned to the UMR page size,
in practice it works then UMR pages are aligned to 8 bytes. However,
it's still not enough flexibility for the unaligned mode of XSK. This
patch leverages KSM to map UMR pages without alignment requirements,
when unaligned XSK is active. The downside is that KSM entries are twice
as big as MTTs, which limits the maximum WQE size, so regular RQs and
aligned XSK continue using MTTs.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit allows striding RQ to determine MTT page size at runtime,
instead of sticking to the compile-time PAGE_SIZE. This functionality
will be used by a following commit that adjusts the MTT page size to the
XSK frame size.
Stick with PAGE_SIZE for XSK on legacy RQ, as frag_stride is not used in
data path, it only helps calculate how pages are partitioned into
fragments, and PAGE_SIZE will ensure each fragment starts at the
beginning of a new allocation unit (XSK frame).
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add the capability that will allow the driver to determine the minimal
MTT page size to be able to map the smallest possible pages in XSK. The
older firmwares that don't have this capability default to 12 (i.e.
4096-byte pages).
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Saeed Mahameed says:
====================
updates from mlx5-next 2022-09-24
Updates form mlx5-next including[1]:
1) HW definitions and support for NPPS clock settings.
2) various cleanups
3) Enable hash mode by default for all NICs
4) page tracker and advanced virtualization HW definitions for vfio
[1] https://lore.kernel.org/netdev/20220907233636.388475-1-saeed@kernel.org/
* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Remove from FPGA IFC file not-needed definitions
net/mlx5: Remove unused structs
net/mlx5: Remove unused functions
net/mlx5: detect and enable bypass port select flow table
net/mlx5: Lag, enable hash mode by default for all NICs
net/mlx5: Lag, set active ports if support bypass port select flow table
RDMA/mlx5: Don't set tx affinity when lag is in hash mode
net/mlx5: add IFC bits for bypassing port select flow table
net/mlx5: Add support for NPPS with real time mode
net/mlx5: Expose NPPS related registers
net/mlx5: Query ADV_VIRTUALIZATION capabilities
net/mlx5: Introduce ifc bits for page tracker
RDMA/mlx5: Move function mlx5_core_query_ib_ppcnt() to mlx5_ib
====================
Link: https://lore.kernel.org/all/20220927201906.234015-1-saeed@kernel.org/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move IP layout bits definitions to be close to the place that actually
uses it, together with removal extra defines that not in-use.
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Remove structs which are no longer used in the driver:
mlx5dr_cmd_qp_create_attr
mlx5_fs_dr_ns
mlx5_pas
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Remove functions which are no longer used in the driver:
mlx5e_ipsec_is_tx_flow
mlx5_health_flush
get_cqe_enhanced_num_mini_cqes
get_cqe_l3_hdr_type
mlx5_health_flush
mlx5_fs_is_ipsec_flow
_mlx5_fs_is_outer_ipproto_flow
mlx5_fs_is_outer_tcp_flow
mlx5_fs_is_outer_udp_flow
mlx5_fs_is_vxlan_flow
mlx5_fs_is_outer_ipsec_flow
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
In hash mode, without setting tx affinity explicitly, the port select
flow table decides which port is used for the traffic.
If port_select_flow_table_bypass capability is supported and tx affinity
is set explicitly for QP/TIS, they will be added into the explicit affinity
table in FW to check which port is used for the traffic.
1. The overloaded explicit affinity table may affect performance.
To avoid this, do not set tx affinity explicitly by default.
2. The packets of the same flow need to be transmitted on the same port.
Because the packets of the same flow use different QPs in slow & fast
path, it shouldn't set tx affinity explicitly for these QPs.
Signed-off-by: Liu, Changcheng <jerrliu@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
port_select_flow_table_bypass - When set, device supports
bypass port select flow table.
active_port - Bitmask indicates the current active ports
in PORT_SELECT_FT LAG.
MLX5_SET_HCA_CAP_OP_MODE_PORT_SELECTION - op_mod to operate
PORT_SELECTION_Capabilities.
Signed-off-by: Liu, Changcheng <jerrliu@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add support for setting NPPS. NPPS is currently available in
REAL_TIME_CLOCK mode only. In addition allow the user to set the pulse
duration.
When NPPS pulse duration is not set explicitly by the user, driver set
it to 50% of the NPPS period.
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add management capability bits indicating firmware may support N pulses
per second. Add corresponding fields in MTPPS register.
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
MACsec EPN splits the packet number (PN) into two 32-bits fields,
epn_lsb (32 least significant bits (LSBs) of PN) and epn_msb (32
most significant bits (MSBs) of PN).
Epn_msb bits are managed by SW and for that HW is required to send
an object change event of type EPN event notifying the SW to update
the epn_msb in addition, once epn_msb is updated SW update HW with
the new epn_msb value for HW to perform replay protection.
To prevent HW from stopping while handling the event, SW manages
another bit for HW called epn_overlap, HW uses the latter to get
an indication regarding how to read the epn_msb value correctly
while still receiving packets.
Add epn event handling that updates the epn_overlap and epn_msb for
every 2^31 packets according to the following logic:
if epn_lsb crosses 2^31 (half sequence number wraparound) upon HW
relevant event, SW updates the esn_overlap value to OLD (value = 1).
When the epn_lsb crosses 2^32 (full sequence number wraparound)
upon HW relevant event, SW updates the esn_overlap to NEW
(value = 0) and increment the esn_msb.
When using MACsec EPN a salt and short secure channel id (ssci)
needs to be provided by the user, when offloading EPN need to pass
this salt and ssci to the HW to be used in the initial vector (IV)
calculations.
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Emeel Hakim <ehakim@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add ifc bits related to advanced steering operations (ASO) and general
object modify for macsec to use as part of offloading EPN and replay
protection features.
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Emeel Hakim <ehakim@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix ifc fields name to be consistent with the device spec document.
Fixes: 8385c51ff5 ("net/mlx5: Introduce MACsec Connect-X offload hardware bits and structures")
Signed-off-by: Emeel Hakim <ehakim@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Many bug fixes in several drivers:
- Fix misuse of the DMA API in rtrs
- Several irdma issues: hung task due to SQ flushing, incorrect capability
reporting to userspace, improper error handling for MW corners, touching
an uninitialized SGL for during invalidation.
- hns was using the wrong page size limits for the HW, an incorrect
calculation of wqe_shift causing WQE corruption, and mis computed
a timer id.
- Fix a crash in SRP triggered by blktests
- Fix compiler errors by calling virt_to_page() with the proper type in
siw
- Userspace triggerable deadlock in ODP
- mlx5 could use the wrong profile due to some driver loading races,
counters were not working in some device configurations, and a crash on
error unwind.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCYxtj4QAKCRCFwuHvBreF
YQNdAQDOAoXv3PCZikmyu4zmjzVdeUUXEig5RU3MgFdCimo99gEA8t+2/pHmnSTB
vn7cxuKMpJydAmLVFJPZxaOEuaBdegQ=
=/eYF
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"Many bug fixes in several drivers:
- Fix misuse of the DMA API in rtrs
- Several irdma issues: hung task due to SQ flushing, incorrect
capability reporting to userspace, improper error handling for MW
corners, touching an uninitialized SGL for during invalidation.
- hns was using the wrong page size limits for the HW, an incorrect
calculation of wqe_shift causing WQE corruption, and mis computed a
timer id.
- Fix a crash in SRP triggered by blktests
- Fix compiler errors by calling virt_to_page() with the proper type
in siw
- Userspace triggerable deadlock in ODP
- mlx5 could use the wrong profile due to some driver loading races,
counters were not working in some device configurations, and a
crash on error unwind"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/irdma: Report RNR NAK generation in device caps
RDMA/irdma: Use s/g array in post send only when its valid
RDMA/irdma: Return correct WC error for bind operation failure
RDMA/irdma: Return error on MR deregister CQP failure
RDMA/irdma: Report the correct max cqes from query device
MAINTAINERS: Update maintainers of HiSilicon RoCE
RDMA/mlx5: Fix UMR cleanup on error flow of driver init
RDMA/mlx5: Set local port to one when accessing counters
RDMA/mlx5: Rely on RoCE fw cap instead of devlink when setting profile
IB/core: Fix a nested dead lock as part of ODP flow
RDMA/siw: Pass a pointer to virt_to_page()
RDMA/srp: Set scmnd->result only when scmnd is not NULL
RDMA/hns: Remove the num_qpc_timer variable
RDMA/hns: Fix wrong fixed value of qp->rq.wqe_shift
RDMA/hns: Fix supported page size
RDMA/cma: Fix arguments order in net device validation
RDMA/irdma: Fix drain SQ hang with no completion
RDMA/rtrs-srv: Pass the correct number of entries for dma mapped SGL
RDMA/rtrs-clt: Use the right sg_cnt after ib_dma_map_sg
Add new namespace for MACsec RX flows.
Encrypted MACsec packets should be first decrypted and stripped
from MACsec header and then continues with the kernel's steering
pipeline.
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tx flow steering consists of two flow tables (FTs).
The first FT (crypto table) has two fixed rules:
One default miss rule so non MACsec offloaded packets bypass the MACSec
tables, another rule to make sure that MACsec key exchange (MKE) traffic
passes unencrypted as expected (matched of ethertype).
On each new MACsec offload flow, a new MACsec rule is added.
This rule is matched on metadata_reg_a (which contains the id of the
flow) and invokes the MACsec offload action on match.
The second FT (check table) has two fixed rules:
One rule for verifying that the previous offload actions were
finished successfully and packet need to be transmitted.
Another default rule for dropping packets that were failed in the
offload actions.
The MACsec FTs should be created on demand when the first MACsec rule is
added and destroyed when the last MACsec rule is deleted.
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changed EGRESS_KERNEL namespace to EGRESS_IPSEC and add new
namespace for MACsec TX.
This namespace should be the last namespace for transmitted packets.
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add MACsec offload related IFC structs, layouts and enumerations.
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to support MACsec offload (and maybe some other crypto features
in the future), generalize flow action parameters / defines to be used by
crypto offlaods other than IPsec.
The following changes made:
ipsec_obj_id field at flow action context was changed to crypto_obj_id,
intreduced a new crypto_type field where IPsec is the default zero type
for backward compatibility.
Action ipsec_decrypt was changed to crypto_decrypt.
Action ipsec_encrypt was changed to crypto_encrypt.
IPsec offload code was updated accordingly for backward compatibility.
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
esp_id is no longer in used
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Query ADV_VIRTUALIZATION capabilities which provide information for
advanced virtualization related features.
Current capabilities refer to the page tracker object which is used for
tracking the pages that are dirtied by the device.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20220905105852.26398-3-yishaih@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Introduce ifc related stuff to enable using page tracker.
A page tracker is a dirty page tracking object used by the device to
report the tracking log.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20220905105852.26398-2-yishaih@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When the RDMA auxiliary driver probes, it sets its profile based on
devlink driverinit value. The latter might not be in sync with FW yet
(In case devlink reload is not performed), thus causing a mismatch
between RDMA driver and FW. This results in the following FW syndrome
when the RDMA driver tries to adjust RoCE state, which fails the probe:
"0xC1F678 | modify_nic_vport_context: roce_en set on a vport that
doesn't support roce"
To prevent this, select the PF profile based on FW RoCE capability
instead of relying on devlink driverinit value.
To provide backward compatibility of the RoCE disable feature, on older
FW's where roce_rw is not set (FW RoCE capability is read-only), keep
the current behavior e.g., rely on devlink driverinit value.
Fixes: fbfa97b4d7 ("net/mlx5: Disable roce at HCA level")
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Link: https://lore.kernel.org/r/cb34ce9a1df4a24c135cb804db87f7d2418bd6cc.1661763459.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
A huge patchset supporting vq resize using the
new vq reset capability.
Features, fixes, cleanups all over the place.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmL2F9APHG1zdEByZWRo
YXQuY29tAAoJECgfDbjSjVRp00QIAKpxyu+zCtrdDuh68DsNn1Cu0y0PXG336ySy
MA1ck/bv94MZBIbI/Bnn3T1jDmUqTFHJiwaGz/aZ5gGAplZiejhH5Ds3SYjHckaa
MKeJ4FTXin9RESP+bXhv4BgZ+ju3KHHkf1jw3TAdVKQ7Nma1u4E6f8nprYEi0TI0
7gLUYenqzS7X1+v9O3rEvPr7tSbAKXYGYpV82sSjHIb9YPQx5luX1JJIZade8A25
mTt5hG1dP1ugUm1NEBPQHjSvdrvO3L5Ahy0My2Bkd77+tOlNF4cuMPt2NS/6+Pgd
n6oMt3GXqVvw5RxZyY8dpknH5kofZhjgFyZXH0l+aNItfHUs7t0=
=rIo2
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:
- A huge patchset supporting vq resize using the new vq reset
capability
- Features, fixes, and cleanups all over the place
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (88 commits)
vdpa/mlx5: Fix possible uninitialized return value
vdpa_sim_blk: add support for discard and write-zeroes
vdpa_sim_blk: add support for VIRTIO_BLK_T_FLUSH
vdpa_sim_blk: make vdpasim_blk_check_range usable by other requests
vdpa_sim_blk: check if sector is 0 for commands other than read or write
vdpa_sim: Implement suspend vdpa op
vhost-vdpa: uAPI to suspend the device
vhost-vdpa: introduce SUSPEND backend feature bit
vdpa: Add suspend operation
virtio-blk: Avoid use-after-free on suspend/resume
virtio_vdpa: support the arg sizes of find_vqs()
vhost-vdpa: Call ida_simple_remove() when failed
vDPA: fix 'cast to restricted le16' warnings in vdpa.c
vDPA: !FEATURES_OK should not block querying device config space
vDPA/ifcvf: support userspace to query features and MQ of a management device
vDPA/ifcvf: get_config_size should return a value no greater than dev implementation
vhost scsi: Allow user to control num virtqueues
vhost-scsi: Fix max number of virtqueues
vdpa/mlx5: Support different address spaces for control and data
vdpa/mlx5: Implement susupend virtqueue callback
...
Implement the suspend callback allowing to suspend the virtqueues so
they stop processing descriptors. This is required to allow to query a
consistent state of the virtqueue while live migration is taking place.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Message-Id: <20220714113927.85729-2-elic@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This PR includes a new RDMA driver for Alibaba Cloud hardware
- Bug fixes and small features for irdma, hns, siw, qedr, hfi1, mlx5
- General spelling/grammer fixes
- rdma cm can follow changes in neighbours for control packets
- Significant amounts of rxe fixes and spec compliance changes
- Use the modern NAPI API
- Use the bitmap API instead of open coding
- Performance improvements for rtrs
- Add the ERDMA driver for Alibaba cloud
- Fix a use after free bug in SRP
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCYuwAuAAKCRCFwuHvBreF
YcRDAQC41YJNs7xve7r62/E6M+o/AXiwXa+m8rGRvcP3mdilNAEAhdom6HskenMZ
/sopeBWF78M9plLvNzWkwukaqIwrXgM=
=abuq
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"This cycle we got a new RDMA driver "ERDMA" for the Alibaba cloud
environment. Otherwise the changes are dominated by rxe fixes.
There is another RDMA driver on the list that might get merged next
cycle, 'MANA' for the Azure cloud environment.
Summary:
- Bug fixes and small features for irdma, hns, siw, qedr, hfi1, mlx5
- General spelling/grammer fixes
- rdma cm can follow changes in neighbours for control packets
- Significant amounts of rxe fixes and spec compliance changes
- Use the modern NAPI API
- Use the bitmap API instead of open coding
- Performance improvements for rtrs
- Add the ERDMA driver for Alibaba cloud
- Fix a use after free bug in SRP"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (99 commits)
RDMA/ib_srpt: Unify checking rdma_cm_id condition in srpt_cm_req_recv()
RDMA/rxe: Fix error unwind in rxe_create_qp()
RDMA/mlx5: Add missing check for return value in get namespace flow
RDMA/rxe: Split qp state for requester and completer
RDMA/rxe: Generate error completion for error requester QP state
RDMA/rxe: Update wqe_index for each wqe error completion
RDMA/srpt: Fix a use-after-free
RDMA/srpt: Introduce a reference count in struct srpt_device
RDMA/srpt: Duplicate port name members
IB/qib: Fix repeated "in" within comments
RDMA/erdma: Add driver to kernel build environment
RDMA/erdma: Add the ABI definitions
RDMA/erdma: Add the erdma module
RDMA/erdma: Add connection management (CM) support
RDMA/erdma: Add verbs implementation
RDMA/erdma: Add verbs header file
RDMA/erdma: Add event queue implementation
RDMA/erdma: Add cmdq implementation
RDMA/erdma: Add main include file
RDMA/erdma: Add the hardware related definitions
...
After replacing the MR cache with an Mkey cache, rename the variables and
functions to fit the new meaning.
Link: https://lore.kernel.org/r/20220726071911.122765-6-michaelgur@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add capability field which indicates the mask for wqe_counter which
connects between loopback CQE and the original WQE. With this connection
the driver can identify lost of the loopback CQE and reply PTP
synchronization with timestamp given in the original CQE.
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add UID field to flow table attributes to allow creating flow tables
with a non default (zero) uid.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Expose the flow table ID to users. This will be used by downstream
patches to allow creating steering rules that point to a flow table ID.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Expose shared_object_to_user_object_allowed, this capability
means an object created with shared UID can point to any UID.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Use software VHCA id when it's supported by the firmware.
A unique id is allocated upon mlx5_mdev_init() and freed upon
mlx5_mdev_uninit(), as such it stays the same during the full life cycle
of the device including upon health recovery if occurred.
The conjunction of sw_vhca_id with sw_owner_id will be a global unique
id per function which uses mlx5_core.
The sw_vhca_id is set upon init_hca command and is used to specify the
VHCA that the NIC vport is affiliated with.
This functionality is needed upon migration of VM which is MPV based.
(i.e. multi port device).
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Introduce ifc related stuff to enable using software vhca id
functionality.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
As part of the flows invoked by mlx5_devlink_eswitch_mode_set() get to
mlx5_rescan_drivers_locked() which can call mlx5e_probe()/mlx5e_remove
and register/unregister mlx5e driver ports accordingly. This can lead to
deadlock once mlx5_devlink_eswitch_mode_set() will use devlink lock.
Use devl_port_register/unregister() instead of
devlink_port_register/unregister() and add devlink instance locks in the
driver paths to this function to have it locked while calling devl_ API
function.
If remove or probe were called by module init or module cleanup flows,
need to lock devlink just before calling devl_port_register(), otherwise
it is called by attach/detach or register/unregister flow and we can
have the flow locked. Added flag to distinguish between these cases.
This will be used by the downstream patch to invoke
mlx5_devlink_eswitch_mode_set() with devlink locked.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently, there are three eswitch modes, none, legacy and switchdev.
None is the default mode. Remove redundant none mode as eswitch mode
should always be either legacy mode or switchdev mode.
With this patch, there are two behavior changes:
1. Legacy becomes the default mode. When querying eswitch mode using
devlink, a valid mode is always returned.
2. When disabling sriov, the eswitch mode will not change, only vfs
are unloaded.
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Expose ifc bits and add needed structure fields and methods to
support enhanced CQE compression feature.
The enhanced CQE compression feature improves cpu utiliziation with
better packet latency from nic to host.
Signed-off-by: Ofer Levi <oferle@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Remove not used MLX5_CAP_BITS_RW_MASK.
While at it, remove CAP_MASK, MLX5_CAP_OFF_CMDIF_CSUM
and MLX5_DEV_CAP_FLAG_*, since MLX5_CAP_BITS_RW_MASK
was their only user.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Attach flow meter to FTE with object id and index.
Use metadata register C5 to store the packet color meter result.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
total_q_under_processor_handle - number of queues in error state due to an
async error or errored command.
send_queue_priority_update_flow - number of QP/SQ priority/SL update
events.
cq_overrun - number of times CQ entered an error state due to an
overflow.
async_eq_overrun -number of time an EQ mapped to async events was
overrun.
comp_eq_overrun - number of time an EQ mapped to completion events was
overrun.
quota_exceeded_command - number of commands issued and failed due to quota
exceeded.
invalid_command - number of commands issued and failed dues to any reason
other than quota exceeded.
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Added support for managing new type of ICM for devices that
support sw_owner_v2.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Added new fields for device memory capabilities, in order to
support creation of ICM memory for modify header patterns.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
mac vlan filter and stats support in mlx5 vdpa
irq hardening in virtio
performance improvements in virtio crypto
polling i/o support in virtio blk
ASID support in vhost
fixes, cleanups all over the place
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmKXBEwPHG1zdEByZWRo
YXQuY29tAAoJECgfDbjSjVRp9pgIAJKdLbtrqDdDrPJRg5Xs2wsWMbGnXbKxm71Q
kGPfnkn15yjwiMJcdNO8O16qP5VGUzBWPhGc3X7U9KRU4EI2yEGLFv9hwRvdlY9h
4C7O4fBWTbuqTT6pwwAH9vT3/b6q8bMh4D3uD2P1zSwh8POqHm1kyRivXPPr4PUA
5TAc04EGcrFQpC9co3v8UmJ7EjLa/tgdUb7igTv5VHi7K2GNGMycSvHP/v7CFgZ6
paVYThMCrWOJkIIk8k6qhMKhAxkQ7lBrqhLBtfX9jyadLmq5hAQQu2D43PoZb2ed
IbNN/g5Sv9yaL7ERYzTRAEfuXVGPiDLWli2C5sHdz2IJDBeqQeU=
=Vlh6
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:
"vhost,virtio and vdpa features, fixes, and cleanups:
- mac vlan filter and stats support in mlx5 vdpa
- irq hardening in virtio
- performance improvements in virtio crypto
- polling i/o support in virtio blk
- ASID support in vhost
- fixes, cleanups all over the place"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (64 commits)
vdpa: ifcvf: set pci driver data in probe
vdpa/mlx5: Add RX MAC VLAN filter support
vdpa/mlx5: Remove flow counter from steering
vhost: rename vhost_work_dev_flush
vhost-test: drop flush after vhost_dev_cleanup
vhost-scsi: drop flush after vhost_dev_cleanup
vhost_vsock: simplify vhost_vsock_flush()
vhost_test: remove vhost_test_flush_vq()
vhost_net: get rid of vhost_net_flush_vq() and extra flush calls
vhost: flush dev once during vhost_dev_stop
vhost: get rid of vhost_poll_flush() wrapper
vhost-vdpa: return -EFAULT on copy_to_user() failure
vdpasim: Off by one in vdpasim_set_group_asid()
virtio: Directly use ida_alloc()/free()
virtio: use WARN_ON() to warning illegal status value
virtio: harden vring IRQ
virtio: allow to unbreak virtqueue
virtio-ccw: implement synchronize_cbs()
virtio-mmio: implement synchronize_cbs()
virtio-pci: implement synchronize_cbs()
...
Current release - new code bugs:
- af_packet: make sure to pull the MAC header, avoid skb panic in GSO
- ptp_clockmatrix: fix inverted logic in is_single_shot()
- netfilter: flowtable: fix missing FLOWI_FLAG_ANYSRC flag
- dt-bindings: net: adin: fix adi,phy-output-clock description syntax
- wifi: iwlwifi: pcie: rename CAUSE macro, avoid MIPS build warning
Previous releases - regressions:
- Revert "net: af_key: add check for pfkey_broadcast in function
pfkey_process"
- tcp: fix tcp_mtup_probe_success vs wrong snd_cwnd
- nf_tables: disallow non-stateful expression in sets earlier
- nft_limit: clone packet limits' cost value
- nf_tables: double hook unregistration in netns path
- ping6: fix ping -6 with interface name
Previous releases - always broken:
- sched: fix memory barriers to prevent skbs from getting stuck
in lockless qdiscs
- neigh: set lower cap for neigh_managed_work rearming, avoid
constantly scheduling the probe work
- bpf: fix probe read error on big endian in ___bpf_prog_run()
- amt: memory leak and error handling fixes
Misc:
- ipv6: expand & rename accept_unsolicited_na to accept_untracked_na
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmKY9lMACgkQMUZtbf5S
IrtNvA//WCpG53NwSy8aV2X/0vkprVEuO8EQeIYhaw1R4KlVcqrITQcaLqkq/xL/
RUq6F/plftMSiuGRhTp/Sgbl0o0XgJkf769m4zQxz9NqWqgcw5kwJPu4Xq1nSM9t
/2qAFNDnXShxRiSYrI0qxQrmd0OUjtsibxsKRTSrrlvcd6zYfrx/7+QK5qpLMF9E
zJpBSYQm2R0RLGRith99G8w3WauhlprPaxyQ71ogQtBhTF+Eg7K+xEm2D5DKtyvj
7CLyrQtR0jyDBAt2ZPCh5D/yVPkNI1rigQ8m4uiW9DE6mk1DsxxY+DIOt5vQPBdR
x9Pq0qG54KS5sP18ABeNRQn4NWdkhVf/CcPkaRxHJdRs13mpQUATJRpZ3Ytd9Nt0
HW6Kby+zY6bdpUX8+UYdhcG6wbt0Lw8B+bSCjiqfE/CBbfUFA3L9/q/5Hk8Xbnxn
lCIk4asxQgpNhcZ+PAkZfFgE0GNDKnXDu1thO+q7/N9srZrrh9WQW5qoq5lexo8V
c01jRbPTKa64Gbvm+xDDGEwSl2uIRITtea284bL3q6lnI50n50dlLOAW0z5tmbEg
X9OHae5bMAdtvS5A1ForJaWA/Mj35ZqtGG5oj0WcGcLupVyec3rgaYaJtNvwgoDx
ptCQVIMLTAHXtZMohm0YrBizg0qbqmCd2c0/LB+3odX328YStJU=
=bWkn
-----END PGP SIGNATURE-----
Merge tag 'net-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf and netfilter.
Current release - new code bugs:
- af_packet: make sure to pull the MAC header, avoid skb panic in GSO
- ptp_clockmatrix: fix inverted logic in is_single_shot()
- netfilter: flowtable: fix missing FLOWI_FLAG_ANYSRC flag
- dt-bindings: net: adin: fix adi,phy-output-clock description syntax
- wifi: iwlwifi: pcie: rename CAUSE macro, avoid MIPS build warning
Previous releases - regressions:
- Revert "net: af_key: add check for pfkey_broadcast in function
pfkey_process"
- tcp: fix tcp_mtup_probe_success vs wrong snd_cwnd
- nf_tables: disallow non-stateful expression in sets earlier
- nft_limit: clone packet limits' cost value
- nf_tables: double hook unregistration in netns path
- ping6: fix ping -6 with interface name
Previous releases - always broken:
- sched: fix memory barriers to prevent skbs from getting stuck in
lockless qdiscs
- neigh: set lower cap for neigh_managed_work rearming, avoid
constantly scheduling the probe work
- bpf: fix probe read error on big endian in ___bpf_prog_run()
- amt: memory leak and error handling fixes
Misc:
- ipv6: expand & rename accept_unsolicited_na to accept_untracked_na"
* tag 'net-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (80 commits)
net/af_packet: make sure to pull mac header
net: add debug info to __skb_pull()
net: CONFIG_DEBUG_NET depends on CONFIG_NET
stmmac: intel: Add RPL-P PCI ID
net: stmmac: use dev_err_probe() for reporting mdio bus registration failure
tipc: check attribute length for bearer name
ice: fix access-beyond-end in the switch code
nfp: remove padding in nfp_nfdk_tx_desc
ax25: Fix ax25 session cleanup problems
net: usb: qmi_wwan: Add support for Cinterion MV31 with new baseline
sfc/siena: fix wrong tx channel offset with efx_separate_tx_channels
sfc/siena: fix considering that all channels have TX queues
socket: Don't use u8 type in uapi socket.h
net/sched: act_api: fix error code in tcf_ct_flow_table_fill_tuple_ipv6()
net: ping6: Fix ping -6 with interface name
macsec: fix UAF bug for real_dev
octeontx2-af: fix error code in is_valid_offset()
wifi: mac80211: fix use-after-free in chanctx code
bonding: guard ns_targets by CONFIG_IPV6
tcp: tcp_rtx_synack() can be called from process context
...
- Improvements to mlx5 vfio-pci variant driver, including support
for parallel migration per PF (Yishai Hadas)
- Remove redundant iommu_present() check (Robin Murphy)
- Ongoing refactoring to consolidate the VFIO driver facing API
to use vfio_device (Jason Gunthorpe)
- Use drvdata to store vfio_device among all vfio-pci and variant
drivers (Jason Gunthorpe)
- Remove redundant code now that IOMMU core manages group DMA
ownership (Jason Gunthorpe)
- Remove vfio_group from external API handling struct file ownership
(Jason Gunthorpe)
- Correct typo in uapi comments (Thomas Huth)
- Fix coccicheck detected deadlock (Wan Jiabing)
- Use rwsem to remove races and simplify code around container and
kvm association to groups (Jason Gunthorpe)
- Harden access to devices in low power states and use runtime PM to
enable d3cold support for unused devices (Abhishek Sahu)
- Fix dma_owner handling of fake IOMMU groups (Jason Gunthorpe)
- Set driver_managed_dma on vfio-pci variant drivers (Jason Gunthorpe)
- Pass KVM pointer directly rather than via notifier (Matthew Rosato)
-----BEGIN PGP SIGNATURE-----
iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmKPvyMbHGFsZXgud2ls
bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsihegP/3XamiYsS0GuA7awAq/X
h9Jahb6kJ+sh0RXL1Gqzc9nxH5X9H/hBcL88VOV3GLwyOhNVNpVjQXGguL3aLaCE
zUrs0+AFEJb990y9H+VgwIDom5BIpgdZ2naG42bz9wUeVGg4daJnkMwOgXwIBzfx
IOddktN6UwuE+DyA57yqL93f+0cTrhYZx9R14sDoLR5lE4uGnbQwIknawEKVtoeR
rEPaCFptxPxCUbqoOSR0Y3bu6rUYSH4iiMZpMviqm2ak3aNn76gru3q4QAnI4gTd
l/w+2OJNFC0U7H5Cz7cdIn2StdJvfSkX0e753+qsFccFsViRCGdnW0Lht/xrYrFC
i8AJxkrq2/bs00LXs7kzcruaD8pJ2UPe2x2+nupHSEsj99K4NraeHRB2CC1uwj0d
gYliOSW5T3//wOpztK48s475VppgXeKWkXGoNY3JJlGjAPyd0vFrH8hRLhVZJ9uI
/eLh6hQnOJuCDz1rQrVNRk6cZi9R1Wpl5dvCBRLqjK519nm569aTlVBra+iNyUCQ
lU5/kN0ym8+X8CweE5ILPGiX2iEXBYMqv+Dm5yOimRUHRJZHYv900FX0GVEnCUCq
23sMDaeHS1hyDCQk//bd2Ig7xjh7mbh7CrKcdJ7pL5Gc/A1zkCXd54hvxViiGwQq
U5KIPTyJy+erpcpxjUApaoP2
=etEI
-----END PGP SIGNATURE-----
Merge tag 'vfio-v5.19-rc1' of https://github.com/awilliam/linux-vfio
Pull vfio updates from Alex Williamson:
- Improvements to mlx5 vfio-pci variant driver, including support for
parallel migration per PF (Yishai Hadas)
- Remove redundant iommu_present() check (Robin Murphy)
- Ongoing refactoring to consolidate the VFIO driver facing API to use
vfio_device (Jason Gunthorpe)
- Use drvdata to store vfio_device among all vfio-pci and variant
drivers (Jason Gunthorpe)
- Remove redundant code now that IOMMU core manages group DMA ownership
(Jason Gunthorpe)
- Remove vfio_group from external API handling struct file ownership
(Jason Gunthorpe)
- Correct typo in uapi comments (Thomas Huth)
- Fix coccicheck detected deadlock (Wan Jiabing)
- Use rwsem to remove races and simplify code around container and kvm
association to groups (Jason Gunthorpe)
- Harden access to devices in low power states and use runtime PM to
enable d3cold support for unused devices (Abhishek Sahu)
- Fix dma_owner handling of fake IOMMU groups (Jason Gunthorpe)
- Set driver_managed_dma on vfio-pci variant drivers (Jason Gunthorpe)
- Pass KVM pointer directly rather than via notifier (Matthew Rosato)
* tag 'vfio-v5.19-rc1' of https://github.com/awilliam/linux-vfio: (38 commits)
vfio: remove VFIO_GROUP_NOTIFY_SET_KVM
vfio/pci: Add driver_managed_dma to the new vfio_pci drivers
vfio: Do not manipulate iommu dma_owner for fake iommu groups
vfio/pci: Move the unused device into low power state with runtime PM
vfio/pci: Virtualize PME related registers bits and initialize to zero
vfio/pci: Change the PF power state to D0 before enabling VFs
vfio/pci: Invalidate mmaps and block the access in D3hot power state
vfio: Change struct vfio_group::container_users to a non-atomic int
vfio: Simplify the life cycle of the group FD
vfio: Fully lock struct vfio_group::container
vfio: Split up vfio_group_get_device_fd()
vfio: Change struct vfio_group::opened from an atomic to bool
vfio: Add missing locking for struct vfio_group::kvm
kvm/vfio: Fix potential deadlock problem in vfio
include/uapi/linux/vfio.h: Fix trivial typo - _IORW should be _IOWR instead
vfio/pci: Use the struct file as the handle not the vfio_group
kvm/vfio: Remove vfio_group from kvm
vfio: Change vfio_group_set_kvm() to vfio_file_set_kvm()
vfio: Change vfio_external_check_extension() to vfio_file_enforced_coherent()
vfio: Remove vfio_external_group_match_file()
...
ECE field should be after opt_param_mask in query qp output.
Fixes: 6b646a7e4a ("net/mlx5: Add ability to read and write ECE options")
Signed-off-by: Changcheng Liu <jerrliu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Implement the get_vq_stats calback of vdpa_config_ops to return the
statistics for a virtqueue.
The statistics are provided as vendor specific statistics where the
driver provides a pair of attribute name and attribute value.
Currently supported are received descriptors and completed descriptors.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Message-Id: <20220518133804.1075129-6-elic@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Multiport eswitch mode is a LAG mode that allows to add rules that
forward traffic to a specific physical port without being affected by LAG
affinity configuration.
This mode of operation is mutual exclusive with the other LAG modes used
by multipath and bonding.
To make the transition between the modes, we maintain a counter on the
number of rules specifying one of the uplink representors as the target
of mirred egress redirect action.
An example of such rule would be:
$ tc filter add dev enp8s0f0_0 prot all root flower dst_mac \
00:11:22:33:44:55 action mirred egress redirect dev enp8s0f0
If the reference count just grows to one and LAG is not in use, we
create the LAG in multiport eswitch mode. Other mode changes are not
allowed while in this mode. When the reference count reaches zero, we
destroy the LAG and let other modes be used if needed.
logic also changed such that if forwarding to some uplink destination
cannot be guaranteed, we fail the operation so the rule will eventually
be in software and not in hardware.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Take the wrapper version which picks default node into a header file.
This reduces the number of exported functions.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add syndrome of last command failure per command type to debugfs to ease
debugging of such failure.
last_failed_syndrome - last command failed syndrome returned by FW.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Expose mlx5_sriov_blocking_notifier_register / unregister APIs to let a
VF register to be notified for its enablement / disablement by the PF.
Upon VF probe it will call mlx5_sriov_blocking_notifier_register() with
its notifier block and upon VF remove it will call
mlx5_sriov_blocking_notifier_unregister() to drop its registration.
This can give a VF the ability to clean some resources upon disable
before that the command interface goes down and on the other hand sets
some stuff before that it's enabled.
This may be used by a VF which is migration capable in few cases.(e.g.
PF load/unload upon an health recovery).
Link: https://lore.kernel.org/r/20220510090206.90374-2-yishaih@nvidia.com
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Lag state has become very complicated with many modes, flags, types and
port selections methods and future work will add additional features.
Add a debugfs to query the current lag state. A new directory named "lag"
will be created under the mlx5 debugfs directory. As the driver has
debugfs per pci function the location will be: <debugfs>/mlx5/<BDF>/lag
For example:
/sys/kernel/debug/mlx5/0000:08:00.0/lag
The following files are exposed:
- state: Returns "active" or "disabled". If "active" it means hardware
lag is active.
- members: Returns the BDFs of all the members of lag object.
- type: Returns the type of the lag currently configured. Valid only
if hardware lag is active.
* "roce" - Members are bare metal PFs.
* "switchdev" - Members are in switchdev mode.
* "multipath" - ECMP offloads.
- port_sel_mode: Returns the egress port selection method, valid
only if hardware lag is active.
* "queue_affinity" - Egress port is selected by
the QP/SQ affinity.
* "hash" - Egress port is selected by hash done on
each packet. Controlled by: xmit_hash_policy of the
bond device.
- flags: Returns flags that are specific per lag @type. Valid only if
hardware lag is active.
* "shared_fdb" - "on" or "off", if "on" single FDB is used.
- mapping: Returns the mapping which is used to select egress port.
Valid only if hardware lag is active.
If @port_sel_mode is "hash" returns the active egress ports.
The hash result will select only active ports.
if @port_sel_mode is "queue_affinity" returns the mapping
between the configured port affinity of the QP/SQ and actual
egress port. For example:
* 1:1 - Mapping means if the configured affinity is port 1
traffic will egress via port 1.
* 1:2 - Mapping means if the configured affinity is port 1
traffic will egress via port 2. This can happen
if port 1 is down or in active/backup mode and port 1
is backup.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Increase the define MLX5_MAX_PORTS to 4 as the driver is ready
to support NICs with 4 ports.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Downstream patches will add support for hardware lag with
more than 2 ports. Add a way for users to query the number of lag ports.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently, removing a device needs to get the driver interface lock before
doing any cleanup. If the driver is waiting in a loop for FW init, there
is no way to cancel the wait, instead the device cleanup waits for the
loop to conclude and release the lock.
To allow immediate response to remove device commands, check the TEARDOWN
flag while waiting for FW init, and exit the loop if it has been set.
Signed-off-by: Gavin Li <gavinl@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>