Commit Graph

1916 Commits

Author SHA1 Message Date
Maher Sanalla
34efda1735 RDMA/mlx5: Enable ATS when allocating kernel MRs
When creating kernel MRs, it is not definitive whether they will be used
for peer-to-peer transactions or for other usecases, since address
mapping is performed only after the MR is created.

Since peer-to-peer transactions benefit significantly from ATS
performance-wise, enable ATS on newly-allocated kernel MRs when
supported.

Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Gal Shalom <galshalom@nvidia.com>
Link: https://patch.msgid.link/fafd4c9f14cf438d2882d88649c2947e1d05d0b4.1725273403.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09 21:16:40 +03:00
Chris Mi
112e6e83a8 IB/mlx5: Fix UMR pd cleanup on error flow of driver init
The cited commit moves the pd allocation from function
mlx5r_umr_resource_cleanup() to a new function mlx5r_umr_cleanup().
So the fix in commit [1] is broken. In error flow, will hit panic [2].

Fix it by checking pd pointer to avoid panic if it is NULL;

[1] RDMA/mlx5: Fix UMR cleanup on error flow of driver init
[2]
 [  347.567063] infiniband mlx5_0: Couldn't register device with driver model
 [  347.591382] BUG: kernel NULL pointer dereference, address: 0000000000000020
 [  347.593438] #PF: supervisor read access in kernel mode
 [  347.595176] #PF: error_code(0x0000) - not-present page
 [  347.596962] PGD 0 P4D 0
 [  347.601361] RIP: 0010:ib_dealloc_pd_user+0x12/0xc0 [ib_core]
 [  347.604171] RSP: 0018:ffff888106293b10 EFLAGS: 00010282
 [  347.604834] RAX: 0000000000000000 RBX: 000000000000000e RCX: 0000000000000000
 [  347.605672] RDX: ffff888106293ad0 RSI: 0000000000000000 RDI: 0000000000000000
 [  347.606529] RBP: 0000000000000000 R08: ffff888106293ae0 R09: ffff888106293ae0
 [  347.607379] R10: 0000000000000a06 R11: 0000000000000000 R12: 0000000000000000
 [  347.608224] R13: ffffffffa0704dc0 R14: 0000000000000001 R15: 0000000000000001
 [  347.609067] FS:  00007fdc720cd9c0(0000) GS:ffff88852c880000(0000) knlGS:0000000000000000
 [  347.610094] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [  347.610727] CR2: 0000000000000020 CR3: 0000000103012003 CR4: 0000000000370eb0
 [  347.611421] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [  347.612113] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 [  347.612804] Call Trace:
 [  347.613130]  <TASK>
 [  347.613417]  ? __die+0x20/0x60
 [  347.613793]  ? page_fault_oops+0x150/0x3e0
 [  347.614243]  ? free_msg+0x68/0x80 [mlx5_core]
 [  347.614840]  ? cmd_exec+0x48f/0x11d0 [mlx5_core]
 [  347.615359]  ? exc_page_fault+0x74/0x130
 [  347.615808]  ? asm_exc_page_fault+0x22/0x30
 [  347.616273]  ? ib_dealloc_pd_user+0x12/0xc0 [ib_core]
 [  347.616801]  mlx5r_umr_cleanup+0x23/0x90 [mlx5_ib]
 [  347.617365]  mlx5_ib_stage_pre_ib_reg_umr_cleanup+0x36/0x40 [mlx5_ib]
 [  347.618025]  __mlx5_ib_add+0x96/0xd0 [mlx5_ib]
 [  347.618539]  mlx5r_probe+0xe9/0x310 [mlx5_ib]
 [  347.619032]  ? kernfs_add_one+0x107/0x150
 [  347.619478]  ? __mlx5_ib_add+0xd0/0xd0 [mlx5_ib]
 [  347.619984]  auxiliary_bus_probe+0x3e/0x90
 [  347.620448]  really_probe+0xc5/0x3a0
 [  347.620857]  __driver_probe_device+0x80/0x160
 [  347.621325]  driver_probe_device+0x1e/0x90
 [  347.621770]  __driver_attach+0xec/0x1c0
 [  347.622213]  ? __device_attach_driver+0x100/0x100
 [  347.622724]  bus_for_each_dev+0x71/0xc0
 [  347.623151]  bus_add_driver+0xed/0x240
 [  347.623570]  driver_register+0x58/0x100
 [  347.623998]  __auxiliary_driver_register+0x6a/0xc0
 [  347.624499]  ? driver_register+0xae/0x100
 [  347.624940]  ? 0xffffffffa0893000
 [  347.625329]  mlx5_ib_init+0x16a/0x1e0 [mlx5_ib]
 [  347.625845]  do_one_initcall+0x4a/0x2a0
 [  347.626273]  ? gcov_event+0x2e2/0x3a0
 [  347.626706]  do_init_module+0x8a/0x260
 [  347.627126]  init_module_from_file+0x8b/0xd0
 [  347.627596]  __x64_sys_finit_module+0x1ca/0x2f0
 [  347.628089]  do_syscall_64+0x4c/0x100

Fixes: 638420115c ("IB/mlx5: Create UMR QP just before first reg_mr occurs")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Link: https://patch.msgid.link/778c40c60287992da5d6ec92bb07b67f7bb5e6ef.1725273295.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-04 10:28:49 +03:00
Yue Haibing
53ffc09a3e RDMA/mlx5: Remove two unused declarations
Commit e6fb246cca ("RDMA/mlx5: Consolidate MR destruction to
mlx5_ib_dereg_mr()") removed mlx5_ib_free_implicit_mr() but left
the declaration.

Commit d98995b4bf ("net/mlx5: Reimplement write combining test") left
mlx5_ib_test_wc().

Remove the unused declarations.

Link: https://patch.msgid.link/r/20240816101358.881247-1-yuehaibing@huawei.com
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-08-19 15:10:36 -03:00
Yishai Hadas
ec7ad65309 RDMA/mlx5: Introduce GET_DATA_DIRECT_SYSFS_PATH ioctl
Introduce the 'GET_DATA_DIRECT_SYSFS_PATH' ioctl to return the sysfs
path of the affiliated 'data direct' device for a given device.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://patch.msgid.link/403745463e0ef52adbef681ff09aa6a29a756352.1722512548.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-08-11 11:12:50 +03:00
Yishai Hadas
de8f847a51 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct
Add support for DMABUF MR registrations with Data-direct device.

Upon userspace calling to register a DMABUF MR with the data direct bit
set, the below algorithm will be followed.

1) Obtain a pinned DMABUF umem from the IB core using the user input
parameters (FD, offset, length) and the DMA PF device.  The DMA PF
device is needed to allow the IOMMU to enable the DMA PF to access the
user buffer over PCI.

2) Create a KSM MKEY by setting its entries according to the user buffer
VA to IOVA mapping, with the MKEY being the data direct device-crossed
MKEY. This KSM MKEY is umrable and will be used as part of the MR cache.
The PD for creating it is the internal device 'data direct' kernel one.

3) Create a crossing MKEY that points to the KSM MKEY using the crossing
access mode.

4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs
managed on the mlx5_ib device.

5) Return the crossing MKEY to the user, created with its supplied PD.

Upon DMA PF unbind flow, the driver will revoke the KSM entries.
The final deregistration will occur under the hood once the application
deregisters its MKEY.

Notes:
- This version supports only the PINNED UMEM mode, so there is no
  dependency on ODP.
- The IOVA supplied by the application must be system page aligned due to
  HW translations of KSM.
- The crossing MKEY will not be umrable or part of the MR cache, as we
  cannot change its crossed (i.e. KSM) MKEY over UMR.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-08-11 11:12:50 +03:00
Yishai Hadas
3aa73c6b79 RDMA: Pass uverbs_attr_bundle as part of '.reg_user_mr_dmabuf' API
Pass uverbs_attr_bundle as part of '.reg_user_mr_dmabuf' API instead of
udata.

This enables passing some new ioctl attributes to the drivers, as will
be introduced in the next patches for mlx5 driver.

Change the involved drivers accordingly.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://patch.msgid.link/9a25b2fc02443f7c36c2d93499ae25252b6afd40.1722512548.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-08-11 11:12:50 +03:00
Yishai Hadas
2e8e631d7a RDMA/mlx5: Add the initialization flow to utilize the 'data direct' device
Add the NET device initialization flow to utilize the 'data
direct' device.

When a NET mlx5_ib device is capable of 'data direct', the following
sequence of actions will occur:
- Find its affiliated 'data direct' VUID via a firmware command.
- Create its own private PD and 'data direct' mkey.
- Register to be notified when its 'data direct' driver is probed or removed.

The DMA device of the affiliated 'data direct' device, including the
private PD and the 'data direct' mkey, will be used later during MR
registrations that request the data direct functionality.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://patch.msgid.link/b11fa87b2a65bce4db8d40341bb6cee490fa4d06.1722512548.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-08-11 11:12:49 +03:00
Yishai Hadas
6910e3660d RDMA/mlx5: Introduce the 'data direct' driver
Introduce the 'data direct' driver for a ConnectX-8 Data Direct device.

The 'data direct' driver functions as the affiliated DMA device for one
or more capable mlx5_ib devices. This DMA device, as the name suggests,
is used exclusively for DMA operations. It can be considered a DMA engine
managed by a PF/VF, lacking network capabilities and having minimal overall
capabilities.

Consequently, the DMA NIC PF will not be exposed to or directly used by
software applications. The driver will not have any direct interface or
interaction with the firmware (no command interface, no capabilities,
etc.). It will operate solely over PCI to enable its DMA functionality.

Registration and un-registration of the driver are handled as part of
the mlx5_ib initialization and exit processes, as the mlx5_ib devices
will effectively be its clients.

The driver will serve as the DMA device for accessing another PCI device
to achieve optimal performance (both on the same NUMA node, P2P access,
etc.).

Upon probing, it will read its VUID over PCI to handle mlx5_ib device
registrations with the same VUID.

Upon removal, it will notify its clients to allow them to clean up the
resources that were mmaped with its DMA device.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://patch.msgid.link/b77edecfd476c3f445da96ab6aef499ae47b2829.1722512548.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-08-11 11:12:49 +03:00
Mark Bloch
0ea4ffb2bc RDMA/mlx5: Expose vhca id for all ports in multiport mode
In multiport mode, RDMA devices make it impossible for userspace to use
DEVX to discover vhca id values for ports beyond port 1. This patch
addresses the issue by exposing the vhca id of all ports.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Link: https://patch.msgid.link/41dea83aa51843aa4c067b4f73f28d64e51bd53c.1722331101.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-08-11 11:12:49 +03:00
Linus Torvalds
3d51520954 RDMA v6.11 merge window
Usual collection of small improvements and fixes:
 
 - Bug fixes and minor improvments in efa, irdma, mlx4, mlx5, rxe, hf1,
   qib, ocrdma
 
 - bnxt_re support for MSN, which is a new retransmit logic
 
 - Initial mana support for RC qps
 
 - Use after free bug and cleanups in iwcm
 
 - Reduce resource usage in mlx5 when RDMA verbs features are not used
 
 - New verb to drain shared recieve queues, similar to normal recieve
   queues. This is necessary to allow ULPs a clean shutdown. Used in the
   iscsi rdma target
 
 - mlx5 support for more than 16 bits of doorbell indexes
 
 - Doorbell moderation support for bnxt_re
 
 - IB multi-plane support for mlx5
 
 - New EFA adaptor PCI IDs
 
 - RDMA_NAME_ASSIGN_TYPE_USER to hint to userspace that it shouldn't rename
   the device
 
 - A collection of hns bugs
 
 - Fix long standing bug in bnxt_re with incorrect endian handling of
   immediate data
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZpfvKQAKCRCFwuHvBreF
 YXomAP46gZpGv5mlMOAXePRuKq6glNZWl3pVuwuycnlmjQcEUQD/dhQbJz0rZKBr
 swuibPo83bFacfXJL7Wxd48m4G3EfgI=
 =1eXu
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Usual collection of small improvements and fixes:

   - Bug fixes and minor improvments in efa, irdma, mlx4, mlx5, rxe,
     hf1, qib, ocrdma

   - bnxt_re support for MSN, which is a new retransmit logic

   - Initial mana support for RC qps

   - Use after free bug and cleanups in iwcm

   - Reduce resource usage in mlx5 when RDMA verbs features are not used

   - New verb to drain shared recieve queues, similar to normal recieve
     queues. This is necessary to allow ULPs a clean shutdown. Used in
     the iscsi rdma target

   - mlx5 support for more than 16 bits of doorbell indexes

   - Doorbell moderation support for bnxt_re

   - IB multi-plane support for mlx5

   - New EFA adaptor PCI IDs

   - RDMA_NAME_ASSIGN_TYPE_USER to hint to userspace that it shouldn't
     rename the device

   - A collection of hns bugs

   - Fix long standing bug in bnxt_re with incorrect endian handling of
     immediate data"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (65 commits)
  IB/hfi1: Constify struct flag_table
  RDMA/mana_ib: Set correct device into ib
  bnxt_re: Fix imm_data endianness
  RDMA: Fix netdev tracker in ib_device_set_netdev
  RDMA/hns: Fix mbx timing out before CMD execution is completed
  RDMA/hns: Fix insufficient extend DB for VFs.
  RDMA/hns: Fix undifined behavior caused by invalid max_sge
  RDMA/hns: Fix shift-out-bounds when max_inline_data is 0
  RDMA/hns: Fix missing pagesize and alignment check in FRMR
  RDMA/hns: Fix unmatch exception handling when init eq table fails
  RDMA/hns: Fix soft lockup under heavy CEQE load
  RDMA/hns: Check atomic wr length
  RDMA/ocrdma: Don't inline statistics functions
  RDMA/core: Introduce "name_assign_type" for an IB device
  RDMA/qib: Fix truncation compilation warnings in qib_verbs.c
  RDMA/qib: Fix truncation compilation warnings in qib_init.c
  RDMA/efa: Add EFA 0xefa3 PCI ID
  RDMA/mlx5: Support per-plane port IB counters by querying PPCNT register
  net/mlx5: mlx5_ifc update for accessing ppcnt register of plane ports
  RDMA/mlx5: Add plane index support when querying PTYS registers
  ...
2024-07-19 09:51:33 -07:00
Jakub Kicinski
dd3cd3ca69 aux-sysfs-irqs
Shay Says:
 ==========
 Introduce auxiliary bus IRQs sysfs
 
 Today, PCI PFs and VFs, which are anchored on the PCI bus, display their
 IRQ information in the <pci_device>/msi_irqs/<irq_num> sysfs files.  PCI
 subfunctions (SFs) are similar to PFs and VFs and these SFs are anchored
 on the auxiliary bus. However, these PCI SFs lack such IRQ information
 on the auxiliary bus, leaving users without visibility into which IRQs
 are used by the SFs. This absence makes it impossible to debug
 situations and to understand the source of interrupts/SFs for
 performance tuning and debug.
 
 Additionally, the SFs are multifunctional devices supporting RDMA,
 network devices, clocks, and more, similar to their peer PCI PFs and
 VFs. Therefore, it is desirable to have SFs' IRQ information available
 at the bus/device level.
 
 To overcome the above limitations, this short series extends the
 auxiliary bus to display IRQ information in sysfs, similar to that of
 PFs and VFs.
 
 It adds an 'irqs' directory under the auxiliary device and includes an
 <irq_num> sysfs file within it.
 
 For example:
 $ ls /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/
 50  51  52  53  54  55  56  57  58
 
 Patch summary:
 patch-1 adds auxiliary bus to support irqs used by auxiliary device
 patch-2 mlx5 driver using exposing irqs for PCI SF devices via auxiliary
         bus
 
 ==========
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmaQTCYACgkQSD+KveBX
 +j7nRAgAhyi8mD93AjpoXX8onbK3ZyPnGwLToCs0NT3EzT0BIwNvDovQp4rhcs16
 3zVwvW+twVsbMuPYTpPVgcynpL6N0K/CoW+ubDGZaRIaf0nDmh4MY1wY/EUsVj8R
 FbeTi5L+9MyKvFbtO5d4cW1q7M0XVD3uR8Wle6PwvXZ1gcM59vsR1eml25NLTC8B
 Z9F9WKG+dFAni0ll/IL837Se3QQapRXtJQ3g6XbIcpXiMqgIrHZ9FyY0LvuWlQq4
 LsIPKh7RySATmAYXwwpsnfdrilvvMHsyjlAoeNHEJBsAUY+kpOIFFi6J5EB+/oyo
 jhBhlc4Al0vUXis9jGTysO7mVYVOUQ==
 =DTxS
 -----END PGP SIGNATURE-----

Merge tag 'aux-sysfs-irqs' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux

Saeed Mahameed says:

====================
aux-sysfs-irqs

Shay Says:
==========
Introduce auxiliary bus IRQs sysfs

Today, PCI PFs and VFs, which are anchored on the PCI bus, display their
IRQ information in the <pci_device>/msi_irqs/<irq_num> sysfs files.  PCI
subfunctions (SFs) are similar to PFs and VFs and these SFs are anchored
on the auxiliary bus. However, these PCI SFs lack such IRQ information
on the auxiliary bus, leaving users without visibility into which IRQs
are used by the SFs. This absence makes it impossible to debug
situations and to understand the source of interrupts/SFs for
performance tuning and debug.

Additionally, the SFs are multifunctional devices supporting RDMA,
network devices, clocks, and more, similar to their peer PCI PFs and
VFs. Therefore, it is desirable to have SFs' IRQ information available
at the bus/device level.

To overcome the above limitations, this short series extends the
auxiliary bus to display IRQ information in sysfs, similar to that of
PFs and VFs.

It adds an 'irqs' directory under the auxiliary device and includes an
<irq_num> sysfs file within it.

For example:
$ ls /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/
50  51  52  53  54  55  56  57  58

Patch summary:
patch-1 adds auxiliary bus to support irqs used by auxiliary device
patch-2 mlx5 driver using exposing irqs for PCI SF devices via auxiliary
        bus
==========

* tag 'aux-sysfs-irqs' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
  net/mlx5: Expose SFs IRQs
  driver core: auxiliary bus: show auxiliary device IRQs
  RDMA/mlx5: Add Qcounters req_transport_retries_exceeded/req_rnr_retries_exceeded
  net/mlx5: Reimplement write combining test
====================

Link: https://patch.msgid.link/20240711213140.256997-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-15 12:42:45 -07:00
Mark Zhang
af48f95492 RDMA/core: Introduce "name_assign_type" for an IB device
The name_assign_type indicates how the name is provided. Currently
these types are supported:
- RDMA_NAME_ASSIGN_TYPE_UNKNOWN: Unknown or not set;
- RDMA_NAME_ASSIGN_TYPE_USER: Name is provided by the user; The
  user-created sub device, rxe and siw device has this type.

When filling nl device info, it is set in the new attribute
RDMA_NLDEV_ATTR_NAME_ASSIGN_TYPE. User-space tools like udev
"rdma_rename" could check this attribute to determine if this
device needs to be renamed or not.

Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/522591bef9a369cc8e5dcb77787e017bffee37fe.1719837610.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-07-04 07:59:53 +03:00
Mark Zhang
7a2210a57d RDMA/mlx5: Support per-plane port IB counters by querying PPCNT register
Supports per-plane port counters by querying PPCNT register with the
"extended port counters" group, as the query_vport_counter command
doesn't support plane ports.

Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/06ffb582d67159b7def4654c8272d3d6e8bd2f2f.1718553901.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2024-07-01 15:38:05 +03:00
Mark Zhang
3b43399b29 RDMA/mlx5: Add plane index support when querying PTYS registers
Support the new "plane_ind" field when querying port PTYS registers.
This is needed when querying the rate of a plane port.

Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/1f703c36306aa46917fcd88eadbb23b3e380d526.1718553901.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2024-07-01 15:38:05 +03:00
Mark Zhang
026a425990 RDMA/mlx5: Support plane device and driver APIs to add and delete it
This patch supports driver APIs "add_sub_dev" and "del_sub_dev", to
add and delete a plane device respectively.
A mlx5 plane device is a rdma SMI device; It provides the SMI capability
through user MAD for it's parent, the logical multi-plane aggregated
device. For a plane port:
- It supports QP0 only;
- When adding a plane device, all plane ports are added;
- For some commands like mad_ifc, both plane_index and native portnum
  is needed;
- When querying or modifying a plane port context, the native portnum
  must be used, as the query/modify_hca_vport_context command doesn't
  support plane port.

Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/e933cd0562aece181f8657af2ca0f5b387d0f14e.1718553901.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2024-07-01 15:38:05 +03:00
Mark Zhang
2a5db20fa5 RDMA/mlx5: Add support to multi-plane device and port
When multi-plane is supported, a logical port, which is aggregation of
multiple physical plane ports, is exposed for data transmission.
Compared with a normal mlx5 IB port, this logical port supports all
functionalities except Subnet Management.

Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/7e37c06c9cb243be9ac79930cd17053903785b95.1718553901.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2024-07-01 15:10:15 +03:00
Akiva Goldberger
589b844f1b RDMA/mlx5: Send UAR page index as ioctl attribute
Add UAR page index as a driver ioctl attribute to increase the number of
supported indices, previously limited to 16 bits by mlx5_ib_create_cq
struct.

Link: https://lore.kernel.org/r/0e18b34d7ec3b1ae02d694b0d545aed7413c0ef7.1719512393.git.leon@kernel.org
Signed-off-by: Akiva Goldberger <agoldberger@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-06-27 16:28:22 -03:00
Akiva Goldberger
dd6d7f8574 RDMA: Pass entire uverbs attr bundle to create cq function
Changes the create_cq verb signature by sending the entire uverbs attr
bundle as a parameter. This allows drivers to send driver specific attrs
through ioctl for the create_cq verb and access them in their driver
specific code.

Also adds a new enum value for driver specific ioctl attributes for
methods already supporting UHW.

Link: https://lore.kernel.org/r/ed147343987c0d43fd391c1b2f85e2f425747387.1719512393.git.leon@kernel.org
Signed-off-by: Akiva Goldberger <agoldberger@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-06-27 16:28:21 -03:00
Patrisious Haddad
36ab7ada64 RDMA/mlx5: Add check for srq max_sge attribute
max_sge attribute is passed by the user, and is inserted and used
unchecked, so verify that the value doesn't exceed maximum allowed value
before using it.

Fixes: e126ba97db ("mlx5: Add driver for Mellanox Connect-IB adapters")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Link: https://lore.kernel.org/r/277ccc29e8d57bfd53ddeb2ac633f2760cf8cdd0.1716900410.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-21 10:19:36 -03:00
Yishai Hadas
81497c148b RDMA/mlx5: Fix unwind flow as part of mlx5_ib_stage_init_init
Fix unwind flow as part of mlx5_ib_stage_init_init to use the correct
goto upon an error.

Fixes: 758ce14aee ("RDMA/mlx5: Implement MACsec gid addition and deletion")
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Patrisious Haddad <phaddad@nvidia.com>
Link: https://lore.kernel.org/r/aa40615116eda14ec9eca21d52017d632ea89188.1716900410.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-21 10:19:36 -03:00
Jason Gunthorpe
2e4c02fdec RDMA/mlx5: Ensure created mkeys always have a populated rb_key
cachable and mmkey.rb_key together are used by mlx5_revoke_mr() to put the
MR/mkey back into the cache. In all cases they should be set correctly.

alloc_cacheable_mr() was setting cachable but not filling rb_key,
resulting in cache_ent_find_and_store() bucketing them all into a 0 length
entry.

implicit_get_child_mr()/mlx5_ib_alloc_implicit_mr() failed to set cachable
or rb_key at all, so the cache was not working at all for implicit ODP.

Cc: stable@vger.kernel.org
Fixes: 8c1185fef6 ("RDMA/mlx5: Change check for cacheable mkeys")
Fixes: dd1b913fb0 ("RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/7778c02dfa0999a30d6746c79a23dd7140a9c729.1716900410.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-21 10:19:36 -03:00
Jason Gunthorpe
f637040c33 RDMA/mlx5: Follow rb_key.ats when creating new mkeys
When a cache ent already exists but doesn't have any mkeys in it the cache
will automatically create a new one based on the specification in the
ent->rb_key.

ent->ats was missed when creating the new key and so ma_translation_mode
was not being set even though the ent requires it.

Cc: stable@vger.kernel.org
Fixes: 73d09b2fe8 ("RDMA/mlx5: Introduce mlx5r_cache_rb_key")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://lore.kernel.org/r/7c5613458ecb89fbe5606b7aa4c8d990bdea5b9a.1716900410.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-21 10:19:36 -03:00
Jason Gunthorpe
c1eb251259 RDMA/mlx5: Remove extra unlock on error path
The below commit lifted the locking out of this function but left this
error path unlock behind resulting in unbalanced locking. Remove the
missed unlock too.

Cc: stable@vger.kernel.org
Fixes: 627122280c ("RDMA/mlx5: Add work to remove temporary entries from the cache")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://lore.kernel.org/r/78090c210c750f47219b95248f9f782f34548bb1.1716900410.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-21 10:19:36 -03:00
Leon Romanovsky
b7161db2d9 Merge branch 'mlx5-next' into wip/leon-for-next
The req_transport_retries_exceeded counter shows the number of times
requester detected transport retries exceed error.

The req_rnr_retries_exceeded counter show the number of times the
requester detected RNR NAKs retries exceed error.

Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-16 18:54:04 +03:00
Patrisious Haddad
b339e0a39d RDMA/mlx5: Add Qcounters req_transport_retries_exceeded/req_rnr_retries_exceeded
The req_transport_retries_exceeded counter shows the number of times
requester detected transport retries exceed error.

The req_rnr_retries_exceeded counter show the number of times the
requester detected RNR NAKs retries exceed error.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Link: https://lore.kernel.org/r/250466af94f4989d638fab168e246035530e912f.1718301543.git.leon@kernel.org
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-16 18:53:23 +03:00
Chiara Meiohas
a4e540119b RDMA/mlx5: Set mkeys for dmabuf at PAGE_SIZE
Set the mkey for dmabuf at PAGE_SIZE to support any SGL
after a move operation.

ib_umem_find_best_pgsz returns 0 on error, so it is
incorrect to check the returned page_size against PAGE_SIZE

Fixes: 90da7dc820 ("RDMA/mlx5: Support dma-buf based userspace memory region")
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://lore.kernel.org/r/1e2289b9133e89f273a4e68d459057d032cbc2ce.1718301631.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-16 18:39:29 +03:00
Jianbo Liu
5895e70f2e IB/mlx5: Allocate resources just before first QP/SRQ is created
Previously, all IB dev resources are initialized on driver load. As
they are not always used, move the initialization to the time when
they are needed.

To be more specific, move PD (p0) and CQ (c0) initialization to the
time when the first SRQ is created. and move SRQs(s0 and s1)
initialization to the time first QP is created. To avoid concurrent
creations, two new mutexes are also added.

Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Link: https://lore.kernel.org/r/98c3e53a8cc0bdfeb6dec6e5bb8b037d78ab00d8.1717409369.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-16 18:37:57 +03:00
Jianbo Liu
638420115c IB/mlx5: Create UMR QP just before first reg_mr occurs
UMR QP is not used in some cases, so move QP and its CQ creations from
driver load flow to the time first reg_mr occurs, that is when MR
interfaces are first called.

The initialization of dev->umrc.pd and dev->umrc.lock is still done in
driver load because pd is needed for mlx5_mkey_cache_init and the lock
is reused to protect against the concurrent creation.

When testing 4G bytes memory registration latency with rtool [1] and 8
threads in parallel, there is minor performance degradation (<5% for
the max latency) is seen for the first reg_mr with this change.

Link: https://github.com/paravmellanox/rtool [1]

Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Link: https://lore.kernel.org/r/55d3c4f8a542fd974d8a4c5816eccfb318a59b38.1717409369.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-16 18:37:50 +03:00
Jianbo Liu
d98995b4bf net/mlx5: Reimplement write combining test
The test of write combining was added before in mlx5_ib driver. It
opens UD QP and posts NOP WQEs, and uses BlueFlame doorbell. When
BlueFlame is used, WQEs get written directly to a PCI BAR of the
device (in addition to memory) so that the device handles them without
having to access memory.

In this test, the WQEs written in memory are different from the ones
written to the BlueFlame which request CQE update. By checking the
completion reports posted on CQ, we can know if BlueFlame succeeds or
not. The write combining must be supported if BlueFlame succeeds as
its register is written using write combining.

This patch reimplements the test in the same way, but using a pair of
SQ and CQ only. It is moved to mlx5_core as a general feature used by
both mlx5_core and mlx5_ib.

Besides, save write combine test result of the PCI function, so that
its thousands of child functions such as SF can query without paying
the time and resource penalty by itself. The test function is called
only after failing to get the cached result. With this enhancement,
all thousands of SFs of the PF attached to same driver no longer need
to perform WC check explicitly, which is already done in the system.
This saves several commands per SF, thereby speeds up SF creation and
also saves completion EQ creation.

Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/4ff5a8cc4c5b5b0d98397baa45a5019bcdbf096e.1717409369.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-16 18:33:59 +03:00
Linus Torvalds
25f4874662 RDMA v6.10 merge window
Normal set of driver updates and small fixes:
 
 - Small improvements and fixes for erdma, efa, hfi1, bnxt_re
 
 - Fix a UAF crash after module unload on leaking restrack entry
 
 - Continue adding full RDMA support in mana with support for EQs, GID's
   and CQs
 
 - Improvements to the mkey cache in mlx5
 
 - DSCP traffic class support in hns and several bug fixes
 
 - Cap the maximum number of MADs in the receive queue to avoid OOM
 
 - Another batch of rxe bug fixes from large scale testing
 
 - __iowrite64_copy() optimizations for write combining MMIO memory
 
 - Remove NULL checks before dev_put/hold()
 
 - EFA support for receive with immediate
 
 - Fix a recent memleaking regression in a cma error path
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZkeo2gAKCRCFwuHvBreF
 YbuNAQChzGmS4F0JAn5Wj0CDvkZghELqtvzEb92SzqcgdyQafAD/fC7f23LJ4OsO
 1ZIaQEZu7j9DVg5PKFZ7WfdXjGTKqwA=
 =QRXg
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Aside from the usual things this has an arch update for
  __iowrite64_copy() used by the RDMA drivers.

  This API was intended to generate large 64 byte MemWr TLPs on PCI.
  These days most processors had done this by just repeating writel() in
  a loop. S390 and some new ARM64 designs require a special helper to
  get this to generate.

   - Small improvements and fixes for erdma, efa, hfi1, bnxt_re

   - Fix a UAF crash after module unload on leaking restrack entry

   - Continue adding full RDMA support in mana with support for EQs,
     GID's and CQs

   - Improvements to the mkey cache in mlx5

   - DSCP traffic class support in hns and several bug fixes

   - Cap the maximum number of MADs in the receive queue to avoid OOM

   - Another batch of rxe bug fixes from large scale testing

   - __iowrite64_copy() optimizations for write combining MMIO memory

   - Remove NULL checks before dev_put/hold()

   - EFA support for receive with immediate

   - Fix a recent memleaking regression in a cma error path"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (70 commits)
  RDMA/cma: Fix kmemleak in rdma_core observed during blktests nvme/rdma use siw
  RDMA/IPoIB: Fix format truncation compilation errors
  bnxt_re: avoid shift undefined behavior in bnxt_qplib_alloc_init_hwq
  RDMA/efa: Support QP with unsolicited write w/ imm. receive
  IB/hfi1: Remove generic .ndo_get_stats64
  IB/hfi1: Do not use custom stat allocator
  RDMA/hfi1: Use RMW accessors for changing LNKCTL2
  RDMA/mana_ib: implement uapi for creation of rnic cq
  RDMA/mana_ib: boundary check before installing cq callbacks
  RDMA/mana_ib: introduce a helper to remove cq callbacks
  RDMA/mana_ib: create and destroy RNIC cqs
  RDMA/mana_ib: create EQs for RNIC CQs
  RDMA/core: Remove NULL check before dev_{put, hold}
  RDMA/ipoib: Remove NULL check before dev_{put, hold}
  RDMA/mlx5: Remove NULL check before dev_{put, hold}
  RDMA/mlx5: Track DCT, DCI and REG_UMR QPs as diver_detail resources.
  RDMA/core: Add an option to display driver-specific QPs in the rdmatool
  RDMA/efa: Add shutdown notifier
  RDMA/mana_ib: Fix missing ret value
  IB/mlx5: Use __iowrite64_copy() for write combining stores
  ...
2024-05-18 13:04:15 -07:00
Jules Irenge
82e966130d RDMA/mlx5: Remove NULL check before dev_{put, hold}
Coccinelle reports a warning

WARNING: NULL check before dev_{put, hold} functions is not needed

The reason is the call netdev_{put, hold} of dev_{put,hold} will check NULL
There is no need to check before using dev_{put, hold}

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Link: https://lore.kernel.org/r/ZjGC4qXrOwZE0aHi@octinomon.home
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-02 17:45:59 +03:00
Chiara Meiohas
fd3af5e218 RDMA/mlx5: Track DCT, DCI and REG_UMR QPs as diver_detail resources.
Allow user to see driver-specific QPs (the "driver_detail" QPs)
through the rdmatool, when requested.

When creating DCT, DCI and REG_UMR QPs, we designate them as driver_detail
resources.

When filling the QP info for the rdma tool, for the driver_detail QPs:
-the QP type is IB_QPT_DRIVER
-the subtype is a string with the QP name ("DCT", "DCI", "REG_UMR")

Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Link: https://lore.kernel.org/r/452432d7d0917f053a80a893a614169857fe3b10.1713268997.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-30 11:19:37 +03:00
Jason Gunthorpe
ef302283dd IB/mlx5: Use __iowrite64_copy() for write combining stores
mlx5 has a built in self-test at driver startup to evaluate if the
platform supports write combining to generate a 64 byte PCIe TLP or
not. This has proven necessary because a lot of common scenarios end up
with broken write combining (especially inside virtual machines) and there
is other way to learn this information.

This self test has been consistently failing on new ARM64 CPU
designs (specifically with NVIDIA Grace's implementation of Neoverse
V2). The C loop around writeq() generates some pretty terrible ARM64
assembly, but historically this has worked on a lot of existing ARM64 CPUs
till now.

We see it succeed about 1 time in 10,000 on the worst effected
systems. The CPU architects speculate that the load instructions
interspersed with the stores makes the WC buffers statistically flush too
often and thus the generation of large TLPs becomes infrequent. This makes
the boot up test unreliable in that it indicates no write-combining,
however userspace would be fine since it uses a ST4 instruction.

Further, S390 has similar issues where only the special zpci_memcpy_toio()
will actually generate large TLPs, and the open coded loop does not
trigger it at all.

Fix both ARM64 and S390 by switching to __iowrite64_copy() which now
provides architecture specific variants that have a high change of
generating a large TLP with write combining. x86 continues to use a
similar writeq loop in the generate __iowrite64_copy().

Fixes: 11f552e217 ("IB/mlx5: Test write combining support")
Link: https://lore.kernel.org/r/6-v3-1893cd8b9369+1925-mlx5_arm_wc_jgg@nvidia.com
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Acked-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 17:11:20 -03:00
Or Har-Toov
2ca7e93bc9 RDMA/mlx5: Adding remote atomic access flag to updatable flags
Currently IB_ACCESS_REMOTE_ATOMIC is blocked from being updated via UMR
although in some cases it should be possible. These cases are checked in
mlx5r_umr_can_reconfig function.

Fixes: ef3642c4f5 ("RDMA/mlx5: Fix error unwinds for rereg_mr")
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Link: https://lore.kernel.org/r/24dac73e2fa48cb806f33a932d97f3e402a5ea2c.1712140377.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-08 13:34:54 +03:00
Or Har-Toov
8c1185fef6 RDMA/mlx5: Change check for cacheable mkeys
umem can be NULL for user application mkeys in some cases. Therefore
umem can't be used for checking if the mkey is cacheable and it is
changed for checking a flag that indicates it. Also make sure that
all mkeys which are not returned to the cache will be destroyed.

Fixes: dd1b913fb0 ("RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow")
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Link: https://lore.kernel.org/r/2690bc5c6896bcb937f89af16a1ff0343a7ab3d0.1712140377.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-08 13:34:51 +03:00
Or Har-Toov
0611a8e8b4 RDMA/mlx5: Uncacheable mkey has neither rb_key or cache_ent
As some mkeys can't be modified with UMR due to some UMR limitations,
like the size of translation that can be updated, not all user mkeys can
be cached.

Fixes: dd1b913fb0 ("RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow")
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Link: https://lore.kernel.org/r/f2742dd934ed73b2d32c66afb8e91b823063880c.1712140377.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-08 13:34:47 +03:00
Michael Guralnik
be121ffb38 RDMA/mlx5: Fix port number for counter query in multi-port configuration
Set the correct port when querying PPCNT in multi-port configuration.
Distinguish between cases where switchdev mode was enabled to multi-port
configuration and don't overwrite the queried port to 1 in multi-port
case.

Fixes: 74b30b3ad5 ("RDMA/mlx5: Set local port to one when accessing counters")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://lore.kernel.org/r/9bfcc8ade958b760a51408c3ad654a01b11f7d76.1712134988.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-08 13:33:10 +03:00
Linus Torvalds
6207b37eb5 RDMA v6.9
Very small update this cycle:
 
 - Minor code improvements in fi, rxe, ipoib, mana, cxgb4, mlx5, irdma,
   rxe, rtrs, mana
 
 - Simplify the hns hem mechanism
 
 - Fix EFA's MSI-X allocation in resource constrained configurations
 
 - Fix a KASN splat in srpt
 
 - Narrow hns's congestion control selection to QPs granularity and allow
   userspace to select it
 
 - Solve a parallel module loading race between the CM module and a driver
   module
 
 - Flexible array cleanup
 
 - Dump hns's SCC Conext to 'rdma res' for debugging
 
 - Make mana build page lists for HW objects that require a 0 offset
   correctly
 
 - Stuck CM ID debugging
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZfgzdQAKCRCFwuHvBreF
 YbS7AQDLy6uJ/1dgrZQ4efcyQDs6H93LG4jWZKoA7F9Oho+MFQEAsQM/UL4nj18O
 T6vHl30N0Ee0aOCqET7HBbnFGKEADAE=
 =KxUj
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Very small update this cycle:

   - Minor code improvements in fi, rxe, ipoib, mana, cxgb4, mlx5,
     irdma, rxe, rtrs, mana

   - Simplify the hns hem mechanism

   - Fix EFA's MSI-X allocation in resource constrained configurations

   - Fix a KASN splat in srpt

   - Narrow hns's congestion control selection to QPs granularity and
     allow userspace to select it

   - Solve a parallel module loading race between the CM module and a
     driver module

   - Flexible array cleanup

   - Dump hns's SCC Conext to 'rdma res' for debugging

   - Make mana build page lists for HW objects that require a 0 offset
     correctly

   - Stuck CM ID debugging"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (29 commits)
  RDMA/cm: add timeout to cm_destroy_id wait
  RDMA/mana_ib: Use virtual address in dma regions for MRs
  RDMA/mana_ib: Fix bug in creation of dma regions
  RDMA/hns: Append SCC context to the raw dump of QPC
  RDMA/uverbs: Avoid -Wflex-array-member-not-at-end warnings
  RDMA/hns: Support userspace configuring congestion control algorithm with QP granularity
  RDMA/rtrs-clt: Check strnlen return len in sysfs mpath_policy_store()
  RDMA/uverbs: Remove flexible arrays from struct *_filter
  RDMA/device: Fix a race between mad_client and cm_client init
  RDMA/hns: Fix mis-modifying default congestion control algorithm
  RDMA/rxe: Remove unused 'iova' parameter from rxe_mr_init_user
  RDMA/srpt: Do not register event handler until srpt device is fully setup
  RDMA/irdma: Remove duplicate assignment
  RDMA/efa: Limit EQs to available MSI-X vectors
  RDMA/mlx5: Delete unused mlx5_ib_copy_pas prototype
  RDMA/cxgb4: Delete unused c4iw_ep_redirect prototype
  RDMA/mana_ib: Introduce mana_ib_install_cq_cb helper function
  RDMA/mana_ib: Introduce mana_ib_get_netdev helper function
  RDMA/mana_ib: Introduce mdev_to_gc helper function
  RDMA/hns: Simplify 'struct hns_roce_hem' allocation
  ...
2024-03-18 15:34:03 -07:00
Yishai Hadas
be551ee157 RDMA/mlx5: Relax DEVX access upon modify commands
Relax DEVX access upon modify commands to be UVERBS_ACCESS_READ.

The kernel doesn't need to protect what firmware protects, or what
causes no damage to anyone but the user.

As firmware needs to protect itself from parallel access to the same
object, don't block parallel modify/query commands on the same object in
the kernel side.

This change will allow user space application to run parallel updates to
different entries in the same bulk object.

Tested-by: Tamar Mashiah <tmashiah@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://lore.kernel.org/r/7407d5ed35dc427c1097699e12b49c01e1073406.1706433934.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-01-31 11:15:39 +02:00
Mark Zhang
43fdbd1402 IB/mlx5: Don't expose debugfs entries for RRoCE general parameters if not supported
debugfs entries for RRoCE general CC parameters must be exposed only when
they are supported, otherwise when accessing them there may be a syndrome
error in kernel log, for example:

$ cat /sys/kernel/debug/mlx5/0000:08:00.1/cc_params/rtt_resp_dscp
cat: '/sys/kernel/debug/mlx5/0000:08:00.1/cc_params/rtt_resp_dscp': Invalid argument
$ dmesg
 mlx5_core 0000:08:00.1: mlx5_cmd_out_err:805:(pid 1253): QUERY_CONG_PARAMS(0x824) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x325a82), err(-22)

Fixes: 66fb1d5df6 ("IB/mlx5: Extend debug control for CC parameters")
Reviewed-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/e7ade70bad52b7468bdb1de4d41d5fad70c8b71c.1706433934.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-01-31 11:15:29 +02:00
Leon Romanovsky
4d5e86a566 RDMA/mlx5: Fix fortify source warning while accessing Eth segment
------------[ cut here ]------------
 memcpy: detected field-spanning write (size 56) of single field "eseg->inline_hdr.start" at /var/lib/dkms/mlnx-ofed-kernel/5.8/build/drivers/infiniband/hw/mlx5/wr.c:131 (size 2)
 WARNING: CPU: 0 PID: 293779 at /var/lib/dkms/mlnx-ofed-kernel/5.8/build/drivers/infiniband/hw/mlx5/wr.c:131 mlx5_ib_post_send+0x191b/0x1a60 [mlx5_ib]
 Modules linked in: 8021q garp mrp stp llc rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) mlx5_core(OE) pci_hyperv_intf mlxdevm(OE) mlx_compat(OE) tls mlxfw(OE) psample nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables libcrc32c nfnetlink mst_pciconf(OE) knem(OE) vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd irqbypass cuse nfsv3 nfs fscache netfs xfrm_user xfrm_algo ipmi_devintf ipmi_msghandler binfmt_misc crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 snd_pcsp aesni_intel crypto_simd cryptd snd_pcm snd_timer joydev snd soundcore input_leds serio_raw evbug nfsd auth_rpcgss nfs_acl lockd grace sch_fq_codel sunrpc drm efi_pstore ip_tables x_tables autofs4 psmouse virtio_net net_failover failover floppy
  [last unloaded: mlx_compat(OE)]
 CPU: 0 PID: 293779 Comm: ssh Tainted: G           OE      6.2.0-32-generic #32~22.04.1-Ubuntu
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:mlx5_ib_post_send+0x191b/0x1a60 [mlx5_ib]
 Code: 0c 01 00 a8 01 75 25 48 8b 75 a0 b9 02 00 00 00 48 c7 c2 10 5b fd c0 48 c7 c7 80 5b fd c0 c6 05 57 0c 03 00 01 e8 95 4d 93 da <0f> 0b 44 8b 4d b0 4c 8b 45 c8 48 8b 4d c0 e9 49 fb ff ff 41 0f b7
 RSP: 0018:ffffb5b48478b570 EFLAGS: 00010046
 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
 RBP: ffffb5b48478b628 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000000 R12: ffffb5b48478b5e8
 R13: ffff963a3c609b5e R14: ffff9639c3fbd800 R15: ffffb5b480475a80
 FS:  00007fc03b444c80(0000) GS:ffff963a3dc00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000556f46bdf000 CR3: 0000000006ac6003 CR4: 00000000003706f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  <TASK>
  ? show_regs+0x72/0x90
  ? mlx5_ib_post_send+0x191b/0x1a60 [mlx5_ib]
  ? __warn+0x8d/0x160
  ? mlx5_ib_post_send+0x191b/0x1a60 [mlx5_ib]
  ? report_bug+0x1bb/0x1d0
  ? handle_bug+0x46/0x90
  ? exc_invalid_op+0x19/0x80
  ? asm_exc_invalid_op+0x1b/0x20
  ? mlx5_ib_post_send+0x191b/0x1a60 [mlx5_ib]
  mlx5_ib_post_send_nodrain+0xb/0x20 [mlx5_ib]
  ipoib_send+0x2ec/0x770 [ib_ipoib]
  ipoib_start_xmit+0x5a0/0x770 [ib_ipoib]
  dev_hard_start_xmit+0x8e/0x1e0
  ? validate_xmit_skb_list+0x4d/0x80
  sch_direct_xmit+0x116/0x3a0
  __dev_xmit_skb+0x1fd/0x580
  __dev_queue_xmit+0x284/0x6b0
  ? _raw_spin_unlock_irq+0xe/0x50
  ? __flush_work.isra.0+0x20d/0x370
  ? push_pseudo_header+0x17/0x40 [ib_ipoib]
  neigh_connected_output+0xcd/0x110
  ip_finish_output2+0x179/0x480
  ? __smp_call_single_queue+0x61/0xa0
  __ip_finish_output+0xc3/0x190
  ip_finish_output+0x2e/0xf0
  ip_output+0x78/0x110
  ? __pfx_ip_finish_output+0x10/0x10
  ip_local_out+0x64/0x70
  __ip_queue_xmit+0x18a/0x460
  ip_queue_xmit+0x15/0x30
  __tcp_transmit_skb+0x914/0x9c0
  tcp_write_xmit+0x334/0x8d0
  tcp_push_one+0x3c/0x60
  tcp_sendmsg_locked+0x2e1/0xac0
  tcp_sendmsg+0x2d/0x50
  inet_sendmsg+0x43/0x90
  sock_sendmsg+0x68/0x80
  sock_write_iter+0x93/0x100
  vfs_write+0x326/0x3c0
  ksys_write+0xbd/0xf0
  ? do_syscall_64+0x69/0x90
  __x64_sys_write+0x19/0x30
  do_syscall_64+0x59/0x90
  ? do_user_addr_fault+0x1d0/0x640
  ? exit_to_user_mode_prepare+0x3b/0xd0
  ? irqentry_exit_to_user_mode+0x9/0x20
  ? irqentry_exit+0x43/0x50
  ? exc_page_fault+0x92/0x1b0
  entry_SYSCALL_64_after_hwframe+0x72/0xdc
 RIP: 0033:0x7fc03ad14a37
 Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
 RSP: 002b:00007ffdf8697fe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
 RAX: ffffffffffffffda RBX: 0000000000008024 RCX: 00007fc03ad14a37
 RDX: 0000000000008024 RSI: 0000556f46bd8270 RDI: 0000000000000003
 RBP: 0000556f46bb1800 R08: 0000000000007fe3 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
 R13: 0000556f46bc66b0 R14: 000000000000000a R15: 0000556f46bb2f50
  </TASK>
 ---[ end trace 0000000000000000 ]---

Link: https://lore.kernel.org/r/8228ad34bd1a25047586270f7b1fb4ddcd046282.1706433934.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2024-01-31 11:15:17 +02:00
Alexey Dobriyan
a400073ce3 RDMA/mlx5: Delete unused mlx5_ib_copy_pas prototype
mlx5_ib_copy_pas() doesn't exist anymore.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Link: https://lore.kernel.org/r/a2cb861e-d11e-4567-8a73-73763d1dc199@p183
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-01-25 12:06:57 +02:00
Linus Torvalds
bf9ca811bb RDMA v6.8 merge window
Small cycle, with some typical driver updates
 
  - General code tidying in siw, hfi1, idrdma, usnic, hns rtrs and bnxt_re
 
  - Many small siw cleanups without an overeaching theme
 
  - Debugfs stats for hns
 
  - Fix a TX queue timeout in IPoIB and missed locking of the mcast list
 
  - Support more features of P7 devices in bnxt_re including a new work
    submission protocol
 
  - CQ interrupts for MANA
 
  - netlink stats for erdma
 
  - EFA multipath PCI support
 
  - Fix Incorrect MR invalidation in iser
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZaCQDQAKCRCFwuHvBreF
 YemEAQCTGebv0k2hbocDOmKml5awt8j9aDJX3aO7Zpfi0AYUtwEAzk+kgN4yAo+B
 Vinvpu171zry+QvmGJsXv2mtZkXH6QY=
 =HT3p
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Small cycle, with some typical driver updates:

   - General code tidying in siw, hfi1, idrdma, usnic, hns rtrs and
     bnxt_re

   - Many small siw cleanups without an overeaching theme

   - Debugfs stats for hns

   - Fix a TX queue timeout in IPoIB and missed locking of the mcast
     list

   - Support more features of P7 devices in bnxt_re including a new work
     submission protocol

   - CQ interrupts for MANA

   - netlink stats for erdma

   - EFA multipath PCI support

   - Fix Incorrect MR invalidation in iser"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (66 commits)
  RDMA/bnxt_re: Fix error code in bnxt_re_create_cq()
  RDMA/efa: Add EFA query MR support
  IB/iser: Prevent invalidating wrong MR
  RDMA/erdma: Add hardware statistics support
  RDMA/erdma: Introduce dma pool for hardware responses of CMDQ requests
  IB/iser: iscsi_iser.h: fix kernel-doc warning and spellos
  RDMA/mana_ib: Add CQ interrupt support for RAW QP
  RDMA/mana_ib: query device capabilities
  RDMA/mana_ib: register RDMA device with GDMA
  RDMA/bnxt_re: Fix the sparse warnings
  RDMA/bnxt_re: Fix the offset for GenP7 adapters for user applications
  RDMA/bnxt_re: Share a page to expose per CQ info with userspace
  RDMA/bnxt_re: Add UAPI to share a page with user space
  IB/ipoib: Fix mcast list locking
  RDMA/mlx5: Expose register c0 for RDMA device
  net/mlx5: E-Switch, expose eswitch manager vport
  net/mlx5: Manage ICM type of SW encap
  RDMA/mlx5: Support handling of SW encap ICM area
  net/mlx5: Introduce indirect-sw-encap ICM properties
  RDMA/bnxt_re: Adds MSN table capability for Gen P7 adapters
  ...
2024-01-12 13:52:21 -08:00
Mark Bloch
d727d27db5 RDMA/mlx5: Expose register c0 for RDMA device
This patch introduces improvements for matching egress traffic sent by the
local device. When applicable, all egress traffic from the local vport is
now tagged with the provided value. This enhancement is particularly useful
for FDB steering purposes.

The primary focus of this update is facilitating the transmission of
traffic from the hypervisor to a VF. To achieve this, one must initiate an
SQ on the hypervisor and subsequently create a rule in the FDB that matches
on the eswitch manager vport and the SQN of the aforementioned SQ.

Obtaining the SQN can be had from SQ opened, and the eswitch manager vport
match can be substituted with the register c0 value exposed by this patch.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://lore.kernel.org/r/aa4120a91c98ff1c44f1213388c744d4cb0324d6.1701871118.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-12 09:04:07 +02:00
Shun Hao
a429ec96c0 RDMA/mlx5: Support handling of SW encap ICM area
New type for this ICM area, now the user can allocate/deallocate
the new type of SW encap ICM memory, to store the encap header data
which are managed by SW.

Signed-off-by: Shun Hao <shunh@nvidia.com>
Link: https://lore.kernel.org/r/546fe43fc700240709e30acf7713ec6834d652bd.1701871118.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-12 09:03:57 +02:00
Christian Brauner
3652117f85 eventfd: simplify eventfd_signal()
Ever since the eventfd type was introduced back in 2007 in commit
e1ad7468c7 ("signal/timer/event: eventfd core") the eventfd_signal()
function only ever passed 1 as a value for @n. There's no point in
keeping that additional argument.

Link: https://lore.kernel.org/r/20231122-vfs-eventfd-signal-v2-2-bd549b14ce0c@kernel.org
Acked-by: Xu Yilun <yilun.xu@intel.com>
Acked-by: Andrew Donnellan <ajd@linux.ibm.com> # ocxl
Acked-by: Eric Farman <farman@linux.ibm.com>  # s390
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-28 14:08:38 +01:00
George Kennedy
2ef422f063 IB/mlx5: Fix init stage error handling to avoid double free of same QP and UAF
In the unlikely event that workqueue allocation fails and returns NULL in
mlx5_mkey_cache_init(), delete the call to
mlx5r_umr_resource_cleanup() (which frees the QP) in
mlx5_ib_stage_post_ib_reg_umr_init().  This will avoid attempted double
free of the same QP when __mlx5_ib_add() does its cleanup.

Resolves a splat:

   Syzkaller reported a UAF in ib_destroy_qp_user

   workqueue: Failed to create a rescuer kthread for wq "mkey_cache": -EINTR
   infiniband mlx5_0: mlx5_mkey_cache_init:981:(pid 1642):
   failed to create work queue
   infiniband mlx5_0: mlx5_ib_stage_post_ib_reg_umr_init:4075:(pid 1642):
   mr cache init failed -12
   ==================================================================
   BUG: KASAN: slab-use-after-free in ib_destroy_qp_user (drivers/infiniband/core/verbs.c:2073)
   Read of size 8 at addr ffff88810da310a8 by task repro_upstream/1642

   Call Trace:
   <TASK>
   kasan_report (mm/kasan/report.c:590)
   ib_destroy_qp_user (drivers/infiniband/core/verbs.c:2073)
   mlx5r_umr_resource_cleanup (drivers/infiniband/hw/mlx5/umr.c:198)
   __mlx5_ib_add (drivers/infiniband/hw/mlx5/main.c:4178)
   mlx5r_probe (drivers/infiniband/hw/mlx5/main.c:4402)
   ...
   </TASK>

   Allocated by task 1642:
   __kmalloc (./include/linux/kasan.h:198 mm/slab_common.c:1026
   mm/slab_common.c:1039)
   create_qp (./include/linux/slab.h:603 ./include/linux/slab.h:720
   ./include/rdma/ib_verbs.h:2795 drivers/infiniband/core/verbs.c:1209)
   ib_create_qp_kernel (drivers/infiniband/core/verbs.c:1347)
   mlx5r_umr_resource_init (drivers/infiniband/hw/mlx5/umr.c:164)
   mlx5_ib_stage_post_ib_reg_umr_init (drivers/infiniband/hw/mlx5/main.c:4070)
   __mlx5_ib_add (drivers/infiniband/hw/mlx5/main.c:4168)
   mlx5r_probe (drivers/infiniband/hw/mlx5/main.c:4402)
   ...

   Freed by task 1642:
   __kmem_cache_free (mm/slub.c:1826 mm/slub.c:3809 mm/slub.c:3822)
   ib_destroy_qp_user (drivers/infiniband/core/verbs.c:2112)
   mlx5r_umr_resource_cleanup (drivers/infiniband/hw/mlx5/umr.c:198)
   mlx5_ib_stage_post_ib_reg_umr_init (drivers/infiniband/hw/mlx5/main.c:4076
   drivers/infiniband/hw/mlx5/main.c:4065)
   __mlx5_ib_add (drivers/infiniband/hw/mlx5/main.c:4168)
   mlx5r_probe (drivers/infiniband/hw/mlx5/main.c:4402)
   ...

Fixes: 04876c12c1 ("RDMA/mlx5: Move init and cleanup of UMR to umr.c")
Link: https://lore.kernel.org/r/1698170518-4006-1-git-send-email-george.kennedy@oracle.com
Suggested-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: George Kennedy <george.kennedy@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-10-31 11:16:05 -03:00
Moshe Shemesh
a53e215f90 RDMA/mlx5: Fix mkey cache WQ flush
The cited patch tries to ensure no pending works on the mkey cache
workqueue by disabling adding new works and call flush_workqueue().
But this workqueue also has delayed works which might still be pending
the delay time to be queued.

Add cancel_delayed_work() for the delayed works which waits to be queued
and then the flush_workqueue() will flush all works which are already
queued and running.

Fixes: 374012b004 ("RDMA/mlx5: Fix mkey cache possible deadlock on cleanup")
Link: https://lore.kernel.org/r/b8722f14e7ed81452f791764a26d2ed4cfa11478.1698256179.git.leon@kernel.org
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-10-31 10:57:49 -03:00
Jason Gunthorpe
162e348024 Merge tag 'v6.6' into rdma.git for-next
Resolve conflict by taking the spin_lock hunk from for-next:

 https://lore.kernel.org/r/20230928113851.5197a1ec@canb.auug.org.au

Required for the next patch.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-10-31 10:54:48 -03:00
Patrisious Haddad
02e7d139e5 RDMA/mlx5: Change the key being sent for MPV device affiliation
Change the key that we send from IB driver to EN driver regarding the
MPV device affiliation, since at that stage the IB device is not yet
initialized, so its index would be zero for different IB devices and
cause wrong associations between unrelated master and slave devices.

Instead use a unique value from inside the core device which is already
initialized at this stage.

Fixes: 0d293714ac ("RDMA/mlx5: Send events from IB driver about device affiliation state")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Link: https://lore.kernel.org/r/ac7e66357d963fc68d7a419515180212c96d137d.1697705185.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-10-24 17:10:58 +03:00
Patrisious Haddad
c1336bb4aa IB/mlx5: Fix rdma counter binding for RAW QP
Previously when we had a RAW QP, we bound a counter to it when it moved
to INIT state, using the counter context inside RQC.

But when we try to modify that counter later in RTS state we used
modify QP which tries to change the counter inside QPC instead of RQC.

Now we correctly modify the counter set_id inside of RQC instead of QPC
for the RAW QP.

Fixes: d14133dd41 ("IB/mlx5: Support set qp counter")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/2e5ab6713784a8fe997d19c508187a0dfecf2dfc.1696847964.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-10-15 11:04:01 +03:00
Leon Romanovsky
16419098e8 IPsec packet offload support in multiport RoCE devices
This series from Patrisious extends mlx5 to support IPsec packet offload
in multiport devices (MPV, see [1] for more details).

These devices have single flow steering logic and two netdev interfaces,
which require extra logic to manage IPsec configurations as they performed
on netdevs.

Thanks

[1] https://lore.kernel.org/linux-rdma/20180104152544.28919-1-leon@kernel.org/

Link: https://lore.kernel.org/all/20231002083832.19746-1-leon@kernel.org
Signed-of-by: Leon Romanovsky <leon@kernel.org>

* mlx5-next: (576 commits)
  net/mlx5: Handle IPsec steering upon master unbind/bind
  net/mlx5: Configure IPsec steering for ingress RoCEv2 MPV traffic
  net/mlx5: Configure IPsec steering for egress RoCEv2 MPV traffic
  net/mlx5: Add create alias flow table function to ipsec roce
  net/mlx5: Implement alias object allow and create functions
  net/mlx5: Add alias flow table bits
  net/mlx5: Store devcom pointer inside IPsec RoCE
  net/mlx5: Register mlx5e priv to devcom in MPV mode
  RDMA/mlx5: Send events from IB driver about device affiliation state
  net/mlx5: Introduce ifc bits for migration in a chunk mode
  Linux 6.6-rc3
  ...
2023-10-04 21:21:49 +03:00
Leon Romanovsky
c99a7457e5 RDMA/mlx5: Remove not-used cache disable flag
During execution of mlx5_mkey_cache_cleanup(), there is a guarantee
that MR are not registered and/or destroyed. It means that we don't
need newly introduced cache disable flag.

Fixes: 374012b004 ("RDMA/mlx5: Fix mkey cache possible deadlock on cleanup")
Link: https://lore.kernel.org/r/c7e9c9f98c8ae4a7413d97d9349b29f5b0a23dbe.1695921626.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2023-10-02 14:32:44 +03:00
Patrisious Haddad
0d293714ac RDMA/mlx5: Send events from IB driver about device affiliation state
Send blocking events from IB driver whenever the device is done being
affiliated or if it is removed from an affiliation.

This is useful since now the EN driver can register to those event and
know when a device is affiliated or not.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://lore.kernel.org/r/a7491c3e483cfd8d962f5f75b9a25f253043384a.1695296682.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-10-02 11:20:59 +03:00
Patrisious Haddad
4f4db19089 IB/mlx5: Adjust mlx5 rate mapping to support 800Gb
Adjust mlx5 function which maps the speed rate from IB spec values
to internal driver values to be able to handle speeds up to 800Gb.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/301c803d8486b0df8aefad3fb3cc10dc58671985.1695204156.git.leon@kernel.org
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-09-26 12:38:54 +03:00
Patrisious Haddad
b28ad32442 IB/mlx5: Rename 400G_8X speed to comply to naming convention
Rename 400G_8X speed to comply to naming convention.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/ac98447cac8379a43fbdb36d56e5fb2b741a97ff.1695204156.git.leon@kernel.org
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-09-26 12:38:50 +03:00
Patrisious Haddad
948f0bf5ad IB/mlx5: Add support for 800G_8X lane speed
Add a check for 800G_8X speed when querying PTYS and report it back
correctly when needed.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/26fd0b6e1fac071c3eb779657bb3d8ba47f47c4f.1695204156.git.leon@kernel.org
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-09-26 12:38:46 +03:00
Or Har-Toov
561b4a3ac6 IB/mlx5: Expose XDR speed through MAD
Under MAD query port, Report NDR speed when NDR is supported in the port
capability mask.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/d30bdec2a66a8a2edd1d84ee61453c58cf346b43.1695204156.git.leon@kernel.org
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-09-26 12:38:43 +03:00
Shay Drory
57e7071683 RDMA/mlx5: Implement mkeys management via LIFO queue
Currently, mkeys are managed via xarray. This implementation leads to
a degradation in cases many MRs are unregistered in parallel, due to xarray
internal implementation, for example: deregistration 1M MRs via 64 threads
is taking ~15% more time[1].

Hence, implement mkeys management via LIFO queue, which solved the
degradation.

[1]
2.8us in kernel v5.19 compare to 3.2us in kernel v6.4

Signed-off-by: Shay Drory <shayd@nvidia.com>
Link: https://lore.kernel.org/r/fde3d4cfab0f32f0ccb231cd113298256e1502c5.1695283384.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-09-26 12:36:18 +03:00
Shay Drory
374012b004 RDMA/mlx5: Fix mkey cache possible deadlock on cleanup
Fix the deadlock by refactoring the MR cache cleanup flow to flush the
workqueue without holding the rb_lock.
This adds a race between cache cleanup and creation of new entries which
we solve by denied creation of new entries after cache cleanup started.

Lockdep:
WARNING: possible circular locking dependency detected
 [ 2785.326074 ] 6.2.0-rc6_for_upstream_debug_2023_01_31_14_02 #1 Not tainted
 [ 2785.339778 ] ------------------------------------------------------
 [ 2785.340848 ] devlink/53872 is trying to acquire lock:
 [ 2785.341701 ] ffff888124f8c0c8 ((work_completion)(&(&ent->dwork)->work)){+.+.}-{0:0}, at: __flush_work+0xc8/0x900
 [ 2785.343403 ]
 [ 2785.343403 ] but task is already holding lock:
 [ 2785.344464 ] ffff88817e8f1260 (&dev->cache.rb_lock){+.+.}-{3:3}, at: mlx5_mkey_cache_cleanup+0x77/0x250 [mlx5_ib]
 [ 2785.346273 ]
 [ 2785.346273 ] which lock already depends on the new lock.
 [ 2785.346273 ]
 [ 2785.347720 ]
 [ 2785.347720 ] the existing dependency chain (in reverse order) is:
 [ 2785.349003 ]
 [ 2785.349003 ] -> #1 (&dev->cache.rb_lock){+.+.}-{3:3}:
 [ 2785.350160 ]        __mutex_lock+0x14c/0x15c0
 [ 2785.350962 ]        delayed_cache_work_func+0x2d1/0x610 [mlx5_ib]
 [ 2785.352044 ]        process_one_work+0x7c2/0x1310
 [ 2785.352879 ]        worker_thread+0x59d/0xec0
 [ 2785.353636 ]        kthread+0x28f/0x330
 [ 2785.354370 ]        ret_from_fork+0x1f/0x30
 [ 2785.355135 ]
 [ 2785.355135 ] -> #0 ((work_completion)(&(&ent->dwork)->work)){+.+.}-{0:0}:
 [ 2785.356515 ]        __lock_acquire+0x2d8a/0x5fe0
 [ 2785.357349 ]        lock_acquire+0x1c1/0x540
 [ 2785.358121 ]        __flush_work+0xe8/0x900
 [ 2785.358852 ]        __cancel_work_timer+0x2c7/0x3f0
 [ 2785.359711 ]        mlx5_mkey_cache_cleanup+0xfb/0x250 [mlx5_ib]
 [ 2785.360781 ]        mlx5_ib_stage_pre_ib_reg_umr_cleanup+0x16/0x30 [mlx5_ib]
 [ 2785.361969 ]        __mlx5_ib_remove+0x68/0x120 [mlx5_ib]
 [ 2785.362960 ]        mlx5r_remove+0x63/0x80 [mlx5_ib]
 [ 2785.363870 ]        auxiliary_bus_remove+0x52/0x70
 [ 2785.364715 ]        device_release_driver_internal+0x3c1/0x600
 [ 2785.365695 ]        bus_remove_device+0x2a5/0x560
 [ 2785.366525 ]        device_del+0x492/0xb80
 [ 2785.367276 ]        mlx5_detach_device+0x1a9/0x360 [mlx5_core]
 [ 2785.368615 ]        mlx5_unload_one_devl_locked+0x5a/0x110 [mlx5_core]
 [ 2785.369934 ]        mlx5_devlink_reload_down+0x292/0x580 [mlx5_core]
 [ 2785.371292 ]        devlink_reload+0x439/0x590
 [ 2785.372075 ]        devlink_nl_cmd_reload+0xaef/0xff0
 [ 2785.372973 ]        genl_family_rcv_msg_doit.isra.0+0x1bd/0x290
 [ 2785.374011 ]        genl_rcv_msg+0x3ca/0x6c0
 [ 2785.374798 ]        netlink_rcv_skb+0x12c/0x360
 [ 2785.375612 ]        genl_rcv+0x24/0x40
 [ 2785.376295 ]        netlink_unicast+0x438/0x710
 [ 2785.377121 ]        netlink_sendmsg+0x7a1/0xca0
 [ 2785.377926 ]        sock_sendmsg+0xc5/0x190
 [ 2785.378668 ]        __sys_sendto+0x1bc/0x290
 [ 2785.379440 ]        __x64_sys_sendto+0xdc/0x1b0
 [ 2785.380255 ]        do_syscall_64+0x3d/0x90
 [ 2785.381031 ]        entry_SYSCALL_64_after_hwframe+0x46/0xb0
 [ 2785.381967 ]
 [ 2785.381967 ] other info that might help us debug this:
 [ 2785.381967 ]
 [ 2785.383448 ]  Possible unsafe locking scenario:
 [ 2785.383448 ]
 [ 2785.384544 ]        CPU0                    CPU1
 [ 2785.385383 ]        ----                    ----
 [ 2785.386193 ]   lock(&dev->cache.rb_lock);
 [ 2785.386940 ]				lock((work_completion)(&(&ent->dwork)->work));
 [ 2785.388327 ]				lock(&dev->cache.rb_lock);
 [ 2785.389425 ]   lock((work_completion)(&(&ent->dwork)->work));
 [ 2785.390414 ]
 [ 2785.390414 ]  *** DEADLOCK ***
 [ 2785.390414 ]
 [ 2785.391579 ] 6 locks held by devlink/53872:
 [ 2785.392341 ]  #0: ffffffff84c17a50 (cb_lock){++++}-{3:3}, at: genl_rcv+0x15/0x40
 [ 2785.393630 ]  #1: ffff888142280218 (&devlink->lock_key){+.+.}-{3:3}, at: devlink_get_from_attrs_lock+0x12d/0x2d0
 [ 2785.395324 ]  #2: ffff8881422d3c38 (&dev->lock_key){+.+.}-{3:3}, at: mlx5_unload_one_devl_locked+0x4a/0x110 [mlx5_core]
 [ 2785.397322 ]  #3: ffffffffa0e59068 (mlx5_intf_mutex){+.+.}-{3:3}, at: mlx5_detach_device+0x60/0x360 [mlx5_core]
 [ 2785.399231 ]  #4: ffff88810e3cb0e8 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x8d/0x600
 [ 2785.400864 ]  #5: ffff88817e8f1260 (&dev->cache.rb_lock){+.+.}-{3:3}, at: mlx5_mkey_cache_cleanup+0x77/0x250 [mlx5_ib]

Fixes: b958451783 ("RDMA/mlx5: Change the cache structure to an RB-tree")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2023-09-26 12:33:53 +03:00
Shay Drory
dab994bcc6 RDMA/mlx5: Fix NULL string error
checkpath is complaining about NULL string, change it to 'Unknown'.

Fixes: 37aa5c36aa ("IB/mlx5: Add UARs write-combining and non-cached mapping")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Link: https://lore.kernel.org/r/8638e5c14fadbde5fa9961874feae917073af920.1695203958.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-09-26 12:29:44 +03:00
Hamdan Igbaria
2fad8f06a5 RDMA/mlx5: Fix mutex unlocking on error flow for steering anchor creation
The mutex was not unlocked on some of the error flows.
Moved the unlock location to include all the error flow scenarios.

Fixes: e1f4a52ac1 ("RDMA/mlx5: Create an indirect flow table for steering anchor")
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Hamdan Igbaria <hamdani@nvidia.com>
Link: https://lore.kernel.org/r/1244a69d783da997c0af0b827c622eb00495492e.1695203958.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-09-26 12:29:40 +03:00
Michael Guralnik
4f14c6c021 RDMA/mlx5: Fix assigning access flags to cache mkeys
After the change to use dynamic cache structure, new cache entries
can be added and the mkey allocation can no longer assume that all
mkeys created for the cache have access_flags equal to zero.

Example of a flow that exposes the issue:
A user registers MR with RO on a HCA that cannot UMR RO and the mkey is
created outside of the cache. When the user deregisters the MR, a new
cache entry is created to store mkeys with RO.

Later, the user registers 2 MRs with RO. The first MR is reused from the
new cache entry. When we try to get the second mkey from the cache we see
the entry is empty so we go to the MR cache mkey allocation flow which
would have allocated a mkey with no access flags, resulting the user getting
a MR without RO.

Fixes: dd1b913fb0 ("RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow")
Reviewed-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://lore.kernel.org/r/8a802700b82def3ace3f77cd7a9ad9d734af87e7.1695203958.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-09-26 12:29:33 +03:00
Linus Torvalds
f7e97ce269 v6.6 merge window RDMA pull request
Many small changes across the subystem, some highlights:
 
 - Usual driver cleanups in qedr, siw, erdma, hfi1, mlx4/5, irdma, mthca,
   hns, and bnxt_re
 
 - siw now works over tunnel and other netdevs with a MAC address by
   removing assumptions about a MAC/GID from the connection manager
 
 - "Doorbell Pacing" for bnxt_re - this is a best effort scheme to allow
   userspace to slow down the doorbell rings if the HW gets full
 
 - irdma egress VLAN priority, better QP/WQ sizing
 
 - rxe bug fixes in queue draining and srq resizing
 
 - Support more ethernet speed options in the core layer
 
 - DMABUF support for bnxt_re
 
 - Multi-stage MTT support for erdma to allow much bigger MR registrations
 
 - A irdma fix with a CVE that came in too late to go to -rc, missing
   bounds checking for 0 length MRs
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZPEqkAAKCRCFwuHvBreF
 YZrNAPoCBfU+VjCKNr2yqF7s52os5ZdBV7Uuh4txHcXWW9H7GAD/f19i2u62fzNu
 C27jj4cztemMBb8mgwyxPw/wLg7NLwY=
 =pC6k
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Many small changes across the subystem, some highlights:

   - Usual driver cleanups in qedr, siw, erdma, hfi1, mlx4/5, irdma,
     mthca, hns, and bnxt_re

   - siw now works over tunnel and other netdevs with a MAC address by
     removing assumptions about a MAC/GID from the connection manager

   - "Doorbell Pacing" for bnxt_re - this is a best effort scheme to
     allow userspace to slow down the doorbell rings if the HW gets full

   - irdma egress VLAN priority, better QP/WQ sizing

   - rxe bug fixes in queue draining and srq resizing

   - Support more ethernet speed options in the core layer

   - DMABUF support for bnxt_re

   - Multi-stage MTT support for erdma to allow much bigger MR
     registrations

   - A irdma fix with a CVE that came in too late to go to -rc, missing
     bounds checking for 0 length MRs"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (87 commits)
  IB/hfi1: Reduce printing of errors during driver shut down
  RDMA/hfi1: Move user SDMA system memory pinning code to its own file
  RDMA/hfi1: Use list_for_each_entry() helper
  RDMA/mlx5: Fix trailing */ formatting in block comment
  RDMA/rxe: Fix redundant break statement in switch-case.
  RDMA/efa: Fix wrong resources deallocation order
  RDMA/siw: Call llist_reverse_order in siw_run_sq
  RDMA/siw: Correct wrong debug message
  RDMA/siw: Balance the reference of cep->kref in the error path
  Revert "IB/isert: Fix incorrect release of isert connection"
  RDMA/bnxt_re: Fix kernel doc errors
  RDMA/irdma: Prevent zero-length STAG registration
  RDMA/erdma: Implement hierarchical MTT
  RDMA/erdma: Refactor the storage structure of MTT entries
  RDMA/erdma: Renaming variable names and field names of struct erdma_mem
  RDMA/hns: Support hns HW stats
  RDMA/hns: Dump whole QP/CQ/MR resource in raw
  RDMA/irdma: Add missing kernel-doc in irdma_setup_umode_qp()
  RDMA/mlx4: Copy union directly
  RDMA/irdma: Drop unused kernel push code
  ...
2023-09-01 16:49:33 -07:00
Jakub Kicinski
3c5066c6b0 Merge branch 'mlx5-next' of https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Leon Romanovsky says:

====================
mlx5 MACsec RoCEv2 support

From Patrisious:

This series extends previously added MACsec offload support
to cover RoCE traffic either.

In order to achieve that, we need configure MACsec with offload between
the two endpoints, like below:

REMOTE_MAC=10:70:fd:43:71:c0

* ip addr add 1.1.1.1/16 dev eth2
* ip link set dev eth2 up
* ip link add link eth2 macsec0 type macsec encrypt on
* ip macsec offload macsec0 mac
* ip macsec add macsec0 tx sa 0 pn 1 on key 00 dffafc8d7b9a43d5b9a3dfbbf6a30c16
* ip macsec add macsec0 rx port 1 address $REMOTE_MAC
* ip macsec add macsec0 rx port 1 address $REMOTE_MAC sa 0 pn 1 on key 01 ead3664f508eb06c40ac7104cdae4ce5
* ip addr add 10.1.0.1/16 dev macsec0
* ip link set dev macsec0 up

And in a similar manner on the other machine, while noting the keys order
would be reversed and the MAC address of the other machine.

RDMA traffic is separated through relevant GID entries and in case
of IP ambiguity issue - meaning we have a physical GIDs and a MACsec
GIDs with the same IP/GID, we disable our physical GID in order
to force the user to only use the MACsec GID.

v0: https://lore.kernel.org/netdev/20230813064703.574082-1-leon@kernel.org/

* 'mlx5-next' of https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
  RDMA/mlx5: Handles RoCE MACsec steering rules addition and deletion
  net/mlx5: Add RoCE MACsec steering infrastructure in core
  net/mlx5: Configure MACsec steering for ingress RoCEv2 traffic
  net/mlx5: Configure MACsec steering for egress RoCEv2 traffic
  IB/core: Reorder GID delete code for RoCE
  net/mlx5: Add MACsec priorities in RDMA namespaces
  RDMA/mlx5: Implement MACsec gid addition and deletion
  net/mlx5: Maintain fs_id xarray per MACsec device inside macsec steering
  net/mlx5: Remove netdevice from MACsec steering
  net/mlx5e: Move MACsec flow steering and statistics database from ethernet to core
  net/mlx5e: Rename MACsec flow steering functions/parameters to suit core naming style
  net/mlx5: Remove dependency of macsec flow steering on ethernet
  net/mlx5e: Move MACsec flow steering operations to be used as core library
  macsec: add functions to get macsec real netdevice and check offload
====================

Link: https://lore.kernel.org/r/20230821073833.59042-1-leon@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-24 11:32:18 -07:00
Rohit Chavan
d3c2245754 RDMA/mlx5: Fix trailing */ formatting in block comment
Resolved a formatting issue where the trailing */ in a block comment
was placed on a same line instead of separate line.

Signed-off-by: Rohit Chavan <roheetchavan@gmail.com>
Link: https://lore.kernel.org/r/20230822120451.8215-1-roheetchavan@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-08-22 17:27:16 +03:00
Patrisious Haddad
58dbd6428a RDMA/mlx5: Handles RoCE MACsec steering rules addition and deletion
Add RoCE MACsec rules when a gid is added for the MACsec netdevice and
handle their cleanup when the gid is removed or the MACsec SA is deleted.
Also support alias IP for the MACsec device, as long as we don't have
more ips than what the gid table can hold.
In addition handle the case where a gid is added but there are still no
SAs added for the MACsec device, so the rules are added later on when
the SAs are added.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2023-08-20 12:35:24 +03:00
Patrisious Haddad
758ce14aee RDMA/mlx5: Implement MACsec gid addition and deletion
Handle MACsec IP ambiguity issue, since mlx5 hw can't support
programming both the MACsec and the physical gid when they have the same
IP address, because it wouldn't know to whom to steer the traffic.
Hence in such case we delete the physical gid from the hw gid table,
which would then cause all traffic sent over it to fail, and we'll only
be able to send traffic over the MACsec gid.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2023-08-20 12:35:24 +03:00
Maher Sanalla
f14c1a14e6 net/mlx5: Allocate completion EQs dynamically
This commit enables the dynamic allocation of EQs at runtime, allowing
for more flexibility in managing completion EQs and reducing the memory
overhead of driver load. Whenever a CQ is created for a given vector
index, the driver will lookup to see if there is an already mapped
completion EQ for that vector, if so, utilize it. Otherwise, allocate a
new EQ on demand and then utilize it for the CQ completion events.

Add a protection lock to the EQ table to protect from concurrent EQ
creation attempts.

While at it, replace mlx5_vector2irqn()/mlx5_vector2eqn() with
mlx5_comp_eqn_get() and mlx5_comp_irqn_get() which will allocate an
EQ on demand if no EQ is found for the given vector.

Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-07 10:53:52 -07:00
Maher Sanalla
674dd4e2e0 net/mlx5: Rename mlx5_comp_vectors_count() to mlx5_comp_vectors_max()
To accurately represent its purpose, rename the function that retrieves
the value of maximum vectors from mlx5_comp_vectors_count() to
mlx5_comp_vectors_max().

Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-07 10:53:51 -07:00
Shetu Ayalew
f0ff2a2dd0 IB/mlx5: Add HW counter called rx_dct_connect
The rx_dct_connect counter shows the number of received connection
requests for the associated DCTs.

Signed-off-by: Shetu Ayalew <shetu@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Link: https://lore.kernel.org/r/01cd24cd7f591734741309921fdc01fc770d84a8.1690121941.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2023-07-31 11:40:32 +03:00
Ruan Jinjie
a45e5f1859 RDMA/mlx: Remove unnecessary variable initializations
Remove unnecessary variable initializations.

Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com>
Link: https://lore.kernel.org/r/20230728065139.3411703-1-ruanjinjie@huawei.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-07-31 10:05:23 +03:00
Yuanyuan Zhong
52b4bdd28c RDMA/mlx5: align MR mem allocation size to power-of-two
The MR memory allocation requests extra bytes to guarantee that there
is enough space to find the memory aligned to MLX5_UMR_ALIGN.

For power-of-two sizes, the alignment can be guaranteed by kmalloc()
according to commit 59bb47985c ("mm, sl[aou]b: guarantee natural
alignment for kmalloc(power-of-two)").

So if target alignment is power-of-two and adding the extra bytes
crosses a power-of-two boundary, use the next power-of-two as the
allocation size.

Signed-off-by: Yuanyuan Zhong <yzhong@purestorage.com>
Link: https://lore.kernel.org/r/20230629213248.3184245-2-yzhong@purestorage.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-07-12 15:06:43 +03:00
Linus Torvalds
7ede5f78a0 v6.5 merge window RDMA pull request
This cycle saw a focus on rxe and bnxt_re drivers:
 
 - Code cleanups for irdma, rxe, rtrs, hns, vmw_pvrdma
 
 - rxe uses workqueues instead of tasklets
 
 - rxe has better compliance around access checks for MRs and rereg_mr
 
 - mana supportst he 'v2' FW interface for RX coalescing
 
 - hfi1 bug fix for stale cache entries in its MR cache
 
 - mlx5 buf fix to handle FW failures when destroying QPs
 
 - erdma HW has a new doorbell allocation mechanism for uverbs that is
   secure
 
 - Lots of small cleanups and rework in bnxt_re
    * Use the common mmap functions
    * Support disassociation
    * Improve FW command flow
 
 - bnxt_re support for "low latency push", this allows a packet
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZJxq+wAKCRCFwuHvBreF
 YSN+AQDKAmWdKCqmc3boMIq5wt4h7yYzdW47LpzGarOn5Hf+UgEA6mpPJyRqB43C
 CNXYIbASl/LLaWzFvxCq/AYp6tzuog4=
 =I8ju
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "This cycle saw a focus on rxe and bnxt_re drivers:

   - Code cleanups for irdma, rxe, rtrs, hns, vmw_pvrdma

   - rxe uses workqueues instead of tasklets

   - rxe has better compliance around access checks for MRs and rereg_mr

   - mana supportst he 'v2' FW interface for RX coalescing

   - hfi1 bug fix for stale cache entries in its MR cache

   - mlx5 buf fix to handle FW failures when destroying QPs

   - erdma HW has a new doorbell allocation mechanism for uverbs that is
     secure

   - Lots of small cleanups and rework in bnxt_re:
       - Use the common mmap functions
       - Support disassociation
       - Improve FW command flow
       - support for 'low latency push'"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (71 commits)
  RDMA/bnxt_re: Fix an IS_ERR() vs NULL check
  RDMA/bnxt_re: Fix spelling mistake "priviledged" -> "privileged"
  RDMA/bnxt_re: Remove duplicated include in bnxt_re/main.c
  RDMA/bnxt_re: Refactor code around bnxt_qplib_map_rc()
  RDMA/bnxt_re: Remove incorrect return check from slow path
  RDMA/bnxt_re: Enable low latency push
  RDMA/bnxt_re: Reorg the bar mapping
  RDMA/bnxt_re: Move the interface version to chip context structure
  RDMA/bnxt_re: Query function capabilities from firmware
  RDMA/bnxt_re: Optimize the bnxt_re_init_hwrm_hdr usage
  RDMA/bnxt_re: Add disassociate ucontext support
  RDMA/bnxt_re: Use the common mmap helper functions
  RDMA/bnxt_re: Initialize opcode while sending message
  RDMA/cma: Remove NULL check before dev_{put, hold}
  RDMA/rxe: Simplify cq->notify code
  RDMA/rxe: Fixes mr access supported list
  RDMA/bnxt_re: optimize the parameters passed to helper functions
  RDMA/bnxt_re: remove redundant cmdq_bitmap
  RDMA/bnxt_re: use firmware provided max request timeout
  RDMA/bnxt_re: cancel all control path command waiters upon error
  ...
2023-06-29 21:01:17 -07:00
Jason Gunthorpe
5f004bcaee Linux 6.4
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmSYzfYeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG/ucH/iOM/1Py/fSg0qSs
 7NJ4XXlourT5zrnRMom3cm3d9gYqgTzgvKFL3kjMEexTRVYbhlcO4ZPRsiry8zxF
 ToGX+V8tDMqb8WSdFHzkljRY+zDRyfEUDMlTzROAD9DunLmQtkJKyrggkeGdjkpP
 OyfGqKpwlLXZRAXBil/U8Mx9MHdjJubloZwghLZr33VdUZa68+JJ9l6w163Oe/ET
 K264NM0wxN/kvN57JvePgqMccQwpINylg8IhRI+XelgczjUXeJBsOA8TDv4bDN4Q
 bjCLhkWbIaZtTYqvOXa/kD0T8wd7KETsMBQN8YzyDh6W0GmAlJjTawyAhA6jA5in
 x3uz2W8=
 =L3zp
 -----END PGP SIGNATURE-----

Merge tag 'v6.4' into rdma.git for-next

Linux 6.4

Resolve conflicts between rdma rc and next in rxe_cq matching linux-next:

drivers/infiniband/sw/rxe/rxe_cq.c:
  https://lore.kernel.org/r/20230622115246.365d30ad@canb.auug.org.au

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-27 14:06:29 -03:00
Jakub Kicinski
173780ff18 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

include/linux/mlx5/driver.h
  617f5db1a6 ("RDMA/mlx5: Fix affinity assignment")
  dc13180824 ("net/mlx5: Enable devlink port for embedded cpu VF vports")
https://lore.kernel.org/all/20230613125939.595e50b8@canb.auug.org.au/

tools/testing/selftests/net/mptcp/mptcp_join.sh
  47867f0a7e ("selftests: mptcp: join: skip check if MIB counter not supported")
  425ba80312 ("selftests: mptcp: join: support RM_ADDR for used endpoints or not")
  45b1a1227a ("mptcp: introduces more address related mibs")
  0639fa230a ("selftests: mptcp: add explicit check for new mibs")
https://lore.kernel.org/netdev/20230609-upstream-net-20230610-mptcp-selftests-support-old-kernels-part-3-v1-0-2896fe2ee8a3@tessares.net/

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-15 22:19:41 -07:00
Mark Bloch
617f5db1a6 RDMA/mlx5: Fix affinity assignment
The cited commit aimed to ensure that Virtual Functions (VFs) assign a
queue affinity to a Queue Pair (QP) to distribute traffic when
the LAG master creates a hardware LAG. If the affinity was set while
the hardware was not in LAG, the firmware would ignore the affinity value.

However, this commit unintentionally assigned an affinity to QPs on the LAG
master's VPORT even if the RDMA device was not marked as LAG-enabled.
In most cases, this was not an issue because when the hardware entered
hardware LAG configuration, the RDMA device of the LAG master would be
destroyed and a new one would be created, marked as LAG-enabled.

The problem arises when a user configures Equal-Cost Multipath (ECMP).
In ECMP mode, traffic can be directed to different physical ports based on
the queue affinity, which is intended for use by VPORTS other than the
E-Switch manager. ECMP mode is supported only if both E-Switch managers are
in switchdev mode and the appropriate route is configured via IP. In this
configuration, the RDMA device is not destroyed, and we retain the RDMA
device that is not marked as LAG-enabled.

To ensure correct behavior, Send Queues (SQs) opened by the E-Switch
manager through verbs should be assigned strict affinity. This means they
will only be able to communicate through the native physical port
associated with the E-Switch manager. This will prevent the firmware from
assigning affinity and will not allow the SQs to be remapped in case of
failover.

Fixes: 802dcc7fc5 ("RDMA/mlx5: Support TX port affinity for VF drivers in LAG mode")
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://lore.kernel.org/r/425b05f4da840bc684b0f7e8ebf61aeb5cef09b0.1685960567.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-06-11 11:27:17 +03:00
Patrisious Haddad
2de43f5b51 RDMA/mlx5: Fix Q-counters query in LAG mode
Previously we used the core device associated to the IB device in order
to do the Q-counters query to the FW, but in LAG mode it is possible
that the core device isn't the one that created this VF.

Hence instead of using the core device to query the Q-counters
we use the ESW core device which is guaranteed to be that of the VF.

Fixes: d22467a71e ("RDMA/mlx5: Expand switchdev Q-counters to expose representor statistics")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/778d7d7a24892348d0bdef17d2e5f9e044717e86.1685960567.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-06-11 11:25:57 +03:00
Patrisious Haddad
e80ef13948 RDMA/mlx5: Remove vport Q-counters dependency on normal Q-counters
Previously the Q-counters initialization assumed that the vport Q-counters
structures and the normal Q-counters structures are identical in size,
and hence when a Q-counter was added to normal Q-counters structure but
not to the vport Q-counters struct it would lead to that counter name
being NULL in switchdev mode, which could cause the kernel crash below.

Currently break the dependency between those two structure and always
use the appropriate struct size, in order to remove the assumption
that both structure sizes are equal.

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 20c64a067 P4D 20c64a067 PUD 20152b067 PMD 0
 Oops: 0000 [#1] SMP
 CPU: 19 PID: 11717 Comm: devlink Tainted: G           OE      6.2.0_mlnx #1
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 RIP: 0010:strlen+0x0/0x20
 Code: 66 2e 0f 1f 84 00 00 00 00 00 48 01 fe eb 0f 0f b6 07 38 d0 74 10 48 83 c7 01 84 c0 74 05 48 39 f7 75 ec 31 c0 c3 48 89 f8 c3 <80> 3f 00 48 89 f8 74 10 48 83 c7 01 80 3f 00 75 f7 48 29 c7 48 89
 RSP: 0018:ffffc9000318b618 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000002c00
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
 RBP: 0000000000000000 R08: ffff888211918110 R09: ffff888211918000
 R10: 000000000000001e R11: ffff888211918000 R12: 0000000000000000
 R13: 0000000000000000 R14: 0000000000000000 R15: ffff8881038ec250
 FS:  00007fa53342fe80(0000) GS:ffff88885fcc0000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 00000002042b2003 CR4: 0000000000770ee0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  <TASK>
  kernfs_name_hash+0x12/0x80
  kernfs_find_ns+0x35/0xb0
  kernfs_remove_by_name_ns+0x46/0xc0
  remove_files.isra.1+0x30/0x70
  internal_create_group+0x253/0x380
  internal_create_groups.part.4+0x3e/0xa0
  setup_port+0x27a/0x8c0 [ib_core]
  ib_setup_port_attrs+0x9d/0x300 [ib_core]
  ib_register_device+0x48e/0x550 [ib_core]
  __mlx5_ib_add+0x2b/0x80 [mlx5_ib]
  mlx5_ib_vport_rep_load+0x141/0x360 [mlx5_ib]
  mlx5_esw_offloads_rep_load+0x48/0xa0 [mlx5_core]
  esw_offloads_enable+0x41e/0xd10 [mlx5_core]
  mlx5_eswitch_enable_locked+0x1e3/0x340 [mlx5_core]
  ? __cond_resched+0x15/0x30
  mlx5_devlink_eswitch_mode_set+0x204/0x3c0 [mlx5_core]
  devlink_nl_cmd_eswitch_set_doit+0x8d/0x100
  genl_family_rcv_msg_doit.isra.19+0xea/0x110
  genl_rcv_msg+0x19b/0x290
  ? devlink_nl_cmd_region_read_dumpit+0x760/0x760
  ? devlink_nl_cmd_port_param_get_doit+0x30/0x30
  ? devlink_put+0x50/0x50
  ? genl_get_cmd_both+0x60/0x60
  netlink_rcv_skb+0x54/0x100
  genl_rcv+0x24/0x40
  netlink_unicast+0x1be/0x2a0
  netlink_sendmsg+0x361/0x4d0
  sock_sendmsg+0x30/0x40
  __sys_sendto+0x11a/0x150
  ? handle_mm_fault+0x101/0x2b0
  ? do_user_addr_fault+0x21d/0x720
  __x64_sys_sendto+0x24/0x30
  do_syscall_64+0x34/0x80
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
 RIP: 0033:0x7fa533611cba
 Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 76 c3 0f 1f 44 00 00 55 48 83 ec 30 44 89 4c
 RSP: 002b:00007ffdb6a898a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
 RAX: ffffffffffffffda RBX: 0000000000daab00 RCX: 00007fa533611cba
 RDX: 0000000000000038 RSI: 0000000000daab00 RDI: 0000000000000003
 RBP: 0000000000daa910 R08: 00007fa533822000 R09: 000000000000000c
 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
  </TASK>
 Modules linked in: rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxdevm(OE) ib_uverbs(OE) ib_core(OE) mlx_compat(OE) mlxfw(OE) memtrack(OE) pci_hyperv_intf nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat dns_resolver nf_nat br_netfilter nfs bridge stp llc lockd grace fscache netfs rfkill overlay iTCO_wdt iTCO_vendor_support kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel i2c_i801 sunrpc lpc_ich sha512_ssse3 pcspkr i2c_smbus mfd_core drm sch_fq_codel i2c_core ip_tables fuse crc32c_intel serio_raw virtio_net net_failover failover [last unloaded: mlxfw]
 CR2: 0000000000000000
 ---[ end trace 0000000000000000 ]---
 RIP: 0010:strlen+0x0/0x20
 Code: 66 2e 0f 1f 84 00 00 00 00 00 48 01 fe eb 0f 0f b6 07 38 d0 74 10 48 83 c7 01 84 c0 74 05 48 39 f7 75 ec 31 c0 c3 48 89 f8 c3 <80> 3f 00 48 89 f8 74 10 48 83 c7 01 80 3f 00 75 f7 48 29 c7 48 89
 RSP: 0018:ffffc9000318b618 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000002c00
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
 RBP: 0000000000000000 R08: ffff888211918110 R09: ffff888211918000
 R10: 000000000000001e R11: ffff888211918000 R12: 0000000000000000
 R13: 0000000000000000 R14: 0000000000000000 R15: ffff8881038ec250
 FS:  00007fa53342fe80(0000) GS:ffff88885fcc0000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 00000002042b2003 CR4: 0000000000770ee0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 PKRU: 55555554
 Kernel panic - not syncing: Fatal exception
 Kernel Offset: disabled
 ---[ end Kernel panic - not syncing: Fatal exception ]---

Fixes: d22467a71e ("RDMA/mlx5: Expand switchdev Q-counters to expose representor statistics")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/016777b7f16eb6bb178999ff59097d0c0f91f68a.1685960567.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-06-11 11:25:44 +03:00
Patrisious Haddad
c2ea687e5e RDMA/mlx5: Fix Q-counters per vport allocation
Previously Q-counters data was being allocated over the PF for all of
the available vports, however that isn't necessary.

Since each VF or SF has a Q-counter allocated for itself.

So we only need to allocate two counters data structures, one for the
device counters, and one for all the other vports to expose the
representors, since they only need to read from it in order to
determine mainly counters numbers and names, so they can all share.

This in turn also solves a bug we previously had where we couldn't
switch the device to switchdev mode when there were more than 128 SF/VFs
configured, since that is the maximum amount of Q-counters available for
a single port

Fixes: d22467a71e ("RDMA/mlx5: Expand switchdev Q-counters to expose representor statistics")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/f54671df16e2227a069b229b33b62cd9ee24c475.1685960567.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-06-11 11:25:39 +03:00
Mark Bloch
e1f4a52ac1 RDMA/mlx5: Create an indirect flow table for steering anchor
A misbehaved user can create a steering anchor that points to a kernel
flow table and then destroy the anchor without freeing the associated
STC. This creates a problem as the kernel can't destroy the flow
table since there is still a reference to it. As a result, this can
exhaust all available flow table resources, preventing other users from
using the RDMA device.

To prevent this problem, a solution is implemented where a special flow
table with two steering rules is created when a user creates a steering
anchor for the first time. The rules include one that drops all traffic
and another that points to the kernel flow table. If the steering anchor
is destroyed, only the rule pointing to the kernel's flow table is removed.
Any traffic reaching the special flow table after that is dropped.

Since the special flow table is not destroyed when the steering anchor is
destroyed, any issues are prevented from occurring. The remaining resources
are only destroyed when the RDMA device is destroyed, which happens after
all DEVX objects are freed, including the STCs, thus mitigating the issue.

Fixes: 0c6ab0ca9a ("RDMA/mlx5: Expose steering anchor to userspace")
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Link: https://lore.kernel.org/r/b4a88a871d651fa4e8f98d552553c1cfe9ba2cd6.1685960567.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-06-11 11:25:34 +03:00
Maher Sanalla
ee4d269ecc RDMA/mlx5: Initiate dropless RQ for RAW Ethernet functions
Delay drop data is initiated for PFs that have the capability of
rq_delay_drop and are in roce profile.

However, PFs with RAW ethernet profile do not initiate delay drop data
on function load, causing kernel panic if delay drop struct members are
accessed later on in case a dropless RQ is created.

Thus, stage the delay drop initialization as part of RAW ethernet
PF loading process.

Fixes: b5ca15ad7e ("IB/mlx5: Add proper representors support")
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Link: https://lore.kernel.org/r/2e9d386785043d48c38711826eb910315c1de141.1685960567.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-06-11 11:25:29 +03:00
Patrisious Haddad
22664c06e9 RDMA/mlx5: Return the firmware result upon destroying QP/RQ
Previously when destroying a QP/RQ, the result of the firmware
destruction function was ignored and upper layers weren't informed
about the failure.
Which in turn could lead to various problems since when upper layer
isn't aware of the failure it continues its operation thinking that the
related QP/RQ was successfully destroyed while it actually wasn't,
which could lead to the below kernel WARN.

Currently, we return the correct firmware destruction status to upper
layers which in case of the RQ would be mlx5_ib_destroy_wq() which
was already capable of handling RQ destruction failure or in case of
a QP to destroy_qp_common(), which now would actually warn upon qp
destruction failure.

WARNING: CPU: 3 PID: 995 at drivers/infiniband/core/rdma_core.c:940 uverbs_destroy_ufile_hw+0xcb/0xe0 [ib_uverbs]
Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm ib_umad ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core overlay mlx5_core fuse
CPU: 3 PID: 995 Comm: python3 Not tainted 5.16.0-rc5+ #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:uverbs_destroy_ufile_hw+0xcb/0xe0 [ib_uverbs]
Code: 41 5c 41 5d 41 5e e9 44 34 f0 e0 48 89 df e8 4c 77 ff ff 49 8b 86 10 01 00 00 48 85 c0 74 a1 4c 89 e7 ff d0 eb 9a 0f 0b eb c1 <0f> 0b be 04 00 00 00 48 89 df e8 b6 f6 ff ff e9 75 ff ff ff 90 0f
RSP: 0018:ffff8881533e3e78 EFLAGS: 00010287
RAX: ffff88811b2cf3e0 RBX: ffff888106209700 RCX: 0000000000000000
RDX: ffff888106209780 RSI: ffff8881533e3d30 RDI: ffff888109b101a0
RBP: 0000000000000001 R08: ffff888127cb381c R09: 0de9890000000009
R10: ffff888127cb3800 R11: 0000000000000000 R12: ffff888106209780
R13: ffff888106209750 R14: ffff888100f20660 R15: 0000000000000000
FS:  00007f8be353b740(0000) GS:ffff88852c980000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8bd5b117c0 CR3: 000000012cd8a004 CR4: 0000000000370ea0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 ib_uverbs_close+0x1a/0x90 [ib_uverbs]
 __fput+0x82/0x230
 task_work_run+0x59/0x90
 exit_to_user_mode_prepare+0x138/0x140
 syscall_exit_to_user_mode+0x1d/0x50
 ? __x64_sys_close+0xe/0x40
 do_syscall_64+0x4a/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f8be3ae0abb
Code: 03 00 00 00 0f 05 48 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 83 43 f9 ff 8b 7c 24 0c 41 89 c0 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 89 44 24 0c e8 c1 43 f9 ff 8b 44
RSP: 002b:00007ffdb51909c0 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 0000557bb7f7c020 RCX: 00007f8be3ae0abb
RDX: 0000557bb7c74010 RSI: 0000557bb7f14ca0 RDI: 0000000000000005
RBP: 0000557bb7fbd598 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000293 R12: 0000557bb7fbd5b8
R13: 0000557bb7fbd5a8 R14: 0000000000001000 R15: 0000557bb7f7c020
 </TASK>

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Link: https://lore.kernel.org/r/c6df677f931d18090bafbe7f7dbb9524047b7d9b.1685953497.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-06-11 11:21:46 +03:00
Patrisious Haddad
afff248998 RDMA/mlx5: Handle DCT QP logic separately from low level QP interface
Previously when destroying a DCT, if the firmware function for the
destruction failed, the common resource would have been destroyed
either way, since it was destroyed before the firmware object.
Which leads to kernel warning "refcount_t: underflow" which indicates
possible use-after-free.
Which is triggered when we try to destroy the common resource for the
second time and execute refcount_dec_and_test(&common->refcount).

So, let's fix the destruction order by factoring out the DCT QP logic
to be in separate XArray database.

refcount_t: underflow; use-after-free.
WARNING: CPU: 8 PID: 1002 at lib/refcount.c:28 refcount_warn_saturate+0xd8/0xe0
Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core overlay mlx5_core fuse
CPU: 8 PID: 1002 Comm: python3 Not tainted 5.16.0-rc5+ #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:refcount_warn_saturate+0xd8/0xe0
Code: ff 48 c7 c7 18 f5 23 82 c6 05 60 70 ff 00 01 e8 d0 0a 45 00 0f 0b c3 48 c7 c7 c0 f4 23 82 c6 05 4c 70 ff 00 01 e8 ba 0a 45 00 <0f> 0b c3 0f 1f 44 00 00 8b 07 3d 00 00 00 c0 74 12 83 f8 01 74 13
RSP: 0018:ffff8881221d3aa8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff8881313e8d40 RCX: ffff88852cc1b5c8
RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff88852cc1b5c0
RBP: ffff888100f70000 R08: ffff88853ffd1ba8 R09: 0000000000000003
R10: 00000000fffff000 R11: 3fffffffffffffff R12: 0000000000000246
R13: ffff888100f71fa0 R14: ffff8881221d3c68 R15: 0000000000000020
FS:  00007efebbb13740(0000) GS:ffff88852cc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005611aac29f80 CR3: 00000001313de004 CR4: 0000000000370ea0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 destroy_resource_common+0x6e/0x95 [mlx5_ib]
 mlx5_core_destroy_rq_tracked+0x38/0xbe [mlx5_ib]
 mlx5_ib_destroy_wq+0x22/0x80 [mlx5_ib]
 ib_destroy_wq_user+0x1f/0x40 [ib_core]
 uverbs_free_wq+0x19/0x40 [ib_uverbs]
 destroy_hw_idr_uobject+0x18/0x50 [ib_uverbs]
 uverbs_destroy_uobject+0x2f/0x190 [ib_uverbs]
 uobj_destroy+0x3c/0x80 [ib_uverbs]
 ib_uverbs_cmd_verbs+0x3e4/0xb80 [ib_uverbs]
 ? uverbs_free_wq+0x40/0x40 [ib_uverbs]
 ? ip_list_rcv+0xf7/0x120
 ? netif_receive_skb_list_internal+0x1b6/0x2d0
 ? task_tick_fair+0xbf/0x450
 ? __handle_mm_fault+0x11fc/0x1450
 ib_uverbs_ioctl+0xa4/0x110 [ib_uverbs]
 __x64_sys_ioctl+0x3e4/0x8e0
 ? handle_mm_fault+0xb9/0x210
 do_syscall_64+0x3d/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7efebc0be17b
Code: 0f 1e fa 48 8b 05 1d ad 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ed ac 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe71813e78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffe71813fb8 RCX: 00007efebc0be17b
RDX: 00007ffe71813fa0 RSI: 00000000c0181b01 RDI: 0000000000000005
RBP: 00007ffe71813f80 R08: 00005611aae96020 R09: 000000000000004f
R10: 00007efebbf9ffa0 R11: 0000000000000246 R12: 00007ffe71813f80
R13: 00007ffe71813f4c R14: 00005611aae2eca0 R15: 00007efeae6c89d0
 </TASK>

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4470888466c8a898edc9833286967529cc5f3c0d.1685953497.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-06-11 11:21:40 +03:00
Leon Romanovsky
2ecfd94616 RDMA/mlx5: Reduce QP table exposure
driver.h is common header to whole mlx5 code base, but struct
mlx5_qp_table is used in mlx5_ib driver only. So move that struct
to be under sole responsibility of mlx5_ib.

Link: https://lore.kernel.org/r/bec0dc1158e795813b135d1143147977f26bf668.1685953497.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2023-06-11 11:21:28 +03:00
Shay Drory
222dd18583 {net/RDMA}/mlx5: introduce lag_for_each_peer
Introduce a generic APIs to iterate over all the devices which are part
of the LAG. This API replace mlx5_lag_get_peer_mdev() which retrieve
only a single peer device from the lag.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-07 14:00:42 -07:00
Shay Drory
962825e534 RDMA/mlx5: Free second uplink ib port
The cited patch introduce ib port for the slave device uplink in
case of multiport eswitch. However, this ib port didn't perform
anything when unloaded.
Unload the new ib port properly.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-07 14:00:42 -07:00
Linus Torvalds
af3877265d v6.4 merge window RDMA pull request
Usual wide collection of unrelated items in drivers:
 
 - Driver bug fixes and treewide cleanups in hfi1, siw, qib, mlx5, rxe,
   usnic, usnic, bnxt_re, ocrdma, iser
    * Unnecessary NULL checks
    * kmap obsolescence
    * pci_enable_pcie_error_reporting() obsolescence
    * Unused variables and macros
    * trace event related warnings
    * casting warnings
 
 - Code cleanups for irdm and erdma
 
 - EFA reporting of 128 byte PCIe TLP support
 
 - mlx5 more agressively uses the out of order HW feature
 
 - Big rework of how state machines and tasks work in rxe
 
 - Fix a syzkaller found crash netdev refcount leak in siw
 
 - bnxt_re revises their HW description header
 
 - Congestion control for bnxt_re
 
 - Use mmu_notifiers more safely in hfi1
 
 - mlx5 gets better support for PCIe relaxed ordering inside VMs
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZEva5wAKCRCFwuHvBreF
 YZFmAQC9T3b/XQ3bRknYciuzbatC98o9xB0FTqmEFYGj+Y2lVAD9EEVe3HKfHfi3
 t/GxXYB5r22oxg5bgsblZfEdEdTVCg8=
 =akMm
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Usual wide collection of unrelated items in drivers:

   - Driver bug fixes and treewide cleanups in hfi1, siw, qib, mlx5,
     rxe, usnic, usnic, bnxt_re, ocrdma, iser:
       - remove unnecessary NULL checks
       - kmap obsolescence
       - pci_enable_pcie_error_reporting() obsolescence
       - unused variables and macros
       - trace event related warnings
       - casting warnings

   - Code cleanups for irdm and erdma

   - EFA reporting of 128 byte PCIe TLP support

   - mlx5 more agressively uses the out of order HW feature

   - Big rework of how state machines and tasks work in rxe

   - Fix a syzkaller found crash netdev refcount leak in siw

   - bnxt_re revises their HW description header

   - Congestion control for bnxt_re

   - Use mmu_notifiers more safely in hfi1

   - mlx5 gets better support for PCIe relaxed ordering inside VMs"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (81 commits)
  RDMA/efa: Add rdma write capability to device caps
  RDMA/mlx5: Use correct device num_ports when modify DC
  RDMA/irdma: Drop spurious WQ_UNBOUND from alloc_ordered_workqueue() call
  RDMA/rxe: Fix spinlock recursion deadlock on requester
  RDMA/mlx5: Fix flow counter query via DEVX
  RDMA/rxe: Protect QP state with qp->state_lock
  RDMA/rxe: Move code to check if drained to subroutine
  RDMA/rxe: Remove qp->req.state
  RDMA/rxe: Remove qp->comp.state
  RDMA/rxe: Remove qp->resp.state
  RDMA/mlx5: Allow relaxed ordering read in VFs and VMs
  net/mlx5: Update relaxed ordering read HCA capabilities
  RDMA/mlx5: Check pcie_relaxed_ordering_enabled() in UMR
  RDMA/mlx5: Remove pcie_relaxed_ordering_enabled() check for RO write
  RDMA: Add ib_virt_dma_to_page()
  RDMA/rxe: Fix the error "trying to register non-static key in rxe_cleanup_task"
  RDMA/irdma: Slightly optimize irdma_form_ah_cm_frame()
  RDMA/rxe: Fix incorrect TASKLET_STATE_SCHED check in rxe_task.c
  IB/hfi1: Place struct mmu_rb_handler on cache line start
  IB/hfi1: Fix bugs with non-PAGE_SIZE-end multi-iovec user SDMA requests
  ...
2023-04-29 17:21:24 -07:00
Mark Zhang
746aa3c8cb RDMA/mlx5: Use correct device num_ports when modify DC
Just like other QP types, when modify DC, the port_num should be compared
with dev->num_ports, instead of HCA_CAP.num_ports.  Otherwise Multi-port
vHCA on DC may not work.

Fixes: 776a3906b6 ("IB/mlx5: Add support for DC target QP")
Link: https://lore.kernel.org/r/20230420013906.1244185-1-markzhang@nvidia.com
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-04-21 12:36:47 -03:00
Mark Bloch
3e358ea861 RDMA/mlx5: Fix flow counter query via DEVX
Commit cited in "fixes" tag added bulk support for flow counters but it
didn't account that's also possible to query a counter using a non-base id
if the counter was allocated as bulk.

When a user performs a query, validate the flow counter id given in the
mailbox is inside the valid range taking bulk value into account.

Fixes: 208d70f562 ("IB/mlx5: Support flow counters offset for bulk counters")
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Link: https://lore.kernel.org/r/79d7fbe291690128e44672418934256254d93115.1681377114.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-04-18 08:47:10 +03:00
Avihai Horon
bd4ba605c4 RDMA/mlx5: Allow relaxed ordering read in VFs and VMs
According to PCIe spec, Enable Relaxed Ordering value in the VF's PCI
config space is wired to 0 and PF relaxed ordering (RO) setting should
be applied to the VF. In QEMU (and maybe others), when assigning VFs,
the RO bit in PCI config space is not emulated properly and is always
set to 0.

Therefore, pcie_relaxed_ordering_enabled() always returns 0 for VFs and
VMs and thus MKeys can't be created with RO read even if the PF supports
it.

pcie_relaxed_ordering_enabled() check was added to avoid a syndrome when
creating a MKey with relaxed ordering (RO) enabled when the driver's
relaxed_ordering_read_pci_enabled HCA capability is out of sync with FW.
With the new relaxed_ordering_read capability this can't happen, as it's
set regardless of RO value in PCI config space and thus can't change
during runtime.

Hence, to allow RO read in VFs and VMs, use the new HCA capability
relaxed_ordering_read without checking pcie_relaxed_ordering_enabled().
The old capability checks are kept for backward compatibility with older
FWs.

Allowing RO in VFs and VMs is valuable since it can greatly improve
performance on some setups. For example, testing throughput of a VF on
an AMD EPYC 7763 and ConnectX-6 Dx setup showed roughly 60% performance
improvement.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Aya Levin <ayal@nvidia.com>
Link: https://lore.kernel.org/r/e7048640d66c341a8fa0465e099926e7989184bc.1681131553.git.leon@kernel.org
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-04-16 13:29:26 +03:00
Avihai Horon
ccbbfe0682 net/mlx5: Update relaxed ordering read HCA capabilities
Rename existing HCA capability relaxed_ordering_read to
relaxed_ordering_read_pci_enabled. This is in accordance with recent PRM
change to better describe the capability, as it's set only if both the
device supports relaxed ordering (RO) read and RO is enabled in PCI
config space.

In addition, add new HCA capability relaxed_ordering_read which is set
if the device supports RO read, regardless of RO in PCI config space.
This will be used in the following patch to allow RO in VFs and VMs.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Link: https://lore.kernel.org/r/caa0002fd8135086357dfcc368e2f5cc73b08480.1681131553.git.leon@kernel.org
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-04-16 13:29:19 +03:00
Avihai Horon
d43b020b0f RDMA/mlx5: Check pcie_relaxed_ordering_enabled() in UMR
relaxed_ordering_read HCA capability is set if both the device supports
relaxed ordering (RO) read and RO is set in PCI config space.

RO in PCI config space can change during runtime. This will change the
value of relaxed_ordering_read HCA capability in FW, but the driver will
not see it since it queries the capabilities only once.

This can lead to the following scenario:
1. RO in PCI config space is enabled.
2. User creates MKey without RO.
3. RO in PCI config space is disabled.
   As a result, relaxed_ordering_read HCA capability is turned off in FW
   but remains on in driver copy of the capabilities.
4. User requests to reconfig the MKey with RO via UMR.
5. Driver will try to reconfig the MKey with RO read although it
   shouldn't (as relaxed_ordering_read HCA capability is really off).

To fix this, check pcie_relaxed_ordering_enabled() before setting RO
read in UMR.

Fixes: 896ec97353 ("RDMA/mlx5: Set mkey relaxed ordering by UMR with ConnectX-7")
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Link: https://lore.kernel.org/r/8d39eb8317e7bed1a354311a20ae707788fd94ed.1681131553.git.leon@kernel.org
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-04-16 13:29:14 +03:00
Avihai Horon
ed4b0661cc RDMA/mlx5: Remove pcie_relaxed_ordering_enabled() check for RO write
pcie_relaxed_ordering_enabled() check was added to avoid a syndrome when
creating a MKey with relaxed ordering (RO) enabled when the driver's
relaxed_ordering_{read,write} HCA capabilities are out of sync with FW.

While this can happen with relaxed_ordering_read, it can't happen with
relaxed_ordering_write as it's set if the device supports RO write,
regardless of RO in PCI config space, and thus can't change during
runtime.

Therefore, drop the pcie_relaxed_ordering_enabled() check for
relaxed_ordering_write while keeping it for relaxed_ordering_read.
Doing so will also allow the usage of RO write in VFs and VMs (where RO
in PCI config space is not reported/emulated properly).

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Link: https://lore.kernel.org/r/7e8f55e31572c1702d69cae015a395d3a824a38a.1681131553.git.leon@kernel.org
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-04-16 13:29:07 +03:00
Tom Rix
081c27b3bc RDMA/mlx5: Remove unused num_alloc_xa_entries variable
clang with W=1 reports
drivers/infiniband/hw/mlx5/devx.c:1996:6: error: variable
  'num_alloc_xa_entries' set but not used [-Werror,-Wunused-but-set-variable]
        int num_alloc_xa_entries = 0;
            ^
This variable is not used so remove it.

Signed-off-by: Tom Rix <trix@redhat.com>
Link: https://lore.kernel.org/r/20230330153607.1838750-1-trix@redhat.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-04-03 15:46:47 +03:00
Patrisious Haddad
d22467a71e RDMA/mlx5: Expand switchdev Q-counters to expose representor statistics
Previously for switchdev only per device counters were supported.

Currently we allocate counters for switchdev per port, which also
includes the ports that belong to VF representors in order to expose
them to users through the rdma tool, allowing the host to track the VFs
statistics through their representors counters.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Link: https://lore.kernel.org/r/ea31e1103c125cd27931ba213f307cde30d2eaed.1679566038.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-03-29 10:09:23 +03:00
Leon Romanovsky
602fb42057 Enable IB out-of-order by default in mlx5
This series from Or changes default of IB out-of-order feature and
allows to the RDMA users to decide if they need to wait for completion
for all segments or it is enough to wait for last segment completion only.

Thanks

Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-03-23 10:22:32 +02:00
Or Har-Toov
742948cc02 RDMA/mlx5: Disable out-of-order in integrity enabled QPs
Set retry_mode to GO_BACK_N when qp is created with INTEGRITY_EN flag
because out-of-order is not supported when doing HW offload of signature
operations.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/362de42cdc7a541afa5b1fd0ec6ae706061764a2.1679230449.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-03-23 10:21:13 +02:00
Maher Sanalla
88c9483faf IB/mlx5: Add support for 400G_8X lane speed
Currently, when driver queries PTYS to report which link speed is being
used on its RoCE ports, it does not check the case of having 400Gbps
transmitted over 8 lanes. Thus it fails to report the said speed and
instead it defaults to report 10G over 4 lanes.

Add a check for the said speed when querying PTYS and report it back
correctly when needed.

Fixes: 08e8676f16 ("IB/mlx5: Add support for 50Gbps per lane link modes")
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/ec9040548d119d22557d6a4b4070d6f421701fd4.1678973994.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-03-20 09:51:17 +02:00
Rohit Chavan
c4526fe2e4 RDMA/mlx5: Coding style fix reported by checkpatch
Block comments should align the * on each line on line 2849
Avoid line continuations in quoted strings on line 3848

Signed-off-by: Rohit Chavan <roheetchavan@gmail.com>
Link: https://lore.kernel.org/r/20230319100847.5566-1-roheetchavan@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-03-19 14:03:34 +02:00
Linus Torvalds
8cbd92339d v6.3 RDMA pull request
Small cycle this time:
 
 - Minor driver updates for hfi1, cxgb4, erdma, hns, irdma, mlx5, siw, mana
 
 - inline CQE support for hns
 
 - Have mlx5 display device error codes
 
 - Pinned DMABUF support for irdma
 
 - Continued rxe cleanups, particularly converting the MRs to use xarray
 
 - Improvements to what can be cached in the mlx5 mkey cache
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCY/gPmgAKCRCFwuHvBreF
 YW5IAP4xOAiTif4f87vD1twRU/ebq4VEX0r+C2NX5x5fwlCJrAEA7RLV8uG9Uii2
 ez0BuWNxfajuvFHntnZ1E+7UDP0S8gk=
 =CgUH
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Quite a small cycle this time, even with the rc8. I suppose everyone
  went to sleep over xmas.

   - Minor driver updates for hfi1, cxgb4, erdma, hns, irdma, mlx5, siw,
     mana

   - inline CQE support for hns

   - Have mlx5 display device error codes

   - Pinned DMABUF support for irdma

   - Continued rxe cleanups, particularly converting the MRs to use
     xarray

   - Improvements to what can be cached in the mlx5 mkey cache"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (61 commits)
  IB/mlx5: Extend debug control for CC parameters
  IB/hfi1: Fix sdma.h tx->num_descs off-by-one errors
  IB/hfi1: Fix math bugs in hfi1_can_pin_pages()
  RDMA/irdma: Add support for dmabuf pin memory regions
  RDMA/mlx5: Use query_special_contexts for mkeys
  net/mlx5e: Use query_special_contexts for mkeys
  net/mlx5: Change define name for 0x100 lkey value
  net/mlx5: Expose bits for querying special mkeys
  RDMA/rxe: Fix missing memory barriers in rxe_queue.h
  RDMA/mana_ib: Fix a bug when the PF indicates more entries for registering memory on first packet
  RDMA/rxe: Remove rxe_alloc()
  RDMA/cma: Distinguish between sockaddr_in and sockaddr_in6 by size
  Subject: RDMA/rxe: Handle zero length rdma
  iw_cxgb4: Fix potential NULL dereference in c4iw_fill_res_cm_id_entry()
  RDMA/mlx5: Use rdma_umem_for_each_dma_block()
  RDMA/umem: Remove unused 'work' member from struct ib_umem
  RDMA/irdma: Cap MSIX used to online CPUs + 1
  RDMA/mlx5: Check reg_create() create for errors
  RDMA/restrack: Correct spelling
  RDMA/cxgb4: Fix potential null-ptr-deref in pass_establish()
  ...
2023-02-24 15:11:03 -08:00
Linus Torvalds
3822a7c409 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X bit.
 
 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.
 
 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes
 
 - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which
   does perform some memcg maintenance and cleanup work.
 
 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".  These filters provide users
   with finer-grained control over DAMOS's actions.  SeongJae has also done
   some DAMON cleanup work.
 
 - Kairui Song adds a series ("Clean up and fixes for swap").
 
 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".
 
 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series.  It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.
 
 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".
 
 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".
 
 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".
 
 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".
 
 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series "mm:
   support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap
   PTEs".
 
 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".
 
 - Sergey Senozhatsky has improved zsmalloc's memory utilization with his
   series "zsmalloc: make zspage chain size configurable".
 
 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.  The previous BPF-based approach had
   shortcomings.  See "mm: In-kernel support for memory-deny-write-execute
   (MDWE)".
 
 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
 
 - T.J.  Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".
 
 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a per-node
   basis.  See the series "Introduce per NUMA node memory error
   statistics".
 
 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage during
   compaction".
 
 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".
 
 - Christoph Hellwig has removed block_device_operations.rw_page() in ths
   series "remove ->rw_page".
 
 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".
 
 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier functions".
 
 - Some pagemap cleanup and generalization work in Mike Rapoport's series
   "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and
   "fixups for generic implementation of pfn_valid()"
 
 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
 
 - Jason Gunthorpe rationalized the GUP system's interface to the rest of
   the kernel in the series "Simplify the external interface for GUP".
 
 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface.  To support this, we'll temporarily be
   printing warnings when people use the debugfs interface.  See the series
   "mm/damon: deprecate DAMON debugfs interface".
 
 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.
 
 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".
 
 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA
 jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K
 DmxHkn0LAitGgJRS/W9w81yrgig9tAQ=
 =MlGs
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
   F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X
   bit.

 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.

 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes

 - Johannes Weiner has a series ("mm: push down lock_page_memcg()")
   which does perform some memcg maintenance and cleanup work.

 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".

   These filters provide users with finer-grained control over DAMOS's
   actions. SeongJae has also done some DAMON cleanup work.

 - Kairui Song adds a series ("Clean up and fixes for swap").

 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".

 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.

 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".

 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".

 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".

 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".

 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series
   "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
   swap PTEs".

 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".

 - Sergey Senozhatsky has improved zsmalloc's memory utilization with
   his series "zsmalloc: make zspage chain size configurable".

 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.

   The previous BPF-based approach had shortcomings. See "mm: In-kernel
   support for memory-deny-write-execute (MDWE)".

 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".

 - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".

 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a
   per-node basis. See the series "Introduce per NUMA node memory error
   statistics".

 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage
   during compaction".

 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".

 - Christoph Hellwig has removed block_device_operations.rw_page() in
   ths series "remove ->rw_page".

 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".

 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier
   functions".

 - Some pagemap cleanup and generalization work in Mike Rapoport's
   series "mm, arch: add generic implementation of pfn_valid() for
   FLATMEM" and "fixups for generic implementation of pfn_valid()"

 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".

 - Jason Gunthorpe rationalized the GUP system's interface to the rest
   of the kernel in the series "Simplify the external interface for
   GUP".

 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface. To support this, we'll temporarily be
   printing warnings when people use the debugfs interface. See the
   series "mm/damon: deprecate DAMON debugfs interface".

 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.

 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".

 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".

* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
  include/linux/migrate.h: remove unneeded externs
  mm/memory_hotplug: cleanup return value handing in do_migrate_range()
  mm/uffd: fix comment in handling pte markers
  mm: change to return bool for isolate_movable_page()
  mm: hugetlb: change to return bool for isolate_hugetlb()
  mm: change to return bool for isolate_lru_page()
  mm: change to return bool for folio_isolate_lru()
  objtool: add UACCESS exceptions for __tsan_volatile_read/write
  kmsan: disable ftrace in kmsan core code
  kasan: mark addr_has_metadata __always_inline
  mm: memcontrol: rename memcg_kmem_enabled()
  sh: initialize max_mapnr
  m68k/nommu: add missing definition of ARCH_PFN_OFFSET
  mm: percpu: fix incorrect size in pcpu_obj_full_size()
  maple_tree: reduce stack usage with gcc-9 and earlier
  mm: page_alloc: call panic() when memoryless node allocation fails
  mm: multi-gen LRU: avoid futile retries
  migrate_pages: move THP/hugetlb migration support check to simplify code
  migrate_pages: batch flushing TLB
  migrate_pages: share more code between _unmap and _move
  ...
2023-02-23 17:09:35 -08:00
Edward Srouji
66fb1d5df6 IB/mlx5: Extend debug control for CC parameters
This patch adds rtt_resp_dscp to the current debug controllability of
congestion control (CC) parameters.
rtt_resp_dscp can be read or written through debugfs.
If set, its value overwrites the DSCP of the generated RTT response.

Signed-off-by: Edward Srouji <edwards@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Link: https://lore.kernel.org/r/1dcc3440ee53c688f19f579a051ded81a2aaa70a.1676538714.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-02-19 11:50:59 +02:00
Jason Gunthorpe
5ef17179da Merge mlx5-next into rdma.git for-next
Synchronize the shared mlx5 branch with net:
 - From Jiri: fixe a deadlock in mlx5_ib's netdev notifier unregister.
 - From Mark and Patrisious: add IPsec RoCEv2 support.
 - From Or: Rely on firmware to get special mkeys

* branch mlx5-next:
  RDMA/mlx5: Use query_special_contexts for mkeys
  net/mlx5e: Use query_special_contexts for mkeys
  net/mlx5: Change define name for 0x100 lkey value
  net/mlx5: Expose bits for querying special mkeys
  net/mlx5: Configure IPsec steering for egress RoCEv2 traffic
  net/mlx5: Configure IPsec steering for ingress RoCEv2 traffic
  net/mlx5: Add IPSec priorities in RDMA namespaces
  net/mlx5: Implement new destination type TABLE_TYPE
  net/mlx5: Introduce new destination type TABLE_TYPE
  RDMA/mlx5: Track netdev to avoid deadlock during netdev notifier unregister
  net/mlx5e: Propagate an internal event in case uplink netdev changes
  net/mlx5e: Fix trap event handling

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-02-17 16:24:14 -04:00
Or Har-Toov
594cac11ab RDMA/mlx5: Use query_special_contexts for mkeys
Use query_sepcial_contexts to get the correct value of mkeys such as
null_mkey, terminate_scatter_list_mkey and dump_fill_mkey, as FW will
change them in certain configurations.

Link: https://lore.kernel.org/r/000236f0a9487d48809f87bcc3620a3964b2d3d3.1673960981.git.leon@kernel.org
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-02-17 16:22:23 -04:00
Or Har-Toov
a419bfb763 net/mlx5: Change define name for 0x100 lkey value
Change define of 0x100 lkey value from MLX5_INVALID_LKEY to be
MLX5_TERMINATE_SCATTER_LIST_LKEY as 0x100 is the value of
terminate_scatter_list_mkey.

Link: https://lore.kernel.org/r/3a116dc3fbae4cb6b76a63d27d418830b06ade0c.1673960981.git.leon@kernel.org
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-02-17 16:22:22 -04:00
Jason Gunthorpe
50a542a8ac RDMA/mlx5: Use rdma_umem_for_each_dma_block()
Replace an open coding of rdma_umem_for_each_dma_block() with the proper
function.

Fixes: b3d47ebd49 ("RDMA/mlx5: Use mlx5_umr_post_send_wait() to update MR pas")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/0-v1-c13a5b88359b+556d0-mlx5_umem_block_jgg@nvidia.com
Reviewed-by: Devesh Sharma <devesh.s.sharma@oracle.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-02-15 13:21:47 +02:00
Mark Bloch
27f9e0ccb6 net/mlx5: Lag, Add single RDMA device in multiport mode
In MultiPort E-Switch mode a single RDMA is created. This device has multiple
RDMA ports that represent the uplink ports that are connected to the E-Switch.
Account for this when creating the RDMA device so it has an additional port for
the non native uplink.

As a side effect of this patch, use shared fdb in multiport eswitch mode.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-14 14:08:25 -08:00
Suren Baghdasaryan
1c71222e5f mm: replace vma->vm_flags direct modifications with modifier calls
Replace direct modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness.

[akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo]
Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-09 16:51:39 -08:00
Jiri Pirko
dca55da0a1 RDMA/mlx5: Track netdev to avoid deadlock during netdev notifier unregister
When removing a network namespace with mlx5 devlink instance being in
it, following callchain is performed:

cleanup_net (takes down_read(&pernet_ops_rwsem)
devlink_pernet_pre_exit()
devlink_reload()
mlx5_devlink_reload_down()
mlx5_unload_one_devl_locked()
mlx5_detach_device()
del_adev()
mlx5r_remove()
__mlx5_ib_remove()
mlx5_ib_roce_cleanup()
mlx5_remove_netdev_notifier()
unregister_netdevice_notifier (takes down_write(&pernet_ops_rwsem)

This deadlocks.

Resolve this by converting to register_netdevice_notifier_dev_net()
which does not take pernet_ops_rwsem and moves the notifier block around
according to netdev it takes as arg.

Use previously introduced netdev added/removed events to track uplink
netdev to be used for register_netdevice_notifier_dev_net() purposes.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-08 20:40:57 -08:00
Dan Carpenter
8e6e49ccf1 RDMA/mlx5: Check reg_create() create for errors
The reg_create() can fail.  Check for errors before dereferencing it.

Fixes: dd1b913fb0 ("RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow")
Signed-off-by: Dan Carpenter <error27@gmail.com>
Link: https://lore.kernel.org/r/Y+ERYy4wN0LsKsm+@kili
Reviewed-by: Devesh Sharma <devesh.s.sharma@oracle.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-02-07 12:27:49 +02:00
Leon Romanovsky
85f9e38a5a RDMA/mlx5: Remove impossible check of mkey cache cleanup failure
mlx5_mkey_cache_cleanup() can't fail and can be changed to be void.

Link: https://lore.kernel.org/r/1acd9528995d083114e7dec2a2afc59436406583.1675328463.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2023-02-06 15:44:26 +02:00
Leon Romanovsky
828cf5936b RDMA/mlx5: Fix MR cache debugfs error in IB representors mode
Block MR cache debugfs creation for IB representor flow as MR cache shouldn't be used
at all in that mode. As part of this change, add missing debugfs cleanup in error path
too.

This change fixes the following debugfs errors:

 bond0: (slave enp8s0f1): Enslaving as a backup interface with an up link
 mlx5_core 0000:08:00.0: lag map: port 1:1 port 2:1
 mlx5_core 0000:08:00.0: shared_fdb:1 mode:queue_affinity
 mlx5_core 0000:08:00.0: Operation mode is single FDB
 debugfs: Directory '2' with parent '/' already present!
...
 debugfs: Directory '22' with parent '/' already present!

Fixes: 73d09b2fe8 ("RDMA/mlx5: Introduce mlx5r_cache_rb_key")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://lore.kernel.org/r/482a78c54acbcfa1742a0e06a452546428900ffa.1675328463.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2023-02-06 15:44:10 +02:00
Michael Guralnik
627122280c RDMA/mlx5: Add work to remove temporary entries from the cache
The non-cache mkeys are stored in the cache only to shorten restarting
application time. Don't store them longer than needed.

Configure cache entries that store non-cache MRs as temporary entries.  If
30 seconds have passed and no user reclaimed the temporarily cached mkeys,
an asynchronous work will destroy the mkeys entries.

Link: https://lore.kernel.org/r/20230125222807.6921-7-michaelgur@nvidia.com
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-01-27 13:15:23 -04:00
Michael Guralnik
dd1b913fb0 RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow
Currently, when dereging an MR, if the mkey doesn't belong to a cache
entry, it will be destroyed.  As a result, the restart of applications
with many non-cached mkeys is not efficient since all the mkeys are
destroyed and then recreated.  This process takes a long time (for 100,000
MRs, it is ~20 seconds for dereg and ~28 seconds for re-reg).

To shorten the restart runtime, insert all cacheable mkeys to the cache.
If there is no fitting entry to the mkey properties, create a temporary
entry that fits it.

After a predetermined timeout, the cache entries will shrink to the
initial high limit.

The mkeys will still be in the cache when consuming them again after an
application restart. Therefore, the registration will be much faster
(for 100,000 MRs, it is ~4 seconds for dereg and ~5 seconds for re-reg).

The temporary cache entries created to store the non-cache mkeys are not
exposed through sysfs like the default cache entries.

Link: https://lore.kernel.org/r/20230125222807.6921-6-michaelgur@nvidia.com
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-01-27 13:04:10 -04:00
Michael Guralnik
73d09b2fe8 RDMA/mlx5: Introduce mlx5r_cache_rb_key
Switch from using the mkey order to using the new struct as the key to the
RB tree of cache entries.

The key is all the mkey properties that UMR operations can't modify.
Using this key to define the cache entries and to search and create cache
mkeys.

Link: https://lore.kernel.org/r/20230125222807.6921-5-michaelgur@nvidia.com
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-01-27 13:04:09 -04:00
Michael Guralnik
b958451783 RDMA/mlx5: Change the cache structure to an RB-tree
Currently, the cache structure is a static linear array. Therefore, his
size is limited to the number of entries in it and is not expandable.  The
entries are dedicated to mkeys of size 2^x and no access_flags. Mkeys with
different properties are not cacheable.

In this patch, we change the cache structure to an RB-tree.  This will
allow to extend the cache to support more entries with different mkey
properties.

Link: https://lore.kernel.org/r/20230125222807.6921-4-michaelgur@nvidia.com
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-01-27 13:04:09 -04:00
Aharon Landau
18b1746bdd RDMA/mlx5: Remove implicit ODP cache entry
Implicit ODP mkey doesn't have unique properties. It shares the same
properties as the order 18 cache entry. There is no need to devote a
special entry for that.

Link: https://lore.kernel.org/r/20230125222807.6921-3-michaelgur@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-01-27 13:04:09 -04:00
Aharon Landau
a2a88b8e22 RDMA/mlx5: Don't keep umrable 'page_shift' in cache entries
mkc.log_page_size can be changed using UMR. Therefore, don't treat it as a
cache entry property.

Removing it from struct mlx5_cache_ent.

All cache mkeys will be created with default PAGE_SHIFT, and updated with
the needed page_shift using UMR when passing them to a user.

Link: https://lore.kernel.org/r/20230125222807.6921-2-michaelgur@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-01-27 13:04:09 -04:00
Patrisious Haddad
8067fd8b26 RDMA/mlx5: Print error syndrome in case of fatal QP errors
Print syndromes in case of fatal QP events. This is helpful for upper
level debugging, as there maybe no CQEs.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/edc794f622a33e4ee12d7f5d218d1a59aa7c6af5.1672821186.git.leonro@nvidia.com
Reviewed-by: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-01-15 12:23:15 +02:00
Mark Zhang
312b8f79eb RDMA/mlx: Calling qp event handler in workqueue context
Move the call of qp event handler from atomic to workqueue context,
so that the handler is able to block. This is needed by following
patches.

Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Patrisious Haddad <phaddad@nvidia.com>
Link: https://lore.kernel.org/r/0cd17b8331e445f03942f4bb28d447f24ac5669d.1672821186.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-01-15 12:23:10 +02:00
Maor Gottlieb
8de8482fe5 RDMA/mlx5: Fix validation of max_rd_atomic caps for DC
Currently, when modifying DC, we validate max_rd_atomic user attribute
against the RC cap, validate against DC. RC and DC QP types have different
device limitations.

This can cause userspace created DC QPs to malfunction.

Fixes: c32a4f296e ("IB/mlx5: Add support for DC Initiator QP")
Link: https://lore.kernel.org/r/0c5aee72cea188c3bb770f4207cce7abc9b6fc74.1672231736.git.leonro@nvidia.com
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-01-01 09:40:35 +02:00
Shay Drory
38b50aa444 RDMA/mlx5: Fix mlx5_ib_get_hw_stats when used for device
Currently, when mlx5_ib_get_hw_stats() is used for device (port_num = 0),
there is a special handling in order to use the correct counters, but,
port_num is being passed down the stack without any change.  Also, some
functions assume that port_num >=1. As a result, the following oops can
occur.

 BUG: unable to handle page fault for address: ffff89510294f1a8
 #PF: supervisor write access in kernel mode
 #PF: error_code(0x0002) - not-present page
 PGD 0 P4D 0
 Oops: 0002 [#1] SMP
 CPU: 8 PID: 1382 Comm: devlink Tainted: G W          6.1.0-rc4_for_upstream_base_2022_11_10_16_12 #1
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 RIP: 0010:_raw_spin_lock+0xc/0x20
 Call Trace:
  <TASK>
  mlx5_ib_get_native_port_mdev+0x73/0xe0 [mlx5_ib]
  do_get_hw_stats.constprop.0+0x109/0x160 [mlx5_ib]
  mlx5_ib_get_hw_stats+0xad/0x180 [mlx5_ib]
  ib_setup_device_attrs+0xf0/0x290 [ib_core]
  ib_register_device+0x3bb/0x510 [ib_core]
  ? atomic_notifier_chain_register+0x67/0x80
  __mlx5_ib_add+0x2b/0x80 [mlx5_ib]
  mlx5r_probe+0xb8/0x150 [mlx5_ib]
  ? auxiliary_match_id+0x6a/0x90
  auxiliary_bus_probe+0x3c/0x70
  ? driver_sysfs_add+0x6b/0x90
  really_probe+0xcd/0x380
  __driver_probe_device+0x80/0x170
  driver_probe_device+0x1e/0x90
  __device_attach_driver+0x7d/0x100
  ? driver_allows_async_probing+0x60/0x60
  ? driver_allows_async_probing+0x60/0x60
  bus_for_each_drv+0x7b/0xc0
  __device_attach+0xbc/0x200
  bus_probe_device+0x87/0xa0
  device_add+0x404/0x940
  ? dev_set_name+0x53/0x70
  __auxiliary_device_add+0x43/0x60
  add_adev+0x99/0xe0 [mlx5_core]
  mlx5_attach_device+0xc8/0x120 [mlx5_core]
  mlx5_load_one_devl_locked+0xb2/0xe0 [mlx5_core]
  devlink_reload+0x133/0x250
  devlink_nl_cmd_reload+0x480/0x570
  ? devlink_nl_pre_doit+0x44/0x2b0
  genl_family_rcv_msg_doit.isra.0+0xc2/0x110
  genl_rcv_msg+0x180/0x2b0
  ? devlink_nl_cmd_region_read_dumpit+0x540/0x540
  ? devlink_reload+0x250/0x250
  ? devlink_put+0x50/0x50
  ? genl_family_rcv_msg_doit.isra.0+0x110/0x110
  netlink_rcv_skb+0x54/0x100
  genl_rcv+0x24/0x40
  netlink_unicast+0x1f6/0x2c0
  netlink_sendmsg+0x237/0x490
  sock_sendmsg+0x33/0x40
  __sys_sendto+0x103/0x160
  ? handle_mm_fault+0x10e/0x290
  ? do_user_addr_fault+0x1c0/0x5f0
  __x64_sys_sendto+0x25/0x30
  do_syscall_64+0x3d/0x90
  entry_SYSCALL_64_after_hwframe+0x46/0xb0

Fix it by setting port_num to 1 in order to get device status and remove
unused variable.

Fixes: aac4492ef2 ("IB/mlx5: Update counter implementation for dual port RoCE")
Link: https://lore.kernel.org/r/98b82994c3cd3fa593b8a75ed3f3901e208beb0f.1672231736.git.leonro@nvidia.com
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-01-01 09:40:01 +02:00
Linus Torvalds
ab425febda v6.2 merge window pull request
Usual size of updates, a new driver a most of the bulk focusing on rxe:
 
 - Usual typos, style, and language updates
 
 - Driver updates for mlx5, irdma, siw, rts, srp, hfi1, hns, erdma, mlx4, srp
 
 - Lots of RXE updates
   * Improve reply error handling for bad MR operations
   * Code tidying
   * Debug printing uses common loggers
   * Remove half implemented RD related stuff
   * Support IBA's recently defined Atomic Write and Flush operations
 
 - erdma support for atomic operations
 
 - New driver "mana" for Ethernet HW available in Azure VMs. This driver
   only supports DPDK
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCY5eIggAKCRCFwuHvBreF
 YeX7AP9+l5Y9J48OmK7y/YgADNo9g05agXp3E8EuUDmBU+PREgEAigdWaJVf2oea
 IctVja0ApLW5W+wsFt8Qh+V4PMiYTAM=
 =Q5V+
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Usual size of updates, a new driver, and most of the bulk focusing on
  rxe:

   - Usual typos, style, and language updates

   - Driver updates for mlx5, irdma, siw, rts, srp, hfi1, hns, erdma,
     mlx4, srp

   - Lots of RXE updates:
      * Improve reply error handling for bad MR operations
      * Code tidying
      * Debug printing uses common loggers
      * Remove half implemented RD related stuff
      * Support IBA's recently defined Atomic Write and Flush operations

   - erdma support for atomic operations

   - New driver 'mana' for Ethernet HW available in Azure VMs. This
     driver only supports DPDK"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (122 commits)
  IB/IPoIB: Fix queue count inconsistency for PKEY child interfaces
  RDMA: Add missed netdev_put() for the netdevice_tracker
  RDMA/rxe: Enable RDMA FLUSH capability for rxe device
  RDMA/cm: Make QP FLUSHABLE for supported device
  RDMA/rxe: Implement flush completion
  RDMA/rxe: Implement flush execution in responder side
  RDMA/rxe: Implement RC RDMA FLUSH service in requester side
  RDMA/rxe: Extend rxe packet format to support flush
  RDMA/rxe: Allow registering persistent flag for pmem MR only
  RDMA/rxe: Extend rxe user ABI to support flush
  RDMA: Extend RDMA kernel verbs ABI to support flush
  RDMA: Extend RDMA user ABI to support flush
  RDMA/rxe: Fix incorrect responder length checking
  RDMA/rxe: Fix oops with zero length reads
  RDMA/mlx5: Remove not-used IB_FLOW_SPEC_IB define
  RDMA/hns: Fix XRC caps on HIP08
  RDMA/hns: Fix error code of CMD
  RDMA/hns: Fix page size cap from firmware
  RDMA/hns: Fix PBL page MTR find
  RDMA/hns: Fix AH attr queried by query_qp
  ...
2022-12-14 09:27:13 -08:00
Zhu Yanjun
6cfe7bd0df RDMA/mlx5: Remove not-used IB_FLOW_SPEC_IB define
IB_FLOW_SPEC_IB is not used in mlx5 and can be deleted.

Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Link: https://lore.kernel.org/r/20221208101954.687960-1-yanjun.zhu@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2022-12-08 11:33:18 +02:00
Li Zhijian
6978837ce4 RDMA/mlx5: no need to kfree NULL pointer
Goto label 'free' where it will kfree the 'in' is not needed though
it's safe to kfree NULL. Return err code directly to simplify the code.

1973 free:
1974         kfree(in);
1975         return err;

Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Link: https://lore.kernel.org/r/20221203033714.25870-1-lizhijian@fujitsu.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2022-12-04 15:34:52 +02:00
Tariq Toukan
02648b4b09 net/mlx5: Generalize name of UMR alignment definition
Per the device spec, MLX5_UMR_MTT_ALIGNMENT is good not only for UMR MTT
entries, but for all other entries as well, like KLMs and KSMs.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-11-29 21:09:43 -08:00
Arumugam Kolappan
abef378c43 RDMA/mlx5: Change debug log level for remote access error syndromes
The mlx5 driver dumps the entire CQE buffer by default for few syndromes.
Some syndromes are expected due to the application behavior [ex:
MLX5_CQE_SYNDROME_REMOTE_ACCESS_ERR, MLX5_CQE_SYNDROME_REMOTE_OP_ERR and
MLX5_CQE_SYNDROME_LOCAL_PROT_ERR]. Hence, for these syndromes, the patch
converts the log level from KERN_WARNING to KERN_DEBUG. This enables the
application to get the CQE buffer dump by changing to KERN_DEBUG level
as and when needed.

Suggested-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Arumugam Kolappan <aru.kolappan@oracle.com>
Link: https://lore.kernel.org/r/1667287664-19377-1-git-send-email-aru.kolappan@oracle.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2022-11-06 20:31:35 +02:00
Linus Torvalds
e08466a7c0 v6.1 merge window pull request
- Small bug fixes in mlx5, efa, rxe, hns, irdma, erdma, siw
 
 - rts tracing improvements
 
 - Code improvements: strlscpy conversion, unused parameter, spelling
   mistakes, unused variables, flex arrays
 
 - restrack device details report for hns
 
 - Simplify struct device initialization in SRP
 
 - Eliminate the never-used service_mask support in IB CM
 
 - Make rxe not print to the console for some kinds of network packets
 
 - Asymetric paths and router support in the CM through netlink messages
 
 - DMABUF importer support for mlx5devx umem's
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCYz9bgAAKCRCFwuHvBreF
 YevoAP47J/svlOFlFtBhTVF79Ddtf+MMeqeVvLoHHQbCU5rUpAD+KUpTXAvwNcM9
 dHwNXz9ctanP5397qusH0rxOKPo/EA4=
 =lgSv
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Not a big list of changes this cycle, mostly small things. The new
  MANA rdma driver should come next cycle along with a bunch of work on
  rxe.

  Summary:

   - Small bug fixes in mlx5, efa, rxe, hns, irdma, erdma, siw

   - rts tracing improvements

   - Code improvements: strlscpy conversion, unused parameter, spelling
     mistakes, unused variables, flex arrays

   - restrack device details report for hns

   - Simplify struct device initialization in SRP

   - Eliminate the never-used service_mask support in IB CM

   - Make rxe not print to the console for some kinds of network packets

   - Asymetric paths and router support in the CM through netlink
     messages

   - DMABUF importer support for mlx5devx umem's"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (84 commits)
  RDMA/rxe: Remove error/warning messages from packet receiver path
  RDMA/usnic: fix set-but-not-unused variable 'flags' warning
  IB/hfi1: Use skb_put_data() instead of skb_put/memcpy pair
  RDMA/hns: Unified Log Printing Style
  RDMA/hns: Replacing magic number with macros in apply_func_caps()
  RDMA/hns: Repacing 'dseg_len' by macros in fill_ext_sge_inl_data()
  RDMA/hns: Remove redundant 'max_srq_desc_sz' in caps
  RDMA/hns: Remove redundant 'num_mtt_segs' and 'max_extend_sg'
  RDMA/hns: Remove redundant 'phy_addr' in hns_roce_hem_list_find_mtt()
  RDMA/hns: Remove redundant 'use_lowmem' argument from hns_roce_init_hem_table()
  RDMA/hns: Remove redundant 'bt_level' for hem_list_alloc_item()
  RDMA/hns: Remove redundant 'attr_mask' in modify_qp_init_to_init()
  RDMA/hns: Remove unnecessary brackets when getting point
  RDMA/hns: Remove unnecessary braces for single statement blocks
  RDMA/hns: Cleanup for a spelling error of Asynchronous
  IB/rdmavt: Add __init/__exit annotations to module init/exit funcs
  RDMA/rxe: Remove redundant num_sge fields
  RDMA/mlx5: Enable ATS support for MRs and umems
  RDMA/mlx5: Add support for dmabuf to devx umem
  RDMA/core: Add UVERBS_ATTR_RAW_FD
  ...
2022-10-07 12:05:29 -07:00
Jason Gunthorpe
33331a728c Linux 6.0
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmM5/fMeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG+Y0H/jG0HQjUHfIOGjPQ
 T33eaF5POoZKzZmsTNITbtya7B3/OTZZRGdSq9B8t5tElyPnZrdIfaXds17mCI8y
 5DJUEQGdv6fqRttYLGKf6rk1kzABhhaTS8n9BgDFEsmvdqwG4AV6dLQr3HL09gTV
 Lu+Is86RPwmpgH0Z9u7DyFCF8yLjefyu0vl6eFm/SXjCE8gQM/LZQSi9mv5/YDxa
 MVKlsZKIkQm6P8a6r8wbGedKYBped4+4gYedsf/IcS0lvKdLIs/P7YgR5wKklSbM
 tqECvBOmKq1Fwj/oxH+fx0KLX/ZD1xxQwoQd+a9DOSo+BuPBt6KGojYT9gQRyFJp
 R7gyUCo=
 =2UOD
 -----END PGP SIGNATURE-----

Merge tag 'v6.0' into rdma.git for-next

Trvial merge conflicts against rdma.git for-rc resolved matching
linux-next:
            drivers/infiniband/hw/hns/hns_roce_hw_v2.c
            drivers/infiniband/hw/hns/hns_roce_main.c

https://lore.kernel.org/r/20220929124005.105149-1-broonie@kernel.org

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-10-06 19:48:45 -03:00
Jakub Kicinski
0d5bfebf74 Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Saeed Mahameed says:

====================
updates from mlx5-next 2022-09-24

Updates form mlx5-next including[1]:

1) HW definitions and support for NPPS clock settings.

2) various cleanups

3) Enable hash mode by default for all NICs

4) page tracker and advanced virtualization HW definitions for vfio

[1] https://lore.kernel.org/netdev/20220907233636.388475-1-saeed@kernel.org/

* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
  net/mlx5: Remove from FPGA IFC file not-needed definitions
  net/mlx5: Remove unused structs
  net/mlx5: Remove unused functions
  net/mlx5: detect and enable bypass port select flow table
  net/mlx5: Lag, enable hash mode by default for all NICs
  net/mlx5: Lag, set active ports if support bypass port select flow table
  RDMA/mlx5: Don't set tx affinity when lag is in hash mode
  net/mlx5: add IFC bits for bypassing port select flow table
  net/mlx5: Add support for NPPS with real time mode
  net/mlx5: Expose NPPS related registers
  net/mlx5: Query ADV_VIRTUALIZATION capabilities
  net/mlx5: Introduce ifc bits for page tracker
  RDMA/mlx5: Move function mlx5_core_query_ib_ppcnt() to mlx5_ib
====================

Link: https://lore.kernel.org/all/20220927201906.234015-1-saeed@kernel.org/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-28 19:20:49 -07:00
Liu, Changcheng
a83bb5df2a RDMA/mlx5: Don't set tx affinity when lag is in hash mode
In hash mode, without setting tx affinity explicitly, the port select
flow table decides which port is used for the traffic.
If port_select_flow_table_bypass capability is supported and tx affinity
is set explicitly for QP/TIS, they will be added into the explicit affinity
table in FW to check which port is used for the traffic.
1. The overloaded explicit affinity table may affect performance.
   To avoid this, do not set tx affinity explicitly by default.
2. The packets of the same flow need to be transmitted on the same port.
   Because the packets of the same flow use different QPs in slow & fast
   path, it shouldn't set tx affinity explicitly for these QPs.

Signed-off-by: Liu, Changcheng <jerrliu@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-09-27 12:50:27 -07:00
Jason Gunthorpe
72b2f7608a RDMA/mlx5: Enable ATS support for MRs and umems
For mlx5 if ATS is enabled in the PCI config then the device will use ATS
requests for only certain DMA operations. This has to be opted in by the
SW side based on the mkey or umem settings.

ATS slows down the PCI performance, so it should only be set in cases when
it is needed. All of these cases revolve around optimizing PCI P2P
transfers and avoiding bad cases where the bus just doesn't work.

Link: https://lore.kernel.org/r/4-v1-bd147097458e+ede-umem_dmabuf_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-09-27 10:15:24 -03:00
Jason Gunthorpe
9af859c58d RDMA/mlx5: Add support for dmabuf to devx umem
This is modeled after the similar EFA enablement in commit
66f4817b57 ("RDMA/efa: Add support for dmabuf memory regions").

Like EFA there is no support for revocation so we simply call the
ib_umem_dmabuf_get_pinned() to obtain a umem instead of the normal
ib_umem_get().  Everything else stays the same.

Link: https://lore.kernel.org/r/3-v1-bd147097458e+ede-umem_dmabuf_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-09-27 10:15:24 -03:00
Leon Romanovsky
70d1b1a7f8 Merge branch 'mlx5-vfio' into mlx5-next
Merge net/mlx5 dependencies for device DMA logging.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2022-09-27 12:56:24 +03:00
Daisuke Matsuda
e866025b3b RDMA/mlx5: Remove duplicate assignment in umr_rereg_pas()
The same value is assigned to 'mr->ibmr.length'. Remove redundant one.

Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Link: https://lore.kernel.org/r/20220908083058.3993700-1-matsuda-daisuke@fujitsu.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2022-09-08 11:53:41 +03:00
Leon Romanovsky
d324a46be3 Merge branch 'mlx5-next' into wip/leon-for-next
Perform merge of Mellanox shared branch.

* mlx5-next:
  RDMA/mlx5: Move function mlx5_core_query_ib_ppcnt() to mlx5_ib
2022-09-05 15:09:55 +03:00
Chris Mi
8a2dd123f1 RDMA/mlx5: Move function mlx5_core_query_ib_ppcnt() to mlx5_ib
This patch doesn't change any functionality, but move one function
to mlx5_ib because it is not used by mlx5_core.

The actual fix is in the next patch.

Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Chris Mi <cmi@nvidia.com>
Link: https://lore.kernel.org/r/fd47b9138412bd94ed30f838026cbb4cf3878150.1661763871.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2022-09-05 15:09:03 +03:00
Bodong Wang
b021d82e25 IB/mlx5: Support querying eswitch functions from DEVX
Query eswitch functions returns information of the external host
PF(if it exists). It can be used to check if DEVX is running on ECPF.

Reviewed-by: Erez Shitrit <erezsh@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Link: https://lore.kernel.org/r/4265925178ab3224dc1d3e3784bb312d808edca5.1661763785.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2022-09-05 14:54:39 +03:00
Maor Gottlieb
9b7d4be967 RDMA/mlx5: Fix UMR cleanup on error flow of driver init
The cited commit removed from the cleanup flow of umr the checks
if the resources were created. This could lead to null-ptr-deref
in case that we had failure in mlx5_ib_stage_ib_reg_init stage.

Fix it by adding new state to the umr that can say if the resources
were created or not and check it in the umr cleanup flow before
destroying the resources.

Fixes: 04876c12c1 ("RDMA/mlx5: Move init and cleanup of UMR to umr.c")
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Link: https://lore.kernel.org/r/4cfa61386cf202e9ce330e8d228ce3b25a36326e.1661763459.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2022-09-05 14:49:57 +03:00
Chris Mi
74b30b3ad5 RDMA/mlx5: Set local port to one when accessing counters
When accessing Ports Performance Counters Register (PPCNT),
local port must be one if it is Function-Per-Port HCA that
HCA_CAP.num_ports is 1.

The offending patch can change the local port to other values
when accessing PPCNT after enabling switchdev mode. The following
syndrome will be printed:

 # cat /sys/class/infiniband/rdmap4s0f0/ports/2/counters/*
 # dmesg
 mlx5_core 0000:04:00.0: mlx5_cmd_check:756:(pid 12450): ACCESS_REG(0x805) op_mod(0x1) failed, status bad parameter(0x3), syndrome (0x1e5585)

Fix it by setting local port to one for Function-Per-Port HCA.

Fixes: 210b1f7807 ("IB/mlx5: When not in dual port RoCE mode, use provided port as native")
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Chris Mi <cmi@nvidia.com>
Link: https://lore.kernel.org/r/6c5086c295c76211169e58dbd610fb0402360bab.1661763459.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2022-09-05 14:49:53 +03:00
Maher Sanalla
9ca05b0f27 RDMA/mlx5: Rely on RoCE fw cap instead of devlink when setting profile
When the RDMA auxiliary driver probes, it sets its profile based on
devlink driverinit value. The latter might not be in sync with FW yet
(In case devlink reload is not performed), thus causing a mismatch
between RDMA driver and FW. This results in the following FW syndrome
when the RDMA driver tries to adjust RoCE state, which fails the probe:

"0xC1F678 | modify_nic_vport_context: roce_en set on a vport that
doesn't support roce"

To prevent this, select the PF profile based on FW RoCE capability
instead of relying on devlink driverinit value.
To provide backward compatibility of the RoCE disable feature, on older
FW's where roce_rw is not set (FW RoCE capability is read-only), keep
the current behavior e.g., rely on devlink driverinit value.

Fixes: fbfa97b4d7 ("net/mlx5: Disable roce at HCA level")
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Link: https://lore.kernel.org/r/cb34ce9a1df4a24c135cb804db87f7d2418bd6cc.1661763459.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2022-09-05 14:49:46 +03:00
Daisuke Matsuda
ca7ef7adad IB/mlx5: Remove duplicate header inclusion related to ODP
rdma/ib_umem.h and rdma/ib_verbs.h are included by rdma/ib_umem_odp.h.
This patch removes the redundant entries.

Link: https://lore.kernel.org/r/20220823025131.862811-1-matsuda-daisuke@fujitsu.com
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2022-08-23 11:22:13 +03:00
Mark Bloch
4b83c3caf2 RDMA/mlx5: Use the proper number of ports
The cited commit allowed the driver to operate over HCAs that have
4 physical ports. Use the number of ports of the RDMA device in the for
loop instead of using the struct size.

Fixes: 4cd14d44b1 ("net/mlx5: Support devices with more than 2 ports")
Link: https://lore.kernel.org/r/a54a56c2ede16044a29d119209b35189c662ac72.1659944855.git.leonro@nvidia.com
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2022-08-16 16:14:19 +03:00
Aharon Landau
13ad1125b9 RDMA/mlx5: Don't compare mkey tags in DEVX indirect mkey
According to the ib spec:
If the CI supports the Base Memory Management Extensions defined in this
specification, the L_Key format must consist of:
24 bit index in the most significant bits of the R_Key, and
8 bit key in the least significant bits of the R_Key
Through a successful Allocate L_Key verb invocation, the CI must let the
consumer own the key portion of the returned R_Key

Therefore, when creating a mkey using DEVX, the consumer is allowed to
change the key part. The kernel should compare only the index part of a
R_Key to determine equality with another R_Key.

Adding capability in order not to break backward compatibility.

Fixes: 534fd7aac5 ("IB/mlx5: Manage indirection mkey upon DEVX flow for ODP")
Link: https://lore.kernel.org/r/3d669aacea85a3a15c3b3b953b3eaba3f80ef9be.1659255945.git.leonro@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2022-08-16 15:21:52 +03:00
Jason Gunthorpe
19d6214ad6 IB/mlx5: Call io_stop_wc() after writing to WC MMIO
This new function is defined only on ARM and serves to guarantee a barrier
in the WC operation. The barrier means that another run of this loop will
not combine with the stores this loop created.

On x86 this is happening implicitly because of the spin_unlock().

Link: https://lore.kernel.org/r/0-v1-c5dade92f363+11-mlx5_io_stop_wc_jgg@nvidia.com
Suggested-by: Pavel Shamis <Pavel.Shamis@arm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2022-08-16 15:08:31 +03:00
Maor Gottlieb
c9776457bd RDMA/mlx5: Add missing check for return value in get namespace flow
Add missing check for return value when calling to
mlx5_ib_ft_type_to_namespace, even though it can't really fail in this
specific call.

Fixes: 52438be441 ("RDMA/mlx5: Allow inserting a steering rule to the FDB")
Link: https://lore.kernel.org/r/7b9ceda217d9368a51dc47a46b769bad4af9ac92.1659256069.git.leonro@nvidia.com
Reviewed-by: Itay Aveksis <itayav@nvidia.com>
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-08-02 13:57:27 -03:00
Aharon Landau
0113780870 RDMA/mlx5: Rename the mkey cache variables and functions
After replacing the MR cache with an Mkey cache, rename the variables and
functions to fit the new meaning.

Link: https://lore.kernel.org/r/20220726071911.122765-6-michaelgur@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-07-27 14:45:48 -03:00
Aharon Landau
6b75338695 RDMA/mlx5: Store in the cache mkeys instead of mrs
Currently, the driver stores mlx5_ib_mr struct in the cache entries,
although the only use of the cached MR is the mkey. Store only the mkey in
the cache.

Link: https://lore.kernel.org/r/20220726071911.122765-5-michaelgur@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-07-27 14:45:48 -03:00
Aharon Landau
19591f134c RDMA/mlx5: Store the number of in_use cache mkeys instead of total_mrs
total_mrs is used only to calculate the number of mkeys currently in
use. To simplify things, replace it with a new member called "in_use" and
directly store the number of mkeys currently in use.

Link: https://lore.kernel.org/r/20220726071911.122765-4-michaelgur@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-07-27 14:45:48 -03:00