Commit Graph

1196 Commits

Author SHA1 Message Date
Zhu Yanjun
edc4ef0e01 RDMA/rxe: Fix the warning "__rxe_cleanup+0x12c/0x170 [rdma_rxe]"
The Call Trace is as below:
"
  <TASK>
  ? show_regs.cold+0x1a/0x1f
  ? __rxe_cleanup+0x12c/0x170 [rdma_rxe]
  ? __warn+0x84/0xd0
  ? __rxe_cleanup+0x12c/0x170 [rdma_rxe]
  ? report_bug+0x105/0x180
  ? handle_bug+0x46/0x80
  ? exc_invalid_op+0x19/0x70
  ? asm_exc_invalid_op+0x1b/0x20
  ? __rxe_cleanup+0x12c/0x170 [rdma_rxe]
  ? __rxe_cleanup+0x124/0x170 [rdma_rxe]
  rxe_destroy_qp.cold+0x24/0x29 [rdma_rxe]
  ib_destroy_qp_user+0x118/0x190 [ib_core]
  rdma_destroy_qp.cold+0x43/0x5e [rdma_cm]
  rtrs_cq_qp_destroy.cold+0x1d/0x2b [rtrs_core]
  rtrs_srv_close_work.cold+0x1b/0x31 [rtrs_server]
  process_one_work+0x21d/0x3f0
  worker_thread+0x4a/0x3c0
  ? process_one_work+0x3f0/0x3f0
  kthread+0xf0/0x120
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x22/0x30
  </TASK>
"
When too many rdma resources are allocated, rxe needs more time to
handle these rdma resources. Sometimes with the current timeout, rxe
can not release the rdma resources correctly.

Compared with other rdma drivers, a bigger timeout is used.

Fixes: 215d0a755e ("RDMA/rxe: Stop lookup of partially built objects")
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Link: https://patch.msgid.link/20250110160927.55014-1-yanjun.zhu@linux.dev
Tested-by: Joe Klein <joe.klein812@gmail.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-01-14 06:43:53 -05:00
Yuyu Li
62f5d59d1a RDMA/siw: Remove deliver net device event
Since the netdev events of link status is now handled in ib_core,
remove the related code in drivers.

In addition, remove sdev->state as it is only used in siw_query_port(),
and it can be replaced by ib_get_curr_port_state().

Signed-off-by: Yuyu Li <liyuyu6@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-12-25 03:39:46 -05:00
Yuyu Li
958152336c RDMA/rxe: Remove deliver net device event
Since the netdev events of link status is now handled in ib_core,
remove the related code in drivers.

In addition, remove the setting of port->attr.state in rxe_port_up()
and rxe_port_down(), as it is only used in rxe_query_port(), and it
can be replaced by ib_get_curr_port_state().

Signed-off-by: Yuyu Li <liyuyu6@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-12-25 02:35:47 -05:00
Zhu Yanjun
2ac5415022 RDMA/rxe: Remove the direct link to net_device
The similar patch in siw is in the link:
https://git.kernel.org/rdma/rdma/c/16b87037b48889

This problem also occurred in RXE. The following analyze this problem.
In the following Call Traces:
"
BUG: KASAN: slab-use-after-free in dev_get_flags+0x188/0x1d0 net/core/dev.c:8782
Read of size 4 at addr ffff8880554640b0 by task kworker/1:4/5295

CPU: 1 UID: 0 PID: 5295 Comm: kworker/1:4 Not tainted
6.12.0-rc3-syzkaller-00399-g9197b73fd7bb #0
Hardware name: Google Compute Engine/Google Compute Engine,
BIOS Google 09/13/2024
Workqueue: infiniband ib_cache_event_task
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:377 [inline]
 print_report+0x169/0x550 mm/kasan/report.c:488
 kasan_report+0x143/0x180 mm/kasan/report.c:601
 dev_get_flags+0x188/0x1d0 net/core/dev.c:8782
 rxe_query_port+0x12d/0x260 drivers/infiniband/sw/rxe/rxe_verbs.c:60
 __ib_query_port drivers/infiniband/core/device.c:2111 [inline]
 ib_query_port+0x168/0x7d0 drivers/infiniband/core/device.c:2143
 ib_cache_update+0x1a9/0xb80 drivers/infiniband/core/cache.c:1494
 ib_cache_event_task+0xf3/0x1e0 drivers/infiniband/core/cache.c:1568
 process_one_work kernel/workqueue.c:3229 [inline]
 process_scheduled_works+0xa65/0x1850 kernel/workqueue.c:3310
 worker_thread+0x870/0xd30 kernel/workqueue.c:3391
 kthread+0x2f2/0x390 kernel/kthread.c:389
 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
 </TASK>
"

1). In the link [1],

"
 infiniband syz2: set down
"

This means that on 839.350575, the event ib_cache_event_task was sent andi
queued in ib_wq.

2). In the link [1],

"
 team0 (unregistering): Port device team_slave_0 removed
"

It indicates that before 843.251853, the net device should be freed.

3). In the link [1],

"
 BUG: KASAN: slab-use-after-free in dev_get_flags+0x188/0x1d0
"

This means that on 850.559070, this slab-use-after-free problem occurred.

In all, on 839.350575, the event ib_cache_event_task was sent and queued
in ib_wq,

before 843.251853, the net device veth was freed.

on 850.559070, this event was executed, and the mentioned freed net device
was called. Thus, the above call trace occurred.

[1] https://syzkaller.appspot.com/x/log.txt?x=12e7025f980000

Reported-by: syzbot+4b87489410b4efd181bf@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=4b87489410b4efd181bf
Fixes: 8700e3e7c4 ("Soft RoCE driver")
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Link: https://patch.msgid.link/20241220222325.2487767-1-yanjun.zhu@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-12-24 04:36:40 -05:00
Bernard Metzler
16b87037b4 RDMA/siw: Remove direct link to net_device
Do not manage a per device direct link to net_device. Rely
on associated ib_devices net_device management, not doubling
the effort locally. A badly managed local link to net_device
was causing a 'KASAN: slab-use-after-free' exception during
siw_query_port() call.

Fixes: bdcf26bf9b ("rdma/siw: network and RDMA core interface")
Reported-by: syzbot+4b87489410b4efd181bf@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=4b87489410b4efd181bf
Signed-off-by: Bernard Metzler <bmt@zurich.ibm.com>
Link: https://patch.msgid.link/20241212151848.564872-1-bmt@zurich.ibm.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-12-19 05:18:37 -05:00
zhenwei pi
db03b70969 RDMA/rxe: Fix mismatched max_msg_sz
User mode queries max_msg_sz as 0x800000 by command 'ibv_devinfo -v',
however ibv_post_send/ibv_post_recv has a limit of 2^31. Fix this
mismatched information.

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Fixes: f605f26ea1 ("RDMA/rxe: Protect QP state with qp->state_lock")
Fixes: 5bf944f241 ("RDMA/rxe: Add error messages")
Link: https://patch.msgid.link/20241216121953.765331-1-pizhenwei@bytedance.com
Review-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-12-17 09:35:33 -05:00
Linus Torvalds
2a163a4cea RDMA v6.13 merge window pull request
Seveal fixes scattered across the drivers and a few new features:
 
 - Minor updates and bug fixes to hfi1, efa, iopob, bnxt, hns
 
 - Force disassociate the userspace FD when hns does an async reset
 
 - bnxt new features for optimized modify QP to skip certain stayes, CQ
   coalescing, better debug dumping
 
 - mlx5 new data placement ordering feature
 
 - Faster destruction of mlx5 devx HW objects
 
 - Improvements to RDMA CM mad handling
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZz4ENwAKCRCFwuHvBreF
 YQYQAP9R54r5J1Iylg+zqhCc+e/9oveuuZbfLvy/EJiEpmdprQEAgPs1RrB0z7U6
 1xrVStUKNPhGd5XeVVZGkIV0zYv6Tw4=
 =V5xI
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Seveal fixes scattered across the drivers and a few new features:

   - Minor updates and bug fixes to hfi1, efa, iopob, bnxt, hns

   - Force disassociate the userspace FD when hns does an async reset

   - bnxt new features for optimized modify QP to skip certain stayes,
     CQ coalescing, better debug dumping

   - mlx5 new data placement ordering feature

   - Faster destruction of mlx5 devx HW objects

   - Improvements to RDMA CM mad handling"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (51 commits)
  RDMA/bnxt_re: Correct the sequence of device suspend
  RDMA/bnxt_re: Use the default mode of congestion control
  RDMA/bnxt_re: Support different traffic class
  IB/cm: Rework sending DREQ when destroying a cm_id
  IB/cm: Do not hold reference on cm_id unless needed
  IB/cm: Explicitly mark if a response MAD is a retransmission
  RDMA/mlx5: Move events notifier registration to be after device registration
  RDMA/bnxt_re: Cache MSIx info to a local structure
  RDMA/bnxt_re: Refurbish CQ to NQ hash calculation
  RDMA/bnxt_re: Refactor NQ allocation
  RDMA/bnxt_re: Fail probe early when not enough MSI-x vectors are reserved
  RDMA/hns: Fix different dgids mapping to the same dip_idx
  RDMA/bnxt_re: Add set_func_resources support for P5/P7 adapters
  RDMA/bnxt_re: Enhance RoCE SRIOV resource configuration design
  bnxt_en: Add support for RoCE sriov configuration
  RDMA/hns: Fix NULL pointer derefernce in hns_roce_map_mr_sg()
  RDMA/hns: Fix out-of-order issue of requester when setting FENCE
  RDMA/nldev: Add IB device and net device rename events
  RDMA/mlx5: Add implementation for ufile_hw_cleanup device operation
  RDMA/core: Move ib_uverbs_file struct to uverbs_types.h
  ...
2024-11-22 20:03:57 -08:00
Liu Jian
775e6d3c8f RDMA/rxe: Set queue pair cur_qp_state when being queried
Same with commit e375b9c929 ("RDMA/cxgb4: Set queue pair state when
 being queried"). The API for ib_query_qp requires the driver to set
cur_qp_state on return, add the missing set.

Fixes: 8700e3e7c4 ("Soft RoCE driver")
Signed-off-by: Liu Jian <liujian56@huawei.com>
Link: https://patch.msgid.link/20241031092019.2138467-1-liujian56@huawei.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-11-04 06:53:06 -05:00
Zhu Yanjun
ea4c990fa9 RDMA/rxe: Fix the qp flush warnings in req
When the qp is in error state, the status of WQEs in the queue should be
set to error. Or else the following will appear.

[  920.617269] WARNING: CPU: 1 PID: 21 at drivers/infiniband/sw/rxe/rxe_comp.c:756 rxe_completer+0x989/0xcc0 [rdma_rxe]
[  920.617744] Modules linked in: rnbd_client(O) rtrs_client(O) rtrs_core(O) rdma_ucm rdma_cm iw_cm ib_cm crc32_generic rdma_rxe ip6_udp_tunnel udp_tunnel ib_uverbs ib_core loop brd null_blk ipv6
[  920.618516] CPU: 1 PID: 21 Comm: ksoftirqd/1 Tainted: G           O       6.1.113-storage+ #65
[  920.618986] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[  920.619396] RIP: 0010:rxe_completer+0x989/0xcc0 [rdma_rxe]
[  920.619658] Code: 0f b6 84 24 3a 02 00 00 41 89 84 24 44 04 00 00 e9 2a f7 ff ff 39 ca bb 03 00 00 00 b8 0e 00 00 00 48 0f 45 d8 e9 15 f7 ff ff <0f> 0b e9 cb f8 ff ff 41 bf f5 ff ff ff e9 08 f8 ff ff 49 8d bc 24
[  920.620482] RSP: 0018:ffff97b7c00bbc38 EFLAGS: 00010246
[  920.620817] RAX: 0000000000000000 RBX: 000000000000000c RCX: 0000000000000008
[  920.621183] RDX: ffff960dc396ebc0 RSI: 0000000000005400 RDI: ffff960dc4e2fbac
[  920.621548] RBP: 0000000000000000 R08: 0000000000000001 R09: ffffffffac406450
[  920.621884] R10: ffffffffac4060c0 R11: 0000000000000001 R12: ffff960dc4e2f800
[  920.622254] R13: ffff960dc4e2f928 R14: ffff97b7c029c580 R15: 0000000000000000
[  920.622609] FS:  0000000000000000(0000) GS:ffff960ef7d00000(0000) knlGS:0000000000000000
[  920.622979] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  920.623245] CR2: 00007fa056965e90 CR3: 00000001107f1000 CR4: 00000000000006e0
[  920.623680] Call Trace:
[  920.623815]  <TASK>
[  920.623933]  ? __warn+0x79/0xc0
[  920.624116]  ? rxe_completer+0x989/0xcc0 [rdma_rxe]
[  920.624356]  ? report_bug+0xfb/0x150
[  920.624594]  ? handle_bug+0x3c/0x60
[  920.624796]  ? exc_invalid_op+0x14/0x70
[  920.624976]  ? asm_exc_invalid_op+0x16/0x20
[  920.625203]  ? rxe_completer+0x989/0xcc0 [rdma_rxe]
[  920.625474]  ? rxe_completer+0x329/0xcc0 [rdma_rxe]
[  920.625749]  rxe_do_task+0x80/0x110 [rdma_rxe]
[  920.626037]  rxe_requester+0x625/0xde0 [rdma_rxe]
[  920.626310]  ? rxe_cq_post+0xe2/0x180 [rdma_rxe]
[  920.626583]  ? do_complete+0x18d/0x220 [rdma_rxe]
[  920.626812]  ? rxe_completer+0x1a3/0xcc0 [rdma_rxe]
[  920.627050]  rxe_do_task+0x80/0x110 [rdma_rxe]
[  920.627285]  tasklet_action_common.constprop.0+0xa4/0x120
[  920.627522]  handle_softirqs+0xc2/0x250
[  920.627728]  ? sort_range+0x20/0x20
[  920.627942]  run_ksoftirqd+0x1f/0x30
[  920.628158]  smpboot_thread_fn+0xc7/0x1b0
[  920.628334]  kthread+0xd6/0x100
[  920.628504]  ? kthread_complete_and_exit+0x20/0x20
[  920.628709]  ret_from_fork+0x1f/0x30
[  920.628892]  </TASK>

Fixes: ae720bdb70 ("RDMA/rxe: Generate error completion for error requester QP state")
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Link: https://patch.msgid.link/20241025152036.121417-1-yanjun.zhu@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-10-30 14:22:03 +02:00
Showrya M N
4e1e3dd88a RDMA/siw: Add sendpage_ok() check to disable MSG_SPLICE_PAGES
While running ISER over SIW, the initiator machine encounters a warning
from skb_splice_from_iter() indicating that a slab page is being used in
send_page. To address this, it is better to add a sendpage_ok() check
within the driver itself, and if it returns 0, then MSG_SPLICE_PAGES flag
should be disabled before entering the network stack.

A similar issue has been discussed for NVMe in this thread:
https://lore.kernel.org/all/20240530142417.146696-1-ofir.gal@volumez.com/

  WARNING: CPU: 0 PID: 5342 at net/core/skbuff.c:7140 skb_splice_from_iter+0x173/0x320
  Call Trace:
   tcp_sendmsg_locked+0x368/0xe40
   siw_tx_hdt+0x695/0xa40 [siw]
   siw_qp_sq_process+0x102/0xb00 [siw]
   siw_sq_resume+0x39/0x110 [siw]
   siw_run_sq+0x74/0x160 [siw]
   kthread+0xd2/0x100
   ret_from_fork+0x34/0x40
   ret_from_fork_asm+0x1a/0x30

Link: https://patch.msgid.link/r/20241007125835.89942-1-showrya@chelsio.com
Signed-off-by: Showrya M N <showrya@chelsio.com>
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-10-11 13:55:53 -03:00
Shen Lichuan
e012316d83 RDMA/rdmavt: Convert to use ERR_CAST()
As opposed to open-code, using the ERR_CAST macro clearly indicates that
this is a pointer to an error value and a type conversion was performed.

Signed-off-by: Shen Lichuan <shenlichuan@vivo.com>
Link: https://patch.msgid.link/20240828082720.33231-1-shenlichuan@vivo.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-02 10:09:42 +03:00
zhenwei pi
444948ee12 RDMA/rxe: Fix __bth_set_resv6a
__bth_set_resv6a is used to clear BIT [24, 29] of rxe_bth::qpn, the
wrong expression leads other BITs into 1.

Link: https://patch.msgid.link/r/20240822065223.1117056-4-pizhenwei@bytedance.com
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-08-23 11:42:38 -03:00
zhenwei pi
938aa9a333 RDMA/rxe: Fix misspelling of 'rmda'
Fix 'rmda' into 'RDMA'.

Link: https://patch.msgid.link/r/20240822065223.1117056-3-pizhenwei@bytedance.com
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-08-23 11:42:38 -03:00
zhenwei pi
c87c5f47ff RDMA/rxe: Use sizeof instead of hard code number
Use 'sizeof(union rdma_network_hdr)' instead of hard code GRH length
for GSI and UD.

Link: https://patch.msgid.link/r/20240822065223.1117056-2-pizhenwei@bytedance.com
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-08-23 11:42:38 -03:00
Showrya M N
60dc7fcafe RDMA/siw: Remove NETDEV_GOING_DOWN event handler
Toggling link while running NVME-oF over siw hits a kernel panic
due to race condition within siw_handler and ib_destroy_qp().
The IB_EVENT_PORT_ERR event can alone handle destroying qps.
therefore remove unwanted processing in siw.

Suggested-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Showrya M N <showrya@chelsio.com>
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Link: https://lore.kernel.org/r/20240724085428.3813-1-showrya@chelsio.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-07-29 11:43:47 +03:00
Linus Torvalds
3d51520954 RDMA v6.11 merge window
Usual collection of small improvements and fixes:
 
 - Bug fixes and minor improvments in efa, irdma, mlx4, mlx5, rxe, hf1,
   qib, ocrdma
 
 - bnxt_re support for MSN, which is a new retransmit logic
 
 - Initial mana support for RC qps
 
 - Use after free bug and cleanups in iwcm
 
 - Reduce resource usage in mlx5 when RDMA verbs features are not used
 
 - New verb to drain shared recieve queues, similar to normal recieve
   queues. This is necessary to allow ULPs a clean shutdown. Used in the
   iscsi rdma target
 
 - mlx5 support for more than 16 bits of doorbell indexes
 
 - Doorbell moderation support for bnxt_re
 
 - IB multi-plane support for mlx5
 
 - New EFA adaptor PCI IDs
 
 - RDMA_NAME_ASSIGN_TYPE_USER to hint to userspace that it shouldn't rename
   the device
 
 - A collection of hns bugs
 
 - Fix long standing bug in bnxt_re with incorrect endian handling of
   immediate data
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZpfvKQAKCRCFwuHvBreF
 YXomAP46gZpGv5mlMOAXePRuKq6glNZWl3pVuwuycnlmjQcEUQD/dhQbJz0rZKBr
 swuibPo83bFacfXJL7Wxd48m4G3EfgI=
 =1eXu
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Usual collection of small improvements and fixes:

   - Bug fixes and minor improvments in efa, irdma, mlx4, mlx5, rxe,
     hf1, qib, ocrdma

   - bnxt_re support for MSN, which is a new retransmit logic

   - Initial mana support for RC qps

   - Use after free bug and cleanups in iwcm

   - Reduce resource usage in mlx5 when RDMA verbs features are not used

   - New verb to drain shared recieve queues, similar to normal recieve
     queues. This is necessary to allow ULPs a clean shutdown. Used in
     the iscsi rdma target

   - mlx5 support for more than 16 bits of doorbell indexes

   - Doorbell moderation support for bnxt_re

   - IB multi-plane support for mlx5

   - New EFA adaptor PCI IDs

   - RDMA_NAME_ASSIGN_TYPE_USER to hint to userspace that it shouldn't
     rename the device

   - A collection of hns bugs

   - Fix long standing bug in bnxt_re with incorrect endian handling of
     immediate data"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (65 commits)
  IB/hfi1: Constify struct flag_table
  RDMA/mana_ib: Set correct device into ib
  bnxt_re: Fix imm_data endianness
  RDMA: Fix netdev tracker in ib_device_set_netdev
  RDMA/hns: Fix mbx timing out before CMD execution is completed
  RDMA/hns: Fix insufficient extend DB for VFs.
  RDMA/hns: Fix undifined behavior caused by invalid max_sge
  RDMA/hns: Fix shift-out-bounds when max_inline_data is 0
  RDMA/hns: Fix missing pagesize and alignment check in FRMR
  RDMA/hns: Fix unmatch exception handling when init eq table fails
  RDMA/hns: Fix soft lockup under heavy CEQE load
  RDMA/hns: Check atomic wr length
  RDMA/ocrdma: Don't inline statistics functions
  RDMA/core: Introduce "name_assign_type" for an IB device
  RDMA/qib: Fix truncation compilation warnings in qib_verbs.c
  RDMA/qib: Fix truncation compilation warnings in qib_init.c
  RDMA/efa: Add EFA 0xefa3 PCI ID
  RDMA/mlx5: Support per-plane port IB counters by querying PPCNT register
  net/mlx5: mlx5_ifc update for accessing ppcnt register of plane ports
  RDMA/mlx5: Add plane index support when querying PTYS registers
  ...
2024-07-19 09:51:33 -07:00
Mark Zhang
af48f95492 RDMA/core: Introduce "name_assign_type" for an IB device
The name_assign_type indicates how the name is provided. Currently
these types are supported:
- RDMA_NAME_ASSIGN_TYPE_UNKNOWN: Unknown or not set;
- RDMA_NAME_ASSIGN_TYPE_USER: Name is provided by the user; The
  user-created sub device, rxe and siw device has this type.

When filling nl device info, it is set in the new attribute
RDMA_NLDEV_ATTR_NAME_ASSIGN_TYPE. User-space tools like udev
"rdma_rename" could check this attribute to determine if this
device needs to be renamed or not.

Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/522591bef9a369cc8e5dcb77787e017bffee37fe.1719837610.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-07-04 07:59:53 +03:00
Akiva Goldberger
dd6d7f8574 RDMA: Pass entire uverbs attr bundle to create cq function
Changes the create_cq verb signature by sending the entire uverbs attr
bundle as a parameter. This allows drivers to send driver specific attrs
through ioctl for the create_cq verb and access them in their driver
specific code.

Also adds a new enum value for driver specific ioctl attributes for
methods already supporting UHW.

Link: https://lore.kernel.org/r/ed147343987c0d43fd391c1b2f85e2f425747387.1719512393.git.leon@kernel.org
Signed-off-by: Akiva Goldberger <agoldberger@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-06-27 16:28:21 -03:00
Honggang LI
4adcaf969d RDMA/rxe: Don't set BTH_ACK_MASK for UC or UD QPs
BTH_ACK_MASK bit is used to indicate that an acknowledge
(for this packet) should be scheduled by the responder.
Both UC and UD QPs are unacknowledged, so don't set
BTH_ACK_MASK for UC or UD QPs.

Fixes: 8700e3e7c4 ("Soft RoCE driver")
Signed-off-by: Honggang LI <honggangli@163.com>
Link: https://lore.kernel.org/r/20240624020348.494338-1-honggangli@163.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-06-26 10:53:29 -03:00
Honggang LI
f67ac0061c RDMA/rxe: Fix responder length checking for UD request packets
According to the IBA specification:
If a UD request packet is detected with an invalid length, the request
shall be an invalid request and it shall be silently dropped by
the responder. The responder then waits for a new request packet.

commit 689c5421bf ("RDMA/rxe: Fix incorrect responder length checking")
defers responder length check for UD QPs in function `copy_data`.
But it introduces a regression issue for UD QPs.

When the packet size is too large to fit in the receive buffer.
`copy_data` will return error code -EINVAL. Then `send_data_in`
will return RESPST_ERR_MALFORMED_WQE. UD QP will transfer into
ERROR state.

Fixes: 689c5421bf ("RDMA/rxe: Fix incorrect responder length checking")
Signed-off-by: Honggang LI <honggangli@163.com>
Link: https://lore.kernel.org/r/20240523094617.141148-1-honggangli@163.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-09 11:51:32 +03:00
Honggang LI
03fa18a992 RDMA/rxe: Fix data copy for IB_SEND_INLINE
For RDMA Send and Write with IB_SEND_INLINE, the memory buffers
specified in sge list will be placed inline in the Send Request.

The data should be copied by CPU from the virtual addresses of
corresponding sge list DMA addresses.

Cc: stable@kernel.org
Fixes: 8d7c7c0eeb ("RDMA: Add ib_virt_dma_to_page()")
Signed-off-by: Honggang LI <honggangli@163.com>
Link: https://lore.kernel.org/r/20240516095052.542767-1-honggangli@163.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-30 16:20:18 +03:00
Steven Rostedt (Google)
2c92ca849f tracing/treewide: Remove second parameter of __assign_str()
With the rework of how the __string() handles dynamic strings where it
saves off the source string in field in the helper structure[1], the
assignment of that value to the trace event field is stored in the helper
value and does not need to be passed in again.

This means that with:

  __string(field, mystring)

Which use to be assigned with __assign_str(field, mystring), no longer
needs the second parameter and it is unused. With this, __assign_str()
will now only get a single parameter.

There's over 700 users of __assign_str() and because coccinelle does not
handle the TRACE_EVENT() macro I ended up using the following sed script:

  git grep -l __assign_str | while read a ; do
      sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file;
      mv /tmp/test-file $a;
  done

I then searched for __assign_str() that did not end with ';' as those
were multi line assignments that the sed script above would fail to catch.

Note, the same updates will need to be done for:

  __assign_str_len()
  __assign_rel_str()
  __assign_rel_str_len()

I tested this with both an allmodconfig and an allyesconfig (build only for both).

[1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/

Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts.
Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for
Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal
Acked-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Darrick J. Wong <djwong@kernel.org>	# xfs
Tested-by: Guenter Roeck <linux@roeck-us.net>
2024-05-22 20:14:47 -04:00
Linus Torvalds
25f4874662 RDMA v6.10 merge window
Normal set of driver updates and small fixes:
 
 - Small improvements and fixes for erdma, efa, hfi1, bnxt_re
 
 - Fix a UAF crash after module unload on leaking restrack entry
 
 - Continue adding full RDMA support in mana with support for EQs, GID's
   and CQs
 
 - Improvements to the mkey cache in mlx5
 
 - DSCP traffic class support in hns and several bug fixes
 
 - Cap the maximum number of MADs in the receive queue to avoid OOM
 
 - Another batch of rxe bug fixes from large scale testing
 
 - __iowrite64_copy() optimizations for write combining MMIO memory
 
 - Remove NULL checks before dev_put/hold()
 
 - EFA support for receive with immediate
 
 - Fix a recent memleaking regression in a cma error path
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZkeo2gAKCRCFwuHvBreF
 YbuNAQChzGmS4F0JAn5Wj0CDvkZghELqtvzEb92SzqcgdyQafAD/fC7f23LJ4OsO
 1ZIaQEZu7j9DVg5PKFZ7WfdXjGTKqwA=
 =QRXg
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Aside from the usual things this has an arch update for
  __iowrite64_copy() used by the RDMA drivers.

  This API was intended to generate large 64 byte MemWr TLPs on PCI.
  These days most processors had done this by just repeating writel() in
  a loop. S390 and some new ARM64 designs require a special helper to
  get this to generate.

   - Small improvements and fixes for erdma, efa, hfi1, bnxt_re

   - Fix a UAF crash after module unload on leaking restrack entry

   - Continue adding full RDMA support in mana with support for EQs,
     GID's and CQs

   - Improvements to the mkey cache in mlx5

   - DSCP traffic class support in hns and several bug fixes

   - Cap the maximum number of MADs in the receive queue to avoid OOM

   - Another batch of rxe bug fixes from large scale testing

   - __iowrite64_copy() optimizations for write combining MMIO memory

   - Remove NULL checks before dev_put/hold()

   - EFA support for receive with immediate

   - Fix a recent memleaking regression in a cma error path"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (70 commits)
  RDMA/cma: Fix kmemleak in rdma_core observed during blktests nvme/rdma use siw
  RDMA/IPoIB: Fix format truncation compilation errors
  bnxt_re: avoid shift undefined behavior in bnxt_qplib_alloc_init_hwq
  RDMA/efa: Support QP with unsolicited write w/ imm. receive
  IB/hfi1: Remove generic .ndo_get_stats64
  IB/hfi1: Do not use custom stat allocator
  RDMA/hfi1: Use RMW accessors for changing LNKCTL2
  RDMA/mana_ib: implement uapi for creation of rnic cq
  RDMA/mana_ib: boundary check before installing cq callbacks
  RDMA/mana_ib: introduce a helper to remove cq callbacks
  RDMA/mana_ib: create and destroy RNIC cqs
  RDMA/mana_ib: create EQs for RNIC CQs
  RDMA/core: Remove NULL check before dev_{put, hold}
  RDMA/ipoib: Remove NULL check before dev_{put, hold}
  RDMA/mlx5: Remove NULL check before dev_{put, hold}
  RDMA/mlx5: Track DCT, DCI and REG_UMR QPs as diver_detail resources.
  RDMA/core: Add an option to display driver-specific QPs in the rdmatool
  RDMA/efa: Add shutdown notifier
  RDMA/mana_ib: Fix missing ret value
  IB/mlx5: Use __iowrite64_copy() for write combining stores
  ...
2024-05-18 13:04:15 -07:00
Bob Pearson
1a633bdc8f RDMA/rxe: Let destroy qp succeed with stuck packet
In some situations a sent packet may get queued in the NIC longer than
than timeout of a ULP. Currently if this happens the ULP may try to reset
the link by destroying the qp and setting up an alternate connection but
will fail because the rxe driver is waiting for the packet to finish
getting sent and be returned to the skb destructor function where the qp
reference holding things up will be dropped. This patch modifies the way
that the qp is passed to the destructor to pass the qp index and not a qp
pointer.  Then the destructor will attempt to lookup the qp from its index
and if it fails exit early. This requires taking a reference on the struct
sock rather than the qp allowing the qp to be destroyed while the sk is
still around waiting for the packet to finish.

Link: https://lore.kernel.org/r/20240329145513.35381-15-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:55:57 -03:00
Bob Pearson
9cc6290991 RDMA/rxe: Get rid of pkt resend on err
Currently the rxe_driver detects packet drops by ip_local_out() which
occur before the packet is sent on the wire and attempts to resend
them. This is redundant with the usual retry mechanism which covers
packets that get dropped in transit to or from the remote node.

The way this is implemented is not robust since it sets need_req_skb and
waits for the number of local skbs outstanding for this qp to drop below a
low water mark. This is racy since the skb may be sent to the destructor
before the requester can set the need_req_skb flag. This will cause a
deadlock in the send path for that qp.

This patch removes this mechanism since the normal retry path will correct
the error and resend the packet and it makes no difference if the packet
is dropped locally or later.

Link: https://lore.kernel.org/r/20240329145513.35381-14-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:34 -03:00
Bob Pearson
55bec1c440 RDMA/rxe: Make rxe_loopback match rxe_send behavior
The rxe send path currently counts the number of skbs outstanding between
the rxe driver and the ethernet driver to prevent too many packets to
accumulate waiting to send. This patch makes the local loopback path
behave the same way. The loopback path forwards the packets to the receive
path which will eventually call kfree_skb on all packets and drop the qp
references. This makes the loopback path more useful for software testing.

Link: https://lore.kernel.org/r/20240329145513.35381-13-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:33 -03:00
Bob Pearson
8776618dbb RDMA/rxe: Fix incorrect rxe_put in error path
In rxe_send() a ref is taken on the qp to keep it alive until the
kfree_skb() has a chance to call the skb destructor rxe_skb_tx_dtor()
which drops the reference. If the packet has an incorrect protocol the
error path just calls kfree_skb() which will call the destructor which
will drop the ref. Currently the driver also calls rxe_put() which is
incorrect. Additionally since the packets sent to rxe_send() are under the
control of the driver and it only ever produces IPV4 or IPV6 packets the
simplest fix is to remove all the code in this block.

Link: https://lore.kernel.org/r/20240329145513.35381-12-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Fixes: 9eb7f8e44d ("IB/rxe: Move refcounting earlier in rxe_send()")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:33 -03:00
Bob Pearson
23bc06af54 RDMA/rxe: Don't call direct between tasks
Replace calls to rxe_run_task() with rxe_sched_task().  This prevents the
tasks from all running on the same cpu.

This change slightly reduces performance for single qp send and write
benchmarks in loopback mode but greatly improves the performance with
multiple qps because if run task is used all the work tends to be
performed on one cpu. For actual on the wire benchmarks there is no
noticeable performance change.

Link: https://lore.kernel.org/r/20240329145513.35381-11-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:33 -03:00
Bob Pearson
3d807a3ebc RDMA/rxe: Don't call rxe_requester from rxe_completer
Instead of rescheduling rxe_requester from rxe_completer() just extend the
duration of rxe_sender() by one pass. Setting run_requester_again forces
rxe_completer() to return 0 which will cause rxe_sender() to be called at
least one more time.

Link: https://lore.kernel.org/r/20240329145513.35381-10-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:33 -03:00
Bob Pearson
4891f4fed0 RDMA/rxe: Don't schedule rxe_completer from rxe_requester
Now that rxe_completer() is always called serially after rxe_requester()
there is no reason to schedule rxe_completer() from rxe_requester().

Link: https://lore.kernel.org/r/20240329145513.35381-9-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:33 -03:00
Bob Pearson
cd8aaddf0d RDMA/rxe: Remove save/rollback_state in rxe_requester
Now that req.task and comp.task are merged it is no longer necessary to
call save_state() before calling rxe_xmit_pkt() and rollback_state() if
rxe_xmit_pkt() fails. This was done originally to prevent races between
rxe_completer() and rxe_requester() which now cannot happen.

Link: https://lore.kernel.org/r/20240329145513.35381-8-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:33 -03:00
Bob Pearson
67f57892f9 RDMA/rxe: Merge request and complete tasks
Currently the rxe driver has three work queue tasks per qp.  These are the
req.task, comp.task and resp.task which call rxe_requester(),
rxe_completer() and rxe_responder() respectively directly or on work
queues. Each of these subroutines checks to see if there is work to be
performed on the send queue or on the response packet queue or the request
packet queue and will run until there is no work remaining or yield the
cpu and reschedule itself until there is no work remaining.

This commit combines the req.task and comp.task into a single send.task
and renames the resp.task to the recv.task. The combined send.task calls
rxe_requester() and rxe_completer() serially and continues until all work
on both the send queue and the response packet queue are done.

In various benchmarks the performance is either improved or left the
same. At high scale there is a significant reduction in the load on the
cpu.

This is the first step in combining these two tasks. Once they are
serialized cross rescheduling of req.task and comp.task can be more
efficiently handled by just letting the send.task continue to run. This
will be done in the next several patches.

Link: https://lore.kernel.org/r/20240329145513.35381-7-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:33 -03:00
Bob Pearson
ff30e45376 RDMA/rxe: Remove redundant scheduling of rxe_completer
In rxe_post_send_kernel() if the qp is in the error state after posting
the work requests the rxe_completer() task is scheduled.

But, the only way to move the qp into the error state is to call
rxe_qp_error() which also schedules the rxe_completer() task to drain the
queues. Calling it a second time has no effect. This commit removes the
redundant call.

Link: https://lore.kernel.org/r/20240329145513.35381-6-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:33 -03:00
Bob Pearson
b703374837 RDMA/rxe: Allow good work requests to be executed
A previous commit incorrectly added an 'if(!err)' before scheduling the
requester task in rxe_post_send_kernel(). But if there were send wrs
successfully added to the send queue before a bad wr they might never get
executed.

This commit fixes this by scheduling the requester task if any wqes were
successfully posted in rxe_post_send_kernel() in rxe_verbs.c.

Link: https://lore.kernel.org/r/20240329145513.35381-5-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Fixes: 5bf944f241 ("RDMA/rxe: Add error messages")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:32 -03:00
Bob Pearson
2b23b60973 RDMA/rxe: Fix seg fault in rxe_comp_queue_pkt
In rxe_comp_queue_pkt() an incoming response packet skb is enqueued to the
resp_pkts queue and then a decision is made whether to run the completer
task inline or schedule it. Finally the skb is dereferenced to bump a 'hw'
performance counter. This is wrong because if the completer task is
already running in a separate thread it may have already processed the skb
and freed it which can cause a seg fault.  This has been observed
infrequently in testing at high scale.

This patch fixes this by changing the order of enqueuing the packet until
after the counter is accessed.

Link: https://lore.kernel.org/r/20240329145513.35381-4-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Fixes: 0b1e5b99a4 ("IB/rxe: Add port protocol stats")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:32 -03:00
Zhu Yanjun
dfcdb38b21 RDMA/rxe: Return the correct errno
In the function __rxe_add_to_pool, the function xa_alloc_cyclic is
called. The return value of the function xa_alloc_cyclic is as below:
"
 Return: 0 if the allocation succeeded without wrapping.  1 if the
 allocation succeeded after wrapping, -ENOMEM if memory could not be
 allocated or -EBUSY if there are no free entries in @limit.
"
But now the function __rxe_add_to_pool only returns -EINVAL. All the
returned error value should be returned to the caller.

Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Link: https://lore.kernel.org/r/20240408142142.792413-1-yanjun.zhu@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-11 14:47:26 +03:00
Yanjun.Zhu
481047d7e8 RDMA/rxe: Fix the problem "mutex_destroy missing"
When a mutex lock is not used any more, the function mutex_destroy
should be called to mark the mutex lock uninitialized.

Fixes: 8700e3e7c4 ("Soft RoCE driver")
Signed-off-by: Yanjun.Zhu <yanjun.zhu@linux.dev>
Link: https://lore.kernel.org/r/20240314065140.27468-1-yanjun.zhu@linux.dev
Reviewed-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-01 14:36:41 +03:00
Guoqing Jiang
aafe4cc509 RDMA/rxe: Remove unused 'iova' parameter from rxe_mr_init_user
This one is not needed since commit 954afc5a8f ("RDMA/rxe:
Use members of generic struct in rxe_mr").

Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20240202124144.16033-1-guoqing.jiang@linux.dev
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-02-04 13:05:43 +02:00
Li Zhijian
190a7affee RDMA/rxe: Remove rxe_info from rxe_set_mtu
commit 9ac01f434a ("RDMA/rxe: Extend dbg log messages to err and info")
newly added this info. But it did only show null device when
the rdma_rxe is being loaded because dev_name(rxe->ib_dev->dev)
has not yet been assigned at the moment:

"(null): rxe_set_mtu: Set mtu to 1024"

Remove it to silent this message, check the mtu from it backend link
instead if needed.

CC: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Link: https://lore.kernel.org/r/20240109083253.3629967-2-lizhijian@fujitsu.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-01-25 11:49:50 +02:00
Li Zhijian
6482718086 RDMA/rxe: Improve newline in printing messages
Previously rxe_{dbg,info,err}() macros are appended built-in newline,
but some users will add redundant newline sometimes. So remove the
built-in newline for these macros.

In terms of rxe_{dbg,info,err}_xxx() macros, because they don't have
built-in newline, append newline when using them.

CC: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Reviewed-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Acked-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Link: https://lore.kernel.org/r/20240109083253.3629967-1-lizhijian@fujitsu.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-01-25 11:49:50 +02:00
Guoqing Jiang
b7a2768a1c RDMA/siw: Call orq_get_current if possible
Use orq_get_current() in siw_orq_empty().

Link: https://lore.kernel.org/r/20231203092655.28102-5-guoqing.jiang@linux.dev
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-12-04 20:13:30 -04:00
Guoqing Jiang
0b988c1bee RDMA/siw: Set qp_state in siw_query_qp
Run test_query_rc_qp against siw failed since siw didn't set qp_state
accordingly. To address it, introduce siw_qp_state_to_ib_qp_state
which convert SIW_QP_STATE_IDLE to IB_QPS_INIT which is similar as
in cxgb4.

rdma-core# ./build/bin/run_tests.py --dev siw0 tests.test_qp.QPTest.test_query_rc_qp -v
test_query_rc_qp (tests.test_qp.QPTest)
Queries an RC QP after creation. Verifies that its properties are as ... FAIL

======================================================================
FAIL: test_query_rc_qp (tests.test_qp.QPTest)
Queries an RC QP after creation. Verifies that its properties are as
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/gjiang/rdma-core/tests/test_qp.py", line 284, in test_query_rc_qp
    self.query_qp_common_test(e.IBV_QPT_RC)
  File "/home/gjiang/rdma-core/tests/test_qp.py", line 265, in query_qp_common_test
    self.verify_qp_attrs(caps, e.IBV_QPS_INIT, qp_init_attr, qp_attr)
  File "/home/gjiang/rdma-core/tests/test_qp.py", line 239, in verify_qp_attrs
    self.assertEqual(state, attr.qp_state)
AssertionError: <ibv_qp_state.IBV_QPS_INIT: 1> != 0

----------------------------------------------------------------------
Ran 1 test in 0.057s

FAILED (failures=1)

Link: https://lore.kernel.org/r/20231203092655.28102-4-guoqing.jiang@linux.dev
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-12-04 20:13:30 -04:00
Guoqing Jiang
51ac45a663 RDMA/siw: Reduce memory usage of struct siw_rx_stream
We can reduce the memory of the struct by move some of it's
member.

Before,

	/* size: 144, cachelines: 3, members: 17 */
	/* sum members: 124, holes: 3, sum holes: 12 */
	/* sum bitfield members: 7 bits (0 bytes) */
	/* padding: 7 */
	/* bit_padding: 1 bits */

After

	/* size: 128, cachelines: 2, members: 17 */
	/* padding: 3 */
	/* bit_padding: 1 bits */

Link: https://lore.kernel.org/r/20231203092655.28102-3-guoqing.jiang@linux.dev
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-12-04 20:13:30 -04:00
Guoqing Jiang
84de14baf8 RDMA/siw: Move tx_cpu ahead
We can reduce one cacheline for the usage of struct siw_qp.

Before,

	/* size: 1928, cachelines: 31, members: 38 */
	/* sum members: 1920, holes: 2, sum holes: 8 */
	/* paddings: 4, sum paddings: 13 */
	/* forced alignments: 3 */

after

	/* size: 1920, cachelines: 30, members: 38 */
	/* paddings: 4, sum paddings: 13 */
	/* forced alignments: 3 */

Link: https://lore.kernel.org/r/20231203092655.28102-2-guoqing.jiang@linux.dev
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-12-04 20:13:30 -04:00
Guoqing Jiang
79844118d6 RDMA/siw: Update comments for siw_qp_sq_process
There is no siw_sq_work_handler in code, change it with siw_tx_thread
since siw_run_sq -> siw_sq_resume -> siw_qp_sq_process.

Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20231113115726.12762-18-guoqing.jiang@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-15 15:58:14 +02:00
Guoqing Jiang
d9a5b48681 RDMA/siw: Introduce siw_destroy_cep_sock
Add one helper to simplify code a bit.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202310091735.oG7bTvLR-lkp@intel.com/`
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20231113115726.12762-17-guoqing.jiang@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-15 15:58:14 +02:00
Guoqing Jiang
788bbf4c2f RDMA/siw: Only check attrs->cap.max_send_wr in siw_create_qp
We can just check max_send_wr here given both max_send_wr and
max_recv_wr are defined as u32 type, and we also need to ensure
num_sqe (derived from max_send_wr) shouldn't be zero.

Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20231113115726.12762-16-guoqing.jiang@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-15 15:58:14 +02:00
Guoqing Jiang
3beced14d1 RDMA/siw: Fix typo
Replace ORRQ with ORQ.

Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20231113115726.12762-15-guoqing.jiang@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-15 15:58:14 +02:00
Guoqing Jiang
a410a73278 RDMA/siw: Remove siw_sk_save_upcalls
Let's move it into siw_sk_assign_cm_upcalls, then we only
need to get sk_callback_lock once.

Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20231113115726.12762-14-guoqing.jiang@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-15 15:58:14 +02:00
Guoqing Jiang
77b59bd932 RDMA/siw: Cleanup siw_accept
With the initialization of rv and the two added label, we can
simplifiy code a bit.

Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20231113115726.12762-13-guoqing.jiang@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-15 15:58:14 +02:00
Guoqing Jiang
08456d4db7 RDMA/siw: Introduce siw_free_cm_id
Factor out a helper to simplify code.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202310091656.JlrmcNXB-lkp@intel.com/
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20231113115726.12762-12-guoqing.jiang@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-15 15:58:14 +02:00
Guoqing Jiang
b5c9154320 RDMA/siw: Introduce siw_cep_set_free_and_put
Add the helper which can be used in some places.

Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20231113115726.12762-11-guoqing.jiang@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-15 15:58:14 +02:00
Guoqing Jiang
25680c1f26 RDMA/siw: Add one parameter to siw_destroy_cpulist
With that we can reuse it in siw_init_cpulist.

Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20231113115726.12762-10-guoqing.jiang@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-15 15:58:13 +02:00
Guoqing Jiang
6a343cc3bf RDMA/siw: Introduce SIW_STAG_MAX_INDEX
Add the macro to remove magic number in the code.

Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20231113115726.12762-9-guoqing.jiang@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-15 15:58:13 +02:00
Guoqing Jiang
60d2136db8 RDMA/siw: Factor out siw_rx_data helper
Remove the redundant code given they share the same logic.

Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20231113115726.12762-8-guoqing.jiang@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-15 15:58:13 +02:00
Guoqing Jiang
065186d228 RDMA/siw: No need to check term_info.valid before call siw_send_terminate
Remove the redundate checking since siw_send_terminate check it inside.

Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20231113115726.12762-7-guoqing.jiang@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-15 15:58:13 +02:00
Guoqing Jiang
659da08ed8 RDMA/siw: Remove rcu from siw_qp
Remove it since it is not used.

Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20231113115726.12762-6-guoqing.jiang@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-15 15:58:13 +02:00
Guoqing Jiang
d248960941 RDMA/siw: Remove goto lable in siw_mmap
Let's remove it since the failure case only falls through
to the useless label.

Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20231113115726.12762-5-guoqing.jiang@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-15 15:58:13 +02:00
Guoqing Jiang
2109ddf032 RDMA/siw: Use iov.iov_len in kernel_sendmsg
We can pass iov.iov_len here.

Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20231113115726.12762-4-guoqing.jiang@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-15 15:58:13 +02:00
Guoqing Jiang
a2b64565e8 RDMA/siw: Introduce siw_update_skb_rcvd
There are some places share the same logic, factor a common
helper for it.

Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20231113115726.12762-3-guoqing.jiang@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-15 15:58:13 +02:00
Guoqing Jiang
3a179fe34a RDMA/siw: Introduce siw_get_page
Add the wrapper function to get either pbl page or umem page.

Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20231113115726.12762-2-guoqing.jiang@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-15 15:58:13 +02:00
Eric Biggers
9aac6c05a5 RDMA/siw: Use crypto_shash_digest() in siw_qp_prepare_tx()
Simplify siw_qp_prepare_tx() by using crypto_shash_digest() instead of
an init+update+final sequence.  This should also improve performance.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20231029045839.154071-1-ebiggers@kernel.org
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-13 10:35:12 +02:00
Bernard Metzler
476b7c7e00 RDMA/siw: Use ib_umem_get() to pin user pages
Abandon siw private code to pin user pages during user
memory registration, but use ib_umem_get() instead.
This will help maintaining the driver in case of changes
to the memory subsystem.

Signed-off-by: Bernard Metzler <bmt@zurich.ibm.com>
Link: https://lore.kernel.org/r/20231104075643.195186-1-bmt@zurich.ibm.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-11-13 10:14:00 +02:00
Jason Gunthorpe
162e348024 Merge tag 'v6.6' into rdma.git for-next
Resolve conflict by taking the spin_lock hunk from for-next:

 https://lore.kernel.org/r/20230928113851.5197a1ec@canb.auug.org.au

Required for the next patch.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-10-31 10:54:48 -03:00
Kees Cook
0bc018b7a7 RDMA/siw: Annotate struct siw_pbl with __counted_by
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).

As found with Coccinelle[1], add __counted_by for struct siw_pbl.

[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci

Cc: Bernard Metzler <bmt@zurich.ibm.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: linux-rdma@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230929180431.3005464-4-keescook@chromium.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-10-02 14:44:54 +03:00
Bernard Metzler
53a3f77704 RDMA/siw: Fix connection failure handling
In case immediate MPA request processing fails, the newly
created endpoint unlinks the listening endpoint and is
ready to be dropped. This special case was not handled
correctly by the code handling the later TCP socket close,
causing a NULL dereference crash in siw_cm_work_handler()
when dereferencing a NULL listener. We now also cancel
the useless MPA timeout, if immediate MPA request
processing fails.

This patch furthermore simplifies MPA processing in general:
Scheduling a useless TCP socket read in sk_data_ready() upcall
is now surpressed, if the socket is already moved out of
TCP_ESTABLISHED state.

Fixes: 6c52fdc244 ("rdma/siw: connection management")
Signed-off-by: Bernard Metzler <bmt@zurich.ibm.com>
Link: https://lore.kernel.org/r/20230905145822.446263-1-bmt@zurich.ibm.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-09-11 14:53:23 +03:00
Krzysztof Kozlowski
3ec648c631 IB: Use capital "OR" for multiple licenses in SPDX
Documentation/process/license-rules.rst and checkpatch expect the SPDX
identifier syntax for multiple licenses to use capital "OR".  Correct it
to keep consistent format and avoid copy-paste issues.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20230823092912.122674-1-krzysztof.kozlowski@linaro.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-09-11 14:14:00 +03:00
Linus Torvalds
f7e97ce269 v6.6 merge window RDMA pull request
Many small changes across the subystem, some highlights:
 
 - Usual driver cleanups in qedr, siw, erdma, hfi1, mlx4/5, irdma, mthca,
   hns, and bnxt_re
 
 - siw now works over tunnel and other netdevs with a MAC address by
   removing assumptions about a MAC/GID from the connection manager
 
 - "Doorbell Pacing" for bnxt_re - this is a best effort scheme to allow
   userspace to slow down the doorbell rings if the HW gets full
 
 - irdma egress VLAN priority, better QP/WQ sizing
 
 - rxe bug fixes in queue draining and srq resizing
 
 - Support more ethernet speed options in the core layer
 
 - DMABUF support for bnxt_re
 
 - Multi-stage MTT support for erdma to allow much bigger MR registrations
 
 - A irdma fix with a CVE that came in too late to go to -rc, missing
   bounds checking for 0 length MRs
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZPEqkAAKCRCFwuHvBreF
 YZrNAPoCBfU+VjCKNr2yqF7s52os5ZdBV7Uuh4txHcXWW9H7GAD/f19i2u62fzNu
 C27jj4cztemMBb8mgwyxPw/wLg7NLwY=
 =pC6k
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Many small changes across the subystem, some highlights:

   - Usual driver cleanups in qedr, siw, erdma, hfi1, mlx4/5, irdma,
     mthca, hns, and bnxt_re

   - siw now works over tunnel and other netdevs with a MAC address by
     removing assumptions about a MAC/GID from the connection manager

   - "Doorbell Pacing" for bnxt_re - this is a best effort scheme to
     allow userspace to slow down the doorbell rings if the HW gets full

   - irdma egress VLAN priority, better QP/WQ sizing

   - rxe bug fixes in queue draining and srq resizing

   - Support more ethernet speed options in the core layer

   - DMABUF support for bnxt_re

   - Multi-stage MTT support for erdma to allow much bigger MR
     registrations

   - A irdma fix with a CVE that came in too late to go to -rc, missing
     bounds checking for 0 length MRs"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (87 commits)
  IB/hfi1: Reduce printing of errors during driver shut down
  RDMA/hfi1: Move user SDMA system memory pinning code to its own file
  RDMA/hfi1: Use list_for_each_entry() helper
  RDMA/mlx5: Fix trailing */ formatting in block comment
  RDMA/rxe: Fix redundant break statement in switch-case.
  RDMA/efa: Fix wrong resources deallocation order
  RDMA/siw: Call llist_reverse_order in siw_run_sq
  RDMA/siw: Correct wrong debug message
  RDMA/siw: Balance the reference of cep->kref in the error path
  Revert "IB/isert: Fix incorrect release of isert connection"
  RDMA/bnxt_re: Fix kernel doc errors
  RDMA/irdma: Prevent zero-length STAG registration
  RDMA/erdma: Implement hierarchical MTT
  RDMA/erdma: Refactor the storage structure of MTT entries
  RDMA/erdma: Renaming variable names and field names of struct erdma_mem
  RDMA/hns: Support hns HW stats
  RDMA/hns: Dump whole QP/CQ/MR resource in raw
  RDMA/irdma: Add missing kernel-doc in irdma_setup_umode_qp()
  RDMA/mlx4: Copy union directly
  RDMA/irdma: Drop unused kernel push code
  ...
2023-09-01 16:49:33 -07:00
Rohit Chavan
6812e06999 RDMA/rxe: Fix redundant break statement in switch-case.
Removed unreachable break statement after return.

Signed-off-by: Rohit Chavan <roheetchavan@gmail.com>
Link: https://lore.kernel.org/r/20230822091304.7312-1-roheetchavan@gmail.com
Acked-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-08-22 17:23:00 +03:00
Guoqing Jiang
9dfccb6d0d RDMA/siw: Call llist_reverse_order in siw_run_sq
We can call the function to get fifo list.

Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20230821133255.31111-4-guoqing.jiang@linux.dev
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-08-22 17:05:12 +03:00
Guoqing Jiang
bee024d204 RDMA/siw: Correct wrong debug message
We need to print num_sle first then pbl->max_buf per the condition.
Also replace mem->pbl with pbl while at it.

Fixes: 303ae1cdfd ("rdma/siw: application interface")
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20230821133255.31111-3-guoqing.jiang@linux.dev
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-08-22 17:05:12 +03:00
Guoqing Jiang
b056327bee RDMA/siw: Balance the reference of cep->kref in the error path
The siw_connect can go to err in below after cep is allocated successfully:

1. If siw_cm_alloc_work returns failure. In this case socket is not
assoicated with cep so siw_cep_put can't be called by siw_socket_disassoc.
We need to call siw_cep_put twice since cep->kref is increased once after
it was initialized.

2. If siw_cm_queue_work can't find a work, which means siw_cep_get is not
called in siw_cm_queue_work, so cep->kref is increased twice by siw_cep_get
and when associate socket with cep after it was initialized. So we need to
call siw_cep_put three times (one in siw_socket_disassoc).

3. siw_send_mpareqrep returns error, this scenario is similar as 2.

So we need to remove one siw_cep_put in the error path.

Fixes: 6c52fdc244 ("rdma/siw: connection management")
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20230821133255.31111-2-guoqing.jiang@linux.dev
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-08-22 17:05:12 +03:00
Bob Pearson
5d122db2ff RDMA/rxe: Fix incomplete state save in rxe_requester
If a send packet is dropped by the IP layer in rxe_requester()
the call to rxe_xmit_packet() can fail with err == -EAGAIN.
To recover, the state of the wqe is restored to the state before
the packet was sent so it can be resent. However, the routines
that save and restore the state miss a significnt part of the
variable state in the wqe, the dma struct which is used to process
through the sge table. And, the state is not saved before the packet
is built which modifies the dma struct.

Under heavy stress testing with many QPs on a fast node sending
large messages to a slow node dropped packets are observed and
the resent packets are corrupted because the dma struct was not
restored. This patch fixes this behavior and allows the test cases
to succeed.

Fixes: 3050b99850 ("IB/rxe: Fix race condition between requester and completer")
Link: https://lore.kernel.org/r/20230721200748.4604-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-07-31 15:24:12 -03:00
Bob Pearson
cc28f35115 RDMA/rxe: Fix rxe_modify_srq
This patch corrects an error in rxe_modify_srq where if the
caller changes the srq size the actual new value is not returned
to the caller since it may be larger than what is requested.
Additionally it open codes the subroutine rcv_wqe_size() which
adds very little value, and makes some whitespace changes.

Fixes: 8700e3e7c4 ("Soft RoCE driver")
Link: https://lore.kernel.org/r/20230620140142.9452-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-07-31 15:24:12 -03:00
Bob Pearson
5993b75d0b RDMA/rxe: Fix unsafe drain work queue code
If create_qp does not fully succeed it is possible for qp cleanup
code to attempt to drain the send or recv work queues before the
queues have been created causing a seg fault. This patch checks
to see if the queues exist before attempting to drain them.

Link: https://lore.kernel.org/r/20230620135519.9365-3-rpearsonhpe@gmail.com
Reported-by: syzbot+2da1965168e7dbcba136@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-rdma/00000000000012d89205fe7cfe00@google.com/raw
Fixes: 49dc9c1f0c ("RDMA/rxe: Cleanup reset state handling in rxe_resp.c")
Fixes: fbdeb828a2 ("RDMA/rxe: Cleanup error state handling in rxe_comp.c")
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-07-31 15:24:12 -03:00
Bob Pearson
e0ba8ff467 RDMA/rxe: Move work queue code to subroutines
This patch:
	- Moves code to initialize a qp send work queue to a
	  subroutine named rxe_init_sq.
	- Moves code to initialize a qp recv work queue to a
	  subroutine named rxe_init_rq.
	- Moves initialization of qp request and response packet
	  queues ahead of work queue initialization so that cleanup
	  of a qp if it is not fully completed can successfully
	  attempt to drain the packet queues without a seg fault.
	- Makes minor whitespace cleanups.

Fixes: 8700e3e7c4 ("Soft RoCE driver")
Link: https://lore.kernel.org/r/20230620135519.9365-2-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Acked-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-07-31 15:24:12 -03:00
Bernard Metzler
91f36237b4 RDMA/siw: Fix tx thread initialization.
Immediately removing the siw module after insertion may
crash in siw_stop_tx_thread(), if the according thread did
not yet had a chance to initialize its wait queue and
siw_stop_tx_thread() tries to wakeup that thread. Initializing
the threads state before spwaning it fixes it.

Reported-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Bernard Metzler <bmt@zurich.ibm.com>
Link: https://lore.kernel.org/r/20230728114418.124328-1-bmt@zurich.ibm.com
Tested-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-07-31 10:05:23 +03:00
Chuck Lever
bad5b6e34f RDMA/siw: Fabricate a GID on tun and loopback devices
LOOPBACK and NONE (tunnel) devices have all-zero MAC addresses.
Currently, siw_device_create() falls back to copying the IB device's
name in those cases, because an all-zero MAC address breaks the RDMA
core address resolution mechanism.

However, at the point when siw_device_create() constructs a GID, the
ib_device::name field is uninitialized, leaving the MAC address to
remain in an all-zero state.

Fabricate a random artificial GID for such devices, and ensure this
artificial GID is returned for all device query operations.

Link: https://lore.kernel.org/r/168960673260.3007.12378736853793339110.stgit@manet.1015granger.net
Reported-by: Tom Talpey <tom@talpey.com>
Fixes: a2d36b02c1 ("RDMA/siw: Enable siw on tunnel devices")
Reviewed-by: Bernard Metzler <bmt@zurich.ibm.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-07-21 16:00:18 -03:00
Julia Lawall
9191df0029 RDMA/siw: use vmalloc_array and vcalloc
Use vmalloc_array and vcalloc to protect against
multiplication overflows.

The changes were done using the following Coccinelle
semantic patch:

// <smpl>
@initialize:ocaml@
@@

let rename alloc =
  match alloc with
    "vmalloc" -> "vmalloc_array"
  | "vzalloc" -> "vcalloc"
  | _ -> failwith "unknown"

@@
    size_t e1,e2;
    constant C1, C2;
    expression E1, E2, COUNT, x1, x2, x3;
    typedef u8;
    typedef __u8;
    type t = {u8,__u8,char,unsigned char};
    identifier alloc = {vmalloc,vzalloc};
    fresh identifier realloc = script:ocaml(alloc) { rename alloc };
@@

(
      alloc(x1*x2*x3)
|
      alloc(C1 * C2)
|
      alloc((sizeof(t)) * (COUNT), ...)
|
-     alloc((e1) * (e2))
+     realloc(e1, e2)
|
-     alloc((e1) * (COUNT))
+     realloc(COUNT, e1)
|
-     alloc((E1) * (E2))
+     realloc(E1, E2)
)
// </smpl>

Link: https://lore.kernel.org/r/20230627144339.144478-15-Julia.Lawall@inria.fr
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-07-21 15:52:09 -03:00
Christophe JAILLET
5c719d7aef RDMA/rxe: Fix an error handling path in rxe_bind_mw()
All errors go to the error handling path, except this one. Be consistent
and also branch to it.

Fixes: 02ed253770 ("RDMA/rxe: Introduce rxe access supported flags")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/43698d8a3ed4e720899eadac887427f73d7ec2eb.1689623735.git.christophe.jaillet@wanadoo.fr
Reviewed-by: Bob Pearson <rpearsonhpe@gmail.com>
Acked-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-07-18 15:22:43 +03:00
Linus Torvalds
7ede5f78a0 v6.5 merge window RDMA pull request
This cycle saw a focus on rxe and bnxt_re drivers:
 
 - Code cleanups for irdma, rxe, rtrs, hns, vmw_pvrdma
 
 - rxe uses workqueues instead of tasklets
 
 - rxe has better compliance around access checks for MRs and rereg_mr
 
 - mana supportst he 'v2' FW interface for RX coalescing
 
 - hfi1 bug fix for stale cache entries in its MR cache
 
 - mlx5 buf fix to handle FW failures when destroying QPs
 
 - erdma HW has a new doorbell allocation mechanism for uverbs that is
   secure
 
 - Lots of small cleanups and rework in bnxt_re
    * Use the common mmap functions
    * Support disassociation
    * Improve FW command flow
 
 - bnxt_re support for "low latency push", this allows a packet
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZJxq+wAKCRCFwuHvBreF
 YSN+AQDKAmWdKCqmc3boMIq5wt4h7yYzdW47LpzGarOn5Hf+UgEA6mpPJyRqB43C
 CNXYIbASl/LLaWzFvxCq/AYp6tzuog4=
 =I8ju
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "This cycle saw a focus on rxe and bnxt_re drivers:

   - Code cleanups for irdma, rxe, rtrs, hns, vmw_pvrdma

   - rxe uses workqueues instead of tasklets

   - rxe has better compliance around access checks for MRs and rereg_mr

   - mana supportst he 'v2' FW interface for RX coalescing

   - hfi1 bug fix for stale cache entries in its MR cache

   - mlx5 buf fix to handle FW failures when destroying QPs

   - erdma HW has a new doorbell allocation mechanism for uverbs that is
     secure

   - Lots of small cleanups and rework in bnxt_re:
       - Use the common mmap functions
       - Support disassociation
       - Improve FW command flow
       - support for 'low latency push'"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (71 commits)
  RDMA/bnxt_re: Fix an IS_ERR() vs NULL check
  RDMA/bnxt_re: Fix spelling mistake "priviledged" -> "privileged"
  RDMA/bnxt_re: Remove duplicated include in bnxt_re/main.c
  RDMA/bnxt_re: Refactor code around bnxt_qplib_map_rc()
  RDMA/bnxt_re: Remove incorrect return check from slow path
  RDMA/bnxt_re: Enable low latency push
  RDMA/bnxt_re: Reorg the bar mapping
  RDMA/bnxt_re: Move the interface version to chip context structure
  RDMA/bnxt_re: Query function capabilities from firmware
  RDMA/bnxt_re: Optimize the bnxt_re_init_hwrm_hdr usage
  RDMA/bnxt_re: Add disassociate ucontext support
  RDMA/bnxt_re: Use the common mmap helper functions
  RDMA/bnxt_re: Initialize opcode while sending message
  RDMA/cma: Remove NULL check before dev_{put, hold}
  RDMA/rxe: Simplify cq->notify code
  RDMA/rxe: Fixes mr access supported list
  RDMA/bnxt_re: optimize the parameters passed to helper functions
  RDMA/bnxt_re: remove redundant cmdq_bitmap
  RDMA/bnxt_re: use firmware provided max request timeout
  RDMA/bnxt_re: cancel all control path command waiters upon error
  ...
2023-06-29 21:01:17 -07:00
Linus Torvalds
3a8a670eee Networking changes for 6.5.
Core
 ----
 
  - Rework the sendpage & splice implementations. Instead of feeding
    data into sockets page by page extend sendmsg handlers to support
    taking a reference on the data, controlled by a new flag called
    MSG_SPLICE_PAGES. Rework the handling of unexpected-end-of-file
    to invoke an additional callback instead of trying to predict what
    the right combination of MORE/NOTLAST flags is.
    Remove the MSG_SENDPAGE_NOTLAST flag completely.
 
  - Implement SCM_PIDFD, a new type of CMSG type analogous to
    SCM_CREDENTIALS, but it contains pidfd instead of plain pid.
 
  - Enable socket busy polling with CONFIG_RT.
 
  - Improve reliability and efficiency of reporting for ref_tracker.
 
  - Auto-generate a user space C library for various Netlink families.
 
 Protocols
 ---------
 
  - Allow TCP to shrink the advertised window when necessary, prevent
    sk_rcvbuf auto-tuning from growing the window all the way up to
    tcp_rmem[2].
 
  - Use per-VMA locking for "page-flipping" TCP receive zerocopy.
 
  - Prepare TCP for device-to-device data transfers, by making sure
    that payloads are always attached to skbs as page frags.
 
  - Make the backoff time for the first N TCP SYN retransmissions
    linear. Exponential backoff is unnecessarily conservative.
 
  - Create a new MPTCP getsockopt to retrieve all info (MPTCP_FULL_INFO).
 
  - Avoid waking up applications using TLS sockets until we have
    a full record.
 
  - Allow using kernel memory for protocol ioctl callbacks, paving
    the way to issuing ioctls over io_uring.
 
  - Add nolocalbypass option to VxLAN, forcing packets to be fully
    encapsulated even if they are destined for a local IP address.
 
  - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
    in-kernel ECMP implementation (e.g. Open vSwitch) select the same
    link for all packets. Support L4 symmetric hashing in Open vSwitch.
 
  - PPPoE: make number of hash bits configurable.
 
  - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
    (ipconfig).
 
  - Add layer 2 miss indication and filtering, allowing higher layers
    (e.g. ACL filters) to make forwarding decisions based on whether
    packet matched forwarding state in lower devices (bridge).
 
  - Support matching on Connectivity Fault Management (CFM) packets.
 
  - Hide the "link becomes ready" IPv6 messages by demoting their
    printk level to debug.
 
  - HSR: don't enable promiscuous mode if device offloads the proto.
 
  - Support active scanning in IEEE 802.15.4.
 
  - Continue work on Multi-Link Operation for WiFi 7.
 
 BPF
 ---
 
  - Add precision propagation for subprogs and callbacks. This allows
    maintaining verification efficiency when subprograms are used,
    or in fact passing the verifier at all for complex programs,
    especially those using open-coded iterators.
 
  - Improve BPF's {g,s}setsockopt() length handling. Previously BPF
    assumed the length is always equal to the amount of written data.
    But some protos allow passing a NULL buffer to discover what
    the output buffer *should* be, without writing anything.
 
  - Accept dynptr memory as memory arguments passed to helpers.
 
  - Add routing table ID to bpf_fib_lookup BPF helper.
 
  - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands.
 
  - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
    maps as read-only).
 
  - Show target_{obj,btf}_id in tracing link fdinfo.
 
  - Addition of several new kfuncs (most of the names are self-explanatory):
    - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
      bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
      and bpf_dynptr_clone().
    - bpf_task_under_cgroup()
    - bpf_sock_destroy() - force closing sockets
    - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs
 
 Netfilter
 ---------
 
  - Relax set/map validation checks in nf_tables. Allow checking
    presence of an entry in a map without using the value.
 
  - Increase ip_vs_conn_tab_bits range for 64BIT builds.
 
  - Allow updating size of a set.
 
  - Improve NAT tuple selection when connection is closing.
 
 Driver API
 ----------
 
  - Integrate netdev with LED subsystem, to allow configuring HW
    "offloaded" blinking of LEDs based on link state and activity
    (i.e. packets coming in and out).
 
  - Support configuring rate selection pins of SFP modules.
 
  - Factor Clause 73 auto-negotiation code out of the drivers, provide
    common helper routines.
 
  - Add more fool-proof helpers for managing lifetime of MDIO devices
    associated with the PCS layer.
 
  - Allow drivers to report advanced statistics related to Time Aware
    scheduler offload (taprio).
 
  - Allow opting out of VF statistics in link dump, to allow more VFs
    to fit into the message.
 
  - Split devlink instance and devlink port operations.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - Synopsys EMAC4 IP support (stmmac)
    - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
    - Marvell 88E6250 7 port switches
    - Microchip LAN8650/1 Rev.B0 PHYs
    - MediaTek MT7981/MT7988 built-in 1GE PHY driver
 
  - WiFi:
    - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
    - Realtek RTL8723DS (SDIO variant)
    - Realtek RTL8851BE
 
  - CAN:
    - Fintek F81604
 
 Drivers
 -------
 
  - Ethernet NICs:
    - Intel (100G, ice):
      - support dynamic interrupt allocation
      - use meta data match instead of VF MAC addr on slow-path
    - nVidia/Mellanox:
      - extend link aggregation to handle 4, rather than just 2 ports
      - spawn sub-functions without any features by default
    - OcteonTX2:
      - support HTB (Tx scheduling/QoS) offload
      - make RSS hash generation configurable
      - support selecting Rx queue using TC filters
    - Wangxun (ngbe/txgbe):
      - add basic Tx/Rx packet offloads
      - add phylink support (SFP/PCS control)
    - Freescale/NXP (enetc):
      - report TAPRIO packet statistics
    - Solarflare/AMD:
      - support matching on IP ToS and UDP source port of outer header
      - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
      - add devlink dev info support for EF10
 
  - Virtual NICs:
    - Microsoft vNIC:
      - size the Rx indirection table based on requested configuration
      - support VLAN tagging
    - Amazon vNIC:
      - try to reuse Rx buffers if not fully consumed, useful for ARM
        servers running with 16kB pages
    - Google vNIC:
      - support TCP segmentation of >64kB frames
 
  - Ethernet embedded switches:
    - Marvell (mv88e6xxx):
      - enable USXGMII (88E6191X)
    - Microchip:
     - lan966x: add support for Egress Stage 0 ACL engine
     - lan966x: support mapping packet priority to internal switch
       priority (based on PCP or DSCP)
 
  - Ethernet PHYs:
    - Broadcom PHYs:
      - support for Wake-on-LAN for BCM54210E/B50212E
      - report LPI counter
    - Microsemi PHYs: support RGMII delay configuration (VSC85xx)
    - Micrel PHYs: receive timestamp in the frame (LAN8841)
    - Realtek PHYs: support optional external PHY clock
    - Altera TSE PCS: merge the driver into Lynx PCS which it is
      a variant of
 
  - CAN: Kvaser PCIEcan:
    - support packet timestamping
 
  - WiFi:
    - Intel (iwlwifi):
      - major update for new firmware and Multi-Link Operation (MLO)
      - configuration rework to drop test devices and split
        the different families
      - support for segmented PNVM images and power tables
      - new vendor entries for PPAG (platform antenna gain) feature
    - Qualcomm 802.11ax (ath11k):
      - Multiple Basic Service Set Identifier (MBSSID) and
        Enhanced MBSSID Advertisement (EMA) support in AP mode
      - support factory test mode
    - RealTek (rtw89):
      - add RSSI based antenna diversity
      - support U-NII-4 channels on 5 GHz band
    - RealTek (rtl8xxxu):
      - AP mode support for 8188f
      - support USB RX aggregation for the newer chips
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmSbJM4ACgkQMUZtbf5S
 IrtoDhAAhEim1+LBIKf4lhPcVdZ2p/TkpnwTz5jsTwSeRBAxTwuNJ2fQhFXg13E3
 MnRq6QaEp8G4/tA/gynLvQop+FEZEnv+horP0zf/XLcC8euU7UrKdrpt/4xxdP07
 IL/fFWsoUGNO+L9LNaHwBo8g7nHvOkPscHEBHc2Xrvzab56TJk6vPySfLqcpKlNZ
 CHWDwTpgRqNZzSKiSpoMVd9OVMKUXcPYHpDmfEJ5l+e8vTXmZzOLHrSELHU5nP5f
 mHV7gxkDCTshoGcaed7UTiOvgu1p6E5EchDJxiLaSUbgsd8SZ3u4oXwRxgj33RK/
 fB2+UaLrRt/DdlHvT/Ph8e8Ygu77yIXMjT49jsfur/zVA0HEA2dFb7V6QlsYRmQp
 J25pnrdXmE15llgqsC0/UOW5J1laTjII+T2T70UOAqQl4LWYAQDG4WwsAqTzU0KY
 dueydDouTp9XC2WYrRUEQxJUzxaOaazskDUHc5c8oHp/zVBT+djdgtvVR9+gi6+7
 yy4elI77FlEEqL0ItdU/lSWINayAlPLsIHkMyhSGKX0XDpKjeycPqkNx4UterXB/
 JKIR5RBWllRft+igIngIkKX0tJGMU0whngiw7d1WLw25wgu4sB53hiWWoSba14hv
 tXMxwZs5iGaPcT38oRVMZz8I1kJM4Dz3SyI7twVvi4RUut64EG4=
 =9i4I
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking changes from Jakub Kicinski:
 "WiFi 7 and sendpage changes are the biggest pieces of work for this
  release. The latter will definitely require fixes but I think that we
  got it to a reasonable point.

  Core:

   - Rework the sendpage & splice implementations

     Instead of feeding data into sockets page by page extend sendmsg
     handlers to support taking a reference on the data, controlled by a
     new flag called MSG_SPLICE_PAGES

     Rework the handling of unexpected-end-of-file to invoke an
     additional callback instead of trying to predict what the right
     combination of MORE/NOTLAST flags is

     Remove the MSG_SENDPAGE_NOTLAST flag completely

   - Implement SCM_PIDFD, a new type of CMSG type analogous to
     SCM_CREDENTIALS, but it contains pidfd instead of plain pid

   - Enable socket busy polling with CONFIG_RT

   - Improve reliability and efficiency of reporting for ref_tracker

   - Auto-generate a user space C library for various Netlink families

  Protocols:

   - Allow TCP to shrink the advertised window when necessary, prevent
     sk_rcvbuf auto-tuning from growing the window all the way up to
     tcp_rmem[2]

   - Use per-VMA locking for "page-flipping" TCP receive zerocopy

   - Prepare TCP for device-to-device data transfers, by making sure
     that payloads are always attached to skbs as page frags

   - Make the backoff time for the first N TCP SYN retransmissions
     linear. Exponential backoff is unnecessarily conservative

   - Create a new MPTCP getsockopt to retrieve all info
     (MPTCP_FULL_INFO)

   - Avoid waking up applications using TLS sockets until we have a full
     record

   - Allow using kernel memory for protocol ioctl callbacks, paving the
     way to issuing ioctls over io_uring

   - Add nolocalbypass option to VxLAN, forcing packets to be fully
     encapsulated even if they are destined for a local IP address

   - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
     in-kernel ECMP implementation (e.g. Open vSwitch) select the same
     link for all packets. Support L4 symmetric hashing in Open vSwitch

   - PPPoE: make number of hash bits configurable

   - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
     (ipconfig)

   - Add layer 2 miss indication and filtering, allowing higher layers
     (e.g. ACL filters) to make forwarding decisions based on whether
     packet matched forwarding state in lower devices (bridge)

   - Support matching on Connectivity Fault Management (CFM) packets

   - Hide the "link becomes ready" IPv6 messages by demoting their
     printk level to debug

   - HSR: don't enable promiscuous mode if device offloads the proto

   - Support active scanning in IEEE 802.15.4

   - Continue work on Multi-Link Operation for WiFi 7

  BPF:

   - Add precision propagation for subprogs and callbacks. This allows
     maintaining verification efficiency when subprograms are used, or
     in fact passing the verifier at all for complex programs,
     especially those using open-coded iterators

   - Improve BPF's {g,s}setsockopt() length handling. Previously BPF
     assumed the length is always equal to the amount of written data.
     But some protos allow passing a NULL buffer to discover what the
     output buffer *should* be, without writing anything

   - Accept dynptr memory as memory arguments passed to helpers

   - Add routing table ID to bpf_fib_lookup BPF helper

   - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands

   - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
     maps as read-only)

   - Show target_{obj,btf}_id in tracing link fdinfo

   - Addition of several new kfuncs (most of the names are
     self-explanatory):
      - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
        bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
        and bpf_dynptr_clone().
      - bpf_task_under_cgroup()
      - bpf_sock_destroy() - force closing sockets
      - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs

  Netfilter:

   - Relax set/map validation checks in nf_tables. Allow checking
     presence of an entry in a map without using the value

   - Increase ip_vs_conn_tab_bits range for 64BIT builds

   - Allow updating size of a set

   - Improve NAT tuple selection when connection is closing

  Driver API:

   - Integrate netdev with LED subsystem, to allow configuring HW
     "offloaded" blinking of LEDs based on link state and activity
     (i.e. packets coming in and out)

   - Support configuring rate selection pins of SFP modules

   - Factor Clause 73 auto-negotiation code out of the drivers, provide
     common helper routines

   - Add more fool-proof helpers for managing lifetime of MDIO devices
     associated with the PCS layer

   - Allow drivers to report advanced statistics related to Time Aware
     scheduler offload (taprio)

   - Allow opting out of VF statistics in link dump, to allow more VFs
     to fit into the message

   - Split devlink instance and devlink port operations

  New hardware / drivers:

   - Ethernet:
      - Synopsys EMAC4 IP support (stmmac)
      - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
      - Marvell 88E6250 7 port switches
      - Microchip LAN8650/1 Rev.B0 PHYs
      - MediaTek MT7981/MT7988 built-in 1GE PHY driver

   - WiFi:
      - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
      - Realtek RTL8723DS (SDIO variant)
      - Realtek RTL8851BE

   - CAN:
      - Fintek F81604

  Drivers:

   - Ethernet NICs:
      - Intel (100G, ice):
         - support dynamic interrupt allocation
         - use meta data match instead of VF MAC addr on slow-path
      - nVidia/Mellanox:
         - extend link aggregation to handle 4, rather than just 2 ports
         - spawn sub-functions without any features by default
      - OcteonTX2:
         - support HTB (Tx scheduling/QoS) offload
         - make RSS hash generation configurable
         - support selecting Rx queue using TC filters
      - Wangxun (ngbe/txgbe):
         - add basic Tx/Rx packet offloads
         - add phylink support (SFP/PCS control)
      - Freescale/NXP (enetc):
         - report TAPRIO packet statistics
      - Solarflare/AMD:
         - support matching on IP ToS and UDP source port of outer
           header
         - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
         - add devlink dev info support for EF10

   - Virtual NICs:
      - Microsoft vNIC:
         - size the Rx indirection table based on requested
           configuration
         - support VLAN tagging
      - Amazon vNIC:
         - try to reuse Rx buffers if not fully consumed, useful for ARM
           servers running with 16kB pages
      - Google vNIC:
         - support TCP segmentation of >64kB frames

   - Ethernet embedded switches:
      - Marvell (mv88e6xxx):
         - enable USXGMII (88E6191X)
      - Microchip:
         - lan966x: add support for Egress Stage 0 ACL engine
         - lan966x: support mapping packet priority to internal switch
           priority (based on PCP or DSCP)

   - Ethernet PHYs:
      - Broadcom PHYs:
         - support for Wake-on-LAN for BCM54210E/B50212E
         - report LPI counter
      - Microsemi PHYs: support RGMII delay configuration (VSC85xx)
      - Micrel PHYs: receive timestamp in the frame (LAN8841)
      - Realtek PHYs: support optional external PHY clock
      - Altera TSE PCS: merge the driver into Lynx PCS which it is a
        variant of

   - CAN: Kvaser PCIEcan:
      - support packet timestamping

   - WiFi:
      - Intel (iwlwifi):
         - major update for new firmware and Multi-Link Operation (MLO)
         - configuration rework to drop test devices and split the
           different families
         - support for segmented PNVM images and power tables
         - new vendor entries for PPAG (platform antenna gain) feature
      - Qualcomm 802.11ax (ath11k):
         - Multiple Basic Service Set Identifier (MBSSID) and Enhanced
           MBSSID Advertisement (EMA) support in AP mode
         - support factory test mode
      - RealTek (rtw89):
         - add RSSI based antenna diversity
         - support U-NII-4 channels on 5 GHz band
      - RealTek (rtl8xxxu):
         - AP mode support for 8188f
         - support USB RX aggregation for the newer chips"

* tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits)
  net: scm: introduce and use scm_recv_unix helper
  af_unix: Skip SCM_PIDFD if scm->pid is NULL.
  net: lan743x: Simplify comparison
  netlink: Add __sock_i_ino() for __netlink_diag_dump().
  net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses
  Revert "af_unix: Call scm_recv() only after scm_set_cred()."
  phylink: ReST-ify the phylink_pcs_neg_mode() kdoc
  libceph: Partially revert changes to support MSG_SPLICE_PAGES
  net: phy: mscc: fix packet loss due to RGMII delays
  net: mana: use vmalloc_array and vcalloc
  net: enetc: use vmalloc_array and vcalloc
  ionic: use vmalloc_array and vcalloc
  pds_core: use vmalloc_array and vcalloc
  gve: use vmalloc_array and vcalloc
  octeon_ep: use vmalloc_array and vcalloc
  net: usb: qmi_wwan: add u-blox 0x1312 composition
  perf trace: fix MSG_SPLICE_PAGES build error
  ipvlan: Fix return value of ipvlan_queue_xmit()
  netfilter: nf_tables: fix underflow in chain reference counter
  netfilter: nf_tables: unbind non-anonymous set if rule construction fails
  ...
2023-06-28 16:43:10 -07:00
Linus Torvalds
6e17c6de3d - Yosry Ahmed brought back some cgroup v1 stats in OOM logs.
- Yosry has also eliminated cgroup's atomic rstat flushing.
 
 - Nhat Pham adds the new cachestat() syscall.  It provides userspace
   with the ability to query pagecache status - a similar concept to
   mincore() but more powerful and with improved usability.
 
 - Mel Gorman provides more optimizations for compaction, reducing the
   prevalence of page rescanning.
 
 - Lorenzo Stoakes has done some maintanance work on the get_user_pages()
   interface.
 
 - Liam Howlett continues with cleanups and maintenance work to the maple
   tree code.  Peng Zhang also does some work on maple tree.
 
 - Johannes Weiner has done some cleanup work on the compaction code.
 
 - David Hildenbrand has contributed additional selftests for
   get_user_pages().
 
 - Thomas Gleixner has contributed some maintenance and optimization work
   for the vmalloc code.
 
 - Baolin Wang has provided some compaction cleanups,
 
 - SeongJae Park continues maintenance work on the DAMON code.
 
 - Huang Ying has done some maintenance on the swap code's usage of
   device refcounting.
 
 - Christoph Hellwig has some cleanups for the filemap/directio code.
 
 - Ryan Roberts provides two patch series which yield some
   rationalization of the kernel's access to pte entries - use the provided
   APIs rather than open-coding accesses.
 
 - Lorenzo Stoakes has some fixes to the interaction between pagecache
   and directio access to file mappings.
 
 - John Hubbard has a series of fixes to the MM selftesting code.
 
 - ZhangPeng continues the folio conversion campaign.
 
 - Hugh Dickins has been working on the pagetable handling code, mainly
   with a view to reducing the load on the mmap_lock.
 
 - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from
   128 to 8.
 
 - Domenico Cerasuolo has improved the zswap reclaim mechanism by
   reorganizing the LRU management.
 
 - Matthew Wilcox provides some fixups to make gfs2 work better with the
   buffer_head code.
 
 - Vishal Moola also has done some folio conversion work.
 
 - Matthew Wilcox has removed the remnants of the pagevec code - their
   functionality is migrated over to struct folio_batch.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJejewAKCRDdBJ7gKXxA
 joggAPwKMfT9lvDBEUnJagY7dbDPky1cSYZdJKxxM2cApGa42gEA6Cl8HRAWqSOh
 J0qXCzqaaN8+BuEyLGDVPaXur9KirwY=
 =B7yQ
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull mm updates from Andrew Morton:

 - Yosry Ahmed brought back some cgroup v1 stats in OOM logs

 - Yosry has also eliminated cgroup's atomic rstat flushing

 - Nhat Pham adds the new cachestat() syscall. It provides userspace
   with the ability to query pagecache status - a similar concept to
   mincore() but more powerful and with improved usability

 - Mel Gorman provides more optimizations for compaction, reducing the
   prevalence of page rescanning

 - Lorenzo Stoakes has done some maintanance work on the
   get_user_pages() interface

 - Liam Howlett continues with cleanups and maintenance work to the
   maple tree code. Peng Zhang also does some work on maple tree

 - Johannes Weiner has done some cleanup work on the compaction code

 - David Hildenbrand has contributed additional selftests for
   get_user_pages()

 - Thomas Gleixner has contributed some maintenance and optimization
   work for the vmalloc code

 - Baolin Wang has provided some compaction cleanups,

 - SeongJae Park continues maintenance work on the DAMON code

 - Huang Ying has done some maintenance on the swap code's usage of
   device refcounting

 - Christoph Hellwig has some cleanups for the filemap/directio code

 - Ryan Roberts provides two patch series which yield some
   rationalization of the kernel's access to pte entries - use the
   provided APIs rather than open-coding accesses

 - Lorenzo Stoakes has some fixes to the interaction between pagecache
   and directio access to file mappings

 - John Hubbard has a series of fixes to the MM selftesting code

 - ZhangPeng continues the folio conversion campaign

 - Hugh Dickins has been working on the pagetable handling code, mainly
   with a view to reducing the load on the mmap_lock

 - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
   from 128 to 8

 - Domenico Cerasuolo has improved the zswap reclaim mechanism by
   reorganizing the LRU management

 - Matthew Wilcox provides some fixups to make gfs2 work better with the
   buffer_head code

 - Vishal Moola also has done some folio conversion work

 - Matthew Wilcox has removed the remnants of the pagevec code - their
   functionality is migrated over to struct folio_batch

* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
  mm/hugetlb: remove hugetlb_set_page_subpool()
  mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
  hugetlb: revert use of page_cache_next_miss()
  Revert "page cache: fix page_cache_next/prev_miss off by one"
  mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
  mm: memcg: rename and document global_reclaim()
  mm: kill [add|del]_page_to_lru_list()
  mm: compaction: convert to use a folio in isolate_migratepages_block()
  mm: zswap: fix double invalidate with exclusive loads
  mm: remove unnecessary pagevec includes
  mm: remove references to pagevec
  mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
  mm: remove struct pagevec
  net: convert sunrpc from pagevec to folio_batch
  i915: convert i915_gpu_error to use a folio_batch
  pagevec: rename fbatch_count()
  mm: remove check_move_unevictable_pages()
  drm: convert drm_gem_put_pages() to use a folio_batch
  i915: convert shmem_sg_free_table() to use a folio_batch
  scatterlist: add sg_set_folio()
  ...
2023-06-28 10:28:11 -07:00
Linus Torvalds
af96134dc8 RCU pull request for v6.5
This pull contains the following branches:
 
 doc.2023.05.10a: Documentation updates
 
 fixes.2023.05.11a: Miscellaneous fixes, perhaps most notably:
 
 o	Remove RCU_NONIDLE().  The new visibility of most of the idle
 	loop to RCU has obsoleted this API.
 
 o	Make the RCU_SOFTIRQ callback-invocation time limit also apply
 	to the rcuc kthreads that invoke callbacks for CONFIG_PREEMPT_RT.
 
 o	Add a jiffies-based callback-invocation time limit to handle
 	long-running callbacks.  (The local_clock() function is only
 	invoked once per 32 callbacks due to its high overhead.)
 
 o	Stop rcu_tasks_invoke_cbs() from using never-onlined CPUs,
 	which fixes a bug that can occur on systems with non-contiguous
 	CPU numbering.
 
 kvfree.2023.05.10a: kvfree_rcu updates
 
 o	Eliminate the single-argument variant of k[v]free_rcu() now
 	that all uses have been converted to k[v]free_rcu_mightsleep().
 
 o	Add WARN_ON_ONCE() checks for k[v]free_rcu*() freeing callbacks
 	too soon.  Yes, this is closing the barn door after the horse
 	has escaped, but Murphy says that there will be more horses.
 
 nocb.2023.05.11a: Callback-offloading updates
 
 o	Fix a number of bugs involving the shrinker and lazy callbacks.
 
 rcu-tasks.2023.05.10a: Tasks RCU updates
 
 torture.2023.05.15a: Torture-test updates
 
 rcu-urgent.2023.06.06a: Urgent SRCU fix (already pulled)
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmSUuukTHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jLB5EACWArBYSbXh9kx6RP3LRkOd//fQWuqx
 z/RmHjMx3a2uIQpsbeAj+jrgHYzSOi7Afdnx2s0gUIWGjpF4d+e31eco9xTQtWIs
 A3/pXUlcTyaPXEZh5ro763UyBF/K003TAdo7EZAScTfDNp2knqGdEOyXTOXiAULX
 GH922kIqg0chbYaWocLY3g5mXeEm+kGY8GrDAB7/B3jHgoyylXzmSULDP4GQV7hw
 DkM0GOlc3TSzHonnNS6j1xboqY4HhWIDkBrD4Oh5P//ttMpb1b6gs1zEyjCQcNBe
 a6fnNF+0dUwANIZKroPn/L1uTGsEUhmLFkVK+XIuAit97yWI6t+aRH6TzHHYmkpu
 wVmLxv/FbJohP7ArWaI8l0gNl0vkli3ZgQXnRvSpCqIFR93AWVMeZsDTGOcLUdry
 AZEnuGXHnc9UB0KGOIras0o/EQezKq57JUV2bBZjl/GIDc3qiaJKnBhHysPc1iuE
 UfP052vCaoZxO3U/FrObQhjLZnstKBYHj8WolxMjIyNMlRIvDro6O1WG4+mjeLDP
 xdrjKGstsJh80CYDei+vJBXsbszhxv8yV4hCQX9JcDl3RjEqOOxgKUnAaP2mm02O
 MX33P3MZvSsHGoxkJpXDSlkQlbNqDBMIjZXbZLRF4o8fPhVmQU/4QlJN0iFOoXaQ
 1qqGrerEzfn0Jw==
 =3LCd
 -----END PGP SIGNATURE-----

Merge tag 'rcu.2023.06.22a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull RCU updates from Paul McKenney:
 "Documentation updates

  Miscellaneous fixes, perhaps most notably:

   - Remove RCU_NONIDLE(). The new visibility of most of the idle loop
     to RCU has obsoleted this API.

   - Make the RCU_SOFTIRQ callback-invocation time limit also apply to
     the rcuc kthreads that invoke callbacks for CONFIG_PREEMPT_RT.

   - Add a jiffies-based callback-invocation time limit to handle
     long-running callbacks. (The local_clock() function is only invoked
     once per 32 callbacks due to its high overhead.)

   - Stop rcu_tasks_invoke_cbs() from using never-onlined CPUs, which
     fixes a bug that can occur on systems with non-contiguous CPU
     numbering.

  kvfree_rcu updates:

   - Eliminate the single-argument variant of k[v]free_rcu() now that
     all uses have been converted to k[v]free_rcu_mightsleep().

   - Add WARN_ON_ONCE() checks for k[v]free_rcu*() freeing callbacks too
     soon. Yes, this is closing the barn door after the horse has
     escaped, but Murphy says that there will be more horses.

  Callback-offloading updates:

   - Fix a number of bugs involving the shrinker and lazy callbacks.

  Tasks RCU updates

  Torture-test updates"

* tag 'rcu.2023.06.22a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (32 commits)
  torture: Remove duplicated argument -enable-kvm for ppc64
  doc/rcutorture: Add description of rcutorture.stall_cpu_block
  rcu/rcuscale: Stop kfree_scale_thread thread(s) after unloading rcuscale
  rcu/rcuscale: Move rcu_scale_*() after kfree_scale_cleanup()
  rcutorture: Correct name of use_softirq module parameter
  locktorture: Add long_hold to adjust lock-hold delays
  rcu/nocb: Make shrinker iterate only over NOCB CPUs
  rcu-tasks: Stop rcu_tasks_invoke_cbs() from using never-onlined CPUs
  rcu: Make rcu_cpu_starting() rely on interrupts being disabled
  rcu: Mark rcu_cpu_kthread() accesses to ->rcu_cpu_has_work
  rcu: Mark additional concurrent load from ->cpu_no_qs.b.exp
  rcu: Employ jiffies-based backstop to callback time limit
  rcu: Check callback-invocation time limit for rcuc kthreads
  rcu: Remove RCU_NONIDLE()
  rcu: Add more RCU files to kernel-api.rst
  rcu-tasks: Clarify the cblist_init_generic() function's pr_info() output
  rcu-tasks: Avoid pr_info() with spin lock in cblist_init_generic()
  rcu/nocb: Recheck lazy callbacks under the ->nocb_lock from shrinker
  rcu/nocb: Fix shrinker race against callback enqueuer
  rcu/nocb: Protect lazy shrinker against concurrent (de-)offloading
  ...
2023-06-27 10:37:01 -07:00
Jason Gunthorpe
5f004bcaee Linux 6.4
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmSYzfYeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG/ucH/iOM/1Py/fSg0qSs
 7NJ4XXlourT5zrnRMom3cm3d9gYqgTzgvKFL3kjMEexTRVYbhlcO4ZPRsiry8zxF
 ToGX+V8tDMqb8WSdFHzkljRY+zDRyfEUDMlTzROAD9DunLmQtkJKyrggkeGdjkpP
 OyfGqKpwlLXZRAXBil/U8Mx9MHdjJubloZwghLZr33VdUZa68+JJ9l6w163Oe/ET
 K264NM0wxN/kvN57JvePgqMccQwpINylg8IhRI+XelgczjUXeJBsOA8TDv4bDN4Q
 bjCLhkWbIaZtTYqvOXa/kD0T8wd7KETsMBQN8YzyDh6W0GmAlJjTawyAhA6jA5in
 x3uz2W8=
 =L3zp
 -----END PGP SIGNATURE-----

Merge tag 'v6.4' into rdma.git for-next

Linux 6.4

Resolve conflicts between rdma rc and next in rxe_cq matching linux-next:

drivers/infiniband/sw/rxe/rxe_cq.c:
  https://lore.kernel.org/r/20230622115246.365d30ad@canb.auug.org.au

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-27 14:06:29 -03:00
David Howells
f8dd95b29d tcp_bpf, smc, tls, espintcp, siw: Reduce MSG_SENDPAGE_NOTLAST usage
As MSG_SENDPAGE_NOTLAST is being phased out along with sendpage(), don't
use it further in than the sendpage methods, but rather translate it to
MSG_MORE and use that instead.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: Bernard Metzler <bmt@zurich.ibm.com>
cc: Jason Gunthorpe <jgg@ziepe.ca>
cc: Leon Romanovsky <leon@kernel.org>
cc: John Fastabend <john.fastabend@gmail.com>
cc: Jakub Sitnicki <jakub@cloudflare.com>
cc: David Ahern <dsahern@kernel.org>
cc: Karsten Graul <kgraul@linux.ibm.com>
cc: Wenjia Zhang <wenjia@linux.ibm.com>
cc: Jan Karcher <jaka@linux.ibm.com>
cc: "D. Wythe" <alibuda@linux.alibaba.com>
cc: Tony Lu <tonylu@linux.alibaba.com>
cc: Wen Gu <guwen@linux.alibaba.com>
cc: Boris Pismenny <borisp@nvidia.com>
cc: Steffen Klassert <steffen.klassert@secunet.com>
cc: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/20230623225513.2732256-2-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-24 15:50:12 -07:00
Bob Pearson
350b6dd4b2 RDMA/rxe: Simplify cq->notify code
The flags parameter to the request notify verb is a bitmask. But, rxe
driver treats cq->notify as an int. If someone ever set both the
IB_CQ_SOLICITED and the IB_CQ_NEXT_COMP bits rxe_cq_post could fail to
generate a completion event. This patch treats the notify flags as a bit
mask consistently and can handle the above case correctly.

Link: https://lore.kernel.org/r/20230612162244.20038-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-20 12:46:18 -03:00
Bob Pearson
ee678e5dff RDMA/rxe: Fixes mr access supported list
A recent patch incorrectly did not include IB_ACCESS_RELAXED_ORDERING in
the list of supported access flags for the rxe driver. The driver actually
does nothing related to relaxed ordering but it causes no problems to
include it as supported but with no effect. This change caused ib_send_bw
and friends to not run correctly.

The correct approach is for the driver to allow any of the optional access
flags and otherwise ignore them. This patch adds IB_ACCESS_OPTIONAL to the
list of rxe supported flags.

Fixes: 02ed253770 ("RDMA/rxe: Introduce rxe access supported flags")
Link: https://lore.kernel.org/r/20230613171654.19334-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-20 11:42:39 -03:00
Jakub Kicinski
173780ff18 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

include/linux/mlx5/driver.h
  617f5db1a6 ("RDMA/mlx5: Fix affinity assignment")
  dc13180824 ("net/mlx5: Enable devlink port for embedded cpu VF vports")
https://lore.kernel.org/all/20230613125939.595e50b8@canb.auug.org.au/

tools/testing/selftests/net/mptcp/mptcp_join.sh
  47867f0a7e ("selftests: mptcp: join: skip check if MIB counter not supported")
  425ba80312 ("selftests: mptcp: join: support RM_ADDR for used endpoints or not")
  45b1a1227a ("mptcp: introduces more address related mibs")
  0639fa230a ("selftests: mptcp: add explicit check for new mibs")
https://lore.kernel.org/netdev/20230609-upstream-net-20230610-mptcp-selftests-support-old-kernels-part-3-v1-0-2896fe2ee8a3@tessares.net/

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-15 22:19:41 -07:00
Bob Pearson
0c7e314a63 RDMA/rxe: Fix rxe_cq_post
A recent patch replaced a tasklet execution of cq->comp_handler by a
direct call. While this made sense it let changes to cq->notify state be
unprotected and assumed that the cq completion machinery and the ulp done
callbacks were reentrant. The result is that in some cases completion
events can be lost. This patch moves the cq->comp_handler call inside of
the spinlock in rxe_cq_post which solves both issues. This is compatible
with the matching code in the request notify verb.

Fixes: 78b26a3353 ("RDMA/rxe: Remove tasklet call from rxe_cq.c")
Link: https://lore.kernel.org/r/20230612155032.17036-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-14 14:12:43 -03:00
Lorenzo Stoakes
4c630f3074 mm/gup: remove vmas parameter from pin_user_pages()
We are now in a position where no caller of pin_user_pages() requires the
vmas parameter at all, so eliminate this parameter from the function and
all callers.

This clears the way to removing the vmas parameter from GUP altogether.

Link: https://lkml.kernel.org/r/195a99ae949c9f5cb589d2222b736ced96ec199a.1684350871.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>	[qib]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sakari Ailus <sakari.ailus@linux.intel.com>	[drivers/media]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:26 -07:00
Bob Pearson
c3e1bf626e RDMA/rxe: Send last wqe reached event on qp cleanup
The IBA requires:
	o11-5.2.5: If the HCA supports SRQ, for RC and UD service,
	the CI shall generate a Last WQE Reached Affiliated Asynchronous
	Event on a QP that is in the Error State and is associated with
	an SRQ when either:
		• a CQE is generated for the last WQE, or
		• the QP gets in the Error State and there are no more
		  WQEs on the RQ.

This patch implements this behavior in flush_recv_queue() which is called
as a result of rxe_qp_error() being called whenever the qp is put into the
error state. The rxe responder executes SRQ WQEs directly from the SRQ so
there are never more WQES on the RQ.

Link: https://lore.kernel.org/r/20230602164229.9277-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-09 14:06:23 -03:00
Zhu Yanjun
2a62b6210c RDMA/rxe: Fix the use-before-initialization error of resp_pkts
In the following:

  Call Trace:
   <TASK>
   __dump_stack lib/dump_stack.c:88 [inline]
   dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
   assign_lock_key kernel/locking/lockdep.c:982 [inline]
   register_lock_class+0xdb6/0x1120 kernel/locking/lockdep.c:1295
   __lock_acquire+0x10a/0x5df0 kernel/locking/lockdep.c:4951
   lock_acquire kernel/locking/lockdep.c:5691 [inline]
   lock_acquire+0x1b1/0x520 kernel/locking/lockdep.c:5656
   __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
   _raw_spin_lock_irqsave+0x3d/0x60 kernel/locking/spinlock.c:162
   skb_dequeue+0x20/0x180 net/core/skbuff.c:3639
   drain_resp_pkts drivers/infiniband/sw/rxe/rxe_comp.c:555 [inline]
   rxe_completer+0x250d/0x3cc0 drivers/infiniband/sw/rxe/rxe_comp.c:652
   rxe_qp_do_cleanup+0x1be/0x820 drivers/infiniband/sw/rxe/rxe_qp.c:761
   execute_in_process_context+0x3b/0x150 kernel/workqueue.c:3473
   __rxe_cleanup+0x21e/0x370 drivers/infiniband/sw/rxe/rxe_pool.c:233
   rxe_create_qp+0x3f6/0x5f0 drivers/infiniband/sw/rxe/rxe_verbs.c:583

This is a use-before-initialization problem.

It happens because rxe_qp_do_cleanup is called during error unwind before
the struct has been fully initialized.

Move the initialization of the skb earlier.

Fixes: 8700e3e7c4 ("Soft RoCE driver")
Link: https://lore.kernel.org/r/20230602035408.741534-1-yanjun.zhu@intel.com
Reported-by: syzbot+eba589d8f49c73d356da@syzkaller.appspotmail.com
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-09 13:22:58 -03:00
Bob Pearson
544c7f62cf RDMA/rxe: Implement rereg_user_mr
Implement the two easy cases of ib_rereg_user_mr.

Link: https://lore.kernel.org/r/20230530221334.89432-7-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-09 13:18:53 -03:00
Bob Pearson
86a3fb55bc RDMA/rxe: Let rkey == lkey for local access
In order to conform to other drivers stop using rkey == 0 as an indication
that there are no remote access flags set.  Set rkey == lkey by default
for all MRs.

Link: https://lore.kernel.org/r/20230530221334.89432-6-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-09 13:18:53 -03:00
Bob Pearson
02ed253770 RDMA/rxe: Introduce rxe access supported flags
Introduce supported bit masks for setting the access attributes of MWs,
MRs, and QPs. Check these when attributes are set.

Link: https://lore.kernel.org/r/20230530221334.89432-5-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-09 13:18:53 -03:00
Bob Pearson
425e1c9018 RDMA/rxe: Fix access checks in rxe_check_bind_mw
The subroutine rxe_check_bind_mw() in rxe_mw.c performs checks on the mw
access flags before they are set so they always succeed.  This patch
instead checks the access flags passed in the send wqe.

Fixes: 32a577b4c3 ("RDMA/rxe: Add support for bind MW work requests")
Link: https://lore.kernel.org/r/20230530221334.89432-4-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-09 13:18:52 -03:00
Bob Pearson
2a129958bd RDMA//rxe: Optimize send path in rxe_resp.c
Bypass calling check_rkey() in rxe_resp.c for non-rdma messages.

Link: https://lore.kernel.org/r/20230530221334.89432-3-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-09 13:18:52 -03:00
Bob Pearson
d11442c6bd RDMA/rxe: Rename IB_ACCESS_REMOTE
Rename IB_ACCESS_REMOTE to RXE_ACCESS_REMOTE and move to rxe_verbs.h as an
enum instead of a #define. Shouldn't use IB_xxx for rxe symbols.

Link: https://lore.kernel.org/r/20230530221334.89432-2-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-09 13:18:52 -03:00
Jakub Kicinski
a03a91bd68 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

drivers/net/ethernet/sfc/tc.c
  622ab65634 ("sfc: fix error unwinds in TC offload")
  b6583d5e9e ("sfc: support TC decap rules matching on enc_src_port")

net/mptcp/protocol.c
  5b825727d0 ("mptcp: add annotations around msk->subflow accesses")
  e76c8ef5cc ("mptcp: refactor mptcp_stream_accept()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-01 15:38:26 -07:00