linux-loongson

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson synced 2025-08-27 15:36:48 +00:00

Author	SHA1	Message	Date
Pavel Begunkov	5288b9e28f	io_uring: open code io_req_cqe_overflow() A preparation patch, just open code io_req_cqe_overflow(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-16 12:38:36 -06:00
Jens Axboe	16256648cd	io_uring/fdinfo: get rid of dumping credentials It's a faily obscure feature, and registered credentials would for that mostly be a static thing. Don't bother including code to dump the personalities indices. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-16 12:33:28 -06:00
Jens Axboe	9a10926627	io_uring/fdinfo: only compile if CONFIG_PROC_FS is set Rather than wrap fdinfo.c in one big if, handle it on the Makefile side instead. io_uring.c already conditionally sets fops->fdinfo() anyway. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-16 12:33:02 -06:00
Jens Axboe	3de7361f7c	Merge branch 'io_uring-6.15' into for-6.16/io_uring Merge in 6.15 io_uring fixes, mostly so that the fdinfo changes can get easily extended without causing merge conflicts. * io_uring-6.15: io_uring/fdinfo: grab ctx->uring_lock around io_uring_show_fdinfo() io_uring/memmap: don't use page_address() on a highmem page io_uring/uring_cmd: fix hybrid polling initialization issue io_uring/sqpoll: Increase task_work submission batch size io_uring: ensure deferred completions are flushed for multishot io_uring: always arm linked timeouts prior to issue io_uring/fdinfo: annotate racy sq/cq head/tail reads io_uring: fix 'sync' handling of io_fallback_tw() io_uring: don't duplicate flushing in io_req_post_cqe	2025-05-16 12:31:19 -06:00
Jakub Kicinski	bebd7b2626	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.15-rc7). Conflicts: tools/testing/selftests/drivers/net/hw/ncdevmem.c `97c4e094a4` ("tests/ncdevmem: Fix double-free of queue array") `2f1a805f32` ("selftests: ncdevmem: Implement devmem TCP TX") https://lore.kernel.org/20250514122900.1e77d62d@canb.auug.org.au Adjacent changes: net/core/devmem.c net/core/devmem.h `0afc44d8cd` ("net: devmem: fix kernel panic when netlink socket close after module unload") `bd61848900` ("net: devmem: Implement TX path") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-05-15 11:28:30 -07:00
Jens Axboe	d871198ee4	io_uring/fdinfo: grab ctx->uring_lock around io_uring_show_fdinfo() Not everything requires locking in there, which is why the 'has_lock' variable exists. But enough does that it's a bit unwieldy to manage. Wrap the whole thing in a ->uring_lock trylock, and just return with no output if we fail to grab it. The existing trylock() will already have greatly diminished utility/output for the failure case. This fixes an issue with reading the SQE fields, if the ring is being actively resized at the same time. Reported-by: Jann Horn <jannh@google.com> Fixes: `79cfe9e59c` ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-14 07:15:28 -06:00
Pavel Begunkov	2b61bb1d9a	io_uring/kbuf: unify legacy buf provision and removal Combine IORING_OP_PROVIDE_BUFFERS and IORING_OP_REMOVE_BUFFERS ->issue(), so that we can deduplicate ring locking and list lookups. This way we further reduce code for legacy provided buffers. Locking is also separated from buffer related handling, which makes it a bit simpler with label jumps. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f61af131622ad4337c2fb9f7c453d5b0102c7b90.1747150490.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-13 14:45:55 -06:00
Pavel Begunkov	c724e80123	io_uring/kbuf: refactor __io_remove_buffers __io_remove_buffers used for two purposes, the first is removing buffers for non ring based lists, which implies that it can be called multiple times for the same list. And the second is for destroying lists, which is not perfectly reentrable for ring based lists. It's confusing, so just have a helper for the legacy pbuf buffer removal, make sure it's not called for ring pbuf, and open code all ring pbuf destruction into io_put_bl(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0ae416b099d311ad23f285cea02f2c94c8ae9a6c.1747150490.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-13 14:45:55 -06:00
Pavel Begunkov	52a05d0cf8	io_uring/kbuf: don't compute size twice on prep The size in prep is calculated by io_provide_buffers_prep(), so remove the recomputation a few lines after. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7c97206561b74fce245cb22449c6082d2e066844.1747150490.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-13 14:45:55 -06:00
Pavel Begunkov	4e9fda29d6	io_uring/kbuf: drop extra vars in io_register_pbuf_ring bl and free_bl variables in io_register_pbuf_ring() always point to the same list since we started to reallocate the pre-existent list. Drop free_bl. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d45c3342d74c9030f99376c777a4b3d59089074d.1747150490.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-13 14:45:55 -06:00
Pavel Begunkov	1724849072	io_uring/kbuf: use mem_is_zero() Make use of mem_is_zero() for reserved fields checking. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/11fe27b7a831329bcdb4ea087317ef123ba7c171.1747150490.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-13 14:45:55 -06:00
Pavel Begunkov	475a8d3037	io_uring/kbuf: account ring io_buffer_list memory Follow the non-ringed pbuf struct io_buffer_list allocations and account it against the memcg. There is low chance of that being an actual problem as ring provided buffer should either pin user memory or allocate it, which is already accounted. Cc: stable@vger.kernel.org # 6.1 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3985218b50d341273cafff7234e1a7e6d0db9808.1747150490.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-13 14:45:47 -06:00
Mina Almasry	bd61848900	net: devmem: Implement TX path Augment dmabuf binding to be able to handle TX. Additional to all the RX binding, we also create tx_vec needed for the TX path. Provide API for sendmsg to be able to send dmabufs bound to this device: - Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from. - MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf. Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY implementation, while disabling instances where MSG_ZEROCOPY falls back to copying. We additionally pipe the binding down to the new zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems instead of the traditional page netmems. We also special case skb_frag_dma_map to return the dma-address of these dmabuf net_iovs instead of attempting to map pages. The TX path may release the dmabuf in a context where we cannot wait. This happens when the user unbinds a TX dmabuf while there are still references to its netmems in the TX path. In that case, the netmems will be put_netmem'd from a context where we can't unmap the dmabuf, Resolve this by making __net_devmem_dmabuf_binding_free schedule_work'd. Based on work by Stanislav Fomichev <sdf@fomichev.me>. A lot of the meat of the implementation came from devmem TCP RFC v1[1], which included the TX path, but Stan did all the rebasing on top of netmem/net_iov. Cc: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com> Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250508004830.4100853-5-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-05-13 11:12:48 +02:00
Mina Almasry	03e96b8c11	netmem: add niov->type attribute to distinguish different net_iov types Later patches in the series adds TX net_iovs where there is no pp associated, so we can't rely on niov->pp->mp_ops to tell what is the type of the net_iov. Add a type enum to the net_iov which tells us the net_iov type. Signed-off-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250508004830.4100853-2-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-05-13 11:12:48 +02:00
Jens Axboe	f446c6311e	io_uring/memmap: don't use page_address() on a highmem page For older/32-bit systems with highmem, don't assume that the pages in a mapped region are always going to be mapped. If io_region_init_ptr() finds that the pages are coalescable, also check if the first page is a HighMem page or not. If it is, fall through to the usual vmap() mapping rather than attempt to get the unmapped page address. Cc: stable@vger.kernel.org Fixes: `c4d0ac1c15` ("io_uring/memmap: optimise single folio regions") Link: https://lore.kernel.org/all/681fe2fb.050a0220.f2294.001a.GAE@google.com/ Reported-by: syzbot+5b8c4abafcb1d791ccfc@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/681fed0a.050a0220.f2294.001c.GAE@google.com/ Reported-by: syzbot+6456a99dfdc2e78c4feb@syzkaller.appspotmail.com Tested-by: syzbot+6456a99dfdc2e78c4feb@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-12 09:27:41 -06:00
Pavel Begunkov	8fb7aee055	io_uring: drain based on allocates reqs Don't rely on CQ sequence numbers for draining, as it has become messy and needs cq_extra adjustments. Instead, base it on the number of allocated requests and only allow flushing when all requests are in the drain list. As a result, cq_extra is gone, no overhead for its accounting in aux cqe posting, less bloating as it was inlined before, and it's in general simpler than trying to track where we should bump it and where it should be put back like in cases of overflow. Also, it'll likely help with cleaning and unifying some of the CQ posting helpers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/46ece1e34320b046c06fee2498d6b4cd12a700f2.1746788718.git.asml.silence@gmail.com Link: https://lore.kernel.org/r/24497b04b004bceada496033d3c9d09ff8e81ae9.1746944903.git.asml.silence@gmail.com [axboe: fold in fix from link2] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-12 07:52:52 -06:00
hexue	63166b815d	io_uring/uring_cmd: fix hybrid polling initialization issue Modify the check for whether the timer is initialized during IO transfer when passthrough is used with hybrid polling, to ensure that it's always setup correctly. Cc: stable@vger.kernel.org Fixes: `01ee194d1a` ("io_uring: add support for hybrid IOPOLL") Signed-off-by: hexue <xue01.he@samsung.com> Link: https://lore.kernel.org/r/20250512052025.293031-1-xue01.he@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-12 07:17:02 -06:00
Pavel Begunkov	63de899cb6	io_uring: count allocated requests Keep track of the number requests a ring currently has allocated (and not freed), it'll be needed in the next patch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c8f8308294dc2a1cb8925d984d937d4fc14ab5d4.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-09 08:01:02 -06:00
Pavel Begunkov	b0c8a6401f	io_uring: open code io_account_cq_overflow() io_account_cq_overflow() doesn't help explaining what's going on in there, and it'll become even smaller with following patches, so open code it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e4333fa0d371f519e52a71148ebdffed4b8d3aa9.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-09 08:01:02 -06:00
Pavel Begunkov	19a94da447	io_uring: consolidate drain seq checking We check sequences when queuing drained requests as well when flushing them. Instead, always queue and immediately try to flush, so that all seq handling can be kept contained in the flushing code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d4651f742e671af5b3216581e539ea5d31bc7125.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-09 08:01:02 -06:00
Pavel Begunkov	e91e4f692f	io_uring: remove drain prealloc checks Currently io_drain_req() has two steps. The first is fast path checking sequence numbers. The second is allocations, rechecking and actual queuing. Further simplify it by removing the first step. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4d06e89ed07611993d7bf89182de2300858379bd.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-09 08:01:01 -06:00
Pavel Begunkov	05b334110f	io_uring: simplify drain ret passing "ret" in io_drain_req() is only used in one place, remove it and pass -ENOMEM directly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ece724b77e66e6caabcc215e0032ee7ff140f289.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-09 08:01:01 -06:00
Pavel Begunkov	fde04c7e27	io_uring: fix spurious drain flushing io_queue_deferred() is not tolerant to spurious calls not completing some requests. You can have an inflight drain-marked request and another request that came after and got queued into the drain list. Now, if io_queue_deferred() is called before the first request completes, it'll check the 2nd req with req_need_defer(), find that there is no drain flag set, and queue it for execution. To make io_queue_deferred() work, it should at least check sequences for the first request, and then we need also need to check if there is another drain request creating another bubble. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/972bde11b7d4ef25b3f5e3fd34f80e4d2aa345b8.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-09 08:01:01 -06:00
Pavel Begunkov	f979c20547	io_uring: account drain memory to cgroup Account drain allocations against memcg. It's not a big problem as each such allocation is paired with a request, which is accounted, but it's nicer to follow the limits more closely. Cc: stable@vger.kernel.org # 6.1 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f8dfdbd755c41fd9c75d12b858af07dfba5bbb68.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-09 08:01:01 -06:00
Pavel Begunkov	81a22c86ec	io_uring: add lockdep asserts to io_add_aux_cqe io_add_aux_cqe() can only be called for rings with uring_lock protected completion queues, add a couple of assertions in regards to that. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c010eab7b94a187c00a9d46d8b67bf7fcad18af4.1746788592.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-09 07:58:55 -06:00
Pavel Begunkov	28b8cd864d	io_uring/net: move CONFIG_NET guards to Makefile Instruct Makefile to never try to compile net.c without CONFIG_NET and kill ifdefs in the file. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f466400e20c3f536191bfd559b1f3cd2a2ab5a1e.1746788579.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-09 07:58:47 -06:00
Long Li	6ae4308116	io_uring: update parameter name in io_pin_pages function declaration Rename first parameter in io_pin_pages from ubuf to uaddr for consistency between declaration and implementation. Signed-off-by: Long Li <leo.lilong@huawei.com> Link: https://lore.kernel.org/r/20250509063015.3799255-1-leo.lilong@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-09 07:58:22 -06:00
Gabriel Krisman Bertazi	92835cebab	io_uring/sqpoll: Increase task_work submission batch size Our QA team reported a 10%-23%, throughput reduction on an io_uring sqpoll testcase doing IO to a null_blk, that I traced back to a reduction of the device submission queue depth utilization. It turns out that, after commit `af5d68f889` ("io_uring/sqpoll: manage task_work privately"), we capped the number of task_work entries that can be completed from a single spin of sqpoll to only 8 entries, before the sqpoll goes around to (potentially) sleep. While this cap doesn't drive the submission side directly, it impacts the completion behavior, which affects the number of IO queued by fio per sqpoll cycle on the submission side, and io_uring ends up seeing less ios per sqpoll cycle. As a result, block layer plugging is less effective, and we see more time spent inside the block layer in profilings charts, and increased submission latency measured by fio. There are other places that have increased overhead once sqpoll sleeps more often, such as the sqpoll utilization calculation. But, in this microbenchmark, those were not representative enough in perf charts, and their removal didn't yield measurable changes in throughput. The major overhead comes from the fact we plug less, and less often, when submitting to the block layer. My benchmark is: fio --ioengine=io_uring --direct=1 --iodepth=128 --runtime=300 --bs=4k \ --invalidate=1 --time_based --ramp_time=10 --group_reporting=1 \ --filename=/dev/nullb0 --name=RandomReads-direct-nullb-sqpoll-4k-1 \ --rw=randread --numjobs=1 --sqthread_poll In one machine, tested on top of Linux 6.15-rc1, we have the following baseline: READ: bw=4994MiB/s (5236MB/s), 4994MiB/s-4994MiB/s (5236MB/s-5236MB/s), io=439GiB (471GB), run=90001-90001msec With this patch: READ: bw=5762MiB/s (6042MB/s), 5762MiB/s-5762MiB/s (6042MB/s-6042MB/s), io=506GiB (544GB), run=90001-90001msec which is a 15% improvement in measured bandwidth. The average submission latency is noticeably lowered too. As measured by fio: Baseline: lat (usec): min=20, max=241, avg=99.81, stdev=3.38 Patched: lat (usec): min=26, max=226, avg=86.48, stdev=4.82 If we look at blktrace, we can also see the plugging behavior is improved. In the baseline, we end up limited to plugging 8 requests in the block layer regardless of the device queue depth size, while after patching we can drive more io, and we manage to utilize the full device queue. In the baseline, after a stabilization phase, an ordinary submission looks like: 254,0 1 49942 0.016028795 5977 U N [iou-sqp-5976] 7 After patching, I see consistently more requests per unplug. 254,0 1 4996 0.001432872 3145 U N [iou-sqp-3144] 32 Ideally, the cap size would at least be the deep enough to fill the device queue, but we can't predict that behavior, or assume all IO goes to a single device, and thus can't guess the ideal batch size. We also don't want to let the tw run unbounded, though I'm not sure it would really be a problem. Instead, let's just give it a more sensible value that will allow for more efficient batching. I've tested with different cap values, and initially proposed to increase the cap to 1024. Jens argued it is too big of a bump and I observed that, with 32, I'm no longer able to observe this bottleneck in any of my machines. Fixes: `af5d68f889` ("io_uring/sqpoll: manage task_work privately") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20250508181203.3785544-1-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-09 07:56:53 -06:00
Jens Axboe	687b2bae0e	io_uring: ensure deferred completions are flushed for multishot Multishot normally uses io_req_post_cqe() to post completions, but when stopping it, it may finish up with a deferred completion. This is fine, except if another multishot event triggers before the deferred completions get flushed. If this occurs, then CQEs may get reordered in the CQ ring, as new multishot completions get posted before the deferred ones are flushed. This can cause confusion on the application side, if strict ordering is required for the use case. When multishot posting via io_req_post_cqe(), flush any pending deferred completions first, if any. Cc: stable@vger.kernel.org # 6.1+ Reported-by: Norman Maurer <norman_maurer@apple.com> Reported-by: Christian Mazakas <christian.mazakas@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-07 07:55:15 -06:00
Pavel Begunkov	35adea1d01	io_uring: move io_req_put_rsrc_nodes() It'd be nice to hide details of how rsrc nodes are used by a request from rsrc.c, specifically which request fields store them, and what bits are signifying if there is a node in a request. It rather belong to generic request handling, so move the helper to io_uring.c. While doing so remove clearing of ->buf_node as it's controlled by REQ_F_BUF_NODE and doesn't require zeroing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bb73fb42baf825edb39344365aff48cdfdd4c692.1746533789.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 10:11:23 -06:00
Pavel Begunkov	9c2ff3f9b5	io_uring: remove io_preinit_req() Apart from setting ->ctx, io_preinit_req() zeroes a bunch of fields of a request, from which only ->file_node is mandatory. Remove the function and zero the entire request on first allocation. With that, we also need to initialise ->ctx every time, which might be a good thing for performance as now we're likely overwriting the entire cache line, and so it can write combined and avoid RMW. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ba5485dc913f1e275862ce88f5169d4ac4a33836.1746533807.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 10:11:23 -06:00
Pavel Begunkov	78967aabf6	io_uring/timeout: don't export link t-out disarm helper [__]io_disarm_linked_timeout() are only used inside timeout.c. so confine them inside the file. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1eb200911255e643bf252a8e65fb2c787340cf18.1746533800.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 10:11:23 -06:00
Pavel Begunkov	a5c98e9424	io_uring/zcrx: dmabuf backed zerocopy receive Add support for dmabuf backed zcrx areas. To use it, the user should pass IORING_ZCRX_AREA_DMABUF in the struct io_uring_zcrx_area_reg flags field and pass a dmabuf fd in the dmabuf_fd field. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20bb1890e60a82ec945ab36370d1fd54be414ab6.1746097431.git.asml.silence@gmail.com Link: https://lore.kernel.org/io-uring/6e37db97303212bbd8955f9501cf99b579f8aece.1746547722.git.asml.silence@gmail.com [axboe: fold in fixup] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 10:11:00 -06:00
Keith Busch	02040353f4	io_uring: enable per-io write streams Allow userspace to pass a per-I/O write stream in the SQE: __u8 write_stream; The __u8 type matches the size the filesystems and block layer support. Application can query the supported values from the block devices max_write_streams sysfs attribute. Unsupported values are ignored by file operations that do not support write streams or rejected with an error by those that support them. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-7-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:46:43 -06:00
Jens Axboe	b53e523261	io_uring: always arm linked timeouts prior to issue There are a few spots where linked timeouts are armed, and not all of them adhere to the pre-arm, attempt issue, post-arm pattern. This can be problematic if the linked request returns that it will trigger a callback later, and does so before the linked timeout is fully armed. Consolidate all the linked timeout handling into __io_issue_sqe(), rather than have it spread throughout the various issue entry points. Cc: stable@vger.kernel.org Link: https://github.com/axboe/liburing/issues/1390 Reported-by: Chase Hiltz <chase@path.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-04 09:15:58 -06:00
Peter Zijlstra	93f1b6d79a	futex: Move futex_queue() into futex_wait_setup() futex_wait_setup() has a weird calling convention in order to return hb to use as an argument to futex_queue(). Mostly such that requeue can have an extra test in between. Reorder code a little to get rid of this and keep the hb usage inside futex_wait_setup(). [bigeasy: fixes] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250416162921.513656-4-bigeasy@linutronix.de	2025-05-03 12:02:05 +02:00
Pavel Begunkov	8a62804248	io_uring/zcrx: split common area map/unmap parts Extract area type depedent parts of io_zcrx_[un]map_area from the generic path. It'll be helpful once there are more area memory types and not only user pages. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/50f6e893e2d20f937e628196cbf528d15f81c289.1746097431.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-02 09:24:42 -06:00
Pavel Begunkov	782dfa329a	io_uring/zcrx: split out memory holders from area In the data path users of struct io_zcrx_area don't need to know what kind of memory it's backed by. Only keep there generic bits in there and and split out memory type dependent fields into a new structure. It also logically separates the step that actually imports the memory, e.g. pinning user pages, from the generic area initialisation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b60fc09c76921bf69e77eb17e07eb4decedb3bf4.1746097431.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-02 09:24:42 -06:00
Pavel Begunkov	6c9589aa08	io_uring/zcrx: resolve netdev before area creation Some area types will require a valid struct device to be created, so resolve netdev and struct device before creating an area. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ac8c1482be22acfe9ca788d2c3ce31b7451ce488.1746097431.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-02 09:24:42 -06:00
Pavel Begunkov	d760d3f59f	io_uring/zcrx: improve area validation dmabuf backed area will be taking an offset instead of addresses, and io_buffer_validate() is not flexible enough to facilitate it. It also takes an iovec, which may truncate the u64 length zcrx takes. Add a new helper function for validation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0b3b735391a0a8f8971bf0121c19765131fddd3b.1746097431.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-02 09:24:42 -06:00
Jens Axboe	f024d3a8de	io_uring/fdinfo: annotate racy sq/cq head/tail reads syzbot complains about the cached sq head read, and it's totally right. But we don't need to care, it's just reading fdinfo, and reading the CQ or SQ tail/head entries are known racy in that they are just a view into that very instant and may of course be outdated by the time they are reported. Annotate both the SQ head and CQ tail read with data_race() to avoid this syzbot complaint. Link: https://lore.kernel.org/io-uring/6811f6dc.050a0220.39e3a1.0d0e.GAE@google.com/ Reported-by: syzbot+3e77fd302e99f5af9394@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-30 07:17:17 -06:00
Pavel Begunkov	91db6edc57	io_uring/cmd: move net cmd into a separate file We keep socket io_uring command implementation in io_uring/uring_cmd.c. Separate it from generic command code into a separate file. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/747d0519a2255bd055ae76b691d38d2b4c311001.1745843119.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-28 11:51:31 -06:00
Pavel Begunkov	27d2fed790	io_uring: delete misleading comment in io_fill_cqe_aux() io_fill_cqe_aux() doesn't overflow completions, however it might fail them and lets the caller handle it. Remove the comment, which doesn't make any sense. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/021aa8c1d8f20ef2b66da6aeabb6b511938fd2c5.1745843119.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-28 11:51:31 -06:00
Jens Axboe	edd43f4d6f	io_uring: fix 'sync' handling of io_fallback_tw() A previous commit added a 'sync' parameter to io_fallback_tw(), which if true, means the caller wants to wait on the fallback thread handling it. But the logic is somewhat messed up, ensure that ctxs are swapped and flushed appropriately. Cc: stable@vger.kernel.org Fixes: `dfbe5561ae` ("io_uring: flush offloaded and delayed task_work on exit") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-24 10:32:43 -06:00
Pavel Begunkov	f6da4fee69	io_uring/eventfd: open code io_eventfd_grab() io_eventfd_grab() doesn't help wit understanding the path, it'll be simpler to keep the helper open coded. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5cb53ce3876c2819db9e8055cf41dca4398521db.1745493845.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-24 08:33:54 -06:00
Pavel Begunkov	da01f60f8a	io_uring/eventfd: clean up rcu locking Conditional locking is never welcome if there are better options. Move rcu locking into io_eventfd_signal(), make it unconditional and use guards. It also helps with sparse warnings. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/91a925e708ca8a5aa7fee61f96d29b24ea9adeaf.1745493845.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-24 08:33:54 -06:00
Pavel Begunkov	62f666df76	io_uring/eventfd: dedup signalling helpers Consolidate io_eventfd_flush_signal() and io_eventfd_signal(). Not much of a difference for now, but it prepares it for following changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5beecd4da65d8d2d83df499196f84b329387f6a2.1745493845.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-24 08:33:54 -06:00
Pavel Begunkov	5e16f1a68d	io_uring: don't duplicate flushing in io_req_post_cqe io_req_post_cqe() sets submit_state.cq_flush so that *flush_completions() can take care of batch commiting CQEs. Don't commit it twice by using __io_cq_unlock_post(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/41c416660c509cee676b6cad96081274bcb459f3.1745493861.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-24 06:28:43 -06:00
Pavel Begunkov	76f1cc98b2	io_uring/zcrx: add support for multiple ifqs Allow the user to register multiple ifqs / zcrx contexts. With that we can use multiple interfaces / interface queues in a single io_uring instance. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/668b03bee03b5216564482edcfefbc2ee337dd30.1745141261.git.asml.silence@gmail.com [axboe: fold in fix] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-23 07:13:14 -06:00
Pavel Begunkov	632b318672	io_uring/zcrx: move zcrx region to struct io_zcrx_ifq Refill queue region is a part of zcrx and should stay in struct io_zcrx_ifq. We can't have multiple queues without it, so move it there. As a result there is no context global zcrx region anymore, and the region is looked up together with its ifq. To protect a concurrent mmap from seeing an inconsistent region we were protecting changes to ->zcrx_region with mmap_lock, but now it protect the publishing of the ifq. Reviewed-by: David Wei <dw@davidwei.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/24f1a728fc03d0166f16d099575457e10d9d90f2.1745141261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-21 05:12:31 -06:00
Pavel Begunkov	77231d4e46	io_uring/zcrx: let zcrx choose region for mmaping In preparation for adding multiple ifqs, add a helper returning a region for mmaping zcrx refill queue. For now it's trivial and returns the same ctx global ->zcrx_region. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d9006e2ef8cd5e5b337c2ba2228ede8409eb60f2.1745141261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-21 05:11:40 -06:00
Pavel Begunkov	59bc1ab922	io_uring/zcrx: remove sqe->file_index check sqe->file_index and sqe->zcrx_ifq_idx are aliased, so even though io_recvzc_prep() has all necessary handling and checking for ->zcrx_ifq_idx, it's already rejected a couple of lines above. It's not a real problem, however, as we can only have one ifq at the moment, which index is always zero, and it returns correct error codes to the user. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b51acedd326d1a87bf53e33303e6fc87661a3e3f.1745141261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-21 05:11:40 -06:00
Pavel Begunkov	a79154ae5d	io_uring/zcrx: move io_zcrx_iov_page We'll need io_zcrx_iov_page at the top to keep offset calculations closer together, move it there. Reviewed-by: David Wei <dw@davidwei.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/575617033a8b84a5985c7eb760b7121efdbe7e56.1745141261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-21 05:11:40 -06:00
Pavel Begunkov	37d26edd6b	io_uring/zcrx: remove duplicated freelist init Several lines below we already initialise the freelist, don't do it twice. Reviewed-by: David Wei <dw@davidwei.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/71b07c4e00d1db65899d1ed8d603d961fe7d1c28.1745141261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-21 05:11:40 -06:00
Pavel Begunkov	be6bad57b2	io_uring/rsrc: remove null check on import WARN_ON_ONCE() checking imu for NULL in io_import_fixed() is in the hot path and too protective, it's time to get rid of it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3782c01e311dbd474bca45aefaf79a2f2822fafb.1745083025.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-21 05:10:04 -06:00
Pavel Begunkov	9cebcf7b0c	io_uring/rsrc: clean up io_coalesce_buffer() We don't need special handling for the first page in io_coalesce_buffer(), move it inside the loop. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/r/ad698cddc1eadb3d92a7515e95bb13f79420323d.1745083025.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-21 05:10:04 -06:00
Pavel Begunkov	ea76925614	io_uring/rsrc: use unpin_user_folio We want to have a full folio to be left pinned but with only one reference, for that we "unpin" all but the first page with unpin_user_pages(), which can be confusing. There is a new helper to achieve that called unpin_user_folio(), so use that. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/r/e0b2be8f9ea68f6b351ec3bb046f04f437f68491.1745083025.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-21 05:10:04 -06:00
Jens Axboe	8a2dacd49f	io_uring/rsrc: remove node assignment helpers There are two helpers here, one assigns and increments the node ref count, and the other is simply a wrapper around that for the buffer node handling. The buffer node assignment benefits from checking and setting REQ_F_BUF_NODE together, otherwise stalls have been observed on setting that flag later in the process. Hence re-do it so that it's set when checked, and cleared in case of (unlikely) failure. With that, the buffer node helper can go, and then drop the generic io_req_assign_rsrc_node() helper as well as there's only a single user of it left at that point. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-21 05:06:58 -06:00
Jens Axboe	53db8a71ec	io_uring: add support for IORING_OP_PIPE This works just like pipe2(2), except it also supports fixed file descriptors. Used in a similar fashion as for other fd instantiating opcodes (like accept, socket, open, etc), where sqe->file_slot is set appropriately if two direct descriptors are desired rather than a set of normal file descriptors. sqe->addr must be set to a pointer to an array of 2 integers, which is where the fixed/normal file descriptors are copied to. sqe->pipe_flags contains flags, same as what is allowed for pipe2(2). Future expansion of per-op private flags can go in sqe->ioprio, like we do for other opcodes that take both a "syscall" flag set and an io_uring opcode specific flag set. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-21 05:06:58 -06:00
Pavel Begunkov	bd32923e5f	io_uring: don't store bgid in req->buf_index Pass buffer group id into the rest of helpers via struct buf_sel_arg and remove all reassignments of req->buf_index back to bgid. Now, it only stores buffer indexes, and the group is provided by callers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3ea9fa08113ecb4d9224b943e7806e80a324bdf9.1743437358.git.asml.silence@gmail.com Link: https://lore.kernel.org/io-uring/0c01d76ff12986c2f48614db8610caff8f78c869.1743500909.git.asml.silence@gmail.com/ [axboe: fold in patch from second link] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-21 05:06:58 -06:00
Pavel Begunkov	c0e9650521	io_uring/kbuf: pass bgid to io_buffer_select() The current situation with buffer group id juggling is not ideal. req->buf_index first stores the bgid, then it's overwritten by a buffer id, and then it can get restored back no recycling / etc. It's not so easy to control, and it's not handled consistently across request types with receive requests saving and restoring the bgid it by hand. It's a prep patch that adds a buffer group id argument to io_buffer_select(). The caller will be responsible for stashing a copy somewhere and passing it into the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a210d6427cc3f4f42271a6853274cd5a50e56820.1743437358.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-21 05:06:58 -06:00
Pavel Begunkov	e6f74fd67d	io_uring: set IMPORT_BUFFER in generic send setup Move REQ_F_IMPORT_BUFFER to the common send setup. Currently, the only user is send zc, but we'll want for normal sends to support that in the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/18b74edfc61d8a1b6c9fed3b78a9276fe80f8ced.1743437358.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-21 05:06:58 -06:00
Pavel Begunkov	e9ff9ae103	io_uring/net: don't use io_do_buffer_select at prep Prep code is interested whether it's a selected buffer request, not whether a buffer has already been selected like what io_do_buffer_select() returns. Check for REQ_F_BUFFER_SELECT directly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4488a029ac698554bebf732263fe19d7734affa6.1743437358.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-21 05:06:58 -06:00
Caleb Sander Mateos	9fe99eed91	io_uring/wq: avoid indirect do_work/free_work calls struct io_wq stores do_work and free_work function pointers which are called on each work item. But these function pointers are always set to io_wq_submit_work and io_wq_free_work, respectively. So remove these function pointers and just call the functions directly. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250329161527.3281314-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-21 05:06:58 -06:00
Pavel Begunkov	f12ecf5e1c	io_uring/zcrx: fix late dma unmap for a dead dev There is a problem with page pools not dma-unmapping immediately when the device is going down, and delaying it until the page pool is destroyed, which is not allowed (see links). That just got fixed for normal page pools, and we need to address memory providers as well. Unmap pages in the memory provider uninstall callback, and protect it with a new lock. There is also a gap between when a dma mapping is created and the mp is installed, so if the device is killed in between, io_uring would be holding on to dma mappings to a dead device with no one to call ->uninstall. Move it to page pool init and rely on ->is_mapped to make sure it's only done once. Link: https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/ Link: https://lore.kernel.org/all/20250409-page-pool-track-dma-v9-0-6a9ef2e0cba8@redhat.com/ Fixes: `34a3e60821` ("io_uring/zcrx: implement zerocopy receive pp memory provider") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ef9b7db249b14f6e0b570a1bb77ff177389f881c.1744965853.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-18 06:12:10 -06:00
Jens Axboe	b419bed4f0	io_uring/rsrc: ensure segments counts are correct on kbuf buffers kbuf imports have the front offset adjusted and segments removed, but the tail segments are still included in the segment count that gets passed in the iov_iter. As the segments aren't necessarily all the same size, move importing to a separate helper and iterate the mapped length to get an exact count. Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-17 11:59:12 -06:00
Nitesh Shetty	80c7378f94	io_uring/rsrc: send exact nr_segs for fixed buffer Sending exact nr_segs, avoids bio split check and processing in block layer, which takes around 5%[1] of overall CPU utilization. In our setup, we see overall improvement of IOPS from 7.15M to 7.65M [2] and 5% less CPU utilization. [1] 3.52% io_uring [kernel.kallsyms] [k] bio_split_rw_at 1.42% io_uring [kernel.kallsyms] [k] bio_split_rw 0.62% io_uring [kernel.kallsyms] [k] bio_submit_split [2] sudo taskset -c 0,1 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1 Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> [Pavel: fixed for kbuf, rebased and reworked on top of cleanups] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7a1a49a8d053bd617c244291d63dbfbc07afde36.1744882081.git.asml.silence@gmail.com [axboe: fold in fix factoring in buf reg offset] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-17 07:42:14 -06:00
Pavel Begunkov	59852ebad9	io_uring/rsrc: refactor io_import_fixed io_import_fixed is a mess. Even though we know the final len of the iterator, we still assign offset + len and do some magic after to correct for that. Do offset calculation first and finalise it with iov_iter_bvec at the end. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2d5107fed24f8b23245ef2ede9a5a7f7c426df61.1744882081.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-17 06:21:13 -06:00
Pavel Begunkov	50169d0754	io_uring/rsrc: separate kbuf offset adjustments Kernel registered buffers are special because segments are not uniform in size, and we have a bunch of optimisations based on that uniformity for normal buffers. Handle kbuf separately, it'll be cleaner this way. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4e9e5990b0ab5aee723c0be5cd9b5bcf810375f9.1744882081.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-17 06:21:13 -06:00
Pavel Begunkov	1ac5712888	io_uring/rsrc: don't skip offset calculation Don't optimise for requests with offset=0. Large registered buffers are the preference and hence the user is likely to pass an offset, and the adjustments are not expensive and will be made even cheaper in following patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1c2beb20470ee3c886a363d4d8340d3790db19f3.1744882081.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-17 06:21:13 -06:00
Pavel Begunkov	70e4f9bfc1	io_uring/zcrx: add pp to ifq conversion helper It'll likely change how page pools store memory providers, so in preparation for that, keep accesses in one place in io_uring by introducing a helper. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3522eb8fa9b4e21bcf32e7e9ae656c616b282210.1744722526.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-15 07:38:05 -06:00
Pavel Begunkov	25744f8495	io_uring/zcrx: return ifq id to the user IORING_OP_RECV_ZC requests take a zcrx object id via sqe::zcrx_ifq_idx, which binds it to the corresponding if / queue. However, we don't return that id back to the user. It's fine as currently there can be only one zcrx and the user assumes that its id should be 0, but as we'll need multiple zcrx objects in the future let's explicitly pass it back on registration. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8714667d370651962f7d1a169032e5f02682a73e.1744722517.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-15 07:37:49 -06:00
Jens Axboe	cf960726eb	io_uring/kbuf: reject zero sized provided buffers This isn't fixing a real issue, but there's also zero point in going through group and buffer setup, when the buffers are going to be rejected once attempted to get used. Cc: stable@vger.kernel.org Reported-by: syzbot+58928048fd1416f1457c@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-07 07:51:23 -06:00
Pavel Begunkov	5a17131a5d	io_uring/zcrx: separate niov number from pages A preparation patch that separates the number of pages / folios from the number of niovs. They will not match in the future to support huge pages, improved dma mapping and/or larger chunk sizes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0780ac966ee84200385737f45bb0f2ada052392b.1743848231.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-07 07:36:52 -06:00
Pavel Begunkov	9b58440a5b	io_uring/zcrx: put refill data into separate cache line Refill queue lock and other bits are only used from the allocation path on the rx softirq side, but it shares the cache line with other fields like ctx that are used also in the "syscall" path, which causes cache bouncing when softirq runs on a different CPU. Separate them into different cache lines. The first one now contains constant fields used by both contextx, followed by a line responsible for refill queue data. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6d1f598e27d623c07fc49d6baee13089a9b1216c.1743848241.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-07 07:36:52 -06:00
Pavel Begunkov	ab6005f391	io_uring: don't post tag CQEs on file/buffer registration failure Buffer / file table registration is all or nothing, if it fails all resources we might have partially registered are dropped and the table is killed. If that happens, it doesn't make sense to post any rsrc tag CQEs. That would be confusing to the application, which should not need to handle that case. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Fixes: `7029acd8a9` ("io_uring/rsrc: get rid of per-ring io_rsrc_node list") Link: https://lore.kernel.org/r/c514446a8dcb0197cddd5d4ba8f6511da081cf1f.1743777957.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-04 09:40:44 -06:00
Pavel Begunkov	390513642e	io_uring: always do atomic put from iowq io_uring always switches requests to atomic refcounting for iowq execution before there is any parallilism by setting REQ_F_REFCOUNT, and the flag is not cleared until the request completes. That should be fine as long as the compiler doesn't make up a non existing value for the flags, however KCSAN still complains when the request owner changes oter flag bits: BUG: KCSAN: data-race in io_req_task_cancel / io_wq_free_work ... read to 0xffff888117207448 of 8 bytes by task 3871 on cpu 0: req_ref_put_and_test io_uring/refs.h:22 [inline] Skip REQ_F_REFCOUNT checks for iowq, we know it's set. Reported-by: syzbot+903a2ad71fb3f1e47cf5@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d880bc27fb8c3209b54641be4ff6ac02b0e5789a.1743679736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-03 08:31:57 -06:00
Ming Lei	1045afae4b	io_uring: support vectored kernel fixed buffer io_uring has supported fixed kernel buffer via io_buffer_register_bvec() and io_buffer_unregister_bvec(). The vectored fixed buffer has been ready, so it is natural to support fixed kernel buffer, one use case is ublk. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250325135155.935398-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-02 07:06:59 -06:00
Ming Lei	8622b20f23	io_uring: add validate_fixed_range() for validate fixed buffer Add helper of validate_fixed_range() for validating fixed buffer range. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250325135155.935398-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-02 07:06:43 -06:00
David Wei	fcfd94d696	io_uring/zcrx: return early from io_zcrx_recv_skb if readlen is 0 When readlen is set for a recvzc request, tcp_read_sock() will call io_zcrx_recv_skb() one final time with len == desc->count == 0. This is caused by the !desc->count check happening too late. The offset + 1 != skb->len happens earlier and causes the while loop to continue. Fix this in io_zcrx_recv_skb() instead of tcp_read_sock(). Return early if len is 0 i.e. the read is done. Fixes: `6699ec9a23` ("io_uring/zcrx: add a read limit to recvzc requests") Signed-off-by: David Wei <dw@davidwei.uk> Link: https://lore.kernel.org/r/20250401195355.1613813-1-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-01 14:00:46 -06:00
Pavel Begunkov	81ed18015d	io_uring/net: avoid import_ubuf for regvec send With registered buffers we set up iterators in helpers like io_import_fixed(), and there is no need for a import_ubuf() before that. It was fine as we used real pointers for offset calculation, but that's not the case anymore since introduction of ublk kernel buffers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9b2de1a50844f848f62c8de609b494971033a6b9.1743437358.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-31 12:41:49 -06:00
Pavel Begunkov	a1fbe0a121	io_uring/rsrc: check size when importing reg buffer We're relying on callers to verify the IO size, do it inside of io_import_fixed() instead. It's safer, easier to deal with, and more consistent as now it's done close to the iter init site. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f9c2c75ec4d356a0c61289073f68d98e8a9db190.1743446271.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-31 12:41:49 -06:00
Pavel Begunkov	ed344511c5	io_uring: cleanup {g,s]etsockopt sqe reading Add a local variable for the sqe pointer to avoid repetition. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8dbac0f9acda2d3842534eeb7ce10d9276b021ae.1743357108.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-31 07:08:46 -06:00
Pavel Begunkov	296e169618	io_uring: hide caches sqes from drivers There is now an io_uring private part of cmd async_data, move saved sqe into it. Drivers are accessing it via struct io_uring_cmd::cmd. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ecbe078dd57acefdbc4366d083327086c0879378.1743357121.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-31 07:08:34 -06:00
Pavel Begunkov	487a0710f8	io_uring: make zcrx depend on CONFIG_IO_URING Reflect in the kconfig that zcrx requires io_uring compiled. Fixes: `6f377873cb` ("io_uring/zcrx: add interface queue and refill queue") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8047135a344e79dbd04ee36a7a69cc260aabc2ca.1743356260.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-31 07:07:44 -06:00
Pavel Begunkov	697b2876ac	io_uring: add req flag invariant build assertion We're caching some of file related request flags in a tricky way, put a build check to make sure flags don't get reshuffled. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9877577b83c25dd78224a8274f799187e7ec7639.1743407551.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-31 07:07:34 -06:00
Pavel Begunkov	ea9106786e	io_uring: don't pass ctx to tw add remote helper Unlike earlier versions, io_msg_remote_post() creates a valid request with a proper context, so don't pass a context to io_req_task_work_add_remote() explicitly but derive it from the request. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/721f51cf34996d98b48f0bfd24ad40aa2730167e.1743190078.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-28 17:14:01 -06:00
Pavel Begunkov	9cc0bbdaba	io_uring/msg: initialise msg request opcode It's risky to have msg request opcode set to garbage, so at least initialise it to nop. Later we might want to add a user inaccessible opcode for such cases. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9afe650fcb348414a4529d89f52eb8969ba06efd.1743190078.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-28 17:13:09 -06:00
Pavel Begunkov	b0e9570a6b	io_uring/msg: rename io_double_lock_ctx() io_double_lock_ctx() doesn't lock both rings. Rename it to prevent any future confusion. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9e5defa000efd9b0f5e169cbb6bad4994d46ec5c.1743190078.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-28 17:13:09 -06:00
Pavel Begunkov	fbe1a30c5d	io_uring/net: import zc ubuf earlier io_send_setup() already sets up the iterator for IORING_OP_SEND_ZC, we don't need repeating that at issue time. Move it all together with mem accounting at prep time, which is more consistent with how the non-zc version does that. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/eb54f007c493ad9f4ca89aa8e715baf30d83fb88.1743202294.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-28 17:11:20 -06:00
Pavel Begunkov	ad3f6cc400	io_uring/net: set sg_from_iter in advance In preparation to the next patch, set ->sg_from_iter callback at request prep time. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5fe2972701df3bacdb3d760bce195fa640bee201.1743202294.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-28 17:11:20 -06:00
Pavel Begunkov	49dbce5602	io_uring/net: clusterise send vs msghdr branches We have multiple branches at prep for send vs sendmsg handling, put them together so that the variant handling is more localised. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/33abf666d9ded74cba4da2f0d9fe58e88520dffe.1743202294.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-28 17:11:20 -06:00
Pavel Begunkov	63b16e4f0b	io_uring/net: unify sendmsg setup with zc io_sendmsg_zc_setup() duplicates parts of io_sendmsg_setup(), and the only difference between them is that the former support vectored registered buffers with nothing zerocopy specific. Merge them together, we want regular sendmsg to eventually support fixed buffers either way. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7e5ec40f9dc93355399dc6fa0cbc8b31f0b20ac5.1743202294.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-28 17:11:20 -06:00
Pavel Begunkov	c55e2845df	io_uring/net: combine sendzc flags writes Save an instruction / trip to the cache and assign some of sendzc flags together. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c519d6f406776c3be3ef988a8339a88e45d1ffd9.1743202294.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-28 17:11:20 -06:00
Pavel Begunkov	5f364117db	io_uring/net: open code io_net_vec_assign() Get rid of io_net_vec_assign() by open coding it into its only caller. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/19191c34b5cfe1161f7eeefa6e785418ea9ad56d.1743202294.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-28 17:11:20 -06:00
Pavel Begunkov	a20b8631c8	io_uring/net: open code io_sendmsg_copy_hdr() io_sendmsg_setup() is trivial and io_sendmsg_copy_hdr() doesn't add any good abstraction, open code one into another. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/565318ce585665e88053663eeee5178d2c15692f.1743202294.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-28 17:11:20 -06:00
Pavel Begunkov	04491732fc	io_uring/net: account memory for zc sendmsg Account pinned pages for IORING_OP_SENDMSG_ZC, just as we for IORING_OP_SEND_ZC and net/ does for MSG_ZEROCOPY. Fixes: `493108d95f` ("io_uring/net: zerocopy sendmsg") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4f00f67ca6ac8e8ed62343ae92b5816b1e0c9c4b.1743086313.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-28 16:15:42 -06:00
Linus Torvalds	eff5f16bfd	for-6.15/io_uring-reg-vec-20250327 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmflYcAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmvJD/4tKQlr0yRhln/JzPiONS41mUAuNRI4MdqJ ykpQkMx3NcQANbNyOxI0PV5I7y1Jdlg/UP9gy11BrIaBk4Kqoluc6iAzgr5q9pBC 8pXhPIe80R/q/LOKEz9n5gqOMPNyUtd7IaBayJPBJre/YZXQu+49IL2Uyy3hss8d neqAbWErd2FoUfTY14XB3ImLM6a76Z6CjE3pJYvVDM5uRBuH0IGqehJJuNpsViBf M9XAW/HZt8ISsVt1tJbCQVWx4b63L/omHI8u5K2M0isTPV+QPk1O2Vgkn7dBrDeT JvThWrM1uE++DYGcQ3DXHfb3gBIFEjTrNb2nddstyEU2ZaEXUkuOV2O0b7WPuphj zp0oFaLl/ivHT8NoJzzZzK24zt99Qz43GWUaFCQeR0R8oTix/M1q0unguER45Iv7 Po/b3h6+RAi+87KOlM5WWo05ScswS8AwcSUsP5xMR5BjjD+GQYO5PmVVyo8w0rid 8F9U9DpN2CTA5YVjI+ax1cxWMOfmAXPK5ONjzZpyJoWb0THgj97esEwc2un7SBi7 TJJz7Gc9/xOqfRKaPDoH9t8+b6ruWHMqCYDw6exSAUKeDxQ+7z0zNMudHkuR5VrX x+Taaj95ONLVNZYz0mbFcvmJC0UBOqkE94omXk7TU2Cn7SBzAW//XDep6CPpX/sa LcmOK4UXdg== =vOm1 -----END PGP SIGNATURE----- Merge tag 'for-6.15/io_uring-reg-vec-20250327' of git://git.kernel.dk/linux Pull more io_uring updates from Jens Axboe: "Final separate updates for io_uring. This started out as a series of cleanups improvements and improvements for registered buffers, but as the last series of the io_uring changes for 6.15, it also collected a few fixes for the other branches on top: - Add support for vectored fixed/registered buffers. Previously only single segments have been supported for commands, now vectored variants are supported as well. This series includes networking and file read/write support. - Small series unifying return codes across multi and single shot. - Small series cleaning up registerd buffer importing. - Adding support for vectored registered buffers for uring_cmd. - Fix for io-wq handling of command reissue. - Various little fixes and tweaks" * tag 'for-6.15/io_uring-reg-vec-20250327' of git://git.kernel.dk/linux: (25 commits) io_uring/net: fix io_req_post_cqe abuse by send bundle io_uring/net: use REQ_F_IMPORT_BUFFER for send_zc io_uring: move min_events sanitisation io_uring: rename "min" arg in io_iopoll_check() io_uring: open code __io_post_aux_cqe() io_uring: defer iowq cqe overflow via task_work io_uring: fix retry handling off iowq io_uring/net: only import send_zc buffer once io_uring/cmd: introduce io_uring_cmd_import_fixed_vec io_uring/cmd: add iovec cache for commands io_uring/cmd: don't expose entire cmd async data io_uring: rename the data cmd cache io_uring: rely on io_prep_reg_vec for iovec placement io_uring: introduce io_prep_reg_iovec() io_uring: unify STOP_MULTISHOT with IOU_OK io_uring: return -EAGAIN to continue multishot io_uring: cap cached iovec/bvec size io_uring/net: implement vectored reg bufs for zctx io_uring/net: convert to struct iou_vec io_uring/net: pull vec alloc out of msghdr import ...	2025-03-28 15:07:04 -07:00
Linus Torvalds	6df9d086ff	for-6.15/io_uring-epoll-wait-20250325 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfjTRYQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpvfMD/4596ZTo0bzeimMMzFloabhaqcMC9eO21iW oNb6KdSKQzAOdQl3+YfhSrP/ZbeoJl8y0YoPGIA2RBd4ELl/4YPVTNeahNhz70P2 iawXcYmMyrM+6W6RXof9uj+k8mE17G7N+M7DuMYjfMnC/OhDKg/3DbbydwesMW47 /1naMsk/rJx6CbeVKAl3V2es3NPUrZlh/dsurm5muGqXHAUmv459w46togKNvUhM tAGSq54zyPLnvnuA4Cm1QBIYtN5dHmjDjMmEXFiEF1qg7c5WIfVpkopc/SjvvlQs fVwN9Mn9Mh1VXuLysSqa5RsXINtXbJdpyXnh8pG/QqhntftNRKkwMxlN0gEYfBzB 4iuvzUpGmPAwPhYJ60Ylci4KhWY31mwbU+ns9+/G4bIdDxdEMJGvy4D/oz/fUr9p 1gwAIkJgfixb6fsUrL6REFoSEmRQQ7hQkyKRFyvXSXsTtqXz2BR7O75ujK381J9d Sa3zexDBRaJWlDWspRe7Olcwf/hqLmY3cWwQ8Z+a2vozU2dAtYpKr5Lhr7SjuftI Uw4pCkrXUG5EhdMSNuQnzVZv3xWPGAAUaAIT0MhDcWakhNT1dyyjbuXWuFzDmKQI 0Q71+UPBW300eJx8bFupYQ1mKp10LbjewCxiXOkBFZxaG7n5UwGsv4FoE3CJ8QEP V0ou6ZpBFA== =AZkA -----END PGP SIGNATURE----- Merge tag 'for-6.15/io_uring-epoll-wait-20250325' of git://git.kernel.dk/linux Pull io_uring epoll support from Jens Axboe: "This adds support for reading epoll events via io_uring. While this may seem counter-intuitive (and/or productive), the reasoning here is that quite a few existing epoll event loops can easily do a partial conversion to a completion based model, but are still stuck with one (or few) event types that remain readiness based. For that case, they then need to add the io_uring fd to the epoll context, and continue to rely on epoll_wait(2) for waiting on events. This misses out on the finer grained waiting that io_uring can do, to reduce context switches and wait for multiple events in one batch reliably. With adding support for reaping epoll events via io_uring, the whole legacy readiness based event types can still be reaped via epoll, with the overall waiting in the loop be driven by io_uring" * tag 'for-6.15/io_uring-epoll-wait-20250325' of git://git.kernel.dk/linux: io_uring/epoll: add support for IORING_OP_EPOLL_WAIT io_uring/epoll: remove CONFIG_EPOLL guards	2025-03-28 14:55:32 -07:00
Linus Torvalds	ca0b04ba0b	for-6.15/io_uring-rx-zc-20250325 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfjTP8QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpm6oEACnpGL52FAKTVj14GDqFo6Pq0Jmnh07x8qj mpHFPwxfWAzRiuNyji2iS9ecS2cnlkixNyMWZipXRi4KJAUjJH6YDd7IofUI3Glf 6v7b6srFSvsWJIJ8LdkJHLHAJuzYnJvFZ8apwgQczEDqgHq7BAunM1sVQ+mydjYk EXT4kN6DSBOPzwr9GAay52f8nXhbqdHfT+YTGHPHg+QToojL6gD7vvW57w/QqD/x 91hJef1z01cSIsDZOxA0EUeD+9bBAHpoamr/e3IOOCVYCN6hy0dGa9g0QGbbpVyE AeU4FGZLV9J8OOfvHVraDt5Wn3IXxYaMu22dSn1S6tVinwjXhJR2LAA+t4fGHAkt i38LjOsIbopSQn/cNhzwC8UZcHLqnVsdDolHlHzsVFVfcpck2/4JFpUeP8QhWgrk f9tY12QUf/oEaWm0/sUCHZNFxpIGeFA5FFXf0Z92clnzBuiuWoesBNvxqY/2DeZn IDNXiv+Trxr6kFEjTpzPeuxbWrn4PJ7afQSAFcEmOCguk5riM+zJZNIKg0TxUHSS tt6sfxmTP1DhgDKad5kT3MLyzOcx47Kbjf4dj6KmRnD+3DGwwN2F7X7R1GJylPSp RLOzJ+Ouuy9UmBN6JMsT4BmR9+FJTVirADU926d/ZqCTtRV8Tnq/6HHmKmmr4CR0 THJ0PJqQjg== =MOve -----END PGP SIGNATURE----- Merge tag 'for-6.15/io_uring-rx-zc-20250325' of git://git.kernel.dk/linux Pull io_uring zero-copy receive support from Jens Axboe: "This adds support for zero-copy receive with io_uring, enabling fast bulk receive of data directly into application memory, rather than needing to copy the data out of kernel memory. While this version only supports host memory as that was the initial target, other memory types are planned as well, with notably GPU memory coming next. This work depends on some networking components which were queued up on the networking side, but have now landed in your tree. This is the work of Pavel Begunkov and David Wei. From the v14 posting: 'We configure a page pool that a driver uses to fill a hw rx queue to hand out user pages instead of kernel pages. Any data that ends up hitting this hw rx queue will thus be dma'd into userspace memory directly, without needing to be bounced through kernel memory. 'Reading' data out of a socket instead becomes a _notification_ mechanism, where the kernel tells userspace where the data is. The overall approach is similar to the devmem TCP proposal This relies on hw header/data split, flow steering and RSS to ensure packet headers remain in kernel memory and only desired flows hit a hw rx queue configured for zero copy. Configuring this is outside of the scope of this patchset. We share netdev core infra with devmem TCP. The main difference is that io_uring is used for the uAPI and the lifetime of all objects are bound to an io_uring instance. Data is 'read' using a new io_uring request type. When done, data is returned via a new shared refill queue. A zero copy page pool refills a hw rx queue from this refill queue directly. Of course, the lifetime of these data buffers are managed by io_uring rather than the networking stack, with different refcounting rules. This patchset is the first step adding basic zero copy support. We will extend this iteratively with new features e.g. dynamically allocated zero copy areas, THP support, dmabuf support, improved copy fallback, general optimisations and more' In a local setup, I was able to saturate a 200G link with a single CPU core, and at netdev conf 0x19 earlier this month, Jamal reported 188Gbit of bandwidth using a single core (no HT, including soft-irq). Safe to say the efficiency is there, as bigger links would be needed to find the per-core limit, and it's considerably more efficient and faster than the existing devmem solution" * tag 'for-6.15/io_uring-rx-zc-20250325' of git://git.kernel.dk/linux: io_uring/zcrx: add selftest case for recvzc with read limit io_uring/zcrx: add a read limit to recvzc requests io_uring: add missing IORING_MAP_OFF_ZCRX_REGION in io_uring_mmap io_uring: Rename KConfig to Kconfig io_uring/zcrx: fix leaks on failed registration io_uring/zcrx: recheck ifq on shutdown io_uring/zcrx: add selftest net: add documentation for io_uring zcrx io_uring/zcrx: add copy fallback io_uring/zcrx: throttle receive requests io_uring/zcrx: set pp memory provider for an rx queue io_uring/zcrx: add io_recvzc request io_uring/zcrx: dma-map area for the device io_uring/zcrx: implement zerocopy receive pp memory provider io_uring/zcrx: grab a net device io_uring/zcrx: add io_zcrx_area io_uring/zcrx: add interface queue and refill queue	2025-03-28 13:45:52 -07:00
Pavel Begunkov	6889ae1b4d	io_uring/net: fix io_req_post_cqe abuse by send bundle [ 114.987980][ T5313] WARNING: CPU: 6 PID: 5313 at io_uring/io_uring.c:872 io_req_post_cqe+0x12e/0x4f0 [ 114.991597][ T5313] RIP: 0010:io_req_post_cqe+0x12e/0x4f0 [ 115.001880][ T5313] Call Trace: [ 115.002222][ T5313] <TASK> [ 115.007813][ T5313] io_send+0x4fe/0x10f0 [ 115.009317][ T5313] io_issue_sqe+0x1a6/0x1740 [ 115.012094][ T5313] io_wq_submit_work+0x38b/0xed0 [ 115.013223][ T5313] io_worker_handle_work+0x62a/0x1600 [ 115.013876][ T5313] io_wq_worker+0x34f/0xdf0 As the comment states, io_req_post_cqe() should only be used by multishot requests, i.e. REQ_F_APOLL_MULTISHOT, which bundled sends are not. Add a flag signifying whether a request wants to post multiple CQEs. Eventually REQ_F_APOLL_MULTISHOT should imply the new flag, but that's left out for simplicity. Cc: stable@vger.kernel.org Fixes: `a05d1f625c` ("io_uring/net: support bundles for send") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8b611dbb54d1cd47a88681f5d38c84d0c02bc563.1743067183.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-27 05:48:32 -06:00
Linus Torvalds	1a9239bb42	Networking changes for 6.15. Core & protocols ---------------- - Continue Netlink conversions to per-namespace RTNL lock (IPv4 routing, routing rules, routing next hops, ARP ioctls). - Continue extending the use of netdev instance locks. As a driver opt-in protect queue operations and (in due course) ethtool operations with the instance lock and not RTNL lock. - Support collecting TCP timestamps (data submitted, sent, acked) in BPF, allowing for transparent (to the application) and lower overhead tracking of TCP RPC performance. - Tweak existing networking Rx zero-copy infra to support zero-copy Rx via io_uring. - Optimize MPTCP performance in single subflow mode by 29%. - Enable GRO on packets which went thru XDP CPU redirect (were queued for processing on a different CPU). Improving TCP stream performance up to 2x. - Improve performance of contended connect() by 200% by searching for an available 4-tuple under RCU rather than a spin lock. Bring an additional 229% improvement by tweaking hash distribution. - Avoid unconditionally touching sk_tsflags on RX, improving performance under UDP flood by as much as 10%. - Avoid skb_clone() dance in ping_rcv() to improve performance under ping flood. - Avoid FIB lookup in netfilter if socket is available, 20% perf win. - Rework network device creation (in-kernel) API to more clearly identify network namespaces and their roles. There are up to 4 namespace roles but we used to have just 2 netns pointer arguments, interpreted differently based on context. - Use sysfs_break_active_protection() instead of trylock to avoid deadlocks between unregistering objects and sysfs access. - Add a new sysctl and sockopt for capping max retransmit timeout in TCP. - Support masking port and DSCP in routing rule matches. - Support dumping IPv4 multicast addresses with RTM_GETMULTICAST. - Support specifying at what time packet should be sent on AF_XDP sockets. - Expose TCP ULP diagnostic info (for TLS and MPTCP) to non-admin users. - Add Netlink YAML spec for WiFi (nl80211) and conntrack. - Introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL() for symbols which only need to be exported when IPv6 support is built as a module. - Age FDB entries based on Rx not Tx traffic in VxLAN, similar to normal bridging. - Allow users to specify source port range for GENEVE tunnels. - netconsole: allow attaching kernel release, CPU ID and task name to messages as metadata Driver API ---------- - Continue rework / fixing of Energy Efficient Ethernet (EEE) across the SW layers. Delegate the responsibilities to phylink where possible. Improve its handling in phylib. - Support symmetric OR-XOR RSS hashing algorithm. - Support tracking and preserving IRQ affinity by NAPI itself. - Support loopback mode speed selection for interface selftests. Device drivers -------------- - Remove the IBM LCS driver for s390. - Remove the sb1000 cable modem driver. - Add support for SFP module access over SMBus. - Add MCTP transport driver for MCTP-over-USB. - Enable XDP metadata support in multiple drivers. - Ethernet high-speed NICs: - Broadcom (bnxt): - add PCIe TLP Processing Hints (TPH) support for new AMD platforms - support dumping RoCE queue state for debug - opt into instance locking - Intel (100G, ice, idpf): - ice: rework MSI-X IRQ management and distribution - ice: support for E830 devices - iavf: add support for Rx timestamping - iavf: opt into instance locking - nVidia/Mellanox: - mlx4: use page pool memory allocator for Rx - mlx5: support for one PTP device per hardware clock - mlx5: support for 200Gbps per-lane link modes - mlx5: move IPSec policy check after decryption - AMD/Solarflare: - support FW flashing via devlink - Cisco (enic): - use page pool memory allocator for Rx - enable 32, 64 byte CQEs - get max rx/tx ring size from the device - Meta (fbnic): - support flow steering and RSS configuration - report queue stats - support TCP segmentation - support IRQ coalescing - support ring size configuration - Marvell/Cavium: - support AF_XDP - Wangxun: - support for PTP clock and timestamping - Huawei (hibmcge): - checksum offload - add more statistics - Ethernet virtual: - VirtIO net: - aggressively suppress Tx completions, improve perf by 96% with 1 CPU and 55% with 2 CPUs - expose NAPI to IRQ mapping and persist NAPI settings - Google (gve): - support XDP in DQO RDA Queue Format - opt into instance locking - Microsoft vNIC: - support BIG TCP - Ethernet NICs consumer, and embedded: - Synopsys (stmmac): - cleanup Tx and Tx clock setting and other link-focused cleanups - enable SGMII and 2500BASEX mode switching for Intel platforms - support Sophgo SG2044 - Broadcom switches (b53): - support for BCM53101 - TI: - iep: add perout configuration support - icssg: support XDP - Cadence (macb): - implement BQL - Xilinx (axinet): - support dynamic IRQ moderation and changing coalescing at runtime - implement BQL - report standard stats - MediaTek: - support phylink managed EEE - Intel: - igc: don't restart the interface on every XDP program change - RealTek (r8169): - support reading registers of internal PHYs directly - increase max jumbo packet size on RTL8125/RTL8126 - Airoha: - support for RISC-V NPU packet processing unit - enable scatter-gather and support MTU up to 9kB - Tehuti (tn40xx): - support cards with TN4010 MAC and an Aquantia AQR105 PHY - Ethernet PHYs: - support for TJA1102S, TJA1121 - dp83tg720: add randomized polling intervals for link detection - dp83822: support changing the transmit amplitude voltage - support for LEDs on 88q2xxx - CAN: - canxl: support Remote Request Substitution bit access - flexcan: add S32G2/S32G3 SoC - WiFi: - remove cooked monitor support - strict mode for better AP testing - basic EPCS support - OMI RX bandwidth reduction support - batman-adv: add support for jumbo frames - WiFi drivers: - RealTek (rtw88): - support RTL8814AE and RTL8814AU - RealTek (rtw89): - switch using wiphy_lock and wiphy_work - add BB context to manipulate two PHY as preparation of MLO - improve BT-coexistence mechanism to play A2DP smoothly - Intel (iwlwifi): - add new iwlmld sub-driver for latest HW/FW combinations - MediaTek (mt76): - preparation for mt7996 Multi-Link Operation (MLO) support - Qualcomm/Atheros (ath12k): - continued work on MLO - Silabs (wfx): - Wake-on-WLAN support - Bluetooth: - add support for skb TX SND/COMPLETION timestamping - hci_core: enable buffer flow control for SCO/eSCO - coredump: log devcd dumps into the monitor - Bluetooth drivers: - intel: add support to configure TX power - nxp: handle bootloader error during cmd5 and cmd7 Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmfkLC8ACgkQMUZtbf5S Irsb5g/+L7oKOf0ALbaV9kxFsoz8AymZfAW9i/27F07omGJGpks8oX6j6rQLgIRO OQOFcp7XEdDh1+jh82gHVuPrw2/6lchLtW8ARtzdiQKFr5DRjrsbtua6GRc8iBqA DIRCBFoV2HuMkF39Vr09HMa9AZAT7QR2RLsRGpSq8E8Z8xxKz0X7oujs10PFpMTE IVKhTrVrk+NDot/IU2hzVpnpup+0ld+T2/ZaBklJGcU8uDffImsqNepHRyCG5UC3 xz74Ju23MAj24Gct+og0yFUooF+lUltKyVm0FYCDCY3bASTwgY01NR3kEH/0NQvM cywLzd/ngHm/SMD2ggVAHkjZUieiIVHdaZ53dgjDeBOQoVP6p0dgUK7EumXX8Mx4 8ReR2UiGoYRPaq9c4o+IjG4K027MwVK2p+mF1a6MLa+20XcyMbev8FIRbbHtC/V4 z5/FsOAxcuICWkA1hU9bODrrGzIqemmdRgKG8sGuTJCt/kYGAn72/TCATGNSaCJ0 00n2jN1aepa7wtywHJ5MhVzxN9iQX7+geUHXz0BI+lK4e1Pmk+vjGksymb9ai2fk eQAUV9ekub6q68/J16scD7XeOUM37bTLiMBQeIF8UtZBOJscKiS71zn9QP9Twwxv P2pm01RDZUI+z5ZX3hc12Pm1vjRHaAh9S1JpAw/pTOVlQ+mAJEM= =XY0S -----END PGP SIGNATURE----- Merge tag 'net-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Continue Netlink conversions to per-namespace RTNL lock (IPv4 routing, routing rules, routing next hops, ARP ioctls) - Continue extending the use of netdev instance locks. As a driver opt-in protect queue operations and (in due course) ethtool operations with the instance lock and not RTNL lock. - Support collecting TCP timestamps (data submitted, sent, acked) in BPF, allowing for transparent (to the application) and lower overhead tracking of TCP RPC performance. - Tweak existing networking Rx zero-copy infra to support zero-copy Rx via io_uring. - Optimize MPTCP performance in single subflow mode by 29%. - Enable GRO on packets which went thru XDP CPU redirect (were queued for processing on a different CPU). Improving TCP stream performance up to 2x. - Improve performance of contended connect() by 200% by searching for an available 4-tuple under RCU rather than a spin lock. Bring an additional 229% improvement by tweaking hash distribution. - Avoid unconditionally touching sk_tsflags on RX, improving performance under UDP flood by as much as 10%. - Avoid skb_clone() dance in ping_rcv() to improve performance under ping flood. - Avoid FIB lookup in netfilter if socket is available, 20% perf win. - Rework network device creation (in-kernel) API to more clearly identify network namespaces and their roles. There are up to 4 namespace roles but we used to have just 2 netns pointer arguments, interpreted differently based on context. - Use sysfs_break_active_protection() instead of trylock to avoid deadlocks between unregistering objects and sysfs access. - Add a new sysctl and sockopt for capping max retransmit timeout in TCP. - Support masking port and DSCP in routing rule matches. - Support dumping IPv4 multicast addresses with RTM_GETMULTICAST. - Support specifying at what time packet should be sent on AF_XDP sockets. - Expose TCP ULP diagnostic info (for TLS and MPTCP) to non-admin users. - Add Netlink YAML spec for WiFi (nl80211) and conntrack. - Introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL() for symbols which only need to be exported when IPv6 support is built as a module. - Age FDB entries based on Rx not Tx traffic in VxLAN, similar to normal bridging. - Allow users to specify source port range for GENEVE tunnels. - netconsole: allow attaching kernel release, CPU ID and task name to messages as metadata Driver API: - Continue rework / fixing of Energy Efficient Ethernet (EEE) across the SW layers. Delegate the responsibilities to phylink where possible. Improve its handling in phylib. - Support symmetric OR-XOR RSS hashing algorithm. - Support tracking and preserving IRQ affinity by NAPI itself. - Support loopback mode speed selection for interface selftests. Device drivers: - Remove the IBM LCS driver for s390 - Remove the sb1000 cable modem driver - Add support for SFP module access over SMBus - Add MCTP transport driver for MCTP-over-USB - Enable XDP metadata support in multiple drivers - Ethernet high-speed NICs: - Broadcom (bnxt): - add PCIe TLP Processing Hints (TPH) support for new AMD platforms - support dumping RoCE queue state for debug - opt into instance locking - Intel (100G, ice, idpf): - ice: rework MSI-X IRQ management and distribution - ice: support for E830 devices - iavf: add support for Rx timestamping - iavf: opt into instance locking - nVidia/Mellanox: - mlx4: use page pool memory allocator for Rx - mlx5: support for one PTP device per hardware clock - mlx5: support for 200Gbps per-lane link modes - mlx5: move IPSec policy check after decryption - AMD/Solarflare: - support FW flashing via devlink - Cisco (enic): - use page pool memory allocator for Rx - enable 32, 64 byte CQEs - get max rx/tx ring size from the device - Meta (fbnic): - support flow steering and RSS configuration - report queue stats - support TCP segmentation - support IRQ coalescing - support ring size configuration - Marvell/Cavium: - support AF_XDP - Wangxun: - support for PTP clock and timestamping - Huawei (hibmcge): - checksum offload - add more statistics - Ethernet virtual: - VirtIO net: - aggressively suppress Tx completions, improve perf by 96% with 1 CPU and 55% with 2 CPUs - expose NAPI to IRQ mapping and persist NAPI settings - Google (gve): - support XDP in DQO RDA Queue Format - opt into instance locking - Microsoft vNIC: - support BIG TCP - Ethernet NICs consumer, and embedded: - Synopsys (stmmac): - cleanup Tx and Tx clock setting and other link-focused cleanups - enable SGMII and 2500BASEX mode switching for Intel platforms - support Sophgo SG2044 - Broadcom switches (b53): - support for BCM53101 - TI: - iep: add perout configuration support - icssg: support XDP - Cadence (macb): - implement BQL - Xilinx (axinet): - support dynamic IRQ moderation and changing coalescing at runtime - implement BQL - report standard stats - MediaTek: - support phylink managed EEE - Intel: - igc: don't restart the interface on every XDP program change - RealTek (r8169): - support reading registers of internal PHYs directly - increase max jumbo packet size on RTL8125/RTL8126 - Airoha: - support for RISC-V NPU packet processing unit - enable scatter-gather and support MTU up to 9kB - Tehuti (tn40xx): - support cards with TN4010 MAC and an Aquantia AQR105 PHY - Ethernet PHYs: - support for TJA1102S, TJA1121 - dp83tg720: add randomized polling intervals for link detection - dp83822: support changing the transmit amplitude voltage - support for LEDs on 88q2xxx - CAN: - canxl: support Remote Request Substitution bit access - flexcan: add S32G2/S32G3 SoC - WiFi: - remove cooked monitor support - strict mode for better AP testing - basic EPCS support - OMI RX bandwidth reduction support - batman-adv: add support for jumbo frames - WiFi drivers: - RealTek (rtw88): - support RTL8814AE and RTL8814AU - RealTek (rtw89): - switch using wiphy_lock and wiphy_work - add BB context to manipulate two PHY as preparation of MLO - improve BT-coexistence mechanism to play A2DP smoothly - Intel (iwlwifi): - add new iwlmld sub-driver for latest HW/FW combinations - MediaTek (mt76): - preparation for mt7996 Multi-Link Operation (MLO) support - Qualcomm/Atheros (ath12k): - continued work on MLO - Silabs (wfx): - Wake-on-WLAN support - Bluetooth: - add support for skb TX SND/COMPLETION timestamping - hci_core: enable buffer flow control for SCO/eSCO - coredump: log devcd dumps into the monitor - Bluetooth drivers: - intel: add support to configure TX power - nxp: handle bootloader error during cmd5 and cmd7" * tag 'net-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1681 commits) unix: fix up for "apparmor: add fine grained af_unix mediation" mctp: Fix incorrect tx flow invalidation condition in mctp-i2c net: usb: asix: ax88772: Increase phy_name size net: phy: Introduce PHY_ID_SIZE — minimum size for PHY ID string net: libwx: fix Tx L4 checksum net: libwx: fix Tx descriptor content for some tunnel packets atm: Fix NULL pointer dereference net: tn40xx: add pci-id of the aqr105-based Tehuti TN4010 cards net: tn40xx: prepare tn40xx driver to find phy of the TN9510 card net: tn40xx: create swnode for mdio and aqr105 phy and add to mdiobus net: phy: aquantia: add essential functions to aqr105 driver net: phy: aquantia: search for firmware-name in fwnode net: phy: aquantia: add probe function to aqr105 for firmware loading net: phy: Add swnode support to mdiobus_scan gve: add XDP DROP and PASS support for DQ gve: update XDP allocation path support RX buffer posting gve: merge packet buffer size fields gve: update GQ RX to use buf_size gve: introduce config-based allocation for XDP gve: remove xdp_xsk_done and xdp_xsk_wakeup statistics ...	2025-03-26 21:48:21 -07:00
Linus Torvalds	91928e0d3c	for-6.15/io_uring-20250322 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfe8DcQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgprzNEADPOOfi05F0f4KOMLYOiRlB+D9yGtnbcu8k foPATU++L5fFDUpjAGobi38AXqHtF/Bx0CjCkPCvy4GjygcszYDY81FQmRlSFdeH CnYV/bm5myyupc8xANcWGJYpxf/ZAw7ugOAgcR3Xxk/qXdD5hYJ1jG4477o8v4IJ jyzE4vQCnWBogIoqpWTQyY3FSP5fdjeQlYdFGIFCx68rFTO9zEpuHhFr2OMfJtSu y1dLkeqbu1BZ0fhgRog9k9ZWk0uRQJI7d7EUXf9iwXOfkUyzCaV+nHvNGRdhJ/XO zj/qePw4lFyCgc+kuIejjIaPf/g84paCzshz/PIxOZTsvo6X78zj72RPM0Makhga amgtEqegWHtKSya1zew57oDwqwaIe6b+56PtNSP7K8IrwcJLheszoPKJIwiiynIL ZqvWseqYQXs58S9qZzSZlyEpwLVFcvyNt7Z/q7b56vhvXHabxOccPkUHt+UscVoQ qpG0R9+zmIiDfzY6Vfu7JyH2Uspdq3WjGYiaoVUko1rJXY+HFPd3be9pIbckyaA8 ZQHIxB+h8IlxhpQ8HJKQpZhtYiluxEu6wDdvtpwC0KiGp5g4Q5eg9SkVyvETIEN8 yW2tHqMAJJtJjDAPwJOV+GIyWQMIexzLmMP7Fq8R5VIoAox+Xn3flgNNEnMZdl/r kTc/hRjHvQ== =VEQr -----END PGP SIGNATURE----- Merge tag 'for-6.15/io_uring-20250322' of git://git.kernel.dk/linux Pull io_uring updates from Jens Axboe: "This is the first of the io_uring pull requests for the 6.15 merge window, there will be others once the net tree has gone in. This contains: - Cleanup and unification of cancelation handling across various request types. - Improvement for bundles, supporting them both for incrementally consumed buffers, and for non-multishot requests. - Enable toggling of using iowait while waiting on io_uring events or not. Unfortunately this is still tied with CPU frequency boosting on short waits, as the scheduler side has not been very receptive to splitting the (useless) iowait stat from the cpufreq implied boost. - Add support for kbuf nodes, enabling zero-copy support for the ublk block driver. - Various cleanups for resource node handling. - Series greatly cleaning up the legacy provided (non-ring based) buffers. For years, we've been pushing the ring provided buffers as the way to go, and that is what people have been using. Reduce the complexity and code associated with legacy provided buffers. - Series cleaning up the compat handling. - Series improving and cleaning up the recvmsg/sendmsg iovec and msg handling. - Series of cleanups for io-wq. - Start adding a bunch of selftests. The liburing repository generally carries feature and regression tests for everything, but at least for ublk initially, we'll try and go the route of having it in selftests as well. We'll see how this goes, might decide to migrate more tests this way in the future. - Various little cleanups and fixes" * tag 'for-6.15/io_uring-20250322' of git://git.kernel.dk/linux: (108 commits) selftests: ublk: add stripe target selftests: ublk: simplify loop io completion selftests: ublk: enable zero copy for null target selftests: ublk: prepare for supporting stripe target selftests: ublk: move common code into common.c selftests: ublk: increase max buffer size to 1MB selftests: ublk: add single sqe allocator helper selftests: ublk: add generic_01 for verifying sequential IO order selftests: ublk: fix starting ublk device io_uring: enable toggle of iowait usage when waiting on CQEs selftests: ublk: fix write cache implementation selftests: ublk: add variable for user to not show test result selftests: ublk: don't show `modprobe` failure selftests: ublk: add one dependency header io_uring/kbuf: enable bundles for incrementally consumed buffers Revert "io_uring/rsrc: simplify the bvec iter count calculation" selftests: ublk: improve test usability selftests: ublk: add stress test for covering IO vs. killing ublk server selftests: ublk: add one stress test for covering IO vs. removing device selftests: ublk: load/unload ublk_drv when preparing & cleaning up tests ...	2025-03-26 17:56:00 -07:00
Caleb Sander Mateos	73b6dacb1c	io_uring/net: use REQ_F_IMPORT_BUFFER for send_zc Instead of a bool field in struct io_sr_msg, use REQ_F_IMPORT_BUFFER to track whether io_send_zc() has already imported the buffer. This flag already serves a similar purpose for sendmsg_zc and {read,write}v_fixed. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20250325143943.1226467-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-26 04:26:45 -06:00
Linus Torvalds	054570267d	lsm/stable-6.15 PR 20250323 -----BEGIN PGP SIGNATURE----- iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmfgWgMUHHBhdWxAcGF1 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNW5RAAvCDq5gBtY0aTNlULe637EVLSh+t8 PkSzHzu/NlzU6BfjtwSm2fuML8welTGxSwUPxUzMCI91gPdkGeFktefavT3xa+QI BHWROn7fEJ/KmRZvngPeIkgLr5xhF5nBJmc/Jw71qem20zRzNgJnpzMX16d10Phx dxd2xOO1qM3bv6Z9RcIssZRGaN+PHngpWWg+0B69XuaBUso87S6NDyKNn1XPmvoz as96k+Wk/xAZGVEeCbs/+H5rBx6DLg+FfTRa06Oh4BFsqedpkDPxLrTgCJGJkA0H dsK6O/993zvjx0Jn4ZPoJ9n35S82BmkCsz4bGq1xVl6FYUiMcm3/8yO41wllS+w4 j+RlTU/RIdB7n8EKyMMl1hj1stTvt3Bi9F5Cbf7ZEv0snfR00K4KVpi17jnFjUHv kpOiEtXZb/NGQip7UAuUq0PisfqbiO4jJurYHRetDgv1WCy6+C8ufM5t6I+cnvmG VG+dlxcW+rDIn6bLRVuGi9TJRsQ6eox9ipa+qEKNNiOXgftELcgT7m74nAS5m0uv n5rDa221nPXecEB0X7d6YUFk711lly90dbelNeLrmv1w6jl8L1PpS1oBaW+UzGu9 46eGBd6pzu9otvK9WVyDEdotDOCrgH0sd7pTetqDhLJZ7KrGwyyqO2gD/JroUKcC lnxBQwPnat86iI8= =oxfV -----END PGP SIGNATURE----- Merge tag 'lsm-pr-20250323' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull lsm updates from Paul Moore: - Various minor updates to the LSM Rust bindings Changes include marking trivial Rust bindings as inlines and comment tweaks to better reflect the LSM hooks. - Add LSM/SELinux access controls to io_uring_allowed() Similar to the io_uring_disabled sysctl, add a LSM hook to io_uring_allowed() to enable LSMs a simple way to enforce security policy on the use of io_uring. This pull request includes SELinux support for this new control using the io_uring/allowed permission. - Remove an unused parameter from the security_perf_event_open() hook The perf_event_attr struct parameter was not used by any currently supported LSMs, remove it from the hook. - Add an explicit MAINTAINERS entry for the credentials code We've seen problems in the past where patches to the credentials code sent by non-maintainers would often languish on the lists for multiple months as there was no one explicitly tasked with the responsibility of reviewing and/or merging credentials related code. Considering that most of the code under security/ has a vested interest in ensuring that the credentials code is well maintained, I'm volunteering to look after the credentials code and Serge Hallyn has also volunteered to step up as an official reviewer. I posted the MAINTAINERS update as a RFC to LKML in hopes that someone else would jump up with an "I'll do it!", but beyond Serge it was all crickets. - Update Stephen Smalley's old email address to prevent confusion This includes a corresponding update to the mailmap file. * tag 'lsm-pr-20250323' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: mailmap: map Stephen Smalley's old email addresses lsm: remove old email address for Stephen Smalley MAINTAINERS: add Serge Hallyn as a credentials reviewer MAINTAINERS: add an explicit credentials entry cred,rust: mark Credential methods inline lsm,rust: reword "destroy" -> "release" in SecurityCtx lsm,rust: mark SecurityCtx methods inline perf: Remove unnecessary parameter of security check lsm: fix a missing security_uring_allowed() prototype io_uring,lsm,selinux: add LSM hooks for io_uring_setup() io_uring: refactor io_uring_allowed()	2025-03-25 15:44:19 -07:00
Pavel Begunkov	816619782b	io_uring: move min_events sanitisation iopoll and normal waiting already duplicate min_completion truncation, so move them inside the corresponding routines. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/254adb289cc04638f25d746a7499260fa89a179e.1742829388.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-25 12:38:01 -06:00
Pavel Begunkov	d73acd7af3	io_uring: rename "min" arg in io_iopoll_check() Don't name arguments "min", it shadows the namesake function. min_events is also more consistent. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f52ce9d88d3bca5732a218b0da14924aa6968909.1742829388.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-25 12:37:57 -06:00
Pavel Begunkov	4c76de42cb	io_uring: open code __io_post_aux_cqe() There is no reason to keep __io_post_aux_cqe() separately from io_post_aux_cqe(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2c4c1f68d694deea25a212fc09bbb11f330cd82e.1742829388.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-25 12:37:52 -06:00
Pavel Begunkov	3afcb3b2e3	io_uring: defer iowq cqe overflow via task_work Don't handle CQE overflows in io_req_complete_post() and defer it to flush_completions. It cuts some duplication, and I also want to limit the number of places directly overflowing completions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9046410ac27e18f2baa6f7cdb363ec921cbc3b79.1742829388.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-25 12:37:47 -06:00
Pavel Begunkov	3f0cb8de56	io_uring: fix retry handling off iowq io_req_complete_post() doesn't handle reissue and if called with a REQ_F_REISSUE request it might post extra unexpected completions. Fix it by pushing into flush_completion via task work. Fixes: `d803d12394` ("io_uring/rw: handle -EAGAIN retry at IO completion time") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/badb3d7e462881e7edbfcc2be6301090b07dbe53.1742829388.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-25 12:37:43 -06:00
Linus Torvalds	a50b4fe095	A treewide hrtimer timer cleanup hrtimers are initialized with hrtimer_init() and a subsequent store to the callback pointer. This turned out to be suboptimal for the upcoming Rust integration and is obviously a silly implementation to begin with. This cleanup replaces the hrtimer_init(T); T->function = cb; sequence with hrtimer_setup(T, cb); The conversion was done with Coccinelle and a few manual fixups. Once the conversion has completely landed in mainline, hrtimer_init() will be removed and the hrtimer::function becomes a private member. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff5jQTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoVvRD/wKtuwmiA66NJFgXC0qVq82A6fO3bY8 GBdbfysDJIbqGu5PTcULTbJ8qkqv3jeLUv6CcXvS4sZ7y/uJQl2lzf8yrD/0bbwc rLI6sHiPSZmK93kNVN4X5H7kvt7cE/DYC9nnEOgK3BY5FgKc4n9887d4aVBhL8Lv ODwVXvZ+xi351YCj7qRyPU24zt/p4tkkT1o2k4a0HBluqLI0D+V20fke9IERUL8r d1uWKlcn0TqYDesE8HXKIhbst3gx52rMJrXBJDHwFmG6v8Pj1fkTXCVpPo8QcBz8 OTVkpomN9f/Tx4+GZwhZOF86LhLL3OhxD6pT7JhFCXdmSGv+Ez8uyk1YZysM/XpV Juy/1yAcBpDIDkmhMFGdAAn48Nn9Fotty0r4je60zSEp1d/4QMXcFme29qr2JTUE iWnQ/HD6DxUjVHqy7CYvvo26Xegg1C7qgyOVt4PYZwAM1VKF5P3kzYTb4SAdxtop Tpji1sfW9QV08jqMNo6XntD32DSP9S2HqjO9LwBw700jnx2jjJ35fcJs6iodMOUn gckIZLMn3L0OoglPdyA5O7SNTbKE7aFiRKdnT/cJtR3Fa39Qu27CwC5gfiyuie9I Q+LG8GLuYSBHXAR+PBK4GWlzJ7Dn8k3eqmbnLeKpRMsU6ZzcttgA64xhaviN2wN0 iJbvLJeisXr3GA== =bYAX -----END PGP SIGNATURE----- Merge tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer cleanups from Thomas Gleixner: "A treewide hrtimer timer cleanup hrtimers are initialized with hrtimer_init() and a subsequent store to the callback pointer. This turned out to be suboptimal for the upcoming Rust integration and is obviously a silly implementation to begin with. This cleanup replaces the hrtimer_init(T); T->function = cb; sequence with hrtimer_setup(T, cb); The conversion was done with Coccinelle and a few manual fixups. Once the conversion has completely landed in mainline, hrtimer_init() will be removed and the hrtimer::function becomes a private member" * tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits) wifi: rt2x00: Switch to use hrtimer_update_function() io_uring: Use helper function hrtimer_update_function() serial: xilinx_uartps: Use helper function hrtimer_update_function() ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup() RDMA: Switch to use hrtimer_setup() virtio: mem: Switch to use hrtimer_setup() drm/vmwgfx: Switch to use hrtimer_setup() drm/xe/oa: Switch to use hrtimer_setup() drm/vkms: Switch to use hrtimer_setup() drm/msm: Switch to use hrtimer_setup() drm/i915/request: Switch to use hrtimer_setup() drm/i915/uncore: Switch to use hrtimer_setup() drm/i915/pmu: Switch to use hrtimer_setup() drm/i915/perf: Switch to use hrtimer_setup() drm/i915/gvt: Switch to use hrtimer_setup() drm/i915/huc: Switch to use hrtimer_setup() drm/amdgpu: Switch to use hrtimer_setup() stm class: heartbeat: Switch to use hrtimer_setup() i2c: Switch to use hrtimer_setup() iio: Switch to use hrtimer_setup() ...	2025-03-25 10:54:15 -07:00
Linus Torvalds	bb18645ac1	io_uring-6.14-20250322 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfe7xsQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpnDHD/4rKlhIMGsWjBMaO1+ogNMsh9pJGRQdQZWB dCcIz1zRZbzY9WE1CQmkAWpM6tEbkUPMw3RrW1wduNO9W1OCQ4wW/wv9gShzi5HU O+yOF9s+wd2733n3Ghxr6zTFCyHeaHb5mFVQsct1eOsbdprxysQLQ6QwB4FqOcKa vZJyDg9Zxsd9wNblzgd7+QSMnPgsCrBhgC86mayDEsQ2e68CXFkxv8W6zBl28rx1 auGCIF33Zic0FUWKqc2N2e2xZ0RUZKeKqf09ZpzoEVZ2Zti2IUd1RAZFaRo5SBTm ZhU71Eip4OqWruNSHmE8KtEgLscV5rwNZ2IH29Ywif71JTYMqEGE3Jr9O1w6REti bNH/ELSXEC/rpscmG904g8UQ3Nv8LcdETq0B3lxl1dnKVVy5jFe3mSvvvWIckJAJ t2KZf1HwX4Q0MXdi8HUnN4uswvu4xPYfjTNaEjqaK0H9U7BTY1CBkfTRjtWatPNH WCfeDK6Bj0T51ke/kmu3EpP6l19H1iivhC+Wz6bCgJ7mrsSlsm44ibjrsNgHeqSP x3jhZTM7RqIdC4UIv+LgHz/IZJ07iMVkpXGTAKBZnW+SrzTsVRfK6/RawqDmxL5T h63Z8rly8kTS9GmWaw82wieHqQpzJrYu2BXik6r5L5ON5DwvYeegcnMTczgldQKL pxQxiBO7yQ== =+zW+ -----END PGP SIGNATURE----- Merge tag 'io_uring-6.14-20250322' of git://git.kernel.dk/linux Pull io_uring fix from Jens Axboe: "Just a single fix for the commit that went into your tree yesterday, which exposed an issue with not always clearing notifications. That could cause them to be used more than once" * tag 'io_uring-6.14-20250322' of git://git.kernel.dk/linux: io_uring/net: fix sendzc double notif flush	2025-03-22 10:45:44 -07:00
Pavel Begunkov	67c007d6c1	io_uring/net: fix sendzc double notif flush refcount_t: underflow; use-after-free. WARNING: CPU: 0 PID: 5823 at lib/refcount.c:28 refcount_warn_saturate+0x15a/0x1d0 lib/refcount.c:28 RIP: 0010:refcount_warn_saturate+0x15a/0x1d0 lib/refcount.c:28 Call Trace: <TASK> io_notif_flush io_uring/notif.h:40 [inline] io_send_zc_cleanup+0x121/0x170 io_uring/net.c:1222 io_clean_op+0x58c/0x9a0 io_uring/io_uring.c:406 io_free_batch_list io_uring/io_uring.c:1429 [inline] __io_submit_flush_completions+0xc16/0xd20 io_uring/io_uring.c:1470 io_submit_flush_completions io_uring/io_uring.h:159 [inline] Before the blamed commit, sendzc relied on io_req_msg_cleanup() to clear REQ_F_NEED_CLEANUP, so after the following snippet the request will never hit the core io_uring cleanup path. io_notif_flush(); io_req_msg_cleanup(); The easiest fix is to null the notification. io_send_zc_cleanup() can still be called after, but it's tolerated. Reported-by: syzbot+cf285a028ffba71b2ef5@syzkaller.appspotmail.com Tested-by: syzbot+cf285a028ffba71b2ef5@syzkaller.appspotmail.com Fixes: `cc34d8330e` ("io_uring/net: don't clear REQ_F_NEED_CLEANUP unconditionally") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e1306007458b8891c88c4f20c966a17595f766b0.1742643795.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-22 08:14:36 -06:00
Caleb Sander Mateos	8e3100fcc5	io_uring/net: only import send_zc buffer once io_send_zc() guards its call to io_send_zc_import() with if (!done_io) in an attempt to avoid calling it redundantly on the same req. However, if the initial non-blocking issue returns -EAGAIN, done_io will stay 0. This causes the subsequent issue to unnecessarily re-import the buffer. Add an explicit flag "imported" to io_sr_msg to track if its buffer has already been imported. Clear the flag in io_send_zc_prep(). Call io_send_zc_import() and set the flag in io_send_zc() if it is unset. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: `54cdcca05a` ("io_uring/net: switch io_send() and io_send_zc() to using io_async_msghdr") Link: https://lore.kernel.org/r/20250321184819.3847386-2-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-21 12:53:23 -06:00
Pavel Begunkov	ef49027529	io_uring/cmd: introduce io_uring_cmd_import_fixed_vec io_uring_cmd_import_fixed_vec() is a cmd helper around vectored registered buffer import functions, which caches the memory under the hood. The lifetime of the vectore and hence the iterator is bound to the request. Furthermore, the user is not allowed to call it multiple times for a single request. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/97487a80dec3fb8cf8aeedf1f9026ef6d503fe4b.1742579999.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-21 12:52:15 -06:00
Pavel Begunkov	3a4689ac10	io_uring/cmd: add iovec cache for commands Add iou_vec to commands and wire caching for it, but don't expose it to users just yet. We need the vec cleared on initial alloc, but since we can't place it at the beginning at the moment, zero the entire async_data. It's cached, and the performance effects only the initial allocation, and it might be not a bad idea since we're exposing those bits to outside drivers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c0f2145b75791bc6106eb4e72add2cf6a2c72a7a.1742579999.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-21 12:52:15 -06:00
Linus Torvalds	d07de43e3f	io_uring-6.14-20250321 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfdXP0QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpsB1EACD4LL93u/GUfCDI3gHMYdGv1Yy79j+qcB9 +ikQYy+pM9iPHvFjXDZ8ZNzveaUic5OAXvrIQcfexcpbbJS3LxkmppBWZ7/gQxhA +nNXXcxYvhmzstnGTjsSlkUOosqUKWyo1M3JsFhZHL3H+Te+FGiSg82wwDdbNS69 eezOzha4njxpkeIgcWI4Qzryg5//bYZKZBdf078m+y/Qa4xL9276VY3Y8bg+R51l wVYuDv5EjJT/zItXYr0uU+NRNFraMP1B23Ew+6EkdV/t4pzaH45H6+WfHv4tV5hS JbGZbvt/Zbvv9citAWsxqUkrGLDuveugKBEH/dKvffBW/tk0RmoiQrs3ZpO+g0/+ Kdlfv/ggALWYC+QYUVyTzmb/xGk5YipSzN06M6t1+ELbedpaZyFWj5vhNSEsqc8l 4edOoyA4ZPafGaipTNfsE4kSNk7UL1AiWqPxJWq3O9WYkVj2Zt9sD3dbhuddIIwm bEAbOvOEZXSQh8EBp0x2epkKc5y8ma75OdZui3cOlQEAKq+JMkPSYxPJLqkgAmrr SAFUoRYtM6ShKX6AGdwzk6w4htp1L3G8tlGssM4eqC2e9wQJ9B3nCep1Xvs48rjW MjpZfxfFRrfV14WHGun+63F9lbdGW4GJqnMbzpvdTrTmGfdOE88frCdcw0EEHRxB EmAoCbJ+aw== =TEDX -----END PGP SIGNATURE----- Merge tag 'io_uring-6.14-20250321' of git://git.kernel.dk/linux Pull io_uring fix from Jens Axboe: "Single fix heading to stable, fixing an issue with io_req_msg_cleanup() sometimes too eagerly clearing cleanup flags" * tag 'io_uring-6.14-20250321' of git://git.kernel.dk/linux: io_uring/net: don't clear REQ_F_NEED_CLEANUP unconditionally	2025-03-21 10:30:15 -07:00
Jens Axboe	07754bfd9a	io_uring: enable toggle of iowait usage when waiting on CQEs By default, io_uring marks a waiting task as being in iowait, if it's sleeping waiting on events and there are pending requests. This isn't necessarily always useful, and may be confusing on non-storage setups where iowait isn't expected. It can also cause extra power usage, by preventing the CPU from entering lower sleep states. This adds a new enter flag, IORING_ENTER_NO_IOWAIT. If set, then io_uring will not account the sleeping task as being in iowait. If the kernel supports this feature, then it will be marked by having the IORING_FEAT_NO_IOWAIT feature flag set. As the kernel currently does not support separating the iowait accounting and CPU frequency boosting, the IORING_ENTER_NO_IOWAIT controls both of these at the same time. In the future, if those do end up being split, then it'd be possible to control them separately. However, it seems more likely that the kernel will decouple iowait and CPU frequency boosting anyway. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-20 20:01:03 -06:00
Jens Axboe	cc34d8330e	io_uring/net: don't clear REQ_F_NEED_CLEANUP unconditionally io_req_msg_cleanup() relies on the fact that io_netmsg_recycle() will always fully recycle, but that may not be the case if the msg cache was already full. To ensure that normal cleanup always gets run, let io_netmsg_recycle() deal with clearing the relevant cleanup flags, as it knows exactly when that should be done. Cc: stable@vger.kernel.org Reported-by: David Wei <dw@davidwei.uk> Fixes: `7519134178` ("io_uring/net: add iovec recycling") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-20 12:27:27 -06:00
Pavel Begunkov	5f14404bfa	io_uring/cmd: don't expose entire cmd async data io_uring needs private bits in cmd's ->async_data, and they should never be exposed to drivers as it'd certainly be abused. Leave struct io_uring_cmd_data for the drivers but wrap it into a structure. It's a prep patch and doesn't do anything useful yet. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20250319061251.21452-3-sidong.yang@furiosa.ai Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-19 09:25:55 -06:00
Pavel Begunkov	575e7b0629	io_uring: rename the data cmd cache Pick a more descriptive name for the cmd async data cache. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20250319061251.21452-2-sidong.yang@furiosa.ai Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-19 09:25:55 -06:00
Paolo Abeni	941defcea7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.14-rc6). Conflicts: tools/testing/selftests/drivers/net/ping.py `75cc19c8ff` ("selftests: drv-net: add xdp cases for ping.py") `de94e86974` ("selftests: drv-net: store addresses in dict indexed by ipver") https://lore.kernel.org/netdev/20250311115758.17a1d414@canb.auug.org.au/ net/core/devmem.c `a70f891e0f` ("net: devmem: do not WARN conditionally after netdev_rx_queue_restart()") `1d22d3060b` ("net: drop rtnl_lock for queue_mgmt operations") https://lore.kernel.org/netdev/20250313114929.43744df1@canb.auug.org.au/ Adjacent changes: tools/testing/selftests/net/Makefile `6f50175cca` ("selftests: Add IPv6 link-local address generation tests for GRE devices.") `2e5584e0f9` ("selftests/net: expand cmsg_ipv6.sh with ipv4") drivers/net/ethernet/broadcom/bnxt/bnxt.c `661958552e` ("eth: bnxt: do not use BNXT_VNIC_NTUPLE unconditionally in queue restart logic") `fe96d717d3` ("bnxt_en: Extend queue stop/start for TX rings") Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-03-13 23:08:11 +01:00
Jens Axboe	cf9536e550	io_uring/kbuf: enable bundles for incrementally consumed buffers The original support for incrementally consumed buffers didn't allow it to be used with bundles, with the assumption being that incremental buffers are generally larger, and hence there's less of a nedd to support it. But that assumption may not be correct - it's perfectly viable to use smaller buffers with incremental consumption, and there may be valid reasons for an application or framework to do so. As there's really no need to explicitly disable bundles with incrementally consumed buffers, allow it. This actually makes the peek side cheaper and simpler, with the completion side basically the same, just needing to iterate for the consumed length. Reported-by: Norman Maurer <norman_maurer@apple.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-10 16:24:43 -06:00
Keith Busch	334f795ff8	Revert "io_uring/rsrc: simplify the bvec iter count calculation" This reverts commit `2a51c327d4`. The kernel registered bvecs do use the iov_iter_advance() API, so we can't rely on this simplification anymore. Fixes: `27cb27b6d5` ("io_uring: add support for kernel registered bvecs") Reported-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250310184825.569371-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-10 16:24:43 -06:00
Pavel Begunkov	146acfd0f6	io_uring: rely on io_prep_reg_vec for iovec placement All vectored reg buffer users should use io_import_reg_vec() for iovec imports, since iovec placement is the function's responsibility and callers shouldn't know much about it, drop the offset parameter from io_prep_reg_vec() and calculate it inside. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/08ed87ca4bbc06724373b6ce06f36b703fe60c4e.1741457480.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-10 07:14:27 -06:00
Pavel Begunkov	d291fb6520	io_uring: introduce io_prep_reg_iovec() iovecs that are turned into registered buffers are imported in a special way with an offset, so that later we can do an in place translation. Add a helper function taking care of it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7de2ecb9ed5efc3c5cf320232236966da5ad4ccc.1741457480.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-10 07:14:27 -06:00
Pavel Begunkov	5027d02452	io_uring: unify STOP_MULTISHOT with IOU_OK IOU_OK means that the request ownership is now handed back to core io_uring and it has to complete it using the result provided in req->cqe. Same is true for multishot and IOU_STOP_MULTISHOT. Rename it into IOU_COMPLETE to avoid confusion and use for both modes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e6a5b2edb0eb9558acb1c8f1db38ac45fee95491.1741453534.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-10 07:14:18 -06:00
Pavel Begunkov	7a9dcb05f5	io_uring: return -EAGAIN to continue multishot Multishot errors can be mapped 1:1 to normal errors, but there are not identical. It leads to a peculiar situation where all multishot requests has to check in what context they're run and return different codes. Unify them starting with EAGAIN / IOU_ISSUE_SKIP_COMPLETE(EIOCBQUEUED) pair, which mean that core io_uring still owns the request and it should be retried. In case of multishot it's naturally just continues to poll, otherwise it might poll, use iowq or do any other kind of allowed blocking. Introduce IOU_RETRY aliased to -EAGAIN for that. Apart from obvious upsides, multishot can now also check for misuse of IOU_ISSUE_SKIP_COMPLETE. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/da117b79ce72ecc3ab488c744e29fae9ba54e23b.1741453534.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-10 07:14:18 -06:00
Linus Torvalds	d53276d292	io_uring-6.14-20250306 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfKQpsQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgphWWD/0Qi6/RyM71So/U3kB6HkwgDd3w7JdeWcmB EZN9qaALS3r/3Q/qNQTT6/2dLNvjw/MWAlLV7R511x7f+9lMJp6KxTyTkmxXbkd6 yTMAm5E19vE1zeQZvHP9SzRKU5o2mh7yVnkNxHqqVbAzePtRGFGPLG/gQ4zCDq7I iqThGN4Hf5hZYM6QH6R25b/iljYYyr6kF8l96TLiY/QKp4ZTPR+dKjV4xfXFnVeF BjC/SzvQQqyHFDtpoXWeUemcxAE2tzPhAZ6w2p/56S2rkbfpxsWnwF+Ut4tteb0R pSBehvxfzkD9hBlADDRja0KK+WM95RBNJ0/Pf7dNBhqC3jhxNFH3FcBf66ot43fH 8Jjshq9Wi5Zfdr176VzZCgpSSVihluctNRTpgDCR81289WsdBnpkhPXE0NSfw5su iulG3kjKJwQmIO/k+S5dBvT9/X+FzjQCuV4abuVyeDVJo2IXKbzGbWqp3ueUNuic kprQwEhcyPYtiVXpwsssnEMYjVx0pUc9FIM72ed+5hoaH+BejuGFHNAkKUOhCSvi S3SM4gVHZE4soDU8wevRuU4Kgag+pW/z1EYltA7DKcTJMGx9gO+uFlUJTSXoOFzO 3kD4oXwnMc/CfHxAGWOWeCiANGaWnEkJYmwGG/0R80oG9Usji43ucErvs3p6zRTE iNFs/IfwPw== =v5sq -----END PGP SIGNATURE----- Merge tag 'io_uring-6.14-20250306' of git://git.kernel.dk/linux Pull io_uring fix from Jens Axboe: "A single fix for a regression introduced in the 6.14 merge window, causing stalls/hangs with IOPOLL reads or writes" * tag 'io_uring-6.14-20250306' of git://git.kernel.dk/linux: io_uring/rw: ensure reissue path is correctly handled for IOPOLL	2025-03-07 11:09:33 -10:00
Yue Haibing	30c970354c	io_uring: Remove unused declaration io_alloc_async_data() Commit `ef623a647f` ("io_uring: Move old async data allocation helper to header") leave behind this unused declaration. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Link: https://lore.kernel.org/r/20250305013454.3635021-1-yuehaibing@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-07 14:09:16 -07:00
Pavel Begunkov	0396ad3766	io_uring: cap cached iovec/bvec size Bvecs can be large, put an arbitrary limit on the max vector size it can cache. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/823055fa6628daa24bbc9cd77c2da87e9a1e1e32.1741362889.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-07 13:41:08 -07:00
Pavel Begunkov	23371eac7d	io_uring/net: implement vectored reg bufs for zctx Add support for vectored registered buffers for send zc. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e484052875f862d2dca99f0f8c04407c1d51a1c1.1741362889.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-07 13:41:08 -07:00
Pavel Begunkov	be7052a4b5	io_uring/net: convert to struct iou_vec Convert net.c to use struct iou_vec. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6437b57dabed44eca708c02e390529c7ed211c78.1741362889.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-07 13:41:08 -07:00
Pavel Begunkov	9fcb349f5a	io_uring/net: pull vec alloc out of msghdr import I'll need more control over iovec management, move io_net_import_vec() out of io_msg_copy_hdr(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9600ea6300f620e65d39da481c22605ddc898850.1741362889.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-07 13:41:08 -07:00
Pavel Begunkov	17523a821d	io_uring/net: combine msghdr copy Call the compat version from inside of io_msg_copy_hdr() and don't duplicate it in callers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/25795660f7b31f9273911c99f495d9c2b169ecda.1741362889.git.asml.silence@gmail.com [axboe: fixup msg pointer vs variable braino in io_msg_copy_hdr()] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-07 13:40:39 -07:00
Pavel Begunkov	835c4bdf95	io_uring/rw: defer reg buf vec import Import registered buffers for vectored reads and writes later at issue time as we now do for other fixed ops. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e8491c976e4ab83a4e3dc428e9fe7555e59583b8.1741362889.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-07 09:07:29 -07:00
Pavel Begunkov	bdabba04bb	io_uring/rw: implement vectored registered rw Implement registered buffer vectored reads with new opcodes IORING_OP_WRITEV_FIXED and IORING_OP_READV_FIXED. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d7c89eb481e870f598edc91cc66ff4d1e4ae3788.1741362889.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-07 09:07:29 -07:00
Pavel Begunkov	9ef4cbbcb4	io_uring: add infra for importing vectored reg buffers Add io_import_reg_vec(), which will be responsible for importing vectored registered buffers. The function might reallocate the vector, but it'd try to do the conversion in place first, which is why it's required of the user to pad the iovec to the right border of the cache. Overlapping also depends on struct iovec being larger than bvec, which is not the case on e.g. 32 bit architectures. Don't try to complicate this case and make sure vectors never overlap, it'll be improved later. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/60bd246b1249476a6996407c1dbc38ef6febad14.1741362889.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-07 09:07:29 -07:00
Pavel Begunkov	e1d4995909	io_uring: introduce struct iou_vec I need a convenient way to pass around and work with iovec+size pair, put them into a structure and makes use of it in rw.c Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d39fadafc9e9047b0a292e5be6db3cf2f48bb1f7.1741362889.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-07 09:07:29 -07:00
Jens Axboe	6e3da40ed6	Merge branch 'for-6.15/io_uring-epoll-wait' into for-6.15/io_uring-reg-vec * for-6.15/io_uring-epoll-wait: io_uring/epoll: add support for IORING_OP_EPOLL_WAIT io_uring/epoll: remove CONFIG_EPOLL guards eventpoll: add epoll_sendevents() helper eventpoll: abstract out ep_try_send_events() helper eventpoll: abstract out parameter sanity checking	2025-03-07 09:07:19 -07:00
Jens Axboe	78b6f6e9bf	Merge branch 'for-6.15/io_uring-rx-zc' into for-6.15/io_uring-reg-vec * for-6.15/io_uring-rx-zc: (80 commits) io_uring/zcrx: add selftest case for recvzc with read limit io_uring/zcrx: add a read limit to recvzc requests io_uring: add missing IORING_MAP_OFF_ZCRX_REGION in io_uring_mmap io_uring: Rename KConfig to Kconfig io_uring/zcrx: fix leaks on failed registration io_uring/zcrx: recheck ifq on shutdown io_uring/zcrx: add selftest net: add documentation for io_uring zcrx io_uring/zcrx: add copy fallback io_uring/zcrx: throttle receive requests io_uring/zcrx: set pp memory provider for an rx queue io_uring/zcrx: add io_recvzc request io_uring/zcrx: dma-map area for the device io_uring/zcrx: implement zerocopy receive pp memory provider io_uring/zcrx: grab a net device io_uring/zcrx: add io_zcrx_area io_uring/zcrx: add interface queue and refill queue net: add helpers for setting a memory provider on an rx queue net: page_pool: add memory provider helpers net: prepare for non devmem TCP memory providers ...	2025-03-07 09:07:11 -07:00
Jakub Kicinski	2525e16a2b	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.14-rc6). Conflicts: net/ethtool/cabletest.c `2bcf4772e4` ("net: ethtool: try to protect all callback with netdev instance lock") `637399bf7e` ("net: ethtool: netlink: Allow NULL nlattrs when getting a phy_device") No Adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-03-06 13:03:35 -08:00
Jens Axboe	bcb0fda3c2	io_uring/rw: ensure reissue path is correctly handled for IOPOLL The IOPOLL path posts CQEs when the io_kiocb is marked as completed, so it cannot rely on the usual retry that non-IOPOLL requests do for read/write requests. If -EAGAIN is received and the request should be retried, go through the normal completion path and let the normal flush logic catch it and reissue it, like what is done for !IOPOLL reads or writes. Fixes: `d803d12394` ("io_uring/rw: handle -EAGAIN retry at IO completion time") Reported-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/io-uring/2b43ccfa-644d-4a09-8f8f-39ad71810f41@oracle.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-05 14:03:34 -07:00
Caleb Sander Mateos	0d83b8a9f1	io_uring: introduce io_cache_free() helper Add a helper function io_cache_free() that returns an allocation to a io_alloc_cache, falling back on kfree() if the io_alloc_cache is full. This is the inverse of io_cache_alloc(), which takes an allocation from an io_alloc_cache and falls back on kmalloc() if the cache is empty. Convert 4 callers to use the helper. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Suggested-by: Li Zetao <lizetao1@huawei.com> Link: https://lore.kernel.org/r/20250304194814.2346705-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-05 07:38:55 -07:00
Caleb Sander Mateos	fe21a4532e	io_uring/rsrc: skip NULL file/buffer checks in io_free_rsrc_node() io_rsrc_node's of type IORING_RSRC_FILE always have a file attached immediately after they are allocated. IORING_RSRC_BUFFER nodes won't be returned from io_sqe_buffer_register()/io_buffer_register_bvec() until they have a io_mapped_ubuf attached. So remove the checks for a NULL file/buffer in io_free_rsrc_node(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250228235916.670437-5-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-04 07:17:15 -07:00
Caleb Sander Mateos	6e5d321a08	io_uring/rsrc: avoid NULL node check on io_sqe_buffer_register() failure The done: label is only reachable if node is non-NULL. So don't bother checking, just call io_free_node(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250228235916.670437-4-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-04 07:17:15 -07:00
Caleb Sander Mateos	13f7f9686e	io_uring/rsrc: call io_free_node() on io_sqe_buffer_register() failure io_sqe_buffer_register() currently calls io_put_rsrc_node() if it fails to fully set up the io_rsrc_node. io_put_rsrc_node() is more involved than necessary, since we already know the reference count will reach 0 and no io_mapped_ubuf has been attached to the node yet. So just call io_free_node() to release the node's memory. This also avoids the need to temporarily set the node's buf pointer to NULL. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250228235916.670437-3-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-04 07:17:15 -07:00
Caleb Sander Mateos	a387b96d2a	io_uring/rsrc: free io_rsrc_node using kfree() io_rsrc_node_alloc() calls io_cache_alloc(), which uses kmalloc() to allocate the node. So it can be freed with kfree() instead of kvfree(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250228235916.670437-2-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-04 07:17:15 -07:00
Caleb Sander Mateos	6a53541829	io_uring/rsrc: split out io_free_node() helper Split the freeing of the io_rsrc_node from io_free_rsrc_node(), for use with nodes that haven't been fully initialized. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250228235916.670437-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-04 07:17:15 -07:00
Caleb Sander Mateos	a1967280a1	io_uring/rsrc: include io_uring_types.h in rsrc.h io_uring/rsrc.h uses several types from include/linux/io_uring_types.h. Include io_uring_types.h explicitly in rsrc.h to avoid depending on users of rsrc.h including io_uring_types.h first. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Li Zetao <lizetao1@huawei.com> Link: https://lore.kernel.org/r/20250301183612.937529-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-04 07:16:07 -07:00

1 2 3 4 5 ...

1722 Commits