Commit Graph

165 Commits

Author SHA1 Message Date
Ming Lei
039c878db7 io_uring/rsrc: add & apply io_req_assign_buf_node()
The following pattern becomes more and more:

+       io_req_assign_rsrc_node(&req->buf_node, node);
+       req->flags |= REQ_F_BUF_NODE;

so make it a helper, which is less fragile to use than above code, for
example, the BUF_NODE flag is even missed in current io_uring_cmd_prep().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241107110149.890530-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07 15:24:33 -07:00
Jens Axboe
b6f58a3f4a io_uring: move struct io_kiocb from task_struct to io_uring_task
Rather than store the task_struct itself in struct io_kiocb, store
the io_uring specific task_struct. The life times are the same in terms
of io_uring, and this avoids doing some dereferences through the
task_struct. For the hot path of putting local task references, we can
deref req->tctx instead, which we'll need anyway in that function
regardless of whether it's local or remote references.

This is mostly straight forward, except the original task PF_EXITING
check needs a bit of tweaking. task_work is _always_ run from the
originating task, except in the fallback case, where it's run from a
kernel thread. Replace the potentially racy (in case of fallback work)
checks for req->task->flags with current->flags. It's either the still
the original task, in which case PF_EXITING will be sane, or it has
PF_KTHREAD set, in which case it's fallback work. Both cases should
prevent moving forward with the given request.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06 13:55:38 -07:00
Jens Axboe
6f94cbc29a io_uring/rsrc: split io_kiocb node type assignments
Currently the io_rsrc_node assignment in io_kiocb is an array of two
pointers, as two nodes may be assigned to a request - one file node,
and one buffer node. However, the buffer node can co-exist with the
provided buffers, as currently it's not supported to use both provided
and registered buffers at the same time.

This crucially brings struct io_kiocb down to 4 cache lines again, as
before it spilled into the 5th cacheline.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06 13:55:36 -07:00
hexue
01ee194d1a io_uring: add support for hybrid IOPOLL
A new hybrid poll is implemented on the io_uring layer. Once an IO is
issued, it will not poll immediately, but rather block first and re-run
before IO complete, then poll to reap IO. While this poll method could
be a suboptimal solution when running on a single thread, it offers
performance lower than regular polling but higher than IRQ, and CPU
utilization is also lower than polling.

To use hybrid polling, the ring must be setup with both the
IORING_SETUP_IOPOLL and IORING_SETUP_HYBRID)IOPOLL flags set. Hybrid
polling has the same restrictions as IOPOLL, in that commands must
explicitly support it.

Signed-off-by: hexue <xue01.he@samsung.com>
Link: https://lore.kernel.org/r/20241101091957.564220-2-xue01.he@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-02 15:45:30 -06:00
Jens Axboe
b54a14041e io_uring/rsrc: add io_rsrc_node_lookup() helper
There are lots of spots open-coding this functionality, add a generic
helper that does the node lookup in a speculation safe way.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-02 15:45:30 -06:00
Jens Axboe
3597f2786b io_uring/rsrc: unify file and buffer resource tables
For files, there's nr_user_files/file_table/file_data, and buffers have
nr_user_bufs/user_bufs/buf_data. There's no reason why file_table and
file_data can't be the same thing, and ditto for the buffer side. That
gets rid of more io_ring_ctx state that's in two spots rather than just
being in one spot, as it should be. Put all the registered file data in
one locations, and ditto on the buffer front.

This also avoids having both io_rsrc_data->nodes being an allocated
array, and ->user_bufs[] or ->file_table.nodes. There's no reason to
have this information duplicated. Keep it in one spot, io_rsrc_data,
along with how many resources are available.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-02 15:45:23 -06:00
Jens Axboe
7029acd8a9 io_uring/rsrc: get rid of per-ring io_rsrc_node list
Work in progress, but get rid of the per-ring serialization of resource
nodes, like registered buffers and files. Main issue here is that one
node can otherwise hold up a bunch of other nodes from getting freed,
which is especially a problem for file resource nodes and networked
workloads where some descriptors may not see activity in a long time.

As an example, instantiate an io_uring ring fd and create a sparse
registered file table. Even 2 will do. Then create a socket and register
it as fixed file 0, F0. The number of open files in the app is now 5,
with 0/1/2 being the usual stdin/out/err, 3 being the ring fd, and 4
being the socket. Register this socket (eg "the listener") in slot 0 of
the registered file table. Now add an operation on the socket that uses
slot 0. Finally, loop N times, where each loop creates a new socket,
registers said socket as a file, then unregisters the socket, and
finally closes the socket. This is roughly similar to what a basic
accept loop would look like.

At the end of this loop, it's not unreasonable to expect that there
would still be 5 open files. Each socket created and registered in the
loop is also unregistered and closed. But since the listener socket
registered first still has references to its resource node due to still
being active, each subsequent socket unregistration is stuck behind it
for reclaim. Hence 5 + N files are still open at that point, where N is
awaiting the final put held up by the listener socket.

Rewrite the io_rsrc_node handling to NOT rely on serialization. Struct
io_kiocb now gets explicit resource nodes assigned, with each holding a
reference to the parent node. A parent node is either of type FILE or
BUFFER, which are the two types of nodes that exist. A request can have
two nodes assigned, if it's using both registered files and buffers.
Since request issue and task_work completion is both under the ring
private lock, no atomics are needed to handle these references. It's a
simple unlocked inc/dec. As before, the registered buffer or file table
each hold a reference as well to the registered nodes. Final put of the
node will remove the node and free the underlying resource, eg unmap the
buffer or put the file.

Outside of removing the stall in resource reclaim described above, it
has the following advantages:

1) It's a lot simpler than the previous scheme, and easier to follow.
   No need to specific quiesce handling anymore.

2) There are no resource node allocations in the fast path, all of that
   happens at resource registration time.

3) The structs related to resource handling can all get simplified
   quite a bit, like io_rsrc_node and io_rsrc_data. io_rsrc_put can
   go away completely.

4) Handling of resource tags is much simpler, and doesn't require
   persistent storage as it can simply get assigned up front at
   registration time. Just copy them in one-by-one at registration time
   and assign to the resource node.

The only real downside is that a request is now explicitly limited to
pinning 2 resources, one file and one buffer, where before just
assigning a resource node to a request would pin all of them. The upside
is that it's easier to follow now, as an individual resource is
explicitly referenced and assigned to the request.

With this in place, the above mentioned example will be using exactly 5
files at the end of the loop, not N.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-02 15:44:18 -06:00
Jens Axboe
1d60d74e85 io_uring/rw: fix missing NOWAIT check for O_DIRECT start write
When io_uring starts a write, it'll call kiocb_start_write() to bump the
super block rwsem, preventing any freezes from happening while that
write is in-flight. The freeze side will grab that rwsem for writing,
excluding any new writers from happening and waiting for existing writes
to finish. But io_uring unconditionally uses kiocb_start_write(), which
will block if someone is currently attempting to freeze the mount point.
This causes a deadlock where freeze is waiting for previous writes to
complete, but the previous writes cannot complete, as the task that is
supposed to complete them is blocked waiting on starting a new write.
This results in the following stuck trace showing that dependency with
the write blocked starting a new write:

task:fio             state:D stack:0     pid:886   tgid:886   ppid:876
Call trace:
 __switch_to+0x1d8/0x348
 __schedule+0x8e8/0x2248
 schedule+0x110/0x3f0
 percpu_rwsem_wait+0x1e8/0x3f8
 __percpu_down_read+0xe8/0x500
 io_write+0xbb8/0xff8
 io_issue_sqe+0x10c/0x1020
 io_submit_sqes+0x614/0x2110
 __arm64_sys_io_uring_enter+0x524/0x1038
 invoke_syscall+0x74/0x268
 el0_svc_common.constprop.0+0x160/0x238
 do_el0_svc+0x44/0x60
 el0_svc+0x44/0xb0
 el0t_64_sync_handler+0x118/0x128
 el0t_64_sync+0x168/0x170
INFO: task fsfreeze:7364 blocked for more than 15 seconds.
      Not tainted 6.12.0-rc5-00063-g76aaf945701c #7963

with the attempting freezer stuck trying to grab the rwsem:

task:fsfreeze        state:D stack:0     pid:7364  tgid:7364  ppid:995
Call trace:
 __switch_to+0x1d8/0x348
 __schedule+0x8e8/0x2248
 schedule+0x110/0x3f0
 percpu_down_write+0x2b0/0x680
 freeze_super+0x248/0x8a8
 do_vfs_ioctl+0x149c/0x1b18
 __arm64_sys_ioctl+0xd0/0x1a0
 invoke_syscall+0x74/0x268
 el0_svc_common.constprop.0+0x160/0x238
 do_el0_svc+0x44/0x60
 el0_svc+0x44/0xb0
 el0t_64_sync_handler+0x118/0x128
 el0t_64_sync+0x168/0x170

Fix this by having the io_uring side honor IOCB_NOWAIT, and only attempt a
blocking grab of the super block rwsem if it isn't set. For normal issue
where IOCB_NOWAIT would always be set, this returns -EAGAIN which will
have io_uring core issue a blocking attempt of the write. That will in
turn also get completions run, ensuring forward progress.

Since freezing requires CAP_SYS_ADMIN in the first place, this isn't
something that can be triggered by a regular user.

Cc: stable@vger.kernel.org # 5.10+
Reported-by: Peter Mann <peter.mann@sh.cz>
Link: https://lore.kernel.org/io-uring/38c94aec-81c9-4f62-b44e-1d87f5597644@sh.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-31 08:21:02 -06:00
Jens Axboe
1caa00d6b6 io_uring: remove 'issue_flags' argument for io_req_set_rsrc_node()
All callers already hold the ring lock and hence are passing '0',
remove the argument and the conditional locking that it controlled.

Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29 13:43:27 -06:00
Jens Axboe
003f82b58c io_uring/rw: get rid of using req->imu
It's assigned in the same function that it's being used, get rid of
it. A local variable will do just fine.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29 13:43:27 -06:00
Jens Axboe
ae6a888a43 io_uring/rw: fix wrong NOWAIT check in io_rw_init_file()
A previous commit improved how !FMODE_NOWAIT is dealt with, but
inadvertently negated a check whilst doing so. This caused -EAGAIN to be
returned from reading files with O_NONBLOCK set. Fix up the check for
REQ_F_SUPPORT_NOWAIT.

Reported-by: Julian Orth <ju.orth@gmail.com>
Link: https://github.com/axboe/liburing/issues/1270
Fixes: f7c9134385 ("io_uring/rw: allow pollable non-blocking attempts for !FMODE_NOWAIT")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-19 09:25:45 -06:00
Jens Axboe
f7c9134385 io_uring/rw: allow pollable non-blocking attempts for !FMODE_NOWAIT
The checking for whether or not io_uring can do a non-blocking read or
write attempt is gated on FMODE_NOWAIT. However, if the file is
pollable, it's feasible to just check if it's currently in a state in
which it can sanely receive or send _some_ data.

This avoids unnecessary io-wq punts, and repeated worthless retries
before doing that punt, by assuming that some data can get delivered
or received if poll tells us that is true. It also allows multishot
reads to properly work with these types of files, enabling a bit of
a cleanup of the logic that:

c9d952b910 ("io_uring/rw: fix cflags posting for single issue multishot read")

had to put in place.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-06 20:58:53 -06:00
Jens Axboe
c9d952b910 io_uring/rw: fix cflags posting for single issue multishot read
If multishot gets disabled, and hence the request will get terminated
rather than persist for more iterations, then posting the CQE with the
right cflags is still important. Most notably, the buffer reference
needs to be included.

Refactor the return of __io_read() a bit, so that the provided buffer
is always put correctly, and hence returned to the application.

Reported-by: Sharon Rosner <Sharon Rosner>
Link: https://github.com/axboe/liburing/issues/1257
Cc: stable@vger.kernel.org
Fixes: 2a975d426c ("io_uring/rw: don't allow multishot reads without NOWAIT support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-06 08:05:47 -06:00
Jens Axboe
90bfb28d5f io_uring/rw: drop -EOPNOTSUPP check in __io_complete_rw_common()
A recent change ensured that the necessary -EOPNOTSUPP -> -EAGAIN
transformation happens inline on both the reader and writer side,
and hence there's no need to check for both of these anymore on
the completion handler side.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-10 09:34:44 -06:00
Jens Axboe
c0a9d496e0 io_uring/rw: treat -EOPNOTSUPP for IOCB_NOWAIT like -EAGAIN
Some file systems, ocfs2 in this case, will return -EOPNOTSUPP for
an IOCB_NOWAIT read/write attempt. While this can be argued to be
correct, the usual return value for something that requires blocking
issue is -EAGAIN.

A refactoring io_uring commit dropped calling kiocb_done() for
negative return values, which is otherwise where we already do that
transformation. To ensure we catch it in both spots, check it in
__io_read() itself as well.

Reported-by: Robert Sander <r.sander@heinlein-support.de>
Link: https://fosstodon.org/@gurubert@mastodon.gurubert.de/113112431889638440
Cc: stable@vger.kernel.org
Fixes: a08d195b58 ("io_uring/rw: split io_read() into a helper")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-10 09:34:41 -06:00
Jens Axboe
6733e678ba io_uring/kbuf: pass in 'len' argument for buffer commit
In preparation for needing the consumed length, pass in the length being
completed. Unused right now, but will be used when it is possible to
partially consume a buffer.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-29 08:44:51 -06:00
Prasad Singamsetty
c34fc6f26a fs: Initial atomic write support
An atomic write is a write issued with torn-write protection, meaning
that for a power failure or any other hardware failure, all or none of the
data from the write will be stored, but never a mix of old and new data.

Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the
write is to be issued with torn-write prevention, according to special
alignment and length rules.

For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for
iocb->ki_flags field to indicate the same.

A call to statx will give the relevant atomic write info for a file:
- atomic_write_unit_min
- atomic_write_unit_max
- atomic_write_segments_max

Both min and max values must be a power-of-2.

Applications can avail of atomic write feature by ensuring that the total
length of a write is a power-of-2 in size and also sized between
atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications
must ensure that the write is at a naturally-aligned offset in the file
wrt the total write length. The value in atomic_write_segments_max
indicates the upper limit for IOV_ITER iovcnt.

Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the
flag set will have RWF_ATOMIC rejected and not just ignored.

Add a type argument to kiocb_set_rw_flags() to allows reads which have
RWF_ATOMIC set to be rejected.

Helper function generic_atomic_write_valid() can be used by FSes to verify
compliant writes. There we check for iov_iter type is for ubuf, which
implies iovcnt==1 for pwritev2(), which is an initial restriction for
atomic_write_segments_max. Initially the only user will be bdev file
operations write handler. We will rely on the block BIO submission path to
ensure write sizes are compliant for the bdev, so we don't need to check
atomic writes sizes yet.

Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com>
jpg: merge into single patch and much rewrite
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-4-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 15:19:17 -06:00
Linus Torvalds
b6394d6f71 Assorted commits that had missed the last merge window...
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZkzp/gAKCRBZ7Krx/gZQ
 63KFAQCsKv3XdcF+2BO+QuwPvR6eAvDxFjrFEcQFyyOXgFVLaAD/UMM0HcEFWxBb
 PCPvyKVP22wF9PbodkrKJn8DRdtRZwM=
 =jvWv
 -----END PGP SIGNATURE-----

Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull misc vfs updates from Al Viro:
 "Assorted commits that had missed the last merge window..."

* tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  remove call_{read,write}_iter() functions
  do_dentry_open(): kill inode argument
  kernel_file_open(): get rid of inode argument
  get_file_rcu(): no need to check for NULL separately
  fd_is_open(): move to fs/file.c
  close_on_exec(): pass files_struct instead of fdtable
2024-05-21 13:11:44 -07:00
Linus Torvalds
9961a78594 for-6.10/io_uring-20240511
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmY/YdYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnmVEADBq8QT9Oa3HTIONHwxjmGMOalr7PSrBP89
 S6Inv/l+3xDlyolyLh1HIXUC84iS9Ihi2pNC3dZct4fNcpA99H0CFaHDGwZ5rVri
 MrFaubZAps1qSzeypqEq3zWGKVUoaYWaOKhuOjye5Ei2tKymbguhDKl1WiKibD21
 E9qOYbhSUFdub/xtx9Rv4BS05QW5bHZ2Y/tTFqB8MY4JUsdb9g/deVZkyGUQYRSd
 40mDallRldjQQTQ8iU4H6/ORdGIN/90aLPbmzMdFtQcymnmRyid3rOEwhwWYe4NO
 ljnI8m1SJQilZz1d5oHBXBB5QubVptY1JWxbk8GQCSmOU5wrCq+ARCJXUtBXwniJ
 K4VFsGm9MkZcc5vsIwIzvsrk8DODla6EVo/jyDy8iFceZcNWfVxdwa5NS67V/6QT
 macbF785XDsmA5E4UjslbZqU047w+A5N1yazcZWzMk0coJDeB8AtsA1/C2WZOm8p
 HVoiAzsqt81hvPItnjCyZluL/YW+BKeOTnq04QbpQKcJpZBzszO4ZLtuD+IXkE69
 8ZZPGFPnPS4ZMQojKkwsBr+Yo65S18oBDkib36mr2lsdnoWTpGq47C7ScUDBbqGm
 iI7U8tYMnVVkQQHVVmGI4KOr5/4lxxp8398kqCaxfW3D5BQhbtUOF/OBjBHj1ZSV
 9aZx87CyhA==
 =DwAV
 -----END PGP SIGNATURE-----

Merge tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:

 - Greatly improve send zerocopy performance, by enabling coalescing of
   sent buffers.

   MSG_ZEROCOPY already does this with send(2) and sendmsg(2), but the
   io_uring side did not. In local testing, the crossover point for send
   zerocopy being faster is now around 3000 byte packets, and it
   performs better than the sync syscall variants as well.

   This feature relies on a shared branch with net-next, which was
   pulled into both branches.

 - Unification of how async preparation is done across opcodes.

   Previously, opcodes that required extra memory for async retry would
   allocate that as needed, using on-stack state until that was the
   case. If async retry was needed, the on-stack state was adjusted
   appropriately for a retry and then copied to the allocated memory.

   This led to some fragile and ugly code, particularly for read/write
   handling, and made storage retries more difficult than they needed to
   be. Allocate the memory upfront, as it's cheap from our pools, and
   use that state consistently both initially and also from the retry
   side.

 - Move away from using remap_pfn_range() for mapping the rings.

   This is really not the right interface to use and can cause lifetime
   issues or leaks. Additionally, it means the ring sq/cq arrays need to
   be physically contigious, which can cause problems in production with
   larger rings when services are restarted, as memory can be very
   fragmented at that point.

   Move to using vm_insert_page(s) for the ring sq/cq arrays, and apply
   the same treatment to mapped ring provided buffers. This also helps
   unify the code we have dealing with allocating and mapping memory.

   Hard to see in the diffstat as we're adding a few features as well,
   but this kills about ~400 lines of code from the codebase as well.

 - Add support for bundles for send/recv.

   When used with provided buffers, bundles support sending or receiving
   more than one buffer at the time, improving the efficiency by only
   needing to call into the networking stack once for multiple sends or
   receives.

 - Tweaks for our accept operations, supporting both a DONTWAIT flag for
   skipping poll arm and retry if we can, and a POLLFIRST flag that the
   application can use to skip the initial accept attempt and rely
   purely on poll for triggering the operation. Both of these have
   identical flags on the receive side already.

 - Make the task_work ctx locking unconditional.

   We had various code paths here that would do a mix of lock/trylock
   and set the task_work state to whether or not it was locked. All of
   that goes away, we lock it unconditionally and get rid of the state
   flag indicating whether it's locked or not.

   The state struct still exists as an empty type, can go away in the
   future.

 - Add support for specifying NOP completion values, allowing it to be
   used for error handling testing.

 - Use set/test bit for io-wq worker flags. Not strictly needed, but
   also doesn't hurt and helps silence a KCSAN warning.

 - Cleanups for io-wq locking and work assignments, closing a tiny race
   where cancelations would not be able to find the work item reliably.

 - Misc fixes, cleanups, and improvements

* tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux: (97 commits)
  io_uring: support to inject result for NOP
  io_uring: fail NOP if non-zero op flags is passed in
  io_uring/net: add IORING_ACCEPT_POLL_FIRST flag
  io_uring/net: add IORING_ACCEPT_DONTWAIT flag
  io_uring/filetable: don't unnecessarily clear/reset bitmap
  io_uring/io-wq: Use set_bit() and test_bit() at worker->flags
  io_uring/msg_ring: cleanup posting to IOPOLL vs !IOPOLL ring
  io_uring: Require zeroed sqe->len on provided-buffers send
  io_uring/notif: disable LAZY_WAKE for linked notifs
  io_uring/net: fix sendzc lazy wake polling
  io_uring/msg_ring: reuse ctx->submitter_task read using READ_ONCE instead of re-reading it
  io_uring/rw: reinstate thread check for retries
  io_uring/notif: implement notification stacking
  io_uring/notif: simplify io_notif_flush()
  net: add callback for setting a ubuf_info to skb
  net: extend ubuf_info callback to ops structure
  io_uring/net: support bundles for recv
  io_uring/net: support bundles for send
  io_uring/kbuf: add helpers for getting/peeking multiple buffers
  io_uring/net: add provided buffer support for IORING_OP_SEND
  ...
2024-05-13 12:48:06 -07:00
Linus Torvalds
1b0aabcc9a vfs-6.10.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZj3HuwAKCRCRxhvAZXjc
 orYvAQCZOr68uJaEaXAArYTdnMdQ6HIzG+FVlwrqtrhz0BV07wEAqgmtSR9XKh+L
 0+DNepg4R8PZOHH371eSSsLNRCUCkAs=
 =SVsU
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains the usual miscellaneous features, cleanups, and fixes
  for vfs and individual fses.

  Features:

   - Free up FMODE_* bits. I've freed up bits 6, 7, 8, and 24. That
     means we now have six free FMODE_* bits in total (but bit #6
     already got used for FMODE_WRITE_RESTRICTED)

   - Add FOP_HUGE_PAGES flag (follow-up to FMODE_* cleanup)

   - Add fd_raw cleanup class so we can make use of automatic cleanup
     provided by CLASS(fd_raw, f)(fd) for O_PATH fds as well

   - Optimize seq_puts()

   - Simplify __seq_puts()

   - Add new anon_inode_getfile_fmode() api to allow specifying f_mode
     instead of open-coding it in multiple places

   - Annotate struct file_handle with __counted_by() and use
     struct_size()

   - Warn in get_file() whether f_count resurrection from zero is
     attempted (epoll/drm discussion)

   - Folio-sophize aio

   - Export the subvolume id in statx() for both btrfs and bcachefs

   - Relax linkat(AT_EMPTY_PATH) requirements

   - Add F_DUPFD_QUERY fcntl() allowing to compare two file descriptors
     for dup*() equality replacing kcmp()

  Cleanups:

   - Compile out swapfile inode checks when swap isn't enabled

   - Use (1 << n) notation for FMODE_* bitshifts for clarity

   - Remove redundant variable assignment in fs/direct-io

   - Cleanup uses of strncpy in orangefs

   - Speed up and cleanup writeback

   - Move fsparam_string_empty() helper into header since it's currently
     open-coded in multiple places

   - Add kernel-doc comments to proc_create_net_data_write()

   - Don't needlessly read dentry->d_flags twice

  Fixes:

   - Fix out-of-range warning in nilfs2

   - Fix ecryptfs overflow due to wrong encryption packet size
     calculation

   - Fix overly long line in xfs file_operations (follow-up to FMODE_*
     cleanup)

   - Don't raise FOP_BUFFER_{R,W}ASYNC for directories in xfs (follow-up
     to FMODE_* cleanup)

   - Don't call xfs_file_open from xfs_dir_open (follow-up to FMODE_*
     cleanup)

   - Fix stable offset api to prevent endless loops

   - Fix afs file server rotations

   - Prevent xattr node from overflowing the eraseblock in jffs2

   - Move fdinfo PTRACE_MODE_READ procfs check into the .permission()
     operation instead of .open() operation since this caused userspace
     regressions"

* tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits)
  afs: Fix fileserver rotation getting stuck
  selftests: add F_DUPDFD_QUERY selftests
  fcntl: add F_DUPFD_QUERY fcntl()
  file: add fd_raw cleanup class
  fs: WARN when f_count resurrection is attempted
  seq_file: Simplify __seq_puts()
  seq_file: Optimize seq_puts()
  proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation
  fs: Create anon_inode_getfile_fmode()
  xfs: don't call xfs_file_open from xfs_dir_open
  xfs: drop fop_flags for directories
  xfs: fix overly long line in the file_operations
  shmem: Fix shmem_rename2()
  libfs: Add simple_offset_rename() API
  libfs: Fix simple_offset_rename_exchange()
  jffs2: prevent xattr node from overflowing the eraseblock
  vfs, swap: compile out IS_SWAPFILE() on swapless configs
  vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements
  fs/direct-io: remove redundant assignment to variable retval
  fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading
  ...
2024-05-13 11:40:06 -07:00
Jens Axboe
039a2e800b io_uring/rw: reinstate thread check for retries
Allowing retries for everything is arguably the right thing to do, now
that every command type is async read from the start. But it's exposed a
few issues around missing check for a retry (which cca6571381 exposed),
and the fixup commit for that isn't necessarily 100% sound in terms of
iov_iter state.

For now, just revert these two commits. This unfortunately then re-opens
the fact that -EAGAIN can get bubbled to userspace for some cases where
the kernel very well could just sanely retry them. But until we have all
the conditions covered around that, we cannot safely enable that.

This reverts commit df604d2ad4.
This reverts commit cca6571381.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-25 09:04:32 -06:00
Jens Axboe
df604d2ad4 io_uring/rw: ensure retry condition isn't lost
A previous commit removed the checking on whether or not it was possible
to retry a request, since it's now possible to retry any of them. This
would previously have caused the request to have been ended with an error,
but now the retry condition can simply get lost instead.

Cleanup the retry handling and always just punt it to task_work, which
will queue it with io-wq appropriately.

Reported-by: Changhui Zhong <czhong@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Fixes: cca6571381 ("io_uring/rw: cleanup retry path")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 09:23:55 -06:00
Miklos Szeredi
7c98f7cb8f remove call_{read,write}_iter() functions
These have no clear purpose.  This is effectively a revert of commit
bb7462b6fd ("vfs: use helpers for calling f_op->{read,write}_iter()").

The patch was created with the help of a coccinelle script.

Fixes: bb7462b6fd ("vfs: use helpers for calling f_op->{read,write}_iter()")
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-04-15 16:03:25 -04:00
Jens Axboe
414d0f45c3 io_uring/alloc_cache: switch to array based caching
Currently lists are being used to manage this, but best practice is
usually to have these in an array instead as that it cheaper to manage.

Outside of that detail, games are also played with KASAN as the list
is inside the cached entry itself.

Finally, all users of this need a struct io_cache_entry embedded in
their struct, which is union'ized with something else in there that
isn't used across the free -> realloc cycle.

Get rid of all of that, and simply have it be an array. This will not
change the memory used, as we're just trading an 8-byte member entry
for the per-elem array size.

This reduces the overhead of the recycled allocations, and it reduces
the amount of code code needed to support recycling to about half of
what it currently is.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
d6f911a6b2 io_uring/rw: add iovec recycling
Let the io_async_rw hold on to the iovec and reuse it, rather than always
allocate and free them.

Also enables KASAN for the iovec entries, so that reuse can be detected
even while they are in the cache.

While doing so, shrink io_async_rw by getting rid of the bigger embedded
fast iovec. Since iovecs are being recycled now, shrink it from 8 to 1.
This reduces the io_async_rw size from 264 to 160 bytes, a 40% reduction.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
cca6571381 io_uring/rw: cleanup retry path
We no longer need to gate a potential retry on whether or not the
context matches our original task, as all read/write operations have
been fully prepared upfront. This means there's never any re-import
needed, and hence we can always retry requests.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
0d10bd77a1 io_uring: get rid of struct io_rw_state
A separate state struct is not needed anymore, just fold it in with
io_async_rw.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Jens Axboe
a9165b83c1 io_uring/rw: always setup io_async_rw for read/write requests
read/write requests try to put everything on the stack, and then alloc
and copy if a retry is needed. This necessitates a bunch of nasty code
that deals with intermediate state.

Get rid of this, and have the prep side setup everything that is needed
upfront, which greatly simplifies the opcode handlers.

This includes adding an alloc cache for io_async_rw, to make it cheap
to handle.

In terms of cost, this should be basically free and transparent. For
the worst case of {READ,WRITE}_FIXED which didn't need it before,
performance is unaffected in the normal peak workload that is being
used to test that. Still runs at 122M IOPS.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:25 -06:00
Pavel Begunkov
e5c12945be io_uring: refactor io_fill_cqe_req_aux
The restriction on multishot execution context disallowing io-wq is
driven by rules of io_fill_cqe_req_aux(), it should only be called in
the master task context, either from the syscall path or in task_work.
Since task_work now always takes the ctx lock implying
IO_URING_F_COMPLETE_DEFER, we can just assume that the function is
always called with its defer argument set to true.

Kill the argument. Also rename the function for more consistency as
"fill" in CQE related functions was usually meant for raw interfaces
only copying data into the CQ without any locking, waking the user
and other accounting "post" functions take care of.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/93423d106c33116c7d06bf277f651aa68b427328.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:24 -06:00
Pavel Begunkov
8e5b3b89ec io_uring: remove struct io_tw_state::locked
ctx is always locked for task_work now, so get rid of struct
io_tw_state::locked. Note I'm stopping one step before removing
io_tw_state altogether, which is not empty, because it still serves the
purpose of indicating which function is a tw callback and forcing users
not to invoke them carelessly out of a wrong context. The removal can
always be done later.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/e95e1ea116d0bfa54b656076e6a977bc221392a4.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:24 -06:00
Pavel Begunkov
6e6b8c6212 io_uring/rw: avoid punting to io-wq directly
kiocb_done() should care to specifically redirecting requests to io-wq.
Remove the hopping to tw to then queue an io-wq, return -EAGAIN and let
the core code io_uring handle offloading.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/413564e550fe23744a970e1783dfa566291b0e6f.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:24 -06:00
Christian Brauner
210a03c9d5
fs: claw back a few FMODE_* bits
There's a bunch of flags that are purely based on what the file
operations support while also never being conditionally set or unset.
IOW, they're not subject to change for individual files. Imho, such
flags don't need to live in f_mode they might as well live in the fops
structs itself. And the fops struct already has that lonely
mmap_supported_flags member. We might as well turn that into a generic
fop_flags member and move a few flags from FMODE_* space into FOP_*
space. That gets us four FMODE_* bits back and the ability for new
static flags that are about file ops to not have to live in FMODE_*
space but in their own FOP_* space. It's not the most beautiful thing
ever but it gets the job done. Yes, there'll be an additional pointer
chase but hopefully that won't matter for these flags.

I suspect there's a few more we can move into there and that we can also
redirect a bunch of new flag suggestions that follow this pattern into
the fop_flags field instead of f_mode.

Link: https://lore.kernel.org/r/20240328-gewendet-spargel-aa60a030ef74@brauner
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-07 13:49:02 +02:00
Jens Axboe
2a975d426c io_uring/rw: don't allow multishot reads without NOWAIT support
Supporting multishot reads requires support for NOWAIT, as the
alternative would be always having io-wq execute the work item whenever
the poll readiness triggered. Any fast file type will have NOWAIT
support (eg it understands both O_NONBLOCK and IOCB_NOWAIT). If the
given file type does not, then simply resort to single shot execution.

Cc: stable@vger.kernel.org
Fixes: fc68fcda04 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-01 11:41:50 -06:00
Jens Axboe
0a3737db84 io_uring/rw: return IOU_ISSUE_SKIP_COMPLETE for multishot retry
If read multishot is being invoked from the poll retry handler, then we
should return IOU_ISSUE_SKIP_COMPLETE rather than -EAGAIN. If not, then
a CQE will be posted with -EAGAIN rather than triggering the retry when
the file is flagged as readable again.

Cc: stable@vger.kernel.org
Reported-by: Sargun Dhillon <sargun@meta.com>
Fixes: fc68fcda04 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-12 08:29:47 -06:00
Pavel Begunkov
e0e4ab52d1 io_uring: refactor DEFER_TASKRUN multishot checks
We disallow DEFER_TASKRUN multishots from running by io-wq, which is
checked by individual opcodes in the issue path. We can consolidate all
it in io_wq_submit_work() at the same time moving the checks out of the
hot path.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e492f0f11588bb5aa11d7d24e6f53b7c7628afdb.1709905727.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08 07:58:23 -07:00
Jens Axboe
186daf2385 io_uring/kbuf: rename REQ_F_PARTIAL_IO to REQ_F_BL_NO_RECYCLE
We only use the flag for this purpose, so rename it accordingly. This
further prevents various other use cases of it, keeping it clean and
consistent. Then we can also check it in one spot, when it's being
attempted recycled, and remove some dead code in io_kbuf_recycle_ring().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08 07:56:27 -07:00
Pavel Begunkov
70581dcd06 io_uring: fix mshot read defer taskrun cqe posting
We can't post CQEs from io-wq with DEFER_TASKRUN set, normal completions
are handled but aux should be explicitly disallowed by opcode handlers.

Cc: stable@vger.kernel.org
Fixes: fc68fcda04 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6fb7cba6f5366da25f4d3eb95273f062309d97fa.1709740837.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-07 06:30:33 -07:00
Kuniyuki Iwashima
3fb1764c6b io_uring: Don't include af_unix.h.
Changes to AF_UNIX trigger rebuild of io_uring, but io_uring does
not use AF_UNIX anymore.

Let's not include af_unix.h and instead include necessary headers.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240212234236.63714-1-kuniyu@amazon.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-12 19:02:11 -07:00
Jens Axboe
949249e25f io_uring/rw: remove dead file == NULL check
Any read/write opcode has needs_file == true, which means that we
would've failed the request long before reaching the issue stage if we
didn't successfully assign a file. This check has been dead forever,
and is really a leftover from generic code.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08 13:27:06 -07:00
Jens Axboe
95041b93e9 io_uring: add io_file_can_poll() helper
This adds a flag to avoid dipping dereferencing file and then f_op to
figure out if the file has a poll handler defined or not. We generally
call this at least twice for networked workloads, and if using ring
provided buffers, we do it on every buffer selection. Particularly the
latter is troublesome, as it's otherwise a very fast operation.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08 13:27:06 -07:00
Jens Axboe
c79f52f065 io_uring/rw: ensure poll based multishot read retries appropriately
io_read_mshot() always relies on poll triggering retries, and this works
fine as long as we do a retry per size of the buffer being read. The
buffer size is given by the size of the buffer(s) in the given buffer
group ID.

But if we're reading less than what is available, then we don't always
get to read everything that is available. For example, if the buffers
available are 32 bytes and we have 64 bytes to read, then we'll
correctly read the first 32 bytes and then wait for another poll trigger
before we attempt the next read. This next poll trigger may never
happen, in which case we just sit forever and never make progress, or it
may trigger at some point in the future, and now we're just delivering
the available data much later than we should have.

io_read_mshot() could do retries itself, but that is wasteful as we'll
be going through all of __io_read() again, and most likely in vain.
Rather than do that, bump our poll reference count and have
io_poll_check_events() do one more loop and check with vfs_poll() if we
have more data to read. If we do, io_read_mshot() will get invoked again
directly and we'll read the next chunk.

io_poll_multishot_retry() must only get called from inside
io_poll_issue(), which is our multishot retry handler, as we know we
already "own" the request at this point.

Cc: stable@vger.kernel.org
Link: https://github.com/axboe/liburing/issues/1041
Fixes: fc68fcda04 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-28 20:37:11 -07:00
Jens Axboe
fe80eb15de io_uring/rw: cleanup io_rw_done()
This originally came from the aio side, and it's laid out rather oddly.
The common case here is that we either get -EIOCBQUEUED from submitting
an async request, or that we complete the request correctly with the
given number of bytes. Handling the odd internal restart error codes
is not a common operation.

Lay it out a bit more optimally that better explains the normal flow,
and switch to avoiding the indirect call completely as this is our
kiocb and we know the completion handler can only be one of two
possible variants. While at it, move it to where it belongs in the
file, with fellow end IO helpers.

Outside of being easier to read, this also reduces the text size of the
function by 24 bytes for me on arm64.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-10 11:46:48 -07:00
Jens Axboe
0a535eddbe io_uring/rw: ensure io->bytes_done is always initialized
If IOSQE_ASYNC is set and we fail importing an iovec for a readv or
writev request, then we leave ->bytes_done uninitialized and hence the
eventual failure CQE posted can potentially have a random res value
rather than the expected -EINVAL.

Setup ->bytes_done before potentially failing, so we have a consistent
value if we fail the request early.

Cc: stable@vger.kernel.org
Reported-by: xingwei lee <xrivendell7@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-21 08:49:18 -07:00
Pavel Begunkov
b66509b849 io_uring: split out cmd api into a separate header
linux/io_uring.h is slowly becoming a rubbish bin where we put
anything exposed to other subsystems. For instance, the task exit
hooks and io_uring cmd infra are completely orthogonal and don't need
each other's definitions. Start cleaning it up by splitting out all
command bits into a new header file.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7ec50bae6e21f371d3850796e716917fc141225a.1701391955.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-12 07:42:52 -07:00
Dylan Yudaken
e53759298a io_uring: do not clamp read length for multishot read
When doing a multishot read, the code path reuses the old read
paths. However this breaks an assumption built into those paths,
namely that struct io_rw::len is available for reuse by __io_import_iovec.

For multishot this results in len being set for the first receive
call, and then subsequent calls are clamped to that buffer length
incorrectly.

Instead keep len as zero after recycling buffers, to reuse the full
buffer size of the next selected buffer.

Fixes: fc68fcda04 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT")
Signed-off-by: Dylan Yudaken <dyudaken@gmail.com>
Link: https://lore.kernel.org/r/20231106203909.197089-4-dyudaken@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-06 13:41:58 -07:00
Dylan Yudaken
49fbe99486 io_uring: do not allow multishot read to set addr or len
For addr: this field is not used, since buffer select is forced.
But by forcing it to be zero it leaves open future uses of the field.

len is actually usable, you could imagine that you want to receive
multishot up to a certain length.
However right now this is not how it is implemented, and it seems
safer to force this to be zero.

Fixes: fc68fcda04 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT")
Signed-off-by: Dylan Yudaken <dyudaken@gmail.com>
Link: https://lore.kernel.org/r/20231106203909.197089-3-dyudaken@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-06 13:41:58 -07:00
Jens Axboe
f688944cfb io_uring/rw: add separate prep handler for fixed read/write
Rather than sprinkle opcode checks in the generic read/write prep handler,
have a separate prep handler for the vectored readv/writev operation.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-06 07:43:16 -07:00
Jens Axboe
0e984ec88d io_uring/rw: add separate prep handler for readv/writev
Rather than sprinkle opcode checks in the generic read/write prep handler,
have a separate prep handler for the vectored readv/writev operation.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-06 07:41:17 -07:00
Jens Axboe
0df96fb71a io_uring/rw: don't attempt to allocate async data if opcode doesn't need it
The new read multishot method doesn't need to allocate async data ever,
as it doesn't do vectored IO and it must only be used with provided
buffers. While it doesn't have ->prep_async() set, it also sets
->async_size to 0, which is different from any other read/write type we
otherwise support.

If it's used on a file type that isn't pollable, we do try and allocate
this async data, and then try and use that data. But since we passed in
a size of 0 for the data, we get a NULL back on data allocation. We then
proceed to dereference that to copy state, and that obviously won't end
well.

Add a check in io_setup_async_rw() for this condition, and avoid copying
state. Also add a check for whether or not buffer selection is specified
in prep while at it.

Fixes: fc68fcda04 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=218101
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-03 09:31:21 -06:00
Linus Torvalds
ffa059b262 for-6.7/io_uring-2023-10-30
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmU/vcMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmnaD/4spcYjSSdeHVh3J60QuWjMYOM//E/BNb6e
 3I2L6Is2RLuDGhVhHKfRfkJQy1UPKYKu5TZewUnwC3bz12kWGc8CZBF4WgM0159T
 0uBm2ZtsstSCONA16tQdmE7gt5MJ6KFO0rsubm/AxNWxTnpyrbrX512TkkJTBrfC
 ZluAKxGviZOcrl9ROoVMc/FeMmaKVcT79mDuLp0y+Pmb2KO3y9bWTs/wpmEPNVro
 P7n/j9B4dBQC3Saij/wCdcsodkHUaCfCnRK3g34JKeACb+Kclg7QSzinb3TZjeEw
 o98l1XMiejkPJDIxYmWPTmdzqu6AUnT3Geq6eL463/PUOjgkzet6idYfk6XQgRyz
 AhFzA6KruMJ+IhOs974KtmDJj+7LbGkMUpW0kEqKWpXFEO2t+yG6Ue4cdC2FtsqV
 m/ojTTeejVqJ1RLng9IqVMT/X6sqpTtBOikNIJeWyDZQGpOOBxkG9qyoYxNQTOAr
 280UwcFMgsRDQMpi9uIsc7uE7QvN/RYL9nqm49bxJTRm/sRsABPb71yWcbrHSAjh
 y2tprYqG0V4qK7ogCiqDt8qdq/nZS6d1mN/th33yGAHtWEStTyFKNuYmPOrzLtWb
 tvnmYGA7YxcpSMEPHQbYG5TlmoWoTlzUlwJ1OWGzqdlPw7USCwjFfTZVJuKm6wkR
 u0uTkYhn4A==
 =okQ8
 -----END PGP SIGNATURE-----

Merge tag 'for-6.7/io_uring-2023-10-30' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:
 "This contains the core io_uring updates, of which there are not many,
  and adds support for using WAITID through io_uring and hence not
  needing to block on these kinds of events.

  Outside of that, tweaks to the legacy provided buffer handling and
  some cleanups related to cancelations for uring_cmd support"

* tag 'for-6.7/io_uring-2023-10-30' of git://git.kernel.dk/linux:
  io_uring/poll: use IOU_F_TWQ_LAZY_WAKE for wakeups
  io_uring/kbuf: Use slab for struct io_buffer objects
  io_uring/kbuf: Allow the full buffer id space for provided buffers
  io_uring/kbuf: Fix check of BID wrapping in provided buffers
  io_uring/rsrc: cleanup io_pin_pages()
  io_uring: cancelable uring_cmd
  io_uring: retain top 8bits of uring_cmd flags for kernel internal use
  io_uring: add IORING_OP_WAITID support
  exit: add internal include file with helpers
  exit: add kernel_waitid_prepare() helper
  exit: move core of do_wait() into helper
  exit: abstract out should_wake helper for child_wait_callback()
  io_uring/rw: add support for IORING_OP_READ_MULTISHOT
  io_uring/rw: mark readv/writev as vectored in the opcode definition
  io_uring/rw: split io_read() into a helper
2023-11-01 11:09:19 -10:00
Linus Torvalds
d1b0949f23 assorted fixes all over the place
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZTxSwQAKCRBZ7Krx/gZQ
 6zadAP9o/724KPDCY3ybgwKyEQ1UNjHTriFRBeoF3o2q0WgidwEA+/xS0Xk3i25w
 xnSZO/8My1edE1IcK/JDwewH/J+4Kw0=
 =N/Lv
 -----END PGP SIGNATURE-----

Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull misc filesystem fixes from Al Viro:
 "Assorted fixes all over the place: literally nothing in common, could
  have been three separate pull requests.

  All are simple regression fixes, but not for anything from this cycle"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ceph_wait_on_conflict_unlink(): grab reference before dropping ->d_lock
  io_uring: kiocb_done() should *not* trust ->ki_pos if ->{read,write}_iter() failed
  sparc32: fix a braino in fault handling in csum_and_copy_..._user()
2023-10-27 16:44:58 -10:00
Al Viro
1939316bf9 io_uring: kiocb_done() should *not* trust ->ki_pos if ->{read,write}_iter() failed
->ki_pos value is unreliable in such cases.  For an obvious example,
consider O_DSYNC write - we feed the data to page cache and start IO,
then we make sure it's completed.  Update of ->ki_pos is dealt with
by the first part; failure in the second ends up with negative value
returned _and_ ->ki_pos left advanced as if sync had been successful.
In the same situation write(2) does not advance the file position
at all.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-10-27 20:14:11 -04:00
Jens Axboe
838b35bb6a io_uring/rw: disable IOCB_DIO_CALLER_COMP
If an application does O_DIRECT writes with io_uring and the file system
supports IOCB_DIO_CALLER_COMP, then completions of the dio write side is
done from the task_work that will post the completion event for said
write as well.

Whenever a dio write is done against a file, the inode i_dio_count is
elevated. This enables other callers to use inode_dio_wait() to wait for
previous writes to complete. If we defer the full dio completion to
task_work, we are dependent on that task_work being run before the
inode i_dio_count can be decremented.

If the same task that issues io_uring dio writes with
IOCB_DIO_CALLER_COMP performs a synchronous system call that calls
inode_dio_wait(), then we can deadlock as we're blocked sleeping on
the event to become true, but not processing the completions that will
result in the inode i_dio_count being decremented.

Until we can guarantee that this is the case, then disable the deferred
caller completions.

Fixes: 099ada2c87 ("io_uring/rw: add write support for IOCB_DIO_CALLER_COMP")
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-25 08:02:29 -06:00
Jens Axboe
fc68fcda04 io_uring/rw: add support for IORING_OP_READ_MULTISHOT
This behaves like IORING_OP_READ, except:

1) It only supports pollable files (eg pipes, sockets, etc). Note that
   for sockets, you probably want to use recv/recvmsg with multishot
   instead.

2) It supports multishot mode, meaning it will repeatedly trigger a
   read and fill a buffer when data is available. This allows similar
   use to recv/recvmsg but on non-sockets, where a single request will
   repeatedly post a CQE whenever data is read from it.

3) Because of #2, it must be used with provided buffers. This is
   uniformly true across any request type that supports multishot and
   transfers data, with the reason being that it's obviously not
   possible to pass in a single buffer for the data, as multiple reads
   may very well trigger before an application has a chance to process
   previous CQEs and the data passed from them.

Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-21 12:02:30 -06:00
Jens Axboe
d2d778fbf9 io_uring/rw: mark readv/writev as vectored in the opcode definition
This is cleaner than gating on the opcode type, particularly as more
read/write type opcodes may be added.

Then we can use that for the data import, and for __io_read() on
whether or not we need to copy state.

Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-21 12:00:46 -06:00
Jens Axboe
a08d195b58 io_uring/rw: split io_read() into a helper
Add __io_read() which does the grunt of the work, leaving the completion
side to the new io_read(). No functional changes in this patch.

Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-21 12:00:41 -06:00
Linus Torvalds
c1b7fcf3f6 for-6.6/io_uring-2023-08-28
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmTs06gQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpq2jEACE+A8T1/teQmMA9PcIhG2gWSltlSmAKkVe
 6Yy6FaXBc7M/1yINGWU+dam5xhTshEuGulgris5Yt8VcX7eas9KQvT1NGDa2KzP4
 D44i4UbMD4z9E7arx67Eyvql0sd9LywQTa2nJB+o9yXmQUhTbkWPmMaIWmZi5QwP
 KcOqzdve+xhWSwdzNsPltB3qnma/WDtguaauh41GksKpUVe+aDi8EDEnFyRTM2Be
 HTTJEH2ZyxYwDzemR8xxr82eVz7KdTVDvn+ZoA/UM6I3SGj6SBmtj5mDLF1JKetc
 I+IazsSCAE1kF7vmDbmxYiIvmE6d6I/3zwgGrKfH4dmysqClZvJQeG3/53XzSO5N
 k0uJebB/S610EqQhAIZ1/KgjRVmrd75+2af+AxsC2pFeTaQRE4plCJgWhMbJleAE
 XBnnC7TbPOWCKaZ6a5UKu8CE1FJnEROmOASPP6tRM30ah5JhlyLU4/4ce+uUPUQo
 bhzv0uAQr5Ezs9JPawj+BTa3A4iLCpLzE+aSB46Czl166Eg57GR5tr5pQLtcwjus
 pN5BO7o4y5BeWu1XeQQObQ5OiOBXqmbcl8aQepBbU4W8qRSCMPXYWe2+8jbxk3VV
 3mZZ9iXxs71ntVDL8IeZYXH7NZK4MLIrqdeM5YwgpSioYAUjqxTm7a8I5I9NCvUx
 DZIBNk2sAA==
 =XS+G
 -----END PGP SIGNATURE-----

Merge tag 'for-6.6/io_uring-2023-08-28' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:
 "Fairly quiet round in terms of features, mostly just improvements all
  over the map for existing code. In detail:

   - Initial support for socket operations through io_uring. Latter half
     of this will likely land with the 6.7 kernel, then allowing things
     like get/setsockopt (Breno)

   - Cleanup of the cancel code, and then adding support for canceling
     requests with the opcode as the key (me)

   - Improvements for the io-wq locking (me)

   - Fix affinity setting for SQPOLL based io-wq (me)

   - Remove the io_uring userspace code. These were added initially as
     copies from liburing, but all of them have since bitrotted and are
     way out of date at this point. Rather than attempt to keep them in
     sync, just get rid of them. People will have liburing available
     anyway for these examples. (Pavel)

   - Series improving the CQ/SQ ring caching (Pavel)

   - Misc fixes and cleanups (Pavel, Yue, me)"

* tag 'for-6.6/io_uring-2023-08-28' of git://git.kernel.dk/linux: (47 commits)
  io_uring: move iopoll ctx fields around
  io_uring: move multishot cqe cache in ctx
  io_uring: separate task_work/waiting cache line
  io_uring: banish non-hot data to end of io_ring_ctx
  io_uring: move non aligned field to the end
  io_uring: add option to remove SQ indirection
  io_uring: compact SQ/CQ heads/tails
  io_uring: force inline io_fill_cqe_req
  io_uring: merge iopoll and normal completion paths
  io_uring: reorder cqring_flush and wakeups
  io_uring: optimise extra io_get_cqe null check
  io_uring: refactor __io_get_cqe()
  io_uring: simplify big_cqe handling
  io_uring: cqe init hardening
  io_uring: improve cqe !tracing hot path
  io_uring/rsrc: Annotate struct io_mapped_ubuf with __counted_by
  io_uring/sqpoll: fix io-wq affinity when IORING_SETUP_SQPOLL is used
  io_uring: simplify io_run_task_work_sig return
  io_uring/rsrc: keep one global dummy_ubuf
  io_uring: never overflow io_aux_cqe
  ...
2023-08-29 20:11:33 -07:00
Linus Torvalds
6016fc9162 New code for 6.6:
* Make large writes to the page cache fill sparse parts of the cache
    with large folios, then use large memcpy calls for the large folio.
  * Track the per-block dirty state of each large folio so that a
    buffered write to a single byte on a large folio does not result in a
    (potentially) multi-megabyte writeback IO.
  * Allow some directio completions to be performed in the initiating
    task's context instead of punting through a workqueue.  This will
    reduce latency for some io_uring requests.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZM0Z1AAKCRBKO3ySh0YR
 pp7BAQCzkKejCM0185tNIH/faHjzidSisNQkJ5HoB4Opq9U66AEA6IPuAdlPlM/J
 FPW1oPq33Yn7AV4wXjUNFfDLzVb/Fgg=
 =dFBU
 -----END PGP SIGNATURE-----

Merge tag 'iomap-6.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap updates from Darrick Wong:
 "We've got some big changes for this release -- I'm very happy to be
  landing willy's work to enable large folios for the page cache for
  general read and write IOs when the fs can make contiguous space
  allocations, and Ritesh's work to track sub-folio dirty state to
  eliminate the write amplification problems inherent in using large
  folios.

  As a bonus, io_uring can now process write completions in the caller's
  context instead of bouncing through a workqueue, which should reduce
  io latency dramatically. IOWs, XFS should see a nice performance bump
  for both IO paths.

  Summary:

   - Make large writes to the page cache fill sparse parts of the cache
     with large folios, then use large memcpy calls for the large folio.

   - Track the per-block dirty state of each large folio so that a
     buffered write to a single byte on a large folio does not result in
     a (potentially) multi-megabyte writeback IO.

   - Allow some directio completions to be performed in the initiating
     task's context instead of punting through a workqueue. This will
     reduce latency for some io_uring requests"

* tag 'iomap-6.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits)
  iomap: support IOCB_DIO_CALLER_COMP
  io_uring/rw: add write support for IOCB_DIO_CALLER_COMP
  fs: add IOCB flags related to passing back dio completions
  iomap: add IOMAP_DIO_INLINE_COMP
  iomap: only set iocb->private for polled bio
  iomap: treat a write through cache the same as FUA
  iomap: use an unsigned type for IOMAP_DIO_* defines
  iomap: cleanup up iomap_dio_bio_end_io()
  iomap: Add per-block dirty state tracking to improve performance
  iomap: Allocate ifs in ->write_begin() early
  iomap: Refactor iomap_write_delalloc_punch() function out
  iomap: Use iomap_punch_t typedef
  iomap: Fix possible overflow condition in iomap_write_delalloc_scan
  iomap: Add some uptodate state handling helpers for ifs state bitmap
  iomap: Drop ifs argument from iomap_set_range_uptodate()
  iomap: Rename iomap_page to iomap_folio_state and others
  iomap: Copy larger chunks from userspace
  iomap: Create large folios in the buffered write path
  filemap: Allow __filemap_get_folio to allocate large folios
  filemap: Add fgf_t typedef
  ...
2023-08-28 11:59:52 -07:00
Pavel Begunkov
ec26c225f0 io_uring: merge iopoll and normal completion paths
io_do_iopoll() and io_submit_flush_completions() are pretty similar,
both filling CQEs and then free a list of requests. Don't duplicate it
and make iopoll use __io_submit_flush_completions(), which also helps
with inlining and other optimisations.

For that, we need to first find all completed iopoll requests and splice
them from the iopoll list and then pass it down. This adds one extra
list traversal, which should be fine as requests will stay hot in cache.

CQ locking is already conditional, introduce ->lockless_cq and skip
locking for IOPOLL as it's protected by ->uring_lock.

We also add a wakeup optimisation for IOPOLL to __io_cq_unlock_post(),
so it works just like io_cqring_ev_posted_iopoll().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3840473f5e8a960de35b77292026691880f6bdbc.1692916914.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24 17:16:19 -06:00
Pavel Begunkov
54927baf6c io_uring: reorder cqring_flush and wakeups
Unlike in the past, io_commit_cqring_flush() doesn't do anything that
may need io_cqring_wake() to be issued after, all requests it completes
will go via task_work. Do io_commit_cqring_flush() after
io_cqring_wake() to clean up __io_cq_unlock_post().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ed32dcfeec47e6c97bd6b18c152ddce5b218403f.1692916914.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24 17:16:19 -06:00
Amir Goldstein
e484fd73f4 io_uring: use kiocb_{start,end}_write() helpers
Use helpers instead of the open coded dance to silence lockdep warnings.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Message-Id: <20230817141337.1025891-5-amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21 17:27:26 +02:00
Amir Goldstein
a370167fe5 io_uring: rename kiocb_end_write() local helper
This helper does not take a kiocb as input and we want to create a
common helper by that name that takes a kiocb as input.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Message-Id: <20230817141337.1025891-2-amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21 17:27:25 +02:00
Pavel Begunkov
00b0db5624 io_uring: open code io_fill_cqe_req()
io_fill_cqe_req() is only called from one place, open code it, and
rename __io_fill_cqe_req().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f432ce75bb1c94cadf0bd2add4d6aa510bd1fb36.1691757663.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11 10:42:57 -06:00
Jens Axboe
099ada2c87 io_uring/rw: add write support for IOCB_DIO_CALLER_COMP
If the filesystem dio handler understands IOCB_DIO_CALLER_COMP, we'll
get a kiocb->ki_complete() callback with kiocb->dio_complete set. In
that case, rather than complete the IO directly through task_work, queue
up an intermediate task_work handler that first processes this callback
and then immediately completes the request.

For XFS, this avoids a punt through a workqueue, which is a lot less
efficient and adds latency to lower queue depth (or sync) O_DIRECT
writes.

Only do this for non-polled IO, as polled IO doesn't need this kind
of deferral as it always completes within the task itself. This then
avoids a check for deferral in the polled IO completion handler.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-01 17:32:45 -06:00
Christoph Hellwig
8487f083c6 io_uring: return REQ_F_ flags from io_file_get_flags
Two of the three callers want them, so return the more usual format,
and shift into the FFS_ form only for the fixed file table.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230620113235.920399-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20 09:36:22 -06:00
Christoph Hellwig
3beed235d1 io_uring: remove io_req_ffs_set
Just checking the flag directly makes it a lot more obvious what is
going on here.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230620113235.920399-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20 09:36:22 -06:00
Jens Axboe
c92fcfc2ba io_uring: avoid indirect function calls for the hottest task_work
We use task_work for a variety of reasons, but doing completions or
triggering rety after poll are by far the hottest two. Use the indirect
funtion call wrappers to avoid the indirect function call if
CONFIG_RETPOLINE is set.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-02 08:55:37 -06:00
Linus Torvalds
9dd6956b38 for-6.4/block-2023-04-21
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRCvcIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpk+JEACj01t7Xen2+Razagu3aTx9tmRGFnTNR3MY
 raFG6B1TADk1TgCWWa2C4Dj67SOispPLm8hbIcOxqB1UscDWCCwjmnr/debADFzW
 Ap6shv/IRwVGmDp+F7ocYas0ynwooOJg4WJTwkSKz2o4m4p3vzlwAKi4fLiSjbXp
 gJTrA7WEvDOVjzajlTFUtjr8rc6PdunbGm25cPIufAxUEhvttYex2VbVqjDmfNsE
 8tyyk9RWbe4AY/ZYaGXVn4yQ/CgL/sXFkVc5noRXNfAQ/K3CVLQrFLJ3JlwUHpiA
 xXBor21TUWCZEo33Y2G5NConAYqE7etoPTkaTDO3/aZ+dAMFyhC/WAYLz1KZGMh1
 +g1fDX1QKEd40H2lfDXvqF1ob7Ut8EzUx+gvBXcc3/AiRpJ5rjfOcj6LPUMUqQJk
 nucLLFTiMKecnDMBERbvixqbaTyrjvkFEj2wYJvgj1LKXAd+x/bj8SGajs9r88Nb
 9YT9ai/+Yl7Ppfb67rCgXJU7oNZQSAQ2H+X/l2jbiqImOgq1u/45AmINnbanS7HH
 Y1I8pbH45AcnCgkJRoQwrNX3BnTOTBJ+D/4Fl4b8jsihq0D3UtwCwPCObHP4LW9S
 MUNPhP3tUuYsAgXqX80+Sao6SYvXDwnbWOM+LOaaZXgjb1ndwDUZXpto8Ra8WB1u
 8kM6s6ZR7g==
 =W1Zb
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4/block-2023-04-21' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - drbd patches, bringing us closer to unifying the out-of-tree version
   and the in tree one (Andreas, Christoph)

 - support for auto-quiesce for the s390 dasd driver (Stefan)

 - MD pull request via Song:
      - md/bitmap: Optimal last page size (Jon Derrick)
      - Various raid10 fixes (Yu Kuai, Li Nan)
      - md: add error_handlers for raid0 and linear (Mariusz Tkaczyk)

 - NVMe pull request via Christoph:
      - Drop redundant pci_enable_pcie_error_reporting (Bjorn Helgaas)
      - Validate nvmet module parameters (Chaitanya Kulkarni)
      - Fence TCP socket on receive error (Chris Leech)
      - Fix async event trace event (Keith Busch)
      - Minor cleanups (Chaitanya Kulkarni, zhenwei pi)
      - Fix and cleanup nvmet Identify handling (Damien Le Moal,
        Christoph Hellwig)
      - Fix double blk_mq_complete_request race in the timeout handler
        (Lei Yin)
      - Fix irq locking in nvme-fcloop (Ming Lei)
      - Remove queue mapping helper for rdma devices (Sagi Grimberg)

 - use structured request attribute checks for nbd (Jakub)

 - fix blk-crypto race conditions between keyslot management (Eric)

 - add sed-opal support for reading read locking range attributes
   (Ondrej)

 - make fault injection configurable for null_blk (Akinobu)

 - clean up the request insertion API (Christoph)

 - clean up the queue running API (Christoph)

 - blkg config helper cleanups (Tejun)

 - lazy init support for blk-iolatency (Tejun)

 - various fixes and tweaks to ublk (Ming)

 - remove hybrid polling. It hasn't really been useful since we got
   async polled IO support, and these days we don't support sync polled
   IO at all (Keith)

 - misc fixes, cleanups, improvements (Zhong, Ondrej, Colin, Chengming,
   Chaitanya, me)

* tag 'for-6.4/block-2023-04-21' of git://git.kernel.dk/linux: (118 commits)
  nbd: fix incomplete validation of ioctl arg
  ublk: don't return 0 in case of any failure
  sed-opal: geometry feature reporting command
  null_blk: Always check queue mode setting from configfs
  block: ublk: switch to ioctl command encoding
  blk-mq: fix the blk_mq_add_to_requeue_list call in blk_kick_flush
  block, bfq: Fix division by zero error on zero wsum
  fault-inject: fix build error when FAULT_INJECTION_CONFIGFS=y and CONFIGFS_FS=m
  block: store bdev->bd_disk->fops->submit_bio state in bdev
  block: re-arrange the struct block_device fields for better layout
  md/raid5: remove unused working_disks variable
  md/raid10: don't call bio_start_io_acct twice for bio which experienced read error
  md/raid10: fix memleak of md thread
  md/raid10: fix memleak for 'conf->bio_split'
  md/raid10: fix leak of 'r10bio->remaining' for recovery
  md/raid10: don't BUG_ON() in raise_barrier()
  md: fix soft lockup in status_resync
  md: add error_handlers for raid0 and linear
  md: Use optimal I/O size for last bitmap page
  md: Fix types in sb writer
  ...
2023-04-26 12:52:58 -07:00
Linus Torvalds
5b9a7bb72f for-6.4/io_uring-2023-04-21
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRCvawQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiKTEACvp0jm3Lyhxb8RMsx5T6Ko0pFH3DIymiL4
 xpoZmAUflOjD0c+99FwHRQqKKXuo3OelhW+YOm0N6OOAt6JMSGmKpZh0UNJx+Fgj
 wMwiQ0X3Y5SaAsr5ZpXM+G1BV7ajihsMpu8/a718ERB3U3cLDz2qJfnzJh+E5Ip5
 pYB4vS3+/FAER2MYQ7IPeovch2wWYtxDPztOxNX6SORu3OvpWiz1GR/+8u0tqj50
 ROq97Jwjh5Tl4zP356EUSj/Vkfdr2yb+NlLbun8My5x8tYftZjnrNQ/+qeJNLwB8
 tWTrg166ox/VX3aYZruAgUPv0IyGPZg7qZV5R72ChBK3VhIbOOLOCm/V6dhvl/XH
 vu2FG7J8WylWHmc+OU8u7TeSJdrwxTLs4e2IFUBK9ymAYFp0Q9S924fgvSYsFvVB
 iNn58SPRIbuA4SPtRfCd7pENtZW/QKfBC5CYK+pjsZVX40c9dbe40foVu4t2/EAo
 gi9+gSWEUVRRW2osxjaHXh78cW63g0j9bNfS6n1Vy32Oo5Mwm7n+bVWqCU5bCBXI
 MpPOk6AgME3UPwFzGzSmx+PVw8VacPxYP1NF8RFTCwj7OowFnrolJtruDmKJgXWY
 BN41EDo41k/C5mEu16Jr9rAkHeVhHaNZ+JhyDrzv8llJ/rv+4zEJw9SrhnpufmOX
 +YERd/ndAw==
 =Erfk
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4/io_uring-2023-04-21' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:

 - Cleanup of the io-wq per-node mapping, notably getting rid of it so
   we just have a single io_wq entry per ring (Breno)

 - Followup to the above, move accounting to io_wq as well and
   completely drop struct io_wqe (Gabriel)

 - Enable KASAN for the internal io_uring caches (Breno)

 - Add support for multishot timeouts. Some applications use timeouts to
   wake someone waiting on completion entries, and this makes it a bit
   easier to just have a recurring timer rather than needing to rearm it
   every time (David)

 - Support archs that have shared cache coloring between userspace and
   the kernel, and hence have strict address requirements for mmap'ing
   the ring into userspace. This should only be parisc/hppa. (Helge, me)

 - XFS has supported O_DIRECT writes without needing to lock the inode
   exclusively for a long time, and ext4 now supports it as well. This
   is true for the common cases of not extending the file size. Flag the
   fs as having that feature, and utilize that to avoid serializing
   those writes in io_uring (me)

 - Enable completion batching for uring commands (me)

 - Revert patch adding io_uring restriction to what can be GUP mapped or
   not. This does not belong in io_uring, as io_uring isn't really
   special in this regard. Since this is also getting in the way of
   cleanups and improvements to the GUP code, get rid of if (me)

 - A few series greatly reducing the complexity of registered resources,
   like buffers or files. Not only does this clean up the code a lot,
   the simplified code is also a LOT more efficient (Pavel)

 - Series optimizing how we wait for events and run task_work related to
   it (Pavel)

 - Fixes for file/buffer unregistration with DEFER_TASKRUN (Pavel)

 - Misc cleanups and improvements (Pavel, me)

* tag 'for-6.4/io_uring-2023-04-21' of git://git.kernel.dk/linux: (71 commits)
  Revert "io_uring/rsrc: disallow multi-source reg buffers"
  io_uring: add support for multishot timeouts
  io_uring/rsrc: disassociate nodes and rsrc_data
  io_uring/rsrc: devirtualise rsrc put callbacks
  io_uring/rsrc: pass node to io_rsrc_put_work()
  io_uring/rsrc: inline io_rsrc_put_work()
  io_uring/rsrc: add empty flag in rsrc_node
  io_uring/rsrc: merge nodes and io_rsrc_put
  io_uring/rsrc: infer node from ctx on io_queue_rsrc_removal
  io_uring/rsrc: remove unused io_rsrc_node::llist
  io_uring/rsrc: refactor io_queue_rsrc_removal
  io_uring/rsrc: simplify single file node switching
  io_uring/rsrc: clean up __io_sqe_buffers_update()
  io_uring/rsrc: inline switch_start fast path
  io_uring/rsrc: remove rsrc_data refs
  io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce
  io_uring/rsrc: use wq for quiescing
  io_uring/rsrc: refactor io_rsrc_ref_quiesce
  io_uring/rsrc: remove io_rsrc_node::done
  io_uring/rsrc: use nospec'ed indexes
  ...
2023-04-26 12:40:31 -07:00
Pavel Begunkov
8751d15426 io_uring: reduce scheduling due to tw
Every task_work will try to wake the task to be executed, which causes
excessive scheduling and additional overhead. For some tw it's
justified, but others won't do much but post a single CQE.

When a task waits for multiple cqes, every such task_work will wake it
up. Instead, the task may give a hint about how many cqes it waits for,
io_req_local_work_add() will compare against it and skip wake ups
if #cqes + #tw is not enough to satisfy the waiting condition. Task_work
that uses the optimisation should be simple enough and never post more
than one CQE. It's also ignored for non DEFER_TASKRUN rings.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d2b77e99d1e86624d8a69f7037d764b739dcd225.1680782017.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-06 16:23:28 -06:00
Pavel Begunkov
a282967c84 io_uring: encapsulate task_work state
For task works we're passing around a bool pointer for whether the
current ring is locked or not, let's wrap it in a structure, that
will make it more opaque preventing abuse and will also help us
to pass more info in the future if needed.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1ecec9483d58696e248d1bfd52cf62b04442df1d.1679931367.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:16:15 -06:00
Jens Axboe
95e49cf837 iov_iter: add iter_iov_addr() and iter_iov_len() helpers
These just return the address and length of the current iovec segment
in the iterator. Convert existing iov_iter_iovec() users to use them
instead of getting a copy of the current vec.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-30 08:12:29 -06:00
Jens Axboe
de4f5fed3f iov_iter: add iter_iovec() helper
This returns a pointer to the current iovec entry in the iterator. Only
useful with ITER_IOVEC right now, but it prepares us to treat ITER_UBUF
and ITER_IOVEC identically for the first segment.

Rename struct iov_iter->iov to iov_iter->__iov to find any potentially
troublesome spots, and also to prevent anyone from adding new code that
accesses iter->iov directly.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-30 08:12:29 -06:00
Keith Busch
54bdd67d0f blk-mq: remove hybrid polling
io_uring provides the only way user space can poll completions, and that
always sets BLK_POLL_NOSLEEP. This effectively makes hybrid polling dead
code, so remove it and everything supporting it.

Hybrid polling was effectively killed off with 9650b453a3, "block:
ignore RWF_HIPRI hint for sync dio", but still potentially reachable
through io_uring until d729cf9acb, "io_uring: don't sleep when
polling for I/O", but hybrid polling probably should not have been
reachable through that async interface from the beginning.

Fixes: 9650b453a3 ("block: ignore RWF_HIPRI hint for sync dio")
Fixes: d729cf9acb ("io_uring: don't sleep when polling for I/O")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20230320194926.3353144-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-20 15:30:03 -06:00
Linus Torvalds
c1ef500307 for-6.3/iter-ubuf-2023-02-16
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPueOUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpkEWD/9hOagNSeXfCd1eAJ44E5IemgHKqfU0RXRs
 kdW1o35eBXwPVAyhhDmcz60hkijm47Pw3IJUdSNaGqdm9uYpLwiatuYY5EOVC4qg
 BFkVPGCA8ERXStFM/mnWj0gkYDmb/8bzk9bdBU1FQvQOIQgYpomlHdMVfQJ+0tDT
 7VTffRaWfcxWd1u+NBMDxmfz47teplxiHJDg38wGlgT6G1kMdEUK+y6hd0SoASPM
 ocMW8LL2v3wLQhQAOWYd6sw2kFnxx4VOzhSepPAY0U78CR6CYm6zthRd+k+Ro/nt
 RFKL6Ijt2LRaOZqY3HRnCpUwmhBNft0ZFH4OHh21vPaukB4sjWbQ5SJniucNcoCN
 rb9jAJDJdS6oy+Uimeig99aQ/yGSLJXG8MQKrC36NdGSwydUfaCLaoLKwfC8zYDC
 Zr3G7tfOhSJQzQtNSH1H0SqHFvMfc7C2Ra8mYXdHbcREswKOTT73aJUHq5RFfwO+
 m10V5rQgCB9rJz0NLbo68GhxDrbTQuueDj+yDWCSoulUdNg3s2BZ3/iBjODJyJNO
 P3aG4bMYxC5te2JWCBnmR6du//8vnvDHnwWh9yKcUk+l/9OTtAPouAdUCv+r1wkz
 Ib0aEX3SiJ65LIePQO2kbdvgnweyFCJYduvMW9zjsH9GMgRP0eA6EKZh3mbKhOw4
 yw9BcZoNYQ==
 =+ImB
 -----END PGP SIGNATURE-----

Merge tag 'for-6.3/iter-ubuf-2023-02-16' of git://git.kernel.dk/linux

Pull io_uring ITER_UBUF conversion from Jens Axboe:
 "Since we now have ITER_UBUF available, switch to using it for single
  ranges as it's more efficient than ITER_IOVEC for that"

* tag 'for-6.3/iter-ubuf-2023-02-16' of git://git.kernel.dk/linux:
  block: use iter_ubuf for single range
  iov_iter: move iter_ubuf check inside restore WARN
  io_uring: use iter_ubuf for single range imports
  io_uring: switch network send/recv to ITER_UBUF
  iov: add import_ubuf()
2023-02-20 14:03:57 -08:00
Breno Leitao
f30bd4d038 io_uring: Split io_issue_def struct
This patch removes some "cold" fields from `struct io_issue_def`.

The plan is to keep only highly used fields into `struct io_issue_def`, so,
it may be hot in the cache. The hot fields are basically all the bitfields
and the callback functions for .issue and .prep.

The other less frequently used fields are now located in a secondary and
cold struct, called `io_cold_def`.

This is the size for the structs:

Before: io_issue_def = 56 bytes
After: io_issue_def = 24 bytes; io_cold_def = 40 bytes

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20230112144411.2624698-2-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Breno Leitao
a7dd27828b io_uring: Rename struct io_op_def
The current io_op_def struct is becoming huge and the name is a bit
generic.

The goal of this patch is to rename this struct to `io_issue_def`. This
struct will contain the hot functions associated with the issue code
path.

For now, this patch only renames the structure, and an upcoming patch
will break up the structure in two, moving the non-issue fields to a
secondary struct.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20230112144411.2624698-1-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Dmitrii Bundin
81594e7e7a io_uring: remove excessive unlikely on IS_ERR
The IS_ERR function uses the IS_ERR_VALUE macro under the hood which
already wraps the condition into unlikely.

Signed-off-by: Dmitrii Bundin <dmitrii.bundin.a@gmail.com>
Link: https://lore.kernel.org/r/20230109185854.25698-1-dmitrii.bundin.a@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov
544d163d65 io_uring: lock overflowing for IOPOLL
syzbot reports an issue with overflow filling for IOPOLL:

WARNING: CPU: 0 PID: 28 at io_uring/io_uring.c:734 io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734
CPU: 0 PID: 28 Comm: kworker/u4:1 Not tainted 6.2.0-rc3-syzkaller-16369-g358a161a6a9e #0
Workqueue: events_unbound io_ring_exit_work
Call trace:
 io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734
 io_req_cqe_overflow+0x5c/0x70 io_uring/io_uring.c:773
 io_fill_cqe_req io_uring/io_uring.h:168 [inline]
 io_do_iopoll+0x474/0x62c io_uring/rw.c:1065
 io_iopoll_try_reap_events+0x6c/0x108 io_uring/io_uring.c:1513
 io_uring_try_cancel_requests+0x13c/0x258 io_uring/io_uring.c:3056
 io_ring_exit_work+0xec/0x390 io_uring/io_uring.c:2869
 process_one_work+0x2d8/0x504 kernel/workqueue.c:2289
 worker_thread+0x340/0x610 kernel/workqueue.c:2436
 kthread+0x12c/0x158 kernel/kthread.c:376
 ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:863

There is no real problem for normal IOPOLL as flush is also called with
uring_lock taken, but it's getting more complicated for IOPOLL|SQPOLL,
for which __io_cqring_overflow_flush() happens from the CQ waiting path.

Reported-and-tested-by: syzbot+6805087452d72929404e@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org # 5.10+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-13 07:32:46 -07:00
Jens Axboe
1e23db450c io_uring: use iter_ubuf for single range imports
This is more efficient than iter_iov.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
[merge to 6.2, minor fixes]
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-01-08 20:59:17 -07:00
Pavel Begunkov
a8cf95f936 io_uring: fix overflow handling regression
Because the single task locking series got reordered ahead of the
timeout and completion lock changes, two hunks inadvertently ended up
using __io_fill_cqe_req() rather than io_fill_cqe_req(). This meant
that we dropped overflow handling in those two spots. Reinstate the
correct CQE filling helper.

Fixes: f66f73421f ("io_uring: skip spinlocking for ->task_complete")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-15 08:20:10 -07:00
Linus Torvalds
ce8a79d560 for-6.2/block-2022-12-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOScsgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpi5ID/9pLXFYOq1+uDjU0KO/MdjMjK8Ukr34lCnk
 WkajRLheE8JBKOFDE54XJk56sQSZHX9bTWqziar0h1fioh7FlQR/tVvzsERCm2M9
 2y9THJNJygC68wgybStyiKlshFjl7TD7Kv5N9Y3xP3mkQygT+D6o8fXZk5xQbYyH
 YdFSoq4rJVHxRL03yzQiReGGIYdOUEQQh8l1FiLwLlKa3lXAey1KuxWIzksVN0KK
 aZB4QhiBpOiPgDHUVisq2XtyQjpZ2byoCImPzgrcqk9Jo4esvm/e6esrg4xlsvII
 LKFFkTmbVqjUZtFjqakFHmfuzVor4nU5f+xb90ZHExuuODYckkxWp5rWhf9QwqqI
 0ik6WYgI1/5vnHnX8f2DYzOFQf9qa/rLgg0CshyUODlD6RfHa9vntqYvlIFkmOBd
 Q7KblIoK8YTzUS1M+v7X8JQ7gDR2KwygH37Da2KJS+vgvfIb8kJGr1ZORuhJuJJ7
 Bl69gaNkHTHrqufp7UI64YXfueeuNu2J9z3zwzGoxeaFaofF/phDn0/2gCQE1fQI
 XBhsMw+ETqI6B2SPHMnzYDu2DM1S8ZTOYQlaD4G3uqgWnAM1tG707395uAy5yu4n
 D5azU1fVG4UocoNIyPujpaoSRs2zWZycEFEeUQkhyDDww/j4hlHi6H33eOnk0zsr
 wxzFGfvHfw==
 =k/vv
 -----END PGP SIGNATURE-----

Merge tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe pull requests via Christoph:
      - Support some passthrough commands without CAP_SYS_ADMIN (Kanchan
        Joshi)
      - Refactor PCIe probing and reset (Christoph Hellwig)
      - Various fabrics authentication fixes and improvements (Sagi
        Grimberg)
      - Avoid fallback to sequential scan due to transient issues (Uday
        Shankar)
      - Implement support for the DEAC bit in Write Zeroes (Christoph
        Hellwig)
      - Allow overriding the IEEE OUI and firmware revision in configfs
        for nvmet (Aleksandr Miloserdov)
      - Force reconnect when number of queue changes in nvmet (Daniel
        Wagner)
      - Minor fixes and improvements (Uros Bizjak, Joel Granados, Sagi
        Grimberg, Christoph Hellwig, Christophe JAILLET)
      - Fix and cleanup nvme-fc req allocation (Chaitanya Kulkarni)
      - Use the common tagset helpers in nvme-pci driver (Christoph
        Hellwig)
      - Cleanup the nvme-pci removal path (Christoph Hellwig)
      - Use kstrtobool() instead of strtobool (Christophe JAILLET)
      - Allow unprivileged passthrough of Identify Controller (Joel
        Granados)
      - Support io stats on the mpath device (Sagi Grimberg)
      - Minor nvmet cleanup (Sagi Grimberg)

 - MD pull requests via Song:
      - Code cleanups (Christoph)
      - Various fixes

 - Floppy pull request from Denis:
      - Fix a memory leak in the init error path (Yuan)

 - Series fixing some batch wakeup issues with sbitmap (Gabriel)

 - Removal of the pktcdvd driver that was deprecated more than 5 years
   ago, and subsequent removal of the devnode callback in struct
   block_device_operations as no users are now left (Greg)

 - Fix for partition read on an exclusively opened bdev (Jan)

 - Series of elevator API cleanups (Jinlong, Christoph)

 - Series of fixes and cleanups for blk-iocost (Kemeng)

 - Series of fixes and cleanups for blk-throttle (Kemeng)

 - Series adding concurrent support for sync queues in BFQ (Yu)

 - Series bringing drbd a bit closer to the out-of-tree maintained
   version (Christian, Joel, Lars, Philipp)

 - Misc drbd fixes (Wang)

 - blk-wbt fixes and tweaks for enable/disable (Yu)

 - Fixes for mq-deadline for zoned devices (Damien)

 - Add support for read-only and offline zones for null_blk
   (Shin'ichiro)

 - Series fixing the delayed holder tracking, as used by DM (Yu,
   Christoph)

 - Series enabling bio alloc caching for IRQ based IO (Pavel)

 - Series enabling userspace peer-to-peer DMA (Logan)

 - BFQ waker fixes (Khazhismel)

 - Series fixing elevator refcount issues (Christoph, Jinlong)

 - Series cleaning up references around queue destruction (Christoph)

 - Series doing quiesce by tagset, enabling cleanups in drivers
   (Christoph, Chao)

 - Series untangling the queue kobject and queue references (Christoph)

 - Misc fixes and cleanups (Bart, David, Dawei, Jinlong, Kemeng, Ye,
   Yang, Waiman, Shin'ichiro, Randy, Pankaj, Christoph)

* tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux: (247 commits)
  blktrace: Fix output non-blktrace event when blk_classic option enabled
  block: sed-opal: Don't include <linux/kernel.h>
  sed-opal: allow using IOC_OPAL_SAVE for locking too
  blk-cgroup: Fix typo in comment
  block: remove bio_set_op_attrs
  nvmet: don't open-code NVME_NS_ATTR_RO enumeration
  nvme-pci: use the tagset alloc/free helpers
  nvme: add the Apple shared tag workaround to nvme_alloc_io_tag_set
  nvme: only set reserved_tags in nvme_alloc_io_tag_set for fabrics controllers
  nvme: consolidate setting the tagset flags
  nvme: pass nr_maps explicitly to nvme_alloc_io_tag_set
  block: bio_copy_data_iter
  nvme-pci: split out a nvme_pci_ctrl_is_dead helper
  nvme-pci: return early on ctrl state mismatch in nvme_reset_work
  nvme-pci: rename nvme_disable_io_queues
  nvme-pci: cleanup nvme_suspend_queue
  nvme-pci: remove nvme_pci_disable
  nvme-pci: remove nvme_disable_admin_queue
  nvme: merge nvme_shutdown_ctrl into nvme_disable_ctrl
  nvme: use nvme_wait_ready in nvme_shutdown_ctrl
  ...
2022-12-13 10:43:59 -08:00
Linus Torvalds
54e60e505d for-6.2/io_uring-2022-12-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOSYp4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgWNEADTcHejQ5y8Ctc1BEsCiEnCbR5DeIwESBKN
 SfK7uK5N5GjKZgKeMtkdxNeckH2wSZ79VFpz2/b4fszM11H+P81r/PP+OwQojcra
 1YyPbp2jXlQZGEYijiXYDohnKr8NnJYj/nuwm4T73OPOD48ekbrY36/t3Hd75jKk
 M5L/HOpbKhA+fypPHYlm3XlwEM9/4wupsDeiabTuAeFpLh66V/h85ZLY91c3Bf7j
 yllzT2CCGQsh+fnqdovpsqUxM4Sh6sHfa2QhwkfJFfU9OLwDz0TLlcvSVwWWICD1
 DGeDE/MPBKZ4Z5zp8t94vTnZDAiExwkZbmCOnOkdYoPdMDqCDhp8a9WywV3IiJJi
 +hRN6Au6BWw8Rj3b3pbobfdXFJAvoHqXHM1SDAsWmEIMpuMhNalIpYqtna9JwRDe
 KV2ERMpGOUpmvQiRaJa5Wtz6hlFetmuav6Ij9DWb5f8+NXikiFAXk4g6TtD1Z7Cb
 15klwM8qW50zX7YUlzBLcjNdj7+HuIswLx0obZ3Uv1ogDwMQKXEvPaAHUwVN02q6
 cWP0P2Bc0EKvdwTSNpgNflhNsiV2DFJZpZfxkdPauN4t2EkJ38iEpHewffy0ecM7
 uyaXGQVQW7WFDuGno4cGbThQIG95MqkRPBhEyB4cOmcyS/aZ92ZFtV1iI8dVDA+v
 uuEIMc3OCA==
 =EgDc
 -----END PGP SIGNATURE-----

Merge tag 'for-6.2/io_uring-2022-12-08' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:

 - Always ensure proper ordering in case of CQ ring overflow, which then
   means we can remove some work-arounds for that (Dylan)

 - Support completion batching for multishot, greatly increasing the
   efficiency for those (Dylan)

 - Flag epoll/eventfd wakeups done from io_uring, so that we can easily
   tell if we're recursing into io_uring again.

   Previously, this would have resulted in repeated multishot
   notifications if we had a dependency there. That could happen if an
   eventfd was registered as the ring eventfd, and we multishot polled
   for events on it. Or if an io_uring fd was added to epoll, and
   io_uring had a multishot request for the epoll fd.

   Test cases here:
	https://git.kernel.dk/cgit/liburing/commit/?id=919755a7d0096fda08fb6d65ac54ad8d0fe027cd

   Previously these got terminated when the CQ ring eventually
   overflowed, now it's handled gracefully (me).

 - Tightening of the IOPOLL based completions (Pavel)

 - Optimizations of the networking zero-copy paths (Pavel)

 - Various tweaks and fixes (Dylan, Pavel)

* tag 'for-6.2/io_uring-2022-12-08' of git://git.kernel.dk/linux: (41 commits)
  io_uring: keep unlock_post inlined in hot path
  io_uring: don't use complete_post in kbuf
  io_uring: spelling fix
  io_uring: remove io_req_complete_post_tw
  io_uring: allow multishot polled reqs to defer completion
  io_uring: remove overflow param from io_post_aux_cqe
  io_uring: add lockdep assertion in io_fill_cqe_aux
  io_uring: make io_fill_cqe_aux static
  io_uring: add io_aux_cqe which allows deferred completion
  io_uring: allow defer completion for aux posted cqes
  io_uring: defer all io_req_complete_failed
  io_uring: always lock in io_apoll_task_func
  io_uring: remove iopoll spinlock
  io_uring: iopoll protect complete_post
  io_uring: inline __io_req_complete_put()
  io_uring: remove io_req_tw_post_queue
  io_uring: use io_req_task_complete() in timeout
  io_uring: hold locks for io_req_complete_failed
  io_uring: add completion locking for iopoll
  io_uring: kill io_cqring_ev_posted() and __io_cq_unlock_post()
  ...
2022-12-13 10:33:08 -08:00
Al Viro
de4eda9de2 use less confusing names for iov_iter direction initializers
READ/WRITE proved to be actively confusing - the meanings are
"data destination, as used with read(2)" and "data source, as
used with write(2)", but people keep interpreting those as
"we read data from it" and "we write data to it", i.e. exactly
the wrong way.

Call them ITER_DEST and ITER_SOURCE - at least that is harder
to misinterpret...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-11-25 13:01:55 -05:00
Pavel Begunkov
2dac1a1592 io_uring: remove iopoll spinlock
This reverts commit 2ccc92f4ef

io_req_complete_post() should now behave well even in case of IOPOLL, we
can remove completion_lock locking.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7e171c8b530656b14a671c59100ca260e46e7f2a.1669203009.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-23 10:47:07 -07:00
Pavel Begunkov
2ccc92f4ef io_uring: add completion locking for iopoll
There are pieces of code that may allow iopoll to race filling cqes,
temporarily add spinlocking around posting events.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/84d86b5c117feda075471c5c9e65208e0dccf5d0.1669203009.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-23 10:44:00 -07:00
Pavel Begunkov
3671163beb io_uring: move kbuf put out of generic tw complete
There are multiple users of io_req_task_complete() including zc
notifications, but only read requests use selected buffers. As we
already have an rw specific tw function, move io_put_kbuf() in there.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/94374c7649aaefc3a17808dc4701f25ccd457e25.1667557923.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21 07:38:31 -07:00
Pavel Begunkov
12e4e8c7ab io_uring/rw: enable bio caches for IRQ rw
Now we can use IOCB_ALLOC_CACHE not only for iopoll'ed reads/write but
also for normal IRQ driven I/O.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fb8bd092ed5a4a3b037e84e4777074d07aa5639a.1667384020.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16 09:44:27 -07:00
Jens Axboe
5c61795ea9 io_uring/rw: remove leftover debug statement
This debug statement was never meant to go into the upstream release,
kill it off before it ends up in a release. It was just part of the
testing for the initial version of the patch.

Fixes: 2ec33a6c3c ("io_uring/rw: ensure kiocb_end_write() is always called")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-16 17:24:10 -06:00
Jens Axboe
2ec33a6c3c io_uring/rw: ensure kiocb_end_write() is always called
A previous commit moved the notifications and end-write handling, but
it is now missing a few spots where we also want to call both of those.
Without that, we can potentially be missing file notifications, and
more importantly, have an imbalance in the super_block writers sem
accounting.

Fixes: b000145e99 ("io_uring/rw: defer fsnotify calls to task context")
Reported-by: Dave Chinner <david@fromorbit.com>
Link: https://lore.kernel.org/all/20221010050319.GC2703033@dread.disaster.area/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-12 16:30:56 -06:00
Stefan Roesch
c86416c6ff io_uring: local variable rw shadows outer variable in io_write
This fixes the shadowing of the outer variable rw in the function
io_write(). No issue is caused by this, but let's silence the shadowing
warning anyway.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Link: https://lore.kernel.org/r/20221010234330.244244-1-shr@devkernel.io
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-12 16:30:56 -06:00
Jens Axboe
b000145e99 io_uring/rw: defer fsnotify calls to task context
We can't call these off the kiocb completion as that might be off
soft/hard irq context. Defer the calls to when we process the
task_work for this request. That avoids valid complaints like:

stack backtrace:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.0.0-rc6-syzkaller-00321-g105a36f3694e #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
 print_usage_bug kernel/locking/lockdep.c:3961 [inline]
 valid_state kernel/locking/lockdep.c:3973 [inline]
 mark_lock_irq kernel/locking/lockdep.c:4176 [inline]
 mark_lock.part.0.cold+0x18/0xd8 kernel/locking/lockdep.c:4632
 mark_lock kernel/locking/lockdep.c:4596 [inline]
 mark_usage kernel/locking/lockdep.c:4527 [inline]
 __lock_acquire+0x11d9/0x56d0 kernel/locking/lockdep.c:5007
 lock_acquire kernel/locking/lockdep.c:5666 [inline]
 lock_acquire+0x1ab/0x570 kernel/locking/lockdep.c:5631
 __fs_reclaim_acquire mm/page_alloc.c:4674 [inline]
 fs_reclaim_acquire+0x115/0x160 mm/page_alloc.c:4688
 might_alloc include/linux/sched/mm.h:271 [inline]
 slab_pre_alloc_hook mm/slab.h:700 [inline]
 slab_alloc mm/slab.c:3278 [inline]
 __kmem_cache_alloc_lru mm/slab.c:3471 [inline]
 kmem_cache_alloc+0x39/0x520 mm/slab.c:3491
 fanotify_alloc_fid_event fs/notify/fanotify/fanotify.c:580 [inline]
 fanotify_alloc_event fs/notify/fanotify/fanotify.c:813 [inline]
 fanotify_handle_event+0x1130/0x3f40 fs/notify/fanotify/fanotify.c:948
 send_to_group fs/notify/fsnotify.c:360 [inline]
 fsnotify+0xafb/0x1680 fs/notify/fsnotify.c:570
 __fsnotify_parent+0x62f/0xa60 fs/notify/fsnotify.c:230
 fsnotify_parent include/linux/fsnotify.h:77 [inline]
 fsnotify_file include/linux/fsnotify.h:99 [inline]
 fsnotify_access include/linux/fsnotify.h:309 [inline]
 __io_complete_rw_common+0x485/0x720 io_uring/rw.c:195
 io_complete_rw+0x1a/0x1f0 io_uring/rw.c:228
 iomap_dio_complete_work fs/iomap/direct-io.c:144 [inline]
 iomap_dio_bio_end_io+0x438/0x5e0 fs/iomap/direct-io.c:178
 bio_endio+0x5f9/0x780 block/bio.c:1564
 req_bio_endio block/blk-mq.c:695 [inline]
 blk_update_request+0x3fc/0x1300 block/blk-mq.c:825
 scsi_end_request+0x7a/0x9a0 drivers/scsi/scsi_lib.c:541
 scsi_io_completion+0x173/0x1f70 drivers/scsi/scsi_lib.c:971
 scsi_complete+0x122/0x3b0 drivers/scsi/scsi_lib.c:1438
 blk_complete_reqs+0xad/0xe0 block/blk-mq.c:1022
 __do_softirq+0x1d3/0x9c6 kernel/softirq.c:571
 invoke_softirq kernel/softirq.c:445 [inline]
 __irq_exit_rcu+0x123/0x180 kernel/softirq.c:650
 irq_exit_rcu+0x5/0x20 kernel/softirq.c:662
 common_interrupt+0xa9/0xc0 arch/x86/kernel/irq.c:240

Fixes: f63cf5192f ("io_uring: ensure that fsnotify is always called")
Link: https://lore.kernel.org/all/20220929135627.ykivmdks2w5vzrwg@quack3/
Reported-by: syzbot+dfcc5f4da15868df7d4d@syzkaller.appspotmail.com
Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-29 11:00:41 -06:00
Pavel Begunkov
c278d9f8ac io_uring/rw: don't lose short results on io_setup_async_rw()
If a retry io_setup_async_rw() fails we lose result from the first
io_iter_do_read(), which is a problem mostly for streams/sockets.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0e8d20cebe5fc9c96ed268463c394237daabc384.1664235732.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 18:44:15 -06:00
Pavel Begunkov
bf68b5b343 io_uring/rw: fix unexpected link breakage
req->cqe.res is set in io_read() to the amount of bytes left to be done,
which is used to figure out whether to fail a read or not. However,
io_read() may do another without returning, and we stash the previous
value into ->bytes_done but forget to update cqe.res. Then we ask a read
to do strictly less than cqe.res but expect the return to be exactly
cqe.res.

Fix the bug by updating cqe.res for retries.

Cc: stable@vger.kernel.org
Reported-and-Tested-by: Beld Zhang <beldzhang@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3a1088440c7be98e5800267af922a67da0ef9f13.1664235732.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 18:44:15 -06:00
Pavel Begunkov
47b4c68660 io_uring/rw: don't lose partial IO result on fail
A partially done read/write may end up in io_req_complete_failed() and
loose the result, make sure we return the number of bytes processed.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/05e0879c226bcd53b441bf92868eadd4bf04e2fc.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:02 -06:00
Dylan Yudaken
4ab9d46507 io_uring: allow buffer recycling in READV
In commit 934447a603 ("io_uring: do not recycle buffer in READV") a
temporary fix was put in io_kbuf_recycle to simply never recycle READV
buffers.

Instead of that, rather treat READV with REQ_F_BUFFER_SELECTED the same as
a READ with REQ_F_BUFFER_SELECTED. Since READV requires iov_len of 1 they
are essentially the same.
In order to do this inside io_prep_rw() add some validation to check that
it is in fact only length 1, and also extract the length of the buffer at
prep time.

This allows removal of the io_iov_buffer_select codepaths as they are only
used from the READV op.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220907165152.994979-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:43 -06:00
Jens Axboe
de97fcb303 fs: add batch and poll flags to the uring_cmd_iopoll() handler
We need the poll_flags to know how to poll for the IO, and we should
have the batch structure in preparation for supporting batched
completions with iopoll.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:43 -06:00
Jens Axboe
a1119fb071 io_uring: cleanly separate request types for iopoll
After the addition of iopoll support for passthrough, there's a bit of
a mixup here. Clean it up and get rid of the casting for the passthrough
command type.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:42 -06:00
Kanchan Joshi
5756a3a7e7 io_uring: add iopoll infrastructure for io_uring_cmd
Put this up in the same way as iopoll is done for regular read/write IO.
Make place for storing a cookie into struct io_uring_cmd on submission.
Perform the completion using the ->uring_cmd_iopoll handler.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20220823161443.49436-3-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:42 -06:00
Pavel Begunkov
62bb0647b1 io_uring/rw: fix error'ed retry return values
Kernel test robot reports that we test negativity of an unsigned in
io_fixup_rw_res() after a recent change, which masks error codes and
messes up the return value in case I/O is re-retried and failed with
an error.

Fixes: 4d9cb92ca4 ("io_uring/rw: fix short rw error handling")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9754a0970af1861e7865f9014f735c70dc60bf79.1663071587.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-13 07:47:11 -06:00