Save a few pointer dereferences by obtaining struct ublk_queue *ubq from
the ublk_uring_cmd_pdu instead of the request's mq_hctx.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250328180411.2696494-4-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ublk_cmd_list_tw_cb() is always performed on a non-empty request list.
So don't check whether rq is NULL on the first iteration of the loop,
just on subsequent iterations.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250328180411.2696494-3-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ublk_dispatch_req() never uses its struct io_uring_cmd *cmd argument.
Drop it so callers don't have to pass a value.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250328180411.2696494-2-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Implement ->queue_rqs() for improving perf in case of MQ.
In this way, we just need to call io_uring_cmd_complete_in_task() once for
whole IO batch, then both io_uring and ublk server can get exact batch from
ublk frontend.
Follows IOPS improvement:
- tests
tools/testing/selftests/ublk/kublk add -t null -q 2 [-z]
fio/t/io_uring -p0 /dev/ublkb0
- results:
more than 10% IOPS boost observed
Pass all ublk selftests, especially the io dispatch order test.
Cc: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250327095123.179113-9-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IO split is usually bad in io_uring world, since -EAGAIN is caused and
IO handling may have to fallback to io-wq, this way does hurt performance.
ublk starts to support zero copy recently, for avoiding unnecessary IO
split, ublk driver's segment limit should be aligned with backend
device's segment limit.
Another reason is that io_buffer_register_bvec() needs to allocate bvecs,
which number is aligned with ublk request segment number, so that big
memory allocation can be avoided by setting reasonable max_segments limit.
So add segment parameter for providing ublk server chance to align
segment limit with backend, and keep it reasonable from implementation
viewpoint.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250327095123.179113-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Call io_uring_cmd_to_pdu() to get uring_cmd pdu, and one big benefit
is the automatic pdu size build check.
Suggested-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250327095123.179113-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In ublk_queue_rq(), ubq->canceling has to be handled after ->fail_io and
->force_abort are dealt with, otherwise the request may not be failed
when deleting disk.
Add comment on this usage.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250327095123.179113-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now ublk driver depends on `ubq->canceling` for deciding if the request
can be dispatched via uring_cmd & io_uring_cmd_complete_in_task().
Once ubq->canceling is set, the uring_cmd can be done via ublk_cancel_cmd()
and io_uring_cmd_done().
So set ubq->canceling when queue is frozen, this way makes sure that the
flag can be observed from ublk_queue_rq() reliably, and avoids
use-after-free on uring_cmd.
Fixes: 216c8f5ef0 ("ublk: replace monitor with cancelable uring_cmd")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250327095123.179113-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfe8DcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgprzNEADPOOfi05F0f4KOMLYOiRlB+D9yGtnbcu8k
foPATU++L5fFDUpjAGobi38AXqHtF/Bx0CjCkPCvy4GjygcszYDY81FQmRlSFdeH
CnYV/bm5myyupc8xANcWGJYpxf/ZAw7ugOAgcR3Xxk/qXdD5hYJ1jG4477o8v4IJ
jyzE4vQCnWBogIoqpWTQyY3FSP5fdjeQlYdFGIFCx68rFTO9zEpuHhFr2OMfJtSu
y1dLkeqbu1BZ0fhgRog9k9ZWk0uRQJI7d7EUXf9iwXOfkUyzCaV+nHvNGRdhJ/XO
zj/qePw4lFyCgc+kuIejjIaPf/g84paCzshz/PIxOZTsvo6X78zj72RPM0Makhga
amgtEqegWHtKSya1zew57oDwqwaIe6b+56PtNSP7K8IrwcJLheszoPKJIwiiynIL
ZqvWseqYQXs58S9qZzSZlyEpwLVFcvyNt7Z/q7b56vhvXHabxOccPkUHt+UscVoQ
qpG0R9+zmIiDfzY6Vfu7JyH2Uspdq3WjGYiaoVUko1rJXY+HFPd3be9pIbckyaA8
ZQHIxB+h8IlxhpQ8HJKQpZhtYiluxEu6wDdvtpwC0KiGp5g4Q5eg9SkVyvETIEN8
yW2tHqMAJJtJjDAPwJOV+GIyWQMIexzLmMP7Fq8R5VIoAox+Xn3flgNNEnMZdl/r
kTc/hRjHvQ==
=VEQr
-----END PGP SIGNATURE-----
Merge tag 'for-6.15/io_uring-20250322' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
"This is the first of the io_uring pull requests for the 6.15 merge
window, there will be others once the net tree has gone in. This
contains:
- Cleanup and unification of cancelation handling across various
request types.
- Improvement for bundles, supporting them both for incrementally
consumed buffers, and for non-multishot requests.
- Enable toggling of using iowait while waiting on io_uring events or
not. Unfortunately this is still tied with CPU frequency boosting
on short waits, as the scheduler side has not been very receptive
to splitting the (useless) iowait stat from the cpufreq implied
boost.
- Add support for kbuf nodes, enabling zero-copy support for the ublk
block driver.
- Various cleanups for resource node handling.
- Series greatly cleaning up the legacy provided (non-ring based)
buffers. For years, we've been pushing the ring provided buffers as
the way to go, and that is what people have been using. Reduce the
complexity and code associated with legacy provided buffers.
- Series cleaning up the compat handling.
- Series improving and cleaning up the recvmsg/sendmsg iovec and msg
handling.
- Series of cleanups for io-wq.
- Start adding a bunch of selftests. The liburing repository
generally carries feature and regression tests for everything, but
at least for ublk initially, we'll try and go the route of having
it in selftests as well. We'll see how this goes, might decide to
migrate more tests this way in the future.
- Various little cleanups and fixes"
* tag 'for-6.15/io_uring-20250322' of git://git.kernel.dk/linux: (108 commits)
selftests: ublk: add stripe target
selftests: ublk: simplify loop io completion
selftests: ublk: enable zero copy for null target
selftests: ublk: prepare for supporting stripe target
selftests: ublk: move common code into common.c
selftests: ublk: increase max buffer size to 1MB
selftests: ublk: add single sqe allocator helper
selftests: ublk: add generic_01 for verifying sequential IO order
selftests: ublk: fix starting ublk device
io_uring: enable toggle of iowait usage when waiting on CQEs
selftests: ublk: fix write cache implementation
selftests: ublk: add variable for user to not show test result
selftests: ublk: don't show `modprobe` failure
selftests: ublk: add one dependency header
io_uring/kbuf: enable bundles for incrementally consumed buffers
Revert "io_uring/rsrc: simplify the bvec iter count calculation"
selftests: ublk: improve test usability
selftests: ublk: add stress test for covering IO vs. killing ublk server
selftests: ublk: add one stress test for covering IO vs. removing device
selftests: ublk: load/unload ublk_drv when preparing & cleaning up tests
...
hrtimers are initialized with hrtimer_init() and a subsequent store to
the callback pointer. This turned out to be suboptimal for the upcoming
Rust integration and is obviously a silly implementation to begin with.
This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
with hrtimer_setup(T, cb);
The conversion was done with Coccinelle and a few manual fixups.
Once the conversion has completely landed in mainline, hrtimer_init()
will be removed and the hrtimer::function becomes a private member.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff5jQTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoVvRD/wKtuwmiA66NJFgXC0qVq82A6fO3bY8
GBdbfysDJIbqGu5PTcULTbJ8qkqv3jeLUv6CcXvS4sZ7y/uJQl2lzf8yrD/0bbwc
rLI6sHiPSZmK93kNVN4X5H7kvt7cE/DYC9nnEOgK3BY5FgKc4n9887d4aVBhL8Lv
ODwVXvZ+xi351YCj7qRyPU24zt/p4tkkT1o2k4a0HBluqLI0D+V20fke9IERUL8r
d1uWKlcn0TqYDesE8HXKIhbst3gx52rMJrXBJDHwFmG6v8Pj1fkTXCVpPo8QcBz8
OTVkpomN9f/Tx4+GZwhZOF86LhLL3OhxD6pT7JhFCXdmSGv+Ez8uyk1YZysM/XpV
Juy/1yAcBpDIDkmhMFGdAAn48Nn9Fotty0r4je60zSEp1d/4QMXcFme29qr2JTUE
iWnQ/HD6DxUjVHqy7CYvvo26Xegg1C7qgyOVt4PYZwAM1VKF5P3kzYTb4SAdxtop
Tpji1sfW9QV08jqMNo6XntD32DSP9S2HqjO9LwBw700jnx2jjJ35fcJs6iodMOUn
gckIZLMn3L0OoglPdyA5O7SNTbKE7aFiRKdnT/cJtR3Fa39Qu27CwC5gfiyuie9I
Q+LG8GLuYSBHXAR+PBK4GWlzJ7Dn8k3eqmbnLeKpRMsU6ZzcttgA64xhaviN2wN0
iJbvLJeisXr3GA==
=bYAX
-----END PGP SIGNATURE-----
Merge tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer cleanups from Thomas Gleixner:
"A treewide hrtimer timer cleanup
hrtimers are initialized with hrtimer_init() and a subsequent store to
the callback pointer. This turned out to be suboptimal for the
upcoming Rust integration and is obviously a silly implementation to
begin with.
This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
with hrtimer_setup(T, cb);
The conversion was done with Coccinelle and a few manual fixups.
Once the conversion has completely landed in mainline, hrtimer_init()
will be removed and the hrtimer::function becomes a private member"
* tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
wifi: rt2x00: Switch to use hrtimer_update_function()
io_uring: Use helper function hrtimer_update_function()
serial: xilinx_uartps: Use helper function hrtimer_update_function()
ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup()
RDMA: Switch to use hrtimer_setup()
virtio: mem: Switch to use hrtimer_setup()
drm/vmwgfx: Switch to use hrtimer_setup()
drm/xe/oa: Switch to use hrtimer_setup()
drm/vkms: Switch to use hrtimer_setup()
drm/msm: Switch to use hrtimer_setup()
drm/i915/request: Switch to use hrtimer_setup()
drm/i915/uncore: Switch to use hrtimer_setup()
drm/i915/pmu: Switch to use hrtimer_setup()
drm/i915/perf: Switch to use hrtimer_setup()
drm/i915/gvt: Switch to use hrtimer_setup()
drm/i915/huc: Switch to use hrtimer_setup()
drm/amdgpu: Switch to use hrtimer_setup()
stm class: heartbeat: Switch to use hrtimer_setup()
i2c: Switch to use hrtimer_setup()
iio: Switch to use hrtimer_setup()
...
The current I/O dispatch mechanism - queueing I/O by adding it to the
io_cmds list (and poking task_work as needed), then dispatching it in
ublk server task context by reversing io_cmds and completing the
io_uring command associated to each one - was introduced by commit
7d4a93176e ("ublk_drv: don't forward io commands in reserve order")
to ensure that the ublk server received I/O in the same order that the
block layer submitted it to ublk_drv. This mechanism was only needed for
the "raw" task_work submission mechanism, since the io_uring task work
wrapper maintains FIFO ordering (using quite a similar mechanism in
fact). The "raw" task_work submission mechanism is no longer supported
in ublk_drv as of commit 29dc5d0661 ("ublk: kill queuing request by
task_work_add"), so the explicit llist/reversal is no longer needed - it
just duplicates logic already present in the underlying io_uring APIs.
Remove it.
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250318-ublk_io_cmds-v1-1-c1bb74798fef@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If vfs_flush() is called with queue frozen, the queue freeze lock may be
connected with FS internal lock, and lockdep warning can be triggered
because the queue freeze lock is connected with too many global or
sub-system locks.
Fix the warning by moving vfs_fsync() out of loop_update_dio():
- vfs_fsync() is only needed when switching to dio
- only loop_change_fd() and loop_configure() may switch from buffered
IO to direct IO, so call vfs_fsync() directly here. This way is safe
because either loop is in unbound, or new file isn't attached
- for the other two cases of set_status and set_block_size, direct IO
can only become off, so no need to call vfs_fsync()
Cc: Christoph Hellwig <hch@infradead.org>
Reported-by: Kun Hu <huk23@m.fudan.edu.cn>
Reported-by: Jiaji Qin <jjtan24@m.fudan.edu.cn>
Closes: https://lore.kernel.org/linux-block/359BC288-B0B1-4815-9F01-3A349B12E816@m.fudan.edu.cn/T/#u
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250318072955.3893805-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Explicitly state that zcomp compress/decompress must be called from
non-atomic context.
Link: https://lkml.kernel.org/r/20250303022425.285971-20-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ensure the page used for local object data is freed on error out path.
Link: https://lkml.kernel.org/r/20250303022425.285971-19-senozhatsky@chromium.org
Fixes: 330edc2bc0 (zram: rework writeback target selection strategy)
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ensure the page used for local object data is freed on error out path.
Link: https://lkml.kernel.org/r/20250303022425.285971-18-senozhatsky@chromium.org
Fixes: 3f909a60ce ("zram: rework recompress target selection strategy")
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When configured with pre-trained compression/decompression dictionary
support, zstd requires custom memory allocator, which it calls internally
from compression()/decompression() routines. That means allocation from
atomic context (either under entry spin-lock, or per-CPU local-lock or
both). Now, with non-atomic zram read()/write(), those limitations are
relaxed and we can allow direct and indirect reclaim.
Link: https://lkml.kernel.org/r/20250303022425.285971-17-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use new read/write zsmalloc object API. For cases when RO mapped object
spans two physical pages (requires temp buffer) compression streams now
carry around one extra physical page.
Link: https://lkml.kernel.org/r/20250303022425.285971-16-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Allocate post-processing target in place_pp_slot(). This simplifies
scan_slots_for_writeback() and scan_slots_for_recompress() loops because
we don't need to track pps pointer state anymore. Previously we have to
explicitly NULL the point if it has been added to a post-processing bucket
or re-use previously allocated pointer otherwise and make sure we don't
leak the memory in the end.
We are also fine doing GFP_NOIO allocation, as post-processing can be
called under memory pressure so we better pick as many slots as we can as
soon as we can and start post-processing them, possibly saving the memory.
Allocation failure there is not fatal, we will post-process whatever we
put into the buckets on previous iterations.
Link: https://lkml.kernel.org/r/20250303022425.285971-12-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This reworks recompression loop handling:
- set a rule that stream-put NULLs the stream pointer If the loop
returns with a non-NULL stream then it's a successful recompression,
otherwise the stream should always be NULL.
- do not count the number of recompressions Mark object as
incompressible as soon as the algorithm with the highest priority failed
to compress that object.
- count compression errors as resource usage Even if compression has
failed, we still need to bump num_recomp_pages counter.
Link: https://lkml.kernel.org/r/20250303022425.285971-11-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Do no select for post processing slots that are already compressed with
same or higher priority compression algorithm.
This should save some memory, as previously we would still put those
entries into corresponding post-processing buckets and filter them out
later in recompress_slot().
Link: https://lkml.kernel.org/r/20250303022425.285971-10-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use the actual number of algorithms zram was configure with instead of
theoretical limit of ZRAM_MAX_COMPS.
Also make sure that min prio is not above max prio.
Link: https://lkml.kernel.org/r/20250303022425.285971-9-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There is no zsmalloc handle allocation slow path now and writestall is not
possible any longer. Remove it from zram_stats.
Link: https://lkml.kernel.org/r/20250303022425.285971-8-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We normally use __GFP_NOWARN for zsmalloc handle allocations, add it to
write_incompressible_page() allocation too.
Link: https://lkml.kernel.org/r/20250303022425.285971-7-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Previously zram write() was atomic which required us to pass
__GFP_KSWAPD_RECLAIM to zsmalloc handle allocation on a fast path and
attempt a slow path allocation (with recompression) if the fast path
failed.
Since we are not in atomic context anymore we can permit direct reclaim
during handle allocation, and hence can have a single allocation path.
There is no slow path anymore so we don't unlock per-CPU stream (and don't
lose compressed data) which means that there is no need to do
recompression now (which should reduce CPU and battery usage).
Link: https://lkml.kernel.org/r/20250303022425.285971-6-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
max_comp_streams device attribute has been defunct since May 2016 when
zram switched to per-CPU compression streams, remove it.
Link: https://lkml.kernel.org/r/20250303022425.285971-5-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We stopped using crypto API (for the time being), so remove its include
and replace CRYPTO_MAX_ALG_NAME with a local define.
Link: https://lkml.kernel.org/r/20250303022425.285971-4-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, per-CPU stream access is done from a non-preemptible (atomic)
section, which imposes the same atomicity requirements on compression
backends as entry spin-lock, and makes it impossible to use algorithms
that can schedule/wait/sleep during compression and decompression.
Switch to preemptible per-CPU model, similar to the one used in zswap.
Instead of a per-CPU local lock, each stream carries a mutex which is
locked throughout entire time zram uses it for compression or
decompression, so that cpu-dead event waits for zram to stop using a
particular per-CPU stream and release it.
Link: https://lkml.kernel.org/r/20250303022425.285971-3-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "zsmalloc/zram: there be preemption", v10.
Currently zram runs compression and decompression in non-preemptible
sections, e.g.
zcomp_stream_get() // grabs CPU local lock
zcomp_compress()
or
zram_slot_lock() // grabs entry spin-lock
zcomp_stream_get() // grabs CPU local lock
zs_map_object() // grabs rwlock and CPU local lock
zcomp_decompress()
Potentially a little troublesome for a number of reasons.
For instance, this makes it impossible to use async compression algorithms
or/and H/W compression algorithms, which can wait for OP completion or
resource availability. This also restricts what compression algorithms
can do internally, for example, zstd can allocate internal state memory
for C/D dictionaries:
do_fsync()
do_writepages()
zram_bio_write()
zram_write_page() // become non-preemptible
zcomp_compress()
zstd_compress()
ZSTD_compress_usingCDict()
ZSTD_compressBegin_usingCDict_internal()
ZSTD_resetCCtx_usingCDict()
ZSTD_resetCCtx_internal()
zstd_custom_alloc() // memory allocation
Not to mention that the system can be configured to maximize compression
ratio at a cost of CPU/HW time (e.g. lz4hc or deflate with very high
compression level) so zram can stay in non-preemptible section (even under
spin-lock or/and rwlock) for an extended period of time. Aside from
compression algorithms, this also restricts what zram can do. One
particular example is zram_write_page() zsmalloc handle allocation, which
has an optimistic allocation (disallowing direct reclaim) and a
pessimistic fallback path, which then forces zram to compress the page one
more time.
This series changes zram to not directly impose atomicity restrictions on
compression algorithms (and on itself), which makes zram write() fully
preemptible; zram read(), sadly, is not always preemptible yet. There are
still indirect atomicity restrictions imposed by zsmalloc(). One notable
example is object mapping API, which returns with: a) local CPU lock held
b) zspage rwlock held
First, zsmalloc's zspage lock is converted from rwlock to a special type
of RW-lookalike look with some extra guarantees/features. Second, a new
handle mapping is introduced which doesn't use per-CPU buffers (and hence
no local CPU lock), does fewer memcpy() calls, but requires users to
provide a pointer to temp buffer for object copy-in (when needed). Third,
zram is converted to the new zsmalloc mapping API and thus zram read()
becomes preemptible.
This patch (of 19):
Concurrent modifications of meta table entries is now handled by per-entry
spin-lock. This has a number of shortcomings.
First, this imposes atomic requirements on compression backends. zram can
call both zcomp_compress() and zcomp_decompress() under entry spin-lock,
which implies that we can use only compression algorithms that don't
schedule/sleep/wait during compression and decompression. This, for
instance, makes it impossible to use some of the ASYNC compression
algorithms (H/W compression, etc.) implementations.
Second, this can potentially trigger watchdogs. For example, entry
re-compression with secondary algorithms is performed under entry
spin-lock. Given that we chain secondary compression algorithms and that
some of them can be configured for best compression ratio (and worst
compression speed) zram can stay under spin-lock for quite some time.
Having a per-entry mutex (or, for instance, a rw-semaphore) significantly
increases sizeof() of each entry and hence the meta table. Therefore
entry locking returns back to bit locking, as before, however, this time
also preempt-rt friendly, because if waits-on-bit instead of
spinning-on-bit. Lock owners are also now permitted to schedule, which is
a first step on the path of making zram non-atomic.
Link: https://lkml.kernel.org/r/20250303022425.285971-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20250303022425.285971-2-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfTYxsQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjVKD/9qZjztmVq6Rk9RBjZwMYxcO9Nzj7qQQ6m9
S15eXslAA1eLec3p1Mx4oVaWoFranY03BClqCgywBUAgpYstnT9cEqkz0P+n6xIE
bNGjfxx4NInvrQYRETskc4wQqOnAdiRMd9i96EpHqW9Pi/pl8dSQxmxlaeo0BBIM
XDvodLhr38aLJwcNBQ9NKCfhJ7RruACuSiXRAsPH3D641ZpccW4ADuhYxhJDehKa
fMzuEFaa9/mBvmIrhE3QCbvWr7VYzSkadMLJWnxLsN1PU4FXZbh4Oy5Kp7DrA4Zq
YkwezSivjNWNqNsiyvVa63mKbxfe9MSh5odqWuLrkWr4cOzEOcpHRbNV2El5RK/x
BtGt/eCT2cRQAG4MzveoiE1yG9AAmUvUZL/RvxbERedqWO69IsgrsIsdnoiaLgw/
267eCeGQlpHGhVUKga7ouShlTowTaCLCi+XgJwUTsVP/VPuzEFwgkzX0J45bSPGd
h0laUzuHcThe8cRY2t5JWu+JJTqHj6ubsPeqiMAQzCns1C+IWYsjPXEohfqt7av+
2yoIwG9DCBfJfh0ml0t3yHHMSJzjcwQcQAw1P7loLI+TIDvrpVP7AYOVt4SYeXl4
RTEvNKQRmQGNZ8B3lGrqVKnbJ5ExBzvE6muQTOhockCTQsNK7WNaT2dMWRLyW6rW
HcdUkADDVg==
=MG/C
-----END PGP SIGNATURE-----
Merge tag 'block-6.14-20250313' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- NVMe pull request via Keith:
- Concurrent pci error and hotplug handling fix (Keith)
- Endpoint function fixes (Damien)
- Fix for a regression introduced in this cycle with error checking for
batched request completions (Shin'ichiro)
* tag 'block-6.14-20250313' of git://git.kernel.dk/linux:
block: change blk_mq_add_to_batch() third argument type to bool
nvme: move error logging from nvme_end_req() to __nvme_end_req()
nvmet: pci-epf: Do not add an IRQ vector if not needed
nvmet: pci-epf: Set NVMET_PCI_EPF_Q_LIVE when a queue is fully created
nvme-pci: fix stuck reset on concurrent DPC and HP
request_queue param is no longer used by blk_rq_map_sg and
__blk_rq_map_sg. Remove it.
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250313035322.243239-1-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 1f47ed294a ("block: cleanup and fix batch completion adding
conditions") modified the evaluation criteria for the third argument,
'ioerror', in the blk_mq_add_to_batch() function. Initially, the
function had checked if 'ioerror' equals zero. Following the commit, it
started checking for negative error values, with the presumption that
such values, for instance -EIO, would be passed in.
However, blk_mq_add_to_batch() callers do not pass negative error
values. Instead, they pass status codes defined in various ways:
- NVMe PCI and Apple drivers pass NVMe status code
- virtio_blk driver passes the virtblk request header status byte
- null_blk driver passes blk_status_t
These codes are either zero or positive, therefore the revised check
fails to function as intended. Specifically, with the NVMe PCI driver,
this modification led to the failure of the blktests test case nvme/039.
In this test scenario, errors are artificially injected to the NVMe
driver, resulting in positive NVMe status codes passed to
blk_mq_add_to_batch(), which unexpectedly processes the failed I/O in a
batch. Hence the failure.
To correct the ioerror check within blk_mq_add_to_batch(), make all
callers to uniformly pass the argument as boolean. Modify the callers to
check their specific status codes and pass the boolean value 'is_error'.
Also describe the arguments of blK_mq_add_to_batch as kerneldoc.
Fixes: 1f47ed294a ("block: cleanup and fix batch completion adding conditions")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/20250311104359.1767728-3-shinichiro.kawasaki@wdc.com
[axboe: fold in documentation update]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In the `module!` macro, the `author` field is currently of type `String`.
Since modules can have multiple authors, this limitation prevents
specifying more than one.
Add an `authors` field as `Option<Vec<String>>` to allow creating
modules with multiple authors, and change the documentation and all
current users to use it. Eventually, the single `author` field may
be removed.
[ The `modinfo` key needs to still be `author`; otherwise, tooling
may not work properly, e.g.:
$ modinfo --author samples/rust/rust_print.ko
Rust for Linux Contributors
I have also kept the original `author` field (undocumented), so
that we can drop it more easily in a kernel cycle or two.
- Miguel ]
Suggested-by: Miguel Ojeda <ojeda@kernel.org>
Link: https://github.com/Rust-for-Linux/linux/issues/244
Reviewed-by: Charalampos Mitrodimas <charmitro@posteo.net>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Signed-off-by: Guilherme Giacomo Simoes <trintaeoitogc@gmail.com>
Link: https://lore.kernel.org/r/20250309175712.845622-2-trintaeoitogc@gmail.com
[ Fixed `modinfo` key. Kept `author` field. Reworded message
accordingly. Updated my email. - Miguel ]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfKQvsQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpnBCD/9bVSGHNnXakVwdpQmtU5zy54cyWd7VaYsz
qeM+Vrl1m5nf8q5ZdEXcM11Ruib3YJiW0GN9d9sWpTwt8C5n8g+8F63koS7GordZ
jcv77nO6FlnwWpm3YlNxAeLuxkl15e4MQIKj/jb540iFygzT8H2lygE816K4kpCX
XuMxNxdSMksntovZufzxo3Sfkm6e6GChCkkqvBxuXiEWFhvbFQ/ZLEsEMtoH4hkI
3Nj1VB3B3pLVCZhWr2uVvcZCiYUDyBslu+SA3RRoX0W6beK1cVI4OQdS8GtnkJf3
qFnLQz0Ib3EVDtugqg7ZGSAAov6Z8waA2MrFeZkG8uIfl4WT3kBfoan7jRX3Mknl
VnFkThyJOzB83OKqlZKjCzYmEzBhKJrRJVtneIrxT+gvEpevFvAQil6SQfyPDwno
4YcUD+IfU/daTdVR58QQ/iLzkQ7stQWYCtZSrICKfcAGy6zswKM5P5uoWltMBwQh
aHsyz9xbmsMrxch1DPRb0T2GD2h9BsiL6rT8JCrOgucMuOYeZL9pNRgz16D/hael
wBCxPcanSdap0N9kiMX8fLYYdmRxpJHzTbeNRsPhZe8HKUPu1sYTbisOou1XSdAW
Dv7zeQWVlw+1cn/S1Y6Oc4mdlPzPTA9szuBXVpbe9Gd7ZqO7sbbKEkGu5w6MGSZ1
oubnZKCNvA==
=jKDe
-----END PGP SIGNATURE-----
Merge tag 'block-6.14-20250306' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- NVMe pull request via Keith:
- TCP use after free fix on polling (Sagi)
- Controller memory buffer cleanup fixes (Icenowy)
- Free leaking requests on bad user passthrough commands (Keith)
- TCP error message fix (Maurizio)
- TCP corruption fix on partial PDU (Maurizio)
- TCP memory ordering fix for weakly ordered archs (Meir)
- Type coercion fix on message error for TCP (Dan)
- Name the RQF flags enum, fixing issues with anon enums and BPF import
of it
- ublk parameter setting fix
- GPT partition 7-bit conversion fix
* tag 'block-6.14-20250306' of git://git.kernel.dk/linux:
block: Name the RQF flags enum
nvme-tcp: fix signedness bug in nvme_tcp_init_connection()
block: fix conversion of GPT partition name to 7-bit
ublk: set_params: properly check if parameters can be applied
nvmet-tcp: Fix a possible sporadic response drops in weakly ordered arch
nvme-tcp: fix potential memory corruption in nvme_tcp_recv_pdu()
nvme-tcp: Fix a C2HTermReq error message
nvmet: remove old function prototype
nvme-ioctl: fix leaked requests on mapping error
nvme-pci: skip CMB blocks incompatible with PCI P2P DMA
nvme-pci: clean up CMBMSC when registering CMB fails
nvme-tcp: fix possible UAF in nvme_tcp_poll
There is a truncation of badblocks length issue when set badblocks as
follow:
echo "2055 4294967299" > bad_blocks
cat bad_blocks
2055 3
Change 'sectors' argument type from 'int' to 'sector_t'.
This change avoids truncation of badblocks length for large sectors by
replacing 'int' with 'sector_t' (u64), enabling proper handling of larger
disk sizes and ensuring compatibility with 64-bit sector addressing.
Fixes: 9e0e252a04 ("badblocks: Add core badblock management code")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/r/20250227075507.151331-13-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Change the return type of badblocks_set() and badblocks_clear()
from int to bool, indicating success or failure. Specifically:
- _badblocks_set() and _badblocks_clear() functions now return
true for success and false for failure.
- All calls to these functions are updated to handle the new
boolean return type.
- This change improves code clarity and ensures a more consistent
handling of success and failure states.
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Coly Li <colyli@kernel.org>
Acked-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20250227075507.151331-11-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The parameters set by the set_params call are only applied to the block
device in the start_dev call. So if a device has already been started, a
subsequently issued set_params on that device will not have the desired
effect, and should return an error. There is an existing check for this
- set_params fails on devices in the LIVE state. But this check is not
sufficient to cover the recovery case. In this case, the device will be
in the QUIESCED or FAIL_IO states, so set_params will succeed. But this
success is misleading, because the parameters will not be applied, since
the device has already been started (by a previous ublk server). The bit
UB_STATE_USED is set on completion of the start_dev; use it to detect
and fail set_params commands which arrive too late to be applied (after
start_dev).
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Fixes: 0aa73170eb ("ublk_drv: add SET_PARAMS/GET_PARAMS control command")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250304-set_params-v1-1-17b5e0887606@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 403ebc8778 ("ublk_drv: add module parameter of ublks_max for
limiting max allowed ublk dev"), claimed ublks_max was added to prevent
a DoS situation with an untrusted user creating too many ublk devices.
If that's the case, ublks_max should only restrict the number of
unprivileged ublk devices in the system. Enforce the limit only for
unprivileged ublk devices, and rename variables accordingly. Leave the
external-facing parameter name unchanged, since changing it may break
systems which use it (but still update its documentation to reflect its
new meaning).
As a result of this change, in a system where there are only normal
(non-unprivileged) devices, the maximum number of such devices is
increased to 1 << MINORBITS, or 1048576. That ought to be enough for
anyone, right?
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250228-ublks_max-v1-1-04b7379190c0@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The struct is introduced in the commit 754d96798f
("loop: remove loop.h"), but it is not used now.
So remove it.
No functional changes.
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250227163343.55952-1-yanjun.zhu@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_buffer_register_bvec() takes index as an unsigned int argument, but
ublk_register_io_buf() casts ub_cmd->addr (a u64) to int. Remove the
misleading cast and instead pass index as an unsigned value to
ublk_register_io_buf() and ublk_unregister_io_buf().
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250301190317.950208-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The in-tree ublk driver doesn't need DMA alignment limit because there
is one data copy between request pages and the userspace buffer.
However, ublk is going to support zero copy, then DMA alignment limit
is required, because same IO buffer is forwarded to backend which may
have specific buffer DMA alignment limit, so the limit has to be exposed
from the frontend driver to client application.
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250227103707.2640014-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The current null_blk implementation checks if any bad blocks exist in
the target blocks of each IO. If so, the IO fails and data is not
transferred for all of the IO target blocks. However, when real storage
devices have bad blocks, the devices may transfer data partially up to
the first bad blocks (e.g., SAS drives). Especially, when the IO is a
write operation, such partial IO leaves partially written data on the
device.
To simulate such partial IO using null_blk, introduce the new parameter
'badblocks_partial_io'. When this parameter is set,
null_handle_badblocks() returns the number of the sectors for the
partial IO as its third pointer argument. Pass the returned number of
sectors to the following calls to null_handle_memory_backend() in
null_process_cmd() and null_zone_write().
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/20250226100613.1622564-6-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As preparation to support partial data transfer, add a new argument to
null_handle_rq() to pass the number of sectors to transfer. While at it,
rename the function from null_handle_rq to null_handle_data_transfer.
This commit does not change the behavior.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/20250226100613.1622564-5-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As a preparation to support partial data transfer due to badblocks,
replace the null_process_cmd() call in null_zone_write() with equivalent
calls to null_handle_badblocks() and null_handle_memory_backed(). This
commit does not change behavior. It will enable null_handle_badblocks()
to return the size of partial data transfer in the following commit,
allowing null_zone_write() to move write pointers appropriately.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/20250226100613.1622564-4-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When IO errors happen on real storage devices, the IOs repeated to the
same target range can success by virtue of recovery features by devices,
such as reserved block assignment. To simulate such IO errors and
recoveries, introduce the new parameter badblocks_once parameter. When
this parameter is set to 1, the specified badblocks are cleared after
the first IO error, so that the next IO to the blocks succeed.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/20250226100613.1622564-3-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The null_blk configfs file 'features' provides a string that lists
available null_blk features for userspace programs to reference.
The string is defined as a long constant in the code, which tends to be
forgotten for updates. It also causes checkpatch.pl to report
"WARNING: quoted string split across lines".
To avoid these drawbacks, generate the feature string on the fly. Refer
to the ca_name field of each element in the nullb_device_attrs table and
concatenate them in the given buffer. Also, sorted nullb_device_attrs
table elements in alphabetical order.
Of note is that the feature "index" was missing before this commit.
This commit adds it to the generated string.
Suggested-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/20250226100613.1622564-2-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In case of an error, ublk's ->uring_cmd() functions currently return
-EIOCBQUEUED and immediately call io_uring_cmd_done(). -EIOCBQUEUED and
io_uring_cmd_done() are intended for asynchronous completions. For
synchronous completions, the ->uring_cmd() function can just return the
negative return code directly. This skips io_uring_cmd_del_cancelable(),
and deferring the completion to task work. So return the error code
directly from __ublk_ch_uring_cmd() and ublk_ctrl_uring_cmd().
Update ublk_ch_uring_cmd_cb(), which currently ignores the return value
from __ublk_ch_uring_cmd(), to call io_uring_cmd_done() for synchronous
completions.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250225212456.2902549-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Indicate to userspace applications if a UBLK_IO_UNREGISTER_IO_BUF
command specifies an invalid buffer index by returning an error code.
Return -EINVAL if no buffer is registered with the given index, and
-EBUSY if the registered buffer is not a kernel bvec.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228231432.642417-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Provide new operations for the user to request mapping an active request
to an io uring instance's buf_table. The user has to provide the index
it wants to install the buffer.
A reference count is taken on the request to ensure it can't be
completed while it is active in a ring's buf_table.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250227223916.143006-6-kbusch@meta.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The loop driver currently uses the logical block size of the underlying
bdev as the lower bound of the loop device block size. While this works
for many cases, it fails for file systems made up of multiple devices
with different logical block sizes (e.g. XFS with a RT device that has a
larger logical block size), or when the file systems doesn't support
direct I/O writes at the sector size granularity (e.g. because it does
out of place writes with a file system block size larger than the sector
size).
Fix this by querying the minimum direct I/O alignment from statx when
available.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20250131120120.1315125-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We can't go below the minimum direct I/O size no matter if direct I/O is
enabled by passing in an O_DIRECT file descriptor or due to the explicit
flag. Now that LO_FLAGS_DIRECT_IO is set earlier after assigning a
backing file, loop_default_blocksize can check it instead of the
O_DIRECT flag to handle both conditions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20250131120120.1315125-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Assigning LO_FLAGS_DIRECT_IO from the O_DIRECT flag is related to
assigning a new backing file. Move the assignment in preparation
of using the flag more and earlier.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20250131120120.1315125-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Split the code for setting up a backing file into a helper in preparation
of adding more code to this path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20250131120120.1315125-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit ad934fc1784802fd1408224474b25ee5289fadfc.
loop_queue_work should be strictly serialized to loop_process_work since
the lo_worker could be freed without noticing new work has been queued
again.
Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Link: https://lore.kernel.org/r/20250218065835.19503-1-zhaoyang.huang@unisoc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.
Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.
Patch was created by using Coccinelle.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/076cd30acb4b2d6b699f3c80463a8983530d4f06.1738746821.git.namcao@linutronix.de
queue_work could spin on wq->cpu_pwq->pool->lock which could lead to
concurrent loop_process_work failed on lo_work_lock contention and
increase the request latency. Remove this combination by moving the
lock release ahead of queue_work.
Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Link: https://lore.kernel.org/r/20250207091942.3966756-1-zhaoyang.huang@unisoc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmemOHsQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmE5D/0fcxzxCS9PpFIjcBfEu1jAy8q5MAxE/u63
PQkkeKHGDUP2/8Ely7Wg1H5tiNagkDj9DHLHJEQikDpPsTCq5bv59VKSCdbsKmjC
tyGHIxdQoc57jS17/UPCALOBQAro+eIX5YLf1av01lNJb8s1bScxseU0ETGSYkUr
gXMQjSQuo3lggjvcZRf3LpmNKyO8h+Eh/d/3tBsijTh5nmqSvdbF+CcD560KjLTv
3UPnT45OXK2fDd+JMxcn5CK4+tMA5Fv5VAf12GlHDgBjUMk1Q2gYEqYoxI9Kjr2E
8WC/TGsKcyVGBhjQMNsN+zWV2Sc483n4KzcI9lksEyqgA8JbFp77sQ+JS7hypWbX
meI+UV49grY/FzrrvZ8A3dlK2A9ktRX0G9UpOQts/i+dFaPsErZ2fihOBlEPAMBp
fxMnYIdJiicLT6tEJno7AfRWhpVvqIPzWxPBcAip1Rgad2vLtz567rvlcOrm/pJE
WMdVhQAs1Lt2viyJPy2JkCqUAvBt9UwGJ2Jo4mlZxpRg6Ns+mfe0m1o5uivtiu3/
o7lYHRq6wm+HITXLXZvBe/T6/HV/b2xsxIatt47AJPMLVMf2B0JGvMTT9Ed+8LJQ
6HLrZHPPYPLOEV2tLTIkYSJXctfpK5fkgKr3OS0M/1q4gV/fB9g7XuUALquYxM9Q
qxZbmtho9g==
=nffI
-----END PGP SIGNATURE-----
Merge tag 'block-6.14-20250207' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- MD pull request via Song:
- fix an error handling path for md-linear
- NVMe pull request via Keith:
- Connection fixes for fibre channel transport (Daniel)
- Endian fixes (Keith, Christoph)
- Cleanup fix for host memory buffer (Francis)
- Platform specific power quirks (Georg)
- Target memory leak (Sagi)
- Use appropriate controller state accessor (Daniel)
- Fixup for a regression introduced last week, where sunvdc wasn't
updated for an API change, causing compilation failures on sparc64.
* tag 'block-6.14-20250207' of git://git.kernel.dk/linux:
drivers/block/sunvdc.c: update the correct AIP call
md: Fix linear_set_limits()
nvme-fc: use ctrl state getter
nvme: make nvme_tls_attrs_group static
nvmet: add a missing endianess conversion in nvmet_execute_admin_connect
nvmet: the result field in nvmet_alloc_ctrl_args is little endian
nvmet: fix a memory leak in controller identify
nvme-fc: do not ignore connectivity loss during connecting
nvme: handle connectivity loss in nvme_set_queue_count
nvme-fc: go straight to connecting state when initializing
nvme-pci: Add TUXEDO IBP Gen9 to Samsung sleep quirk
nvme-pci: Add TUXEDO InfinityFlex to Samsung sleep quirk
nvme-pci: remove redundant dma frees in hmb
nvmet: fix rw control endian access
My sparc64 defconfig build failed like this:
drivers/block/sunvdc.c: In function 'vdc_queue_drain':
drivers/block/sunvdc.c:1130:9: error: too many arguments to function 'blk_mq_unquiesce_queue'
1130 | blk_mq_unquiesce_queue(q, memflags);
| ^~~~~~~~~~~~~~~~~~~~~~
In file included from drivers/block/sunvdc.c:10:
include/linux/blk-mq.h:895:6: note: declared here
895 | void blk_mq_unquiesce_queue(struct request_queue *q);
| ^~~~~~~~~~~~~~~~~~~~~~
drivers/block/sunvdc.c:1131:9: error: too few arguments to function 'blk_mq_unfreeze_queue'
1131 | blk_mq_unfreeze_queue(q);
| ^~~~~~~~~~~~~~~~~~~~~
In file included from drivers/block/sunvdc.c:10:
include/linux/blk-mq.h:914:1: note: declared here
914 | blk_mq_unfreeze_queue(struct request_queue *q, unsigned int memflags)
| ^~~~~~~~~~~~~~~~~~~~~
Fixes: 1e1a9cecfa ("block: force noio scope in blk_mq_freeze_queue")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmec72kQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpm0pD/sENbY3LKqN8JipH8fWjLmdcimB7STB8MpB
LcuWiiuhDQuU1Yjr7c1h0Nkb6qwrb+gP+NQJrhlyUO7JVlvzFivJqsRscibNj+MF
hX/vUM2zKo0La2YxRv7aygTTCs8JfRmLKkTtfo0H4geljbX8/yMKSEJA+7x4j60X
TeJjmAHEKBGm2cNA7QMGqhzoUfjW1WQyn9x6/YOlLsYLYB8Cd7LFJqxM1tiDzUay
ys/QJ833qctdqBK5iLIh5voqZA58N3koidw536NLNgyvJn77CSAmC5jF1Vr0kFS6
Zb+dE9qRXDsND83dq3qvpSG5gtE42DygEG0g7s49zztd1rPOWE45RCxQtcZqHSvu
N9hN+/4SNX+02AfrgZFEk8+yGbtIlsb0mHRxqIKh8261tb1RTYVmtI4tqC3GVMjj
Qo41rDLDhaPOoKayo/KZO55n9U9M8BIkoF73hIIr5l9wHf5iOwi+EjzULlYhGigc
xypMvwWFLu3Of9YCJo38fTxGO6fL4eOJihoqqxNfeRACyQqbVeLWAT4wgUr2G8la
MIp2l5Vb7MmFXsES39fARLoN/8jjbgaXdxnSWhZlo4SPQaAlVA5ScXC4tKI3Gte3
O7fPVAX76PrZ+y6x4enfooI3mrpkeK676BxsokOX0m8WrVyVmIc4cfSaayc1gC5m
Uy298X93IA==
=molg
-----END PGP SIGNATURE-----
Merge tag 'block-6.14-20250131' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:
- MD pull request via Song:
- Fix a md-cluster regression introduced
- More sysfs race fixes
- Mark anything inside queue freezing as not being able to do IO for
memory allocations
- Fix for a regression introduced in loop in this merge window
- Fix for a regression in queue mapping setups introduced in this merge
window
- Fix for the block dio fops attempting an iov_iter revert upton
getting -EIOCBQUEUED on the read side. This one is going to stable as
well
* tag 'block-6.14-20250131' of git://git.kernel.dk/linux:
block: force noio scope in blk_mq_freeze_queue
block: fix nr_hw_queue update racing with disk addition/removal
block: get rid of request queue ->sysfs_dir_lock
loop: don't clear LO_FLAGS_PARTSCAN on LOOP_SET_STATUS{,64}
md/md-bitmap: Synchronize bitmap_get_stats() with bitmap lifetime
blk-mq: create correct map for fallback case
block: don't revert iter for -EIOCBQUEUED
When block drivers or the core block code perform allocations with a
frozen queue, this could try to recurse into the block device to
reclaim memory and deadlock. Thus all allocations done by a process
that froze a queue need to be done without __GFP_IO and __GFP_FS.
Instead of tying to track all of them down, force a noio scope as
part of freezing the queue.
Note that nvme is a bit of a mess here due to the non-owner freezes,
and they will be addressed separately.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250131120352.1315351-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Here is the big set of driver core and debugfs updates for 6.14-rc1.
It's coming late in the merge cycle as there are a number of merge
conflicts with your tree now, and I wanted to make sure they were
working properly. To resolve them, look in linux-next, and I will send
the "fixup" patch as a response to the pull request.
Included in here is a bunch of driver core, PCI, OF, and platform rust
bindings (all acked by the different subsystem maintainers), hence the
merge conflict with the rust tree, and some driver core api updates to
mark things as const, which will also require some fixups due to new
stuff coming in through other trees in this merge window.
There are also a bunch of debugfs updates from Al, and there is at least
one user that does have a regression with these, but Al is working on
tracking down the fix for it. In my use (and everyone else's linux-next
use), it does not seem like a big issue at the moment.
Here's a short list of the things in here:
- driver core bindings for PCI, platform, OF, and some i/o functions.
We are almost at the "write a real driver in rust" stage now,
depending on what you want to do.
- misc device rust bindings and a sample driver to show how to use
them
- debugfs cleanups in the fs as well as the users of the fs api for
places where drivers got it wrong or were unnecessarily doing things
in complex ways.
- driver core const work, making more of the api take const * for
different parameters to make the rust bindings easier overall.
- other small fixes and updates
All of these have been in linux-next with all of the aforementioned
merge conflicts, and the one debugfs issue, which looks to be resolved
"soon".
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZ5koPA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ymFHACfT5acDKf2Bov2Lc/5u3vBW/R6ChsAnj+LmgVI
hcDSPodj4szR40RRnzBd
=u5Ey
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core and debugfs updates from Greg KH:
"Here is the big set of driver core and debugfs updates for 6.14-rc1.
Included in here is a bunch of driver core, PCI, OF, and platform rust
bindings (all acked by the different subsystem maintainers), hence the
merge conflict with the rust tree, and some driver core api updates to
mark things as const, which will also require some fixups due to new
stuff coming in through other trees in this merge window.
There are also a bunch of debugfs updates from Al, and there is at
least one user that does have a regression with these, but Al is
working on tracking down the fix for it. In my use (and everyone
else's linux-next use), it does not seem like a big issue at the
moment.
Here's a short list of the things in here:
- driver core rust bindings for PCI, platform, OF, and some i/o
functions.
We are almost at the "write a real driver in rust" stage now,
depending on what you want to do.
- misc device rust bindings and a sample driver to show how to use
them
- debugfs cleanups in the fs as well as the users of the fs api for
places where drivers got it wrong or were unnecessarily doing
things in complex ways.
- driver core const work, making more of the api take const * for
different parameters to make the rust bindings easier overall.
- other small fixes and updates
All of these have been in linux-next with all of the aforementioned
merge conflicts, and the one debugfs issue, which looks to be resolved
"soon""
* tag 'driver-core-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (95 commits)
rust: device: Use as_char_ptr() to avoid explicit cast
rust: device: Replace CString with CStr in property_present()
devcoredump: Constify 'struct bin_attribute'
devcoredump: Define 'struct bin_attribute' through macro
rust: device: Add property_present()
saner replacement for debugfs_rename()
orangefs-debugfs: don't mess with ->d_name
octeontx2: don't mess with ->d_parent or ->d_parent->d_name
arm_scmi: don't mess with ->d_parent->d_name
slub: don't mess with ->d_name
sof-client-ipc-flood-test: don't mess with ->d_name
qat: don't mess with ->d_name
xhci: don't mess with ->d_iname
mtu3: don't mess wiht ->d_iname
greybus/camera - stop messing with ->d_iname
mediatek: stop messing with ->d_iname
netdevsim: don't embed file_operations into your structs
b43legacy: make use of debugfs_get_aux()
b43: stop embedding struct file_operations into their objects
carl9170: stop embedding file_operations into their objects
...
A small number of improvements all over the place:
vdpa/octeon gained support for multiple interrupts
virtio-pci gained support for error recovery
vp_vdpa gained support for notification with data
vhost/net has been fixed to set num_buffers for spec compliance
virtio-mem now works with kdump on s390
Small cleanups all over the place.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmeXnAsPHG1zdEByZWRo
YXQuY29tAAoJECgfDbjSjVRpbsYH/0gfvGFBrILN3O06cWtm/ZEny6U86o3imvxm
5tBYOu/gh7yFqPHb3ywwz0Xy8Sty8zdIGVcod6+ioiS5JxV4m75/8eODZZHK/O+g
W+2ozgRFm07RIQX8qQxfN6MURTEw9GHWLPqHfLopbQtoKJbD0NpWnm272xlJkox2
SzuHJ2D1Sg3ItcRr0x1TVsjefQKUHFduS/nt2WfQWjCnEXEbCx3S+Jp6oFCoub6L
zgI6RLim9HdScgo5lXzbWEyJ4fEjWOypO3Z5IEXls8ZP/OEueCHZX3eZmfgbbfhP
/uCPhoIxHe4PJBFDRKogdNyV40Iq8LvF7RzhOtJjS7GFlf1bipM=
=PM05
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:
"A small number of improvements all over the place:
- vdpa/octeon support for multiple interrupts
- virtio-pci support for error recovery
- vp_vdpa support for notification with data
- vhost/net fix to set num_buffers for spec compliance
- virtio-mem now works with kdump on s390
And small cleanups all over the place"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (23 commits)
virtio_blk: Add support for transport error recovery
virtio_pci: Add support for PCIe Function Level Reset
vhost/net: Set num_buffers for virtio 1.0
vdpa/octeon_ep: read vendor-specific PCI capability
virtio-pci: define type and header for PCI vendor data
vdpa/octeon_ep: handle device config change events
vdpa/octeon_ep: enable support for multiple interrupts per device
vdpa: solidrun: Replace deprecated PCI functions
s390/kdump: virtio-mem kdump support (CONFIG_PROC_VMCORE_DEVICE_RAM)
virtio-mem: support CONFIG_PROC_VMCORE_DEVICE_RAM
virtio-mem: remember usable region size
virtio-mem: mark device ready before registering callbacks in kdump mode
fs/proc/vmcore: introduce PROC_VMCORE_DEVICE_RAM to detect device RAM ranges in 2nd kernel
fs/proc/vmcore: factor out freeing a list of vmcore ranges
fs/proc/vmcore: factor out allocating a vmcore range and adding it to a list
fs/proc/vmcore: move vmcore definitions out of kcore.h
fs/proc/vmcore: prefix all pr_* with "vmcore:"
fs/proc/vmcore: disallow vmcore modifications while the vmcore is open
fs/proc/vmcore: replace vmcoredd_mutex by vmcore_mutex
fs/proc/vmcore: convert vmcore_cb_lock into vmcore_mutex
...
LOOP_SET_STATUS{,64} can set a lot more flags than it is supposed to
clear (the LOOP_SET_STATUS_CLEARABLE_FLAGS vs
LOOP_SET_STATUS_SETTABLE_FLAGS defines should have been a hint..).
Fix this by only clearing the bits in LOOP_SET_STATUS_CLEARABLE_FLAGS.
Fixes: ae074d07a0 ("loop: move updating lo_flag s out of loop_set_status_from_info")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250127143045.538279-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add support for proper cleanup and re-initialization of virtio-blk devices
during transport reset error recovery flow.
This enhancement includes:
- Pre-reset handler (reset_prepare) to perform device-specific cleanup
- Post-reset handler (reset_done) to re-initialize the device
These changes allow the device to recover from various reset scenarios,
ensuring proper functionality after a reset event occurs.
Without this implementation, the device cannot properly recover from
resets, potentially leading to undefined behavior or device malfunction.
This feature has been tested using PCI transport with Function Level
Reset (FLR) as an example reset mechanism. The reset can be triggered
manually via sysfs (echo 1 > /sys/bus/pci/devices/$PCI_ADDR/reset).
Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Message-Id: <1732690652-3065-3-git-send-email-israelr@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
indivudual patches which are described in their changelogs.
- "Allocate and free frozen pages" from Matthew Wilcox reorganizes the
page allocator so we end up with the ability to allocate and free
zero-refcount pages. So that callers (ie, slab) can avoid a refcount
inc & dec.
- "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use
large folios other than PMD-sized ones.
- "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and
fixes for this small built-in kernel selftest.
- "mas_anode_descend() related cleanup" from Wei Yang tidies up part of
the mapletree code.
- "mm: fix format issues and param types" from Keren Sun implements a
few minor code cleanups.
- "simplify split calculation" from Wei Yang provides a few fixes and a
test for the mapletree code.
- "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes
continues the work of moving vma-related code into the (relatively) new
mm/vma.c.
- "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
Hildenbrand cleans up and rationalizes handling of gfp flags in the page
allocator.
- "readahead: Reintroduce fix for improper RA window sizing" from Jan
Kara is a second attempt at fixing a readahead window sizing issue. It
should reduce the amount of unnecessary reading.
- "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
addresses an issue where "huge" amounts of pte pagetables are
accumulated
(https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/).
Qi's series addresses this windup by synchronously freeing PTE memory
within the context of madvise(MADV_DONTNEED).
- "selftest/mm: Remove warnings found by adding compiler flags" from
Muhammad Usama Anjum fixes some build warnings in the selftests code
when optional compiler warnings are enabled.
- "mm: don't use __GFP_HARDWALL when migrating remote pages" from David
Hildenbrand tightens the allocator's observance of __GFP_HARDWALL.
- "pkeys kselftests improvements" from Kevin Brodsky implements various
fixes and cleanups in the MM selftests code, mainly pertaining to the
pkeys tests.
- "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
estimate application working set size.
- "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
provides some cleanups to memcg's hugetlb charging logic.
- "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based
kernel build was demonstrated.
- "zram: split page type read/write handling" from Sergey Senozhatsky
has several fixes and cleaups for zram in the area of zram_write_page().
A watchdog softlockup warning was eliminated.
- "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky
cleans up the pagetable destructor implementations. A rare
use-after-free race is fixed.
- "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
simplifies and cleans up the debugging code in the VMA merging logic.
- "Account page tables at all levels" from Kevin Brodsky cleans up and
regularizes the pagetable ctor/dtor handling. This results in
improvements in accounting accuracy.
- "mm/damon: replace most damon_callback usages in sysfs with new core
functions" from SeongJae Park cleans up and generalizes DAMON's sysfs
file interface logic.
- "mm/damon: enable page level properties based monitoring" from
SeongJae Park increases the amount of information which is presented in
response to DAMOS actions.
- "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes
DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs
is completed.
- "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter
Xu cleans up and generalizes the hugetlb reservation accounting.
- "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
removes a never-used feature of the alloc_pages_bulk() interface.
- "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
extends DAMOS filters to support not only exclusion (rejecting), but
also inclusion (allowing) behavior.
- "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
"introduces a new memory descriptor for zswap.zpool that currently
overlaps with struct page for now. This is part of the effort to reduce
the size of struct page and to enable dynamic allocation of memory
descriptors."
- "mm, swap: rework of swap allocator locks" from Kairui Song redoes and
simplifies the swap allocator locking. A speedup of 400% was
demonstrated for one workload. As was a 35% reduction for kernel build
time with swap-on-zram.
- "mm: update mips to use do_mmap(), make mmap_region() internal" from
Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
mmap_region() can be made MM-internal.
- "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU
regressions and otherwise improves MGLRU performance.
- "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park
updates DAMON documentation.
- "Cleanup for memfd_create()" from Isaac Manjarres does that thing.
- "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand
provides various cleanups in the areas of hugetlb folios, THP folios and
migration.
- "Uncached buffered IO" from Jens Axboe implements the new
RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache
reading and writing. To permite userspace to address issues with
massive buildup of useless pagecache when reading/writing fast devices.
- "selftests/mm: virtual_address_range: Reduce memory" from Thomas
Weißschuh fixes and optimizes some of the MM selftests.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5a+cwAKCRDdBJ7gKXxA
jtoyAP9R58oaOKPJuTizEKKXvh/RpMyD6sYcz/uPpnf+cKTZxQEAqfVznfWlw/Lz
uC3KRZYhmd5YrxU4o+qjbzp9XWX/xAE=
=Ib2s
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs.
- "Allocate and free frozen pages" from Matthew Wilcox reorganizes
the page allocator so we end up with the ability to allocate and
free zero-refcount pages. So that callers (ie, slab) can avoid a
refcount inc & dec
- "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to
use large folios other than PMD-sized ones
- "Fix mm/rodata_test" from Petr Tesarik performs some maintenance
and fixes for this small built-in kernel selftest
- "mas_anode_descend() related cleanup" from Wei Yang tidies up part
of the mapletree code
- "mm: fix format issues and param types" from Keren Sun implements a
few minor code cleanups
- "simplify split calculation" from Wei Yang provides a few fixes and
a test for the mapletree code
- "mm/vma: make more mmap logic userland testable" from Lorenzo
Stoakes continues the work of moving vma-related code into the
(relatively) new mm/vma.c
- "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
Hildenbrand cleans up and rationalizes handling of gfp flags in the
page allocator
- "readahead: Reintroduce fix for improper RA window sizing" from Jan
Kara is a second attempt at fixing a readahead window sizing issue.
It should reduce the amount of unnecessary reading
- "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
addresses an issue where "huge" amounts of pte pagetables are
accumulated:
https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/
Qi's series addresses this windup by synchronously freeing PTE
memory within the context of madvise(MADV_DONTNEED)
- "selftest/mm: Remove warnings found by adding compiler flags" from
Muhammad Usama Anjum fixes some build warnings in the selftests
code when optional compiler warnings are enabled
- "mm: don't use __GFP_HARDWALL when migrating remote pages" from
David Hildenbrand tightens the allocator's observance of
__GFP_HARDWALL
- "pkeys kselftests improvements" from Kevin Brodsky implements
various fixes and cleanups in the MM selftests code, mainly
pertaining to the pkeys tests
- "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
estimate application working set size
- "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
provides some cleanups to memcg's hugetlb charging logic
- "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
removes the global swap cgroup lock. A speedup of 10% for a
tmpfs-based kernel build was demonstrated
- "zram: split page type read/write handling" from Sergey Senozhatsky
has several fixes and cleaups for zram in the area of
zram_write_page(). A watchdog softlockup warning was eliminated
- "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin
Brodsky cleans up the pagetable destructor implementations. A rare
use-after-free race is fixed
- "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
simplifies and cleans up the debugging code in the VMA merging
logic
- "Account page tables at all levels" from Kevin Brodsky cleans up
and regularizes the pagetable ctor/dtor handling. This results in
improvements in accounting accuracy
- "mm/damon: replace most damon_callback usages in sysfs with new
core functions" from SeongJae Park cleans up and generalizes
DAMON's sysfs file interface logic
- "mm/damon: enable page level properties based monitoring" from
SeongJae Park increases the amount of information which is
presented in response to DAMOS actions
- "mm/damon: remove DAMON debugfs interface" from SeongJae Park
removes DAMON's long-deprecated debugfs interfaces. Thus the
migration to sysfs is completed
- "mm/hugetlb: Refactor hugetlb allocation resv accounting" from
Peter Xu cleans up and generalizes the hugetlb reservation
accounting
- "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
removes a never-used feature of the alloc_pages_bulk() interface
- "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
extends DAMOS filters to support not only exclusion (rejecting),
but also inclusion (allowing) behavior
- "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
introduces a new memory descriptor for zswap.zpool that currently
overlaps with struct page for now. This is part of the effort to
reduce the size of struct page and to enable dynamic allocation of
memory descriptors
- "mm, swap: rework of swap allocator locks" from Kairui Song redoes
and simplifies the swap allocator locking. A speedup of 400% was
demonstrated for one workload. As was a 35% reduction for kernel
build time with swap-on-zram
- "mm: update mips to use do_mmap(), make mmap_region() internal"
from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
mmap_region() can be made MM-internal
- "mm/mglru: performance optimizations" from Yu Zhao fixes a few
MGLRU regressions and otherwise improves MGLRU performance
- "Docs/mm/damon: add tuning guide and misc updates" from SeongJae
Park updates DAMON documentation
- "Cleanup for memfd_create()" from Isaac Manjarres does that thing
- "mm: hugetlb+THP folio and migration cleanups" from David
Hildenbrand provides various cleanups in the areas of hugetlb
folios, THP folios and migration
- "Uncached buffered IO" from Jens Axboe implements the new
RWF_DONTCACHE flag which provides synchronous dropbehind for
pagecache reading and writing. To permite userspace to address
issues with massive buildup of useless pagecache when
reading/writing fast devices
- "selftests/mm: virtual_address_range: Reduce memory" from Thomas
Weißschuh fixes and optimizes some of the MM selftests"
* tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
mm/compaction: fix UBSAN shift-out-of-bounds warning
s390/mm: add missing ctor/dtor on page table upgrade
kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags()
tools: add VM_WARN_ON_VMG definition
mm/damon/core: use str_high_low() helper in damos_wmark_wait_us()
seqlock: add missing parameter documentation for raw_seqcount_try_begin()
mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh
mm/page_alloc: remove the incorrect and misleading comment
zram: remove zcomp_stream_put() from write_incompressible_page()
mm: separate move/undo parts from migrate_pages_batch()
mm/kfence: use str_write_read() helper in get_access_type()
selftests/mm/mkdirty: fix memory leak in test_uffdio_copy()
kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags()
selftests/mm: virtual_address_range: avoid reading from VM_IO mappings
selftests/mm: vm_util: split up /proc/self/smaps parsing
selftests/mm: virtual_address_range: unmap chunks after validation
selftests/mm: virtual_address_range: mmap() without PROT_WRITE
selftests/memfd/memfd_test: fix possible NULL pointer dereference
mm: add FGP_DONTCACHE folio creation flag
mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
...
this pull are:
- "lib min_heap: Improve min_heap safety, testing, and documentation"
from Kuan-Wei Chiu provides various tightenings to the min_heap library
code.
- "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms some
cleanup and Rust preparation in the xarray library code.
- "Update reference to include/asm-<arch>" from Geert Uytterhoeven fixes
pathnames in some code comments.
- "Converge on using secs_to_jiffies()" from Easwar Hariharan uses the
new secs_to_jiffies() in various places where that is appropriate.
- "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen
switches two filesystems to the new mount API.
- "Convert ocfs2 to use folios" from Matthew Wilcox does that.
- "Remove get_task_comm() and print task comm directly" from Yafang Shao
removes now-unneeded calls to get_task_comm() in various places.
- "squashfs: reduce memory usage and update docs" from Phillip Lougher
implements some memory savings in squashfs and performs some
maintainability work.
- "lib: clarify comparison function requirements" from Kuan-Wei Chiu
tightens the sort code's behaviour and adds some maintenance work.
- "nilfs2: protect busy buffer heads from being force-cleared" from
Ryusuke Konishi fixes an issues in nlifs when the fs is presented with a
corrupted image.
- "nilfs2: fix kernel-doc comments for function return values" from
Ryusuke Konishi fixes some nilfs kerneldoc.
- "nilfs2: fix issues with rename operations" from Ryusuke Konishi
addresses some nilfs BUG_ONs which syzbot was able to trigger.
- "minmax.h: Cleanups and minor optimisations" from David Laight
does some maintenance work on the min/max library code.
- "Fixes and cleanups to xarray" from Kemeng Shi does maintenance work
on the xarray library code.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5SP5QAKCRDdBJ7gKXxA
jqN7AQChvwXGG43n4d5SDiA/rH7ddvowQcDqhC9cAMJ1ReR7qwEA8/LIWDE4PdMX
mJnaZ1/ibpEpearrChCViApQtcyEGQI=
=ti4E
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
"Mainly individually changelogged singleton patches. The patch series
in this pull are:
- "lib min_heap: Improve min_heap safety, testing, and documentation"
from Kuan-Wei Chiu provides various tightenings to the min_heap
library code
- "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms
some cleanup and Rust preparation in the xarray library code
- "Update reference to include/asm-<arch>" from Geert Uytterhoeven
fixes pathnames in some code comments
- "Converge on using secs_to_jiffies()" from Easwar Hariharan uses
the new secs_to_jiffies() in various places where that is
appropriate
- "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen
switches two filesystems to the new mount API
- "Convert ocfs2 to use folios" from Matthew Wilcox does that
- "Remove get_task_comm() and print task comm directly" from Yafang
Shao removes now-unneeded calls to get_task_comm() in various
places
- "squashfs: reduce memory usage and update docs" from Phillip
Lougher implements some memory savings in squashfs and performs
some maintainability work
- "lib: clarify comparison function requirements" from Kuan-Wei Chiu
tightens the sort code's behaviour and adds some maintenance work
- "nilfs2: protect busy buffer heads from being force-cleared" from
Ryusuke Konishi fixes an issues in nlifs when the fs is presented
with a corrupted image
- "nilfs2: fix kernel-doc comments for function return values" from
Ryusuke Konishi fixes some nilfs kerneldoc
- "nilfs2: fix issues with rename operations" from Ryusuke Konishi
addresses some nilfs BUG_ONs which syzbot was able to trigger
- "minmax.h: Cleanups and minor optimisations" from David Laight does
some maintenance work on the min/max library code
- "Fixes and cleanups to xarray" from Kemeng Shi does maintenance
work on the xarray library code"
* tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (131 commits)
ocfs2: use str_yes_no() and str_no_yes() helper functions
include/linux/lz4.h: add some missing macros
Xarray: use xa_mark_t in xas_squash_marks() to keep code consistent
Xarray: remove repeat check in xas_squash_marks()
Xarray: distinguish large entries correctly in xas_split_alloc()
Xarray: move forward index correctly in xas_pause()
Xarray: do not return sibling entries from xas_find_marked()
ipc/util.c: complete the kernel-doc function descriptions
gcov: clang: use correct function param names
latencytop: use correct kernel-doc format for func params
minmax.h: remove some #defines that are only expanded once
minmax.h: simplify the variants of clamp()
minmax.h: move all the clamp() definitions after the min/max() ones
minmax.h: use BUILD_BUG_ON_MSG() for the lo < hi test in clamp()
minmax.h: reduce the #define expansion of min(), max() and clamp()
minmax.h: update some comments
minmax.h: add whitespace around operators and after commas
nilfs2: do not update mtime of renamed directory that is not moved
nilfs2: handle errors that nilfs_prepare_chunk() may return
CREDITS: fix spelling mistake
...
We cannot and should not put per-CPU compression stream in
write_incompressible_page() because that function never gets any
per-CPU streams in the first place. It's zram_write_page() that
puts the stream before it calls write_incompressible_page().
Link: https://lkml.kernel.org/r/20250115072003.380567-1-senozhatsky@chromium.org
Fixes: 485d11509d6d ("zram: factor out ZRAM_HUGE write")
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
zram writeback is a costly operation, because every target slot (unless
ZRAM_HUGE) is decompressed before it gets written to a backing device.
The writeback to a backing device uses submit_bio_wait() which may look
like a rescheduling point. However, if the backing device has
BD_HAS_SUBMIT_BIO bit set __submit_bio() calls directly
disk->fops->submit_bio(bio) on the backing device and so when
submit_bio_wait() calls blk_wait_io() the I/O is already done. On such
systems we effective end up in a loop
for_each (target slot) {
decompress(slot)
__submit_bio()
disk->fops->submit_bio(bio)
}
Which on PREEMPT_NONE systems triggers watchdogs (since there are no
explicit rescheduling points). Add cond_resched() to the zram writeback
loop.
Link: https://lkml.kernel.org/r/20241218063513.297475-8-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We only can read pages from zspool in writeback, zram_read_page() is not
really right in that context not only because it's a more generic function
that handles ZRAM_WB pages, but also because it requires us to unlock slot
between slot flag check and actual page read. Use zram_read_from_zspool()
instead and do slot flags check and page read under the same slot lock.
Link: https://lkml.kernel.org/r/20241218063513.297475-7-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Similarly to write, split the page read code into ZRAM_HUGE read,
ZRAM_SAME read and compressed page read to simplify the code.
Link: https://lkml.kernel.org/r/20241218063513.297475-6-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
zram_write_page() handles: ZRAM_SAME pages (which was already factored
out) stores, regular page stores and ZRAM_HUGE pages stores.
ZRAM_HUGE handling adds a significant amount of complexity. Instead, we
can handle ZRAM_HUGE in a separate function. This allows us to simplify
zs_handle allocations slow-path, as it now does not handle ZRAM_HUGE case.
ZRAM_HUGE zs_handle allocation, on the other hand, can now drop
__GFP_KSWAPD_RECLAIM because we handle ZRAM_HUGE in preemptible context
(outside of local-lock scope).
Link: https://lkml.kernel.org/r/20241218063513.297475-5-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Handling of ZRAM_SAME now uses a goto to the final stages of
zram_write_page() plus it introduces a branch and flags variable, which is
not making the code any simpler. In reality, we can handle ZRAM_SAME
immediately when we detect such pages and remove a goto and a branch.
Factor out ZRAM_SAME handling into a separate routine to simplify
zram_write_page().
Link: https://lkml.kernel.org/r/20241218063513.297475-4-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Element is in the same anon union as handle and hence holds the same
value, which makes code below sort of confusing
handle = zram_get_handle()
if (!handle)
element = zram_get_element()
Element doesn't really simplify the code, let's just remove it. We
already re-purpose handle to store the block id a written back page.
Link: https://lkml.kernel.org/r/20241218063513.297475-3-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "zram: split page type read/write handling", v2.
This is a subset of [1] series which contains only fixes and improvements
(no new features, as ZRAM_HUGE split is still under consideration).
The motivation for factoring out is that zram_write_page() gets more and
more complex all the time, because it tries to handle too many scenarios:
ZRAM_SAME store, ZRAM_HUGE store, compress page store with zs_malloc
allocation slowpath and conditional recompression, etc. Factor those out
and make things easier to handle.
Addition of cond_resched() is simply a fix, I can trigger watchdog from
zram writeback(). And early slot free is just a reasonable thing to do.
[1] https://lore.kernel.org/linux-kernel/20241119072057.3440039-1-senozhatsky@chromium.org
This patch (of 7):
In the current implementation entry's previously allocated memory is
released in the very last moment, when we already have allocated a new
memory for new data. This, basically, temporarily increases memory usage
for no good reason. For example, consider the case when both old (stale)
and new entry data are incompressible so such entry will temporarily use
two physical pages - one for stale (old) data and one for new data. We
can release old memory as soon as we get a write request for entry.
Link: https://lkml.kernel.org/r/20241218063513.297475-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20241218063513.297475-2-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmeL6hoQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppw2EADQV8nDgLRggZR+il4U03yKHXcQEdAX1GrB
Erowx+dasIJuh6kp3n6qRe9QD/pRqt1DKyLvXoWF8Qfuwq85j7oDnDDYxutNYT27
hDgrLJriJ3VeKYtTu+andHWt8P29b5h57UayInDOUJurEPA6rXyFZ5YVIti8n21K
uDOrQXiACG3qRWS2+p2f3UNhX0MkFNFdN/lxi13WMIJtRWF5bXAP+JOgIWCID4Ze
QuSY6rQD4dp4Q6M2erpX6tn0YZb7Hvw3rPjsd91n6jvYfTUVLH375zg8jCBpi6Wi
Syufbb8xcTtriVPTDRNu0ekjebkc8wD8ax/h86g0z9v3Ua4DlNmsx9eXrtv6r5nu
YXqDODOad6stI0+owFquW2vas0gHmfNSfyfGdlk2g24PMtP5Yx0V6FIEvwIeqnje
ghgxQvBuKUsdhqakByfNnc+XvXi3+RUJek8kvMeUSUQWT1IyMQqPOOk0yp9WdyWD
bY1f2ECP5BR1b37zYOyawewsI5xTupHUswn5a4r4qtGn3O15rGDkX98Nab5aLCnR
rW/DvX7+wT6gW9EwrRHiwjwfNDZbsJ9Ggu3lMhtUl5GUWdk58yTiVgKaHJLnlX9/
CKFKfyyIR1Vl8+gYIpemyFhhcoN+dCSf06ISkrg0jeS0/tYwydaAaCBPL5J4kxZA
h3Rtbh+Pgg==
=EXYs
-----END PGP SIGNATURE-----
Merge tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe pull requests via Keith:
- Target support for PCI-Endpoint transport (Damien)
- TCP IO queue spreading fixes (Sagi, Chaitanya)
- Target handling for "limited retry" flags (Guixen)
- Poll type fix (Yongsoo)
- Xarray storage error handling (Keisuke)
- Host memory buffer free size fix on error (Francis)
- MD pull requests via Song:
- Reintroduce md-linear (Yu Kuai)
- md-bitmap refactor and fix (Yu Kuai)
- Replace kmap_atomic with kmap_local_page (David Reaver)
- Quite a few queue freeze and debugfs deadlock fixes
Ming introduced lockdep support for this in the 6.13 kernel, and it
has (unsurprisingly) uncovered quite a few issues
- Use const attributes for IO schedulers
- Remove bio ioprio wrappers
- Fixes for stacked device atomic write support
- Refactor queue affinity helpers, in preparation for better supporting
isolated CPUs
- Cleanups of loop O_DIRECT handling
- Cleanup of BLK_MQ_F_* flags
- Add rotational support for null_blk
- Various fixes and cleanups
* tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux: (106 commits)
block: Don't trim an atomic write
block: Add common atomic writes enable flag
md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add()
block: limit disk max sectors to (LLONG_MAX >> 9)
block: Change blk_stack_atomic_writes_limits() unit_min check
block: Ensure start sector is aligned for stacking atomic writes
blk-mq: Move more error handling into blk_mq_submit_bio()
block: Reorder the request allocation code in blk_mq_submit_bio()
nvme: fix bogus kzalloc() return check in nvme_init_effects_log()
md/md-bitmap: move bitmap_{start, end}write to md upper layer
md/raid5: implement pers->bitmap_sector()
md: add a new callback pers->bitmap_sector()
md/md-bitmap: remove the last parameter for bimtap_ops->endwrite()
md/md-bitmap: factor behind write counters out from bitmap_{start/end}write()
md: Replace deprecated kmap_atomic() with kmap_local_page()
md: reintroduce md-linear
partitions: ldm: remove the initial kernel-doc notation
blk-cgroup: rwstat: fix kernel-doc warnings in header file
blk-cgroup: fix kernel-doc warnings in header file
nbd: fix partial sending
...
nbd driver sends request header and payload with multiple call of
sock_sendmsg, and partial sending can't be avoided. However, nbd driver
returns BLK_STS_RESOURCE to block core in this situation. This way causes
one issue: request->tag may change in the next run of nbd_queue_rq(), but
the original old tag has been sent as part of header cookie, this way
confuses nbd driver reply handling, since the real request can't be
retrieved any more with the obsolete old tag.
Fix it by retrying sending directly in per-socket work function,
meantime return BLK_STS_OK to block layer core.
Cc: vincent.chen@sifive.com
Cc: Leon Schuermann <leon@is.currently.online>
Cc: Bart Van Assche <bvanassche@acm.org>
Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Link: https://lore.kernel.org/r/20241029011941.153037-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We need the debugfs / driver-core fixes in here as well for testing and
to build on top of.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit b35108a51c ("jiffies: Define secs_to_jiffies()") introduced
secs_to_jiffies(). As the value here is a multiple of 1000, use
secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication.
This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with
the following Coccinelle rules:
@@ constant C; @@
- msecs_to_jiffies(C * 1000)
+ secs_to_jiffies(C)
@@ constant C; @@
- msecs_to_jiffies(C * MSEC_PER_SEC)
+ secs_to_jiffies(C)
Link: https://lkml.kernel.org/r/20241210-converge-secs-to-jiffies-v3-12-ddfefd7e9f2a@linux.microsoft.com
Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Daniel Mack <daniel@zonque.org>
Cc: David Airlie <airlied@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dick Kennedy <dick.kennedy@broadcom.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Fainelli <florian.fainelli@broadcom.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haojian Zhuang <haojian.zhuang@gmail.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jack Wang <jinpu.wang@cloud.ionos.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Jeff Johnson <jjohnson@kernel.org>
Cc: Jeff Johnson <quic_jjohnson@quicinc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jeroen de Borst <jeroendb@google.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kalle Valo <kvalo@kernel.org>
Cc: Louis Peens <louis.peens@corigine.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nicolas Palix <nicolas.palix@imag.fr>
Cc: Oded Gabbay <ogabbay@kernel.org>
Cc: Ofir Bitton <obitton@habana.ai>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Praveen Kaligineedi <pkaligineedi@google.com>
Cc: Ray Jui <rjui@broadcom.com>
Cc: Robert Jarzmik <robert.jarzmik@free.fr>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Roger Pau Monné <roger.pau@citrix.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Shailend Chand <shailend@google.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Simon Horman <horms@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Takashi Iwai <tiwai@suse.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If zram_meta_alloc failed early, it frees allocated zram->table without
setting it NULL. Which will potentially cause zram_meta_free to access
the table if user reset an failed and uninitialized device.
Link: https://lkml.kernel.org/r/20250107065446.86928-1-ryncsn@gmail.com
Fixes: 74363ec674 ("zram: fix uninitialized ZRAM not releasing backing device")
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This field duplicate the LO_FLAGS_DIRECT_IO flag in lo_flags. Remove it
to have a single source of truth about using direct I/O.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250110073750.1582447-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All callers of loop_update_dio except for loop_configure already have the
queue frozen, and loop_configure works on an unbound device. Remove the
superfluous recursive freezing in loop_update_dio and add asserts for the
locking and freezing state instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250110073750.1582447-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Unlike all other calls of (__)loop_update_dio, loop_set_status never
looks at the O_DIRECT flag of the backing file, and thus doesn't
re-enable direct I/O on an O_DIRECT backing file if e.g. the new block
size would allow it. Fix that and remove the need for the separate
__loop_update_dio flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250110073750.1582447-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
loop_set_dio is different from the other (__)loop_update_dio callers in
that it doesn't take any implicit conditions into account and wants to
update the direct I/O flag to the user passed in value and fail if that
can't be done.
Open code the logic here to prepare for simplifying the other direct I/O
flag updates and to make the error handling less convoluted.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250110073750.1582447-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no point in doing an fdatasync to write out pages when switching
away from direct I/O, as there won't be any. The writeback is only
needed when switching to direct I/O, which would have to invalidate the
pagecache less efficiently from the I/O path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250110073750.1582447-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While loop_configure simplify assigns the flags passed in by userspace,
loop_set_status only looks at the two changeable flags, and currently
has to do a complicate dance to implement that.
Move assign lo->lo_flags out of loop_set_status_from_info into the
callers and thus drastically simplify the lo_flags handling in
loop_set_status.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250110073750.1582447-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Match the locking order used by the core block code by only freezing
the queue after taking the limits lock using the
queue_limits_commit_update_frozen helper and document the callers that
do not freeze the queue at all.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250110054726.1499538-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace loop_reconfigure_limits with a slightly less encompassing
loop_update_limits that expects the caller to acquire and commit the
queue limits to prepare for sorting out the freeze vs limits lock
ordering.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250110054726.1499538-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Match the locking order used by the core block code by only freezing
the queue after taking the limits lock using the
queue_limits_commit_update_frozen helper.
This also allows removes the need for the separate __nbd_set_size helper,
so remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250110054726.1499538-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a helper that freezes the queue, updates the queue limits and
unfreezes the queue and convert all open coded versions of that to the
new helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250110054726.1499538-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Following process can cause nbd_config UAF:
1) grab nbd_config temporarily;
2) nbd_genl_disconnect() flush all recv_work() and release the
initial reference:
nbd_genl_disconnect
nbd_disconnect_and_put
nbd_disconnect
flush_workqueue(nbd->recv_workq)
if (test_and_clear_bit(NBD_RT_HAS_CONFIG_REF, ...))
nbd_config_put
-> due to step 1), reference is still not zero
3) nbd_genl_reconfigure() queue recv_work() again;
nbd_genl_reconfigure
config = nbd_get_config_unlocked(nbd)
if (!config)
-> succeed
if (!test_bit(NBD_RT_BOUND, ...))
-> succeed
nbd_reconnect_socket
queue_work(nbd->recv_workq, &args->work)
4) step 1) release the reference;
5) Finially, recv_work() will trigger UAF:
recv_work
nbd_config_put(nbd)
-> nbd_config is freed
atomic_dec(&config->recv_threads)
-> UAF
Fix the problem by clearing NBD_RT_BOUND in nbd_genl_disconnect(), so
that nbd_genl_reconfigure() will fail.
Fixes: b7aa3d3938 ("nbd: add a reconfigure netlink command")
Reported-by: syzbot+6b0df248918b92c33e6a@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/675bfb65.050a0220.1a2d0d.0006.GAE@google.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250103092859.3574648-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The only queues that really can't support a scheduler are those that
do not have a gendisk associated with them, and thus can't be used for
non-passthrough commands. In addition to those null_blk can optionally
set the flag, which is a bad odd. Replace the null_blk usage with
BLK_MQ_F_NO_SCHED_BY_DEFAULT to keep the expected semantics and then
remove BLK_MQ_F_NO_SCHED as the non-disk queues never call into
elevator_init_mq or blk_register_queue which adds the sysfs attributes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250106083531.799976-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Constify the following API:
struct device *device_find_child(struct device *dev, void *data,
int (*match)(struct device *dev, void *data));
To :
struct device *device_find_child(struct device *dev, const void *data,
device_match_t match);
typedef int (*device_match_t)(struct device *dev, const void *data);
with the following reasons:
- Protect caller's match data @*data which is for comparison and lookup
and the API does not actually need to modify @*data.
- Make the API's parameters (@match)() and @data have the same type as
all of other device finding APIs (bus|class|driver)_find_device().
- All kinds of existing device match functions can be directly taken
as the API's argument, they were exported by driver core.
Constify the API and adapt for various existing usages.
BTW, various subsystem changes are squashed into this commit to meet
'git bisect' requirement, and this commit has the minimal and simplest
changes to complement squashing shortcoming, and that may bring extra
code improvement.
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>
Acked-by: Uwe Kleine-König <ukleinek@kernel.org> # for drivers/pwm
Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Link: https://lore.kernel.org/r/20241224-const_dfc_done-v5-4-6623037414d4@quicinc.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Inside ublk_abort_requests(), gendisk is grabbed for aborting all
inflight requests. And ublk_abort_requests() is called when exiting
the uring context or handling timeout.
If add_disk() fails, the gendisk may have been freed when calling
ublk_abort_requests(), so use-after-free can be caused when getting
disk's reference in ublk_abort_requests().
Fixes the bug by detaching gendisk from ublk device if add_disk() fails.
Fixes: bd23f6c2c2 ("ublk: quiesce request queue when aborting queue")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241225110640.351531-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
BLK_MQ_F_SHOULD_MERGE is set for all tag_sets except those that purely
process passthrough commands (bsg-lib, ufs tmf, various nvme admin
queues) and thus don't even check the flag. Remove it to simplify the
driver interface.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241219060214.1928848-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace all users of blk_mq_virtio_map_queues with the more generic
blk_mq_map_hw_queues. This in preparation to retire
blk_mq_virtio_map_queues.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Link: https://lore.kernel.org/r/20241202-refactor-blk-affinity-helpers-v6-7-27211e9c2cd5@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use page->private to store the index instead of page->index.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20241216160849.31739-1-willy@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since commit 43b62ce3ff ("block: move bio io prio to a new field"), macro
bio_set_prio() does nothing but set bio->bi_ioprio. All other places just
set bio->bi_ioprio directly, so replace bio_set_prio() remaining
callsites with setting bio->bi_ioprio directly and delete that macro.
Signed-off-by: John Garry <john.g.garry@oracle.com>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20241202111957.2311683-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To facilitate testing of kernel functions related to the rotational
feature (BLK_FEAT_ROTATIONAL) of a block device (e.g. NVMe rotational
bit support), add the rotational boolean configfs attribute and module
parameter to the null_blk driver. If set, a null block device will
report being a rotational device through it queue limits features with
the BLK_FEAT_ROTATIONAL flag.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20241126000956.95983-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Setting backing device is done before ZRAM initialization. If we set the
backing device, then remove the ZRAM module without initializing the
device, the backing device reference will be leaked and the device will be
hold forever.
Fix this by always reset the ZRAM fully on rmmod or reset store.
Link: https://lkml.kernel.org/r/20241209165717.94215-3-ryncsn@gmail.com
Fixes: 013bf95a83 ("zram: add interface to specif backing device")
Signed-off-by: Kairui Song <kasong@tencent.com>
Reported-by: Desheng Wu <deshengwu@tencent.com>
Suggested-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "zram: fix backing device setup issue", v2.
This series fixes two bugs of backing device setting:
- ZRAM should reject using a zero sized (or the uninitialized ZRAM
device itself) as the backing device.
- Fix backing device leaking when removing a uninitialized ZRAM
device.
This patch (of 2):
Setting a zero sized block device as backing device is pointless, and one
can easily create a recursive loop by setting the uninitialized ZRAM
device itself as its own backing device by (zram0 is uninitialized):
echo /dev/zram0 > /sys/block/zram0/backing_dev
It's definitely a wrong config, and the module will pin itself, kernel
should refuse doing so in the first place.
By refusing to use zero sized device we avoided misuse cases including
this one above.
Link: https://lkml.kernel.org/r/20241209165717.94215-1-ryncsn@gmail.com
Link: https://lkml.kernel.org/r/20241209165717.94215-2-ryncsn@gmail.com
Fixes: 013bf95a83 ("zram: add interface to specif backing device")
Signed-off-by: Kairui Song <kasong@tencent.com>
Reported-by: Desheng Wu <deshengwu@tencent.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit 4ce6e2db00 ("virtio-blk: Ensure no requests in virtqueues before
deleting vqs.") replaces queue quiesce with queue freeze in virtio-blk's
PM callbacks. And the motivation is to drain inflight IOs before suspending.
block layer's queue freeze looks very handy, but it is also easy to cause
deadlock, such as, any attempt to call into bio_queue_enter() may run into
deadlock if the queue is frozen in current context. There are all kinds
of ->suspend() called in suspend context, so keeping queue frozen in the
whole suspend context isn't one good idea. And Marek reported lockdep
warning[1] caused by virtio-blk's freeze queue in virtblk_freeze().
[1] https://lore.kernel.org/linux-block/ca16370e-d646-4eee-b9cc-87277c89c43c@samsung.com/
Given the motivation is to drain in-flight IOs, it can be done by calling
freeze & unfreeze, meantime restore to previous behavior by keeping queue
quiesced during suspend.
Cc: Yi Sun <yi.sun@unisoc.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: virtualization@lists.linux.dev
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://lore.kernel.org/r/20241112125821.1475793-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add the missing description to fix the following warning:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/block/rnull_mod.o
Signed-off-by: FUJITA Tomonori <fujita.tomonori@gmail.com>
Acked-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://lore.kernel.org/r/20241130094521.193924-1-fujita.tomonori@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The continual trickle of small conversion patches is grating on me, and
is really not helping. Just get rid of the 'remove_new' member
function, which is just an alias for the plain 'remove', and had a
comment to that effect:
/*
* .remove_new() is a relic from a prototype conversion of .remove().
* New drivers are supposed to implement .remove(). Once all drivers are
* converted to not use .remove_new any more, it will be dropped.
*/
This was just a tree-wide 'sed' script that replaced '.remove_new' with
'.remove', with some care taken to turn a subsequent tab into two tabs
to make things line up.
I did do some minimal manual whitespace adjustment for places that used
spaces to line things up.
Then I just removed the old (sic) .remove_new member function, and this
is the end result. No more unnecessary conversion noise.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmdJ6jwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpvgeEADaPL+qmsRyp070K1OI/yTA9Jhf6liEyJ31
0GPVar5Vt6ObH6/POObrJqbBtAo5asQanFvwyVyLztKYxPHU7sdQaRcD+vvj7q3+
EhmQZSKM7Spp77awWhRWbeQfUBvdhTeGHjQH/0e60eIrF9KtEL9sM9hVqc8hBD9F
YtDNWPCk7Rz1PPNYlGEkQ2JmYmaxh3Gn29c/k1cSSo3InEQOFj6x+0Cgz6RjbTx3
9HfpLhVG3WV5MlZCCwp7KG36aJzlc0nq53x/sC9cg+F17RvL2EwNAOUfLl75/Kp/
t7PCQSd2ODciiDN9qZW71KGtVtlJ07W048Rk0nB+ogneC0uh4fuIYTidP9D7io7D
bBMrhDuUpnlPzlOqg0aeedXePQL7TRfT3CTentol6xldqg14n7C4QTQFQMSJCgJf
gr4YCTwl0RTknXo0A3ja16XwsUq5+2xsSoCTU25TY+wgKiAcc5lN9fhbvPRzbCQC
u9EQ9I9IFAMqEdnE51sw0x16fLtN2w4/zOkvTF+gD/KooEjSn9lcfeNue7jt1O0/
gFvFJCdXK/2GgxwHihvsEVdcNeaS8JowNafKUsfOM2G0qWQbY+l2vl/b5PfwecWi
0knOaqNWlGMwrQ+z+fgsEeFG7X98ninC7tqVZpzoZ7j0x65anH+Jq4q1Egongj0H
90zclclxjg==
=6cbB
-----END PGP SIGNATURE-----
Merge tag 'block-6.13-20242901' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:
- NVMe pull request via Keith:
- Use correct srcu list traversal (Breno)
- Scatter-gather support for metadata (Keith)
- Fabrics shutdown race condition fix (Nilay)
- Persistent reservations updates (Guixin)
- Add the required bits for MD atomic write support for raid0/1/10
- Correct return value for unknown opcode in ublk
- Fix deadlock with zone revalidation
- Fix for the io priority request vs bio cleanups
- Use the correct unsigned int type for various limit helpers
- Fix for a race in loop
- Cleanup blk_rq_prep_clone() to prevent uninit-value warning and make
it easier for actual humans to read
- Fix potential UAF when iterating tags
- A few fixes for bfq-iosched UAF issues
- Fix for brd discard not decrementing the allocated page count
- Various little fixes and cleanups
* tag 'block-6.13-20242901' of git://git.kernel.dk/linux: (36 commits)
brd: decrease the number of allocated pages which discarded
block, bfq: fix bfqq uaf in bfq_limit_depth()
block: Don't allow an atomic write be truncated in blkdev_write_iter()
mq-deadline: don't call req_get_ioprio from the I/O completion handler
block: Prevent potential deadlock in blk_revalidate_disk_zones()
block: Remove extra part pointer NULLify in blk_rq_init()
nvme: tuning pr code by using defined structs and macros
nvme: introduce change ptpl and iekey definition
block: return bool from get_disk_ro and bdev_read_only
block: remove a duplicate definition for bdev_read_only
block: return bool from blk_rq_aligned
block: return unsigned int from blk_lim_dma_alignment_and_pad
block: return unsigned int from queue_dma_alignment
block: return unsigned int from bdev_io_opt
block: req->bio is always set in the merge code
block: don't bother checking the data direction for merges
block: blk-mq: fix uninit-value in blk_rq_prep_clone and refactor
Revert "block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()"
md/raid10: Atomic write support
md/raid1: Atomic write support
...
The number of allocated pages which discarded will not decrease.
Fix it.
Fixes: 9ead7efc6f ("brd: implement discard support")
Signed-off-by: Zhang Xianwei <zhang.xianwei8@zte.com.cn>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241128170056565nPKSz2vsP8K8X2uk2iaDG@zte.com.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Toolchain and infrastructure:
- Enable a series of lints, including safety-related ones, e.g. the
compiler will now warn about missing safety comments, as well as
unnecessary ones. How safety documentation is organized is a frequent
source of review comments, thus having the compiler guide new
developers on where they are expected (and where not) is very nice.
- Start using '#[expect]': an interesting feature in Rust (stabilized
in 1.81.0) that makes the compiler warn if an expected warning was
_not_ emitted. This is useful to avoid forgetting cleaning up locally
ignored diagnostics ('#[allow]'s).
- Introduce '.clippy.toml' configuration file for Clippy, the Rust
linter, which will allow us to tweak its behaviour. For instance, our
first use cases are declaring a disallowed macro and, more
importantly, enabling the checking of private items.
- Lints-related fixes and cleanups related to the items above.
- Migrate from 'receiver_trait' to 'arbitrary_self_types': to get the
kernel into stable Rust, one of the major pieces of the puzzle is the
support to write custom types that can be used as 'self', i.e. as
receivers, since the kernel needs to write types such as 'Arc' that
common userspace Rust would not. 'arbitrary_self_types' has been
accepted to become stable, and this is one of the steps required to
get there.
- Remove usage of the 'new_uninit' unstable feature.
- Use custom C FFI types. Includes a new 'ffi' crate to contain our
custom mapping, instead of using the standard library 'core::ffi'
one. The actual remapping will be introduced in a later cycle.
- Map '__kernel_{size_t,ssize_t,ptrdiff_t}' to 'usize'/'isize' instead
of 32/64-bit integers.
- Fix 'size_t' in bindgen generated prototypes of C builtins.
- Warn on bindgen < 0.69.5 and libclang >= 19.1 due to a double issue
in the projects, which we managed to trigger with the upcoming
tracepoint support. It includes a build test since some distributions
backported the fix (e.g. Debian -- thanks!). All major distributions
we list should be now OK except Ubuntu non-LTS.
'macros' crate:
- Adapt the build system to be able run the doctests there too; and
clean up and enable the corresponding doctests.
'kernel' crate:
- Add 'alloc' module with generic kernel allocator support and remove
the dependency on the Rust standard library 'alloc' and the extension
traits we used to provide fallible methods with flags.
Add the 'Allocator' trait and its implementations '{K,V,KV}malloc'.
Add the 'Box' type (a heap allocation for a single value of type 'T'
that is also generic over an allocator and considers the kernel's GFP
flags) and its shorthand aliases '{K,V,KV}Box'. Add 'ArrayLayout'
type. Add 'Vec' (a contiguous growable array type) and its shorthand
aliases '{K,V,KV}Vec', including iterator support.
For instance, now we may write code such as:
let mut v = KVec::new();
v.push(1, GFP_KERNEL)?;
assert_eq!(&v, &[1]);
Treewide, move as well old users to these new types.
- 'sync' module: add global lock support, including the
'GlobalLockBackend' trait; the 'Global{Lock,Guard,LockedBy}' types
and the 'global_lock!' macro. Add the 'Lock::try_lock' method.
- 'error' module: optimize 'Error' type to use 'NonZeroI32' and make
conversion functions public.
- 'page' module: add 'page_align' function.
- Add 'transmute' module with the existing 'FromBytes' and 'AsBytes'
traits.
- 'block::mq::request' module: improve rendered documentation.
- 'types' module: extend 'Opaque' type documentation and add simple
examples for the 'Either' types.
drm/panic:
- Clean up a series of Clippy warnings.
Documentation:
- Add coding guidelines for lints and the '#[expect]' feature.
- Add Ubuntu to the list of distributions in the Quick Start guide.
MAINTAINERS:
- Add Danilo Krummrich as maintainer of the new 'alloc' module.
And a few other small cleanups and fixes.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEPjU5OPd5QIZ9jqqOGXyLc2htIW0FAmdFMIgACgkQGXyLc2ht
IW16hQ/+KX/jmdGoXtNXx7T6yG6SJ/txPOieGAWfhBf6C3bqkGrU9Gnw/O3VWrxf
eyj1QLQaIVUkmumWCefeiy9u3xRXx5fpS0tWJOjUtxC5NcS7vCs0AHQs1skIa6H+
YD6UKDPOy7CB5fVYqo13B5xnFAlciU0dLo6IGB6bB/lSpCudGLE9+nukfn5H3/R1
DTc3/fbSoYQU6Ij/FKscB+D/A7ojdYaReodhbzNw1lChg1MrJlCpqoQvHPE8ijg+
UDljHFFvgKdhSQL9GTa3LC7X4DsnihMWzXt14m6mMOqBa6TqF47WUhhgC77pHEI2
v/Yy8MLq0pdIzT1wFjsqs6opuvXc7K5Yk5Y60HDsDyIyjk2xgOjh6ZlD0EV161gS
7w1NtaKd/Cn7hnL7Ua51yJDxJTMllne3fTWemhs3Zd63j7ham98yOoiw+6L2QaM4
C9nW48vfUuTwDuYJ5HU0uSugubuHW3Ng5JEvMcvd4QjmaI1bQNkgVzefR5j3dLw8
9kEOTzJoxHpu5B7PZVTEd/L95hlmk1csSQObxi7JYCCimMkusF1S+heBzV/SqWD5
5ioEhCnSKE05fhQs0Uxns1HkcFle8Bn6r3aSAWV6yaR8oF94yHcuaZRUKxKMHw+1
cmBE2X8Yvtldw+CYDwEGWjKDtwOStbqk+b/ZzP7f7/p56QH9lSg=
=Kn7b
-----END PGP SIGNATURE-----
Merge tag 'rust-6.13' of https://github.com/Rust-for-Linux/linux
Pull rust updates from Miguel Ojeda:
"Toolchain and infrastructure:
- Enable a series of lints, including safety-related ones, e.g. the
compiler will now warn about missing safety comments, as well as
unnecessary ones. How safety documentation is organized is a
frequent source of review comments, thus having the compiler guide
new developers on where they are expected (and where not) is very
nice.
- Start using '#[expect]': an interesting feature in Rust (stabilized
in 1.81.0) that makes the compiler warn if an expected warning was
_not_ emitted. This is useful to avoid forgetting cleaning up
locally ignored diagnostics ('#[allow]'s).
- Introduce '.clippy.toml' configuration file for Clippy, the Rust
linter, which will allow us to tweak its behaviour. For instance,
our first use cases are declaring a disallowed macro and, more
importantly, enabling the checking of private items.
- Lints-related fixes and cleanups related to the items above.
- Migrate from 'receiver_trait' to 'arbitrary_self_types': to get the
kernel into stable Rust, one of the major pieces of the puzzle is
the support to write custom types that can be used as 'self', i.e.
as receivers, since the kernel needs to write types such as 'Arc'
that common userspace Rust would not. 'arbitrary_self_types' has
been accepted to become stable, and this is one of the steps
required to get there.
- Remove usage of the 'new_uninit' unstable feature.
- Use custom C FFI types. Includes a new 'ffi' crate to contain our
custom mapping, instead of using the standard library 'core::ffi'
one. The actual remapping will be introduced in a later cycle.
- Map '__kernel_{size_t,ssize_t,ptrdiff_t}' to 'usize'/'isize'
instead of 32/64-bit integers.
- Fix 'size_t' in bindgen generated prototypes of C builtins.
- Warn on bindgen < 0.69.5 and libclang >= 19.1 due to a double issue
in the projects, which we managed to trigger with the upcoming
tracepoint support. It includes a build test since some
distributions backported the fix (e.g. Debian -- thanks!). All
major distributions we list should be now OK except Ubuntu non-LTS.
'macros' crate:
- Adapt the build system to be able run the doctests there too; and
clean up and enable the corresponding doctests.
'kernel' crate:
- Add 'alloc' module with generic kernel allocator support and remove
the dependency on the Rust standard library 'alloc' and the
extension traits we used to provide fallible methods with flags.
Add the 'Allocator' trait and its implementations '{K,V,KV}malloc'.
Add the 'Box' type (a heap allocation for a single value of type
'T' that is also generic over an allocator and considers the
kernel's GFP flags) and its shorthand aliases '{K,V,KV}Box'. Add
'ArrayLayout' type. Add 'Vec' (a contiguous growable array type)
and its shorthand aliases '{K,V,KV}Vec', including iterator
support.
For instance, now we may write code such as:
let mut v = KVec::new();
v.push(1, GFP_KERNEL)?;
assert_eq!(&v, &[1]);
Treewide, move as well old users to these new types.
- 'sync' module: add global lock support, including the
'GlobalLockBackend' trait; the 'Global{Lock,Guard,LockedBy}' types
and the 'global_lock!' macro. Add the 'Lock::try_lock' method.
- 'error' module: optimize 'Error' type to use 'NonZeroI32' and make
conversion functions public.
- 'page' module: add 'page_align' function.
- Add 'transmute' module with the existing 'FromBytes' and 'AsBytes'
traits.
- 'block::mq::request' module: improve rendered documentation.
- 'types' module: extend 'Opaque' type documentation and add simple
examples for the 'Either' types.
drm/panic:
- Clean up a series of Clippy warnings.
Documentation:
- Add coding guidelines for lints and the '#[expect]' feature.
- Add Ubuntu to the list of distributions in the Quick Start guide.
MAINTAINERS:
- Add Danilo Krummrich as maintainer of the new 'alloc' module.
And a few other small cleanups and fixes"
* tag 'rust-6.13' of https://github.com/Rust-for-Linux/linux: (82 commits)
rust: alloc: Fix `ArrayLayout` allocations
docs: rust: remove spurious item in `expect` list
rust: allow `clippy::needless_lifetimes`
rust: warn on bindgen < 0.69.5 and libclang >= 19.1
rust: use custom FFI integer types
rust: map `__kernel_size_t` and friends also to usize/isize
rust: fix size_t in bindgen prototypes of C builtins
rust: sync: add global lock support
rust: macros: enable the rest of the tests
rust: macros: enable paste! use from macro_rules!
rust: enable macros::module! tests
rust: kbuild: expand rusttest target for macros
rust: types: extend `Opaque` documentation
rust: block: fix formatting of `kernel::block::mq::request` module
rust: macros: fix documentation of the paste! macro
rust: kernel: fix THIS_MODULE header path in ThisModule doc comment
rust: page: add Rust version of PAGE_ALIGN
rust: helpers: remove unnecessary header includes
rust: exports: improve grammar in commentary
drm/panic: allow verbose version check
...
Sergey Senozhatsky improves zram's post-processing selection algorithm.
This leads to improved memory savings.
- Wei Yang has gone to town on the mapletree code, contributing several
series which clean up the implementation:
- "refine mas_mab_cp()"
- "Reduce the space to be cleared for maple_big_node"
- "maple_tree: simplify mas_push_node()"
- "Following cleanup after introduce mas_wr_store_type()"
- "refine storing null"
- The series "selftests/mm: hugetlb_fault_after_madv improvements" from
David Hildenbrand fixes this selftest for s390.
- The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng
implements some rationaizations and cleanups in the page mapping code.
- The series "mm: optimize shadow entries removal" from Shakeel Butt
optimizes the file truncation code by speeding up the handling of shadow
entries.
- The series "Remove PageKsm()" from Matthew Wilcox completes the
migration of this flag over to being a folio-based flag.
- The series "Unify hugetlb into arch_get_unmapped_area functions" from
Oscar Salvador implements a bunch of consolidations and cleanups in the
hugetlb code.
- The series "Do not shatter hugezeropage on wp-fault" from Dev Jain
takes away the wp-fault time practice of turning a huge zero page into
small pages. Instead we replace the whole thing with a THP. More
consistent cleaner and potentiall saves a large number of pagefaults.
- The series "percpu: Add a test case and fix for clang" from Andy
Shevchenko enhances and fixes the kernel's built in percpu test code.
- The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett
optimizes mremap() by avoiding doing things which we didn't need to do.
- The series "Improve the tmpfs large folio read performance" from
Baolin Wang teaches tmpfs to copy data into userspace at the folio size
rather than as individual pages. A 20% speedup was observed.
- The series "mm/damon/vaddr: Fix issue in
damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON splitting.
- The series "memcg-v1: fully deprecate charge moving" from Shakeel Butt
removes the long-deprecated memcgv2 charge moving feature.
- The series "fix error handling in mmap_region() and refactor" from
Lorenzo Stoakes cleanup up some of the mmap() error handling and
addresses some potential performance issues.
- The series "x86/module: use large ROX pages for text allocations" from
Mike Rapoport teaches x86 to use large pages for read-only-execute
module text.
- The series "page allocation tag compression" from Suren Baghdasaryan
is followon maintenance work for the new page allocation profiling
feature.
- The series "page->index removals in mm" from Matthew Wilcox remove
most references to page->index in mm/. A slow march towards shrinking
struct page.
- The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs
interface tests" from Andrew Paniakin performs maintenance work for
DAMON's self testing code.
- The series "mm: zswap swap-out of large folios" from Kanchana Sridhar
improves zswap's batching of compression and decompression. It is a
step along the way towards using Intel IAA hardware acceleration for
this zswap operation.
- The series "kasan: migrate the last module test to kunit" from
Sabyrzhan Tasbolatov completes the migration of the KASAN built-in tests
over to the KUnit framework.
- The series "implement lightweight guard pages" from Lorenzo Stoakes
permits userapace to place fault-generating guard pages within a single
VMA, rather than requiring that multiple VMAs be created for this.
Improved efficiencies for userspace memory allocators are expected.
- The series "memcg: tracepoint for flushing stats" from JP Kobryn uses
tracepoints to provide increased visibility into memcg stats flushing
activity.
- The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky
fixes a zram buglet which potentially affected performance.
- The series "mm: add more kernel parameters to control mTHP" from
Maíra Canal enhances our ability to control/configuremultisize THP from
the kernel boot command line.
- The series "kasan: few improvements on kunit tests" from Sabyrzhan
Tasbolatov has a couple of fixups for the KASAN KUnit tests.
- The series "mm/list_lru: Split list_lru lock into per-cgroup scope"
from Kairui Song optimizes list_lru memory utilization when lockdep is
enabled.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZzwFqgAKCRDdBJ7gKXxA
jkeuAQCkl+BmeYHE6uG0hi3pRxkupseR6DEOAYIiTv0/l8/GggD/Z3jmEeqnZaNq
xyyenpibWgUoShU2wZ/Ha8FE5WDINwg=
=JfWR
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- The series "zram: optimal post-processing target selection" from
Sergey Senozhatsky improves zram's post-processing selection
algorithm. This leads to improved memory savings.
- Wei Yang has gone to town on the mapletree code, contributing several
series which clean up the implementation:
- "refine mas_mab_cp()"
- "Reduce the space to be cleared for maple_big_node"
- "maple_tree: simplify mas_push_node()"
- "Following cleanup after introduce mas_wr_store_type()"
- "refine storing null"
- The series "selftests/mm: hugetlb_fault_after_madv improvements" from
David Hildenbrand fixes this selftest for s390.
- The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng
implements some rationaizations and cleanups in the page mapping
code.
- The series "mm: optimize shadow entries removal" from Shakeel Butt
optimizes the file truncation code by speeding up the handling of
shadow entries.
- The series "Remove PageKsm()" from Matthew Wilcox completes the
migration of this flag over to being a folio-based flag.
- The series "Unify hugetlb into arch_get_unmapped_area functions" from
Oscar Salvador implements a bunch of consolidations and cleanups in
the hugetlb code.
- The series "Do not shatter hugezeropage on wp-fault" from Dev Jain
takes away the wp-fault time practice of turning a huge zero page
into small pages. Instead we replace the whole thing with a THP. More
consistent cleaner and potentiall saves a large number of pagefaults.
- The series "percpu: Add a test case and fix for clang" from Andy
Shevchenko enhances and fixes the kernel's built in percpu test code.
- The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett
optimizes mremap() by avoiding doing things which we didn't need to
do.
- The series "Improve the tmpfs large folio read performance" from
Baolin Wang teaches tmpfs to copy data into userspace at the folio
size rather than as individual pages. A 20% speedup was observed.
- The series "mm/damon/vaddr: Fix issue in
damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON
splitting.
- The series "memcg-v1: fully deprecate charge moving" from Shakeel
Butt removes the long-deprecated memcgv2 charge moving feature.
- The series "fix error handling in mmap_region() and refactor" from
Lorenzo Stoakes cleanup up some of the mmap() error handling and
addresses some potential performance issues.
- The series "x86/module: use large ROX pages for text allocations"
from Mike Rapoport teaches x86 to use large pages for
read-only-execute module text.
- The series "page allocation tag compression" from Suren Baghdasaryan
is followon maintenance work for the new page allocation profiling
feature.
- The series "page->index removals in mm" from Matthew Wilcox remove
most references to page->index in mm/. A slow march towards shrinking
struct page.
- The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs
interface tests" from Andrew Paniakin performs maintenance work for
DAMON's self testing code.
- The series "mm: zswap swap-out of large folios" from Kanchana Sridhar
improves zswap's batching of compression and decompression. It is a
step along the way towards using Intel IAA hardware acceleration for
this zswap operation.
- The series "kasan: migrate the last module test to kunit" from
Sabyrzhan Tasbolatov completes the migration of the KASAN built-in
tests over to the KUnit framework.
- The series "implement lightweight guard pages" from Lorenzo Stoakes
permits userapace to place fault-generating guard pages within a
single VMA, rather than requiring that multiple VMAs be created for
this. Improved efficiencies for userspace memory allocators are
expected.
- The series "memcg: tracepoint for flushing stats" from JP Kobryn uses
tracepoints to provide increased visibility into memcg stats flushing
activity.
- The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky
fixes a zram buglet which potentially affected performance.
- The series "mm: add more kernel parameters to control mTHP" from
Maíra Canal enhances our ability to control/configuremultisize THP
from the kernel boot command line.
- The series "kasan: few improvements on kunit tests" from Sabyrzhan
Tasbolatov has a couple of fixups for the KASAN KUnit tests.
- The series "mm/list_lru: Split list_lru lock into per-cgroup scope"
from Kairui Song optimizes list_lru memory utilization when lockdep
is enabled.
* tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits)
cma: enforce non-zero pageblock_order during cma_init_reserved_mem()
mm/kfence: add a new kunit test test_use_after_free_read_nofault()
zram: fix NULL pointer in comp_algorithm_show()
memcg/hugetlb: add hugeTLB counters to memcg
vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event
mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount
zram: ZRAM_DEF_COMP should depend on ZRAM
MAINTAINERS/MEMORY MANAGEMENT: add document files for mm
Docs/mm/damon: recommend academic papers to read and/or cite
mm: define general function pXd_init()
kmemleak: iommu/iova: fix transient kmemleak false positive
mm/list_lru: simplify the list_lru walk callback function
mm/list_lru: split the lock to per-cgroup scope
mm/list_lru: simplify reparenting and initial allocation
mm/list_lru: code clean up for reparenting
mm/list_lru: don't export list_lru_add
mm/list_lru: don't pass unnecessary key parameters
kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller
kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW
kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmc/WXkACgkQnJ2qBz9k
QNnwjAf/c8K3Vhw9RuKMtPF0K+gC//0mLsq+WmgrtXfMLvbSymrACnwHFJzpNGeS
iEqCYlCC7vlqzPXpsVRlFeHpM52oVnE/wFF0Hp1h/Y1oqbRSzur6iSl4epmmBN+K
AsPoWEXco7ABqtrhoZb0b1n7io9VorHN4nLhO6KWD83nZAawJDWgSw0sNCqcT6to
vVxR3baP/EhONxNquxXe2lxq26dMilehmTk4AOyYslNYb0iG4r18TPyNb7fmuuKG
M+nFfMnM9EPH8lnmgx6Mg/X77d/eZoq4pMRmeqSsroB5k/AQJnNrGweNL1+yr7OY
adWNOMGWdNNQXPFgGbL5yZwNZ64kRA==
=Eq1B
-----END PGP SIGNATURE-----
Merge tag 'reiserfs_delete' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull reiserfs removal from Jan Kara:
"The deprecation period of reiserfs is ending at the end of this year
so it is time to remove it"
* tag 'reiserfs_delete' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
reiserfs: The last commit
Current loop calls vfs_statfs() while holding the q->limits_lock. If
FS takes some locking in vfs_statfs callback, this may lead to ABBA
locking bug (at least, FAT fs has this issue actually).
So this patch calls vfs_statfs() outside q->limits_locks instead,
because looks like no reason to hold q->limits_locks while getting
discord configs.
Chain exists of:
&sbi->fat_lock --> &q->q_usage_counter(io)#17 --> &q->limits_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&q->limits_lock);
lock(&q->q_usage_counter(io)#17);
lock(&q->limits_lock);
lock(&sbi->fat_lock);
*** DEADLOCK ***
Reported-by: syzbot+a5d8c609c02f508672cc@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=a5d8c609c02f508672cc
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add requests to the tail of the list instead of the front so that they
are queued up in submission order.
Remove the re-reordering in blk_mq_dispatch_plug_list, virtio_queue_rqs
and nvme_queue_rqs now that the list is ordered as expected.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace the semi-open coded request list helpers with a proper rq_list
type that mirrors the bio_list and has head and tail pointers. Besides
better type safety this actually allows to insert at the tail of the
list, which will be useful soon.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_mq_flush_plug_list submits requests in the reverse order that they
were submitted, which leads to a rather suboptimal I/O pattern
especially in rotational devices. Fix this by rewriting virtio_queue_rqs
so that it always pops the requests from the passed in request list, and
then adds them to the head of a local submit list. This actually
simplifies the code a bit as it removes the complicated list splicing,
at the cost of extra updates of the rq_next pointer. As that should be
cache hot anyway it should be an easy price to pay.
Fixes: 0e9911fa76 ("virtio-blk: support mq_ops->queue_rqs()")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When Compressed RAM block device support is disabled, the
CONFIG_ZRAM_DEF_COMP symbol still ends up in the generated config file:
CONFIG_ZRAM_DEF_COMP="unset-value"
While this causes no real harm, avoid polluting the config file by
adding a dependency on ZRAM.
Link: https://lkml.kernel.org/r/64e05bad68a9bd5cc322efd114a04d25de525940.1730807319.git.geert@linux-m68k.org
Fixes: 917a59e81c ("zram: introduce custom comp backends API")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If entry does not fulfill current mark_idle() parameters, e.g. cutoff
time, then we should clear its ZRAM_IDLE from previous mark_idle()
invocations.
Consider the following case:
- mark_idle() cutoff time 8h
- mark_idle() cutoff time 4h
- writeback() idle - will writeback entries with cutoff time 8h,
while it should only pick entries with cutoff time 4h
The bug was reported by Shin Kawamura.
Link: https://lkml.kernel.org/r/20241028153629.1479791-3-senozhatsky@chromium.org
Fixes: 755804d169 ("zram: introduce an aged idle interface")
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reported-by: Shin Kawamura <kawasin@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@vger.kernel.org>
Patch series "zram: IDLE flag handling fixes", v2.
zram can wrongly preserve ZRAM_IDLE flag on its entries which can result
in premature post-processing (writeback and recompression) of such
entries.
This patch (of 2)
Recompression should clear ZRAM_IDLE flag on the entries it has accessed,
because otherwise some entries, specifically those for which recompression
has failed, become immediate candidate entries for another post-processing
(e.g. writeback).
Consider the following case:
- recompression marks entries IDLE every 4 hours and attempts
to recompress them
- some entries are incompressible, so we keep them intact and
hence preserve IDLE flag
- writeback marks entries IDLE every 8 hours and writebacks
IDLE entries, however we have IDLE entries left from
recompression, so writeback prematurely writebacks those
entries.
The bug was reported by Shin Kawamura.
Link: https://lkml.kernel.org/r/20241028153629.1479791-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20241028153629.1479791-2-senozhatsky@chromium.org
Fixes: 84b33bf788 ("zram: introduce recompress sysfs knob")
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reported-by: Shin Kawamura <kawasin@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@vger.kernel.org>
max_zone_append_sectors differs from all other queue limits in that the
final value used is not stored in the queue_limits but needs to be
obtained using queue_limits_max_zone_append_sectors helper. This not
only adds (tiny) extra overhead to the I/O path, but also can be easily
forgotten in file system code.
Add a new max_hw_zone_append_sectors value to queue_limits which is
set by the driver, and calculate max_zone_append_sectors from that and
the other inputs in blk_validate_zoned_limits, similar to how
max_sectors is calculated to fix this.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241104073955.112324-3-hch@lst.de
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20241108154657.845768-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In ublk_ch_mmap(), queue id is calculated in the following way:
(vma->vm_pgoff << PAGE_SHIFT) / `max_cmd_buf_size`
'max_cmd_buf_size' is equal to
`UBLK_MAX_QUEUE_DEPTH * sizeof(struct ublksrv_io_desc)`
and UBLK_MAX_QUEUE_DEPTH is 4096 and part of UAPI, so 'max_cmd_buf_size'
is always page aligned in 4K page size kernel. However, it isn't true in
64K page size kernel.
Fixes the issue by always rounding up 'max_cmd_buf_size' with PAGE_SIZE.
Cc: stable@vger.kernel.org
Fixes: 71f28f3136 ("ublk_drv: add io_uring based userspace block driver")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241111110718.1394001-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
PAGE_SIZE may be 64K, and the max block size can be PAGE_SIZE, so any
variable for holding block size can't be defined as 'unsigned short'.
Unfortunately commit 473516b361 ("loop: use the atomic queue limits
update API") passes 'bsize' with type of 'unsigned short' to
loop_reconfigure_limits(), and causes LTP/ioctl_loop06 test failure:
12 ioctl_loop06.c:76: TINFO: Using LOOP_SET_BLOCK_SIZE with arg > PAGE_SIZE
13 ioctl_loop06.c:59: TFAIL: Set block size succeed unexpectedly
...
18 ioctl_loop06.c:76: TINFO: Using LOOP_CONFIGURE with block_size > PAGE_SIZE
19 ioctl_loop06.c:59: TFAIL: Set block size succeed unexpectedly
Fixes the issue by defining 'block size' variable with 'unsigned int', which is
aligned with block layer's definition.
(improve commit log & add fixes tag)
Fixes: 473516b361 ("loop: use the atomic queue limits update API")
Cc: John Garry <john.g.garry@oracle.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Li Wang <liwang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241109022744.1126003-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Unfreeze queue after returning from blk_mark_disk_dead(), this way at
least allows us to verify queue freeze correctly with lockdep.
Suggested-by: Christoph Hellwig <hch@lst.de>
Cc: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241031133723.303835-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A cosmetic change: do not open-code compression priority 0, use
ZRAM_PRIMARY_COMP instead.
Link: https://lkml.kernel.org/r/20241009042908.750260-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
pcim_iomap_table() and pcim_request_regions() have been deprecated in
commit e354bb84a4 ("PCI: Deprecate pcim_iomap_table(),
pcim_iomap_regions_request_all()") and commit d140f80f60 ("PCI:
Deprecate pcim_iomap_regions() in favor of pcim_iomap_region()"),
respectively.
Replace these functions with pcim_iomap_region().
Signed-off-by: Philipp Stanner <pstanner@redhat.com>
Link: https://lore.kernel.org/r/20241106145249.108996-2-pstanner@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We now have only one active post-processing at any time, so we don't have
same race conditions that we had before. If slot selected for
post-processing gets freed or freed and reallocated it loses its PP_SLOT
flag and there is no way for such a slot to gain PP_SLOT flag again until
current post-processing terminates.
Link: https://lkml.kernel.org/r/20240917021020.883356-8-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Drop some redundant zram_test_flag() calls and re-order zram_clear_flag()
calls. Plus two small trivial coding style fixes. No functional changes.
Link: https://lkml.kernel.org/r/20240917021020.883356-7-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ZRAM_SAME slots cannot be post-processed (writeback or recompress) so do
not mark them ZRAM_IDLE. Same with ZRAM_WB slots, they cannot be
ZRAM_IDLE because they are not in zsmalloc pool anymore.
Link: https://lkml.kernel.org/r/20240917021020.883356-6-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Writeback suffers from the same problem as recompression did before -
target slot selection for writeback is just a simple iteration over
zram->table entries (stored pages) which selects suboptimal targets for
writeback. This is especially problematic for writeback, because we
uncompress objects before writeback so each of them takes 4K out of
limited writeback storage. For example, when we take a 48 bytes slot and
store it as a 4K object to writeback device we only save 48 bytes of
memory (release from zsmalloc pool). We naturally want to pick the
largest objects for writeback, because then each writeback will release
the largest amount of memory.
This patch applies the same solution and strategy as for recompression
target selection: pp control (post-process) with 16 buckets of candidate
pp slots. Slots are assigned to pp buckets based on sizes - the larger
the slot the higher the group index. This gives us sorted by size lists
of candidate slots (in linear time), so that among post-processing
candidate slots we always select the largest ones first and maximize the
memory saving.
TEST
====
A very simple demonstration: zram is configured with a writeback device.
A limited writeback (wb_limit 2500 pages) is performed then, with a log of
sizes of slots that were written back. You can see that patched zram
selects slots for recompression in significantly different manner, which
leads to higher memory savings (see column #2 of mm_stat output).
BASE
----
*** initial state of zram device
/sys/block/zram0/mm_stat
1750327296 619765836 631902208 0 631902208 1 0 34278 34278
*** writeback idle wb_limit 2500
/sys/block/zram0/mm_stat
1750327296 617622333 631578624 0 631902208 1 0 34278 34278
Sizes of selected objects for writeback:
... 193 349 46 46 46 46 852 1002 543 162 107 49 34 34 34 ...
PATCHED
-------
*** initial state of zram device
/sys/block/zram0/mm_stat
1750319104 619760957 631992320 0 631992320 1 0 34278 34278
*** writeback idle wb_limit 2500
/sys/block/zram0/mm_stat
1750319104 612672056 626135040 0 631992320 1 0 34278 34278
Sizes of selected objects for writeback:
... 3667 3580 3581 3580 3581 3581 3581 3231 3211 3203 3231 3246 ...
Note, pp-slots are not strictly sorted, there is a PP_BUCKET_SIZE_RANGE
variation of sizes within particular bucket.
Link: https://lkml.kernel.org/r/20240917021020.883356-5-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Target slot selection for recompression is just a simple iteration over
zram->table entries (stored pages) from slot 0 to max slot. Given that
zram->table slots are written in random order and are not sorted by size,
a simple iteration over slots selects suboptimal targets for
recompression. This is not a problem if we recompress every single
zram->table slot, but we never do that in reality. In reality we limit
the number of slots we can recompress (via max_pages parameter) and hence
proper slot selection becomes very important. The strategy is quite
simple, suppose we have two candidate slots for recompression, one of size
48 bytes and one of size 2800 bytes, and we can recompress only one, then
it certainly makes more sense to pick 2800 entry for recompression.
Because even if we manage to compress 48 bytes objects even further the
savings are going to be very small. Potential savings after good
re-compression of 2800 bytes objects are much higher.
This patch reworks slot selection and introduces the strategy described
above: among candidate slots always select the biggest ones first.
For that the patch introduces zram_pp_ctl (post-processing) structure
which holds NUM_PP_BUCKETS pp buckets of slots. Slots are assigned to a
particular group based on their sizes - the larger the size of the slot
the higher the group index. This, basically, sorts slots by size in liner
time (we still perform just one iteration over zram->table slots). When
we select slot for recompression we always first lookup in higher pp
buckets (those that hold the largest slots). Which achieves the desired
behavior.
TEST
====
A very simple demonstration: zram is configured with zstd, and zstd with
dict as a recompression stream. A limited (max 4096 pages) recompression
is performed then, with a log of sizes of slots that were recompressed.
You can see that patched zram selects slots for recompression in
significantly different manner, which leads to higher memory savings (see
column #2 of mm_stat output).
BASE
----
*** initial state of zram device
/sys/block/zram0/mm_stat
1750994944 504491413 514203648 0 514203648 1 0 34204 34204
*** recompress idle max_pages=4096
/sys/block/zram0/mm_stat
1750994944 504262229 514953216 0 514203648 1 0 34204 34204
Sizes of selected objects for recompression:
... 45 58 24 226 91 40 24 24 24 424 2104 93 2078 2078 2078 959 154 ...
PATCHED
-------
*** initial state of zram device
/sys/block/zram0/mm_stat
1750982656 504492801 514170880 0 514170880 1 0 34204 34204
*** recompress idle max_pages=4096
/sys/block/zram0/mm_stat
1750982656 503716710 517586944 0 514170880 1 0 34204 34204
Sizes of selected objects for recompression:
... 3680 3694 3667 3590 3614 3553 3537 3548 3550 3542 3543 3537 ...
Note, pp-slots are not strictly sorted, there is a PP_BUCKET_SIZE_RANGE
variation of sizes within particular bucket.
[senozhatsky@chromium.org: do not skip the first bucket]
Link: https://lkml.kernel.org/r/20241001085634.1948384-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20240917021020.883356-4-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Both recompress and writeback soon will unlock slots during processing,
which makes things too complex wrt possible race-conditions. We still
want to clear PP_SLOT in slot_free, because this is how we figure out that
slot that was selected for post-processing has been released under us and
when we start post-processing we check if slot still has PP_SLOT set. At
the same time, theoretically, we can have something like this:
CPU0 CPU1
recompress
scan slots
set PP_SLOT
unlock slot
slot_free
clear PP_SLOT
allocate PP_SLOT
writeback
scan slots
set PP_SLOT
unlock slot
select PP-slot
test PP_SLOT
So recompress will not detect that slot has been re-used and re-selected
for concurrent writeback post-processing.
Make sure that we only permit on post-processing operation at a time. So
now recompress and writeback post-processing don't race against each
other, we only need to handle slot re-use (slot_free and write), which is
handled individually by each pp operation.
Having recompress and writeback competing for the same slots is not
exactly good anyway (can't imagine anyone doing that).
Link: https://lkml.kernel.org/r/20240917021020.883356-3-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "zram: optimal post-processing target selection", v5.
Problem:
--------
Both recompression and writeback perform a very simple linear scan of all
zram slots in search for post-processing (writeback or recompress)
candidate slots. This often means that we pick the worst candidate for pp
(post-processing), e.g. a 48 bytes object for writeback, which is nearly
useless, because it only releases 48 bytes from zsmalloc pool, but
consumes an entire 4K slot in the backing device. Similarly,
recompression of an 48 bytes objects is unlikely to save more memory that
recompression of a 3000 bytes object. Both recompression and writeback
consume constrained resources (CPU time, batter, backing device storage
space) and quite often have a (daily) limit on the number of items they
post-process, so we should utilize those constrained resources in the most
optimal way.
Solution:
---------
This patch reworks the way we select pp targets. We, quite clearly, want
to sort all the candidates and always pick the largest, be it
recompression or writeback. Especially for writeback, because the larger
object we writeback the more memory we release. This series introduces
concept of pp buckets and pp scan/selection.
The scan step is a simple iteration over all zram->table entries, just
like what we currently do, but we don't post-process a candidate slot
immediately. Instead we assign it to a PP (post-processing) bucket. PP
bucket is, basically, a list which holds pp candidate slots that belong to
the same size class. PP buckets are 64 bytes apart, slots are not
strictly sorted within a bucket there is a 64 bytes variance.
The select step simply iterates over pp buckets from highest to lowest and
picks all candidate slots a particular buckets contains. So this gives us
sorted candidates (in linear time) and allows us to select most optimal
(largest) candidates for post-processing first.
This patch (of 7):
This flag indicates that the slot was selected as a candidate slot for
post-processing (pp) and was assigned to a pp bucket. It does not
necessarily mean that the slot is currently under post-processing, but may
mean so. The slot can loose its PP_SLOT flag, while still being in the
pp-bucket, if it's accessed or slot_free-ed.
Link: https://lkml.kernel.org/r/20240917021020.883356-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20240917021020.883356-2-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
max_zone_append_sectors differs from all other queue limits in that the
final value used is not stored in the queue_limits but needs to be
obtained using queue_limits_max_zone_append_sectors helper. This not
only adds (tiny) extra overhead to the I/O path, but also can be easily
forgotten in file system code.
Add a new max_hw_zone_append_sectors value to queue_limits which is
set by the driver, and calculate max_zone_append_sectors from that and
the other inputs in blk_validate_zoned_limits, similar to how
max_sectors is calculated to fix this.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241104073955.112324-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A bdev discard granularity is always at least SECTOR_SIZE, so don't check
for a zero value.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241101092215.422428-1-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
My colleague Wupeng found the following problems during fault injection:
BUG: unable to handle page fault for address: fffffbfff809d073
PGD 6e648067 P4D 123ec8067 PUD 123ec4067 PMD 100e38067 PTE 0
Oops: Oops: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 5 UID: 0 PID: 755 Comm: modprobe Not tainted 6.12.0-rc3+ #17
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.16.1-2.fc37 04/01/2014
RIP: 0010:__asan_load8+0x4c/0xa0
...
Call Trace:
<TASK>
blkdev_put_whole+0x41/0x70
bdev_release+0x1a3/0x250
blkdev_release+0x11/0x20
__fput+0x1d7/0x4a0
task_work_run+0xfc/0x180
syscall_exit_to_user_mode+0x1de/0x1f0
do_syscall_64+0x6b/0x170
entry_SYSCALL_64_after_hwframe+0x76/0x7e
loop_init() is calling loop_add() after __register_blkdev() succeeds and
is ignoring disk_add() failure from loop_add(), for loop_add() failure
is not fatal and successfully created disks are already visible to
bdev_open().
brd_init() is currently calling brd_alloc() before __register_blkdev()
succeeds and is releasing successfully created disks when brd_init()
returns an error. This can cause UAF for the latter two case:
case 1:
T1:
modprobe brd
brd_init
brd_alloc(0) // success
add_disk
disk_scan_partitions
bdev_file_open_by_dev // alloc file
fput // won't free until back to userspace
brd_alloc(1) // failed since mem alloc error inject
// error path for modprobe will release code segment
// back to userspace
__fput
blkdev_release
bdev_release
blkdev_put_whole
bdev->bd_disk->fops->release // fops is freed now, UAF!
case 2:
T1: T2:
modprobe brd
brd_init
brd_alloc(0) // success
open(/dev/ram0)
brd_alloc(1) // fail
// error path for modprobe
close(/dev/ram0)
...
/* UAF! */
bdev->bd_disk->fops->release
Fix this problem by following what loop_init() does. Besides,
reintroduce brd_devices_mutex to help serialize modifications to
brd_list.
Fixes: 7f9b348cb5 ("brd: convert to blk_alloc_disk/blk_cleanup_disk")
Reported-by: Wupeng Ma <mawupeng1@huawei.com>
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241030034914.907829-1-yangerkun@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ublk currently supports the following behaviors on ublk server exit:
A: outstanding I/Os get errors, subsequently issued I/Os get errors
B: outstanding I/Os get errors, subsequently issued I/Os queue
C: outstanding I/Os get reissued, subsequently issued I/Os queue
and the following behaviors for recovery of preexisting block devices by
a future incarnation of the ublk server:
1: ublk devices stopped on ublk server exit (no recovery possible)
2: ublk devices are recoverable using start/end_recovery commands
The userspace interface allows selection of combinations of these
behaviors using flags specified at device creation time, namely:
default behavior: A + 1
UBLK_F_USER_RECOVERY: B + 2
UBLK_F_USER_RECOVERY|UBLK_F_USER_RECOVERY_REISSUE: C + 2
The behavior A + 2 is currently unsupported. Add support for this
behavior under the new flag combination
UBLK_F_USER_RECOVERY|UBLK_F_USER_RECOVERY_FAIL_IO.
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241007182419.3263186-5-ushankar@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Save some lines by merging stop_work and quiesce_work into nosrv_work,
which looks at the recovery flags and does the right thing when the "no
ublk server" condition is detected.
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241007182419.3263186-4-ushankar@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ublk currently supports the following behaviors on ublk server exit:
A: outstanding I/Os get errors, subsequently issued I/Os get errors
B: outstanding I/Os get errors, subsequently issued I/Os queue
C: outstanding I/Os get reissued, subsequently issued I/Os queue
and the following behaviors for recovery of preexisting block devices by
a future incarnation of the ublk server:
1: ublk devices stopped on ublk server exit (no recovery possible)
2: ublk devices are recoverable using start/end_recovery commands
The userspace interface allows selection of combinations of these
behaviors using flags specified at device creation time, namely:
default behavior: A + 1
UBLK_F_USER_RECOVERY: B + 2
UBLK_F_USER_RECOVERY|UBLK_F_USER_RECOVERY_REISSUE: C + 2
We can't easily change the userspace interface to allow independent
selection of one of {A, B, C} and one of {1, 2}, but we can refactor the
internal helpers which test for the flags. Replace the existing helpers
with the following set:
ublk_nosrv_should_reissue_outstanding: tests for behavior C
ublk_nosrv_[dev_]should_queue_io: tests for behavior B
ublk_nosrv_should_stop_dev: tests for behavior 1
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241007182419.3263186-3-ushankar@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Setting UBLK_F_USER_RECOVERY_REISSUE without also setting
UBLK_F_USER_RECOVERY is currently silently equivalent to not setting any
recovery flags at all, even though that's obviously not intended. Check
for this case and fail add_dev (with a paranoid warning to aid debugging
any program which might rely on the old behavior) with EINVAL if it is
detected.
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241007182419.3263186-2-ushankar@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Deprecation period of reiserfs ends with the end of this year so it is
time to remove it from the kernel.
Acked-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmcSk4AQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpuuXD/0UERdP+djJNoXBW5Mv7U5a4rJ7ZgfPL7ku
z3ZfdnNYGitZhYkVjNQ60TLzXRQyUaIIxMVBWzkb59I6ixmuQbzm/lC55B6s/FIR
bfT3afe1WRgLCaFbStu91qRs/44Mq4yK6wXcIU7LwutRT/5cqZwelqRZLK7DMFln
zlGX4zNrCMRUDTr6PLa6CvyY4dmQSL17Ib1ypcKXjGs5YjDntzSrIsKVT1Wayans
WroGGPG6W7r2c2kn8pe4uPIjZVfMUF2vrdIs0KEYaAQOC7ppEucCgDZMEWRs7kdH
63hheudJjVSwLF/qYnXNHe/Bz12QCZohPp6UsqRpC8o96Ralgo6Q+FxkXsVelMXW
JKhtDqYGBDHOQrjrEWN1rnYw/DauEQAgvOtdVfEx2IBzPsG07cB8yv8MNA90H9QH
KStI7h9qnBEMMNcXX8prOymCHNWAeuF4mbitVrRfSfEVm/0BbQ19qoyGrvwNFgEf
6T+4Xj/P+FsiLVe8vsgBZDaxEEU5Ifd/rki/QFVk/2z72BBZxmdf2nm51SOM28V7
HGMHwJI3H8rdmPXvt5Q/ve6GWNOYLO5PSAJgSSe96UStvtsAHGB4eM+LykdnE7cI
SoytU5KfAM8DD6wnyHIgYuvJyZWrmLoVDrRjym8emc2KrJOe7qg+Ah4ERcNTCnhl
nw50f27G4w==
=waNY
-----END PGP SIGNATURE-----
Merge tag 'block-6.12-20241018' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- NVMe pull request via Keith:
- Fix target passthrough identifier (Nilay)
- Fix tcp locking (Hannes)
- Replace list with sbitmap for tracking RDMA rsp tags (Guixen)
- Remove unnecessary fallthrough statements (Tokunori)
- Remove ready-without-media support (Greg)
- Fix multipath partition scan deadlock (Keith)
- Fix concurrent PCI reset and remove queue mapping (Maurizio)
- Fabrics shutdown fixes (Nilay)
- Fix for a kerneldoc warning (Keith)
- Fix a race with blk-rq-qos and wakeups (Omar)
- Cleanup of checking for always-set tag_set (SurajSonawane2415)
- Fix for a crash with CPU hotplug notifiers (Ming)
- Don't allow zero-copy ublk on unprivileged device (Ming)
- Use array_index_nospec() for CDROM (Josh)
- Remove dead code in drbd (David)
- Tweaks to elevator loading (Breno)
* tag 'block-6.12-20241018' of git://git.kernel.dk/linux:
cdrom: Avoid barrier_nospec() in cdrom_ioctl_media_changed()
nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function
nvme: make keep-alive synchronous operation
nvme-loop: flush off pending I/O while shutting down loop controller
nvme-pci: fix race condition between reset and nvme_dev_disable()
ublk: don't allow user copy for unprivileged device
blk-rq-qos: fix crash on rq_qos_wait vs. rq_qos_wake_function race
nvme-multipath: defer partition scanning
blk-mq: setup queue ->tag_set before initializing hctx
elevator: Remove argument from elevator_find_get
elevator: do not request_module if elevator exists
drbd: Remove unused conn_lowest_minor
nvme: disable CC.CRIME (NVME_CC_CRIME)
nvme: delete unnecessary fallthru comment
nvmet-rdma: use sbitmap to replace rsp free list
block: Fix elevator_get_default() checking for NULL q->tag_set
nvme: tcp: avoid race between queue_lock lock and destroy
nvmet-passthru: clear EUID/NGUID/UUID while using loop target
block: fix blk_rq_map_integrity_sg kernel-doc
UBLK_F_USER_COPY requires userspace to call write() on ublk char
device for filling request buffer, and unprivileged device can't
be trusted.
So don't allow user copy for unprivileged device.
Cc: stable@vger.kernel.org
Fixes: 1172d5b8be ("ublk: support user copy")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241016134847.2911721-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that we got the kernel `Box` type in place, convert all existing
`Box` users to make use of it.
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Benno Lossin <benno.lossin@proton.me>
Reviewed-by: Gary Guo <gary@garyguo.net>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://lore.kernel.org/r/20241004154149.93856-13-dakr@kernel.org
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
conn_lowest_minor() last use was removed by 2011 commit
69a227731a ("drbd: Pass a peer device to a number of fuctions")
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://lore.kernel.org/r/20241010204426.277535-1-linux@treblig.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
asm/unaligned.h is always an include of asm-generic/unaligned.h;
might as well move that thing to linux/unaligned.h and include
that - there's nothing arch-specific in that header.
auto-generated by the following:
for i in `git grep -l -w asm/unaligned.h`; do
sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i
done
for i in `git grep -l -w asm-generic/unaligned.h`; do
sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i
done
git mv include/asm-generic/unaligned.h include/linux/unaligned.h
git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h
sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild
sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
For fixing CVE-2023-6270, f98364e926 ("aoe: fix the potential
use-after-free problem in aoecmd_cfg_pkts") makes tx() calling dev_put()
instead of doing in aoecmd_cfg_pkts(). It avoids that the tx() runs
into use-after-free.
Then Nicolai Stange found more places in aoe have potential use-after-free
problem with tx(). e.g. revalidate(), aoecmd_ata_rw(), resend(), probe()
and aoecmd_cfg_rsp(). Those functions also use aoenet_xmit() to push
packet to tx queue. So they should also use dev_hold() to increase the
refcnt of skb->dev.
On the other hand, moving dev_put() to tx() causes that the refcnt of
skb->dev be reduced to a negative value, because corresponding
dev_hold() are not called in revalidate(), aoecmd_ata_rw(), resend(),
probe(), and aoecmd_cfg_rsp(). This patch fixed this issue.
Cc: stable@vger.kernel.org
Link: https://nvd.nist.gov/vuln/detail/CVE-2023-6270
Fixes: f98364e926 ("aoe: fix the potential use-after-free problem in aoecmd_cfg_pkts")
Reported-by: Nicolai Stange <nstange@suse.com>
Signed-off-by: Chun-Yi Lee <jlee@suse.com>
Link: https://lore.kernel.org/stable/20240624064418.27043-1-jlee%40suse.com
Link: https://lore.kernel.org/r/20241002035458.24401-1-jlee@suse.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's a focus on fixes for the memfd_pin_folios() work which was added
into 6.11. Apart from that, the usual shower of singleton fixes.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZvbhSAAKCRDdBJ7gKXxA
jp8CAP47txk2c+tBLggog2MkQamADY5l5MT6E3fYq3ghSiKtVQEAnqX3LiQJ02tB
o9LcPcVrM90QntpKrLP1CpWCVdR+zA8=
=e0QC
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-09-27-09-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"19 hotfixes. 13 are cc:stable.
There's a focus on fixes for the memfd_pin_folios() work which was
added into 6.11. Apart from that, the usual shower of singleton fixes"
* tag 'mm-hotfixes-stable-2024-09-27-09-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
ocfs2: fix uninit-value in ocfs2_get_block()
zram: don't free statically defined names
memory tiers: use default_dram_perf_ref_source in log message
Revert "list: test: fix tests for list_cut_position()"
kselftests: mm: fix wrong __NR_userfaultfd value
compiler.h: specify correct attribute for .rodata..c_jump_table
mm/damon/Kconfig: update DAMON doc URL
mm: kfence: fix elapsed time for allocated/freed track
ocfs2: fix deadlock in ocfs2_get_system_file_inode
ocfs2: reserve space for inline xattr before attaching reflink tree
mm: migrate: annotate data-race in migrate_folio_unmap()
mm/hugetlb: simplify refs in memfd_alloc_folio
mm/gup: fix memfd_pin_folios alloc race panic
mm/gup: fix memfd_pin_folios hugetlb page allocation
mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak
mm/hugetlb: fix memfd_pin_folios free_huge_pages leak
mm/filemap: fix filemap_get_folios_contig THP panic
mm: make SPLIT_PTE_PTLOCKS depend on SMP
tools: fix shared radix-tree build
no_llseek had been defined to NULL two years ago, in commit 868941b144
("fs: remove no_llseek")
To quote that commit,
At -rc1 we'll need do a mechanical removal of no_llseek -
git grep -l -w no_llseek | grep -v porting.rst | while read i; do
sed -i '/\<no_llseek\>/d' $i
done
would do it.
Unfortunately, that hadn't been done. Linus, could you do that now, so
that we could finally put that thing to rest? All instances are of the
form
.llseek = no_llseek,
so it's obviously safe.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When CONFIG_ZRAM_MULTI_COMP isn't set ZRAM_SECONDARY_COMP can hold
default_compressor, because it's the same offset as ZRAM_PRIMARY_COMP, so
we need to make sure that we don't attempt to kfree() the statically
defined compressor name.
This is detected by KASAN.
==================================================================
Call trace:
kfree+0x60/0x3a0
zram_destroy_comps+0x98/0x198 [zram]
zram_reset_device+0x22c/0x4a8 [zram]
reset_store+0x1bc/0x2d8 [zram]
dev_attr_store+0x44/0x80
sysfs_kf_write+0xfc/0x188
kernfs_fop_write_iter+0x28c/0x428
vfs_write+0x4dc/0x9b8
ksys_write+0x100/0x1f8
__arm64_sys_write+0x74/0xb8
invoke_syscall+0xd8/0x260
el0_svc_common.constprop.0+0xb4/0x240
do_el0_svc+0x48/0x68
el0_svc+0x40/0xc8
el0t_64_sync_handler+0x120/0x130
el0t_64_sync+0x190/0x198
==================================================================
Link: https://lkml.kernel.org/r/20240923164843.1117010-1-andrej.skvortzov@gmail.com
Fixes: 684826f827 ("zram: free secondary algorithms names")
Signed-off-by: Andrey Skvortsov <andrej.skvortzov@gmail.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reported-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
Closes: https://lore.kernel.org/lkml/57130e48-dbb6-4047-a8c7-ebf5aaea93f4@linux.vnet.ibm.com/
Tested-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
Cc: Chris Li <chrisl@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmb0T5AQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpnfHEADCXqmqZC+xr3sHZH9T1lz9KaFp1FjuBhCw
bGpUgXQ9aLcqQUWJxmYVer8N2x2+Ds+xq4fm/rP1BfvNgRupqheHBwuLxSrz14EX
lYmKZ+krMIPTDaLFewmEWflDwmZX0WFgV6nKTMLiO5BMeI4zXCkFGtwYFys2+Cdd
9zYCFPgGDZUR77Ws5PpyqPVz2MoiNtsjrGmHpEmNZ+rIDzlpVOYgYk27X9ZbvNxC
/l0KTc9+ayAeG0Kx5jO+m6Hrj3I6ehvM9JZMgpS/tF/jtccD2oVkJFJDlU+Jciv6
BwVzgyDPGV7sXFT1fnSqDBYYwr/73nzNH0Gk8wn4Jg2LhjmVANVo9eQSOXDTYZI+
O4HfIHGTIrk75TQd4bhq3dqaylS78pKBI/eQJUli2UNoyLWMrMyE88yh2YJam2Fs
vJ/MHGxvFRurYbAlqLr33nb3ajvpg+D7XuAYfqHPMc2ZUe28Kza50Dj+luNjfVCu
3qfR6qBlsdWuABtUS3vneB9jZp5jDnOpVfuBgtcAqIboUjehTXsI7If09Ex/mxLq
O0KqNwBMfunPOKd5kGXlAgY8LRMfOhNaAAFBlXYUZB2eAadQnqVselTFvHMZkXo7
wH/l6trd+/Tf+7Rav0YduNIlpVr7IctC+A7ph4zPdIjQxFEySCrC7cvAjel29LyV
zgWW0Mw/sA==
=yiWu
-----END PGP SIGNATURE-----
Merge tag 'for-6.12/block-20240925' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:
- Improve blk-integrity segment counting and merging (Keith)
- NVMe pull request via Keith:
- Multipath fixes (Hannes)
- Sysfs attribute list NULL terminate fix (Shin'ichiro)
- Remove problematic read-back (Keith)
- Fix for a regression with the IO scheduler switching freezing from
6.11 (Damien)
- Use a raw spinlock for sbitmap, as it may get called from preempt
disabled context (Ming)
- Cleanup for bd_claiming waiting, using var_waitqueue() rather than
the bit waitqueues, as that more accurately describes that it does
(Neil)
- Various cleanups (Kanchan, Qiu-ji, David)
* tag 'for-6.12/block-20240925' of git://git.kernel.dk/linux:
nvme: remove CC register read-back during enabling
nvme: null terminate nvme_tls_attrs
nvme-multipath: avoid hang on inaccessible namespaces
nvme-multipath: system fails to create generic nvme device
lib/sbitmap: define swap_lock as raw_spinlock_t
block: Remove unused blk_limits_io_{min,opt}
drbd: Fix atomicity violation in drbd_uuid_set_bm()
block: Fix elv_iosched_local_module handling of "none" scheduler
block: remove bogus union
block: change wait on bd_claiming to use a var_waitqueue
blk-integrity: improved sg segment mapping
block: unexport blk_rq_count_integrity_sg
nvme-rdma: use request to get integrity segments
scsi: use request to get integrity segments
block: provide a request helper for user integrity segments
blk-integrity: consider entire bio list for merging
blk-integrity: properly account for segments
blk-mq: set the nr_integrity_segments from bio
blk-mq: unconditional nr_integrity_segments
this pull request are:
"Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds
consistency to the APIs and behaviour of these two core allocation
functions. This also simplifies/enables Rustification.
"Some cleanups for shmem" from Baolin Wang. No functional changes - mode
code reuse, better function naming, logic simplifications.
"mm: some small page fault cleanups" from Josef Bacik. No functional
changes - code cleanups only.
"Various memory tiering fixes" from Zi Yan. A small fix and a little
cleanup.
"mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and
simplifications and .text shrinkage.
"Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt. This
is a feature, it adds new feilds to /proc/vmstat such as
$ grep kstack /proc/vmstat
kstack_1k 3
kstack_2k 188
kstack_4k 11391
kstack_8k 243
kstack_16k 0
which tells us that 11391 processes used 4k of stack while none at all
used 16k. Useful for some system tuning things, but partivularly useful
for "the dynamic kernel stack project".
"kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov.
Teaches kmemleak to detect leaksage of percpu memory.
"mm: memcg: page counters optimizations" from Roman Gushchin. "3
independent small optimizations of page counters".
"mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David
Hildenbrand. Improves PTE/PMD splitlock detection, makes powerpc/8xx work
correctly by design rather than by accident.
"mm: remove arch_make_page_accessible()" from David Hildenbrand. Some
folio conversions which make arch_make_page_accessible() unneeded.
"mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel.
Cleans up and fixes our handling of the resetting of the cgroup/process
peak-memory-use detector.
"Make core VMA operations internal and testable" from Lorenzo Stoakes.
Rationalizaion and encapsulation of the VMA manipulation APIs. With a
view to better enable testing of the VMA functions, even from a
userspace-only harness.
"mm: zswap: fixes for global shrinker" from Takero Funaki. Fix issues in
the zswap global shrinker, resulting in improved performance.
"mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill in
some missing info in /proc/zoneinfo.
"mm: replace follow_page() by folio_walk" from David Hildenbrand. Code
cleanups and rationalizations (conversion to folio_walk()) resulting in
the removal of follow_page().
"improving dynamic zswap shrinker protection scheme" from Nhat Pham. Some
tuning to improve zswap's dynamic shrinker. Significant reductions in
swapin and improvements in performance are shown.
"mm: Fix several issues with unaccepted memory" from Kirill Shutemov.
Improvements to the new unaccepted memory feature,
"mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on DAX
PUDs. This was missing, although nobody seems to have notied yet.
"Introduce a store type enum for the Maple tree" from Sidhartha Kumar.
Cleanups and modest performance improvements for the maple tree library
code.
"memcg: further decouple v1 code from v2" from Shakeel Butt. Move more
cgroup v1 remnants away from the v2 memcg code.
"memcg: initiate deprecation of v1 features" from Shakeel Butt. Adds
various warnings telling users that memcg v1 features are deprecated.
"mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li.
Greatly improves the success rate of the mTHP swap allocation.
"mm: introduce numa_memblks" from Mike Rapoport. Moves various disparate
per-arch implementations of numa_memblk code into generic code.
"mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly
improves the performance of munmap() of swap-filled ptes.
"support large folio swap-out and swap-in for shmem" from Baolin Wang.
With this series we no longer split shmem large folios into simgle-page
folios when swapping out shmem.
"mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice performance
improvements and code reductions for gigantic folios.
"support shmem mTHP collapse" from Baolin Wang. Adds support for
khugepaged's collapsing of shmem mTHP folios.
"mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect()
performance regression due to the addition of mseal().
"Increase the number of bits available in page_type" from Matthew Wilcox.
Increases the number of bits available in page_type!
"Simplify the page flags a little" from Matthew Wilcox. Many legacy page
flags are now folio flags, so the page-based flags and their
accessors/mutators can be removed.
"mm: store zero pages to be swapped out in a bitmap" from Usama Arif. An
optimization which permits us to avoid writing/reading zero-filled zswap
pages to backing store.
"Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race window
which occurs when a MAP_FIXED operqtion is occurring during an unrelated
vma tree walk.
"mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of the
vma_merge() functionality, making ot cleaner, more testable and better
tested.
"misc fixups for DAMON {self,kunit} tests" from SeongJae Park. Minor
fixups of DAMON selftests and kunit tests.
"mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang. Code
cleanups and folio conversions.
"Shmem mTHP controls and stats improvements" from Ryan Roberts. Cleanups
for shmem controls and stats.
"mm: count the number of anonymous THPs per size" from Barry Song. Expose
additional anon THP stats to userspace for improved tuning.
"mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio
conversions and removal of now-unused page-based APIs.
"replace per-quota region priorities histogram buffer with per-context
one" from SeongJae Park. DAMON histogram rationalization.
"Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae
Park. DAMON documentation updates.
"mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve
related doc and warn" from Jason Wang: fixes usage of page allocator
__GFP_NOFAIL and GFP_ATOMIC flags.
"mm: split underused THPs" from Yu Zhao. Improve THP=always policy - this
was overprovisioning THPs in sparsely accessed memory areas.
"zram: introduce custom comp backends API" frm Sergey Senozhatsky. Add
support for zram run-time compression algorithm tuning.
"mm: Care about shadow stack guard gap when getting an unmapped area" from
Mark Brown. Fix up the various arch_get_unmapped_area() implementations
to better respect guard areas.
"Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability of
mem_cgroup_iter() and various code cleanups.
"mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge
pfnmap support.
"resource: Fix region_intersects() vs add_memory_driver_managed()" from
Huang Ying. Fix a bug in region_intersects() for systems with CXL memory.
"mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches a
couple more code paths to correctly recover from the encountering of
poisoned memry.
"mm: enable large folios swap-in support" from Barry Song. Support the
swapin of mTHP memory into appropriately-sized folios, rather than into
single-page folios.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZu1BBwAKCRDdBJ7gKXxA
jlWNAQDYlqQLun7bgsAN4sSvi27VUuWv1q70jlMXTfmjJAvQqwD/fBFVR6IOOiw7
AkDbKWP2k0hWPiNJBGwoqxdHHx09Xgo=
=s0T+
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Along with the usual shower of singleton patches, notable patch series
in this pull request are:
- "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds
consistency to the APIs and behaviour of these two core allocation
functions. This also simplifies/enables Rustification.
- "Some cleanups for shmem" from Baolin Wang. No functional changes -
mode code reuse, better function naming, logic simplifications.
- "mm: some small page fault cleanups" from Josef Bacik. No
functional changes - code cleanups only.
- "Various memory tiering fixes" from Zi Yan. A small fix and a
little cleanup.
- "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and
simplifications and .text shrinkage.
- "Kernel stack usage histogram" from Pasha Tatashin and Shakeel
Butt. This is a feature, it adds new feilds to /proc/vmstat such as
$ grep kstack /proc/vmstat
kstack_1k 3
kstack_2k 188
kstack_4k 11391
kstack_8k 243
kstack_16k 0
which tells us that 11391 processes used 4k of stack while none at
all used 16k. Useful for some system tuning things, but
partivularly useful for "the dynamic kernel stack project".
- "kmemleak: support for percpu memory leak detect" from Pavel
Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory.
- "mm: memcg: page counters optimizations" from Roman Gushchin. "3
independent small optimizations of page counters".
- "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from
David Hildenbrand. Improves PTE/PMD splitlock detection, makes
powerpc/8xx work correctly by design rather than by accident.
- "mm: remove arch_make_page_accessible()" from David Hildenbrand.
Some folio conversions which make arch_make_page_accessible()
unneeded.
- "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David
Finkel. Cleans up and fixes our handling of the resetting of the
cgroup/process peak-memory-use detector.
- "Make core VMA operations internal and testable" from Lorenzo
Stoakes. Rationalizaion and encapsulation of the VMA manipulation
APIs. With a view to better enable testing of the VMA functions,
even from a userspace-only harness.
- "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix
issues in the zswap global shrinker, resulting in improved
performance.
- "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill
in some missing info in /proc/zoneinfo.
- "mm: replace follow_page() by folio_walk" from David Hildenbrand.
Code cleanups and rationalizations (conversion to folio_walk())
resulting in the removal of follow_page().
- "improving dynamic zswap shrinker protection scheme" from Nhat
Pham. Some tuning to improve zswap's dynamic shrinker. Significant
reductions in swapin and improvements in performance are shown.
- "mm: Fix several issues with unaccepted memory" from Kirill
Shutemov. Improvements to the new unaccepted memory feature,
- "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on
DAX PUDs. This was missing, although nobody seems to have notied
yet.
- "Introduce a store type enum for the Maple tree" from Sidhartha
Kumar. Cleanups and modest performance improvements for the maple
tree library code.
- "memcg: further decouple v1 code from v2" from Shakeel Butt. Move
more cgroup v1 remnants away from the v2 memcg code.
- "memcg: initiate deprecation of v1 features" from Shakeel Butt.
Adds various warnings telling users that memcg v1 features are
deprecated.
- "mm: swap: mTHP swap allocator base on swap cluster order" from
Chris Li. Greatly improves the success rate of the mTHP swap
allocation.
- "mm: introduce numa_memblks" from Mike Rapoport. Moves various
disparate per-arch implementations of numa_memblk code into generic
code.
- "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly
improves the performance of munmap() of swap-filled ptes.
- "support large folio swap-out and swap-in for shmem" from Baolin
Wang. With this series we no longer split shmem large folios into
simgle-page folios when swapping out shmem.
- "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice
performance improvements and code reductions for gigantic folios.
- "support shmem mTHP collapse" from Baolin Wang. Adds support for
khugepaged's collapsing of shmem mTHP folios.
- "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect()
performance regression due to the addition of mseal().
- "Increase the number of bits available in page_type" from Matthew
Wilcox. Increases the number of bits available in page_type!
- "Simplify the page flags a little" from Matthew Wilcox. Many legacy
page flags are now folio flags, so the page-based flags and their
accessors/mutators can be removed.
- "mm: store zero pages to be swapped out in a bitmap" from Usama
Arif. An optimization which permits us to avoid writing/reading
zero-filled zswap pages to backing store.
- "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race
window which occurs when a MAP_FIXED operqtion is occurring during
an unrelated vma tree walk.
- "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of
the vma_merge() functionality, making ot cleaner, more testable and
better tested.
- "misc fixups for DAMON {self,kunit} tests" from SeongJae Park.
Minor fixups of DAMON selftests and kunit tests.
- "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang.
Code cleanups and folio conversions.
- "Shmem mTHP controls and stats improvements" from Ryan Roberts.
Cleanups for shmem controls and stats.
- "mm: count the number of anonymous THPs per size" from Barry Song.
Expose additional anon THP stats to userspace for improved tuning.
- "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more
folio conversions and removal of now-unused page-based APIs.
- "replace per-quota region priorities histogram buffer with
per-context one" from SeongJae Park. DAMON histogram
rationalization.
- "Docs/damon: update GitHub repo URLs and maintainer-profile" from
SeongJae Park. DAMON documentation updates.
- "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and
improve related doc and warn" from Jason Wang: fixes usage of page
allocator __GFP_NOFAIL and GFP_ATOMIC flags.
- "mm: split underused THPs" from Yu Zhao. Improve THP=always policy.
This was overprovisioning THPs in sparsely accessed memory areas.
- "zram: introduce custom comp backends API" frm Sergey Senozhatsky.
Add support for zram run-time compression algorithm tuning.
- "mm: Care about shadow stack guard gap when getting an unmapped
area" from Mark Brown. Fix up the various arch_get_unmapped_area()
implementations to better respect guard areas.
- "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability
of mem_cgroup_iter() and various code cleanups.
- "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge
pfnmap support.
- "resource: Fix region_intersects() vs add_memory_driver_managed()"
from Huang Ying. Fix a bug in region_intersects() for systems with
CXL memory.
- "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches
a couple more code paths to correctly recover from the encountering
of poisoned memry.
- "mm: enable large folios swap-in support" from Barry Song. Support
the swapin of mTHP memory into appropriately-sized folios, rather
than into single-page folios"
* tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits)
zram: free secondary algorithms names
uprobes: turn xol_area->pages[2] into xol_area->page
uprobes: introduce the global struct vm_special_mapping xol_mapping
Revert "uprobes: use vm_special_mapping close() functionality"
mm: support large folios swap-in for sync io devices
mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios
mm: fix swap_read_folio_zeromap() for large folios with partial zeromap
mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries
set_memory: add __must_check to generic stubs
mm/vma: return the exact errno in vms_gather_munmap_vmas()
memcg: cleanup with !CONFIG_MEMCG_V1
mm/show_mem.c: report alloc tags in human readable units
mm: support poison recovery from copy_present_page()
mm: support poison recovery from do_cow_fault()
resource, kunit: add test case for region_intersects()
resource: make alloc_free_mem_region() works for iomem_resource
mm: z3fold: deprecate CONFIG_Z3FOLD
vfio/pci: implement huge_fault support
mm/arm64: support large pfn mappings
mm/x86: support large pfn mappings
...
The violation of atomicity occurs when the drbd_uuid_set_bm function is
executed simultaneously with modifying the value of
device->ldev->md.uuid[UI_BITMAP]. Consider a scenario where, while
device->ldev->md.uuid[UI_BITMAP] passes the validity check when its
value is not zero, the value of device->ldev->md.uuid[UI_BITMAP] is
written to zero. In this case, the check in drbd_uuid_set_bm might refer
to the old value of device->ldev->md.uuid[UI_BITMAP] (before locking),
which allows an invalid value to pass the validity check, resulting in
inconsistency.
To address this issue, it is recommended to include the data validity
check within the locked section of the function. This modification
ensures that the value of device->ldev->md.uuid[UI_BITMAP] does not
change during the validation process, thereby maintaining its integrity.
This possible bug is found by an experimental static analysis tool
developed by our team. This tool analyzes the locking APIs to extract
function pairs that can be concurrently executed, and then analyzes the
instructions in the paired functions to identify possible concurrency
bugs including data races and atomicity violations.
Fixes: 9f2247bb9b ("drbd: Protect accesses to the uuid set with a spinlock")
Cc: stable@vger.kernel.org
Signed-off-by: Qiu-ji Chen <chenqiuji666@gmail.com>
Reviewed-by: Philipp Reisner <philipp.reisner@linbit.com>
Link: https://lore.kernel.org/r/20240913083504.10549-1-chenqiuji666@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmbm9fQeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGXcwH/A8+IXnrGv+VzYgD
+mE4hgGGHt4dClcUZ31gQetkkT6xktEVp6pB6JkFO7oEgBiTkJBbYGl6VZtsAIOd
Fi3jic8ik0uhZLFcxDJcHTceh6Pw8bkhWoh0tkF3bkDRwbppJdG7Khyk8DxTl24w
ldqh9om2cC7w9IPVx93xTgKgMMZ63qiJyUdTvxEZI3BG8F70smlgZSPskLp2Iktd
FIJZPcyKM0bhJYwZOpXK0vx5C2cA4oIW4xriHUw4aklv646OBxNKevB2JJAft2uA
6LyvuLgnYn/OpdFGZ8slvdmhm6hLWft5B1/bWKorUkz7p5YGiySFzpkMVAkNJ6mS
cRwHJNc=
=flw3
-----END PGP SIGNATURE-----
Merge tag 'v6.11' into for-6.12/block
Merge in 6.11 final to get the fix for preventing deadlocks on an
elevator switch, as there's a fixup for that patch.
* tag 'v6.11': (1788 commits)
Linux 6.11
Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop"
pinctrl: pinctrl-cy8c95x0: Fix regcache
cifs: Fix signature miscalculation
mm: avoid leaving partial pfn mappings around in error case
drm/xe/client: add missing bo locking in show_meminfo()
drm/xe/client: fix deadlock in show_meminfo()
drm/xe/oa: Enable Xe2+ PES disaggregation
drm/xe/display: fix compat IS_DISPLAY_STEP() range end
drm/xe: Fix access_ok check in user_fence_create
drm/xe: Fix possible UAF in guc_exec_queue_process_msg
drm/xe: Remove fence check from send_tlb_invalidation
drm/xe/gt: Remove double include
net: netfilter: move nf flowtable bpf initialization in nf_flow_table_module_init()
PCI: Fix potential deadlock in pcim_intx()
workqueue: Clear worker->pool in the worker thread context
net: tighten bad gso csum offset check in virtio_net_hdr
netlink: specs: mptcp: fix port endianness
net: dpaa: Pad packets to ETH_ZLEN
mptcp: pm: Fix uaf in __timer_delete_sync
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmbkZhQQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjOKD/0fzd4yOcqxSI9W3OLGd04VrOTJIQa4CRbV
GmoTq39pOeIDVGug5ekkTpqqHHnuGk+nQhCzD9vsN/eTmC7yZOIr847O2aWzvYEn
PzFRgmJpoo2E9sr/IsTR5LnJjbaIZhQVkqLH6ZOj9tpKlVwN2SK0nIRVNrAi5zgT
MaDrto/2OUld+vmA99Rgb23jxM6UBdCPIjuiVa+11Vg9Z3D1tWbBmrsG7OMysyIf
FbASBeKHqFSO61/ipFCZv6VV1X8zoWEVyT8n4A1yUbbN5rLzPgoQJVbfSqQRXIdr
cdrKeCbKxl+joSgKS6LKpvnfwRgGF+hgAfpZg4c0vrbZGTQcRhhLFECyh/aVI08F
p5TOMArhVaX59664gHgSPq4KnGTXOO29dot9N3Jya/ZQnxinjY9r+GVOfLuduPPy
1B04vab8oAsk4zK7fZbkDxgYUyifwzK/vQ6OqYq2mYdpdIS/AE7T2ou61Bz5mI7I
/BuucNV0Z96OKlyLEXwXXZjZgNu1TFcq6ARIBJ8L08PY64Fesj5BXabRyXkeNH26
0exyz9heeJs6OwRGfngXmS24tDSS0k74CeZX3KoePNj69u6KCn346KiU1qgntwwD
E5F7AEHqCl5FjUEIWB4M1EPlfA8U0MzOL+tkx2xKJAjsU60wAy7jRSyOIcqodpMs
6UlPcJzgYg==
=uuLl
-----END PGP SIGNATURE-----
Merge tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- MD changes via Song:
- md-bitmap refactoring (Yu Kuai)
- raid5 performance optimization (Artur Paszkiewicz)
- Other small fixes (Yu Kuai, Chen Ni)
- Add a sysfs entry 'new_level' (Xiao Ni)
- Improve information reported in /proc/mdstat (Mateusz Kusiak)
- NVMe changes via Keith:
- Asynchronous namespace scanning (Stuart)
- TCP TLS updates (Hannes)
- RDMA queue controller validation (Niklas)
- Align field names to the spec (Anuj)
- Metadata support validation (Puranjay)
- A syntax cleanup (Shen)
- Fix a Kconfig linking error (Arnd)
- New queue-depth quirk (Keith)
- Add missing unplug trace event (Keith)
- blk-iocost fixes (Colin, Konstantin)
- t10-pi modular removal and fixes (Alexey)
- Fix for potential BLKSECDISCARD overflow (Alexey)
- bio splitting cleanups and fixes (Christoph)
- Deal with folios rather than rather than pages, speeding up how the
block layer handles bigger IOs (Kundan)
- Use spinlocks rather than bit spinlocks in zram (Sebastian, Mike)
- Reduce zoned device overhead in ublk (Ming)
- Add and use sendpages_ok() for drbd and nvme-tcp (Ofir)
- Fix regression in partition error pointer checking (Riyan)
- Add support for write zeroes and rotational status in nbd (Wouter)
- Add Yu Kuai as new BFQ maintainer. The scheduler has been
unmaintained for quite a while.
- Various sets of fixes for BFQ (Yu Kuai)
- Misc fixes and cleanups (Alvaro, Christophe, Li, Md Haris, Mikhail,
Yang)
* tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux: (120 commits)
nvme-pci: qdepth 1 quirk
block: fix potential invalid pointer dereference in blk_add_partition
blk_iocost: make read-only static array vrate_adj_pct const
block: unpin user pages belonging to a folio at once
mm: release number of pages of a folio
block: introduce folio awareness and add a bigger size from folio
block: Added folio-ized version of bio_add_hw_page()
block, bfq: factor out a helper to split bfqq in bfq_init_rq()
block, bfq: remove local variable 'bfqq_already_existing' in bfq_init_rq()
block, bfq: remove local variable 'split' in bfq_init_rq()
block, bfq: remove bfq_log_bfqg()
block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()
block, bfq: fix procress reference leakage for bfqq in merge chain
block, bfq: fix uaf for accessing waker_bfqq after splitting
blk-throttle: support prioritized processing of metadata
blk-throttle: remove last_low_overflow_time
drbd: Add NULL check for net_conf to prevent dereference in state validation
nvme-tcp: fix link failure for TCP auth
blk-mq: add missing unplug trace event
mtip32xx: Remove redundant null pointer checks in mtip_hw_debugfs_init()
...
If the net_conf pointer is NULL and the code attempts to access its
fields without a check, it will lead to a null pointer dereference.
Add a NULL check before dereferencing the pointer.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 44ed167da7 ("drbd: rcu_read_lock() and rcu_dereference() for tconn->net_conf")
Cc: stable@vger.kernel.org
Signed-off-by: Mikhail Lobanov <m.lobanov@rosalinux.ru>
Link: https://lore.kernel.org/r/20240909133740.84297-1-m.lobanov@rosalinux.ru
Signed-off-by: Jens Axboe <axboe@kernel.dk>
recompress device attribute supports alg=NAME parameter so that we can
specify only one particular algorithm we want to perform recompression
with. However, with algo params we now can have several exactly same
secondary algorithms but each with its own params tuning (e.g. priority 1
configured to use more aggressive level, and priority 2 configured to use
a pre-trained dictionary). Support priority=NUM parameter so that we can
correctly determine which secondary algorithm we want to use.
Link: https://lkml.kernel.org/r/20240902105656.1383858-25-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Support pre-trained dictionary param. Just like lz4, lz4hc doesn't
mandate specific format of the dictionary and zstd --train can be used to
train a dictionary for lz4, according to [1].
TEST
====
*** lz4hc
/sys/block/zram0/mm_stat
1750638592 608954620 621031424 0 621031424 1 0 34288 34288
*** lz4hc dict=/etc/lz4-dict-amd64
/sys/block/zram0/mm_stat
1750671360 505068582 514994176 0 514994176 1 0 34278 34278
[1] https://github.com/lz4/lz4/issues/557
Link: https://lkml.kernel.org/r/20240902105656.1383858-22-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Support pre-trained dictionary param. lz4 doesn't mandate specific format
of the dictionary and even zstd --train can be used to train a dictionary
for lz4, according to [1].
TEST
====
*** lz4
/sys/block/zram0/mm_stat
1750654976 664188565 676864000 0 676864000 1 0 34288 34288
*** lz4 dict=/etc/lz4-dict-amd64
/sys/block/zram0/mm_stat
1750638592 619891141 632053760 0 632053760 1 0 34278 34278
*** lz4 level=5 dict=/etc/lz4-dict-amd64
/sys/block/zram0/mm_stat
1750638592 727174243 740810752 0 740810752 1 0 34437 34437
[1] https://github.com/lz4/lz4/issues/557
Link: https://lkml.kernel.org/r/20240902105656.1383858-21-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Immutable params never change once comp has been allocated and setup, so
we don't need to store multiple copies of them in each per-CPU backend
context. Move those to per-comp zcomp_params and pass it to backends
callbacks for requests execution. Basically, this means parameters
sharing between different contexts.
Also introduce two new backends callbacks: setup_params() and
release_params(). First, we need to validate params in a driver-specific
way; second, driver may want to allocate its specific representation of
the params which is needed to execute requests.
Link: https://lkml.kernel.org/r/20240902105656.1383858-20-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Keep run-time driver data (scratch buffers, etc.) in zcomp_ctx structure.
This structure is allocated per-CPU because drivers (backends) need to
modify its content during requests execution.
We will split mutable and immutable driver data, this is a preparation
path.
Link: https://lkml.kernel.org/r/20240902105656.1383858-19-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Encapsulate compression/decompression data in zcomp_req structure.
Link: https://lkml.kernel.org/r/20240902105656.1383858-18-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Handle dict=path algorithm param so that we can read a pre-trained
compression algorithm dictionary which we then pass to the backend
configuration.
Link: https://lkml.kernel.org/r/20240902105656.1383858-17-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This attribute is used to setup compression algorithms' parameters, so we
can tweak algorithms' characteristics. At this point only 'level' is
supported (to be extended in the future).
Each call sets up parameters for one particular algorithm, which should be
specified either by the algorithm's priority or algo name. This is
expected to be called after corresponding algorithm is selected via
comp_algorithm or recomp_algorithm.
echo "priority=0 level=1" > /sys/block/zram0/algorithm_params
or
echo "algo=zstd level=1" > /sys/block/zram0/algorithm_params
Link: https://lkml.kernel.org/r/20240902105656.1383858-16-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
zstd compression params depends on level, but are constant for a given
instance of zstd compression backend. Calculate once (during ctx
creation).
Link: https://lkml.kernel.org/r/20240902105656.1383858-15-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We will store a per-algorithm parameters there (compression level,
dictionary, dictionary size, etc.).
Link: https://lkml.kernel.org/r/20240902105656.1383858-14-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Make sure that backends array has anything apart from the sentinel NULL
value.
We also select LZO_BACKEND if none backends were selected.
Link: https://lkml.kernel.org/r/20240902105656.1383858-13-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
zram works with PAGE_SIZE buffers, so we always know exact size of the
source buffer and hence can pass estimated_src_size to zstd_get_params().
This hint on x86_64, for example, reduces the size of the work memory
buffer from 1303520 bytes down to 90080 bytes. Given that compression
streams are per-CPU that's quite some memory saving.
Link: https://lkml.kernel.org/r/20240902105656.1383858-10-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Moving to custom backends implementation gives us ability to have our own
minimalistic and extendable API, and algorithms tunings becomes possible.
The list of compression backends is empty at this point, we will add
backends in the followup patches.
Link: https://lkml.kernel.org/r/20240902105656.1383858-5-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since the debugfs_create_dir() never returns a null pointer, checking
the return value for a null pointer is redundant. Since
debugfs_create_file() can deal with a ERR_PTR() style pointer, drop
the check. Since mtip_hw_debugfs_init does not pay attention to the
return value, its return type can be changed to void.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/20240907034046.3595268-1-lizetao1@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmbbAicQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpqu1D/48quZBeoRXnXoNdAEUIQ36I9f33JzbwqnD
bGW0++IwV7ZGrKSuy/IQuoosZhhM8ZyyQxha9JontLse0XlZKzXeN0hT3aovngmy
GP9Nw+EXfcP7cDYjUxrGasbsAnl5p9Orojc1IQms43tAb6KCI4xVunYzkJ9rqcNq
5Ik8FY46c28ivqlYFjj17NVDZE29AEgBxFa3jGYVusmnUpPkqysl4iQJdMeRAaIQ
+UaDJst2ZVtDqqD20gvVL/p9xR2JnPoiv3evSXStkpDf7kp+WXC+51t9mcz6TX+b
OwmnnrwkZuJ9jTzEsE39fDwWyfUr83UKR5jXKI/4SdAkq+GeTHZAyHuavSm5v4ym
+pITxruaoEL3eVfC6wlXTrsnEXLn5b402Bv8tjT+Q8HBHXRuC4soCGN3sdbFZj5E
ETS5TNkGbVcYv4mye75QDBkG0DBXXEc9zkM5PIEX8A/iLwNq10QzfAFHwyhydlvt
mpOrf4M9IvMtHYejmouAmfd/00gHfPxboGLObtNjdZD65HGJP99PRtURj+Ng/gDd
XbHSX8hSKLhQd8ReQojSasounm0dc+FFaMUFprS96g36G12zq2c3dAaLXTPZeOM3
JQIu6707SqmxEFiwJ86QT/nvEO6JgBgQy48Y5r8i0++w4Qb4UbfLnAQXOJXdI3nm
0uYdUCeYMQ==
=BwIz
-----END PGP SIGNATURE-----
Merge tag 'block-6.11-20240906' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
"Mostly just some fixlets for NVMe, but also a bug fix for the ublk
driver and an integrity fix"
* tag 'block-6.11-20240906' of git://git.kernel.dk/linux:
bio-integrity: don't restrict the size of integrity metadata
ublk_drv: fix NULL pointer dereference in ublk_ctrl_start_recovery()
nvmet: Identify-Active Namespace ID List command should reject invalid nsid
nvme: set BLK_FEAT_ZONED for ZNS multipath disks
nvme-pci: Add sleep quirk for Samsung 990 Evo
nvme-pci: allocate tagset on reset if necessary
nvmet-tcp: fix kernel crash if commands allocation fails
nvme: use better description for async reset reason
nvmet: Make nvmet_debugfs static
The zram_table_entry::flags member is of type long and uses 8 bytes on a
64bit architecture. With a PAGE_SIZE of 256KiB we have PAGE_SHIFT of 18
which in turn leads to __NR_ZRAM_PAGEFLAGS = 27. This still fits in an
ordinary integer.
By reducing the size of `flags' to four bytes, the size of the struct
goes back to 16 bytes. The padding between the lock and ac_time (if
enabled) is also gone.
Make zram_table_entry::flags an unsigned int and update the build test
to reflect the change.
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20240906141520.730009-4-bigeasy@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The ZRAM_LOCK was used for locking and after the addition of spinlock_t
the bit set and cleared but there no reader of it.
Remove the ZRAM_LOCK bit.
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20240906141520.730009-3-bigeasy@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The bit spinlock disables preemption. The spinlock_t lock becomes a sleeping
lock on PREEMPT_RT and it can not be acquired in this context. In this locked
section, zs_free() acquires a zs_pool::lock, and there is access to
zram::wb_limit_lock.
Add a spinlock_t for locking. Keep the set/ clear ZRAM_LOCK bit after
the lock has been acquired/ dropped. The size of struct zram_table_entry
increases by 4 bytes due to lock and additional 4 bytes padding with
CONFIG_ZRAM_TRACK_ENTRY_ACTIME enabled.
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20240906141520.730009-2-bigeasy@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The version of the NBD protocol implemented by the kernel driver
currently has a 32 bit field for length values. As the NBD protocol uses
bytes as a unit of length, length values larger than 2^32 bytes cannot
be expressed.
Update the max_hw_discard_sectors field to match that.
Signed-off-by: Wouter Verhelst <w@uter.be>
Fixes: 268283244c ("nbd: use the atomic queue limits API in nbd_set_size")
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Eric Blake <eblake@redhat.Com>
Link: https://lore.kernel.org/r/20240812133032.115134-8-w@uter.be
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The NBD protocol defines a message for zeroing out a region of an export
Add support to the kernel driver for that message.
Signed-off-by: Wouter Verhelst <w@uter.be>
Cc: Eric Blake <eblake@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240812133032.115134-3-w@uter.be
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When two UBLK_CMD_START_USER_RECOVERY commands are submitted, the
first one sets 'ubq->ubq_daemon' to NULL, and the second one triggers
WARN in ublk_queue_reinit() and subsequently a NULL pointer dereference
issue.
Fix it by adding the check in ublk_ctrl_start_recovery() and return
immediately in case of zero 'ub->nr_queues_ready'.
BUG: kernel NULL pointer dereference, address: 0000000000000028
RIP: 0010:ublk_ctrl_start_recovery.constprop.0+0x82/0x180
Call Trace:
<TASK>
? __die+0x20/0x70
? page_fault_oops+0x75/0x170
? exc_page_fault+0x64/0x140
? asm_exc_page_fault+0x22/0x30
? ublk_ctrl_start_recovery.constprop.0+0x82/0x180
ublk_ctrl_uring_cmd+0x4f7/0x6c0
? pick_next_task_idle+0x26/0x40
io_uring_cmd+0x9a/0x1b0
io_issue_sqe+0x193/0x3f0
io_wq_submit_work+0x9b/0x390
io_worker_handle_work+0x165/0x360
io_wq_worker+0xcb/0x2f0
? finish_task_switch.isra.0+0x203/0x290
? finish_task_switch.isra.0+0x203/0x290
? __pfx_io_wq_worker+0x10/0x10
ret_from_fork+0x2d/0x50
? __pfx_io_wq_worker+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Fixes: c732a852b4 ("ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support")
Reported-and-tested-by: Changhui Zhong <czhong@redhat.com>
Closes: https://lore.kernel.org/all/CAGVVp+UvLiS+bhNXV-h2icwX1dyybbYHeQUuH7RYqUvMQf6N3w@mail.gmail.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Li Nan <linan122@huawei.com>
Link: https://lore.kernel.org/r/20240904031348.4139545-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that we have an extra 8 bits, we don't need to limit ourselves to
supporting a 64KiB page size. I'm sure both Hexagon users are grateful,
but it does reduce complexity a little. We can also remove
reset_first_obj_offset() as calling __ClearPageZsmalloc() will now reset
all 32 bits of page_type.
Link: https://lkml.kernel.org/r/20240821173914.2270383-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If request timetout is handled by nbd_requeue_cmd(), normal completion
has to be stopped for avoiding to complete this requeued request, other
use-after-free can be triggered.
Fix the race by clearing NBD_CMD_INFLIGHT in nbd_requeue_cmd(), meantime
make sure that cmd->lock is grabbed for clearing the flag and the
requeue.
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Fixes: 2895f1831e ("nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240830034145.1827742-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The bio->bi_iter.bi_size is updated when bio_add_page() is called. So we
do not need to assign msg->bi_size again to it, since its redudant and
can also be harmful. Instead we can use it to add a sanity check, which
checks the locally calculated bi_size, with the one sent in msg.
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://lore.kernel.org/r/20240809135346.978320-1-haris.iqbal@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove the debugfs_create_dir() error check. It's safe to pass in error
pointers to the debugfs API, hence the user isn't supposed to include
error checking of the return values.
Signed-off-by: Yang Ruibin <11162571@vivo.com>
Link: https://lore.kernel.org/r/20240827022741.3410294-1-11162571@vivo.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ublk zoned takes 16 bytes in each request pdu just for handling REPORT_ZONE
operation, this way does waste memory since request pdu is allocated
statically.
Store the transient zone report data into one global xarray, and remove
it after the report zone request is completed. This way is reasonable
since report zone is run in slow code path.
Fixes: 29802d7ca3 ("ublk: enable zoned storage support")
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Andreas Hindborg <a.hindborg@samsung.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240812013624.587587-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>