After commit 7be835694d ("block: fix that util can be greater than
100%"), it's not used and can be removed.
Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-1-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgeCdgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgprW3EACoZmPgLRnt19f/u+veGKnJknGMkxgyph44
MryIe6kmJPnZ8CUAq+RloHtObAsMDbNmvcfwvp4Aeg1he7D6yNgCTHyamGRhtwR0
ydGS/eawYUibp5757TIegj26jZOEwWKkVaytX60N0rnRLLkr6IFXgA9JBhxpkF95
YYDW5+6nr7Z95ZNprSpvA5Q59tbv+HAGoKX2mmmj6sWwXCwx9eHEK0//bnOe6HzA
3mYGpsFOIgOZ/DFj33jscSygsuyQRevh/i4TxuNZjDm+AD9TZZwAbspxKPVml+pT
aVQ6oRQoGzcPDaoMTTUAwxrMmMaVI3oDFibTDofldDjRbvPI96wR4Cs1orCBHrsE
5xOynjyp6Yrr+RBCbA0i6vIdhb5P/oDsXQ4/qxLgdz46aWwXl7kxd6JhdjDXrmXm
3k9Kou/GHNGbB2qhYU+qASRAWj+hieEEgS31ppCu44q0hh8pbynlUCnl3Di0hHsJ
CIgjzOtdLb9sCJm68mpZGCkx71B2RX2TTZRhtCUHZNuAr2vqynPcza2+gtTIYJkN
uxtYeguKwKzWTJkB5l4hvscH8GVAKY4hFpILT7CjAMs6Rn/i9mCtnuWlDdGbxQdb
JQjxbKM5C/XNdcMG/3oDWaTrg1beFPSk87N9dCNx2A87zLxtLi71OpIScPFdlP9y
XN6QgJbh2g==
=sdd1
-----END PGP SIGNATURE-----
Merge tag 'block-6.15-20250509' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- Fix for a regression in this series for loop and read/write iterator
handling
- zone append block update tweak
- remove a broken IO priority test
- NVMe pull request via Christoph:
- unblock ctrl state transition for firmware update (Daniel
Wagner)
* tag 'block-6.15-20250509' of git://git.kernel.dk/linux:
block: remove test of incorrect io priority level
nvme: unblock ctrl state transition for firmware update
block: only update request sector if needed
loop: Add sanity check for read/write_iter
Ever since commit eca2040972b4("scsi: block: ioprio: Clean up interface
definition"), the macro IOPRIO_PRIO_LEVEL() will mask the level value to
something between 0 and 7 so necessarily, level will always be lower than
IOPRIO_NR_LEVELS(8).
Remove this obsolete check.
Reported-by: Kexin Wei <ys.weikexin@h3c.com>
Cc: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20250508083018.GA769554@bytedance
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When blk_unregister_queue() is called from add_disk() failure path,
there is race in registering/unregistering elevator queue kobject
from the two code paths, because commit 559dc11143 ("block: move
elv_register[unregister]_queue out of elevator_lock") moves elevator
queue register/unregister out of elevator lock.
Fix the race by removing elevator after deleting disk->queue_kobj,
because kobject_del(&disk->queue_kobj) drains in-progress sysfs
show()/store() of all attributes.
Fixes: 559dc11143 ("block: move elv_register[unregister]_queue out of elevator_lock")
Reported-by: Nilay Shroff <nilay@linux.ibm.com>
Suggested-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250508085807.3175112-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_mq_freeze_queue() can't be called on quiesced queue, otherwise it may
never return if there is any queued requests.
Fix it by removing quiesce queue around elevator_set_none() because
elevator_switch() does quiesce queue in case that we need to switch
to none really.
Fixes: 1e44bedbc9 ("block: unifying elevator change")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250508085807.3175112-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
XFS will be able to support large atomic writes (atomic write > 1x block)
in future. This will be achieved by using different operating methods,
depending on the size of the write.
Specifically a new method of operation based in FS atomic extent remapping
will be supported in addition to the current HW offload-based method.
The FS method will generally be appreciably slower performing than the
HW-offload method. However the FS method will be typically able to
contribute to achieving a larger atomic write unit max limit.
XFS will support a hybrid mode, where HW offload method will be used when
possible, i.e. HW offload is used when the length of the write is
supported, and for other times FS-based atomic writes will be used.
As such, there is an atomic write length at which the user may experience
appreciably slower performance.
Advertise this limit in a new statx field, stx_atomic_write_unit_max_opt.
When zero, it means that there is no such performance boundary.
Masks STATX{_ATTR}_WRITE_ATOMIC can be used to get this new field. This is
ok for older kernels which don't support this new field, as they would
report 0 in this field (from zeroing in cp_statx()) already. Furthermore
those older kernels don't support large atomic writes - apart from block
fops, but there would be consistent performance there for atomic writes
in range [unit min, unit max].
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Rewrite bio_map_kern using the new bio_add_* helpers and drop the
kerneldoc comment that is superfluous for an internal helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250507120451.4000627-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
That way the bio can be allocated with the right operation already
set and there is no need to pass the separated 'reading' argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250507120451.4000627-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove the q argument from blk_rq_map_kern and the internal helpers
called by it as the queue can trivially be derived from the request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250507120451.4000627-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a helper to add a vmalloc region to a bio, abstracting away the
vmalloc addresses from the underlying pages and another one wrapping
it for the simple case where all data fits into a single bio.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250507120451.4000627-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a helper to perform synchronous I/O on a kernel direct map range.
Currently this is implemented in various places in usually not very
efficient ways, so provide a generic helper instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250507120451.4000627-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a helper to add a directly mapped kernel virtual address to a
bio so that callers don't have to convert to pages or folios.
For now only the _nofail variant is provided as that is what all the
obvious callers want.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250507120451.4000627-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Export blk_crypto_derive_sw_secret(), blk_crypto_import_key(),
blk_crypto_generate_key(), and blk_crypto_prepare_key() so that they can
be used by device-mapper when passing through wrapped key support.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Use the per-kiocb write stream if provided, or map temperature hints to
write streams (which is a bit questionable, but this shows how it is
done).
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[kbusch: removed statx reporting]
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20250506121732.8211-6-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Export the granularity that write streams should be discarded with,
as it is essential for making good use of them.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20250506121732.8211-5-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Drivers with hardware that support write streams need a way to export how
many are available so applications can generically query this.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
[hch: renamed hints to streams, removed stacking]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20250506121732.8211-4-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add the ability to pass a write stream for placement control in the bio.
The new field fits in an existing hole, so does not change the size of
the struct.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20250506121732.8211-3-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In case of a ZONE APPEND write, regardless of native ZONE APPEND or the
emulation layer in the zone write plugging code, the sector the data got
written to by the device needs to be updated in the bio.
At the moment, this is done for every native ZONE APPEND write and every
request that is flagged with 'BIO_ZONE_WRITE_PLUGGING'. But thus
superfluously updates the sector for regular writes to a zoned block
device.
Check if a bio is a native ZONE APPEND write or if the bio is flagged as
'BIO_EMULATES_ZONE_APPEND', meaning the block layer's zone write plugging
code handles the ZONE APPEND and translates it into a regular write and
back. Only if one of these two criterion is met, update the sector in the
bio upon completion.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/dea089581cb6b777c1cd1500b38ac0b61df4b2d1.1746530748.git.jth@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In case of a ZONE APPEND write, regardless of native ZONE APPEND or the
emulation layer in the zone write plugging code, the sector the data got
written to by the device needs to be updated in the bio.
At the moment, this is done for every native ZONE APPEND write and every
request that is flagged with 'BIO_ZONE_WRITE_PLUGGING'. But thus
superfluously updates the sector for regular writes to a zoned block
device.
Check if a bio is a native ZONE APPEND write or if the bio is flagged as
'BIO_EMULATES_ZONE_APPEND', meaning the block layer's zone write plugging
code handles the ZONE APPEND and translates it into a regular write and
back. Only if one of these two criterion is met, update the sector in the
bio upon completion.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/dea089581cb6b777c1cd1500b38ac0b61df4b2d1.1746530748.git.jth@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
scheduler's ->exit() is called with queue frozen and elevator lock is held, and
wbt_enable_default() can't be called with queue frozen, otherwise the
following lockdep warning is triggered:
#6 (&q->rq_qos_mutex){+.+.}-{4:4}:
#5 (&eq->sysfs_lock){+.+.}-{4:4}:
#4 (&q->elevator_lock){+.+.}-{4:4}:
#3 (&q->q_usage_counter(io)#3){++++}-{0:0}:
#2 (fs_reclaim){+.+.}-{0:0}:
#1 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}:
#0 (&q->debugfs_mutex){+.+.}-{4:4}:
Fix the issue by moving wbt_enable_default() out of bfq's exit(), and
call it from elevator_change_done().
Meantime add disk->rqos_state_mutex for covering wbt state change, which
matches the purpose more than ->elevator_lock.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250505141805.2751237-26-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move hctx cpuhp add/del out of queue freezing for not connecting freeze
lock with cpuhp locks, then lockdep warning can be avoided.
This way is safe because both needn't queue to be frozen and scheduler
switch isn't allowed, with same reason for moving hctx debugfs/sysfs
register out of queue freeze.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250505141805.2751237-25-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Both blk_mq_map_swqueue() and blk_mq_realloc_hw_ctxs() are called before
the request queue is added to tagset list, so the two won't run concurrently
with blk_mq_update_nr_hw_queues().
When the two functions are only called from queue initialization or
blk_mq_update_nr_hw_queues(), elevator switch can't happen.
So remove ->elevator_lock uses from the two functions.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250505141805.2751237-24-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move hctx debugfs/sysfs register out of freezing queue in
__blk_mq_update_nr_hw_queues(), so that the following lockdep dependency
can be killed:
#2 (&q->q_usage_counter(io)#16){++++}-{0:0}:
#1 (fs_reclaim){+.+.}-{0:0}:
#0 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: //debugfs
And registering/un-registering hctx debugfs/sysfs does not require queue to
be frozen:
- hctx sysfs attributes show() are drained when removing kobject, and
there isn't store() implementation for hctx sysfs attributes
- debugfs entry read() is drained too when removing debugfs directory,
and there isn't write() implementation for hctx debugfs too
- so it is safe to register/unregister hctx sysfs/debugfs without
freezing queue because the cod paths changes nothing, and we just
need to keep hctx live
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250505141805.2751237-23-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move elv_register[unregister]_queue out of ->elevator_lock & queue freezing,
so we can kill many lockdep warnings.
elv_register[unregister]_queue() is serialized, and just dealing with sysfs/
debugfs things, no need to be done with queue frozen:
- when it is called from adding disk, elevator switch isn't possible
because ->queue_kobj isn't added yet
- when it is called from deleting disk, disable_elv_switch() is
responsible for preventing new elevator switch and draining old
elevator switch.
- when it is called from blk_mq_update_nr_hw_queues(), adding/removing
disk and elevator switch can't be allowed or in-progress
With this change, elevator's ->exit() is called before calling
elv_unregister_queue, then user may call into ->show()/store() of elevator's
sysfs attributes, and we have covered this issue by adding `ELEVATOR_FLAG_DYNG`.
For blk-mq debugfs, hctx->sched_tags is always checked with ->elevator_lock by
debugfs code, meantime hctx->sched_tags is updated with ->elevator_lock, so
there isn't such issue.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250505141805.2751237-22-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add new helper disable_elv_switch() and new flag QUEUE_FLAG_NO_ELV_SWITCH
for disabling elevator switch before deleting disk:
- originally flag QUEUE_FLAG_REGISTERED is added for preventing elevator
switch during removing disk, but this flag has been used widely for
other purposes, so add one new flag for disabling elevator switch only
- for avoiding deadlock risk, we have to move elevator queue
register/unregister out of elevator lock and queue freeze, which will be
done in next patch. However, this way adds small race window between elevator
switch and deleting ->queue_kobj, in which elevator queue register/unregister
could be run concurrently. The added helper will be used for avoiding the race
in the following patch.
- drain in-progress elevator switch before deleting disk
Suggested-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250505141805.2751237-21-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Prepare for moving elv_register[unregister]_queue out of elevator_lock
& queue freezing, so we may have to call elv_unregister_queue() after
elevator ->exit() is called, then there is small window for user to
call into ->show()/store(), and user-after-free can be caused.
Fail to show/store elevator sysfs attribute if elevator is dying by
adding one new flag of ELEVATOR_FLAG_DYNG, which is protected by
elevator ->sysfs_lock.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20250505141805.2751237-20-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
elevatore queue's type is assigned since its allocation, and never
get cleared until it is released.
So its ->type is always not NULL, remove the unnecessary check.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250505141805.2751237-19-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass elevator_queue reference to elv_register_queue() & elv_unregister_queue().
No functional change, and prepare for moving the two out of elevator
lock & freezing queue, when we need to store the old & new elevator
queue in `struct elv_change_ctx` instance, then both two can co-exist
for short while, so we have to pass the exact elevator_queue instance
to elv_register_queue & unregister_queue.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250505141805.2751237-18-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Elevator change is one well-define behavior:
- tear down current elevator if it exists
- setup new elevator
It is supposed to cover any case for changing elevator by single
internal API, typically the following cases:
- setup default elevator in add_disk()
- switch to none in del_disk()
- reset elevator in blk_mq_update_nr_hw_queues()
- switch elevator in sysfs `store` elevator attribute
This patch uses elevator_change() to cover all above cases:
- every elevator switch is serialized with each other: add_disk/del_disk/
store elevator is serialized already, blk_mq_update_nr_hw_queues() uses
srcu for syncing with the other three cases
- for both add_disk()/del_disk(), queue freeze works at atomic mode
or has been froze, so the freeze in elevator_change() won't add extra
delay
- `struct elev_change_ctx` instance holds any info for changing elevator
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250505141805.2751237-17-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add `struct elv_change_ctx` and prepare for unifying elevator change by
elevator_change(). With this way, any input & output parameter can
be provided & observed in top helper.
This way helps to move kobject add/delete & debugfs register/unregister
out of ->elevator_lock & freezing queue.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250505141805.2751237-16-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move queue freezing & elevator_lock into elevator_change(), and prepare
for using elevator_change() for setting up & tearing down default elevator
too.
Also add lockdep_assert_held() in __elevator_change() because either
read or write lock is required for changing elevator.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250505141805.2751237-15-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In blk_mq_update_nr_hw_queues(), nr_hw_queues changes and elevator data
depends on it, and elevator has to be reattached, so call elevator_switch()
to force attachment.
Add elv_update_nr_hw_queues() simply for blk_mq_update_nr_hw_queues() to
reattach elevator, since elevator switch isn't likely when running
blk_mq_update_nr_hw_queues(). This way removes the current switch
none and switch back code.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250505141805.2751237-14-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move blk_queue_registered() check into elv_iosched_store() and prepare
for using elevator_change() for covering any kind of elevator change in
adding/deleting disk and updating nr_hw_queue.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250505141805.2751237-13-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This removes duplicate code, and keeps the callers tidy.
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250505141805.2751237-12-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
That makes the function nicely self-contained and can be used
to avoid code duplication.
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250505141805.2751237-11-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Both adding/deleting disk code are reader of `nr_hw_queues`, so we can't
allow them in-progress when updating nr_hw_queues, kernel panic and
kasan has been reported in [1].
Prevent adding/deleting disk during updating nr_hw_queues by adding
rw_semaphore to tagset, write lock is grabbed in blk_mq_update_nr_hw_queues(),
and read lock is acquired when adding/deleting disk.
Also mark GFP_NOIO allocation scope for adding/deleting disk because
blk_mq_update_nr_hw_queues() is part of some driver's error handler.
This way avoids lot of trouble.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Suggested-by: Nilay Shroff <nilay@linux.ibm.com>
Reported-by: Nilay Shroff <nilay@linux.ibm.com>
Closes: https://lore.kernel.org/linux-block/a5896cdb-a59a-4a37-9f99-20522f5d2987@linux.ibm.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250505141805.2751237-9-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add helper add_disk_final() for scanning partitions, announcing disk and
handling the last thing for adding disk.
No functional change, and prepare for prevent adding disk from happening
when updating nr_hw_queues.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250505141805.2751237-8-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
sched debugfs shares same lifetime with scheduler's kobject, and same
lock(elevator lock), so move sched debugfs register/unregister into
elevator_register_queue() and elevator_unregister_queue().
Then we needn't blk_mq_debugfs_register() for us to register sched
debugfs any more.
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250505141805.2751237-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add blk_mq_sched_reg_debugfs()/blk_mq_sched_unreg_debugfs() to clean up
sched init/exit code a bit.
Register & unregister debugfs for sched & sched_hctx order is changed a
bit, but it is safe because sched & sched_hctx is guaranteed to be ready
when exporting via debugfs.
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250505141805.2751237-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use q->elevator with ->elevator_lock held in elv_iosched_show(), since
the local cached elevator reference may become stale after getting
->elevator_lock.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250505141805.2751237-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Both elevator_switch() and elevator_disable() are only called from the
two code paths, in which queue is guaranteed to be frozen.
So don't call freeze queue in the two functions, also add asserts for
queue freeze.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250505141805.2751237-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ELEVATOR_FLAG_DISABLE_WBT is only used by BFQ to disallow wbt when BFQ is
in use. The flag is set in BFQ's init(), and cleared in BFQ's exit().
Making it as request queue flag, so that we can avoid to deal with elevator
switch race. Also it isn't graceful to checking one scheduler flag in
wbt_enable_default().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250505141805.2751237-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue(), and publish
this request queue to tagset after everything is setup.
This way is safe because BLK_MQ_F_TAG_QUEUE_SHARED isn't used by
blk_mq_map_swqueue(), and this flag is mainly checked in fast IO code
path.
Prepare for removing ->elevator_lock from blk_mq_map_swqueue() which
is supposed to be called when elevator switch can't be done.
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reported-by: Nilay Shroff <nilay@linux.ibm.com>
Closes: https://lore.kernel.org/linux-block/567cb7ab-23d6-4cee-a915-c8cdac903ddd@linux.ibm.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250505141805.2751237-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now the tg->[bytes/io]_disp type is signed, and calculate_bytes/io_allowed
return type is unsigned. Even if the bps/iops limit is not set to max, the
return value of the function may still exceed INT_MAX or LLONG_MAX, which
can cause overflow in outer variables. In such cases, we can add additional
checks accordingly.
And in throtl_trim_slice(), if the BPS/IOPS limit is set to max, there's
no need to call calculate_bytes/io_allowed(). Introduces the helper
functions throtl_trim_bps/iops to simplifies the process. For cases when
the calculated trim value exceeds INT_MAX (causing an overflow), we reset
tg->[bytes/io]_disp to zero, so return original tg->[bytes/io]_disp because
it is the size that is actually trimmed.
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250417132054.2866409-4-wozizhi@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We no longer need carryover_[bytes/ios] in tg, so it is removed. The
related comments about carryover in tg are also merged into
[bytes/io]_disp, and modify other related comments.
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250417132054.2866409-3-wozizhi@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In commit 6cc477c368 ("blk-throttle: carry over directly"), the carryover
bytes/ios was be carried to [bytes/io]_disp. However, its update mechanism
has some issues.
In __tg_update_carryover(), we calculate "bytes" and "ios" to represent the
carryover, but the computation when updating [bytes/io]_disp is incorrect.
And if the sq->nr_queued is empty, we may not update tg->[bytes/io]_disp to
0 in tg_update_carryover(). We should set it to 0 in non carryover case.
This patch fixes the issue.
Fixes: 6cc477c368 ("blk-throttle: carry over directly")
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250417132054.2866409-2-wozizhi@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The block layer bounce buffering support is unused now, remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250505081138.3435992-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use writeback_iter instead of the deprecated write_cache_pages wrapper
in blkdev_writepages. This removes an indirect call per folio.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20250424082752.1967679-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_mq_flush_plug_list() has a fast path if all requests in the plug
are destined for the same request_queue. It calls ->queue_rqs() with the
whole batch of requests, falling back on ->queue_rq() for any requests
not handled by ->queue_rqs(). However, if the requests are destined for
multiple queues, blk_mq_flush_plug_list() has a slow path that calls
blk_mq_dispatch_list() repeatedly to filter the requests by ctx/hctx.
Each queue's requests are inserted into the hctx's dispatch list under a
spinlock, then __blk_mq_sched_dispatch_requests() takes them out of the
dispatch list (taking the spinlock again), and finally
blk_mq_dispatch_rq_list() calls ->queue_rq() on each request.
Acquiring the hctx spinlock twice and calling ->queue_rq() instead of
->queue_rqs() makes the slow path significantly more expensive. Thus,
batching more requests into a single plug (e.g. io_uring_enter syscall)
can counterintuitively hurt performance by causing the plug to span
multiple queues. We have observed 2-3% of CPU time spent acquiring the
hctx spinlock alone on workloads issuing requests to multiple NVMe
devices in the same io_uring SQE batches.
Add a medium path in blk_mq_flush_plug_list() for plugs that don't have
elevators or come from a schedule, but do span multiple queues. Filter
the requests by queue and call ->queue_rqs()/->queue_rq() on the list of
requests destined to each request_queue.
With this change, we no longer see any CPU time spent in _raw_spin_lock
from blk_mq_flush_plug_list and throughput increases accordingly.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250426011728.4189119-4-csander@purestorage.com
[axboe: fix whitespace damage]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Factor out the logic from blk_mq_flush_plug_list() that calls
->queue_rqs() with a fallback to ->queue_rq() into a helper function
blk_mq_dispatch_queue_requests(). This is in preparation for using this
code with other lists of requests.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250426011728.4189119-3-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_mq_plug_issue_direct(), __blk_mq_flush_plug_list(), and
blk_mq_dispatch_plug_list() take a struct blk_plug * but only use its
mq_list. Pass the struct rq_list * instead in preparation for calling
them with other lists of requests.
Drop "plug" from the function names as they are no longer plug-specific.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250426011728.4189119-2-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We need two patches inside linux-block tree as dependencies of the patch
which will follow this merge.
Specifically, we need:
block: fix race between set_blocksize and read paths
block: hoist block size validation code to a separate function
Signed-off-by: Carlos Maiolino <cem@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgK7wMQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppNwD/46vpEWhwGLeNXxFic5CCNMbUBUl04+Rc9I
p22BY9rp1+7ooXJGJJTGaQAjwTFKP/kxyaQWZrJFXK8t2wYhJ8E2PWPVolk8jsOT
0wSYaF9iW4kw5twcmWq+VqPM+joLGKxkwojDTvz4CiorKrq2J14yHkrtfp81R3d3
rR7VzeggglSxEJAKkIBkbRWtMwTQ6WvImm4uufccI3AwfPJcM3qxSXGqq3wryA0O
PyqFlkOdjDIbNP3Zu0QvqQ0xyefGCyGyAfKEPNEAn1oOpD8Y/SUvdMdlPzA9pJ93
9+8F9pAg6fo8vgBEMavVGNjFOw4OrxNBNL9St3vlz+VMpid+HMyflolLwCTdQXCz
HEZ+H75uwMwh3mskHp5paitdE4Y70tqXW6LWgr/5wXOsl8Lh5p1A7Ll0tP27gUe1
vV1Yh+nwbg5TQ1qi+NmjhUThivT96hop+5nK9p5r7GHSZP1xiJdb7RsQqOhDXBmP
I5sjc5Dny8S9b87ehX6b4VfpTbk3aRVhOaEJ4l6k4dwFFkTRP0ODa0/bWPf3Reb0
4HI2/NYRsdLNyut2896P19RxcpcXz5PKEKvcCt5pAehwv4urIdpLoeNLgMdgwfcc
qbtNbTkV2V/fOKG7pS6yUQmWGR/XXQZDFJ4gCFZemiYYVAXJCqjBB6dliMy8roFz
p5Rc31AEMw==
=s0NF
-----END PGP SIGNATURE-----
Merge tag 'block-6.15-20250424' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- Fix autoloading of drivers from stat*(2)
- Fix losing read-ahead setting one suspend/resume, when a device is
re-probed.
- Fix race between setting the block size and page cache updates.
Includes a helper that a coming XFS fix will use as well.
- ublk cancelation fixes.
- ublk selftest additions and fixes.
- NVMe pull via Christoph:
- fix an out-of-bounds access in nvmet_enable_port (Richard
Weinberger)
* tag 'block-6.15-20250424' of git://git.kernel.dk/linux:
ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd
ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATA
block: don't autoload drivers on blk-cgroup configuration
block: don't autoload drivers on stat
block: remove the backing_inode variable in bdev_statx
block: move blkdev_{get,put} _no_open prototypes out of blkdev.h
block: never reduce ra_pages in blk_apply_bdi_limits
selftests: ublk: common: fix _get_disk_dev_t for pre-9.0 coreutils
selftests: ublk: remove useless 'delay_us' from 'struct dev_ctx'
selftests: ublk: fix recover test
block: hoist block size validation code to a separate function
block: fix race between set_blocksize and read paths
nvmet: fix out-of-bounds access in nvmet_enable_port
Merge 6.15 block fixes - both to get the fixes causing issues with
XFS testing, but also to make it easier for 6.16 ublk patches to avoid
conflicts.
* block-6.15:
ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd
ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATA
block: don't autoload drivers on blk-cgroup configuration
block: don't autoload drivers on stat
block: remove the backing_inode variable in bdev_statx
block: move blkdev_{get,put} _no_open prototypes out of blkdev.h
block: never reduce ra_pages in blk_apply_bdi_limits
selftests: ublk: common: fix _get_disk_dev_t for pre-9.0 coreutils
selftests: ublk: remove useless 'delay_us' from 'struct dev_ctx'
selftests: ublk: fix recover test
block: hoist block size validation code to a separate function
block: fix race between set_blocksize and read paths
nvmet: fix out-of-bounds access in nvmet_enable_port
Loading a driver just to configure blk-cgroup doesn't make sense, as that
assumes and already existing device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkdev_get_no_open can trigger the legacy autoload of block drivers. A
simple stat of a block device has not historically done that, so disable
this behavior again.
Fixes: 9abcfbd235 ("block: Add atomic write support for statx")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
backing_inode is only used once, so remove it and update the comment
describing the bdev lookup to be a bit more clear.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
These are only to be used by block internal code. Remove the comment
as we grew more users due to reworking block device node opening.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When the user increased the read-ahead size through sysfs this value
currently get lost if the device is reprobe, including on a resume
from suspend.
As there is no hardware limitation for the read-ahead size there is
no real need to reset it or track a separate hardware limitation
like for max_sectors.
This restores the pre-atomic queue limit behavior in the sd driver as
sd did not use blk_queue_io_opt and thus never updated the read ahead
size to the value based of the optimal I/O, but changes behavior for
all other drivers. As the new behavior seems useful and sd is the
driver for which the readahead size tweaks are most useful that seems
like a worthwhile trade off.
Fixes: 804e498e04 ("sd: convert to the atomic queue limits API")
Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20250424082521.1967286-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hoist the block size validation code to bdev_validate_blocksize so that
we can call it from filesystems that don't care about the bdev pagecache
manipulations of set_blocksize.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/174543795720.4139148.840349813093799165.stgit@frogsfrogsfrogs
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the new large sector size support, it's now the case that
set_blocksize can change i_blksize and the folio order in a manner that
conflicts with a concurrent reader and causes a kernel crash.
Specifically, let's say that udev-worker calls libblkid to detect the
labels on a block device. The read call can create an order-0 folio to
read the first 4096 bytes from the disk. But then udev is preempted.
Next, someone tries to mount an 8k-sectorsize filesystem from the same
block device. The filesystem calls set_blksize, which sets i_blksize to
8192 and the minimum folio order to 1.
Now udev resumes, still holding the order-0 folio it allocated. It then
tries to schedule a read bio and do_mpage_readahead tries to create
bufferheads for the folio. Unfortunately, blocks_per_folio == 0 because
the page size is 4096 but the blocksize is 8192 so no bufferheads are
attached and the bh walk never sets bdev. We then submit the bio with a
NULL block device and crash.
Therefore, truncate the page cache after flushing but before updating
i_blksize. However, that's not enough -- we also need to lock out file
IO and page faults during the update. Take both the i_rwsem and the
invalidate_lock in exclusive mode for invalidations, and in shared mode
for read/write operations.
I don't know if this is the correct fix, but xfs/259 found it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/174543795699.4139148.2086129139322431423.stgit@frogsfrogsfrogs
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Even if blk-rq-qos isn't used or configured, dipping into the queue to
fetch ->rq_qos is a noticeable slowdown and visible in profiles. Add an
unlikely static key around blk-rq-qos, to avoid fetching this cacheline
if blk-iolatency or blk-wbt isn't configured or used.
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
On x86, rep stos will be emitted to clear the the blk_mq_alloc_data
struct, as not all members are being explicitly initialied. Depending on
the type of CPU, this is a noticeable slowdown compared to just ensuring
that the struct is fully initialized when setup.
For the 4 spots that setup a struct blk_mq_alloc_data on the stack,
ensure all members are being initialized.
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The 'nr_budgets' argument of blk_mq_dispatch_rq_list() is either the
number of elements in the 'list' argument or zero. Instead of passing
the number of list elements to blk_mq_dispatch_rq_list(), pass a boolean
argument that indicates whether or not blk_mq_dispatch_rq_list() should
request the block driver for a budget for each request in 'list'.
Remove the code for counting list elements from blk_mq_dispatch_rq_list()
callers where possible. Remove the code that decrements nr_budgets from
blk_mq_dispatch_rq_list() because it is superfluous. Each request that
is processed by blk_mq_dispatch_rq_list() is in one of these two states
if 'get_budget' is false:
* Either the request is on 'list' and the budget for the request has to
be released from the error path.
* Or the request is not on 'list' and q->mq_ops->queue_rq() has already
released the budget (ret != BLK_STS_OK) or q->mq_ops->queue_rq() will
release the budget asynchronously (ret == BLK_STS_OK).
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: John Garry <john.g.garry@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20250415205134.3650042-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaAQM5QAKCRCRxhvAZXjc
olcwAP0RETZn15Jkt5+mKjcx99fuVE7je3lp56UH4Y4XjZmthgEA1n65RDr4Tq6E
548A2/9Hnt4NWdvoi9VhrG4+5dNRowM=
=cFFa
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.15-rc3.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Revert the hfs{plus} deprecation warning that's also included in this
pull request. The commit introducing the deprecation warning resides
rather early in this branch. So simply dropping it would've rebased
all other commits which I decided to avoid. Hence the revert in the
same branch
[ Background - the deprecation warning discussion resulted in people
stepping up, and so hfs{plus} will have a maintainer taking care of
it after all.. - Linus ]
- Switch CONFIG_SYSFS_SYCALL default to n and decouple from
CONFIG_EXPERT
- Fix an audit bug caused by changes to our kernel path lookup helpers
this cycle. Audit needs the parent path even if the dentry it tried
to look up is negative
- Ensure that the kernel path lookup helpers leave the passed in path
argument clean when they return an error. This is consistent with all
our other helpers
- Ensure that vfs_getattr_nosec() calls bdev_statx() so the relevant
information is available to kernel consumers as well
- Don't set a timer and call schedule() if the timer will expire
immediately in epoll
- Make netfs lookup tables with __nonstring
* tag 'vfs-6.15-rc3.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
Revert "hfs{plus}: add deprecation warning"
fs: move the bdex_statx call to vfs_getattr_nosec
netfs: Mark __nonstring lookup tables
eventpoll: Set epoll timeout if it's in the future
fs: ensure that *path_locked*() helpers leave passed path pristine
fs: add kern_path_locked_negative()
hfs{plus}: add deprecation warning
Kconfig: switch CONFIG_SYSFS_SYCALL default to n
Currently bdex_statx is only called from the very high-level
vfs_statx_path function, and thus bypassing it for in-kernel calls
to vfs_getattr or vfs_getattr_nosec.
This breaks querying the block ѕize of the underlying device in the
loop driver and also is a pitfall for any other new kernel caller.
Move the call into the lowest level helper to ensure all callers get
the right results.
Fixes: 2d985f8c6b ("vfs: support STATX_DIOALIGN on block devices")
Fixes: f4774e92aa ("loop: take the file system minimum dio alignment into account")
Reported-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250417064042.712140-1-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
Placing multiple protection information buffers inside the same page
can lead to oopses because set_page_dirty_lock() can't be called from
interrupt context.
Since a protection information buffer is not backed by a file there is
no point in setting its page dirty, there is nothing to synchronize.
Drop the call to set_page_dirty_lock() and remove the last argument to
bio_integrity_unpin_bvec().
Cc: stable@vger.kernel.org
Fixes: 492c5d4559 ("block: bio-integrity: directly map user buffers")
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/yq1v7r3ev9g.fsf@ca-mkp.ca.oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When registering a queue fails after blk_mq_sysfs_register() is
successful but the function later encounters an error, we need
to clean up the blk_mq_sysfs resources.
Add the missing blk_mq_sysfs_unregister() call in the error path
to properly clean up these resources and prevent a memory leak.
Fixes: 320ae51fee ("blk-mq: new multi-queue block IO queueing mechanism")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250412092554.475218-1-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add an SPDX license identifier line to blk-throttle.h
Use 'GPL-2.0' as the identifier, since blk-throttle.c uses
that, and blk.h (from which some material was copied when
blk-throttle.h was created) also uses that identifier.
Signed-off-by: Tim Bird <tim.bird@sony.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/MW5PR13MB5632EE4645BCA24ED111EC0EFDB62@MW5PR13MB5632.namprd13.prod.outlook.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.
Conversion was done with coccinelle plus manual fixups where necessary.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This non-functional change serves as preparation for moving to
subsystem-based rstat trees. To simplify future commits, change the
signatures of existing cgroup-based rstat functions to become css-based and
rename them to reflect that.
Though the signatures have changed, the implementations have not. Within
these functions use the css->cgroup pointer to obtain the associated cgroup
and allow code to function the same just as it did before this patch. At
applicable call sites, pass the subsystem-specific css pointer as an
argument or pass a pointer to cgroup::self if not in subsystem context.
Note that cgroup_rstat_updated_list() and cgroup_rstat_push_children()
are not altered yet since there would be a larger amount of css to
cgroup conversions which may overcomplicate the code at this
intermediate phase.
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
->elevator_lock depends on queue freeze lock, see block/blk-sysfs.c.
queue freeze lock depends on fs_reclaim.
So don't grab elevator lock during queue initialization which needs to
call kmalloc(GFP_KERNEL), and we can cut the dependency between
->elevator_lock and fs_reclaim, then the lockdep warning can be killed.
This way is safe because elevator setting isn't ready to run during
queue initialization.
There isn't such issue in __blk_mq_update_nr_hw_queues() because
memalloc_noio_save() is called before acquiring elevator lock.
Fixes the following lockdep warning:
https://lore.kernel.org/linux-block/67e6b425.050a0220.2f068f.007b.GAE@google.com/
Reported-by: syzbot+4c7e0f9b94ad65811efb@syzkaller.appspotmail.com
Cc: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250403105402.1334206-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Another set of improvements to the kernel's CRC (cyclic redundancy
check) code:
- Rework the CRC64 library functions to be directly optimized, like what
I did last cycle for the CRC32 and CRC-T10DIF library functions.
- Rewrite the x86 PCLMULQDQ-optimized CRC code, and add VPCLMULQDQ
support and acceleration for crc64_be and crc64_nvme.
- Rewrite the riscv Zbc-optimized CRC code, and add acceleration for
crc_t10dif, crc64_be, and crc64_nvme.
- Remove crc_t10dif and crc64_rocksoft from the crypto API, since they
are no longer needed there.
- Rename crc64_rocksoft to crc64_nvme, as the old name was incorrect.
- Add kunit test cases for crc64_nvme and crc7.
- Eliminate redundant functions for calculating the Castagnoli CRC32,
settling on just crc32c().
- Remove unnecessary prompts from some of the CRC kconfig options.
- Further optimize the x86 crc32c code.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZ+CGGhQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK3wRAP4tbnzawUmlIHIF0hleoADXehUgAhMt
NZn15mGvyiuwIQEA8W9qvnLdFXZkdxhxAEvDDFjyrRauL6eGtr/GvCx4AQY=
=wmKG
-----END PGP SIGNATURE-----
Merge tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux
Pull CRC updates from Eric Biggers:
"Another set of improvements to the kernel's CRC (cyclic redundancy
check) code:
- Rework the CRC64 library functions to be directly optimized, like
what I did last cycle for the CRC32 and CRC-T10DIF library
functions
- Rewrite the x86 PCLMULQDQ-optimized CRC code, and add VPCLMULQDQ
support and acceleration for crc64_be and crc64_nvme
- Rewrite the riscv Zbc-optimized CRC code, and add acceleration for
crc_t10dif, crc64_be, and crc64_nvme
- Remove crc_t10dif and crc64_rocksoft from the crypto API, since
they are no longer needed there
- Rename crc64_rocksoft to crc64_nvme, as the old name was incorrect
- Add kunit test cases for crc64_nvme and crc7
- Eliminate redundant functions for calculating the Castagnoli CRC32,
settling on just crc32c()
- Remove unnecessary prompts from some of the CRC kconfig options
- Further optimize the x86 crc32c code"
* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (36 commits)
x86/crc: drop the avx10_256 functions and rename avx10_512 to avx512
lib/crc: remove unnecessary prompt for CONFIG_CRC64
lib/crc: remove unnecessary prompt for CONFIG_LIBCRC32C
lib/crc: remove unnecessary prompt for CONFIG_CRC8
lib/crc: remove unnecessary prompt for CONFIG_CRC7
lib/crc: remove unnecessary prompt for CONFIG_CRC4
lib/crc7: unexport crc7_be_syndrome_table
lib/crc_kunit.c: update comment in crc_benchmark()
lib/crc_kunit.c: add test and benchmark for crc7_be()
x86/crc32: optimize tail handling for crc32c short inputs
riscv/crc64: add Zbc optimized CRC64 functions
riscv/crc-t10dif: add Zbc optimized CRC-T10DIF function
riscv/crc32: reimplement the CRC32 functions using new template
riscv/crc: add "template" for Zbc optimized CRC functions
x86/crc: add ANNOTATE_NOENDBR to suppress objtool warnings
x86/crc32: improve crc32c_arch() code generation with clang
x86/crc64: implement crc64_be and crc64_nvme using new template
x86/crc-t10dif: implement crc_t10dif using new template
x86/crc32: implement crc32_le using new template
x86/crc: add "template" for [V]PCLMULQDQ based CRC functions
...
hrtimers are initialized with hrtimer_init() and a subsequent store to
the callback pointer. This turned out to be suboptimal for the upcoming
Rust integration and is obviously a silly implementation to begin with.
This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
with hrtimer_setup(T, cb);
The conversion was done with Coccinelle and a few manual fixups.
Once the conversion has completely landed in mainline, hrtimer_init()
will be removed and the hrtimer::function becomes a private member.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff5jQTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoVvRD/wKtuwmiA66NJFgXC0qVq82A6fO3bY8
GBdbfysDJIbqGu5PTcULTbJ8qkqv3jeLUv6CcXvS4sZ7y/uJQl2lzf8yrD/0bbwc
rLI6sHiPSZmK93kNVN4X5H7kvt7cE/DYC9nnEOgK3BY5FgKc4n9887d4aVBhL8Lv
ODwVXvZ+xi351YCj7qRyPU24zt/p4tkkT1o2k4a0HBluqLI0D+V20fke9IERUL8r
d1uWKlcn0TqYDesE8HXKIhbst3gx52rMJrXBJDHwFmG6v8Pj1fkTXCVpPo8QcBz8
OTVkpomN9f/Tx4+GZwhZOF86LhLL3OhxD6pT7JhFCXdmSGv+Ez8uyk1YZysM/XpV
Juy/1yAcBpDIDkmhMFGdAAn48Nn9Fotty0r4je60zSEp1d/4QMXcFme29qr2JTUE
iWnQ/HD6DxUjVHqy7CYvvo26Xegg1C7qgyOVt4PYZwAM1VKF5P3kzYTb4SAdxtop
Tpji1sfW9QV08jqMNo6XntD32DSP9S2HqjO9LwBw700jnx2jjJ35fcJs6iodMOUn
gckIZLMn3L0OoglPdyA5O7SNTbKE7aFiRKdnT/cJtR3Fa39Qu27CwC5gfiyuie9I
Q+LG8GLuYSBHXAR+PBK4GWlzJ7Dn8k3eqmbnLeKpRMsU6ZzcttgA64xhaviN2wN0
iJbvLJeisXr3GA==
=bYAX
-----END PGP SIGNATURE-----
Merge tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer cleanups from Thomas Gleixner:
"A treewide hrtimer timer cleanup
hrtimers are initialized with hrtimer_init() and a subsequent store to
the callback pointer. This turned out to be suboptimal for the
upcoming Rust integration and is obviously a silly implementation to
begin with.
This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
with hrtimer_setup(T, cb);
The conversion was done with Coccinelle and a few manual fixups.
Once the conversion has completely landed in mainline, hrtimer_init()
will be removed and the hrtimer::function becomes a private member"
* tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
wifi: rt2x00: Switch to use hrtimer_update_function()
io_uring: Use helper function hrtimer_update_function()
serial: xilinx_uartps: Use helper function hrtimer_update_function()
ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup()
RDMA: Switch to use hrtimer_setup()
virtio: mem: Switch to use hrtimer_setup()
drm/vmwgfx: Switch to use hrtimer_setup()
drm/xe/oa: Switch to use hrtimer_setup()
drm/vkms: Switch to use hrtimer_setup()
drm/msm: Switch to use hrtimer_setup()
drm/i915/request: Switch to use hrtimer_setup()
drm/i915/uncore: Switch to use hrtimer_setup()
drm/i915/pmu: Switch to use hrtimer_setup()
drm/i915/perf: Switch to use hrtimer_setup()
drm/i915/gvt: Switch to use hrtimer_setup()
drm/i915/huc: Switch to use hrtimer_setup()
drm/amdgpu: Switch to use hrtimer_setup()
stm class: heartbeat: Switch to use hrtimer_setup()
i2c: Switch to use hrtimer_setup()
iio: Switch to use hrtimer_setup()
...
- Add deprecation info messages to cgroup1-only features.
- rstat updates including a bug fix and breaking up a critical section to
reduce interrupt latency impact.
- Other misc and doc updates.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ9xO2g4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGQz4AQDeWKmngRsnddEMkqOV1ArwXSr+8xUQrvCBx0RL
vcjOQQEAusGCTeGXWJ96kw+N9BXvGwFsfSeoxjOqAnvrBS1EgAc=
=WvJg
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- Add deprecation info messages to cgroup1-only features
- rstat updates including a bug fix and breaking up a critical section
to reduce interrupt latency impact
- Other misc and doc updates
* tag 'cgroup-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: rstat: Cleanup flushing functions and locking
cgroup/rstat: avoid disabling irqs for O(num_cpu)
mm: Fix a build breakage in memcontrol-v1.c
blk-cgroup: Simplify policy files registration
cgroup: Update file naming comment
cgroup: Add deprecation message to legacy freezer controller
mm: Add transformation message for per-memcg swappiness
RFC cgroup/cpuset-v1: Add deprecation messages to sched_relax_domain_level
cgroup/cpuset-v1: Add deprecation messages to memory_migrate
cgroup/cpuset-v1: Add deprecation messages to mem_exclusive and mem_hardwall
cgroup: Print message when /proc/cgroups is read on v2-only system
cgroup/blkio: Add deprecation messages to reset_stats
cgroup/cpuset-v1: Add deprecation messages to memory_spread_page and memory_spread_slab
cgroup/cpuset-v1: Add deprecation messages to sched_load_balance and memory_pressure_enabled
cgroup, docs: Be explicit about independence of RT_GROUP_SCHED and non-cpu controllers
cgroup/rstat: Fix forceidle time in cpu.stat
cgroup/misc: Remove unused misc_cg_res_total_usage
cgroup/cpuset: Move procfs cpuset attribute under cgroup-v1.c
cgroup: update comment about dropping cgroup kn refs
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rxAAKCRCRxhvAZXjc
ooIPAQCwMjDjtWegvBy8kefiRw+fa4z3ZWHrwRT9DJrD/K9WyAD+JVd0ou27SVpQ
jKpRSRct2eTbyxdYiGydHQGm5F5sLg4=
=0FyQ
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs pagesize updates from Christian Brauner:
"This enables block sizes greater than the page size for block devices.
With this we can start supporting block devices with logical block
sizes larger than 4k.
It also allows to lift the device cache sector size support to 64k.
This allows filesystems which can use larger sector sizes up to 64k to
ensure that the filesystem will not generate writes that are smaller
than the specified sector size"
* tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
bdev: add back PAGE_SIZE block size validation for sb_set_blocksize()
bdev: use bdev_io_min() for statx block size
block/bdev: lift block size restrictions to 64k
block/bdev: enable large folio support for large logical block sizes
fs/buffer fs/mpage: remove large folio restriction
fs/mpage: use blocks_per_folio instead of blocks_per_page
fs/mpage: avoid negative shift for large blocksize
fs/buffer: remove batching from async read
fs/buffer: simplify block_read_full_folio() with bh_offset()
In case blkg_conf_open_bdev_frozen() fails, ioc_qos_write() jumps to the
error path without assigning a value to 'ret'. Ensure that it inherits
the error from the passed back error value.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202503200454.QWpwKeJu-lkp@intel.com/
Fixes: 9730763f47 ("block: correct locking order for protecting blk-wbt parameters")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The commit '245618f8e45f ("block: protect wbt_lat_usec using q->
elevator_lock")' introduced q->elevator_lock to protect updates
to blk-wbt parameters when writing to the sysfs attribute wbt_
lat_usec and the cgroup attribute io.cost.qos. However, both
these attributes also acquire q->rq_qos_mutex, leading to the
following lockdep warning:
======================================================
WARNING: possible circular locking dependency detected
6.14.0-rc5+ #138 Not tainted
------------------------------------------------------
bash/5902 is trying to acquire lock:
c000000085d495a0 (&q->rq_qos_mutex){+.+.}-{4:4}, at: wbt_init+0x164/0x238
but task is already holding lock:
c000000085d498c8 (&q->elevator_lock){+.+.}-{4:4}, at: queue_wb_lat_store+0xb0/0x20c
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock+0xf0/0xa58
ioc_qos_write+0x16c/0x85c
cgroup_file_write+0xc4/0x32c
kernfs_fop_write_iter+0x1b8/0x29c
vfs_write+0x410/0x584
ksys_write+0x84/0x140
system_call_exception+0x134/0x360
system_call_vectored_common+0x15c/0x2ec
-> #0 (&q->rq_qos_mutex){+.+.}-{4:4}:
__lock_acquire+0x1b6c/0x2ae0
lock_acquire+0x140/0x430
__mutex_lock+0xf0/0xa58
wbt_init+0x164/0x238
queue_wb_lat_store+0x1dc/0x20c
queue_attr_store+0x12c/0x164
sysfs_kf_write+0x6c/0xb0
kernfs_fop_write_iter+0x1b8/0x29c
vfs_write+0x410/0x584
ksys_write+0x84/0x140
system_call_exception+0x134/0x360
system_call_vectored_common+0x15c/0x2ec
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&q->elevator_lock);
lock(&q->rq_qos_mutex);
lock(&q->elevator_lock);
lock(&q->rq_qos_mutex);
*** DEADLOCK ***
6 locks held by bash/5902:
#0: c000000051122400 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x84/0x140
#1: c00000007383f088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x174/0x29c
#2: c000000008550428 (kn->active#182){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x180/0x29c
#3: c000000085d493a8 (&q->q_usage_counter(io)#5){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
#4: c000000085d493e0 (&q->q_usage_counter(queue)#5){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
#5: c000000085d498c8 (&q->elevator_lock){+.+.}-{4:4}, at: queue_wb_lat_store+0xb0/0x20c
stack backtrace:
CPU: 17 UID: 0 PID: 5902 Comm: bash Kdump: loaded Not tainted 6.14.0-rc5+ #138
Hardware name: IBM,9043-MRX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_028) hv:phyp pSeries
Call Trace:
[c0000000721ef590] [c00000000118f8a8] dump_stack_lvl+0x108/0x18c (unreliable)
[c0000000721ef5c0] [c00000000022563c] print_circular_bug+0x448/0x604
[c0000000721ef670] [c000000000225a44] check_noncircular+0x24c/0x26c
[c0000000721ef740] [c00000000022bf28] __lock_acquire+0x1b6c/0x2ae0
[c0000000721ef870] [c000000000229240] lock_acquire+0x140/0x430
[c0000000721ef970] [c0000000011cfbec] __mutex_lock+0xf0/0xa58
[c0000000721efaa0] [c00000000096c46c] wbt_init+0x164/0x238
[c0000000721efaf0] [c0000000008f8cd8] queue_wb_lat_store+0x1dc/0x20c
[c0000000721efb50] [c0000000008f8fa0] queue_attr_store+0x12c/0x164
[c0000000721efc60] [c0000000007c11cc] sysfs_kf_write+0x6c/0xb0
[c0000000721efca0] [c0000000007bfa4c] kernfs_fop_write_iter+0x1b8/0x29c
[c0000000721efcf0] [c0000000006a281c] vfs_write+0x410/0x584
[c0000000721efdc0] [c0000000006a2cc8] ksys_write+0x84/0x140
[c0000000721efe10] [c000000000031b64] system_call_exception+0x134/0x360
[c0000000721efe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
>From the above log it's apparent that method which writes to sysfs attr
wbt_lat_usec acquires q->elevator_lock first, and then acquires q->rq_
qos_mutex. However the another method which writes to io.cost.qos,
acquires q->rq_qos_mutex first, and then acquires q->rq_qos_mutex. So
this could potentially cause the deadlock.
A closer look at ioc_qos_write shows that correcting the lock order is
non-trivial because q->rq_qos_mutex is acquired in blkg_conf_open_bdev
and released in blkg_conf_exit. The function blkg_conf_open_bdev is
responsible for parsing user input and finding the corresponding block
device (bdev) from the user provided major:minor number.
Since we do not know the bdev until blkg_conf_open_bdev completes, we
cannot simply move q->elevator_lock acquisition before blkg_conf_open_
bdev. So to address this, we intoduce new helpers blkg_conf_open_bdev_
frozen and blkg_conf_exit_frozen which are just wrappers around blkg_
conf_open_bdev and blkg_conf_exit respectively. The helper blkg_conf_
open_bdev_frozen is similar to blkg_conf_open_bdev, but additionally
freezes the queue, acquires q->elevator_lock and ensures the correct
locking order is followed between q->elevator_lock and q->rq_qos_mutex.
Similarly another helper blkg_conf_exit_frozen in addition to unfreezing
the queue ensures that we release the locks in correct order.
By using these helpers, now we maintain the same locking order in all
code paths where we update blk-wbt parameters.
Fixes: 245618f8e4 ("block: protect wbt_lat_usec using q->elevator_lock")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202503171650.cc082b66-lkp@intel.com
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250319105518.468941-3-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The ioc_qos_write method acquires q->elevator_lock to protect
updates to blk-wbt parameters. Once these updates are complete,
the lock should be released before returning from ioc_qos_write.
However, in one code path, the release of q->elevator_lock was
mistakenly omitted, potentially leading to a lock leak. This commit
fixes the issue by ensuring that q->elevator_lock is properly
released in all return paths of ioc_qos_write.
Fixes: 245618f8e4 ("block: protect wbt_lat_usec using q->elevator_lock")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202503171650.cc082b66-lkp@intel.com
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250319105518.468941-2-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch improve the returned error code of blkcg_policy_register().
1. Move the validation check for cpd/pd_alloc_fn and cpd/pd_free_fn
function pairs to the start of blkcg_policy_register(). This ensures
we immediately return -EINVAL if the function pairs are not correctly
provided, rather than returning -ENOSPC after locking and unlocking
mutexes unnecessarily.
Those locks should not contention any problems, as error of policy
registration is a super cold path.
2. Return -ENOMEM when cpd_alloc_fn() failed.
Co-authored-by: Wen Tao <wentao@uniontech.com>
Signed-off-by: Wen Tao <wentao@uniontech.com>
Signed-off-by: Chen Linxuan <chenlinxuan@uniontech.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/3E333A73B6B6DFC0+20250317022924.150907-1-chenlinxuan@uniontech.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The hctx_busy_show method in debugfs is currently unprotected. This
method iterates over all started requests in a tagset and prints them.
However, the tags can be updated concurrently via the sysfs attributes
'nr_requests' or 'scheduler' (elevator switch), leading to potential
race conditions.
Since sysfs attributes 'nr_requests' and 'scheduler' are already
protected using q->elevator_lock, extend this protection to the debugfs
'busy' attribute as well to ensure consistency.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250313115235.3707600-4-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In some debugfs attribute read methods, failure to acquire the mutex
lock results in jumping to a label before returning an error code.
However this is unnecessary, as we can return the failure code directly,
improving code readability and reducing complexity.
This commit removes the goto labels and ensures that the method returns
immediately upon failing to acquire the mutex lock.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250313115235.3707600-3-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, the block debugfs attributes (tags, tags_bitmap, sched_tags,
and sched_tags_bitmap) are protected using q->sysfs_lock. However, these
attributes are updated in multiple scenarios:
- During driver probe method
- During an elevator switch/update
- During an nr_hw_queues update
- When writing to the sysfs attribute nr_requests
All these update paths (except driver probe method, which doesn't
require any protection) are already protected using q->elevator_lock. To
ensure consistency and proper synchronization, replace q->sysfs_lock
with q->elevator_lock for protecting these debugfs attributes.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250313115235.3707600-2-nilay@linux.ibm.com
[axboe: some commit message rewording/fixes]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
request_queue param is no longer used by blk_rq_map_sg and
__blk_rq_map_sg. Remove it.
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250313035322.243239-1-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
>4GB folio is possible on some ARCHs, such as aarch64, 16GB hugepage
is supported, then 'offset' of folio can't be held in 'unsigned int',
cause warning in bio_add_folio_nofail() and IO failure.
Fix it by adjusting 'page' & trimming 'offset' so that `->bi_offset` won't
be overflow, and folio can be added to bio successfully.
Fixes: ed9832bc08 ("block: introduce folio awareness and add a bigger size from folio")
Cc: Kundan Kumar <kundan.kumar@samsung.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Gavin Shan <gshan@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20250312145136.2891229-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use one set of files when there is no difference between default and
legacy files, similar to regular subsys files registration. No
functional change.
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Tejun Heo <tj@kernel.org>
It is difficult to sync with stat updaters, stats are (should be)
monotonic so users can calculate differences from a reference.
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Tejun Heo <tj@kernel.org>
In _badblocks_check(), there are lines of code like this,
1246 sectors -= len;
[snipped]
1251 WARN_ON(sectors < 0);
The WARN_ON() at line 1257 doesn't make sense because sectors is
unsigned long long type and never to be <0.
Fix it by checking directly checking whether sectors is less than len.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Coly Li <colyli@kernel.org>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250309160556.42854-1-colyli@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make sure ->nr_integrity_segments is cloned in blk_rq_prep_clone(),
otherwise requests cloned by device-mapper multipath will not have the
proper nr_integrity_segments values set, then BUG() is hit from
sg_alloc_table_chained().
Fixes: b0fd271d5f ("block: add request clone interface (v2)")
Cc: stable@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250310115453.2271109-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, hctx attributes (nr_tags, nr_reserved_tags, and cpu_list)
are protected using `q->sysfs_lock`. However, these attributes can be
updated in multiple scenarios:
- During the driver's probe method.
- When updating nr_hw_queues.
- When writing to the sysfs attribute nr_requests,
which can modify nr_tags.
The nr_requests attribute is already protected using q->elevator_lock,
but none of the update paths actually use q->sysfs_lock to protect hctx
attributes. So to ensure proper synchronization, replace q->sysfs_lock
with q->elevator_lock when reading hctx attributes through sysfs.
Additionally, blk_mq_update_nr_hw_queues allocates and updates hctx.
The allocation of hctx is protected using q->elevator_lock, however,
updating hctx params happens without any protection, so safeguard hctx
param update path by also using q->elevator_lock.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250306093956.2818808-1-nilay@linux.ibm.com
[axboe: wrap comment at 80 chars]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The bdi->ra_pages could be updated under q->limits_lock because it's
usually calculated from the queue limits by queue_limits_commit_update.
So protect reading/writing the sysfs attribute read_ahead_kb using
q->limits_lock instead of q->sysfs_lock.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250304102551.2533767-8-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The wbt latency and state could be updated while initializing the
elevator or exiting the elevator. It could be also updated while
configuring IO latency QoS parameters using cgroup. The elevator
code path is now protected with q->elevator_lock. So we should
protect the access to sysfs attribute wbt_lat_usec using q->elevator
_lock instead of q->sysfs_lock. White we're at it, also protect
ioc_qos_write(), which configures wbt parameters via cgroup, using
q->elevator_lock.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250304102551.2533767-7-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The sysfs attribute nr_requests could be simultaneously updated from
elevator switch/update or nr_hw_queue update code path. The update to
nr_requests for each of those code paths runs holding q->elevator_lock.
So we should protect access to sysfs attribute nr_requests using q->
elevator_lock instead of q->sysfs_lock.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250304102551.2533767-6-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A queue's elevator can be updated either when modifying nr_hw_queues
or through the sysfs scheduler attribute. Currently, elevator switching/
updating is protected using q->sysfs_lock, but this has led to lockdep
splats[1] due to inconsistent lock ordering between q->sysfs_lock and
the freeze-lock in multiple block layer call sites.
As the scope of q->sysfs_lock is not well-defined, its (mis)use has
resulted in numerous lockdep warnings. To address this, introduce a new
q->elevator_lock, dedicated specifically for protecting elevator
switches/updates. And we'd now use this new q->elevator_lock instead of
q->sysfs_lock for protecting elevator switches/updates.
While at it, make elv_iosched_load_module() a static function, as it is
only called from elv_iosched_store(). Also, remove redundant parameters
from elv_iosched_load_module() function signature.
[1] https://lore.kernel.org/all/67637e70.050a0220.3157ee.000c.GAE@google.com/
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250304102551.2533767-5-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There're few sysfs attributes in block layer which don't really need
acquiring q->sysfs_lock while accessing it. The reason being, reading/
writing a value from/to such attributes are either atomic or could be
easily protected using READ_ONCE()/WRITE_ONCE(). Moreover, sysfs
attributes are inherently protected with sysfs/kernfs internal locking.
So this change help segregate all existing sysfs attributes for which
we could avoid acquiring q->sysfs_lock. For all read-only attributes
we removed the q->sysfs_lock from show method of such attributes. In
case attribute is read/write then we removed the q->sysfs_lock from
both show and store methods of these attributes.
We audited all block sysfs attributes and found following list of
attributes which shouldn't require q->sysfs_lock protection:
1. io_poll:
Write to this attribute is ignored. So, we don't need q->sysfs_lock.
2. io_poll_delay:
Write to this attribute is NOP, so we don't need q->sysfs_lock.
3. io_timeout:
Write to this attribute updates q->rq_timeout and read of this
attribute returns the value stored in q->rq_timeout Moreover, the
q->rq_timeout is set only once when we init the queue (under blk_mq_
init_allocated_queue()) even before disk is added. So that means
that we don't need to protect it with q->sysfs_lock. As this
attribute is not directly correlated with anything else simply using
READ_ONCE/WRITE_ONCE should be enough.
4. nomerges:
Write to this attribute file updates two q->flags : QUEUE_FLAG_
NOMERGES and QUEUE_FLAG_NOXMERGES. These flags are accessed during
bio-merge which anyways doesn't run with q->sysfs_lock held.
Moreover, the q->flags are updated/accessed with bitops which are
atomic. So, protecting it with q->sysfs_lock is not necessary.
5. rq_affinity:
Write to this attribute file makes atomic updates to q->flags:
QUEUE_FLAG_SAME_COMP and QUEUE_FLAG_SAME_FORCE. These flags are
also accessed from blk_mq_complete_need_ipi() using test_bit macro.
As read/write to q->flags uses bitops which are atomic, protecting
it with q->stsys_lock is not necessary.
6. nr_zones:
Write to this attribute happens in the driver probe method (except
nvme) before disk is added and outside of q->sysfs_lock or any other
lock. Moreover nr_zones is defined as "unsigned int" and so reading
this attribute, even when it's simultaneously being updated on other
cpu, should not return torn value on any architecture supported by
linux. So we can avoid using q->sysfs_lock or any other lock/
protection while reading this attribute.
7. discard_zeroes_data:
Reading of this attribute always returns 0, so we don't require
holding q->sysfs_lock.
8. write_same_max_bytes
Reading of this attribute always returns 0, so we don't require
holding q->sysfs_lock.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250304102551.2533767-4-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation to further simplify and group sysfs attributes which
don't require locking or require some form of locking other than q->
limits_lock, move acquire/release of q->sysfs_lock and queue freeze/
unfreeze under each attributes' respective show/store method.
While we are at it, also remove ->load_module() as it's used to load
the module before queue is freezed. Now as we moved queue-freeze under
->store(), we could load module directly from the attributes' store
method before we actually start freezing the queue. Currently, the
->load_module() is only used by "scheduler" attribute, so we now load
the relevant elevator module before we start freezing the queue in
elv_iosched_store().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250304102551.2533767-3-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There're few sysfs attributes(RW) whose store method is protected
with q->limits_lock, however the corresponding show method of these
attributes run holding q->sysfs_lock and that doesn't make sense
as ideally the show method of these attributes should also run
holding q->limits_lock instead of q->sysfs_lock. Hence update the
show method of these sysfs attributes so that reading of these
attributes acquire q->limits_lock instead of q->sysfs_lock.
Similarly, there're few sysfs attributes(RO) whose show method is
currently protected with q->sysfs_lock however updates to these
attributes could occur using atomic limit update APIs such as queue_
limits_start_update() and queue_limits_commit_update() which run
holding q->limits_lock. So that means that reading these attributes
holding q->sysfs_lock doesn't make sense. Hence update the show method
of these sysfs attributes(RO) such that they run with holding q->
limits_lock instead of q->sysfs_lock.
We have defined a new macro QUEUE_LIM_RO_ENTRY() which uses new ->show_
limit() method and it runs holding q->limits_lock. All existing sysfs
attributes(RO) which needs protection using q->limits_lock while
reading have been now updated to use this new macro for initialization.
Also, the existing QUEUE_LIM_RW_ENTRY() is updated to use new ->show_
limit() method for reading attributes instead of existing ->show()
method. As ->show_limit() runs holding q->limits_lock, the existing
sysfs attributes(RW) requiring protection are now inherently protected
using q->limits_lock instead of q->sysfs_lock.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250304102551.2533767-2-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfKQvsQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpnBCD/9bVSGHNnXakVwdpQmtU5zy54cyWd7VaYsz
qeM+Vrl1m5nf8q5ZdEXcM11Ruib3YJiW0GN9d9sWpTwt8C5n8g+8F63koS7GordZ
jcv77nO6FlnwWpm3YlNxAeLuxkl15e4MQIKj/jb540iFygzT8H2lygE816K4kpCX
XuMxNxdSMksntovZufzxo3Sfkm6e6GChCkkqvBxuXiEWFhvbFQ/ZLEsEMtoH4hkI
3Nj1VB3B3pLVCZhWr2uVvcZCiYUDyBslu+SA3RRoX0W6beK1cVI4OQdS8GtnkJf3
qFnLQz0Ib3EVDtugqg7ZGSAAov6Z8waA2MrFeZkG8uIfl4WT3kBfoan7jRX3Mknl
VnFkThyJOzB83OKqlZKjCzYmEzBhKJrRJVtneIrxT+gvEpevFvAQil6SQfyPDwno
4YcUD+IfU/daTdVR58QQ/iLzkQ7stQWYCtZSrICKfcAGy6zswKM5P5uoWltMBwQh
aHsyz9xbmsMrxch1DPRb0T2GD2h9BsiL6rT8JCrOgucMuOYeZL9pNRgz16D/hael
wBCxPcanSdap0N9kiMX8fLYYdmRxpJHzTbeNRsPhZe8HKUPu1sYTbisOou1XSdAW
Dv7zeQWVlw+1cn/S1Y6Oc4mdlPzPTA9szuBXVpbe9Gd7ZqO7sbbKEkGu5w6MGSZ1
oubnZKCNvA==
=jKDe
-----END PGP SIGNATURE-----
Merge tag 'block-6.14-20250306' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- NVMe pull request via Keith:
- TCP use after free fix on polling (Sagi)
- Controller memory buffer cleanup fixes (Icenowy)
- Free leaking requests on bad user passthrough commands (Keith)
- TCP error message fix (Maurizio)
- TCP corruption fix on partial PDU (Maurizio)
- TCP memory ordering fix for weakly ordered archs (Meir)
- Type coercion fix on message error for TCP (Dan)
- Name the RQF flags enum, fixing issues with anon enums and BPF import
of it
- ublk parameter setting fix
- GPT partition 7-bit conversion fix
* tag 'block-6.14-20250306' of git://git.kernel.dk/linux:
block: Name the RQF flags enum
nvme-tcp: fix signedness bug in nvme_tcp_init_connection()
block: fix conversion of GPT partition name to 7-bit
ublk: set_params: properly check if parameters can be applied
nvmet-tcp: Fix a possible sporadic response drops in weakly ordered arch
nvme-tcp: fix potential memory corruption in nvme_tcp_recv_pdu()
nvme-tcp: Fix a C2HTermReq error message
nvmet: remove old function prototype
nvme-ioctl: fix leaked requests on mapping error
nvme-pci: skip CMB blocks incompatible with PCI P2P DMA
nvme-pci: clean up CMBMSC when registering CMB fails
nvme-tcp: fix possible UAF in nvme_tcp_poll
The commit titled "block/bdev: lift block size restrictions to 64k"
lifted the block layer's max supported block size to 64k inside the
helper blk_validate_block_size() now that we support large folios.
However in lifting the block size we also removed the silly use
cases many filesystems have to use sb_set_blocksize() to *verify*
that the block size <= PAGE_SIZE. The call to sb_set_blocksize() was
used to check the block size <= PAGE_SIZE since historically we've
always supported userspace to create for example 64k block size
filesystems even on 4k page size systems, but what we didn't allow
was mounting them. Older filesystems have been using the check with
sb_set_blocksize() for years.
While, we could argue that such checks should be filesystem specific,
there are much more users of sb_set_blocksize() than LBS enabled
filesystem on upstream, so just do the easier thing and bring back
the PAGE_SIZE check for sb_set_blocksize() users and only skip it
for LBS enabled filesystems.
This will ensure that tests such as generic/466 when run in a loop
against say, ext4, won't try to try to actually mount a filesystem with
a block size larger than your filesystem supports given your PAGE_SIZE
and in the worst case crash.
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20250307020403.3068567-1-mcgrof@kernel.org
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
There is a truncation of badblocks length issue when set badblocks as
follow:
echo "2055 4294967299" > bad_blocks
cat bad_blocks
2055 3
Change 'sectors' argument type from 'int' to 'sector_t'.
This change avoids truncation of badblocks length for large sectors by
replacing 'int' with 'sector_t' (u64), enabling proper handling of larger
disk sizes and ensuring compatibility with 64-bit sector addressing.
Fixes: 9e0e252a04 ("badblocks: Add core badblock management code")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/r/20250227075507.151331-13-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Change the return type of badblocks_set() and badblocks_clear()
from int to bool, indicating success or failure. Specifically:
- _badblocks_set() and _badblocks_clear() functions now return
true for success and false for failure.
- All calls to these functions are updated to handle the new
boolean return type.
- This change improves code clarity and ensures a more consistent
handling of success and failure states.
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Coly Li <colyli@kernel.org>
Acked-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20250227075507.151331-11-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The bad blocks check would miss bad blocks when retrying under contention,
as checking parameters are not reset. These stale values from the previous
attempt could lead to incorrect scanning in the subsequent retry.
Move seqlock to outer function and reinitialize checking state for each
retry. This ensures a clean state for each check attempt, preventing any
missed bad blocks.
Fixes: 3ea3354cb9 ("badblocks: improve badblocks_check() for multiple ranges handling")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/r/20250227075507.151331-10-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a merge issue when adding badblocks as follow:
echo 0 10 > bad_blocks
echo 30 10 > bad_blocks
echo 20 10 > bad_blocks
cat bad_blocks
0 10
20 10 //should be merged with (30 10)
30 10
In this case, if new badblocks does not intersect with prev, it is added
by insert_at(). If there is an intersection with prev+1, the merge will
be processed in the next re_insert loop.
However, when the end of the new badblocks is exactly equal to the offset
of prev+1, no further re_insert loop occurs, and the two badblocks are not
merge.
Fix it by inc prev, badblocks can be merged during the subsequent code.
Fixes: aa511ff821 ("badblocks: switch to the improved badblock handling code")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250227075507.151331-9-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Regardless of whether overlap_front() returns true or false,
can_merge_front() will be executed first. Therefore, move
can_merge_front() in front of can_merge_front() to simplify code.
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250227075507.151331-8-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The number of badblocks cannot exceed MAX_BADBLOCKS, but it should be
allowed to equal MAX_BADBLOCKS.
Fixes: aa511ff821 ("badblocks: switch to the improved badblock handling code")
Fixes: c3c6a86e9e ("badblocks: add helper routines for badblock ranges handling")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/r/20250227075507.151331-7-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
_badblocks_set() returns success if at least one badblock is set
successfully, even if others fail. This can lead to data inconsistencies
in raid, where a failed badblock set should trigger the disk to be kicked
out to prevent future reads from failed write areas.
_badblocks_set() should return error if any badblock set fails. Instead
of relying on 'rv', directly returning 'sectors' for clearer logic. If all
badblocks are successfully set, 'sectors' will be 0, otherwise it
indicates the number of badblocks that have not been set yet, thus
signaling failure.
By the way, it can also fix an issue: when a newly set unack badblock is
included in an existing ack badblock, the setting will return an error.
···
echo "0 100" /sys/block/md0/md/dev-loop1/bad_blocks
echo "0 100" /sys/block/md0/md/dev-loop1/unacknowledged_bad_blocks
-bash: echo: write error: No space left on device
```
After fix, it will return success.
Fixes: aa511ff821 ("badblocks: switch to the improved badblock handling code")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/r/20250227075507.151331-6-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In the current handling of badblocks settings, a lot of processing has
been done for scenarios where the number of badblocks exceeds 512.
This makes the code look quite complex and also introduces some issues,
For example, if there is 512 badblocks already:
for((i=0; i<510; i++)); do ((sector=i*2)); echo "$sector 1" > bad_blocks; done
echo 2100 10 > bad_blocks
echo 2200 10 > bad_blocks
Set new one, exceed 512:
echo 2000 500 > bad_blocks
Expected:
2000 500
Actual:
2100 400
In fact, a disk shouldn't have too many badblocks, and for disks with
512 badblocks, attempting to set more bad blocks doesn't make much sense.
At that point, the more appropriate action would be to replace the disk.
Therefore, to resolve these issues and simplify the code somewhat, return
error directly when setting badblocks exceeds 512.
Fixes: aa511ff821 ("badblocks: switch to the improved badblock handling code")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250227075507.151331-5-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If ack and unack badblocks are adjacent, they will not be merged and will
remain as two separate badblocks. Even after the bad blocks are written
to disk and both become ack, they will still remain as two independent
bad blocks. This is not ideal as it wastes the limited space for
badblocks. Therefore, during ack_all_badblocks(), attempt to merge
badblocks if they are adjacent.
Fixes: aa511ff821 ("badblocks: switch to the improved badblock handling code")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/r/20250227075507.151331-4-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
'bb->shift' is used directly in badblocks. It is wrong, fix it.
Fixes: 3ea3354cb9 ("badblocks: improve badblocks_check() for multiple ranges handling")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/r/20250227075507.151331-2-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, BLK_INTEGRITY_NOGENERATE and BLK_INTEGRITY_NOVERIFY are not
explicitly set during integrity initialization. This can lead to
incorrect reporting of read_verify and write_generate sysfs values,
particularly when a device does not support integrity. Ensure that these
flags are correctly initialized by default.
Reported-by: M Nikhil <nikh1092@linux.ibm.com>
Link: https://lore.kernel.org/linux-block/f6130475-3ccd-45d2-abde-3ccceada0f0a@linux.ibm.com/
Fixes: 9f4aa46f2a ("block: invert the BLK_INTEGRITY_{GENERATE,VERIFY} flags")
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250305063033.1813-3-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
queue_limits_stack_integrity() incorrectly sets
BLK_INTEGRITY_DEVICE_CAPABLE for a DM device even when none of its
underlying devices support integrity. This happens because the flag is
inherited unconditionally. Ensure that integrity capabilities are
correctly propagated only when the underlying devices actually support
integrity.
Reported-by: M Nikhil <nikh1092@linux.ibm.com>
Link: https://lore.kernel.org/linux-block/f6130475-3ccd-45d2-abde-3ccceada0f0a@linux.ibm.com/
Fixes: c6e56cf6b2 ("block: move integrity information into queue_limits")
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250305063033.1813-2-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now ->carryover_bytes[] and ->carryover_ios[] only covers limit/config
update.
Actually the carryover bytes/ios can be carried to ->bytes_disp[] and
->io_disp[] directly, since the carryover is one-shot thing and only valid
in current slice.
Then we can remove the two fields and simplify code much.
Type of ->bytes_disp[] and ->io_disp[] has to change as signed because the
two fields may become negative when updating limits or config, but both are
big enough for holding bytes/ios dispatched in single slice
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250305043123.3938491-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 29390bb566 ("blk-throttle: support prioritized processing of metadata")
takes bytes/ios carryover for prioritized processing of metadata. Turns out
we can support it by charging it directly without trimming slice, and the
result is same with carryover.
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250305043123.3938491-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The two fields are not used any more, so remove them.
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250305043123.3938491-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The bio submission time may be a few jiffies more than the expected
waiting time, due to 'extra_bytes' can't be divided in
tg_within_bps_limit(), and also due to timer wakeup delay.
In this case, adjust slice_start to jiffies will discard the extra wait time,
causing lower rate than expected.
Current in-tree code already covers deviation by rounddown(), but turns
out it is not enough, because jiffies - slice_start can be a multiple of
throtl_slice.
For example, assume bps_limit is 1000bytes, 1 jiffes is 10ms, and
slice is 20ms(2 jiffies), expected rate is 1000 / 1000 * 20 = 20 bytes
per slice.
If user issues two 21 bytes IO, then wait time will be 30ms for the
first IO:
bytes_allowed = 20, extra_bytes = 1;
jiffy_wait = 1 + 2 = 3 jiffies
and consider
extra 1 jiffies by timer, throtl_trim_slice() will be called at:
jiffies = 40ms
slice_start = 0ms, slice_end= 40ms
bytes_disp = 21
In this case, before the patch, real rate in the first two slices is
10.5 bytes per slice, and slice will be updated to:
jiffies = 40ms
slice_start = 40ms, slice_end = 60ms,
bytes_disp = 0;
Hence the second IO will have to wait another 30ms;
With the patch, the real rate in the first slice is 20 bytes per slice,
which is the same as expected, and slice will be updated:
jiffies=40ms,
slice_start = 20ms, slice_end = 60ms,
bytes_disp = 1;
And now, there is still 19 bytes allowed in the second slice, and the
second IO will only have to wait 10ms;
This problem will cause blktests throtl/001 failure in case of
CONFIG_HZ_100=y, fix it by preserving one extra finished slice in
throtl_trim_slice().
Fixes: e43473b7f2 ("blkio: Core implementation of throttle policy")
Reported-by: Ming Lei <ming.lei@redhat.com>
Closes: https://lore.kernel.org/linux-block/20250222092823.210318-3-yukuai1@huaweicloud.com/
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250227120645.812815-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The utf16_le_to_7bit function claims to, naively, convert a UTF-16
string to a 7-bit ASCII string. By naively, we mean that it:
* drops the first byte of every character in the original UTF-16 string
* checks if all characters are printable, and otherwise replaces them
by exclamation mark "!".
This means that theoretically, all characters outside the 7-bit ASCII
range should be replaced by another character. Examples:
* lower-case alpha (ɒ) 0x0252 becomes 0x52 (R)
* ligature OE (œ) 0x0153 becomes 0x53 (S)
* hangul letter pieup (ㅂ) 0x3142 becomes 0x42 (B)
* upper-case gamma (Ɣ) 0x0194 becomes 0x94 (not printable) so gets
replaced by "!"
The result of this conversion for the GPT partition name is passed to
user-space as PARTNAME via udev, which is confusing and feels questionable.
However, there is a flaw in the conversion function itself. By dropping
one byte of each character and using isprint() to check if the remaining
byte corresponds to a printable character, we do not actually guarantee
that the resulting character is 7-bit ASCII.
This happens because we pass 8-bit characters to isprint(), which
in the kernel returns 1 for many values > 0x7f - as defined in ctype.c.
This results in many values which should be replaced by "!" to be kept
as-is, despite not being valid 7-bit ASCII. Examples:
* e with acute accent (é) 0x00E9 becomes 0xE9 - kept as-is because
isprint(0xE9) returns 1.
* euro sign (€) 0x20AC becomes 0xAC - kept as-is because isprint(0xAC)
returns 1.
This way has broken pyudev utility[1], fixes it by using a mask of 7 bits
instead of 8 bits before calling isprint.
Link: https://github.com/pyudev/pyudev/issues/490#issuecomment-2685794648 [1]
Link: https://lore.kernel.org/linux-block/4cac90c2-e414-4ebb-ae62-2a4589d9dc6e@canonical.com/
Cc: Mulhern <amulhern@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: stable@vger.kernel.org
Signed-off-by: Olivier Gayot <olivier.gayot@canonical.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250305022154.3903128-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Many of the fields in struct bio_integrity_payload are only needed for
the default integrity buffer in the block layer, and the variable
sized array at the end of the structure makes it very hard to embed
into caller allocated structures.
Reduce struct bio_integrity_payload to the minimal structure needed in
common code and create two separate containing structures for the
automatically generated payload and the caller allocated payload.
The latter is a simple wrapper for struct bio_integrity_payload and
the bvecs, while the former contains the additional fields moved out
of struct bio_integrity_payload.
Always use a dedicated mempool for automatic integrity metadata
instead of depending on bio_set that is submitter controlled and thus
often doesn't have the mempool initialized and stop using mempools for
the submitter buffers as they aren't in the NOIO I/O submission path
where we need to guarantee forward progress.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20250225154449.422989-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The code that automatically creates a integrity payload and generates and
verifies the checksums for bios that don't have submitter-provided
integrity payload currently sits right in the middle of the block
integrity metadata infrastructure. Split it into a separate file to
make the different layers clear.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250225154449.422989-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
None of the few drivers still using the legacy block layer bounce
buffering support integrity metadata. Explicitly mark the features as
incompatible and stop creating the slab and mempool for integrity
buffers for the bounce bio_set.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250225154449.422989-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Device mapper bioset often has big bio_slab size, which can be more than
1000, then 8byte can't hold the slab name any more, cause the kmem_cache
allocation warning of 'kmem_cache of name 'bio-108' already exists'.
Fix the warning by extending bio_slab->name to 12 bytes, but fix output
of /proc/slabinfo
Reported-by: Guangwu Zhang <guazhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250228132656.2838008-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For devices that natively support zone append operations,
REQ_OP_ZONE_APPEND BIOs are not processed through zone write plugging
and are immediately issued to the zoned device. This means that there is
no write pointer offset tracking done for these operations and that a
zone write plug is not necessary.
However, when receiving a zone append BIO, we may already have a zone
write plug for the target zone if that zone was previously partially
written using regular write operations. In such case, since the write
pointer offset of the zone write plug is not incremented by the amount
of sectors appended to the zone, 2 issues arise:
1) we risk leaving the plug in the disk hash table if the zone is fully
written using zone append or regular write operations, because the
write pointer offset will never reach the "zone full" state.
2) Regular write operations that are issued after zone append operations
will always be failed by blk_zone_wplug_prepare_bio() as the write
pointer alignment check will fail, even if the user correctly
accounted for the zone append operations and issued the regular
writes with a correct sector.
Avoid these issues by immediately removing the zone write plug of zones
that are the target of zone append operations when blk_zone_plug_bio()
is called. The new function blk_zone_wplug_handle_native_zone_append()
implements this for devices that natively support zone append. The
removal of the zone write plug using disk_remove_zone_wplug() requires
aborting all plugged regular write using disk_zone_wplug_abort() as
otherwise the plugged write BIOs would never be executed (with the plug
removed, the completion path will never see again the zone write plug as
disk_get_zone_wplug() will return NULL). Rate-limited warnings are added
to blk_zone_wplug_handle_native_zone_append() and to
disk_zone_wplug_abort() to signal this.
Since blk_zone_wplug_handle_native_zone_append() is called in the hot
path for operations that will not be plugged, disk_get_zone_wplug() is
optimized under the assumption that a user issuing zone append
operations is not at the same time issuing regular writes and that there
are no hashed zone write plugs. The struct gendisk atomic counter
nr_zone_wplugs is added to check this, with this counter incremented in
disk_insert_zone_wplug() and decremented in disk_remove_zone_wplug().
To be consistent with this fix, we do not need to fill the zone write
plug hash table with zone write plugs for zones that are partially
written for a device that supports native zone append operations.
So modify blk_revalidate_seq_zone() to return early to avoid allocating
and inserting a zone write plug for partially written sequential zones
if the device natively supports zone append.
Reported-by: Jorgen Hansen <Jorgen.Hansen@wdc.com>
Fixes: 9b1ce7f0c6 ("block: Implement zone append emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Jorgen Hansen <Jorgen.Hansen@wdc.com>
Link: https://lore.kernel.org/r/20250214041434.82564-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
wbt_wait() no longer uses a spinlock as a parameter. Update the function
comments accordingly.
RWB_UNKNOWN_BUMP is used when we gradually adjust scale_steps toward the
center state, which is a value of 0.
Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250213100611.209997-2-yizhou.tang@shopee.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Using PAGE_SIZE as a minimum expected DMA segment size in consideration
of devices which have a max DMA segment size of < 64k when used on 64k
PAGE_SIZE systems leads to devices not being able to probe such as
eMMC and Exynos UFS controller [0] [1] you can end up with a probe failure
as follows:
WARNING: CPU: 2 PID: 397 at block/blk-settings.c:339 blk_validate_limits+0x364/0x3c0
Ensure we use min(max_seg_size, seg_boundary_mask + 1) as the new min segment
size when max segment size is < PAGE_SIZE for 16k and 64k base page size systems.
If anyone need to backport this patch, the following commits are depended:
commit 6aeb4f8364 ("block: remove bio_add_pc_page")
commit 02ee5d69e3 ("block: remove blk_rq_bio_prep")
commit b7175e24d6 ("block: add a dma mapping iterator")
Link: https://lore.kernel.org/linux-block/20230612203314.17820-1-bvanassche@acm.org/ # [0]
Link: https://lore.kernel.org/linux-block/1d55e942-5150-de4c-3a02-c3d066f87028@acm.org/ # [1]
Cc: Yi Zhang <yi.zhang@redhat.com>
Cc: John Garry <john.g.garry@oracle.com>
Cc: Keith Busch <kbusch@kernel.org>
Tested-by: Paul Bunyan <pbunyan@redhat.com>
Reviewed-by: Daniel Gomez <da.gomez@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250225022141.2154581-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We now can support blocksizes larger than PAGE_SIZE, so in theory
we should be able to lift the restriction up to the max supported page
cache order. However bound ourselves to what we can currently validate
and test. Through blktests and fstest we can validate up to 64k today.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20250221223823.1680616-8-mcgrof@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Call mapping_set_folio_min_order() when modifying the logical block
size to ensure folios are allocated with the correct size.
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20250221223823.1680616-7-mcgrof@kernel.org
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAme4rUUQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpj9+EADLPOFPa9hT1PGBbpnj74vBayoTO/M+w2Gp
+k2b8if3eGlY43WO2k+ytceWbA901iyvLPRqt1M1Ez8+BrNBg4NKcLv7q4O9NA3i
nDPLggugSc5sdRbLRimxiwHkkpSOBenkdb7R9XGmMXTCfSbRKl0kK01ivpgkbiG4
pbyPWYcoMyHaECBfPhazrJig4+rugXOYbkYoOM4NHsLqlTNfmowcMRPu+6czXt7q
ITHW2RTWK3ue8q+c3nwGPDk2ZKM8X/49rA/6bvD3voLNs+jQ8KFg2KULENf0Xaq6
1ZGrhLcr45iEHP0/+RORMzx27PqbTCSGIOTMZtwZNqh5+V+ybrGJq/F/T5rkrA3F
QqHld/WSSKWJ10RVAyjDP7NQ5vNZTwwGAEVagjyIFEfk7G7RTY2kIpSZiUgrZ9oD
4CkOKUGmVkUsKQW6gb0JQObtYyXXoNtmg8wQU2WwhISjFDkoYWw53LHwH/LnxLyi
Vg182amVBmERk4I5nTUiIML/7TzS69srb0Q7yaQS3eTwzLorDaB+3tPAxQmCTkGq
KeBfuBtbP3LTOy2Oek4YbKl8CA2KDYtK7FbCE6PECUbdTjpNrcAAgA/ZcgoKV8s7
EHWZFx7dZyFS6LFNWzT9VhTtgSZS92JIsZwgnjSJPV2UazyxmqHFChzVVGWJbaB3
agkMor3nVg==
=L+Lr
-----END PGP SIGNATURE-----
Merge tag 'block-6.14-20250221' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- NVMe pull request via Keith:
- FC controller state check fixes (Daniel)
- PCI Endpoint fixes (Damien)
- TCP connection failure fixe (Caleb)
- TCP handling C2HTermReq PDU (Maurizio)
- RDMA queue state check (Ruozhu)
- Apple controller fixes (Hector)
- Target crash on disbaled namespace (Hannes)
- MD pull request via Yu:
- Fix queue limits error handling for raid0, raid1 and raid10
- Fix for a NULL pointer deref in request data mapping
- Code cleanup for request merging
* tag 'block-6.14-20250221' of git://git.kernel.dk/linux:
nvme: only allow entering LIVE from CONNECTING state
nvme-fc: rely on state transitions to handle connectivity loss
apple-nvme: Support coprocessors left idle
apple-nvme: Release power domains when probe fails
nvmet: Use enum definitions instead of hardcoded values
nvme: Cleanup the definition of the controller config register fields
nvme/ioctl: add missing space in err message
nvme-tcp: fix connect failure on receiving partial ICResp PDU
nvme: tcp: Fix compilation warning with W=1
nvmet: pci-epf: Avoid RCU stalls under heavy workload
nvmet: pci-epf: Do not uselessly write the CSTS register
nvmet: pci-epf: Correctly initialize CSTS when enabling the controller
nvmet-rdma: recheck queue state is LIVE in state lock in recv done
nvmet: Fix crash when a namespace is disabled
nvme-tcp: add basic support for the C2HTermReq PDU
nvme-pci: quirk Acer FA100 for non-uniqueue identifiers
block: fix NULL pointer dereferenced within __blk_rq_map_sg
block/merge: remove unnecessary min() with UINT_MAX
md/raid*: Fix the set_queue_limits implementations
hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.
Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.
Patch was created by using Coccinelle.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/196d487c925411923a2d59d4bf5e366b9dac2747.1738746821.git.namcao@linutronix.de
hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.
Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.
Patch was created by using Coccinelle.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/d0d57e1dab46b617856dfb93c721d221cc31ab0b.1738746821.git.namcao@linutronix.de
The block layer internal flush request may not have bio attached, so the
request iterator has to be initialized from valid req->bio, otherwise NULL
pointer dereferenced is triggered.
Cc: Christoph Hellwig <hch@lst.de>
Reported-and-tested-by: Cheyenne Wills <cheyenne.wills@gmail.com>
Fixes: b7175e24d6 ("block: add a dma mapping iterator")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250217031626.461977-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In bvec_split_segs(), max_bytes is an unsigned, so it must be less than
or equal to UINT_MAX. Remove the unnecessary min().
Prior to commit 67927d2201 ("block/merge: count bytes instead of
sectors"), the min() was with UINT_MAX >> 9, so it did have an effect.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250214193637.234702-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix several issues in partition probing:
- The bailout for a bad partoffset must use put_dev_sector(), since the
preceding read_part_sector() succeeded.
- If the partition table claims a silly sector size like 0xfff bytes
(which results in partition table entries straddling sector boundaries),
bail out instead of accessing out-of-bounds memory.
- We must not assume that the partition table contains proper NUL
termination - use strnlen() and strncmp() instead of strlen() and
strcmp().
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/r/20250214-partition-mac-v1-1-c1c626dffbd5@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When rq_qos_wait() is first introduced, it is easy to understand. But
with some bug fixes applied, it is not easy for newcomers to understand
the whole logic under those fixes. In this patch, rq_qos_wait() is
refactored and more comments are added for better understanding. There
are 3 points for the improvement:
1) Use waitqueue_active() instead of wq_has_sleeper() to eliminate
unnecessary memory barrier in wq_has_sleeper() which is supposed
to be used in waker side. In this case, we do need the barrier.
So use the cheaper one to locklessly test for waiters on the queue.
2) Remove acquire_inflight_cb() logic for the first waiter out of the
while loop to make the code clear.
3) Add more comments to explain how to sync with different waiters and
the waker.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250208090416.38642-2-songmuchun@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is already a macro DEFINE_WAIT_FUNC() to declare a wait_queue_entry
with a specified waking function. But there is not a counterpart for
initializing one wait_queue_entry with a specified waking function. So
introducing init_wait_func() for this, which also could be used in iocost
and rq-qos. Using default_wake_function() in rq_qos_wait() to wake up
waiters, which could remove ->task field from rq_qos_wait_data.
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250208090416.38642-1-songmuchun@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Until this point, the kernel can use hardware-wrapped keys to do
encryption if userspace provides one -- specifically a key in
ephemerally-wrapped form. However, no generic way has been provided for
userspace to get such a key in the first place.
Getting such a key is a two-step process. First, the key needs to be
imported from a raw key or generated by the hardware, producing a key in
long-term wrapped form. This happens once in the whole lifetime of the
key. Second, the long-term wrapped key needs to be converted into
ephemerally-wrapped form. This happens each time the key is "unlocked".
In Android, these operations are supported in a generic way through
KeyMint, a userspace abstraction layer. However, that method is
Android-specific and can't be used on other Linux systems, may rely on
proprietary libraries, and also misleads people into supporting KeyMint
features like rollback resistance that make sense for other KeyMint keys
but don't make sense for hardware-wrapped inline encryption keys.
Therefore, this patch provides a generic kernel interface for these
operations by introducing new block device ioctls:
- BLKCRYPTOIMPORTKEY: convert a raw key to long-term wrapped form.
- BLKCRYPTOGENERATEKEY: have the hardware generate a new key, then
return it in long-term wrapped form.
- BLKCRYPTOPREPAREKEY: convert a key from long-term wrapped form to
ephemerally-wrapped form.
These ioctls are implemented using new operations in blk_crypto_ll_ops.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> # sm8650
Link: https://lore.kernel.org/r/20250204060041.409950-4-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add sysfs files that indicate which type(s) of keys are supported by the
inline encryption hardware associated with a particular request queue:
/sys/block/$disk/queue/crypto/hw_wrapped_keys
/sys/block/$disk/queue/crypto/raw_keys
Userspace can use the presence or absence of these files to decide what
encyption settings to use.
Don't use a single key_type file, as devices might support both key
types at the same time.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> # sm8650
Link: https://lore.kernel.org/r/20250204060041.409950-3-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To prevent keys from being compromised if an attacker acquires read
access to kernel memory, some inline encryption hardware can accept keys
which are wrapped by a per-boot hardware-internal key. This avoids
needing to keep the raw keys in kernel memory, without limiting the
number of keys that can be used. Such hardware also supports deriving a
"software secret" for cryptographic tasks that can't be handled by
inline encryption; this is needed for fscrypt to work properly.
To support this hardware, allow struct blk_crypto_key to represent a
hardware-wrapped key as an alternative to a raw key, and make drivers
set flags in struct blk_crypto_profile to indicate which types of keys
they support. Also add the ->derive_sw_secret() low-level operation,
which drivers supporting wrapped keys must implement.
For more information, see the detailed documentation which this patch
adds to Documentation/block/inline-encryption.rst.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> # sm8650
Link: https://lore.kernel.org/r/20250204060041.409950-2-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>