Factor out the logic from blk_mq_flush_plug_list() that calls
->queue_rqs() with a fallback to ->queue_rq() into a helper function
blk_mq_dispatch_queue_requests(). This is in preparation for using this
code with other lists of requests.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250426011728.4189119-3-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_mq_plug_issue_direct(), __blk_mq_flush_plug_list(), and
blk_mq_dispatch_plug_list() take a struct blk_plug * but only use its
mq_list. Pass the struct rq_list * instead in preparation for calling
them with other lists of requests.
Drop "plug" from the function names as they are no longer plug-specific.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250426011728.4189119-2-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We need two patches inside linux-block tree as dependencies of the patch
which will follow this merge.
Specifically, we need:
block: fix race between set_blocksize and read paths
block: hoist block size validation code to a separate function
Signed-off-by: Carlos Maiolino <cem@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgK7wMQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppNwD/46vpEWhwGLeNXxFic5CCNMbUBUl04+Rc9I
p22BY9rp1+7ooXJGJJTGaQAjwTFKP/kxyaQWZrJFXK8t2wYhJ8E2PWPVolk8jsOT
0wSYaF9iW4kw5twcmWq+VqPM+joLGKxkwojDTvz4CiorKrq2J14yHkrtfp81R3d3
rR7VzeggglSxEJAKkIBkbRWtMwTQ6WvImm4uufccI3AwfPJcM3qxSXGqq3wryA0O
PyqFlkOdjDIbNP3Zu0QvqQ0xyefGCyGyAfKEPNEAn1oOpD8Y/SUvdMdlPzA9pJ93
9+8F9pAg6fo8vgBEMavVGNjFOw4OrxNBNL9St3vlz+VMpid+HMyflolLwCTdQXCz
HEZ+H75uwMwh3mskHp5paitdE4Y70tqXW6LWgr/5wXOsl8Lh5p1A7Ll0tP27gUe1
vV1Yh+nwbg5TQ1qi+NmjhUThivT96hop+5nK9p5r7GHSZP1xiJdb7RsQqOhDXBmP
I5sjc5Dny8S9b87ehX6b4VfpTbk3aRVhOaEJ4l6k4dwFFkTRP0ODa0/bWPf3Reb0
4HI2/NYRsdLNyut2896P19RxcpcXz5PKEKvcCt5pAehwv4urIdpLoeNLgMdgwfcc
qbtNbTkV2V/fOKG7pS6yUQmWGR/XXQZDFJ4gCFZemiYYVAXJCqjBB6dliMy8roFz
p5Rc31AEMw==
=s0NF
-----END PGP SIGNATURE-----
Merge tag 'block-6.15-20250424' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- Fix autoloading of drivers from stat*(2)
- Fix losing read-ahead setting one suspend/resume, when a device is
re-probed.
- Fix race between setting the block size and page cache updates.
Includes a helper that a coming XFS fix will use as well.
- ublk cancelation fixes.
- ublk selftest additions and fixes.
- NVMe pull via Christoph:
- fix an out-of-bounds access in nvmet_enable_port (Richard
Weinberger)
* tag 'block-6.15-20250424' of git://git.kernel.dk/linux:
ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd
ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATA
block: don't autoload drivers on blk-cgroup configuration
block: don't autoload drivers on stat
block: remove the backing_inode variable in bdev_statx
block: move blkdev_{get,put} _no_open prototypes out of blkdev.h
block: never reduce ra_pages in blk_apply_bdi_limits
selftests: ublk: common: fix _get_disk_dev_t for pre-9.0 coreutils
selftests: ublk: remove useless 'delay_us' from 'struct dev_ctx'
selftests: ublk: fix recover test
block: hoist block size validation code to a separate function
block: fix race between set_blocksize and read paths
nvmet: fix out-of-bounds access in nvmet_enable_port
Merge 6.15 block fixes - both to get the fixes causing issues with
XFS testing, but also to make it easier for 6.16 ublk patches to avoid
conflicts.
* block-6.15:
ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd
ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATA
block: don't autoload drivers on blk-cgroup configuration
block: don't autoload drivers on stat
block: remove the backing_inode variable in bdev_statx
block: move blkdev_{get,put} _no_open prototypes out of blkdev.h
block: never reduce ra_pages in blk_apply_bdi_limits
selftests: ublk: common: fix _get_disk_dev_t for pre-9.0 coreutils
selftests: ublk: remove useless 'delay_us' from 'struct dev_ctx'
selftests: ublk: fix recover test
block: hoist block size validation code to a separate function
block: fix race between set_blocksize and read paths
nvmet: fix out-of-bounds access in nvmet_enable_port
Loading a driver just to configure blk-cgroup doesn't make sense, as that
assumes and already existing device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkdev_get_no_open can trigger the legacy autoload of block drivers. A
simple stat of a block device has not historically done that, so disable
this behavior again.
Fixes: 9abcfbd235 ("block: Add atomic write support for statx")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
backing_inode is only used once, so remove it and update the comment
describing the bdev lookup to be a bit more clear.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
These are only to be used by block internal code. Remove the comment
as we grew more users due to reworking block device node opening.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When the user increased the read-ahead size through sysfs this value
currently get lost if the device is reprobe, including on a resume
from suspend.
As there is no hardware limitation for the read-ahead size there is
no real need to reset it or track a separate hardware limitation
like for max_sectors.
This restores the pre-atomic queue limit behavior in the sd driver as
sd did not use blk_queue_io_opt and thus never updated the read ahead
size to the value based of the optimal I/O, but changes behavior for
all other drivers. As the new behavior seems useful and sd is the
driver for which the readahead size tweaks are most useful that seems
like a worthwhile trade off.
Fixes: 804e498e04 ("sd: convert to the atomic queue limits API")
Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20250424082521.1967286-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hoist the block size validation code to bdev_validate_blocksize so that
we can call it from filesystems that don't care about the bdev pagecache
manipulations of set_blocksize.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/174543795720.4139148.840349813093799165.stgit@frogsfrogsfrogs
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the new large sector size support, it's now the case that
set_blocksize can change i_blksize and the folio order in a manner that
conflicts with a concurrent reader and causes a kernel crash.
Specifically, let's say that udev-worker calls libblkid to detect the
labels on a block device. The read call can create an order-0 folio to
read the first 4096 bytes from the disk. But then udev is preempted.
Next, someone tries to mount an 8k-sectorsize filesystem from the same
block device. The filesystem calls set_blksize, which sets i_blksize to
8192 and the minimum folio order to 1.
Now udev resumes, still holding the order-0 folio it allocated. It then
tries to schedule a read bio and do_mpage_readahead tries to create
bufferheads for the folio. Unfortunately, blocks_per_folio == 0 because
the page size is 4096 but the blocksize is 8192 so no bufferheads are
attached and the bh walk never sets bdev. We then submit the bio with a
NULL block device and crash.
Therefore, truncate the page cache after flushing but before updating
i_blksize. However, that's not enough -- we also need to lock out file
IO and page faults during the update. Take both the i_rwsem and the
invalidate_lock in exclusive mode for invalidations, and in shared mode
for read/write operations.
I don't know if this is the correct fix, but xfs/259 found it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/174543795699.4139148.2086129139322431423.stgit@frogsfrogsfrogs
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Even if blk-rq-qos isn't used or configured, dipping into the queue to
fetch ->rq_qos is a noticeable slowdown and visible in profiles. Add an
unlikely static key around blk-rq-qos, to avoid fetching this cacheline
if blk-iolatency or blk-wbt isn't configured or used.
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
On x86, rep stos will be emitted to clear the the blk_mq_alloc_data
struct, as not all members are being explicitly initialied. Depending on
the type of CPU, this is a noticeable slowdown compared to just ensuring
that the struct is fully initialized when setup.
For the 4 spots that setup a struct blk_mq_alloc_data on the stack,
ensure all members are being initialized.
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The 'nr_budgets' argument of blk_mq_dispatch_rq_list() is either the
number of elements in the 'list' argument or zero. Instead of passing
the number of list elements to blk_mq_dispatch_rq_list(), pass a boolean
argument that indicates whether or not blk_mq_dispatch_rq_list() should
request the block driver for a budget for each request in 'list'.
Remove the code for counting list elements from blk_mq_dispatch_rq_list()
callers where possible. Remove the code that decrements nr_budgets from
blk_mq_dispatch_rq_list() because it is superfluous. Each request that
is processed by blk_mq_dispatch_rq_list() is in one of these two states
if 'get_budget' is false:
* Either the request is on 'list' and the budget for the request has to
be released from the error path.
* Or the request is not on 'list' and q->mq_ops->queue_rq() has already
released the budget (ret != BLK_STS_OK) or q->mq_ops->queue_rq() will
release the budget asynchronously (ret == BLK_STS_OK).
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: John Garry <john.g.garry@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20250415205134.3650042-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaAQM5QAKCRCRxhvAZXjc
olcwAP0RETZn15Jkt5+mKjcx99fuVE7je3lp56UH4Y4XjZmthgEA1n65RDr4Tq6E
548A2/9Hnt4NWdvoi9VhrG4+5dNRowM=
=cFFa
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.15-rc3.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Revert the hfs{plus} deprecation warning that's also included in this
pull request. The commit introducing the deprecation warning resides
rather early in this branch. So simply dropping it would've rebased
all other commits which I decided to avoid. Hence the revert in the
same branch
[ Background - the deprecation warning discussion resulted in people
stepping up, and so hfs{plus} will have a maintainer taking care of
it after all.. - Linus ]
- Switch CONFIG_SYSFS_SYCALL default to n and decouple from
CONFIG_EXPERT
- Fix an audit bug caused by changes to our kernel path lookup helpers
this cycle. Audit needs the parent path even if the dentry it tried
to look up is negative
- Ensure that the kernel path lookup helpers leave the passed in path
argument clean when they return an error. This is consistent with all
our other helpers
- Ensure that vfs_getattr_nosec() calls bdev_statx() so the relevant
information is available to kernel consumers as well
- Don't set a timer and call schedule() if the timer will expire
immediately in epoll
- Make netfs lookup tables with __nonstring
* tag 'vfs-6.15-rc3.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
Revert "hfs{plus}: add deprecation warning"
fs: move the bdex_statx call to vfs_getattr_nosec
netfs: Mark __nonstring lookup tables
eventpoll: Set epoll timeout if it's in the future
fs: ensure that *path_locked*() helpers leave passed path pristine
fs: add kern_path_locked_negative()
hfs{plus}: add deprecation warning
Kconfig: switch CONFIG_SYSFS_SYCALL default to n
Currently bdex_statx is only called from the very high-level
vfs_statx_path function, and thus bypassing it for in-kernel calls
to vfs_getattr or vfs_getattr_nosec.
This breaks querying the block ѕize of the underlying device in the
loop driver and also is a pitfall for any other new kernel caller.
Move the call into the lowest level helper to ensure all callers get
the right results.
Fixes: 2d985f8c6b ("vfs: support STATX_DIOALIGN on block devices")
Fixes: f4774e92aa ("loop: take the file system minimum dio alignment into account")
Reported-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250417064042.712140-1-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
Placing multiple protection information buffers inside the same page
can lead to oopses because set_page_dirty_lock() can't be called from
interrupt context.
Since a protection information buffer is not backed by a file there is
no point in setting its page dirty, there is nothing to synchronize.
Drop the call to set_page_dirty_lock() and remove the last argument to
bio_integrity_unpin_bvec().
Cc: stable@vger.kernel.org
Fixes: 492c5d4559 ("block: bio-integrity: directly map user buffers")
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/yq1v7r3ev9g.fsf@ca-mkp.ca.oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When registering a queue fails after blk_mq_sysfs_register() is
successful but the function later encounters an error, we need
to clean up the blk_mq_sysfs resources.
Add the missing blk_mq_sysfs_unregister() call in the error path
to properly clean up these resources and prevent a memory leak.
Fixes: 320ae51fee ("blk-mq: new multi-queue block IO queueing mechanism")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250412092554.475218-1-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add an SPDX license identifier line to blk-throttle.h
Use 'GPL-2.0' as the identifier, since blk-throttle.c uses
that, and blk.h (from which some material was copied when
blk-throttle.h was created) also uses that identifier.
Signed-off-by: Tim Bird <tim.bird@sony.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/MW5PR13MB5632EE4645BCA24ED111EC0EFDB62@MW5PR13MB5632.namprd13.prod.outlook.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.
Conversion was done with coccinelle plus manual fixups where necessary.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This non-functional change serves as preparation for moving to
subsystem-based rstat trees. To simplify future commits, change the
signatures of existing cgroup-based rstat functions to become css-based and
rename them to reflect that.
Though the signatures have changed, the implementations have not. Within
these functions use the css->cgroup pointer to obtain the associated cgroup
and allow code to function the same just as it did before this patch. At
applicable call sites, pass the subsystem-specific css pointer as an
argument or pass a pointer to cgroup::self if not in subsystem context.
Note that cgroup_rstat_updated_list() and cgroup_rstat_push_children()
are not altered yet since there would be a larger amount of css to
cgroup conversions which may overcomplicate the code at this
intermediate phase.
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
->elevator_lock depends on queue freeze lock, see block/blk-sysfs.c.
queue freeze lock depends on fs_reclaim.
So don't grab elevator lock during queue initialization which needs to
call kmalloc(GFP_KERNEL), and we can cut the dependency between
->elevator_lock and fs_reclaim, then the lockdep warning can be killed.
This way is safe because elevator setting isn't ready to run during
queue initialization.
There isn't such issue in __blk_mq_update_nr_hw_queues() because
memalloc_noio_save() is called before acquiring elevator lock.
Fixes the following lockdep warning:
https://lore.kernel.org/linux-block/67e6b425.050a0220.2f068f.007b.GAE@google.com/
Reported-by: syzbot+4c7e0f9b94ad65811efb@syzkaller.appspotmail.com
Cc: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250403105402.1334206-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Another set of improvements to the kernel's CRC (cyclic redundancy
check) code:
- Rework the CRC64 library functions to be directly optimized, like what
I did last cycle for the CRC32 and CRC-T10DIF library functions.
- Rewrite the x86 PCLMULQDQ-optimized CRC code, and add VPCLMULQDQ
support and acceleration for crc64_be and crc64_nvme.
- Rewrite the riscv Zbc-optimized CRC code, and add acceleration for
crc_t10dif, crc64_be, and crc64_nvme.
- Remove crc_t10dif and crc64_rocksoft from the crypto API, since they
are no longer needed there.
- Rename crc64_rocksoft to crc64_nvme, as the old name was incorrect.
- Add kunit test cases for crc64_nvme and crc7.
- Eliminate redundant functions for calculating the Castagnoli CRC32,
settling on just crc32c().
- Remove unnecessary prompts from some of the CRC kconfig options.
- Further optimize the x86 crc32c code.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZ+CGGhQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK3wRAP4tbnzawUmlIHIF0hleoADXehUgAhMt
NZn15mGvyiuwIQEA8W9qvnLdFXZkdxhxAEvDDFjyrRauL6eGtr/GvCx4AQY=
=wmKG
-----END PGP SIGNATURE-----
Merge tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux
Pull CRC updates from Eric Biggers:
"Another set of improvements to the kernel's CRC (cyclic redundancy
check) code:
- Rework the CRC64 library functions to be directly optimized, like
what I did last cycle for the CRC32 and CRC-T10DIF library
functions
- Rewrite the x86 PCLMULQDQ-optimized CRC code, and add VPCLMULQDQ
support and acceleration for crc64_be and crc64_nvme
- Rewrite the riscv Zbc-optimized CRC code, and add acceleration for
crc_t10dif, crc64_be, and crc64_nvme
- Remove crc_t10dif and crc64_rocksoft from the crypto API, since
they are no longer needed there
- Rename crc64_rocksoft to crc64_nvme, as the old name was incorrect
- Add kunit test cases for crc64_nvme and crc7
- Eliminate redundant functions for calculating the Castagnoli CRC32,
settling on just crc32c()
- Remove unnecessary prompts from some of the CRC kconfig options
- Further optimize the x86 crc32c code"
* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (36 commits)
x86/crc: drop the avx10_256 functions and rename avx10_512 to avx512
lib/crc: remove unnecessary prompt for CONFIG_CRC64
lib/crc: remove unnecessary prompt for CONFIG_LIBCRC32C
lib/crc: remove unnecessary prompt for CONFIG_CRC8
lib/crc: remove unnecessary prompt for CONFIG_CRC7
lib/crc: remove unnecessary prompt for CONFIG_CRC4
lib/crc7: unexport crc7_be_syndrome_table
lib/crc_kunit.c: update comment in crc_benchmark()
lib/crc_kunit.c: add test and benchmark for crc7_be()
x86/crc32: optimize tail handling for crc32c short inputs
riscv/crc64: add Zbc optimized CRC64 functions
riscv/crc-t10dif: add Zbc optimized CRC-T10DIF function
riscv/crc32: reimplement the CRC32 functions using new template
riscv/crc: add "template" for Zbc optimized CRC functions
x86/crc: add ANNOTATE_NOENDBR to suppress objtool warnings
x86/crc32: improve crc32c_arch() code generation with clang
x86/crc64: implement crc64_be and crc64_nvme using new template
x86/crc-t10dif: implement crc_t10dif using new template
x86/crc32: implement crc32_le using new template
x86/crc: add "template" for [V]PCLMULQDQ based CRC functions
...
hrtimers are initialized with hrtimer_init() and a subsequent store to
the callback pointer. This turned out to be suboptimal for the upcoming
Rust integration and is obviously a silly implementation to begin with.
This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
with hrtimer_setup(T, cb);
The conversion was done with Coccinelle and a few manual fixups.
Once the conversion has completely landed in mainline, hrtimer_init()
will be removed and the hrtimer::function becomes a private member.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff5jQTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoVvRD/wKtuwmiA66NJFgXC0qVq82A6fO3bY8
GBdbfysDJIbqGu5PTcULTbJ8qkqv3jeLUv6CcXvS4sZ7y/uJQl2lzf8yrD/0bbwc
rLI6sHiPSZmK93kNVN4X5H7kvt7cE/DYC9nnEOgK3BY5FgKc4n9887d4aVBhL8Lv
ODwVXvZ+xi351YCj7qRyPU24zt/p4tkkT1o2k4a0HBluqLI0D+V20fke9IERUL8r
d1uWKlcn0TqYDesE8HXKIhbst3gx52rMJrXBJDHwFmG6v8Pj1fkTXCVpPo8QcBz8
OTVkpomN9f/Tx4+GZwhZOF86LhLL3OhxD6pT7JhFCXdmSGv+Ez8uyk1YZysM/XpV
Juy/1yAcBpDIDkmhMFGdAAn48Nn9Fotty0r4je60zSEp1d/4QMXcFme29qr2JTUE
iWnQ/HD6DxUjVHqy7CYvvo26Xegg1C7qgyOVt4PYZwAM1VKF5P3kzYTb4SAdxtop
Tpji1sfW9QV08jqMNo6XntD32DSP9S2HqjO9LwBw700jnx2jjJ35fcJs6iodMOUn
gckIZLMn3L0OoglPdyA5O7SNTbKE7aFiRKdnT/cJtR3Fa39Qu27CwC5gfiyuie9I
Q+LG8GLuYSBHXAR+PBK4GWlzJ7Dn8k3eqmbnLeKpRMsU6ZzcttgA64xhaviN2wN0
iJbvLJeisXr3GA==
=bYAX
-----END PGP SIGNATURE-----
Merge tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer cleanups from Thomas Gleixner:
"A treewide hrtimer timer cleanup
hrtimers are initialized with hrtimer_init() and a subsequent store to
the callback pointer. This turned out to be suboptimal for the
upcoming Rust integration and is obviously a silly implementation to
begin with.
This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
with hrtimer_setup(T, cb);
The conversion was done with Coccinelle and a few manual fixups.
Once the conversion has completely landed in mainline, hrtimer_init()
will be removed and the hrtimer::function becomes a private member"
* tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
wifi: rt2x00: Switch to use hrtimer_update_function()
io_uring: Use helper function hrtimer_update_function()
serial: xilinx_uartps: Use helper function hrtimer_update_function()
ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup()
RDMA: Switch to use hrtimer_setup()
virtio: mem: Switch to use hrtimer_setup()
drm/vmwgfx: Switch to use hrtimer_setup()
drm/xe/oa: Switch to use hrtimer_setup()
drm/vkms: Switch to use hrtimer_setup()
drm/msm: Switch to use hrtimer_setup()
drm/i915/request: Switch to use hrtimer_setup()
drm/i915/uncore: Switch to use hrtimer_setup()
drm/i915/pmu: Switch to use hrtimer_setup()
drm/i915/perf: Switch to use hrtimer_setup()
drm/i915/gvt: Switch to use hrtimer_setup()
drm/i915/huc: Switch to use hrtimer_setup()
drm/amdgpu: Switch to use hrtimer_setup()
stm class: heartbeat: Switch to use hrtimer_setup()
i2c: Switch to use hrtimer_setup()
iio: Switch to use hrtimer_setup()
...
- Add deprecation info messages to cgroup1-only features.
- rstat updates including a bug fix and breaking up a critical section to
reduce interrupt latency impact.
- Other misc and doc updates.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ9xO2g4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGQz4AQDeWKmngRsnddEMkqOV1ArwXSr+8xUQrvCBx0RL
vcjOQQEAusGCTeGXWJ96kw+N9BXvGwFsfSeoxjOqAnvrBS1EgAc=
=WvJg
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- Add deprecation info messages to cgroup1-only features
- rstat updates including a bug fix and breaking up a critical section
to reduce interrupt latency impact
- Other misc and doc updates
* tag 'cgroup-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: rstat: Cleanup flushing functions and locking
cgroup/rstat: avoid disabling irqs for O(num_cpu)
mm: Fix a build breakage in memcontrol-v1.c
blk-cgroup: Simplify policy files registration
cgroup: Update file naming comment
cgroup: Add deprecation message to legacy freezer controller
mm: Add transformation message for per-memcg swappiness
RFC cgroup/cpuset-v1: Add deprecation messages to sched_relax_domain_level
cgroup/cpuset-v1: Add deprecation messages to memory_migrate
cgroup/cpuset-v1: Add deprecation messages to mem_exclusive and mem_hardwall
cgroup: Print message when /proc/cgroups is read on v2-only system
cgroup/blkio: Add deprecation messages to reset_stats
cgroup/cpuset-v1: Add deprecation messages to memory_spread_page and memory_spread_slab
cgroup/cpuset-v1: Add deprecation messages to sched_load_balance and memory_pressure_enabled
cgroup, docs: Be explicit about independence of RT_GROUP_SCHED and non-cpu controllers
cgroup/rstat: Fix forceidle time in cpu.stat
cgroup/misc: Remove unused misc_cg_res_total_usage
cgroup/cpuset: Move procfs cpuset attribute under cgroup-v1.c
cgroup: update comment about dropping cgroup kn refs
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rxAAKCRCRxhvAZXjc
ooIPAQCwMjDjtWegvBy8kefiRw+fa4z3ZWHrwRT9DJrD/K9WyAD+JVd0ou27SVpQ
jKpRSRct2eTbyxdYiGydHQGm5F5sLg4=
=0FyQ
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs pagesize updates from Christian Brauner:
"This enables block sizes greater than the page size for block devices.
With this we can start supporting block devices with logical block
sizes larger than 4k.
It also allows to lift the device cache sector size support to 64k.
This allows filesystems which can use larger sector sizes up to 64k to
ensure that the filesystem will not generate writes that are smaller
than the specified sector size"
* tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
bdev: add back PAGE_SIZE block size validation for sb_set_blocksize()
bdev: use bdev_io_min() for statx block size
block/bdev: lift block size restrictions to 64k
block/bdev: enable large folio support for large logical block sizes
fs/buffer fs/mpage: remove large folio restriction
fs/mpage: use blocks_per_folio instead of blocks_per_page
fs/mpage: avoid negative shift for large blocksize
fs/buffer: remove batching from async read
fs/buffer: simplify block_read_full_folio() with bh_offset()
In case blkg_conf_open_bdev_frozen() fails, ioc_qos_write() jumps to the
error path without assigning a value to 'ret'. Ensure that it inherits
the error from the passed back error value.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202503200454.QWpwKeJu-lkp@intel.com/
Fixes: 9730763f47 ("block: correct locking order for protecting blk-wbt parameters")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The commit '245618f8e45f ("block: protect wbt_lat_usec using q->
elevator_lock")' introduced q->elevator_lock to protect updates
to blk-wbt parameters when writing to the sysfs attribute wbt_
lat_usec and the cgroup attribute io.cost.qos. However, both
these attributes also acquire q->rq_qos_mutex, leading to the
following lockdep warning:
======================================================
WARNING: possible circular locking dependency detected
6.14.0-rc5+ #138 Not tainted
------------------------------------------------------
bash/5902 is trying to acquire lock:
c000000085d495a0 (&q->rq_qos_mutex){+.+.}-{4:4}, at: wbt_init+0x164/0x238
but task is already holding lock:
c000000085d498c8 (&q->elevator_lock){+.+.}-{4:4}, at: queue_wb_lat_store+0xb0/0x20c
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock+0xf0/0xa58
ioc_qos_write+0x16c/0x85c
cgroup_file_write+0xc4/0x32c
kernfs_fop_write_iter+0x1b8/0x29c
vfs_write+0x410/0x584
ksys_write+0x84/0x140
system_call_exception+0x134/0x360
system_call_vectored_common+0x15c/0x2ec
-> #0 (&q->rq_qos_mutex){+.+.}-{4:4}:
__lock_acquire+0x1b6c/0x2ae0
lock_acquire+0x140/0x430
__mutex_lock+0xf0/0xa58
wbt_init+0x164/0x238
queue_wb_lat_store+0x1dc/0x20c
queue_attr_store+0x12c/0x164
sysfs_kf_write+0x6c/0xb0
kernfs_fop_write_iter+0x1b8/0x29c
vfs_write+0x410/0x584
ksys_write+0x84/0x140
system_call_exception+0x134/0x360
system_call_vectored_common+0x15c/0x2ec
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&q->elevator_lock);
lock(&q->rq_qos_mutex);
lock(&q->elevator_lock);
lock(&q->rq_qos_mutex);
*** DEADLOCK ***
6 locks held by bash/5902:
#0: c000000051122400 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x84/0x140
#1: c00000007383f088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x174/0x29c
#2: c000000008550428 (kn->active#182){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x180/0x29c
#3: c000000085d493a8 (&q->q_usage_counter(io)#5){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
#4: c000000085d493e0 (&q->q_usage_counter(queue)#5){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
#5: c000000085d498c8 (&q->elevator_lock){+.+.}-{4:4}, at: queue_wb_lat_store+0xb0/0x20c
stack backtrace:
CPU: 17 UID: 0 PID: 5902 Comm: bash Kdump: loaded Not tainted 6.14.0-rc5+ #138
Hardware name: IBM,9043-MRX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_028) hv:phyp pSeries
Call Trace:
[c0000000721ef590] [c00000000118f8a8] dump_stack_lvl+0x108/0x18c (unreliable)
[c0000000721ef5c0] [c00000000022563c] print_circular_bug+0x448/0x604
[c0000000721ef670] [c000000000225a44] check_noncircular+0x24c/0x26c
[c0000000721ef740] [c00000000022bf28] __lock_acquire+0x1b6c/0x2ae0
[c0000000721ef870] [c000000000229240] lock_acquire+0x140/0x430
[c0000000721ef970] [c0000000011cfbec] __mutex_lock+0xf0/0xa58
[c0000000721efaa0] [c00000000096c46c] wbt_init+0x164/0x238
[c0000000721efaf0] [c0000000008f8cd8] queue_wb_lat_store+0x1dc/0x20c
[c0000000721efb50] [c0000000008f8fa0] queue_attr_store+0x12c/0x164
[c0000000721efc60] [c0000000007c11cc] sysfs_kf_write+0x6c/0xb0
[c0000000721efca0] [c0000000007bfa4c] kernfs_fop_write_iter+0x1b8/0x29c
[c0000000721efcf0] [c0000000006a281c] vfs_write+0x410/0x584
[c0000000721efdc0] [c0000000006a2cc8] ksys_write+0x84/0x140
[c0000000721efe10] [c000000000031b64] system_call_exception+0x134/0x360
[c0000000721efe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
>From the above log it's apparent that method which writes to sysfs attr
wbt_lat_usec acquires q->elevator_lock first, and then acquires q->rq_
qos_mutex. However the another method which writes to io.cost.qos,
acquires q->rq_qos_mutex first, and then acquires q->rq_qos_mutex. So
this could potentially cause the deadlock.
A closer look at ioc_qos_write shows that correcting the lock order is
non-trivial because q->rq_qos_mutex is acquired in blkg_conf_open_bdev
and released in blkg_conf_exit. The function blkg_conf_open_bdev is
responsible for parsing user input and finding the corresponding block
device (bdev) from the user provided major:minor number.
Since we do not know the bdev until blkg_conf_open_bdev completes, we
cannot simply move q->elevator_lock acquisition before blkg_conf_open_
bdev. So to address this, we intoduce new helpers blkg_conf_open_bdev_
frozen and blkg_conf_exit_frozen which are just wrappers around blkg_
conf_open_bdev and blkg_conf_exit respectively. The helper blkg_conf_
open_bdev_frozen is similar to blkg_conf_open_bdev, but additionally
freezes the queue, acquires q->elevator_lock and ensures the correct
locking order is followed between q->elevator_lock and q->rq_qos_mutex.
Similarly another helper blkg_conf_exit_frozen in addition to unfreezing
the queue ensures that we release the locks in correct order.
By using these helpers, now we maintain the same locking order in all
code paths where we update blk-wbt parameters.
Fixes: 245618f8e4 ("block: protect wbt_lat_usec using q->elevator_lock")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202503171650.cc082b66-lkp@intel.com
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250319105518.468941-3-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The ioc_qos_write method acquires q->elevator_lock to protect
updates to blk-wbt parameters. Once these updates are complete,
the lock should be released before returning from ioc_qos_write.
However, in one code path, the release of q->elevator_lock was
mistakenly omitted, potentially leading to a lock leak. This commit
fixes the issue by ensuring that q->elevator_lock is properly
released in all return paths of ioc_qos_write.
Fixes: 245618f8e4 ("block: protect wbt_lat_usec using q->elevator_lock")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202503171650.cc082b66-lkp@intel.com
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250319105518.468941-2-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch improve the returned error code of blkcg_policy_register().
1. Move the validation check for cpd/pd_alloc_fn and cpd/pd_free_fn
function pairs to the start of blkcg_policy_register(). This ensures
we immediately return -EINVAL if the function pairs are not correctly
provided, rather than returning -ENOSPC after locking and unlocking
mutexes unnecessarily.
Those locks should not contention any problems, as error of policy
registration is a super cold path.
2. Return -ENOMEM when cpd_alloc_fn() failed.
Co-authored-by: Wen Tao <wentao@uniontech.com>
Signed-off-by: Wen Tao <wentao@uniontech.com>
Signed-off-by: Chen Linxuan <chenlinxuan@uniontech.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/3E333A73B6B6DFC0+20250317022924.150907-1-chenlinxuan@uniontech.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The hctx_busy_show method in debugfs is currently unprotected. This
method iterates over all started requests in a tagset and prints them.
However, the tags can be updated concurrently via the sysfs attributes
'nr_requests' or 'scheduler' (elevator switch), leading to potential
race conditions.
Since sysfs attributes 'nr_requests' and 'scheduler' are already
protected using q->elevator_lock, extend this protection to the debugfs
'busy' attribute as well to ensure consistency.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250313115235.3707600-4-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In some debugfs attribute read methods, failure to acquire the mutex
lock results in jumping to a label before returning an error code.
However this is unnecessary, as we can return the failure code directly,
improving code readability and reducing complexity.
This commit removes the goto labels and ensures that the method returns
immediately upon failing to acquire the mutex lock.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250313115235.3707600-3-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, the block debugfs attributes (tags, tags_bitmap, sched_tags,
and sched_tags_bitmap) are protected using q->sysfs_lock. However, these
attributes are updated in multiple scenarios:
- During driver probe method
- During an elevator switch/update
- During an nr_hw_queues update
- When writing to the sysfs attribute nr_requests
All these update paths (except driver probe method, which doesn't
require any protection) are already protected using q->elevator_lock. To
ensure consistency and proper synchronization, replace q->sysfs_lock
with q->elevator_lock for protecting these debugfs attributes.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250313115235.3707600-2-nilay@linux.ibm.com
[axboe: some commit message rewording/fixes]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
request_queue param is no longer used by blk_rq_map_sg and
__blk_rq_map_sg. Remove it.
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250313035322.243239-1-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
>4GB folio is possible on some ARCHs, such as aarch64, 16GB hugepage
is supported, then 'offset' of folio can't be held in 'unsigned int',
cause warning in bio_add_folio_nofail() and IO failure.
Fix it by adjusting 'page' & trimming 'offset' so that `->bi_offset` won't
be overflow, and folio can be added to bio successfully.
Fixes: ed9832bc08 ("block: introduce folio awareness and add a bigger size from folio")
Cc: Kundan Kumar <kundan.kumar@samsung.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Gavin Shan <gshan@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20250312145136.2891229-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use one set of files when there is no difference between default and
legacy files, similar to regular subsys files registration. No
functional change.
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Tejun Heo <tj@kernel.org>
It is difficult to sync with stat updaters, stats are (should be)
monotonic so users can calculate differences from a reference.
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Tejun Heo <tj@kernel.org>
In _badblocks_check(), there are lines of code like this,
1246 sectors -= len;
[snipped]
1251 WARN_ON(sectors < 0);
The WARN_ON() at line 1257 doesn't make sense because sectors is
unsigned long long type and never to be <0.
Fix it by checking directly checking whether sectors is less than len.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Coly Li <colyli@kernel.org>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250309160556.42854-1-colyli@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make sure ->nr_integrity_segments is cloned in blk_rq_prep_clone(),
otherwise requests cloned by device-mapper multipath will not have the
proper nr_integrity_segments values set, then BUG() is hit from
sg_alloc_table_chained().
Fixes: b0fd271d5f ("block: add request clone interface (v2)")
Cc: stable@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250310115453.2271109-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, hctx attributes (nr_tags, nr_reserved_tags, and cpu_list)
are protected using `q->sysfs_lock`. However, these attributes can be
updated in multiple scenarios:
- During the driver's probe method.
- When updating nr_hw_queues.
- When writing to the sysfs attribute nr_requests,
which can modify nr_tags.
The nr_requests attribute is already protected using q->elevator_lock,
but none of the update paths actually use q->sysfs_lock to protect hctx
attributes. So to ensure proper synchronization, replace q->sysfs_lock
with q->elevator_lock when reading hctx attributes through sysfs.
Additionally, blk_mq_update_nr_hw_queues allocates and updates hctx.
The allocation of hctx is protected using q->elevator_lock, however,
updating hctx params happens without any protection, so safeguard hctx
param update path by also using q->elevator_lock.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250306093956.2818808-1-nilay@linux.ibm.com
[axboe: wrap comment at 80 chars]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The bdi->ra_pages could be updated under q->limits_lock because it's
usually calculated from the queue limits by queue_limits_commit_update.
So protect reading/writing the sysfs attribute read_ahead_kb using
q->limits_lock instead of q->sysfs_lock.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250304102551.2533767-8-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The wbt latency and state could be updated while initializing the
elevator or exiting the elevator. It could be also updated while
configuring IO latency QoS parameters using cgroup. The elevator
code path is now protected with q->elevator_lock. So we should
protect the access to sysfs attribute wbt_lat_usec using q->elevator
_lock instead of q->sysfs_lock. White we're at it, also protect
ioc_qos_write(), which configures wbt parameters via cgroup, using
q->elevator_lock.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250304102551.2533767-7-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The sysfs attribute nr_requests could be simultaneously updated from
elevator switch/update or nr_hw_queue update code path. The update to
nr_requests for each of those code paths runs holding q->elevator_lock.
So we should protect access to sysfs attribute nr_requests using q->
elevator_lock instead of q->sysfs_lock.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250304102551.2533767-6-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A queue's elevator can be updated either when modifying nr_hw_queues
or through the sysfs scheduler attribute. Currently, elevator switching/
updating is protected using q->sysfs_lock, but this has led to lockdep
splats[1] due to inconsistent lock ordering between q->sysfs_lock and
the freeze-lock in multiple block layer call sites.
As the scope of q->sysfs_lock is not well-defined, its (mis)use has
resulted in numerous lockdep warnings. To address this, introduce a new
q->elevator_lock, dedicated specifically for protecting elevator
switches/updates. And we'd now use this new q->elevator_lock instead of
q->sysfs_lock for protecting elevator switches/updates.
While at it, make elv_iosched_load_module() a static function, as it is
only called from elv_iosched_store(). Also, remove redundant parameters
from elv_iosched_load_module() function signature.
[1] https://lore.kernel.org/all/67637e70.050a0220.3157ee.000c.GAE@google.com/
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250304102551.2533767-5-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There're few sysfs attributes in block layer which don't really need
acquiring q->sysfs_lock while accessing it. The reason being, reading/
writing a value from/to such attributes are either atomic or could be
easily protected using READ_ONCE()/WRITE_ONCE(). Moreover, sysfs
attributes are inherently protected with sysfs/kernfs internal locking.
So this change help segregate all existing sysfs attributes for which
we could avoid acquiring q->sysfs_lock. For all read-only attributes
we removed the q->sysfs_lock from show method of such attributes. In
case attribute is read/write then we removed the q->sysfs_lock from
both show and store methods of these attributes.
We audited all block sysfs attributes and found following list of
attributes which shouldn't require q->sysfs_lock protection:
1. io_poll:
Write to this attribute is ignored. So, we don't need q->sysfs_lock.
2. io_poll_delay:
Write to this attribute is NOP, so we don't need q->sysfs_lock.
3. io_timeout:
Write to this attribute updates q->rq_timeout and read of this
attribute returns the value stored in q->rq_timeout Moreover, the
q->rq_timeout is set only once when we init the queue (under blk_mq_
init_allocated_queue()) even before disk is added. So that means
that we don't need to protect it with q->sysfs_lock. As this
attribute is not directly correlated with anything else simply using
READ_ONCE/WRITE_ONCE should be enough.
4. nomerges:
Write to this attribute file updates two q->flags : QUEUE_FLAG_
NOMERGES and QUEUE_FLAG_NOXMERGES. These flags are accessed during
bio-merge which anyways doesn't run with q->sysfs_lock held.
Moreover, the q->flags are updated/accessed with bitops which are
atomic. So, protecting it with q->sysfs_lock is not necessary.
5. rq_affinity:
Write to this attribute file makes atomic updates to q->flags:
QUEUE_FLAG_SAME_COMP and QUEUE_FLAG_SAME_FORCE. These flags are
also accessed from blk_mq_complete_need_ipi() using test_bit macro.
As read/write to q->flags uses bitops which are atomic, protecting
it with q->stsys_lock is not necessary.
6. nr_zones:
Write to this attribute happens in the driver probe method (except
nvme) before disk is added and outside of q->sysfs_lock or any other
lock. Moreover nr_zones is defined as "unsigned int" and so reading
this attribute, even when it's simultaneously being updated on other
cpu, should not return torn value on any architecture supported by
linux. So we can avoid using q->sysfs_lock or any other lock/
protection while reading this attribute.
7. discard_zeroes_data:
Reading of this attribute always returns 0, so we don't require
holding q->sysfs_lock.
8. write_same_max_bytes
Reading of this attribute always returns 0, so we don't require
holding q->sysfs_lock.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250304102551.2533767-4-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation to further simplify and group sysfs attributes which
don't require locking or require some form of locking other than q->
limits_lock, move acquire/release of q->sysfs_lock and queue freeze/
unfreeze under each attributes' respective show/store method.
While we are at it, also remove ->load_module() as it's used to load
the module before queue is freezed. Now as we moved queue-freeze under
->store(), we could load module directly from the attributes' store
method before we actually start freezing the queue. Currently, the
->load_module() is only used by "scheduler" attribute, so we now load
the relevant elevator module before we start freezing the queue in
elv_iosched_store().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250304102551.2533767-3-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There're few sysfs attributes(RW) whose store method is protected
with q->limits_lock, however the corresponding show method of these
attributes run holding q->sysfs_lock and that doesn't make sense
as ideally the show method of these attributes should also run
holding q->limits_lock instead of q->sysfs_lock. Hence update the
show method of these sysfs attributes so that reading of these
attributes acquire q->limits_lock instead of q->sysfs_lock.
Similarly, there're few sysfs attributes(RO) whose show method is
currently protected with q->sysfs_lock however updates to these
attributes could occur using atomic limit update APIs such as queue_
limits_start_update() and queue_limits_commit_update() which run
holding q->limits_lock. So that means that reading these attributes
holding q->sysfs_lock doesn't make sense. Hence update the show method
of these sysfs attributes(RO) such that they run with holding q->
limits_lock instead of q->sysfs_lock.
We have defined a new macro QUEUE_LIM_RO_ENTRY() which uses new ->show_
limit() method and it runs holding q->limits_lock. All existing sysfs
attributes(RO) which needs protection using q->limits_lock while
reading have been now updated to use this new macro for initialization.
Also, the existing QUEUE_LIM_RW_ENTRY() is updated to use new ->show_
limit() method for reading attributes instead of existing ->show()
method. As ->show_limit() runs holding q->limits_lock, the existing
sysfs attributes(RW) requiring protection are now inherently protected
using q->limits_lock instead of q->sysfs_lock.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250304102551.2533767-2-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfKQvsQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpnBCD/9bVSGHNnXakVwdpQmtU5zy54cyWd7VaYsz
qeM+Vrl1m5nf8q5ZdEXcM11Ruib3YJiW0GN9d9sWpTwt8C5n8g+8F63koS7GordZ
jcv77nO6FlnwWpm3YlNxAeLuxkl15e4MQIKj/jb540iFygzT8H2lygE816K4kpCX
XuMxNxdSMksntovZufzxo3Sfkm6e6GChCkkqvBxuXiEWFhvbFQ/ZLEsEMtoH4hkI
3Nj1VB3B3pLVCZhWr2uVvcZCiYUDyBslu+SA3RRoX0W6beK1cVI4OQdS8GtnkJf3
qFnLQz0Ib3EVDtugqg7ZGSAAov6Z8waA2MrFeZkG8uIfl4WT3kBfoan7jRX3Mknl
VnFkThyJOzB83OKqlZKjCzYmEzBhKJrRJVtneIrxT+gvEpevFvAQil6SQfyPDwno
4YcUD+IfU/daTdVR58QQ/iLzkQ7stQWYCtZSrICKfcAGy6zswKM5P5uoWltMBwQh
aHsyz9xbmsMrxch1DPRb0T2GD2h9BsiL6rT8JCrOgucMuOYeZL9pNRgz16D/hael
wBCxPcanSdap0N9kiMX8fLYYdmRxpJHzTbeNRsPhZe8HKUPu1sYTbisOou1XSdAW
Dv7zeQWVlw+1cn/S1Y6Oc4mdlPzPTA9szuBXVpbe9Gd7ZqO7sbbKEkGu5w6MGSZ1
oubnZKCNvA==
=jKDe
-----END PGP SIGNATURE-----
Merge tag 'block-6.14-20250306' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- NVMe pull request via Keith:
- TCP use after free fix on polling (Sagi)
- Controller memory buffer cleanup fixes (Icenowy)
- Free leaking requests on bad user passthrough commands (Keith)
- TCP error message fix (Maurizio)
- TCP corruption fix on partial PDU (Maurizio)
- TCP memory ordering fix for weakly ordered archs (Meir)
- Type coercion fix on message error for TCP (Dan)
- Name the RQF flags enum, fixing issues with anon enums and BPF import
of it
- ublk parameter setting fix
- GPT partition 7-bit conversion fix
* tag 'block-6.14-20250306' of git://git.kernel.dk/linux:
block: Name the RQF flags enum
nvme-tcp: fix signedness bug in nvme_tcp_init_connection()
block: fix conversion of GPT partition name to 7-bit
ublk: set_params: properly check if parameters can be applied
nvmet-tcp: Fix a possible sporadic response drops in weakly ordered arch
nvme-tcp: fix potential memory corruption in nvme_tcp_recv_pdu()
nvme-tcp: Fix a C2HTermReq error message
nvmet: remove old function prototype
nvme-ioctl: fix leaked requests on mapping error
nvme-pci: skip CMB blocks incompatible with PCI P2P DMA
nvme-pci: clean up CMBMSC when registering CMB fails
nvme-tcp: fix possible UAF in nvme_tcp_poll
The commit titled "block/bdev: lift block size restrictions to 64k"
lifted the block layer's max supported block size to 64k inside the
helper blk_validate_block_size() now that we support large folios.
However in lifting the block size we also removed the silly use
cases many filesystems have to use sb_set_blocksize() to *verify*
that the block size <= PAGE_SIZE. The call to sb_set_blocksize() was
used to check the block size <= PAGE_SIZE since historically we've
always supported userspace to create for example 64k block size
filesystems even on 4k page size systems, but what we didn't allow
was mounting them. Older filesystems have been using the check with
sb_set_blocksize() for years.
While, we could argue that such checks should be filesystem specific,
there are much more users of sb_set_blocksize() than LBS enabled
filesystem on upstream, so just do the easier thing and bring back
the PAGE_SIZE check for sb_set_blocksize() users and only skip it
for LBS enabled filesystems.
This will ensure that tests such as generic/466 when run in a loop
against say, ext4, won't try to try to actually mount a filesystem with
a block size larger than your filesystem supports given your PAGE_SIZE
and in the worst case crash.
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20250307020403.3068567-1-mcgrof@kernel.org
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
There is a truncation of badblocks length issue when set badblocks as
follow:
echo "2055 4294967299" > bad_blocks
cat bad_blocks
2055 3
Change 'sectors' argument type from 'int' to 'sector_t'.
This change avoids truncation of badblocks length for large sectors by
replacing 'int' with 'sector_t' (u64), enabling proper handling of larger
disk sizes and ensuring compatibility with 64-bit sector addressing.
Fixes: 9e0e252a04 ("badblocks: Add core badblock management code")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/r/20250227075507.151331-13-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Change the return type of badblocks_set() and badblocks_clear()
from int to bool, indicating success or failure. Specifically:
- _badblocks_set() and _badblocks_clear() functions now return
true for success and false for failure.
- All calls to these functions are updated to handle the new
boolean return type.
- This change improves code clarity and ensures a more consistent
handling of success and failure states.
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Coly Li <colyli@kernel.org>
Acked-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20250227075507.151331-11-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The bad blocks check would miss bad blocks when retrying under contention,
as checking parameters are not reset. These stale values from the previous
attempt could lead to incorrect scanning in the subsequent retry.
Move seqlock to outer function and reinitialize checking state for each
retry. This ensures a clean state for each check attempt, preventing any
missed bad blocks.
Fixes: 3ea3354cb9 ("badblocks: improve badblocks_check() for multiple ranges handling")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/r/20250227075507.151331-10-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a merge issue when adding badblocks as follow:
echo 0 10 > bad_blocks
echo 30 10 > bad_blocks
echo 20 10 > bad_blocks
cat bad_blocks
0 10
20 10 //should be merged with (30 10)
30 10
In this case, if new badblocks does not intersect with prev, it is added
by insert_at(). If there is an intersection with prev+1, the merge will
be processed in the next re_insert loop.
However, when the end of the new badblocks is exactly equal to the offset
of prev+1, no further re_insert loop occurs, and the two badblocks are not
merge.
Fix it by inc prev, badblocks can be merged during the subsequent code.
Fixes: aa511ff821 ("badblocks: switch to the improved badblock handling code")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250227075507.151331-9-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Regardless of whether overlap_front() returns true or false,
can_merge_front() will be executed first. Therefore, move
can_merge_front() in front of can_merge_front() to simplify code.
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250227075507.151331-8-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The number of badblocks cannot exceed MAX_BADBLOCKS, but it should be
allowed to equal MAX_BADBLOCKS.
Fixes: aa511ff821 ("badblocks: switch to the improved badblock handling code")
Fixes: c3c6a86e9e ("badblocks: add helper routines for badblock ranges handling")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/r/20250227075507.151331-7-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
_badblocks_set() returns success if at least one badblock is set
successfully, even if others fail. This can lead to data inconsistencies
in raid, where a failed badblock set should trigger the disk to be kicked
out to prevent future reads from failed write areas.
_badblocks_set() should return error if any badblock set fails. Instead
of relying on 'rv', directly returning 'sectors' for clearer logic. If all
badblocks are successfully set, 'sectors' will be 0, otherwise it
indicates the number of badblocks that have not been set yet, thus
signaling failure.
By the way, it can also fix an issue: when a newly set unack badblock is
included in an existing ack badblock, the setting will return an error.
···
echo "0 100" /sys/block/md0/md/dev-loop1/bad_blocks
echo "0 100" /sys/block/md0/md/dev-loop1/unacknowledged_bad_blocks
-bash: echo: write error: No space left on device
```
After fix, it will return success.
Fixes: aa511ff821 ("badblocks: switch to the improved badblock handling code")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/r/20250227075507.151331-6-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In the current handling of badblocks settings, a lot of processing has
been done for scenarios where the number of badblocks exceeds 512.
This makes the code look quite complex and also introduces some issues,
For example, if there is 512 badblocks already:
for((i=0; i<510; i++)); do ((sector=i*2)); echo "$sector 1" > bad_blocks; done
echo 2100 10 > bad_blocks
echo 2200 10 > bad_blocks
Set new one, exceed 512:
echo 2000 500 > bad_blocks
Expected:
2000 500
Actual:
2100 400
In fact, a disk shouldn't have too many badblocks, and for disks with
512 badblocks, attempting to set more bad blocks doesn't make much sense.
At that point, the more appropriate action would be to replace the disk.
Therefore, to resolve these issues and simplify the code somewhat, return
error directly when setting badblocks exceeds 512.
Fixes: aa511ff821 ("badblocks: switch to the improved badblock handling code")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250227075507.151331-5-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If ack and unack badblocks are adjacent, they will not be merged and will
remain as two separate badblocks. Even after the bad blocks are written
to disk and both become ack, they will still remain as two independent
bad blocks. This is not ideal as it wastes the limited space for
badblocks. Therefore, during ack_all_badblocks(), attempt to merge
badblocks if they are adjacent.
Fixes: aa511ff821 ("badblocks: switch to the improved badblock handling code")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/r/20250227075507.151331-4-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
'bb->shift' is used directly in badblocks. It is wrong, fix it.
Fixes: 3ea3354cb9 ("badblocks: improve badblocks_check() for multiple ranges handling")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/r/20250227075507.151331-2-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, BLK_INTEGRITY_NOGENERATE and BLK_INTEGRITY_NOVERIFY are not
explicitly set during integrity initialization. This can lead to
incorrect reporting of read_verify and write_generate sysfs values,
particularly when a device does not support integrity. Ensure that these
flags are correctly initialized by default.
Reported-by: M Nikhil <nikh1092@linux.ibm.com>
Link: https://lore.kernel.org/linux-block/f6130475-3ccd-45d2-abde-3ccceada0f0a@linux.ibm.com/
Fixes: 9f4aa46f2a ("block: invert the BLK_INTEGRITY_{GENERATE,VERIFY} flags")
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250305063033.1813-3-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
queue_limits_stack_integrity() incorrectly sets
BLK_INTEGRITY_DEVICE_CAPABLE for a DM device even when none of its
underlying devices support integrity. This happens because the flag is
inherited unconditionally. Ensure that integrity capabilities are
correctly propagated only when the underlying devices actually support
integrity.
Reported-by: M Nikhil <nikh1092@linux.ibm.com>
Link: https://lore.kernel.org/linux-block/f6130475-3ccd-45d2-abde-3ccceada0f0a@linux.ibm.com/
Fixes: c6e56cf6b2 ("block: move integrity information into queue_limits")
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250305063033.1813-2-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now ->carryover_bytes[] and ->carryover_ios[] only covers limit/config
update.
Actually the carryover bytes/ios can be carried to ->bytes_disp[] and
->io_disp[] directly, since the carryover is one-shot thing and only valid
in current slice.
Then we can remove the two fields and simplify code much.
Type of ->bytes_disp[] and ->io_disp[] has to change as signed because the
two fields may become negative when updating limits or config, but both are
big enough for holding bytes/ios dispatched in single slice
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250305043123.3938491-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 29390bb566 ("blk-throttle: support prioritized processing of metadata")
takes bytes/ios carryover for prioritized processing of metadata. Turns out
we can support it by charging it directly without trimming slice, and the
result is same with carryover.
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250305043123.3938491-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The two fields are not used any more, so remove them.
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250305043123.3938491-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The bio submission time may be a few jiffies more than the expected
waiting time, due to 'extra_bytes' can't be divided in
tg_within_bps_limit(), and also due to timer wakeup delay.
In this case, adjust slice_start to jiffies will discard the extra wait time,
causing lower rate than expected.
Current in-tree code already covers deviation by rounddown(), but turns
out it is not enough, because jiffies - slice_start can be a multiple of
throtl_slice.
For example, assume bps_limit is 1000bytes, 1 jiffes is 10ms, and
slice is 20ms(2 jiffies), expected rate is 1000 / 1000 * 20 = 20 bytes
per slice.
If user issues two 21 bytes IO, then wait time will be 30ms for the
first IO:
bytes_allowed = 20, extra_bytes = 1;
jiffy_wait = 1 + 2 = 3 jiffies
and consider
extra 1 jiffies by timer, throtl_trim_slice() will be called at:
jiffies = 40ms
slice_start = 0ms, slice_end= 40ms
bytes_disp = 21
In this case, before the patch, real rate in the first two slices is
10.5 bytes per slice, and slice will be updated to:
jiffies = 40ms
slice_start = 40ms, slice_end = 60ms,
bytes_disp = 0;
Hence the second IO will have to wait another 30ms;
With the patch, the real rate in the first slice is 20 bytes per slice,
which is the same as expected, and slice will be updated:
jiffies=40ms,
slice_start = 20ms, slice_end = 60ms,
bytes_disp = 1;
And now, there is still 19 bytes allowed in the second slice, and the
second IO will only have to wait 10ms;
This problem will cause blktests throtl/001 failure in case of
CONFIG_HZ_100=y, fix it by preserving one extra finished slice in
throtl_trim_slice().
Fixes: e43473b7f2 ("blkio: Core implementation of throttle policy")
Reported-by: Ming Lei <ming.lei@redhat.com>
Closes: https://lore.kernel.org/linux-block/20250222092823.210318-3-yukuai1@huaweicloud.com/
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250227120645.812815-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The utf16_le_to_7bit function claims to, naively, convert a UTF-16
string to a 7-bit ASCII string. By naively, we mean that it:
* drops the first byte of every character in the original UTF-16 string
* checks if all characters are printable, and otherwise replaces them
by exclamation mark "!".
This means that theoretically, all characters outside the 7-bit ASCII
range should be replaced by another character. Examples:
* lower-case alpha (ɒ) 0x0252 becomes 0x52 (R)
* ligature OE (œ) 0x0153 becomes 0x53 (S)
* hangul letter pieup (ㅂ) 0x3142 becomes 0x42 (B)
* upper-case gamma (Ɣ) 0x0194 becomes 0x94 (not printable) so gets
replaced by "!"
The result of this conversion for the GPT partition name is passed to
user-space as PARTNAME via udev, which is confusing and feels questionable.
However, there is a flaw in the conversion function itself. By dropping
one byte of each character and using isprint() to check if the remaining
byte corresponds to a printable character, we do not actually guarantee
that the resulting character is 7-bit ASCII.
This happens because we pass 8-bit characters to isprint(), which
in the kernel returns 1 for many values > 0x7f - as defined in ctype.c.
This results in many values which should be replaced by "!" to be kept
as-is, despite not being valid 7-bit ASCII. Examples:
* e with acute accent (é) 0x00E9 becomes 0xE9 - kept as-is because
isprint(0xE9) returns 1.
* euro sign (€) 0x20AC becomes 0xAC - kept as-is because isprint(0xAC)
returns 1.
This way has broken pyudev utility[1], fixes it by using a mask of 7 bits
instead of 8 bits before calling isprint.
Link: https://github.com/pyudev/pyudev/issues/490#issuecomment-2685794648 [1]
Link: https://lore.kernel.org/linux-block/4cac90c2-e414-4ebb-ae62-2a4589d9dc6e@canonical.com/
Cc: Mulhern <amulhern@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: stable@vger.kernel.org
Signed-off-by: Olivier Gayot <olivier.gayot@canonical.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250305022154.3903128-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Many of the fields in struct bio_integrity_payload are only needed for
the default integrity buffer in the block layer, and the variable
sized array at the end of the structure makes it very hard to embed
into caller allocated structures.
Reduce struct bio_integrity_payload to the minimal structure needed in
common code and create two separate containing structures for the
automatically generated payload and the caller allocated payload.
The latter is a simple wrapper for struct bio_integrity_payload and
the bvecs, while the former contains the additional fields moved out
of struct bio_integrity_payload.
Always use a dedicated mempool for automatic integrity metadata
instead of depending on bio_set that is submitter controlled and thus
often doesn't have the mempool initialized and stop using mempools for
the submitter buffers as they aren't in the NOIO I/O submission path
where we need to guarantee forward progress.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20250225154449.422989-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The code that automatically creates a integrity payload and generates and
verifies the checksums for bios that don't have submitter-provided
integrity payload currently sits right in the middle of the block
integrity metadata infrastructure. Split it into a separate file to
make the different layers clear.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250225154449.422989-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
None of the few drivers still using the legacy block layer bounce
buffering support integrity metadata. Explicitly mark the features as
incompatible and stop creating the slab and mempool for integrity
buffers for the bounce bio_set.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250225154449.422989-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Device mapper bioset often has big bio_slab size, which can be more than
1000, then 8byte can't hold the slab name any more, cause the kmem_cache
allocation warning of 'kmem_cache of name 'bio-108' already exists'.
Fix the warning by extending bio_slab->name to 12 bytes, but fix output
of /proc/slabinfo
Reported-by: Guangwu Zhang <guazhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250228132656.2838008-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For devices that natively support zone append operations,
REQ_OP_ZONE_APPEND BIOs are not processed through zone write plugging
and are immediately issued to the zoned device. This means that there is
no write pointer offset tracking done for these operations and that a
zone write plug is not necessary.
However, when receiving a zone append BIO, we may already have a zone
write plug for the target zone if that zone was previously partially
written using regular write operations. In such case, since the write
pointer offset of the zone write plug is not incremented by the amount
of sectors appended to the zone, 2 issues arise:
1) we risk leaving the plug in the disk hash table if the zone is fully
written using zone append or regular write operations, because the
write pointer offset will never reach the "zone full" state.
2) Regular write operations that are issued after zone append operations
will always be failed by blk_zone_wplug_prepare_bio() as the write
pointer alignment check will fail, even if the user correctly
accounted for the zone append operations and issued the regular
writes with a correct sector.
Avoid these issues by immediately removing the zone write plug of zones
that are the target of zone append operations when blk_zone_plug_bio()
is called. The new function blk_zone_wplug_handle_native_zone_append()
implements this for devices that natively support zone append. The
removal of the zone write plug using disk_remove_zone_wplug() requires
aborting all plugged regular write using disk_zone_wplug_abort() as
otherwise the plugged write BIOs would never be executed (with the plug
removed, the completion path will never see again the zone write plug as
disk_get_zone_wplug() will return NULL). Rate-limited warnings are added
to blk_zone_wplug_handle_native_zone_append() and to
disk_zone_wplug_abort() to signal this.
Since blk_zone_wplug_handle_native_zone_append() is called in the hot
path for operations that will not be plugged, disk_get_zone_wplug() is
optimized under the assumption that a user issuing zone append
operations is not at the same time issuing regular writes and that there
are no hashed zone write plugs. The struct gendisk atomic counter
nr_zone_wplugs is added to check this, with this counter incremented in
disk_insert_zone_wplug() and decremented in disk_remove_zone_wplug().
To be consistent with this fix, we do not need to fill the zone write
plug hash table with zone write plugs for zones that are partially
written for a device that supports native zone append operations.
So modify blk_revalidate_seq_zone() to return early to avoid allocating
and inserting a zone write plug for partially written sequential zones
if the device natively supports zone append.
Reported-by: Jorgen Hansen <Jorgen.Hansen@wdc.com>
Fixes: 9b1ce7f0c6 ("block: Implement zone append emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Jorgen Hansen <Jorgen.Hansen@wdc.com>
Link: https://lore.kernel.org/r/20250214041434.82564-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
wbt_wait() no longer uses a spinlock as a parameter. Update the function
comments accordingly.
RWB_UNKNOWN_BUMP is used when we gradually adjust scale_steps toward the
center state, which is a value of 0.
Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250213100611.209997-2-yizhou.tang@shopee.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Using PAGE_SIZE as a minimum expected DMA segment size in consideration
of devices which have a max DMA segment size of < 64k when used on 64k
PAGE_SIZE systems leads to devices not being able to probe such as
eMMC and Exynos UFS controller [0] [1] you can end up with a probe failure
as follows:
WARNING: CPU: 2 PID: 397 at block/blk-settings.c:339 blk_validate_limits+0x364/0x3c0
Ensure we use min(max_seg_size, seg_boundary_mask + 1) as the new min segment
size when max segment size is < PAGE_SIZE for 16k and 64k base page size systems.
If anyone need to backport this patch, the following commits are depended:
commit 6aeb4f8364 ("block: remove bio_add_pc_page")
commit 02ee5d69e3 ("block: remove blk_rq_bio_prep")
commit b7175e24d6 ("block: add a dma mapping iterator")
Link: https://lore.kernel.org/linux-block/20230612203314.17820-1-bvanassche@acm.org/ # [0]
Link: https://lore.kernel.org/linux-block/1d55e942-5150-de4c-3a02-c3d066f87028@acm.org/ # [1]
Cc: Yi Zhang <yi.zhang@redhat.com>
Cc: John Garry <john.g.garry@oracle.com>
Cc: Keith Busch <kbusch@kernel.org>
Tested-by: Paul Bunyan <pbunyan@redhat.com>
Reviewed-by: Daniel Gomez <da.gomez@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250225022141.2154581-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We now can support blocksizes larger than PAGE_SIZE, so in theory
we should be able to lift the restriction up to the max supported page
cache order. However bound ourselves to what we can currently validate
and test. Through blktests and fstest we can validate up to 64k today.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20250221223823.1680616-8-mcgrof@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Call mapping_set_folio_min_order() when modifying the logical block
size to ensure folios are allocated with the correct size.
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20250221223823.1680616-7-mcgrof@kernel.org
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAme4rUUQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpj9+EADLPOFPa9hT1PGBbpnj74vBayoTO/M+w2Gp
+k2b8if3eGlY43WO2k+ytceWbA901iyvLPRqt1M1Ez8+BrNBg4NKcLv7q4O9NA3i
nDPLggugSc5sdRbLRimxiwHkkpSOBenkdb7R9XGmMXTCfSbRKl0kK01ivpgkbiG4
pbyPWYcoMyHaECBfPhazrJig4+rugXOYbkYoOM4NHsLqlTNfmowcMRPu+6czXt7q
ITHW2RTWK3ue8q+c3nwGPDk2ZKM8X/49rA/6bvD3voLNs+jQ8KFg2KULENf0Xaq6
1ZGrhLcr45iEHP0/+RORMzx27PqbTCSGIOTMZtwZNqh5+V+ybrGJq/F/T5rkrA3F
QqHld/WSSKWJ10RVAyjDP7NQ5vNZTwwGAEVagjyIFEfk7G7RTY2kIpSZiUgrZ9oD
4CkOKUGmVkUsKQW6gb0JQObtYyXXoNtmg8wQU2WwhISjFDkoYWw53LHwH/LnxLyi
Vg182amVBmERk4I5nTUiIML/7TzS69srb0Q7yaQS3eTwzLorDaB+3tPAxQmCTkGq
KeBfuBtbP3LTOy2Oek4YbKl8CA2KDYtK7FbCE6PECUbdTjpNrcAAgA/ZcgoKV8s7
EHWZFx7dZyFS6LFNWzT9VhTtgSZS92JIsZwgnjSJPV2UazyxmqHFChzVVGWJbaB3
agkMor3nVg==
=L+Lr
-----END PGP SIGNATURE-----
Merge tag 'block-6.14-20250221' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- NVMe pull request via Keith:
- FC controller state check fixes (Daniel)
- PCI Endpoint fixes (Damien)
- TCP connection failure fixe (Caleb)
- TCP handling C2HTermReq PDU (Maurizio)
- RDMA queue state check (Ruozhu)
- Apple controller fixes (Hector)
- Target crash on disbaled namespace (Hannes)
- MD pull request via Yu:
- Fix queue limits error handling for raid0, raid1 and raid10
- Fix for a NULL pointer deref in request data mapping
- Code cleanup for request merging
* tag 'block-6.14-20250221' of git://git.kernel.dk/linux:
nvme: only allow entering LIVE from CONNECTING state
nvme-fc: rely on state transitions to handle connectivity loss
apple-nvme: Support coprocessors left idle
apple-nvme: Release power domains when probe fails
nvmet: Use enum definitions instead of hardcoded values
nvme: Cleanup the definition of the controller config register fields
nvme/ioctl: add missing space in err message
nvme-tcp: fix connect failure on receiving partial ICResp PDU
nvme: tcp: Fix compilation warning with W=1
nvmet: pci-epf: Avoid RCU stalls under heavy workload
nvmet: pci-epf: Do not uselessly write the CSTS register
nvmet: pci-epf: Correctly initialize CSTS when enabling the controller
nvmet-rdma: recheck queue state is LIVE in state lock in recv done
nvmet: Fix crash when a namespace is disabled
nvme-tcp: add basic support for the C2HTermReq PDU
nvme-pci: quirk Acer FA100 for non-uniqueue identifiers
block: fix NULL pointer dereferenced within __blk_rq_map_sg
block/merge: remove unnecessary min() with UINT_MAX
md/raid*: Fix the set_queue_limits implementations
hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.
Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.
Patch was created by using Coccinelle.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/196d487c925411923a2d59d4bf5e366b9dac2747.1738746821.git.namcao@linutronix.de
hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.
Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.
Patch was created by using Coccinelle.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/d0d57e1dab46b617856dfb93c721d221cc31ab0b.1738746821.git.namcao@linutronix.de
The block layer internal flush request may not have bio attached, so the
request iterator has to be initialized from valid req->bio, otherwise NULL
pointer dereferenced is triggered.
Cc: Christoph Hellwig <hch@lst.de>
Reported-and-tested-by: Cheyenne Wills <cheyenne.wills@gmail.com>
Fixes: b7175e24d6 ("block: add a dma mapping iterator")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250217031626.461977-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In bvec_split_segs(), max_bytes is an unsigned, so it must be less than
or equal to UINT_MAX. Remove the unnecessary min().
Prior to commit 67927d2201 ("block/merge: count bytes instead of
sectors"), the min() was with UINT_MAX >> 9, so it did have an effect.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250214193637.234702-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix several issues in partition probing:
- The bailout for a bad partoffset must use put_dev_sector(), since the
preceding read_part_sector() succeeded.
- If the partition table claims a silly sector size like 0xfff bytes
(which results in partition table entries straddling sector boundaries),
bail out instead of accessing out-of-bounds memory.
- We must not assume that the partition table contains proper NUL
termination - use strnlen() and strncmp() instead of strlen() and
strcmp().
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/r/20250214-partition-mac-v1-1-c1c626dffbd5@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When rq_qos_wait() is first introduced, it is easy to understand. But
with some bug fixes applied, it is not easy for newcomers to understand
the whole logic under those fixes. In this patch, rq_qos_wait() is
refactored and more comments are added for better understanding. There
are 3 points for the improvement:
1) Use waitqueue_active() instead of wq_has_sleeper() to eliminate
unnecessary memory barrier in wq_has_sleeper() which is supposed
to be used in waker side. In this case, we do need the barrier.
So use the cheaper one to locklessly test for waiters on the queue.
2) Remove acquire_inflight_cb() logic for the first waiter out of the
while loop to make the code clear.
3) Add more comments to explain how to sync with different waiters and
the waker.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250208090416.38642-2-songmuchun@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is already a macro DEFINE_WAIT_FUNC() to declare a wait_queue_entry
with a specified waking function. But there is not a counterpart for
initializing one wait_queue_entry with a specified waking function. So
introducing init_wait_func() for this, which also could be used in iocost
and rq-qos. Using default_wake_function() in rq_qos_wait() to wake up
waiters, which could remove ->task field from rq_qos_wait_data.
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250208090416.38642-1-songmuchun@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Until this point, the kernel can use hardware-wrapped keys to do
encryption if userspace provides one -- specifically a key in
ephemerally-wrapped form. However, no generic way has been provided for
userspace to get such a key in the first place.
Getting such a key is a two-step process. First, the key needs to be
imported from a raw key or generated by the hardware, producing a key in
long-term wrapped form. This happens once in the whole lifetime of the
key. Second, the long-term wrapped key needs to be converted into
ephemerally-wrapped form. This happens each time the key is "unlocked".
In Android, these operations are supported in a generic way through
KeyMint, a userspace abstraction layer. However, that method is
Android-specific and can't be used on other Linux systems, may rely on
proprietary libraries, and also misleads people into supporting KeyMint
features like rollback resistance that make sense for other KeyMint keys
but don't make sense for hardware-wrapped inline encryption keys.
Therefore, this patch provides a generic kernel interface for these
operations by introducing new block device ioctls:
- BLKCRYPTOIMPORTKEY: convert a raw key to long-term wrapped form.
- BLKCRYPTOGENERATEKEY: have the hardware generate a new key, then
return it in long-term wrapped form.
- BLKCRYPTOPREPAREKEY: convert a key from long-term wrapped form to
ephemerally-wrapped form.
These ioctls are implemented using new operations in blk_crypto_ll_ops.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> # sm8650
Link: https://lore.kernel.org/r/20250204060041.409950-4-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add sysfs files that indicate which type(s) of keys are supported by the
inline encryption hardware associated with a particular request queue:
/sys/block/$disk/queue/crypto/hw_wrapped_keys
/sys/block/$disk/queue/crypto/raw_keys
Userspace can use the presence or absence of these files to decide what
encyption settings to use.
Don't use a single key_type file, as devices might support both key
types at the same time.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> # sm8650
Link: https://lore.kernel.org/r/20250204060041.409950-3-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To prevent keys from being compromised if an attacker acquires read
access to kernel memory, some inline encryption hardware can accept keys
which are wrapped by a per-boot hardware-internal key. This avoids
needing to keep the raw keys in kernel memory, without limiting the
number of keys that can be used. Such hardware also supports deriving a
"software secret" for cryptographic tasks that can't be handled by
inline encryption; this is needed for fscrypt to work properly.
To support this hardware, allow struct blk_crypto_key to represent a
hardware-wrapped key as an alternative to a raw key, and make drivers
set flags in struct blk_crypto_profile to indicate which types of keys
they support. Also add the ->derive_sw_secret() low-level operation,
which drivers supporting wrapped keys must implement.
For more information, see the detailed documentation which this patch
adds to Documentation/block/inline-encryption.rst.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> # sm8650
Link: https://lore.kernel.org/r/20250204060041.409950-2-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This CRC64 variant comes from the NVME NVM Command Set Specification
(https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-1.0e-2024.07.29-Ratified.pdf).
The "Rocksoft Model CRC Algorithm", published in 1993 and available at
https://www.zlib.net/crc_v3.txt, is a generalized CRC algorithm that can
calculate any variant of CRC, given a list of parameters such as
polynomial, bit order, etc. It is not a CRC variant.
The NVME NVM Command Set Specification has a table that gives the
"Rocksoft Model Parameters" for the CRC variant it uses. When support
for this CRC variant was added to Linux, this table seems to have been
misinterpreted as naming the CRC variant the "Rocksoft" CRC. In fact,
the table names the CRC variant as the "NVM Express 64b CRC".
Most implementations of this CRC variant outside Linux have been calling
it CRC64-NVME. Therefore, update Linux to match.
While at it, remove the superfluous "update" from the function name, so
crc64_rocksoft_update() is now just crc64_nvme(), matching most of the
other CRC library functions.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Acked-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250130035130.180676-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Following what was done for the CRC32 and CRC-T10DIF library functions,
get rid of the pointless use of the crypto API and make
crc64_rocksoft_update() call into the library directly. This is faster
and simpler.
Remove crc64_rocksoft() (the version of the function that did not take a
'crc' argument) since it is unused.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Acked-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250130035130.180676-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmec72kQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpm0pD/sENbY3LKqN8JipH8fWjLmdcimB7STB8MpB
LcuWiiuhDQuU1Yjr7c1h0Nkb6qwrb+gP+NQJrhlyUO7JVlvzFivJqsRscibNj+MF
hX/vUM2zKo0La2YxRv7aygTTCs8JfRmLKkTtfo0H4geljbX8/yMKSEJA+7x4j60X
TeJjmAHEKBGm2cNA7QMGqhzoUfjW1WQyn9x6/YOlLsYLYB8Cd7LFJqxM1tiDzUay
ys/QJ833qctdqBK5iLIh5voqZA58N3koidw536NLNgyvJn77CSAmC5jF1Vr0kFS6
Zb+dE9qRXDsND83dq3qvpSG5gtE42DygEG0g7s49zztd1rPOWE45RCxQtcZqHSvu
N9hN+/4SNX+02AfrgZFEk8+yGbtIlsb0mHRxqIKh8261tb1RTYVmtI4tqC3GVMjj
Qo41rDLDhaPOoKayo/KZO55n9U9M8BIkoF73hIIr5l9wHf5iOwi+EjzULlYhGigc
xypMvwWFLu3Of9YCJo38fTxGO6fL4eOJihoqqxNfeRACyQqbVeLWAT4wgUr2G8la
MIp2l5Vb7MmFXsES39fARLoN/8jjbgaXdxnSWhZlo4SPQaAlVA5ScXC4tKI3Gte3
O7fPVAX76PrZ+y6x4enfooI3mrpkeK676BxsokOX0m8WrVyVmIc4cfSaayc1gC5m
Uy298X93IA==
=molg
-----END PGP SIGNATURE-----
Merge tag 'block-6.14-20250131' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:
- MD pull request via Song:
- Fix a md-cluster regression introduced
- More sysfs race fixes
- Mark anything inside queue freezing as not being able to do IO for
memory allocations
- Fix for a regression introduced in loop in this merge window
- Fix for a regression in queue mapping setups introduced in this merge
window
- Fix for the block dio fops attempting an iov_iter revert upton
getting -EIOCBQUEUED on the read side. This one is going to stable as
well
* tag 'block-6.14-20250131' of git://git.kernel.dk/linux:
block: force noio scope in blk_mq_freeze_queue
block: fix nr_hw_queue update racing with disk addition/removal
block: get rid of request queue ->sysfs_dir_lock
loop: don't clear LO_FLAGS_PARTSCAN on LOOP_SET_STATUS{,64}
md/md-bitmap: Synchronize bitmap_get_stats() with bitmap lifetime
blk-mq: create correct map for fallback case
block: don't revert iter for -EIOCBQUEUED
When block drivers or the core block code perform allocations with a
frozen queue, this could try to recurse into the block device to
reclaim memory and deadlock. Thus all allocations done by a process
that froze a queue need to be done without __GFP_IO and __GFP_FS.
Instead of tying to track all of them down, force a noio scope as
part of freezing the queue.
Note that nvme is a bit of a mess here due to the non-owner freezes,
and they will be addressed separately.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250131120352.1315351-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The nr_hw_queue update could potentially race with disk addtion/removal
while registering/unregistering hctx sysfs files. The __blk_mq_update_
nr_hw_queues() runs with q->tag_list_lock held and so to avoid it racing
with disk addition/removal we should acquire q->tag_list_lock while
registering/unregistering hctx sysfs files.
With this patch, blk_mq_sysfs_register() (called during disk addition)
and blk_mq_sysfs_unregister() (called during disk removal) now runs
with q->tag_list_lock held so that it avoids racing with __blk_mq_update
_nr_hw_queues().
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20250128143436.874357-3-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The request queue uses ->sysfs_dir_lock for protecting the addition/
deletion of kobject entries under sysfs while we register/unregister
blk-mq. However kobject addition/deletion is already protected with
kernfs/sysfs internal synchronization primitives. So use of q->sysfs_
dir_lock seems redundant.
Moreover, q->sysfs_dir_lock is also used at few other callsites along
with q->sysfs_lock for protecting the addition/deletion of kojects.
One such example is when we register with sysfs a set of independent
access ranges for a disk. Here as well we could get rid off q->sysfs_
dir_lock and only use q->sysfs_lock.
The only variable which q->sysfs_dir_lock appears to protect is q->
mq_sysfs_init_done which is set/unset while registering/unregistering
blk-mq with sysfs. But use of q->mq_sysfs_init_done could be easily
replaced using queue registered bit QUEUE_FLAG_REGISTERED.
So with this patch we remove q->sysfs_dir_lock from each callsite
and replace q->mq_sysfs_init_done using QUEUE_FLAG_REGISTERED.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20250128143436.874357-2-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Here is the big set of driver core and debugfs updates for 6.14-rc1.
It's coming late in the merge cycle as there are a number of merge
conflicts with your tree now, and I wanted to make sure they were
working properly. To resolve them, look in linux-next, and I will send
the "fixup" patch as a response to the pull request.
Included in here is a bunch of driver core, PCI, OF, and platform rust
bindings (all acked by the different subsystem maintainers), hence the
merge conflict with the rust tree, and some driver core api updates to
mark things as const, which will also require some fixups due to new
stuff coming in through other trees in this merge window.
There are also a bunch of debugfs updates from Al, and there is at least
one user that does have a regression with these, but Al is working on
tracking down the fix for it. In my use (and everyone else's linux-next
use), it does not seem like a big issue at the moment.
Here's a short list of the things in here:
- driver core bindings for PCI, platform, OF, and some i/o functions.
We are almost at the "write a real driver in rust" stage now,
depending on what you want to do.
- misc device rust bindings and a sample driver to show how to use
them
- debugfs cleanups in the fs as well as the users of the fs api for
places where drivers got it wrong or were unnecessarily doing things
in complex ways.
- driver core const work, making more of the api take const * for
different parameters to make the rust bindings easier overall.
- other small fixes and updates
All of these have been in linux-next with all of the aforementioned
merge conflicts, and the one debugfs issue, which looks to be resolved
"soon".
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZ5koPA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ymFHACfT5acDKf2Bov2Lc/5u3vBW/R6ChsAnj+LmgVI
hcDSPodj4szR40RRnzBd
=u5Ey
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core and debugfs updates from Greg KH:
"Here is the big set of driver core and debugfs updates for 6.14-rc1.
Included in here is a bunch of driver core, PCI, OF, and platform rust
bindings (all acked by the different subsystem maintainers), hence the
merge conflict with the rust tree, and some driver core api updates to
mark things as const, which will also require some fixups due to new
stuff coming in through other trees in this merge window.
There are also a bunch of debugfs updates from Al, and there is at
least one user that does have a regression with these, but Al is
working on tracking down the fix for it. In my use (and everyone
else's linux-next use), it does not seem like a big issue at the
moment.
Here's a short list of the things in here:
- driver core rust bindings for PCI, platform, OF, and some i/o
functions.
We are almost at the "write a real driver in rust" stage now,
depending on what you want to do.
- misc device rust bindings and a sample driver to show how to use
them
- debugfs cleanups in the fs as well as the users of the fs api for
places where drivers got it wrong or were unnecessarily doing
things in complex ways.
- driver core const work, making more of the api take const * for
different parameters to make the rust bindings easier overall.
- other small fixes and updates
All of these have been in linux-next with all of the aforementioned
merge conflicts, and the one debugfs issue, which looks to be resolved
"soon""
* tag 'driver-core-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (95 commits)
rust: device: Use as_char_ptr() to avoid explicit cast
rust: device: Replace CString with CStr in property_present()
devcoredump: Constify 'struct bin_attribute'
devcoredump: Define 'struct bin_attribute' through macro
rust: device: Add property_present()
saner replacement for debugfs_rename()
orangefs-debugfs: don't mess with ->d_name
octeontx2: don't mess with ->d_parent or ->d_parent->d_name
arm_scmi: don't mess with ->d_parent->d_name
slub: don't mess with ->d_name
sof-client-ipc-flood-test: don't mess with ->d_name
qat: don't mess with ->d_name
xhci: don't mess with ->d_iname
mtu3: don't mess wiht ->d_iname
greybus/camera - stop messing with ->d_iname
mediatek: stop messing with ->d_iname
netdevsim: don't embed file_operations into your structs
b43legacy: make use of debugfs_get_aux()
b43: stop embedding struct file_operations into their objects
carl9170: stop embedding file_operations into their objects
...
The fallback code in blk_mq_map_hw_queues is original from
blk_mq_pci_map_queues and was added to handle the case where
pci_irq_get_affinity will return NULL for !SMP configuration.
blk_mq_map_hw_queues replaces besides blk_mq_pci_map_queues also
blk_mq_virtio_map_queues which used to use blk_mq_map_queues for the
fallback.
It's possible to use blk_mq_map_queues for both cases though.
blk_mq_map_queues creates the same map as blk_mq_clear_mq_map for !SMP
that is CPU 0 will be mapped to hctx 0.
The WARN_ON_ONCE has to be dropped for virtio as the fallback is also
taken for certain configuration on default. Though there is still a
WARN_ON_ONCE check in lib/group_cpus.c:
WARN_ON(nr_present + nr_others < numgrps);
which will trigger if the caller tries to create more hardware queues
than CPUs. It tests the same as the WARN_ON_ONCE in
blk_mq_pci_map_queues did.
Fixes: a5665c3d15 ("virtio: blk/scsi: replace blk_mq_virtio_map_queues with blk_mq_map_hw_queues")
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Closes: https://lore.kernel.org/all/20250122093020.6e8a4e5b@gandalf.local.home/
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Link: https://lore.kernel.org/r/20250123-fix-blk_mq_map_hw_queues-v1-1-08dbd01f2c39@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkdev_read_iter() has a few odd checks, like gating the position and
count adjustment on whether or not the result is bigger-than-or-equal to
zero (where bigger than makes more sense), and not checking the return
value of blkdev_direct_IO() before doing an iov_iter_revert(). The
latter can lead to attempting to revert with a negative value, which
when passed to iov_iter_revert() as an unsigned value will lead to
throwing a WARN_ON() because unroll is bigger than MAX_RW_COUNT.
Be sane and don't revert for -EIOCBQUEUED, like what is done in other
spots.
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmeNDEUQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpl5hD/4t7kWWNQDeQG9CiA3QStMJ5Yow2AgYtK8f
sJBr5/6PGEsbTreX//Kh8DtPZPRGcjG9elCo58QxWaPZ2mg3fTOR3/QYLMlaGXU2
hSht58lj32utpuzMjMo9bG3aesi03bLf+buaq7V1FaMlcTV8rXqK1s/HGtphDBRo
8tNLEk3JDJDs3vlWbNp/5Hqh9+Ro6DU8df1zWWH4Vbu8RXaGIPyJyjKvvcbfuuCf
k7Ay45XNAmTZg+rSNGv1H3Yn1LNzPMVFLWBfzRahPCzlKy2+mJMWz1PWu9naaUK+
WTM+kgiBLF24k59G/9xuxC5bYtsTjTbr4GsEE5ZvFBnhKPzLzzaJj7iQHRj83vtv
tqxNmAbA3wJoNk48Zr8+cYbfDX9Q9Pl32wIaS/LxRgF9MT4lem6pyKY7Skd12oK3
rnQ8moGtnOBxp3QUU6BZ7IX3ipb+Bgw7FhZbtVYJdlqKeKyi1QO0MuITwGXpMwk/
EWDDTsspIf+QaTu+fmO8byJavugKljW8t7hM1JpvlfOLl+rsh6/+AYz42fCvcaA0
Tu4bpUk8SuwALvZfU2R6bLkorGG6MFuGI8g3eixOcGir3YAcHBMfdg6ItpZi5qVt
ToM87BMaezOZZvSwX1JBaQ0AR5HBQYmHaiLWgPsORf3PjJ0kz+u21SK9D+yJkUtU
rT6+HvoVXA==
=ufpE
-----END PGP SIGNATURE-----
Merge tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
"Not a lot in terms of features this time around, mostly just cleanups
and code consolidation:
- Support for PI meta data read/write via io_uring, with NVMe and
SCSI covered
- Cleanup the per-op structure caching, making it consistent across
various command types
- Consolidate the various user mapped features into a concept called
regions, making the various users of that consistent
- Various cleanups and fixes"
* tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux: (56 commits)
io_uring/fdinfo: fix io_uring_show_fdinfo() misuse of ->d_iname
io_uring: reuse io_should_terminate_tw() for cmds
io_uring: Factor out a function to parse restrictions
io_uring/rsrc: require cloned buffers to share accounting contexts
io_uring: simplify the SQPOLL thread check when cancelling requests
io_uring: expose read/write attribute capability
io_uring/rw: don't gate retry on completion context
io_uring/rw: handle -EAGAIN retry at IO completion time
io_uring/rw: use io_rw_recycle() from cleanup path
io_uring/rsrc: simplify the bvec iter count calculation
io_uring: ensure io_queue_deferred() is out-of-line
io_uring/rw: always clear ->bytes_done on io_async_rw setup
io_uring/rw: use NULL for rw->free_iovec assigment
io_uring/rw: don't mask in f_iocb_flags
io_uring/msg_ring: Drop custom destructor
io_uring: Move old async data allocation helper to header
io_uring/rw: Allocate async data through helper
io_uring/net: Allocate msghdr async data through helper
io_uring/uring_cmd: Allocate async data through generic helper
io_uring/poll: Allocate apoll with generic alloc_cache helper
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmeL6hoQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppw2EADQV8nDgLRggZR+il4U03yKHXcQEdAX1GrB
Erowx+dasIJuh6kp3n6qRe9QD/pRqt1DKyLvXoWF8Qfuwq85j7oDnDDYxutNYT27
hDgrLJriJ3VeKYtTu+andHWt8P29b5h57UayInDOUJurEPA6rXyFZ5YVIti8n21K
uDOrQXiACG3qRWS2+p2f3UNhX0MkFNFdN/lxi13WMIJtRWF5bXAP+JOgIWCID4Ze
QuSY6rQD4dp4Q6M2erpX6tn0YZb7Hvw3rPjsd91n6jvYfTUVLH375zg8jCBpi6Wi
Syufbb8xcTtriVPTDRNu0ekjebkc8wD8ax/h86g0z9v3Ua4DlNmsx9eXrtv6r5nu
YXqDODOad6stI0+owFquW2vas0gHmfNSfyfGdlk2g24PMtP5Yx0V6FIEvwIeqnje
ghgxQvBuKUsdhqakByfNnc+XvXi3+RUJek8kvMeUSUQWT1IyMQqPOOk0yp9WdyWD
bY1f2ECP5BR1b37zYOyawewsI5xTupHUswn5a4r4qtGn3O15rGDkX98Nab5aLCnR
rW/DvX7+wT6gW9EwrRHiwjwfNDZbsJ9Ggu3lMhtUl5GUWdk58yTiVgKaHJLnlX9/
CKFKfyyIR1Vl8+gYIpemyFhhcoN+dCSf06ISkrg0jeS0/tYwydaAaCBPL5J4kxZA
h3Rtbh+Pgg==
=EXYs
-----END PGP SIGNATURE-----
Merge tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe pull requests via Keith:
- Target support for PCI-Endpoint transport (Damien)
- TCP IO queue spreading fixes (Sagi, Chaitanya)
- Target handling for "limited retry" flags (Guixen)
- Poll type fix (Yongsoo)
- Xarray storage error handling (Keisuke)
- Host memory buffer free size fix on error (Francis)
- MD pull requests via Song:
- Reintroduce md-linear (Yu Kuai)
- md-bitmap refactor and fix (Yu Kuai)
- Replace kmap_atomic with kmap_local_page (David Reaver)
- Quite a few queue freeze and debugfs deadlock fixes
Ming introduced lockdep support for this in the 6.13 kernel, and it
has (unsurprisingly) uncovered quite a few issues
- Use const attributes for IO schedulers
- Remove bio ioprio wrappers
- Fixes for stacked device atomic write support
- Refactor queue affinity helpers, in preparation for better supporting
isolated CPUs
- Cleanups of loop O_DIRECT handling
- Cleanup of BLK_MQ_F_* flags
- Add rotational support for null_blk
- Various fixes and cleanups
* tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux: (106 commits)
block: Don't trim an atomic write
block: Add common atomic writes enable flag
md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add()
block: limit disk max sectors to (LLONG_MAX >> 9)
block: Change blk_stack_atomic_writes_limits() unit_min check
block: Ensure start sector is aligned for stacking atomic writes
blk-mq: Move more error handling into blk_mq_submit_bio()
block: Reorder the request allocation code in blk_mq_submit_bio()
nvme: fix bogus kzalloc() return check in nvme_init_effects_log()
md/md-bitmap: move bitmap_{start, end}write to md upper layer
md/raid5: implement pers->bitmap_sector()
md: add a new callback pers->bitmap_sector()
md/md-bitmap: remove the last parameter for bimtap_ops->endwrite()
md/md-bitmap: factor behind write counters out from bitmap_{start/end}write()
md: Replace deprecated kmap_atomic() with kmap_local_page()
md: reintroduce md-linear
partitions: ldm: remove the initial kernel-doc notation
blk-cgroup: rwstat: fix kernel-doc warnings in header file
blk-cgroup: fix kernel-doc warnings in header file
nbd: fix partial sending
...
This is disallowed.
This check will now be relevant since the device mapper personalities
will start to support atomic writes, and they use this function.
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20250116170301.474130-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently only stacked devices need to explicitly enable atomic writes by
setting BLK_FEAT_ATOMIC_WRITES_STACKED flag.
This does not work well for device mapper stacking devices, as there many
sets of limits are stacked and what is the 'bottom' and 'top' device can
swapped. This means that BLK_FEAT_ATOMIC_WRITES_STACKED needs to be set
for many queue limits, which is messy.
Generalize enabling atomic writes enabling by ensuring that all devices
must explicitly set a flag - that includes NVMe, SCSI sd, and md raid.
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20250116170301.474130-2-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kernel `loff_t` is defined as `long long int`, so we can't support disk
which size is > LLONG_MAX.
There are many virtual block drivers, and hardware may report bad capacity
too, so limit max sectors to (LLONG_MAX >> 9) for avoiding potential
trouble.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250115092648.1104452-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The current check in blk_stack_atomic_writes_limits() for a bottom device
supporting atomic writes is to verify that limit atomic_write_unit_min is
non-zero.
This would cause a problem for device mapper queue limits calculation. This
is because it uses a temporary queue_limits structure to stack the limits,
before finally commiting the limits update.
The value of atomic_write_unit_min for the temporary queue_limits
structure is never evaluated and so cannot be used, so use limit
atomic_write_hw_unit_min.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250109114000.2299896-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For stacking atomic writes, ensure that the start sector is aligned with
the device atomic write unit min and any boundary. Otherwise, we may
permit misaligned atomic writes.
Rework bdev_can_atomic_write() into a common helper to resuse the
alignment check. There also use atomic_write_hw_unit_min, which is more
proper (than atomic_write_unit_min).
Fixes: d7f36dc446 ("block: Support atomic writes limits for stacked devices")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250109114000.2299896-2-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The error handling code in blk_mq_get_new_requests() cannot be understood
without knowing that this function is only called by blk_mq_submit_bio().
Hence move the code for handling blk_mq_get_new_requests() failures into
blk_mq_submit_bio().
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20241218212246.1073149-3-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Help the CPU branch predictor in case of a cache hit by handling the cache
hit scenario first.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20241218212246.1073149-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove the file's first comment describing what the file is.
This comment is not in kernel-doc format so it causes a kernel-doc
warning.
ldm.h:13: warning: expecting prototype for ldm(). Prototype was for _FS_PT_LDM_H_() instead
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Richard Russon (FlatCap) <ldm@flatcap.org>
Cc: linux-ntfs-dev@lists.sourceforge.net
Cc: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20250111062758.910458-1-rdunlap@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Correct the function parameters to eliminate kernel-doc warnings:
blk-cgroup-rwstat.h:63: warning: Function parameter or struct member 'opf' not described in 'blkg_rwstat_add'
blk-cgroup-rwstat.h:63: warning: Excess function parameter 'op' description in 'blkg_rwstat_add'
blk-cgroup-rwstat.h:91: warning: Function parameter or struct member 'result' not described in 'blkg_rwstat_read'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: cgroups@vger.kernel.org
Link: https://lore.kernel.org/r/20250111062748.910442-1-rdunlap@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Correct the function parameters and function names to eliminate
kernel-doc warnings:
blk-cgroup.h:238: warning: Function parameter or struct member 'bio' not described in 'bio_issue_as_root_blkg'
blk-cgroup.h:248: warning: bad line:
blk-cgroup.h:279: warning: expecting prototype for blkg_to_pdata(). Prototype was for blkg_to_pd() instead
blk-cgroup.h:296: warning: expecting prototype for pdata_to_blkg(). Prototype was for pd_to_blkg() instead
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: cgroups@vger.kernel.org
Link: https://lore.kernel.org/r/20250111062736.910383-1-rdunlap@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We need the debugfs / driver-core fixes in here as well for testing and
to build on top of.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
queue_attr_store() always freezes a device queue before calling the
attribute store operation. For attributes that control queue limits, the
store operation will also lock the queue limits with a call to
queue_limits_start_update(). However, some drivers (e.g. SCSI sd) may
need to issue commands to a device to obtain limit values from the
hardware with the queue limits locked. This creates a potential ABBA
deadlock situation if a user attempts to modify a limit (thus freezing
the device queue) while the device driver starts a revalidation of the
device queue limits.
Avoid such deadlock by not freezing the queue before calling the
->store_limit() method in struct queue_sysfs_entry and instead use the
queue_limits_commit_update_frozen helper to freeze the queue after taking
the limits lock.
This also removes taking the sysfs lock for the store_limit method as
it doesn't protect anything here, but creates even more nesting.
Hopefully it will go away from the actual sysfs methods entirely soon.
(commit log adapted from a similar patch from Damien Le Moal)
Fixes: ff956a3be9 ("block: use queue_limits_commit_update in queue_discard_max_store")
Fixes: 0327ca9d53 ("block: use queue_limits_commit_update in queue_max_sectors_store")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250110054726.1499538-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
De-duplicate the code for updating queue limits by adding a store_limit
method that allows having common code handle the actual queue limits
update.
Note that this is a pure refactoring patch and does not address the
existing freeze vs limits lock order problem in the refactored code,
which will be addressed next.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20250110054726.1499538-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When __blk_mq_update_nr_hw_queues changes the number of tag sets, it
might have to disable poll queues. Currently it does so by adjusting
the BLK_FEAT_POLL, which is a bit against the intent of features that
describe hardware / driver capabilities, but more importantly causes
nasty lock order problems with the broadly held freeze when updating the
number of hardware queues and the limits lock. Fix this by leaving
BLK_FEAT_POLL alone, and instead check for the number of poll queues in
the bio submission and poll handlers. While this adds extra work to the
fast path, the variables are in cache lines used by these operations
anyway, so it should be cheap enough.
Fixes: 8023e144f9 ("block: move the poll flag to queue_limits")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250110054726.1499538-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Otherwise feature reconfiguration can race with I/O submission.
Also drop the bio_clear_polled in the error path, as the flag does not
matter for instant error completions, it is a left over from when we
allowed polled I/O to proceed unpolled in this case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250110054726.1499538-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a helper that freezes the queue, updates the queue limits and
unfreezes the queue and convert all open coded versions of that to the
new helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250110054726.1499538-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
queue_limits_commit_update is the function that needs to operate on a
frozen queue, not queue_limits_start_update. Update the kerneldoc
comments to reflect that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20250110054726.1499538-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkcg_fill_root_iostats() iterates over @block_class's devices by
class_dev_iter_(init|next)(), but does not end iterating with
class_dev_iter_exit(), so causes the class's subsystem refcount leakage.
Fix by ending the iterating with class_dev_iter_exit().
Fixes: ef45fe470e ("blk-cgroup: show global disk stats in root cgroup io.stat")
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: stable@vger.kernel.org
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Link: https://lore.kernel.org/r/20250105-class_fix-v6-2-3a2f1768d4d4@quicinc.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Use a plain BLK_MQ_F_* flag to select the round robin tag selection
instead of overlaying an enum with just two possible values into the
flags space.
Doing so allows adding a BLK_MQ_F_MAX sentinel for simplified overflow
checking in the messy debugfs helpers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20250106083531.799976-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The only queues that really can't support a scheduler are those that
do not have a gendisk associated with them, and thus can't be used for
non-passthrough commands. In addition to those null_blk can optionally
set the flag, which is a bad odd. Replace the null_blk usage with
BLK_MQ_F_NO_SCHED_BY_DEFAULT to keep the expected semantics and then
remove BLK_MQ_F_NO_SCHED as the non-disk queues never call into
elevator_init_mq or blk_register_queue which adds the sysfs attributes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250106083531.799976-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The little work done in blk_mq_init_bitmaps is easier done in the only
caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20250106083531.799976-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a big conditional for blk-mq vs not mq at the beginning of
add_disk_fwnode so that elevator_init_mq is only called for blk-mq disks,
and add checks that the right methods or set or not set based on the
queue type.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250106083531.799976-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_rq_map_sg is maze of nested loops. Untangle it by creating an
iterator that returns [paddr,len] tuples for DMA mapping, and then
implement the DMA logic on top of this. This not only removes code
at the source level, but also generates nicer binary code:
$ size block/blk-merge.o.*
text data bss dec hex filename
10001 432 0 10433 28c1 block/blk-merge.o.new
10317 468 0 10785 2a21 block/blk-merge.o.old
Last but not least it will be used as a building block for a new
DMA mapping helper that doesn't rely on struct scatterlist.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250106081609.798289-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is not real point in a helper just to assign three values to four
fields, especially when the surrounding code is working on the
neighbor fields directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20250103073417.459715-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Lift bio_split_rw_at into blk_rq_append_bio so that it validates the
hardware limits. With this all passthrough callers can simply add
bio_add_page to build the bio and delay checking for exceeding of limits
to this point instead of doing it for each page.
While this looks like adding a new expensive loop over all bio_vecs,
blk_rq_append_bio is already doing that just to counter the number of
segments.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20250103073417.459715-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Set kernel config:
CONFIG_BLK_DEV_LOOP=m
CONFIG_BLK_DEV_LOOP_MIN_COUNT=0
Do latter:
mknod loop0 b 7 0
exec 4<> loop0
Before commit e418de3abc ("block: switch gendisk lookup to a simple
xarray"), lookup_gendisk will first use base_probe to load module loop,
and then the retry will call loop_probe to prepare the loop disk. Finally
open for this disk will success. However, after this commit, we lose the
retry logic, and open will fail with ENXIO. Block device autoloading is
deprecated and will be removed soon, but maybe we should keep open success
until we really remove it. So, give a retry to fix it.
Fixes: e418de3abc ("block: switch gendisk lookup to a simple xarray")
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241209110435.3670985-1-yangerkun@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The elevator core now allows instances of 'struct elv_fs_entry' to be
moved into read-only memory. Make use of that to protect them against
accidental or malicious modifications.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20250102-sysfs-const-attr-elevator-v1-4-9837d2058c60@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The elevator core now allows instances of 'struct elv_fs_entry' to be
moved into read-only memory. Make use of that to protect them against
accidental or malicious modifications.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20250102-sysfs-const-attr-elevator-v1-3-9837d2058c60@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The elevator core now allows instances of 'struct elv_fs_entry' to be
moved into read-only memory. Make use of that to protect them against
accidental or malicious modifications.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20250102-sysfs-const-attr-elevator-v1-2-9837d2058c60@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reduce the indentation level of the code in queue_zone_wplugs_show() by
moving the body of the loop in that function into a new function.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20241217210310.645966-5-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For the blk_queue_exit() calls, document where the corresponding code can
be found that increases q->q_usage_counter.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20241217210310.645966-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Document which functions expect that their callers must hold a lock.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20241217210310.645966-3-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Only include those header files that are necessary.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20241217210310.645966-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
BLK_MQ_F_SHOULD_MERGE is set for all tag_sets except those that purely
process passthrough commands (bsg-lib, ufs tmf, various nvme admin
queues) and thus don't even check the flag. Remove it to simplify the
driver interface.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241219060214.1928848-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are no users left of the pci and virtio queue mapping helpers.
Thus remove them.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Link: https://lore.kernel.org/r/20241202-refactor-blk-affinity-helpers-v6-8-27211e9c2cd5@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_mq_pci_map_queues and blk_mq_virtio_map_queues will create a CPU to
hardware queue mapping based on affinity information. These two function
share common code and only differ on how the affinity information is
retrieved. Also, those functions are located in the block subsystem
where it doesn't really fit in. They are virtio and pci subsystem
specific.
Thus introduce provide a generic mapping function which uses the
irq_get_affinity callback from bus_type.
Originally idea from Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Link: https://lore.kernel.org/r/20241202-refactor-blk-affinity-helpers-v6-4-27211e9c2cd5@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now we only verify the outmost freeze & unfreeze in current context in case
that !q->mq_freeze_depth, so it is reliable to save queue lying state when
we want to lock the freeze queue since the state is one per-task variable
now.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241127135133.3952153-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now blk_freeze_queue_start() can track disk state automatically, and
it isn't necessary to verify queue freeze manually in elevator_init_mq()
any more.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241127135133.3952153-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If an iocb contains metadata, extract that and prepare the bip.
Based on flags specified by the user, set corresponding guard/app/ref
tags to be checked in bip.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20241128112240.8867-11-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch introduces BIP_CHECK_GUARD/REFTAG/APPTAG bip_flags which
indicate how the hardware should check the integrity payload.
BIP_CHECK_GUARD/REFTAG are conversion of existing semantics, while
BIP_CHECK_APPTAG is a new flag. The driver can now just rely on block
layer flags, and doesn't need to know the integrity source. Submitter
of PI decides which tags to check. This would also give us a unified
interface for user and kernel generated integrity.
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20241128112240.8867-8-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch refactors bio_integrity_map_user to accept iov_iter as
argument. This is a prep patch.
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20241128112240.8867-4-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Copy back the bounce buffer to user-space in entirety when the parent
bio completes. The existing code uses bip_iter.bi_size for sizing the
copy, which can be modified. So move away from that and fetch it from
the vector passed to the block layer. While at it, switch to using
better variable names.
Fixes: 492c5d4559 ("block: bio-integrity: directly map user buffers")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20241128112240.8867-3-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Introduce BIP_CLONE_FLAGS describing integrity flags that should be
inherited in the cloned bip from the parent.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20241128112240.8867-2-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit be26ba9642.
Commit be26ba9642 ("block: Fix potential deadlock while freezing queue and
acquiring sysfs_loc") actually reverts commit 22465bbac5 ("blk-mq: move cpuhp
callback registering out of q->sysfs_lock"), and causes the original resctrl
lockdep warning.
So revert it and we need to fix the issue in another way.
Cc: Nilay Shroff <nilay@linux.ibm.com>
Fixes: be26ba9642 ("block: Fix potential deadlock while freezing queue and acquiring sysfs_loc")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241218101617.3275704-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We already have a helper for checking the limits on the block size
both low and high, just use that.
No functional changes.
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241218020212.3657139-2-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make queue_iostats_passthrough_show() report 0/1 in sysfs instead of 0/4.
This patch fixes the following sparse warning:
block/blk-sysfs.c:266:31: warning: incorrect type in argument 1 (different base types)
block/blk-sysfs.c:266:31: expected unsigned long var
block/blk-sysfs.c:266:31: got restricted blk_flags_t
Cc: Keith Busch <kbusch@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Fixes: 110234da18 ("block: enable passthrough command statistics")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20241212212941.1268662-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move a statement that occurs in both branches of an if-statement in front
of the if-statement. Fix a typo in a source code comment. No functionality
has been changed.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241212212941.1268662-3-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since commit fde02699c2 ("block: mq-deadline: Remove support for zone
write locking"), the local variable 'insert_before' is assigned once and
is used once. Hence remove this local variable.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241212212941.1268662-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After a recent change to clamp() and its variants [1] that increases the
coverage of the check that high is greater than low because it can be
done through inlining, certain build configurations (such as s390
defconfig) fail to build with clang with:
block/blk-iocost.c:1101:11: error: call to '__compiletime_assert_557' declared with 'error' attribute: clamp() low limit 1 greater than high limit active
1101 | inuse = clamp_t(u32, inuse, 1, active);
| ^
include/linux/minmax.h:218:36: note: expanded from macro 'clamp_t'
218 | #define clamp_t(type, val, lo, hi) __careful_clamp(type, val, lo, hi)
| ^
include/linux/minmax.h:195:2: note: expanded from macro '__careful_clamp'
195 | __clamp_once(type, val, lo, hi, __UNIQUE_ID(v_), __UNIQUE_ID(l_), __UNIQUE_ID(h_))
| ^
include/linux/minmax.h:188:2: note: expanded from macro '__clamp_once'
188 | BUILD_BUG_ON_MSG(statically_true(ulo > uhi), \
| ^
__propagate_weights() is called with an active value of zero in
ioc_check_iocgs(), which results in the high value being less than the
low value, which is undefined because the value returned depends on the
order of the comparisons.
The purpose of this expression is to ensure inuse is not more than
active and at least 1. This could be written more simply with a ternary
expression that uses min(inuse, active) as the condition so that the
value of that condition can be used if it is not zero and one if it is.
Do this conversion to resolve the error and add a comment to deter
people from turning this back into clamp().
Fixes: 7caa47151a ("blkcg: implement blk-iocost")
Link: https://lore.kernel.org/r/34d53778977747f19cce2abb287bb3e6@AcuMS.aculab.com/ [1]
Suggested-by: David Laight <david.laight@aculab.com>
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Closes: https://lore.kernel.org/llvm/CA+G9fYsD7mw13wredcZn0L-KBA3yeoVSTuxnss-AEWMN3ha0cA@mail.gmail.com/
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202412120322.3GfVe3vF-lkp@intel.com/
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make bio_iov_bvec_set() accept a pointer to const iov_iter, which means
that we can drop the undesirable casting to struct iov_iter pointer in
blk_rq_map_user_bvec().
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241202115727.2320401-1-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkcg_unpin_online() walks up the blkcg hierarchy putting the online pin. To
walk up, it uses blkcg_parent(blkcg) but it was calling that after
blkcg_destroy_blkgs(blkcg) which could free the blkcg, leading to the
following UAF:
==================================================================
BUG: KASAN: slab-use-after-free in blkcg_unpin_online+0x15a/0x270
Read of size 8 at addr ffff8881057678c0 by task kworker/9:1/117
CPU: 9 UID: 0 PID: 117 Comm: kworker/9:1 Not tainted 6.13.0-rc1-work-00182-gb8f52214c61a-dirty #48
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS unknown 02/02/2022
Workqueue: cgwb_release cgwb_release_workfn
Call Trace:
<TASK>
dump_stack_lvl+0x27/0x80
print_report+0x151/0x710
kasan_report+0xc0/0x100
blkcg_unpin_online+0x15a/0x270
cgwb_release_workfn+0x194/0x480
process_scheduled_works+0x71b/0xe20
worker_thread+0x82a/0xbd0
kthread+0x242/0x2c0
ret_from_fork+0x33/0x70
ret_from_fork_asm+0x1a/0x30
</TASK>
...
Freed by task 1944:
kasan_save_track+0x2b/0x70
kasan_save_free_info+0x3c/0x50
__kasan_slab_free+0x33/0x50
kfree+0x10c/0x330
css_free_rwork_fn+0xe6/0xb30
process_scheduled_works+0x71b/0xe20
worker_thread+0x82a/0xbd0
kthread+0x242/0x2c0
ret_from_fork+0x33/0x70
ret_from_fork_asm+0x1a/0x30
Note that the UAF is not easy to trigger as the free path is indirected
behind a couple RCU grace periods and a work item execution. I could only
trigger it with artifical msleep() injected in blkcg_unpin_online().
Fix it by reading the parent pointer before destroying the blkcg's blkg's.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Abagail ren <renzezhongucas@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Fixes: 4308a434e5 ("blkcg: don't offline parent blkcg first")
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Zone write plugging for handling writes to zones of a zoned block
device always execute a zone report whenever a write BIO to a zone
fails. The intent of this is to ensure that the tracking of a zone write
pointer is always correct to ensure that the alignment to a zone write
pointer of write BIOs can be checked on submission and that we can
always correctly emulate zone append operations using regular write
BIOs.
However, this error recovery scheme introduces a potential deadlock if a
device queue freeze is initiated while BIOs are still plugged in a zone
write plug and one of these write operation fails. In such case, the
disk zone write plug error recovery work is scheduled and executes a
report zone. This in turn can result in a request allocation in the
underlying driver to issue the report zones command to the device. But
with the device queue freeze already started, this allocation will
block, preventing the report zone execution and the continuation of the
processing of the plugged BIOs. As plugged BIOs hold a queue usage
reference, the queue freeze itself will never complete, resulting in a
deadlock.
Avoid this problem by completely removing from the zone write plugging
code the use of report zones operations after a failed write operation,
instead relying on the device user to either execute a report zones,
reset the zone, finish the zone, or give up writing to the device (which
is a fairly common pattern for file systems which degrade to read-only
after write failures). This is not an unreasonnable requirement as all
well-behaved applications, FSes and device mapper already use report
zones to recover from write errors whenever possible by comparing the
current position of a zone write pointer with what their assumption
about the position is.
The changes to remove the automatic error recovery are as follows:
- Completely remove the error recovery work and its associated
resources (zone write plug list head, disk error list, and disk
zone_wplugs_work work struct). This also removes the functions
disk_zone_wplug_set_error() and disk_zone_wplug_clear_error().
- Change the BLK_ZONE_WPLUG_ERROR zone write plug flag into
BLK_ZONE_WPLUG_NEED_WP_UPDATE. This new flag is set for a zone write
plug whenever a write opration targetting the zone of the zone write
plug fails. This flag indicates that the zone write pointer offset is
not reliable and that it must be updated when the next report zone,
reset zone, finish zone or disk revalidation is executed.
- Modify blk_zone_write_plug_bio_endio() to set the
BLK_ZONE_WPLUG_NEED_WP_UPDATE flag for the target zone of a failed
write BIO.
- Modify the function disk_zone_wplug_set_wp_offset() to clear this
new flag, thus implementing recovery of a correct write pointer
offset with the reset (all) zone and finish zone operations.
- Modify blkdev_report_zones() to always use the disk_report_zones_cb()
callback so that disk_zone_wplug_sync_wp_offset() can be called for
any zone marked with the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag.
This implements recovery of a correct write pointer offset for zone
write plugs marked with BLK_ZONE_WPLUG_NEED_WP_UPDATE and within
the range of the report zones operation executed by the user.
- Modify blk_revalidate_seq_zone() to call
disk_zone_wplug_sync_wp_offset() for all sequential write required
zones when a zoned block device is revalidated, thus always resolving
any inconsistency between the write pointer offset of zone write
plugs and the actual write pointer position of sequential zones.
Fixes: dd291d77cc ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241209122357.47838-5-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The zone reclaim processing of the dm-zoned device mapper uses
blkdev_issue_zeroout() to align the write pointer of a zone being used
for reclaiming another zone, to write the valid data blocks from the
zone being reclaimed at the same position relative to the zone start in
the reclaim target zone.
The first call to blkdev_issue_zeroout() will try to use hardware
offload using a REQ_OP_WRITE_ZEROES operation if the device reports a
non-zero max_write_zeroes_sectors queue limit. If this operation fails
because of the lack of hardware support, blkdev_issue_zeroout() falls
back to using a regular write operation with the zero-page as buffer.
Currently, such REQ_OP_WRITE_ZEROES failure is automatically handled by
the block layer zone write plugging code which will execute a report
zones operation to ensure that the write pointer of the target zone of
the failed operation has not changed and to "rewind" the zone write
pointer offset of the target zone as it was advanced when the write zero
operation was submitted. So the REQ_OP_WRITE_ZEROES failure does not
cause any issue and blkdev_issue_zeroout() works as expected.
However, since the automatic recovery of zone write pointers by the zone
write plugging code can potentially cause deadlocks with queue freeze
operations, a different recovery must be implemented in preparation for
the removal of zone write plugging report zones based recovery.
Do this by introducing the new function blk_zone_issue_zeroout(). This
function first calls blkdev_issue_zeroout() with the flag
BLKDEV_ZERO_NOFALLBACK to intercept failures on the first execution
which attempt to use the device hardware offload with the
REQ_OP_WRITE_ZEROES operation. If this attempt fails, a report zone
operation is issued to restore the zone write pointer offset of the
target zone to the correct position and blkdev_issue_zeroout() is called
again without the BLKDEV_ZERO_NOFALLBACK flag. The report zones
operation performing this recovery is implemented using the helper
function disk_zone_sync_wp_offset() which calls the gendisk report_zones
file operation with the callback disk_report_zones_cb(). This callback
updates the target write pointer offset of the target zone using the new
function disk_zone_wplug_sync_wp_offset().
dmz_reclaim_align_wp() is modified to change its call to
blkdev_issue_zeroout() to a call to blk_zone_issue_zeroout() without any
other change needed as the two functions are functionnally equivalent.
Fixes: dd291d77cc ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241209122357.47838-4-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are currently any issuer of REQ_OP_ZONE_RESET and
REQ_OP_ZONE_FINISH operations that set REQ_NOWAIT. However, as we cannot
handle this flag correctly due to the potential request allocation
failure that may happen in blk_mq_submit_bio() after blk_zone_plug_bio()
has handled the zone write plug write pointer updates for the targeted
zones, modify blk_zone_wplug_handle_reset_or_finish() to warn if this
flag is set and ignore it.
Fixes: dd291d77cc ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241209122357.47838-3-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For zoned block devices, a write BIO issued to a zone that has no
on-going writes will be prepared for execution and allowed to execute
immediately by blk_zone_wplug_handle_write() (called from
blk_zone_plug_bio()). However, if this BIO specifies REQ_NOWAIT, the
allocation of a request for its execution in blk_mq_submit_bio() may
fail after blk_zone_plug_bio() completed, marking the target zone of the
BIO as plugged. When this BIO is retried later on, it will be blocked as
the zone write plug of the target zone is in a plugged state without any
on-going write operation (completion of write operations trigger
unplugging of the next write BIOs for a zone). This leads to a BIO that
is stuck in a zone write plug and never completes, which results in
various issues such as hung tasks.
Avoid this problem by always executing REQ_NOWAIT write BIOs using the
BIO work of a zone write plug. This ensure that we never block the BIO
issuer and can thus safely ignore the REQ_NOWAIT flag when executing the
BIO from the zone write plug BIO work.
Since such BIO may be the first write BIO issued to a zone with no
on-going write, modify disk_zone_wplug_add_bio() to schedule the zone
write plug BIO work if the write plug is not already marked with the
BLK_ZONE_WPLUG_PLUGGED flag. This scheduling is otherwise not necessary
as the completion of the on-going write for the zone will schedule the
execution of the next plugged BIOs.
blk_zone_wplug_handle_write() is also fixed to better handle zone write
plug allocation failures for REQ_NOWAIT BIOs by failing a write BIO
using bio_wouldblock_error() instead of bio_io_error().
Reported-by: Bart Van Assche <bvanassche@acm.org>
Fixes: dd291d77cc ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241209122357.47838-2-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Registering and unregistering cpuhp callback requires global cpu hotplug lock,
which is used everywhere. Meantime q->sysfs_lock is used in block layer
almost everywhere.
It is easy to trigger lockdep warning[1] by connecting the two locks.
Fix the warning by moving blk-mq's cpuhp callback registering out of
q->sysfs_lock. Add one dedicated global lock for covering registering &
unregistering hctx's cpuhp, and it is safe to do so because hctx is
guaranteed to be live if our request_queue is live.
[1] https://lore.kernel.org/lkml/Z04pz3AlvI4o0Mr8@agluck-desk3/
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Newman <peternewman@google.com>
Cc: Babu Moger <babu.moger@amd.com>
Reported-by: Luck Tony <tony.luck@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20241206111611.978870-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We need to retrieve 'hctx' from xarray table in the cpuhp callback, so the
callback should be registered after this 'hctx' is added to xarray table.
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Newman <peternewman@google.com>
Cc: Babu Moger <babu.moger@amd.com>
Cc: Luck Tony <tony.luck@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20241206111611.978870-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmdJ6jwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpvgeEADaPL+qmsRyp070K1OI/yTA9Jhf6liEyJ31
0GPVar5Vt6ObH6/POObrJqbBtAo5asQanFvwyVyLztKYxPHU7sdQaRcD+vvj7q3+
EhmQZSKM7Spp77awWhRWbeQfUBvdhTeGHjQH/0e60eIrF9KtEL9sM9hVqc8hBD9F
YtDNWPCk7Rz1PPNYlGEkQ2JmYmaxh3Gn29c/k1cSSo3InEQOFj6x+0Cgz6RjbTx3
9HfpLhVG3WV5MlZCCwp7KG36aJzlc0nq53x/sC9cg+F17RvL2EwNAOUfLl75/Kp/
t7PCQSd2ODciiDN9qZW71KGtVtlJ07W048Rk0nB+ogneC0uh4fuIYTidP9D7io7D
bBMrhDuUpnlPzlOqg0aeedXePQL7TRfT3CTentol6xldqg14n7C4QTQFQMSJCgJf
gr4YCTwl0RTknXo0A3ja16XwsUq5+2xsSoCTU25TY+wgKiAcc5lN9fhbvPRzbCQC
u9EQ9I9IFAMqEdnE51sw0x16fLtN2w4/zOkvTF+gD/KooEjSn9lcfeNue7jt1O0/
gFvFJCdXK/2GgxwHihvsEVdcNeaS8JowNafKUsfOM2G0qWQbY+l2vl/b5PfwecWi
0knOaqNWlGMwrQ+z+fgsEeFG7X98ninC7tqVZpzoZ7j0x65anH+Jq4q1Egongj0H
90zclclxjg==
=6cbB
-----END PGP SIGNATURE-----
Merge tag 'block-6.13-20242901' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:
- NVMe pull request via Keith:
- Use correct srcu list traversal (Breno)
- Scatter-gather support for metadata (Keith)
- Fabrics shutdown race condition fix (Nilay)
- Persistent reservations updates (Guixin)
- Add the required bits for MD atomic write support for raid0/1/10
- Correct return value for unknown opcode in ublk
- Fix deadlock with zone revalidation
- Fix for the io priority request vs bio cleanups
- Use the correct unsigned int type for various limit helpers
- Fix for a race in loop
- Cleanup blk_rq_prep_clone() to prevent uninit-value warning and make
it easier for actual humans to read
- Fix potential UAF when iterating tags
- A few fixes for bfq-iosched UAF issues
- Fix for brd discard not decrementing the allocated page count
- Various little fixes and cleanups
* tag 'block-6.13-20242901' of git://git.kernel.dk/linux: (36 commits)
brd: decrease the number of allocated pages which discarded
block, bfq: fix bfqq uaf in bfq_limit_depth()
block: Don't allow an atomic write be truncated in blkdev_write_iter()
mq-deadline: don't call req_get_ioprio from the I/O completion handler
block: Prevent potential deadlock in blk_revalidate_disk_zones()
block: Remove extra part pointer NULLify in blk_rq_init()
nvme: tuning pr code by using defined structs and macros
nvme: introduce change ptpl and iekey definition
block: return bool from get_disk_ro and bdev_read_only
block: remove a duplicate definition for bdev_read_only
block: return bool from blk_rq_aligned
block: return unsigned int from blk_lim_dma_alignment_and_pad
block: return unsigned int from queue_dma_alignment
block: return unsigned int from bdev_io_opt
block: req->bio is always set in the merge code
block: don't bother checking the data direction for merges
block: blk-mq: fix uninit-value in blk_rq_prep_clone and refactor
Revert "block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()"
md/raid10: Atomic write support
md/raid1: Atomic write support
...
Set new allocated bfqq to bic or remove freed bfqq from bic are both
protected by bfqd->lock, however bfq_limit_depth() is deferencing bfqq
from bic without the lock, this can lead to UAF if the io_context is
shared by multiple tasks.
For example, test bfq with io_uring can trigger following UAF in v6.6:
==================================================================
BUG: KASAN: slab-use-after-free in bfqq_group+0x15/0x50
Call Trace:
<TASK>
dump_stack_lvl+0x47/0x80
print_address_description.constprop.0+0x66/0x300
print_report+0x3e/0x70
kasan_report+0xb4/0xf0
bfqq_group+0x15/0x50
bfqq_request_over_limit+0x130/0x9a0
bfq_limit_depth+0x1b5/0x480
__blk_mq_alloc_requests+0x2b5/0xa00
blk_mq_get_new_requests+0x11d/0x1d0
blk_mq_submit_bio+0x286/0xb00
submit_bio_noacct_nocheck+0x331/0x400
__block_write_full_folio+0x3d0/0x640
writepage_cb+0x3b/0xc0
write_cache_pages+0x254/0x6c0
write_cache_pages+0x254/0x6c0
do_writepages+0x192/0x310
filemap_fdatawrite_wbc+0x95/0xc0
__filemap_fdatawrite_range+0x99/0xd0
filemap_write_and_wait_range.part.0+0x4d/0xa0
blkdev_read_iter+0xef/0x1e0
io_read+0x1b6/0x8a0
io_issue_sqe+0x87/0x300
io_wq_submit_work+0xeb/0x390
io_worker_handle_work+0x24d/0x550
io_wq_worker+0x27f/0x6c0
ret_from_fork_asm+0x1b/0x30
</TASK>
Allocated by task 808602:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
__kasan_slab_alloc+0x83/0x90
kmem_cache_alloc_node+0x1b1/0x6d0
bfq_get_queue+0x138/0xfa0
bfq_get_bfqq_handle_split+0xe3/0x2c0
bfq_init_rq+0x196/0xbb0
bfq_insert_request.isra.0+0xb5/0x480
bfq_insert_requests+0x156/0x180
blk_mq_insert_request+0x15d/0x440
blk_mq_submit_bio+0x8a4/0xb00
submit_bio_noacct_nocheck+0x331/0x400
__blkdev_direct_IO_async+0x2dd/0x330
blkdev_write_iter+0x39a/0x450
io_write+0x22a/0x840
io_issue_sqe+0x87/0x300
io_wq_submit_work+0xeb/0x390
io_worker_handle_work+0x24d/0x550
io_wq_worker+0x27f/0x6c0
ret_from_fork+0x2d/0x50
ret_from_fork_asm+0x1b/0x30
Freed by task 808589:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
kasan_save_free_info+0x27/0x40
__kasan_slab_free+0x126/0x1b0
kmem_cache_free+0x10c/0x750
bfq_put_queue+0x2dd/0x770
__bfq_insert_request.isra.0+0x155/0x7a0
bfq_insert_request.isra.0+0x122/0x480
bfq_insert_requests+0x156/0x180
blk_mq_dispatch_plug_list+0x528/0x7e0
blk_mq_flush_plug_list.part.0+0xe5/0x590
__blk_flush_plug+0x3b/0x90
blk_finish_plug+0x40/0x60
do_writepages+0x19d/0x310
filemap_fdatawrite_wbc+0x95/0xc0
__filemap_fdatawrite_range+0x99/0xd0
filemap_write_and_wait_range.part.0+0x4d/0xa0
blkdev_read_iter+0xef/0x1e0
io_read+0x1b6/0x8a0
io_issue_sqe+0x87/0x300
io_wq_submit_work+0xeb/0x390
io_worker_handle_work+0x24d/0x550
io_wq_worker+0x27f/0x6c0
ret_from_fork+0x2d/0x50
ret_from_fork_asm+0x1b/0x30
Fix the problem by protecting bic_to_bfqq() with bfqd->lock.
CC: Jan Kara <jack@suse.cz>
Fixes: 76f1df88bb ("bfq: Limit number of requests consumed by each cgroup")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20241129091509.2227136-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A write which goes past the end of the bdev in blkdev_write_iter() will
be truncated. Truncating cannot tolerated for an atomic write, so error
that condition.
Fixes: caf336f81b ("block: Add fops atomic write support")
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241127092318.632790-1-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
req_get_ioprio looks at req->bio to find the I/O priority, which is not
set when completing bios that the driver fully iterated through.
Stash away the dd_per_prio in the elevator private data instead of looking
it up again to optimize the code a bit while fixing the regression from
removing the per-request ioprio value.
Fixes: 6975c1a486 ("block: remove the ioprio field from struct request")
Reported-by: Chris Bainbridge <chris.bainbridge@gmail.com>
Reported-by: Sam Protsenko <semen.protsenko@linaro.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Chris Bainbridge <chris.bainbridge@gmail.com>
Tested-by: Sam Protsenko <semen.protsenko@linaro.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20241126102136.619067-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The function blk_revalidate_disk_zones() calls the function
disk_update_zone_resources() after freezing the device queue. In turn,
disk_update_zone_resources() calls queue_limits_start_update() which
takes a queue limits mutex lock, resulting in the ordering:
q->q_usage_counter check -> q->limits_lock. However, the usual ordering
is to always take a queue limit lock before freezing the queue to commit
the limits updates, e.g., the code pattern:
lim = queue_limits_start_update(q);
...
blk_mq_freeze_queue(q);
ret = queue_limits_commit_update(q, &lim);
blk_mq_unfreeze_queue(q);
Thus, blk_revalidate_disk_zones() introduces a potential circular
locking dependency deadlock that lockdep sometimes catches with the
splat:
[ 51.934109] ======================================================
[ 51.935916] WARNING: possible circular locking dependency detected
[ 51.937561] 6.12.0+ #2107 Not tainted
[ 51.938648] ------------------------------------------------------
[ 51.940351] kworker/u16:4/157 is trying to acquire lock:
[ 51.941805] ffff9fff0aa0bea8 (&q->limits_lock){+.+.}-{4:4}, at: disk_update_zone_resources+0x86/0x170
[ 51.944314]
but task is already holding lock:
[ 51.945688] ffff9fff0aa0b890 (&q->q_usage_counter(queue)#3){++++}-{0:0}, at: blk_revalidate_disk_zones+0x15f/0x340
[ 51.948527]
which lock already depends on the new lock.
[ 51.951296]
the existing dependency chain (in reverse order) is:
[ 51.953708]
-> #1 (&q->q_usage_counter(queue)#3){++++}-{0:0}:
[ 51.956131] blk_queue_enter+0x1c9/0x1e0
[ 51.957290] blk_mq_alloc_request+0x187/0x2a0
[ 51.958365] scsi_execute_cmd+0x78/0x490 [scsi_mod]
[ 51.959514] read_capacity_16+0x111/0x410 [sd_mod]
[ 51.960693] sd_revalidate_disk.isra.0+0x872/0x3240 [sd_mod]
[ 51.962004] sd_probe+0x2d7/0x520 [sd_mod]
[ 51.962993] really_probe+0xd5/0x330
[ 51.963898] __driver_probe_device+0x78/0x110
[ 51.964925] driver_probe_device+0x1f/0xa0
[ 51.965916] __driver_attach_async_helper+0x60/0xe0
[ 51.967017] async_run_entry_fn+0x2e/0x140
[ 51.968004] process_one_work+0x21f/0x5a0
[ 51.968987] worker_thread+0x1dc/0x3c0
[ 51.969868] kthread+0xe0/0x110
[ 51.970377] ret_from_fork+0x31/0x50
[ 51.970983] ret_from_fork_asm+0x11/0x20
[ 51.971587]
-> #0 (&q->limits_lock){+.+.}-{4:4}:
[ 51.972479] __lock_acquire+0x1337/0x2130
[ 51.973133] lock_acquire+0xc5/0x2d0
[ 51.973691] __mutex_lock+0xda/0xcf0
[ 51.974300] disk_update_zone_resources+0x86/0x170
[ 51.975032] blk_revalidate_disk_zones+0x16c/0x340
[ 51.975740] sd_zbc_revalidate_zones+0x73/0x160 [sd_mod]
[ 51.976524] sd_revalidate_disk.isra.0+0x465/0x3240 [sd_mod]
[ 51.977824] sd_probe+0x2d7/0x520 [sd_mod]
[ 51.978917] really_probe+0xd5/0x330
[ 51.979915] __driver_probe_device+0x78/0x110
[ 51.981047] driver_probe_device+0x1f/0xa0
[ 51.982143] __driver_attach_async_helper+0x60/0xe0
[ 51.983282] async_run_entry_fn+0x2e/0x140
[ 51.984319] process_one_work+0x21f/0x5a0
[ 51.985873] worker_thread+0x1dc/0x3c0
[ 51.987289] kthread+0xe0/0x110
[ 51.988546] ret_from_fork+0x31/0x50
[ 51.989926] ret_from_fork_asm+0x11/0x20
[ 51.991376]
other info that might help us debug this:
[ 51.994127] Possible unsafe locking scenario:
[ 51.995651] CPU0 CPU1
[ 51.996694] ---- ----
[ 51.997716] lock(&q->q_usage_counter(queue)#3);
[ 51.998817] lock(&q->limits_lock);
[ 52.000043] lock(&q->q_usage_counter(queue)#3);
[ 52.001638] lock(&q->limits_lock);
[ 52.002485]
*** DEADLOCK ***
Prevent this issue by moving the calls to blk_mq_freeze_queue() and
blk_mq_unfreeze_queue() around the call to queue_limits_commit_update()
in disk_update_zone_resources(). In case of revalidation failure, the
call to disk_free_zone_resources() in blk_revalidate_disk_zones()
is still done with the queue frozen as before.
Fixes: 843283e96e ("block: Fake max open zones limit when there is no limit")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241126104705.183996-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As smatch, which is a lot smarter than me noticed. So remove the checks
for it, and condense these checks a bit including the comments stating
the obvious.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241119161157.1328171-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Because it already is encoded in the opcode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241119161157.1328171-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix an issue detected by the `smatch` tool:
block/blk-mq.c:3314 blk_rq_prep_clone() error: uninitialized
symbol 'bio'.
This patch refactors `blk_rq_prep_clone()` to improve code
readability and ensure safety by addressing potential misuse of
the `bio` variable:
- Move the bio_put(bio); call to the bio_ctr error handling block,
which is the only place where it can be triggered.
- Move the bio variable into the __rq_for_each_bio loop scope.
This change removes the need to set bio to NULL at the loop's
end.
discussion on why bio remains uninitialized:
https://lore.kernel.org/lkml/20241004141037.43277-1-surajsonawane0215@gmail.com
Summary of above discussion:
- I pointed out that `bio` can remain uninitialized if the
allocation with `bio_alloc_clone` fails.
- Keith Busch explained that `bio` is initialized to `NULL` when
`bio_alloc_clone()` fails, preventing uninitialized usage.
- John Garry questioned whether `rq_src->bio` being `NULL` could
leave `bio` uninitialized. Keith clarified that in such cases,
`bio` is not referenced, so it does not need initialization.
- Christoph Hellwig recommended code improvements:
- move the bio_put to the bio_ctr error handling, which is the only
case where it can happen
- move the bio variable into the __rq_for_each_bio scope, which
also removed the need to zero it at the end of the loop
These changes enhance code clarity, address static analysis tool
warnings, and make the function more maintainable.
thread of previous version patch discussion:
https://lore.kernel.org/lkml/20241004100842.9052-1-surajsonawane0215@gmail.com
Signed-off-by: Suraj Sonawane <surajsonawane0215@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241119164412.37609-1-surajsonawane0215@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Allow stacked devices to support atomic writes by aggregating the minimum
capability of all bottom devices.
Flag BLK_FEAT_ATOMIC_WRITES_STACKED is set for stacked devices which
have been enabled to support atomic writes.
Some things to note on the implementation:
- For simplicity, all bottom devices must have same atomic write boundary
value (if any)
- The atomic write boundary must be a power-of-2 already, but this
restriction could be relaxed. Furthermore, it is now required that the
chunk sectors for a top device must be aligned with this boundary.
- If a bottom device atomic write unit min/max are not aligned with the
top device chunk sectors, the top device atomic write unit min/max are
reduced to a value which works for the chunk sectors.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241118105018.1870052-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It is so far expected that the limits passed are valid.
In future atomic writes will be supported for stacked block devices, and
calculating the limits there will be complicated, so add extra sanity
checks to ensure that the values are always valid.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241118105018.1870052-2-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
lim->discard_granularity is always at least SECTOR_SIZE, so drop the
pointless check for granularity less than SECTOR_SIZE.
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241112092144.4059847-1-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmc7S40QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjHVD/43rDZ8ehs+IAAr6S0RemNX1SRG0mK2UOEb
kMoNogS7StO/c4JYW3JuzCyLRn5ZsgeWV/muqxwDEWQrmTGrvi+V45KikrZPwm3k
p0ump33qV9EU2jiR1MKZjtwK2P0CI7/DD3W8ww6IOvKbTT7RcqQcdHznvXArFBtc
xCuQPpayFG7ZasC+N9VaBwtiUEVgU3Ek9AFT7UVZRWajjHPNalQwaooJWayO0rEG
KdoW5yG0ryLrgCY2ACSvRLS+2s14EJtb8hgT08WKHTNgd5LxhSKxfsTapamua+7U
FdVS6Ij0tEkgu2jpvgj7QKO0Uw10Cnep2gj7RHts/LVewvkliS6XcheOzqRS1jWU
I2EI+UaGOZ11OUiw52VIveEVS5zV/NWhgy5BSP9LYEvXw0BUAHRDYGMem8o5G1V1
SWqjIM1UWvcQDlAnMF9FDVzojvjVUmYWvcAlFFztO8J0B7SavHR3NcfHwEf57reH
rNoUbi/9c4/wjJJF33gejiR5pU+ewy/Mk75GrtX3xpEqlztfRbf9/FbPCMEAO1KR
DF/b3lkUV9i2/BRW6a0SpZ5RDSmSYMnateel6TrPyVSRnpiSSFO8FrbynwUOa17b
6i49YDFWzzXOrR1YWDg6IEtTrcmBEmvi7F6aoDs020qUnL0hwLn1ZuoIxuiFEpor
Z0iFF1B/nw==
=PWTH
-----END PGP SIGNATURE-----
Merge tag 'for-6.13/block-20241118' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe updates via Keith:
- Use uring_cmd helper (Pavel)
- Host Memory Buffer allocation enhancements (Christoph)
- Target persistent reservation support (Guixin)
- Persistent reservation tracing (Guixen)
- NVMe 2.1 specification support (Keith)
- Rotational Meta Support (Matias, Wang, Keith)
- Volatile cache detection enhancment (Guixen)
- MD updates via Song:
- Maintainers update
- raid5 sync IO fix
- Enhance handling of faulty and blocked devices
- raid5-ppl atomic improvement
- md-bitmap fix
- Support for manually defining embedded partition tables
- Zone append fixes and cleanups
- Stop sending the queued requests in the plug list to the driver
->queue_rqs() handle in reverse order.
- Zoned write plug cleanups
- Cleanups disk stats tracking and add support for disk stats for
passthrough IO
- Add preparatory support for file system atomic writes
- Add lockdep support for queue freezing. Already found a bunch of
issues, and some fixes for that are in here. More will be coming.
- Fix race between queue stopping/quiescing and IO queueing
- ublk recovery improvements
- Fix ublk mmap for 64k pages
- Various fixes and cleanups
* tag 'for-6.13/block-20241118' of git://git.kernel.dk/linux: (118 commits)
MAINTAINERS: Update git tree for mdraid subsystem
block: make struct rq_list available for !CONFIG_BLOCK
block/genhd: use seq_put_decimal_ull for diskstats decimal values
block: don't reorder requests in blk_mq_add_to_batch
block: don't reorder requests in blk_add_rq_to_plug
block: add a rq_list type
block: remove rq_list_move
virtio_blk: reverse request order in virtio_queue_rqs
nvme-pci: reverse request order in nvme_queue_rqs
btrfs: validate queue limits
block: export blk_validate_limits
nvmet: add tracing of reservation commands
nvme: parse reservation commands's action and rtype to string
nvmet: report ns's vwc not present
md/raid5: Increase r5conf.cache_name size
block: remove the ioprio field from struct request
block: remove the write_hint field from struct request
nvme: check ns's volatile write cache not present
nvme: add rotational support
nvme: use command set independent id ns if available
...
There was a bug report [1] where the user got a warning alignment
inconsistency. The user has optimal I/O 16776704 (0xFFFE00) and physical
block size 4096. Note that the optimal I/O size may be set by the DMA
engines or SCSI controllers and they have no knowledge about the disks
attached to them, so the situation with optimal I/O not aligned to
physical block size may happen.
This commit makes blk_validate_limits round down optimal I/O size to the
physical block size of the block device.
Closes: https://lore.kernel.org/dm-devel/1426ad71-79b4-4062-b2bf-84278be66a5d@redhat.com/T/ [1]
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: a23634644a ("block: take io_opt and io_min into account for max_sectors")
Cc: stable@vger.kernel.org # v6.11+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/3dc0014b-9690-dc38-81c9-4a316a2d4fb2@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcopwAKCRCRxhvAZXjc
oitWAQD68PGFI6/ES9x+qGsDFEZBH08icuO+a9dyaZXyNRosDgD/ex2zHj6F7IzS
Ghgb9jiqWQ8l2+PDYfisxa/0jiqCbAk=
=DmXf
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.untorn.writes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs untorn write support from Christian Brauner:
"An atomic write is a write issed with torn-write protection. This
means for a power failure or any hardware failure all or none of the
data from the write will be stored, never a mix of old and new data.
This work is already supported for block devices. If a block device is
opened with O_DIRECT and the block device supports atomic write, then
FMODE_CAN_ATOMIC_WRITE is added to the file of the opened block
device.
This contains the work to expand atomic write support to filesystems,
specifically ext4 and XFS. Currently, only support for writing exactly
one filesystem block atomically is added.
Since it's now possible to have filesystem block size > page size for
XFS, it's possible to write 4K+ blocks atomically on x86"
* tag 'vfs-6.13.untorn.writes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
iomap: drop an obsolete comment in iomap_dio_bio_iter
ext4: Do not fallback to buffered-io for DIO atomic write
ext4: Support setting FMODE_CAN_ATOMIC_WRITE
ext4: Check for atomic writes support in write iter
ext4: Add statx support for atomic writes
xfs: Support setting FMODE_CAN_ATOMIC_WRITE
xfs: Validate atomic writes
xfs: Support atomic write for statx
fs: iomap: Atomic write support
fs: Export generic_atomic_write_valid()
block: Add bdev atomic write limits helpers
fs/block: Check for IOCB_DIRECT in generic_atomic_write_valid()
block/fs: Pass an iocb to generic_atomic_write_valid()
seq_printf is costly. For each block device, 19 decimal values are
yielded in /proc/diskstats via seq_printf. On a system with 16 logical
block devices, profiling for open/read/close sequences shows seq_printf
took ~75% samples of diskstats_show:
diskstats_show(92.626% 2269372/2450040)
seq_printf(76.026% 1725313/2269372)
vsnprintf(99.163% 1710866/1725313)
format_decode(26.597% 455040/1710866)
number(19.554% 334542/1710866)
memcpy_orig(4.183% 71570/1710866)
...
srso_return_thunk(0.009% 148/1725313)
part_stat_read_all(8.030% 182236/2269372)
One million rounds of open/read/close /proc/diskstats takes:
real 0m37.687s
user 0m0.264s
sys 0m32.911s
On average, each sequence tooks ~0.032ms
With this patch, most decimal values are yield via seq_put_decimal_ull,
performance is significantly improved:
real 0m20.792s
user 0m0.316s
sys 0m20.463s
On average, each sequence tooks ~0.020ms, a ~37.5% improvement.
Signed-off-by: David Wang <00107082@163.com>
Link: https://lore.kernel.org/r/20241108054500.4251-1-00107082@163.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add requests to the tail of the list instead of the front so that they
are queued up in submission order.
Remove the re-reordering in blk_mq_dispatch_plug_list, virtio_queue_rqs
and nvme_queue_rqs now that the list is ordered as expected.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace the semi-open coded request list helpers with a proper rq_list
type that mirrors the bio_list and has head and tail pointers. Besides
better type safety this actually allows to insert at the tail of the
list, which will be useful soon.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While block drivers do the validation as part of committing them to the
queue, users that use the limit outside of a block device context have
to validate the limits and fill in the calculated values as well.
So far btrfs is the only user of queue limits without a block device,
and it has gotten away with that more or less by accident. But with
commit 559218d43e ("block: pre-calculate max_zone_append_sectors")
this became fatal for setups that have small max zone append size,
as it won't be limited now.
Export blk_validate_limits so that it can be called directly from btrfs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241113084541.34315-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The request ioprio is only initialized from the first attached bio,
so requests without a bio already never set it. Directly use the
bio field instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20241112170050.1612998-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The write_hint is only used for read/write requests, which must have a
bio attached to them. Just use the bio field instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20241112170050.1612998-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
max_zone_append_sectors differs from all other queue limits in that the
final value used is not stored in the queue_limits but needs to be
obtained using queue_limits_max_zone_append_sectors helper. This not
only adds (tiny) extra overhead to the I/O path, but also can be easily
forgotten in file system code.
Add a new max_hw_zone_append_sectors value to queue_limits which is
set by the driver, and calculate max_zone_append_sectors from that and
the other inputs in blk_validate_zoned_limits, similar to how
max_sectors is calculated to fix this.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241104073955.112324-3-hch@lst.de
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20241108154657.845768-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>