Commit Graph

7906 Commits

Author SHA1 Message Date
David Wang
bda9c7d92f block/genhd: use seq_put_decimal_ull for diskstats decimal values
seq_printf is costly. For each block device, 19 decimal values are
yielded in /proc/diskstats via seq_printf. On a system with 16 logical
block devices, profiling for open/read/close sequences shows seq_printf
took ~75% samples of diskstats_show:

	diskstats_show(92.626% 2269372/2450040)
	    seq_printf(76.026% 1725313/2269372)
		vsnprintf(99.163% 1710866/1725313)
		    format_decode(26.597% 455040/1710866)
		    number(19.554% 334542/1710866)
		    memcpy_orig(4.183% 71570/1710866)
			...
		srso_return_thunk(0.009% 148/1725313)
	    part_stat_read_all(8.030% 182236/2269372)

One million rounds of open/read/close /proc/diskstats takes:

	real	0m37.687s
	user	0m0.264s
	sys	0m32.911s
On average, each sequence tooks ~0.032ms

With this patch, most decimal values are yield via seq_put_decimal_ull,
performance is significantly improved:

	real	0m20.792s
	user	0m0.316s
	sys	0m20.463s
On average, each sequence tooks ~0.020ms, a ~37.5% improvement.

Signed-off-by: David Wang <00107082@163.com>
Link: https://lore.kernel.org/r/20241108054500.4251-1-00107082@163.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-13 14:02:19 -07:00
Christoph Hellwig
e70c301fae block: don't reorder requests in blk_add_rq_to_plug
Add requests to the tail of the list instead of the front so that they
are queued up in submission order.

Remove the re-reordering in blk_mq_dispatch_plug_list, virtio_queue_rqs
and nvme_queue_rqs now that the list is ordered as expected.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-13 12:04:58 -07:00
Christoph Hellwig
a3396b9999 block: add a rq_list type
Replace the semi-open coded request list helpers with a proper rq_list
type that mirrors the bio_list and has head and tail pointers.  Besides
better type safety this actually allows to insert at the tail of the
list, which will be useful soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-13 12:04:58 -07:00
Christoph Hellwig
470d2bc3a0 block: export blk_validate_limits
While block drivers do the validation as part of committing them to the
queue, users that use the limit outside of a block device context have
to validate the limits and fill in the calculated values as well.

So far btrfs is the only user of queue limits without a block device,
and it has gotten away with that more or less by accident.  But with
commit 559218d43e ("block: pre-calculate max_zone_append_sectors")
this became fatal for setups that have small max zone append size,
as it won't be limited now.

Export blk_validate_limits so that it can be called directly from btrfs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241113084541.34315-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-13 11:40:11 -07:00
Christoph Hellwig
6975c1a486 block: remove the ioprio field from struct request
The request ioprio is only initialized from the first attached bio,
so requests without a bio already never set it.  Directly use the
bio field instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20241112170050.1612998-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-12 14:42:02 -07:00
Christoph Hellwig
61952bb734 block: remove the write_hint field from struct request
The write_hint is only used for read/write requests, which must have a
bio attached to them.  Just use the bio field instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20241112170050.1612998-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-12 14:42:02 -07:00
Christoph Hellwig
559218d43e block: pre-calculate max_zone_append_sectors
max_zone_append_sectors differs from all other queue limits in that the
final value used is not stored in the queue_limits but needs to be
obtained using queue_limits_max_zone_append_sectors helper.  This not
only adds (tiny) extra overhead to the I/O path, but also can be easily
forgotten in file system code.

Add a new max_hw_zone_append_sectors value to queue_limits which is
set by the driver, and calculate max_zone_append_sectors from that and
the other inputs in blk_validate_zoned_limits, similar to how
max_sectors is calculated to fix this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241104073955.112324-3-hch@lst.de
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20241108154657.845768-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-11 09:20:36 -07:00
Christoph Hellwig
0ef2b9e698 block: lift bio_is_zone_append to bio.h
Make bio_is_zone_append globally available, because file systems need
to use to check for a zone append bio in their end_io handlers to deal
with the block layer emulation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241104062647.91160-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-11 09:06:45 -07:00
Christoph Hellwig
7ecd2cd4fa block: fix bio_split_rw_at to take zone_write_granularity into account
Otherwise it can create unaligned writes on zoned devices.

Fixes: a805a4fa4f ("block: introduce zone_write_granularity limit")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241104062647.91160-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-11 09:06:45 -07:00
Christoph Hellwig
60dc5ea6bc block: take chunk_sectors into account in bio_split_write_zeroes
For zoned devices, write zeroes must be split at the zone boundary
which is represented as chunk_sectors.  For other uses like the
internally RAIDed NVMe devices it is probably at least useful.

Enhance get_max_io_size to know about write zeroes and use it in
bio_split_write_zeroes.  Also add a comment about the seemingly
nonsensical zero max_write_zeroes limit.

Fixes: 885fa13f65 ("block: implement splitting of REQ_OP_WRITE_ZEROES bios")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20241104062647.91160-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-11 09:06:45 -07:00
John Garry
6eb0968588 block: Handle bio_split() errors in bio_submit_split()
bio_split() may error, so check this.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241111112150.3756529-4-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-11 08:35:46 -07:00
John Garry
27b26f09a7 block: Error an attempt to split an atomic write in bio_split()
This is disallowed.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241111112150.3756529-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-11 08:35:46 -07:00
John Garry
e546fe1da9 block: Rework bio_split() return value
Instead of returning an inconclusive value of NULL for an error in calling
bio_split(), return a ERR_PTR() always.

Also remove the BUG_ON() calls, and WARN_ON_ONCE() instead. Indeed, since
almost all callers don't check the return code from bio_split(), we'll
crash anyway (for those failures).

Fix up the only user which checks bio_split() return code today (directly
or indirectly), blk_crypto_fallback_split_bio_if_needed(). The md/bcache
code does check the return code in cached_dev_cache_miss() ->
bio_next_split() -> bio_split(), but only to see if there was a split, so
there would be no change in behaviour here (when returning a ERR_PTR()).

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241111112150.3756529-2-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-11 08:35:46 -07:00
Ming Lei
357e1b7f73 block: don't verify IO lock for freeze/unfreeze in elevator_init_mq()
elevator_init_mq() is only called at the entry of add_disk_fwnode() when
disk IO isn't allowed yet.

So not verify io lock(q->io_lockdep_map) for freeze & unfreeze in
elevator_init_mq().

Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reported-by: Lai Yi <yi1.lai@linux.intel.com>
Fixes: f1be1788a3 ("block: model freeze & enter queue as lock for supporting lockdep")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241031133723.303835-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07 16:27:22 -07:00
Ming Lei
6a78699838 block: always verify unfreeze lock on the owner task
commit f1be1788a3 ("block: model freeze & enter queue as lock for
supporting lockdep") tries to apply lockdep for verifying freeze &
unfreeze. However, the verification is only done the outmost freeze and
unfreeze. This way is actually not correct because q->mq_freeze_depth
still may drop to zero on other task instead of the freeze owner task.

Fix this issue by always verifying the last unfreeze lock on the owner
task context, and make sure both the outmost freeze & unfreeze are
verified in the current task.

Fixes: f1be1788a3 ("block: model freeze & enter queue as lock for supporting lockdep")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241031133723.303835-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07 16:27:22 -07:00
Ming Lei
54027869df block: remove blk_freeze_queue()
No one use blk_freeze_queue(), so remove it and the obsolete comment.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241031133723.303835-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07 16:27:22 -07:00
Damien Le Moal
f3d9bf0514 block: Add a public bdev_zone_is_seq() helper
Turn the private disk_zone_is_conv() function in blk-zoned.c into a
public and documented bdev_zone_is_seq() helper with the inverse
polarity of the original function, also adding a check for non-zoned
devices so that all file systems can use the helper, even with a regular
block device.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241107064300.227731-3-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07 15:30:54 -07:00
Damien Le Moal
d7cb6d7414 block: RCU protect disk->conv_zones_bitmap
Ensure that a disk revalidation changing the conventional zones bitmap
of a disk does not cause invalid memory references when using the
disk_zone_is_conv() helper by RCU protecting the disk->conv_zones_bitmap
pointer.

disk_zone_is_conv() is modified to operate under the RCU read lock and
the function disk_set_conv_zones_bitmap() is added to update a disk
conv_zones_bitmap pointer using rcu_replace_pointer() with the disk
zone_wplugs_lock spinlock held.

disk_free_zone_resources() is modified to call
disk_update_zone_resources() with a NULL bitmap pointer to free the disk
conv_zones_bitmap. disk_set_conv_zones_bitmap() is also used in
disk_update_zone_resources() to set the new (revalidated) bitmap and
free the old one.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241107064300.227731-2-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07 15:30:54 -07:00
zhangguopeng
8e71afb94d block: Replace sprintf() with sysfs_emit()
Per Documentation/filesystems/sysfs.rst, show() should only use
sysfs_emit() or sysfs_emit_at() when formatting the value to be
returned to user space.

No functional change intended.

Signed-off-by: zhangguopeng <zhangguopeng@kylinos.cn>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241107104258.29742-1-zhangguopeng@kylinos.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07 15:30:30 -07:00
Damien Le Moal
4122fef16b block: Switch to using refcount_t for zone write plugs
Replace the raw atomic_t reference counting of zone write plugs with a
refcount_t.  No functional changes.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202411050650.ilIZa8S7-lkp@intel.com/
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241107065438.236348-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07 11:21:52 -07:00
Jens Axboe
ab9bc81c1c Revert "block: pre-calculate max_zone_append_sectors"
This causes issue on, at least, nvme-mpath where my boot fails with:

WARNING: CPU: 354 PID: 2729 at block/blk-settings.c:75 blk_validate_limits+0x356/0x380
Modules linked in: tg3(+) nvme usbcore scsi_mod ptp i2c_piix4 libphy nvme_core crc32c_intel scsi_common usb_common pps_core i2c_smbus
CPU: 354 UID: 0 PID: 2729 Comm: kworker/u2061:1 Not tainted 6.12.0-rc6+ #181
Hardware name: Dell Inc. PowerEdge R7625/06444F, BIOS 1.8.3 04/02/2024
Workqueue: async async_run_entry_fn
RIP: 0010:blk_validate_limits+0x356/0x380
Code: f6 47 01 04 75 28 83 bf 94 00 00 00 00 75 39 83 bf 98 00 00 00 00 75 34 83 7f 68 00 75 32 31 c0 83 7f 5c 00 0f 84 9b fd ff ff <0f> 0b eb 13 0f 0b eb 0f 48 c7 c0 74 12 58 92 48 89 c7 e8 13 76 46
RSP: 0018:ffffa8a1dfb93b30 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff9232829c8388 RCX: 0000000000000088
RDX: 0000000000000080 RSI: 0000000000000200 RDI: ffffa8a1dfb93c38
RBP: 000000000000000c R08: 00000000ffffffff R09: 000000000000ffff
R10: 0000000000000000 R11: 0000000000000000 R12: ffff9232829b9000
R13: ffff9232829b9010 R14: ffffa8a1dfb93c38 R15: ffffa8a1dfb93c38
FS:  0000000000000000(0000) GS:ffff923867c80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055c1b92480a8 CR3: 0000002484ff0002 CR4: 0000000000370ef0
Call Trace:
 <TASK>
 ? __warn+0xca/0x1a0
 ? blk_validate_limits+0x356/0x380
 ? report_bug+0x11a/0x1a0
 ? handle_bug+0x5e/0x90
 ? exc_invalid_op+0x16/0x40
 ? asm_exc_invalid_op+0x16/0x20
 ? blk_validate_limits+0x356/0x380
 blk_alloc_queue+0x7a/0x250
 __blk_alloc_disk+0x39/0x80
 nvme_mpath_alloc_disk+0x13d/0x1b0 [nvme_core]
 nvme_scan_ns+0xcc7/0x1010 [nvme_core]
 async_run_entry_fn+0x27/0x120
 process_scheduled_works+0x1a0/0x360
 worker_thread+0x2bc/0x350
 ? pr_cont_work+0x1b0/0x1b0
 kthread+0x111/0x120
 ? kthread_unuse_mm+0x90/0x90
 ret_from_fork+0x30/0x40
 ? kthread_unuse_mm+0x90/0x90
 ret_from_fork_asm+0x11/0x20
 </TASK>
---[ end trace 0000000000000000 ]---

presumably due to max_zone_append_sectors not being cleared to zero,
resulting in blk_validate_zoned_limits() complaining and failing.

This reverts commit 2a8f6153e1.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07 05:45:34 -07:00
Christoph Hellwig
2a8f6153e1 block: pre-calculate max_zone_append_sectors
max_zone_append_sectors differs from all other queue limits in that the
final value used is not stored in the queue_limits but needs to be
obtained using queue_limits_max_zone_append_sectors helper.  This not
only adds (tiny) extra overhead to the I/O path, but also can be easily
forgotten in file system code.

Add a new max_hw_zone_append_sectors value to queue_limits which is
set by the driver, and calculate max_zone_append_sectors from that and
the other inputs in blk_validate_zoned_limits, similar to how
max_sectors is calculated to fix this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241104073955.112324-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-04 10:34:07 -07:00
Christoph Hellwig
e494c3dce6 block: remove the max_zone_append_sectors check in blk_revalidate_disk_zones
With the lock layer zone append emulation, we are now always setting a
max_zone_append_sectors value for zoned devices and this check can't
ever trigger.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241104073955.112324-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-04 10:34:06 -07:00
Christoph Hellwig
05df016684 block: update blk_stack_limits documentation
Listing every single features that needs to be pre-set by stacking
drivers does not scale.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241104054218.45596-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-04 10:33:20 -07:00
Linus Torvalds
f4a1e8e369 block-6.12-20241101
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmclGQUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpkKsEADWz2dU1Edfbg5lez9Kf2SZwLKUbh9VVBQY
 kuIeGI/lQ1eeIdIgEJtCmx6ePlbk9D5WYu7lJvXfODAh5gUBLMpI4t6p4L1TkU2o
 yZ9UDoy5SK8o0UAOZhrHw4fE70EdI31Olm2mz3ToIevmpKOabNr/joVEfpEYaoPV
 t9FIGH1DvUZQSWUgw6G2lxWA7gTwfzvAfOegXF0VlFyhCEXjLXwJSUq+lWaShL33
 oDxETRYGaoBVfNVLIaZ6LFVUrvmgzFQ1Q6q1iY+DAUgMh9NXV2mzQ+/17HZjWaeU
 jsEeeOMnvtxhvQQpRyEvGA93Fh+gDkx/wljB4EmK2oqETs+oDVzts+L+YJk2ef7K
 n71dhuj04BRfduu6DFNg6ufAcMJOTgd0odgN9h9hPa0z3y0sx9hOSW6XZfepABl1
 +XDDyI4p/lVTMH/vYlS2ay2RUo7KrxfX26qSYbz6UItzo6dMl/3YjtKbzYpP0mZM
 4+Yu5YxDXirbnxpk9uhzOA4CdwiXXWxLIVIX1KeUWkJZ5wzmH2rFqmpzH8OL91uH
 J4SIMLgMCm4e45nDOLH5ZSLIc1MO2df0sJ1gZZD6MoT0xbxULyIQuj19sBszeHwH
 YgV4sQlPead4NT2zoSDeXSpeIWmv5dd15To7wVpeInyj/DHyXSHzycHYSehgdmv1
 FBQ+LMffyA==
 =e3uK
 -----END PGP SIGNATURE-----

Merge tag 'block-6.12-20241101' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - Fixup for a recent blk_rq_map_user_bvec() patch

 - NVMe pull request via Keith:
     - Spec compliant identification fix (Keith)
     - Module parameter to enable backward compatibility on unusual
       namespace formats (Keith)
     - Target double free fix when using keys (Vitaliy)
     - Passthrough command error handling fix (Keith)

* tag 'block-6.12-20241101' of git://git.kernel.dk/linux:
  nvme: re-fix error-handling for io_uring nvme-passthrough
  nvmet-auth: assign dh_key to NULL after kfree_sensitive
  nvme: module parameter to disable pi with offsets
  block: fix queue limits checks in blk_rq_map_user_bvec for real
  nvme: enhance cns version checking
2024-11-01 13:41:55 -10:00
Christoph Hellwig
f187b9bf1a block: remove bio_add_zone_append_page
This is only used by the nvmet zns passthrough code, which can trivially
just use bio_add_pc_page and do the sanity check for the max zone append
limit itself.

All future zoned file systems should follow the btrfs lead and let the
upper layers fill up bios unlimited by hardware constraints and split
them to the limits in the I/O submission handler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20241030051859.280923-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-31 10:54:25 -06:00
Christoph Hellwig
cafd00d0e9 block: remove zone append special casing from the direct I/O path
This code is unused, and all future zoned file systems should follow
the btrfs lead of splitting the bios themselves to the zoned limits
in the I/O submission handler, because if they didn't they would be
hit by commit ed9832bc08 ("block: introduce folio awareness and add
a bigger size from folio") breaking this code when the zone append
limit (that is usually the max_hw_sectors limit) is smaller than the
largest possible folio size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20241030051859.280923-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-31 10:54:25 -06:00
Keith Busch
133008e84b blk-integrity: remove seed for user mapped buffers
The seed is only used for kernel generation and verification. That
doesn't happen for user buffers, so passing the seed around doesn't
accomplish anything.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20241016201309.1090320-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-30 07:49:32 -06:00
Christoph Hellwig
2f5a65ef30 block: add a bdev_limits helper
Add a helper to get the queue_limits from the bdev without having to
poke into the request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241029141937.249920-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29 09:15:00 -06:00
Christoph Hellwig
be0e822bb3 block: fix queue limits checks in blk_rq_map_user_bvec for real
blk_rq_map_user_bvec currently only has ad-hoc checks for queue limits,
and the last fix to it enabled valid NVMe I/O to pass, but also allowed
invalid one for drivers that set a max_segment_size or seg_boundary
limit.

Fix it once for all by using the bio_split_rw_at helper from the I/O
path that indicates if and where a bio would be have to be split to
adhere to the queue limits, and it returns a positive value, turn that
into -EREMOTEIO to retry using the copy path.

Fixes: 2ff9494418 ("block: fix sanity checks in blk_rq_map_user_bvec")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241028090840.446180-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-28 12:35:05 -06:00
Linus Torvalds
75f8b2f526 block-6.12-20241026
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmcc66sQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjplD/9/LUtySoj29m8jF4cTgnysGsuAjqcgNyU+
 ykykZPQca+cWVhQzgFHob7N09C4y2gF2h/wokKM2cS8gaWzsSKvRaZBTeiQODJrf
 yAqG47BRXo6KJIpqT+A+FB0eDgRitCFweq5Is7Jh/rQooqJNvZb6W3hmK/eIfKxM
 BcY98/v02/eA/hry+IqAUzhoKHASxc/iFJJ8u+lk1fJyNZvQeIgzdy6RJwp/101L
 hCA1grIQRLJ86hhvbqrrMCmKfZeeuXKvx106YFhRlG0TCpPOGCeYMeqowdH5JlX6
 inzt2NfciqncQmnKp8m3DCi2keT7AT+D1QX92JuTBAxa99qkaoqoC6b/EjbAIRpc
 0cTR+G13LbyKlUuGMSRxa50EQtG4lkkIj3VlKAkxHPtEqy9y2+mK0JA33myYunTG
 wzOL8LKl0seLKtC8zHpcBZi5KNZt1MEu7GiibJVFdouje3X/VtDs00KymOboL7Uk
 W5YmpOSpLa1kh4U1FvdT0U1/xaV0Tb4UB3xjF0Qqhtqe1js1Vq86r5u/aiX3F3oZ
 0emqwd/lMCGEzqRY7qeBN0zEj4LLXU/3Lxn6k+1LjX4exxjMaS5loZ6tPq5czxoC
 M5Qh2JmEP7zLx9hNg6QjOO+cCmLrG/oWCZRyxSsHguNeEgdEdKYZ8yDOPXtKuqkE
 Qc6YkxIOsQ==
 =zznR
 -----END PGP SIGNATURE-----

Merge tag 'block-6.12-20241026' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - Pull request for MD via Song fixing a few issues

 - Fix a wrong check in blk_rq_map_user_bvec(), causing IO errors on
   passthrough IO (Xinyu)

* tag 'block-6.12-20241026' of git://git.kernel.dk/linux:
  block: fix sanity checks in blk_rq_map_user_bvec
  md/raid10: fix null ptr dereference in raid10_size()
  md: ensure child flush IO does not affect origin bio->bi_status
2024-10-27 08:29:36 -10:00
Ming Lei
f1be1788a3 block: model freeze & enter queue as lock for supporting lockdep
Recently we got several deadlock report[1][2][3] caused by
blk_mq_freeze_queue and blk_enter_queue().

Turns out the two are just like acquiring read/write lock, so model them
as read/write lock for supporting lockdep:

1) model q->q_usage_counter as two locks(io and queue lock)

- queue lock covers sync with blk_enter_queue()

- io lock covers sync with bio_enter_queue()

2) make the lockdep class/key as per-queue:

- different subsystem has very different lock use pattern, shared lock
 class causes false positive easily

- freeze_queue degrades to no lock in case that disk state becomes DEAD
  because bio_enter_queue() won't be blocked any more

- freeze_queue degrades to no lock in case that request queue becomes dying
  because blk_enter_queue() won't be blocked any more

3) model blk_mq_freeze_queue() as acquire_exclusive & try_lock
- it is exclusive lock, so dependency with blk_enter_queue() is covered

- it is trylock because blk_mq_freeze_queue() are allowed to run
  concurrently

4) model blk_enter_queue() & bio_enter_queue() as acquire_read()
- nested blk_enter_queue() are allowed

- dependency with blk_mq_freeze_queue() is covered

- blk_queue_exit() is often called from other contexts(such as irq), and
it can't be annotated as lock_release(), so simply do it in
blk_enter_queue(), this way still covered cases as many as possible

With lockdep support, such kind of reports may be reported asap and
needn't wait until the real deadlock is triggered.

For example, lockdep report can be triggered in the report[3] with this
patch applied.

[1] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
https://bugzilla.kernel.org/show_bug.cgi?id=219166

[2] del_gendisk() vs blk_queue_enter() race condition
https://lore.kernel.org/linux-block/20241003085610.GK11458@google.com/

[3] queue_freeze & queue_enter deadlock in scsi
https://lore.kernel.org/linux-block/ZxG38G9BuFdBpBHZ@fedora/T/#u

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241025003722.3630252-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-26 07:14:53 -06:00
Ming Lei
8acdd0e7bf blk-mq: add non_owner variant of start_freeze/unfreeze queue APIs
Add non_owner variant of start_freeze/unfreeze queue APIs, so that the
caller knows that what they are doing, and we can skip lockdep support
for non_owner variant in per-call level.

Prepare for supporting lockdep for freezing/unfreezing queue.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241025003722.3630252-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-26 07:14:53 -06:00
Xinyu Zhang
2ff9494418 block: fix sanity checks in blk_rq_map_user_bvec
blk_rq_map_user_bvec contains a check bytes + bv->bv_len > nr_iter which
causes unnecessary failures in NVMe passthrough I/O, reproducible as
follows:

- register a 2 page, page-aligned buffer against a ring
- use that buffer to do a 1 page io_uring NVMe passthrough read

The second (i = 1) iteration of the loop in blk_rq_map_user_bvec will
then have nr_iter == 1 page, bytes == 1 page, bv->bv_len == 1 page, so
the check bytes + bv->bv_len > nr_iter will succeed, causing the I/O to
fail. This failure is unnecessary, as when the check succeeds, it means
we've checked the entire buffer that will be used by the request - i.e.
blk_rq_map_user_bvec should complete successfully. Therefore, terminate
the loop early and return successfully when the check bytes + bv->bv_len
> nr_iter succeeds.

While we're at it, also remove the check that all segments in the bvec
are single-page. While this seems to be true for all users of the
function, it doesn't appear to be required anywhere downstream.

CC: stable@vger.kernel.org
Signed-off-by: Xinyu Zhang <xizhang@purestorage.com>
Co-developed-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Fixes: 3798754793 ("block: extend functionality to map bvec iterator")
Link: https://lore.kernel.org/r/20241023211519.4177873-1-ushankar@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-23 17:02:48 -06:00
Bart Van Assche
e203e20a8b blk-mq: Unexport blk_mq_flush_busy_ctxs()
Commit a6088845c2 ("block: kyber: make kyber more friendly with merging")
removed the only blk_mq_flush_busy_ctxs() call from outside the block layer
core. Hence unexport blk_mq_flush_busy_ctxs().

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20241023202850.3469279-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-23 16:30:40 -06:00
Bart Van Assche
ccd9e252c5 blk-mq: Make blk_mq_quiesce_tagset() hold the tag list mutex less long
Make sure that the tag_list_lock mutex is not held any longer than
necessary. This change reduces latency if e.g. blk_mq_quiesce_tagset()
is called concurrently from more than one thread. This function is used
by the NVMe core and also by the UFS driver.

Reported-by: Peter Wang <peter.wang@mediatek.com>
Cc: Chao Leng <lengchao@huawei.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 414dd48e88 ("blk-mq: add tagset quiesce interface")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20241022181617.2716173-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 19:11:25 -06:00
Muchun Song
904ebd2527 block: remove redundant explicit memory barrier from rq_qos waiter and waker
The memory barriers in list_del_init_careful() and list_empty_careful()
in pairs already handle the proper ordering between data.got_token
and data.wq.entry. So remove the redundant explicit barriers. And also
change a "break" statement to "return" to avoid redundant calling of
finish_wait().

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Link: https://lore.kernel.org/r/20241021085251.73353-1-songmuchun@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 14:05:09 -06:00
Jens Axboe
fdad1a20cd Merge branch 'for-6.13/block-atomic' into for-6.13/block
Merge in block/fs prep patches for the atomic write support.

* for-6.13/block-atomic:
  block: Add bdev atomic write limits helpers
  fs/block: Check for IOCB_DIRECT in generic_atomic_write_valid()
  block/fs: Pass an iocb to generic_atomic_write_valid()
2024-10-22 08:21:51 -06:00
Li Lingfeng
919b5139bd block: flush all throttled bios when deleting the cgroup
When a process migrates to another cgroup and the original cgroup is deleted,
the restrictions of throttled bios cannot be removed. If the restrictions
are set too low, it will take a long time to complete these bios.

Refer to the process of deleting a disk to remove the restrictions and
issue bios when deleting the cgroup.

This makes difference on the behavior of throttled bios:
Before: the limit of the throttled bios can't be changed and the bios will
complete under this limit;
Now: the limit will be canceled and the throttled bios will be flushed
immediately.

References:
[1] https://lore.kernel.org/r/20220318130144.1066064-4-ming.lei@redhat.com
[2] https://lore.kernel.org/all/da861d63-58c6-3ca0-2535-9089993e9e28@huaweicloud.com/

Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20240817071108.1919729-1-lilingfeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:43 -06:00
Muchun Song
96a9fe64bf block: fix ordering between checking BLK_MQ_S_STOPPED request adding
Supposing first scenario with a virtio_blk driver.

CPU0                        CPU1

blk_mq_try_issue_directly()
  __blk_mq_issue_directly()
    q->mq_ops->queue_rq()
      virtio_queue_rq()
        blk_mq_stop_hw_queue()
                            virtblk_done()
  blk_mq_request_bypass_insert()  1) store
                              blk_mq_start_stopped_hw_queue()
                                clear_bit(BLK_MQ_S_STOPPED)       3) store
                                blk_mq_run_hw_queue()
                                  if (!blk_mq_hctx_has_pending()) 4) load
                                    return
                                  blk_mq_sched_dispatch_requests()
  blk_mq_run_hw_queue()
    if (!blk_mq_hctx_has_pending())
      return
    blk_mq_sched_dispatch_requests()
      if (blk_mq_hctx_stopped())  2) load
        return
      __blk_mq_sched_dispatch_requests()

Supposing another scenario.

CPU0                        CPU1

blk_mq_requeue_work()
  blk_mq_insert_request() 1) store
                            virtblk_done()
                              blk_mq_start_stopped_hw_queue()
  blk_mq_run_hw_queues()        clear_bit(BLK_MQ_S_STOPPED)       3) store
                                blk_mq_run_hw_queue()
                                  if (!blk_mq_hctx_has_pending()) 4) load
                                    return
                                  blk_mq_sched_dispatch_requests()
    if (blk_mq_hctx_stopped())  2) load
      continue
    blk_mq_run_hw_queue()

Both scenarios are similar, the full memory barrier should be inserted
between 1) and 2), as well as between 3) and 4) to make sure that either
CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1 sees dispatch list.
Otherwise, either CPU will not rerun the hardware queue causing
starvation of the request.

The easy way to fix it is to add the essential full memory barrier into
helper of blk_mq_hctx_stopped(). In order to not affect the fast path
(hardware queue is not stopped most of the time), we only insert the
barrier into the slow path. Actually, only slow path needs to care about
missing of dispatching the request to the low-level device driver.

Fixes: 320ae51fee ("blk-mq: new multi-queue block IO queueing mechanism")
Cc: stable@vger.kernel.org
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241014092934.53630-4-songmuchun@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:40 -06:00
Muchun Song
6bda857bcb block: fix ordering between checking QUEUE_FLAG_QUIESCED request adding
Supposing the following scenario.

CPU0                        CPU1

blk_mq_insert_request()     1) store
                            blk_mq_unquiesce_queue()
                            blk_queue_flag_clear()                3) store
                              blk_mq_run_hw_queues()
                                blk_mq_run_hw_queue()
                                  if (!blk_mq_hctx_has_pending()) 4) load
                                    return
blk_mq_run_hw_queue()
  if (blk_queue_quiesced()) 2) load
    return
  blk_mq_sched_dispatch_requests()

The full memory barrier should be inserted between 1) and 2), as well as
between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED
is cleared or CPU1 sees dispatch list or setting of bitmap of software
queue. Otherwise, either CPU will not rerun the hardware queue causing
starvation.

So the first solution is to 1) add a pair of memory barrier to fix the
problem, another solution is to 2) use hctx->queue->queue_lock to
synchronize QUEUE_FLAG_QUIESCED. Here, we chose 2) to fix it since
memory barrier is not easy to be maintained.

Fixes: f4560ffe8c ("blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queue")
Cc: stable@vger.kernel.org
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241014092934.53630-3-songmuchun@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:40 -06:00
Muchun Song
2003ee8a9a block: fix missing dispatching request when queue is started or unquiesced
Supposing the following scenario with a virtio_blk driver.

CPU0                    CPU1                    CPU2

blk_mq_try_issue_directly()
  __blk_mq_issue_directly()
    q->mq_ops->queue_rq()
      virtio_queue_rq()
        blk_mq_stop_hw_queue()
                                                virtblk_done()
                        blk_mq_try_issue_directly()
                          if (blk_mq_hctx_stopped())
  blk_mq_request_bypass_insert()                  blk_mq_run_hw_queue()
  blk_mq_run_hw_queue()     blk_mq_run_hw_queue()
                            blk_mq_insert_request()
                            return

After CPU0 has marked the queue as stopped, CPU1 will see the queue is
stopped. But before CPU1 puts the request on the dispatch list, CPU2
receives the interrupt of completion of request, so it will run the
hardware queue and marks the queue as non-stopped. Meanwhile, CPU1 also
runs the same hardware queue. After both CPU1 and CPU2 complete
blk_mq_run_hw_queue(), CPU1 just puts the request to the same hardware
queue and returns. It misses dispatching a request. Fix it by running
the hardware queue explicitly. And blk_mq_request_issue_directly()
should handle a similar situation. Fix it as well.

Fixes: d964f04a8f ("blk-mq: fix direct issue")
Cc: stable@vger.kernel.org
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241014092934.53630-2-songmuchun@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:40 -06:00
Xiuhong Wang
732312e183 Revert "blk-throttle: Fix IO hang for a corner case"
This reverts commit 5b7048b897.

The main purpose of this patch is cleanup.
The throtl_adjusted_limit function was removed after
commit bf20ab538c ("blk-throttle: remove
CONFIG_BLK_DEV_THROTTLING_LOW"), so the problem of not being
able to scale after setting bps or iops to 1 will not occur.
So revert this commit that bps/iops can be set to 1.

Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiuhong Wang <xiuhong.wang@unisoc.com>
Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20241016024508.3340330-1-xiuhong.wang@unisoc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:40 -06:00
Julia Lawall
28878733ca block: replace call_rcu by kfree_rcu for simple kmem_cache_free callback
Since SLOB was removed and since
commit 6c6c47b063 ("mm, slab: call kvfree_rcu_barrier() from kmem_cache_destroy()"),
it is not necessary to use call_rcu when the callback only performs
kmem_cache_free. Use kfree_rcu() directly.

The changes were made using Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Link: https://lore.kernel.org/r/20241013201704.49576-10-Julia.Lawall@inria.fr
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:40 -06:00
Greg Joyce
b21d948f4c block: sed-opal: add ioctl IOC_OPAL_SET_SID_PW
After a SED drive is provisioned, there is no way to change the SID
password via the ioctl() interface. A new ioctl IOC_OPAL_SET_SID_PW
will allow the password to be changed. The valid current password is
required.

Signed-off-by: Greg Joyce <gjoyce@linux.ibm.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Link: https://lore.kernel.org/r/20240829175639.6478-2-gjoyce@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:40 -06:00
Keith Busch
110234da18 block: enable passthrough command statistics
Applications using the passthrough interfaces for IO want to continue
seeing the disk stats. These requests had been fenced off from this
block layer feature. While the block layer doesn't necessarily know what
a passthrough command does, we do know the data size and direction,
which is enough to account for the command's stats.

Since tracking these has the potential to produce unexpected results,
the passthrough stats are locked behind a new queue flag that needs to
be enabled with the /sys/block/<dev>/queue/iostats_passthrough
attribute.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241007153236.2818562-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:32 -06:00
Christoph Hellwig
d51c9cdfc2 block: return void from the queue_sysfs_entry load_module method
Requesting a module either succeeds or does nothing, return an error from
this method does not make sense.

Also move the load_module after the store method in the struct
declaration to keep the important show and store methods together.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://lore.kernel.org/r/20241008050841.104602-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:22 -06:00
Konstantin Khlebnikov
758737d86f block: add partition uuid into uevent as "PARTUUID"
Both most common formats have uuid in addition to partition name:
GPT: standard uuid xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
DOS: 4 byte disk signature and 1 byte partition xxxxxxxx-xx

Tools from util-linux use the same notation for them.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
[dianders: rebased to modern kernels]
Signed-off-by: Douglas Anderson <dianders@google.com>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241004171340.v2.1.I938c91d10e454e841fdf5d64499a8ae8514dc004@changeid
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:15:17 -06:00
Jens Axboe
746fc7e9d4 block: move issue side time stamping to blk_account_io_start()
It's known needed at that point, and it's cleaner to just assign it
there rather than rely on it being reliably set before hitting the
IO accounting. Hence, move it out of blk_mq_rq_time_init(), which is
now only doing the allocation side timing.

While at it, get rid of the '0' time passing to blk_mq_rq_time_init(),
just pass in blk_time_get_ns() for the two cases where 0 is being
explicitly passed in. The rest pass in the previously cached allocation
time.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:15:17 -06:00
Jens Axboe
148e6968f6 block: set issue time stamp based on queue state
A previous commit moved RQF_IO_STAT into blk_account_io_done(), where
it's being set rather than at allocation time. Unfortunately we do check
for that flag in blk_mq_rq_time_init(), and hence setting the
start_time_ns wasn't being done. This lead to unwieldy inflight IO counts
and times, as IO completion accounting would a 0 value rather than the
issue time for it's subtraction math.

Fix this by switching the blk_mq_rq_time_init() check to use the queue
state rather than the request state.

Fixes: b8f762400ae8 ("block: move iostat check into blk_acount_io_start()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202410062110.512391df-oliver.sang@intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:15:17 -06:00
Christian Marangi
2e3a191e89 block: add support for partition table defined in OF
Add support for partition table defined in Device Tree. Similar to how
it's done with MTD, add support for defining a fixed partition table in
device tree.

A common scenario for this is fixed block (eMMC) embedded devices that
have no MBR or GPT partition table to save storage space. Bootloader
access the block device with absolute address of data.

This is to complete the functionality with an equivalent implementation
with providing partition table with bootargs, for case where the booargs
can't be modified and tweaking the Device Tree is the only solution to
have an usabe partition table.

The implementation follow the fixed-partitions parser used on MTD
devices where a "partitions" node is expected to be declared with
"fixed-partitions" compatible in the OF node of the disk device
(mmc-card for eMMC for example) and each child node declare a label
and a reg with offset and size. If label is not declared, the node name
is used as fallback. Eventually is also possible to declare the read-only
property to flag the partition as read-only.

Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241002221306.4403-6-ansuelsmth@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:14:56 -06:00
Christian Marangi
9dfd9ea93a block: introduce add_disk_fwnode()
Introduce add_disk_fwnode() as a replacement of device_add_disk() that
permits to pass and attach a fwnode to disk dev.

This variant can be useful for eMMC that might have the partition table
for the disk defined in DT. A parser can later make use of the attached
fwnode to parse the related table and init the hardcoded partition for
the disk.

device_add_disk() is converted to a simple wrapper of add_disk_fwnode()
with the fwnode entry set as NULL.

Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241002221306.4403-4-ansuelsmth@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:14:56 -06:00
Christian Marangi
ba40f4c590 block: add support for defining read-only partitions
Add support for defining read-only partitions and complete support for
it in the cmdline partition parser as the additional "ro" after a
partition is scanned but never actually applied.

Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241002221306.4403-2-ansuelsmth@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:14:56 -06:00
Jens Axboe
e3569ecae4 block: kill blk_do_io_stat() helper
It's now just checking whether or not RQF_IO_STAT is set, so let's get
rid of it and just open-code the specific flag that is being checked.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:14:56 -06:00
Jens Axboe
fd0a63bcda block: remove 'req->part' check for stats accounting
If RQF_IO_STAT is set, then accounting is enabled. There's no need to
further gate this on req->part being set or not, RQF_IO_STAT should
never be set if accounting is not being done for this request.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:14:56 -06:00
Jens Axboe
8933805623 block: move iostat check into blk_acount_io_start()
Rather than have blk_do_io_stat() check for both RQF_IO_STAT and whether
the request is a passthrough requests every time, move both of those
checks into blk_account_io_start(). Then blk_do_io_stat() can be reduced
to just checking for RQF_IO_STAT.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:14:56 -06:00
John Garry
c3be7ebbbc fs/block: Check for IOCB_DIRECT in generic_atomic_write_valid()
Currently FMODE_CAN_ATOMIC_WRITE is set if the bdev can atomic write and
the file is open for direct IO. This does not work if the file is not
opened for direct IO, yet fcntl(O_DIRECT) is used on the fd later.

Change to check for direct IO on a per-IO basis in
generic_atomic_write_valid(). Since we want to report -EOPNOTSUPP for
non-direct IO for an atomic write, change to return an error code.

Relocate the block fops atomic write checks to the common write path, as to
catch non-direct IO.

Fixes: c34fc6f26a ("fs: Initial atomic write support")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241019125113.369994-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-19 16:48:22 -06:00
John Garry
9a8dbdadae block/fs: Pass an iocb to generic_atomic_write_valid()
Darrick and Hannes both thought it better that generic_atomic_write_valid()
should be passed a struct iocb, and not just the member of that struct
which is referenced; see [0] and [1].

I think that makes a more generic and clean API, so make that change.

[0] https://lore.kernel.org/linux-block/680ce641-729b-4150-b875-531a98657682@suse.de/
[1] https://lore.kernel.org/linux-xfs/20240620212401.GA3058325@frogsfrogsfrogs/

Fixes: c34fc6f26a ("fs: Initial atomic write support")
Suggested-by: Darrick J. Wong <djwong@kernel.org>
Suggested-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241019125113.369994-2-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-19 16:48:21 -06:00
Linus Torvalds
f8eacd8ad7 block-6.12-20241018
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmcSk4AQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpuuXD/0UERdP+djJNoXBW5Mv7U5a4rJ7ZgfPL7ku
 z3ZfdnNYGitZhYkVjNQ60TLzXRQyUaIIxMVBWzkb59I6ixmuQbzm/lC55B6s/FIR
 bfT3afe1WRgLCaFbStu91qRs/44Mq4yK6wXcIU7LwutRT/5cqZwelqRZLK7DMFln
 zlGX4zNrCMRUDTr6PLa6CvyY4dmQSL17Ib1ypcKXjGs5YjDntzSrIsKVT1Wayans
 WroGGPG6W7r2c2kn8pe4uPIjZVfMUF2vrdIs0KEYaAQOC7ppEucCgDZMEWRs7kdH
 63hheudJjVSwLF/qYnXNHe/Bz12QCZohPp6UsqRpC8o96Ralgo6Q+FxkXsVelMXW
 JKhtDqYGBDHOQrjrEWN1rnYw/DauEQAgvOtdVfEx2IBzPsG07cB8yv8MNA90H9QH
 KStI7h9qnBEMMNcXX8prOymCHNWAeuF4mbitVrRfSfEVm/0BbQ19qoyGrvwNFgEf
 6T+4Xj/P+FsiLVe8vsgBZDaxEEU5Ifd/rki/QFVk/2z72BBZxmdf2nm51SOM28V7
 HGMHwJI3H8rdmPXvt5Q/ve6GWNOYLO5PSAJgSSe96UStvtsAHGB4eM+LykdnE7cI
 SoytU5KfAM8DD6wnyHIgYuvJyZWrmLoVDrRjym8emc2KrJOe7qg+Ah4ERcNTCnhl
 nw50f27G4w==
 =waNY
 -----END PGP SIGNATURE-----

Merge tag 'block-6.12-20241018' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
     - Fix target passthrough identifier (Nilay)
     - Fix tcp locking (Hannes)
     - Replace list with sbitmap for tracking RDMA rsp tags (Guixen)
     - Remove unnecessary fallthrough statements (Tokunori)
     - Remove ready-without-media support (Greg)
     - Fix multipath partition scan deadlock (Keith)
     - Fix concurrent PCI reset and remove queue mapping (Maurizio)
     - Fabrics shutdown fixes (Nilay)

 - Fix for a kerneldoc warning (Keith)

 - Fix a race with blk-rq-qos and wakeups (Omar)

 - Cleanup of checking for always-set tag_set (SurajSonawane2415)

 - Fix for a crash with CPU hotplug notifiers (Ming)

 - Don't allow zero-copy ublk on unprivileged device (Ming)

 - Use array_index_nospec() for CDROM (Josh)

 - Remove dead code in drbd (David)

 - Tweaks to elevator loading (Breno)

* tag 'block-6.12-20241018' of git://git.kernel.dk/linux:
  cdrom: Avoid barrier_nospec() in cdrom_ioctl_media_changed()
  nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function
  nvme: make keep-alive synchronous operation
  nvme-loop: flush off pending I/O while shutting down loop controller
  nvme-pci: fix race condition between reset and nvme_dev_disable()
  ublk: don't allow user copy for unprivileged device
  blk-rq-qos: fix crash on rq_qos_wait vs. rq_qos_wake_function race
  nvme-multipath: defer partition scanning
  blk-mq: setup queue ->tag_set before initializing hctx
  elevator: Remove argument from elevator_find_get
  elevator: do not request_module if elevator exists
  drbd: Remove unused conn_lowest_minor
  nvme: disable CC.CRIME (NVME_CC_CRIME)
  nvme: delete unnecessary fallthru comment
  nvmet-rdma: use sbitmap to replace rsp free list
  block: Fix elevator_get_default() checking for NULL q->tag_set
  nvme: tcp: avoid race between queue_lock lock and destroy
  nvmet-passthru: clear EUID/NGUID/UUID while using loop target
  block: fix blk_rq_map_integrity_sg kernel-doc
2024-10-18 15:53:00 -07:00
Omar Sandoval
e972b08b91 blk-rq-qos: fix crash on rq_qos_wait vs. rq_qos_wake_function race
We're seeing crashes from rq_qos_wake_function that look like this:

  BUG: unable to handle page fault for address: ffffafe180a40084
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  PGD 100000067 P4D 100000067 PUD 10027c067 PMD 10115d067 PTE 0
  Oops: Oops: 0002 [#1] PREEMPT SMP PTI
  CPU: 17 UID: 0 PID: 0 Comm: swapper/17 Not tainted 6.12.0-rc3-00013-geca631b8fe80 #11
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
  RIP: 0010:_raw_spin_lock_irqsave+0x1d/0x40
  Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 41 54 9c 41 5c fa 65 ff 05 62 97 30 4c 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 0a 4c 89 e0 41 5c c3 cc cc cc cc 89 c6 e8 2c 0b 00
  RSP: 0018:ffffafe180580ca0 EFLAGS: 00010046
  RAX: 0000000000000000 RBX: ffffafe180a3f7a8 RCX: 0000000000000011
  RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffffafe180a40084
  RBP: 0000000000000000 R08: 00000000001e7240 R09: 0000000000000011
  R10: 0000000000000028 R11: 0000000000000888 R12: 0000000000000002
  R13: ffffafe180a40084 R14: 0000000000000000 R15: 0000000000000003
  FS:  0000000000000000(0000) GS:ffff9aaf1f280000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: ffffafe180a40084 CR3: 000000010e428002 CR4: 0000000000770ef0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  PKRU: 55555554
  Call Trace:
   <IRQ>
   try_to_wake_up+0x5a/0x6a0
   rq_qos_wake_function+0x71/0x80
   __wake_up_common+0x75/0xa0
   __wake_up+0x36/0x60
   scale_up.part.0+0x50/0x110
   wb_timer_fn+0x227/0x450
   ...

So rq_qos_wake_function() calls wake_up_process(data->task), which calls
try_to_wake_up(), which faults in raw_spin_lock_irqsave(&p->pi_lock).

p comes from data->task, and data comes from the waitqueue entry, which
is stored on the waiter's stack in rq_qos_wait(). Analyzing the core
dump with drgn, I found that the waiter had already woken up and moved
on to a completely unrelated code path, clobbering what was previously
data->task. Meanwhile, the waker was passing the clobbered garbage in
data->task to wake_up_process(), leading to the crash.

What's happening is that in between rq_qos_wake_function() deleting the
waitqueue entry and calling wake_up_process(), rq_qos_wait() is finding
that it already got a token and returning. The race looks like this:

rq_qos_wait()                           rq_qos_wake_function()
==============================================================
prepare_to_wait_exclusive()
                                        data->got_token = true;
                                        list_del_init(&curr->entry);
if (data.got_token)
        break;
finish_wait(&rqw->wait, &data.wq);
  ^- returns immediately because
     list_empty_careful(&wq_entry->entry)
     is true
... return, go do something else ...
                                        wake_up_process(data->task)
                                          (NO LONGER VALID!)-^

Normally, finish_wait() is supposed to synchronize against the waker.
But, as noted above, it is returning immediately because the waitqueue
entry has already been removed from the waitqueue.

The bug is that rq_qos_wake_function() is accessing the waitqueue entry
AFTER deleting it. Note that autoremove_wake_function() wakes the waiter
and THEN deletes the waitqueue entry, which is the proper order.

Fix it by swapping the order. We also need to use
list_del_init_careful() to match the list_empty_careful() in
finish_wait().

Fixes: 38cfb5a45e ("blk-wbt: improve waking of tasks")
Cc: stable@vger.kernel.org
Signed-off-by: Omar Sandoval <osandov@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/d3bee2463a67b1ee597211823bf7ad3721c26e41.1729014591.git.osandov@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-16 07:20:14 -06:00
Ming Lei
c25c0c9035 blk-mq: setup queue ->tag_set before initializing hctx
Commit 7b815817aa ("blk-mq: add helper for checking if one CPU is mapped to specified hctx")
needs to check queue mapping via tag set in hctx's cpuhp handler.

However, q->tag_set may not be setup yet when the cpuhp handler is
enabled, then kernel oops is triggered.

Fix the issue by setup queue tag_set before initializing hctx.

Cc: stable@vger.kernel.org
Reported-and-tested-by: Rick Koch <mr.rickkoch@gmail.com>
Closes: https://lore.kernel.org/linux-block/CANa58eeNDozLaBHKPLxSAhEy__FPfJT_F71W=sEQw49UCrC9PQ@mail.gmail.com
Fixes: 7b815817aa ("blk-mq: add helper for checking if one CPU is mapped to specified hctx")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241014005115.2699642-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-14 08:17:07 -06:00
Breno Leitao
ee7ff15bf5 elevator: Remove argument from elevator_find_get
Commit e4eb37cc0f ("block: Remove elevator required features")
removed the usage of `struct request_queue` from elevator_find_get(),
but didn't removed the argument.

Remove the "struct request_queue *q" argument from elevator_find_get()
given it is useless.

Fixes: e4eb37cc0f ("block: Remove elevator required features")
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20241011155615.3361143-1-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-11 11:11:09 -06:00
Breno Leitao
b4ff6e93bf elevator: do not request_module if elevator exists
Whenever an I/O elevator is changed, the system attempts to load a
module for the new elevator. This occurs regardless of whether the
elevator is already loaded or built directly into the kernel. This
behavior introduces unnecessary overhead and potential issues.

This makes the operation slower, and more error-prone. For instance,
making the problem fixed by [1] visible for users that doesn't even rely
on modules being available through modules.

Do not try to load the ioscheduler if it is already visible.

This change brings two main benefits: it improves the performance of
elevator changes, and it reduces the likelihood of errors occurring
during this process.

[1] Commit e3accac1a9 ("block: Fix elv_iosched_local_module handling of "none" scheduler")

Fixes: 734e1a8603 ("block: Prevent deadlocks when switching elevators")
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20241011170122.3880087-1-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-11 11:10:52 -06:00
SurajSonawane2415
b402328a24 block: Fix elevator_get_default() checking for NULL q->tag_set
elevator_get_default() and elv_support_iosched() both check for whether
or not q->tag_set is non-NULL, however it's not possible for them to be
NULL. This messes up some static checkers, as the checking of tag_set
isn't consistent.

Remove the checks, which both simplifies the logic and avoids checker
errors.

Signed-off-by: SurajSonawane2415 <surajsonawane0215@gmail.com>
Link: https://lore.kernel.org/r/20241007111416.13814-1-surajsonawane0215@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-07 14:24:36 -06:00
Linus Torvalds
360c1f1f24 block-6.12-20241004
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmcAA+4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprKuD/99KWHNFwDyhwPVRlK8ZiCo556DWJ9sT6aJ
 ICt0MAehx2KUgjQTcOoAE0gPAd2ASJA+hisvY5yiMtXNvOgyYXcUfH+UqYoFWSO8
 O7vm7/w/YK3jvuwRwTDSH2s3+L/j3oROHZQSZVAkoi9KcbAaPt5FbFbpqWmAuBgJ
 ymsGWybW27GSOCDh55Jnm4IDaHMUfpcsvEg8Ryl6DyD2C0ZkMy64DmGiopUKCyML
 ceeiVUboNzWYg9qeZswKm1OVSSzZWocy4zDbJYjueVPmr7juvQ4ILxHDqKhTOZpO
 JHBnI1c3i6pOyZqtyaU3Eb/I0b3ZebwFhY3p1+CxkW1FIatrML86/b0ZIxowEiJO
 a8tRHkk2fXaTyuaxTPlkaE9P5cIK/BpEvxn/v8mddVPTwjAAAULINfZAYoGhwWyF
 cSQuN+BOJnxU6DUp66OVKlxhLiZAttBy+kyCPaLUVLyJX7w/VaQz2PwQjzHybjy7
 zNz7M4x4/06bFC3ykmLgIRM+Hqo+El0SE6wx/GHi70GuC+cqkHnfUL4Yscj1t6SD
 Uadul3PYaxVBZ7lcdHy09iD8JxdF1wYxyqiGULi7DPyUqBJ2r4e/TH2x2MxZ3sKA
 CEWBVHZ8yNdfckhybLArbdkQYd+GaiFna/ySUvMjrAdF2Hes42uo9eF4L5tAFtxq
 pKg3IPp0oA==
 =ujN7
 -----END PGP SIGNATURE-----

Merge tag 'block-6.12-20241004' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - Fix another use-after-free in aoe

 - Fixup wrong nested non-saving irq disable/restore in blk-iocost

 - Fixup a kerneldoc complaint introduced by a merge window patch

* tag 'block-6.12-20241004' of git://git.kernel.dk/linux:
  aoe: fix the potential use-after-free problem in more places
  blk_iocost: remove some duplicate irq disable/enables
  block: fix blk_rq_map_integrity_sg kernel-doc
2024-10-04 10:43:44 -07:00
Al Viro
5f60d5f6bb move asm/unaligned.h to linux/unaligned.h
asm/unaligned.h is always an include of asm-generic/unaligned.h;
might as well move that thing to linux/unaligned.h and include
that - there's nothing arch-specific in that header.

auto-generated by the following:

for i in `git grep -l -w asm/unaligned.h`; do
	sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i
done
for i in `git grep -l -w asm-generic/unaligned.h`; do
	sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i
done
git mv include/asm-generic/unaligned.h include/linux/unaligned.h
git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h
sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild
sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
2024-10-02 17:23:23 -04:00
Dan Carpenter
14d57ec3b8 blk_iocost: remove some duplicate irq disable/enables
These are called from blkcg_print_blkgs() which already disables IRQs so
disabling it again is wrong.  It means that IRQs will be enabled slightly
earlier than intended, however, so far as I can see, this bug is harmless.

Fixes: 35198e3230 ("blk-iocost: read params inside lock in sysfs apis")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/Zv0kudA9xyGdaA4g@stanley.mountain
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-02 07:15:43 -06:00
Keith Busch
0ab4284300 block: fix blk_rq_map_integrity_sg kernel-doc
Fix the documentation to match the new function signature.

Fixes: 76c313f658 ("blk-integrity: improved sg segment mapping")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240922141800.3622319-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-02 07:15:33 -06:00
Linus Torvalds
11a299a793 for-6.12/block-20240925
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmb0T5AQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnfHEADCXqmqZC+xr3sHZH9T1lz9KaFp1FjuBhCw
 bGpUgXQ9aLcqQUWJxmYVer8N2x2+Ds+xq4fm/rP1BfvNgRupqheHBwuLxSrz14EX
 lYmKZ+krMIPTDaLFewmEWflDwmZX0WFgV6nKTMLiO5BMeI4zXCkFGtwYFys2+Cdd
 9zYCFPgGDZUR77Ws5PpyqPVz2MoiNtsjrGmHpEmNZ+rIDzlpVOYgYk27X9ZbvNxC
 /l0KTc9+ayAeG0Kx5jO+m6Hrj3I6ehvM9JZMgpS/tF/jtccD2oVkJFJDlU+Jciv6
 BwVzgyDPGV7sXFT1fnSqDBYYwr/73nzNH0Gk8wn4Jg2LhjmVANVo9eQSOXDTYZI+
 O4HfIHGTIrk75TQd4bhq3dqaylS78pKBI/eQJUli2UNoyLWMrMyE88yh2YJam2Fs
 vJ/MHGxvFRurYbAlqLr33nb3ajvpg+D7XuAYfqHPMc2ZUe28Kza50Dj+luNjfVCu
 3qfR6qBlsdWuABtUS3vneB9jZp5jDnOpVfuBgtcAqIboUjehTXsI7If09Ex/mxLq
 O0KqNwBMfunPOKd5kGXlAgY8LRMfOhNaAAFBlXYUZB2eAadQnqVselTFvHMZkXo7
 wH/l6trd+/Tf+7Rav0YduNIlpVr7IctC+A7ph4zPdIjQxFEySCrC7cvAjel29LyV
 zgWW0Mw/sA==
 =yiWu
 -----END PGP SIGNATURE-----

Merge tag 'for-6.12/block-20240925' of git://git.kernel.dk/linux

Pull more block updates from Jens Axboe:

 - Improve blk-integrity segment counting and merging (Keith)

 - NVMe pull request via Keith:
      - Multipath fixes (Hannes)
      - Sysfs attribute list NULL terminate fix (Shin'ichiro)
      - Remove problematic read-back (Keith)

 - Fix for a regression with the IO scheduler switching freezing from
   6.11 (Damien)

 - Use a raw spinlock for sbitmap, as it may get called from preempt
   disabled context (Ming)

 - Cleanup for bd_claiming waiting, using var_waitqueue() rather than
   the bit waitqueues, as that more accurately describes that it does
   (Neil)

 - Various cleanups (Kanchan, Qiu-ji, David)

* tag 'for-6.12/block-20240925' of git://git.kernel.dk/linux:
  nvme: remove CC register read-back during enabling
  nvme: null terminate nvme_tls_attrs
  nvme-multipath: avoid hang on inaccessible namespaces
  nvme-multipath: system fails to create generic nvme device
  lib/sbitmap: define swap_lock as raw_spinlock_t
  block: Remove unused blk_limits_io_{min,opt}
  drbd: Fix atomicity violation in drbd_uuid_set_bm()
  block: Fix elv_iosched_local_module handling of "none" scheduler
  block: remove bogus union
  block: change wait on bd_claiming to use a var_waitqueue
  blk-integrity: improved sg segment mapping
  block: unexport blk_rq_count_integrity_sg
  nvme-rdma: use request to get integrity segments
  scsi: use request to get integrity segments
  block: provide a request helper for user integrity segments
  blk-integrity: consider entire bio list for merging
  blk-integrity: properly account for segments
  blk-mq: set the nr_integrity_segments from bio
  blk-mq: unconditional nr_integrity_segments
2024-09-25 14:56:40 -07:00
Linus Torvalds
171754c380 vfs-6.12.blocksize
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEvwAKCRCRxhvAZXjc
 ohg3APwJWQnqFlBddcRl4yrPJ/cgcYSYAOdHb+E+blomSwdxcwEAmwsnLPNQOtw2
 rxKvQfZqhVT437bl7RpPPZrHGxwTng8=
 =6v1r
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.12.blocksize' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull vfs blocksize updates from Christian Brauner:
 "This contains the vfs infrastructure as well as the xfs bits to enable
  support for block sizes (bs) larger than page sizes (ps) plus a few
  fixes to related infrastructure.

  There has been efforts over the last 16 years to enable enable Large
  Block Sizes (LBS), that is block sizes in filesystems where bs > page
  size. Through these efforts we have learned that one of the main
  blockers to supporting bs > ps in filesystems has been a way to
  allocate pages that are at least the filesystem block size on the page
  cache where bs > ps.

  Thanks to various previous efforts it is possible to support bs > ps
  in XFS with only a few changes in XFS itself. Most changes are to the
  page cache to support minimum order folio support for the target block
  size on the filesystem.

  A motivation for Large Block Sizes today is to support high-capacity
  (large amount of Terabytes) QLC SSDs where the internal Indirection
  Unit (IU) are typically greater than 4k to help reduce DRAM and so in
  turn cost and space. In practice this then allows different
  architectures to use a base page size of 4k while still enabling
  support for block sizes aligned to the larger IUs by relying on high
  order folios on the page cache when needed.

  It also allows to take advantage of the drive's support for atomics
  larger than 4k with buffered IO support in Linux. As described this
  year at LSFMM, supporting large atomics greater than 4k enables
  databases to remove the need to rely on their own journaling, so they
  can disable double buffered writes, which is a feature different cloud
  providers are already enabling through custom storage solutions"

* tag 'vfs-6.12.blocksize' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
  Documentation: iomap: fix a typo
  iomap: remove the iomap_file_buffered_write_punch_delalloc return value
  iomap: pass the iomap to the punch callback
  iomap: pass flags to iomap_file_buffered_write_punch_delalloc
  iomap: improve shared block detection in iomap_unshare_iter
  iomap: handle a post-direct I/O invalidate race in iomap_write_delalloc_release
  docs:filesystems: fix spelling and grammar mistakes in iomap design page
  filemap: fix htmldoc warning for mapping_align_index()
  iomap: make zero range flush conditional on unwritten mappings
  iomap: fix handling of dirty folios over unwritten extents
  iomap: add a private argument for iomap_file_buffered_write
  iomap: remove set_memor_ro() on zero page
  xfs: enable block size larger than page size support
  xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  xfs: expose block size in stat
  xfs: use kvmalloc for xattr buffers
  iomap: fix iomap_dio_zero() for fs bs > system page size
  filemap: cap PTE range to be created to allowed zero fill in folio_map_range()
  mm: split a folio in minimum folio order chunks
  readahead: allocate folios with mapping_min_order in readahead
  ...
2024-09-20 17:53:17 -07:00
Dr. David Alan Gilbert
9ba5dcc722 block: Remove unused blk_limits_io_{min,opt}
blk_limits_io_min and blk_limits_io_opt are unused since the
recent commit
  0a94a469a4 ("dm: stop using blk_limits_io_{min,opt}")

Remove them.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://lore.kernel.org/r/20240920004817.676216-1-linux@treblig.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-20 00:19:48 -06:00
Linus Torvalds
a1d1eb2f57 SCSI misc on 20240919
Updates to the usual drivers (ufs, smartpqi, NCR5380, mac_scsi, lpfc,
 mpi3mr).  There are no user visible core changes and a whole series of
 minor updates and fixes.  The largest core change is probably the
 simplification of the workqueue allocation path.
 
 Signed-off-by: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCZuvd5yYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishV7dAQC+TSlv
 BeNm8W4yAFCXLCwnJh8rT6ZzuBsjsIHH1DPP3wD+IXuIOFf5gVRJGpCNJc/dI082
 /ehSrIdeJxwaNoOOt+Y=
 =SXZD
 -----END PGP SIGNATURE-----

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
 "Updates to the usual drivers (ufs, smartpqi, NCR5380, mac_scsi, lpfc,
  mpi3mr).

  There are no user visible core changes and a whole series of minor
  updates and fixes. The largest core change is probably the
  simplification of the workqueue allocation path"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (86 commits)
  scsi: smartpqi: update driver version to 2.1.30-031
  scsi: smartpqi: fix volume size updates
  scsi: smartpqi: fix rare system hang during LUN reset
  scsi: smartpqi: add new controller PCI IDs
  scsi: smartpqi: add counter for parity write stream requests
  scsi: smartpqi: correct stream detection
  scsi: smartpqi: Add fw log to kdump
  scsi: bnx2fc: Remove some unused fields in struct bnx2fc_rport
  scsi: qla2xxx: Remove the unused 'del_list_entry' field in struct fc_port
  scsi: ufs: core: Remove ufshcd_urgent_bkops()
  scsi: core: Remove obsoleted declaration for scsi_driverbyte_string()
  scsi: bnx2i: Remove unused declarations
  scsi: core: Simplify an alloc_workqueue() invocation
  scsi: ufs: Simplify alloc*_workqueue() invocation
  scsi: stex: Simplify an alloc_ordered_workqueue() invocation
  scsi: scsi_transport_fc: Simplify alloc_workqueue() invocations
  scsi: snic: Simplify alloc_workqueue() invocations
  scsi: qedi: Simplify an alloc_workqueue() invocation
  scsi: qedf: Simplify alloc_workqueue() invocations
  scsi: myrs: Simplify an alloc_ordered_workqueue() invocation
  ...
2024-09-19 11:28:51 +02:00
Damien Le Moal
e3accac1a9 block: Fix elv_iosched_local_module handling of "none" scheduler
Commit 734e1a8603 ("block: Prevent deadlocks when switching
elevators") introduced the function elv_iosched_load_module() to allow
loading an elevator module outside of elv_iosched_store() with the
target device queue not frozen, to avoid deadlocks. However, the "none"
scheduler does not have a module and as a result,
elv_iosched_load_module() always returns an error when trying to switch
to this valid scheduler.

Fix this by ignoring the return value of the request_module() call
done by elv_iosched_load_module(). This restores the behavior before
commit 734e1a8603, which was to ignore the request_module() result and
instead rely on elevator_change() to handle the "none" scheduler case.

Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 734e1a8603 ("block: Prevent deadlocks when switching elevators")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240917133231.134806-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-17 08:34:00 -06:00
Jens Axboe
42b16d3ac3 Linux 6.11
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmbm9fQeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGXcwH/A8+IXnrGv+VzYgD
 +mE4hgGGHt4dClcUZ31gQetkkT6xktEVp6pB6JkFO7oEgBiTkJBbYGl6VZtsAIOd
 Fi3jic8ik0uhZLFcxDJcHTceh6Pw8bkhWoh0tkF3bkDRwbppJdG7Khyk8DxTl24w
 ldqh9om2cC7w9IPVx93xTgKgMMZ63qiJyUdTvxEZI3BG8F70smlgZSPskLp2Iktd
 FIJZPcyKM0bhJYwZOpXK0vx5C2cA4oIW4xriHUw4aklv646OBxNKevB2JJAft2uA
 6LyvuLgnYn/OpdFGZ8slvdmhm6hLWft5B1/bWKorUkz7p5YGiySFzpkMVAkNJ6mS
 cRwHJNc=
 =flw3
 -----END PGP SIGNATURE-----

Merge tag 'v6.11' into for-6.12/block

Merge in 6.11 final to get the fix for preventing deadlocks on an
elevator switch, as there's a fixup for that patch.

* tag 'v6.11': (1788 commits)
  Linux 6.11
  Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop"
  pinctrl: pinctrl-cy8c95x0: Fix regcache
  cifs: Fix signature miscalculation
  mm: avoid leaving partial pfn mappings around in error case
  drm/xe/client: add missing bo locking in show_meminfo()
  drm/xe/client: fix deadlock in show_meminfo()
  drm/xe/oa: Enable Xe2+ PES disaggregation
  drm/xe/display: fix compat IS_DISPLAY_STEP() range end
  drm/xe: Fix access_ok check in user_fence_create
  drm/xe: Fix possible UAF in guc_exec_queue_process_msg
  drm/xe: Remove fence check from send_tlb_invalidation
  drm/xe/gt: Remove double include
  net: netfilter: move nf flowtable bpf initialization in nf_flow_table_module_init()
  PCI: Fix potential deadlock in pcim_intx()
  workqueue: Clear worker->pool in the worker thread context
  net: tighten bad gso csum offset check in virtio_net_hdr
  netlink: specs: mptcp: fix port endianness
  net: dpaa: Pad packets to ETH_ZLEN
  mptcp: pm: Fix uaf in __timer_delete_sync
  ...
2024-09-17 08:32:53 -06:00
Linus Torvalds
cb69d86550 Updates for the interrupt subsystem:
- Core:
 	- Remove a global lock in the affinity setting code
 
 	  The lock protects a cpumask for intermediate results and the lock
 	  causes a bottleneck on simultaneous start of multiple virtual
 	  machines. Replace the lock and the static cpumask with a per CPU
 	  cpumask which is nicely serialized by raw spinlock held when
 	  executing this code.
 
 	- Provide support for giving a suffix to interrupt domain names.
 
 	  That's required to support devices with subfunctions so that the
 	  domain names are distinct even if they originate from the same
 	  device node.
 
 	- The usual set of cleanups and enhancements all over the place
 
   - Drivers:
 
 	- Support for longarch AVEC interrupt chip
 
 	- Refurbishment of the Armada driver so it can be extended for new
           variants.
 
 	- The usual set of cleanups and enhancements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbn5p8THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoRFtD/43eB3h5usY2OPW0JmDqrE6qnzsvjPZ
 1H52BcmMcOuI6yCfTnbi/fBB52mwSEGq9Dmt1GXradyq9/CJDIqZ1ajI1rA2jzW2
 YdbeTDpKm1rS2ddzfp2LT2BryrNt+7etrRO7qHn4EKSuOcNuV2f58WPbIIqasvaK
 uPbUDVDPrvXxLNcjoab6SqaKrEoAaHSyKpd0MvDd80wHrtcSC/QouW7JDSUXv699
 RwvLebN1OF6mQ2J8Z3DLeCQpcbAs+UT8UvID7kYUJi1g71J/ZY+xpMLoX/gHiDNr
 isBtsuEAiZeNaFpksc7A6Jgu5ljZf2/aLCqbPLlHaduHFNmo94x9KUbIF2cpEMN+
 rsf5Ff7AVh1otz3cUwLLsm+cFLWRRoZdLuncn7rrgB4Yg0gll7qzyLO6YGvQHr8U
 Ocj1RXtvvWsMk4XzhgCt1AH/42cO6go+bhA4HspeYykNpsIldIUl1MeFbO8sWiDJ
 kybuwiwHp3oaMLjEK4Lpq65u7Ll8Lju2zRde65YUJN2nbNmJFORrOLmeC1qsr6ri
 dpend6n2qD9UD1oAt32ej/uXnG160nm7UKescyxiZNeTm1+ez8GW31hY128ifTY3
 4R3urGS38p3gazXBsfw6eqkeKx0kEoDNoQqrO5gBvb8kowYTvoZtkwMGAN9OADwj
 w6vvU0i+NIyVMA==
 =JlJ2
 -----END PGP SIGNATURE-----

Merge tag 'irq-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq updates from Thomas Gleixner:
 "Core:

   - Remove a global lock in the affinity setting code

     The lock protects a cpumask for intermediate results and the lock
     causes a bottleneck on simultaneous start of multiple virtual
     machines. Replace the lock and the static cpumask with a per CPU
     cpumask which is nicely serialized by raw spinlock held when
     executing this code.

   - Provide support for giving a suffix to interrupt domain names.

     That's required to support devices with subfunctions so that the
     domain names are distinct even if they originate from the same
     device node.

   - The usual set of cleanups and enhancements all over the place

  Drivers:

   - Support for longarch AVEC interrupt chip

   - Refurbishment of the Armada driver so it can be extended for new
     variants.

   - The usual set of cleanups and enhancements all over the place"

* tag 'irq-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (73 commits)
  genirq: Use cpumask_intersects()
  genirq/cpuhotplug: Use cpumask_intersects()
  irqchip/apple-aic: Only access system registers on SoCs which provide them
  irqchip/apple-aic: Add a new "Global fast IPIs only" feature level
  irqchip/apple-aic: Skip unnecessary enabling of use_fast_ipi
  dt-bindings: apple,aic: Document A7-A11 compatibles
  irqdomain: Use IS_ERR_OR_NULL() in irq_domain_trim_hierarchy()
  genirq/msi: Use kmemdup_array() instead of kmemdup()
  genirq/proc: Change the return value for set affinity permission error
  genirq/proc: Use irq_move_pending() in show_irq_affinity()
  genirq/proc: Correctly set file permissions for affinity control files
  genirq: Get rid of global lock in irq_do_set_affinity()
  genirq: Fix typo in struct comment
  irqchip/loongarch-avec: Add AVEC irqchip support
  irqchip/loongson-pch-msi: Prepare get_pch_msi_handle() for AVECINTC
  irqchip/loongson-eiointc: Rename CPUHP_AP_IRQ_LOONGARCH_STARTING
  LoongArch: Architectural preparation for AVEC irqchip
  LoongArch: Move irqchip function prototypes to irq-loongson.h
  irqchip/loongson-pch-msi: Switch to MSI parent domains
  softirq: Remove unused 'action' parameter from action callback
  ...
2024-09-17 07:09:17 +02:00
NeilBrown
aa3d8a3678 block: change wait on bd_claiming to use a var_waitqueue
bd_prepare_to_claim() waits for a var to change, not for a bit to be
cleared. Change from bit_waitqueue() to __var_waitqueue() and
correspondingly use wake_up_var(). This will allow a future patch which
change the "bit" function to expect an "unsigned long *" instead of
"void *".

Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20240826063659.15327-2-neilb@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-16 21:12:33 -06:00
Linus Torvalds
a430d95c5e lsm/stable-6.12 PR 20240911
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmbiGGAUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXPU8BAA1+A15pmS34I9pq7c8TmRz3rNEs/a
 zrW1aWJ0X/+axNS7sW3Pwtt1EKuaOhskKU8gNSieRhljC8rgXIVjZzLw6Atgcr5k
 upulGbU9TXyVisYN+PWv9/84ito6/nYsKb7Mg3nUVsdodtIFVnsk1fxYLPHQEBig
 Pl3i26U3VqH93Kz0W5vs/QR2uduPB8ZyscdTgcbrY9Vv1Y7IDZ2g9QsJVKLvbQKL
 qcPK1JkHa+sBPJxDqS9A40zgbLbdPQgWQzsXX3dz822w1Ga7FIHSqxMBA6HwHZ+L
 kV4P58wVfavhwt/cQSKMWI/yiGPMMd0B6yD+m8ojOvGfOfRCWxGMmEMqHNuZ3m7k
 Bfll5ZgZTY8phUUhiNf3nxO3F3MM/5bHdhPOj3RReqbAbS6uWr4/fThPDYY/zIo6
 NCY3HGxx3Ae64uQ01gC2p/czC50jDsMwlbXiZbrgdBhjBm/CVk5ozb80mLVcGrLB
 +6XMzzSbC8IaNAH2fDmUJ2ABdwyNPgsSOTGZVzIanpxu1SU2/yk3SMxkp8fv5s36
 wLeODUVcLgsjVV538Mkm6PGTE4TlXaH9yi6apMyJAGp0vPYx5c3Xxk2y5A5cur5p
 hcrbDiX2QgeqFbwsz36incmPmbef2NU2c8feR8XLtPJuwNIeRcMSje0pnkaFlRmb
 TAUJ1sDQAzZ8Fy0=
 =HIAO
 -----END PGP SIGNATURE-----

Merge tag 'lsm-pr-20240911' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm

Pull lsm updates from Paul Moore:

 - Move the LSM framework to static calls

   This transitions the vast majority of the LSM callbacks into static
   calls. Those callbacks which haven't been converted were left as-is
   due to the general ugliness of the changes required to support the
   static call conversion; we can revisit those callbacks at a future
   date.

 - Add the Integrity Policy Enforcement (IPE) LSM

   This adds a new LSM, Integrity Policy Enforcement (IPE). There is
   plenty of documentation about IPE in this patches, so I'll refrain
   from going into too much detail here, but the basic motivation behind
   IPE is to provide a mechanism such that administrators can restrict
   execution to only those binaries which come from integrity protected
   storage, e.g. a dm-verity protected filesystem. You will notice that
   IPE requires additional LSM hooks in the initramfs, dm-verity, and
   fs-verity code, with the associated patches carrying ACK/review tags
   from the associated maintainers. We couldn't find an obvious
   maintainer for the initramfs code, but the IPE patchset has been
   widely posted over several years.

   Both Deven Bowers and Fan Wu have contributed to IPE's development
   over the past several years, with Fan Wu agreeing to serve as the IPE
   maintainer moving forward. Once IPE is accepted into your tree, I'll
   start working with Fan to ensure he has the necessary accounts, keys,
   etc. so that he can start submitting IPE pull requests to you
   directly during the next merge window.

 - Move the lifecycle management of the LSM blobs to the LSM framework

   Management of the LSM blobs (the LSM state buffers attached to
   various kernel structs, typically via a void pointer named "security"
   or similar) has been mixed, some blobs were allocated/managed by
   individual LSMs, others were managed by the LSM framework itself.

   Starting with this pull we move management of all the LSM blobs,
   minus the XFRM blob, into the framework itself, improving consistency
   across LSMs, and reducing the amount of duplicated code across LSMs.
   Due to some additional work required to migrate the XFRM blob, it has
   been left as a todo item for a later date; from a practical
   standpoint this omission should have little impact as only SELinux
   provides a XFRM LSM implementation.

 - Fix problems with the LSM's handling of F_SETOWN

   The LSM hook for the fcntl(F_SETOWN) operation had a couple of
   problems: it was racy with itself, and it was disconnected from the
   associated DAC related logic in such a way that the LSM state could
   be updated in cases where the DAC state would not. We fix both of
   these problems by moving the security_file_set_fowner() hook into the
   same section of code where the DAC attributes are updated. Not only
   does this resolve the DAC/LSM synchronization issue, but as that code
   block is protected by a lock, it also resolve the race condition.

 - Fix potential problems with the security_inode_free() LSM hook

   Due to use of RCU to protect inodes and the placement of the LSM hook
   associated with freeing the inode, there is a bit of a challenge when
   it comes to managing any LSM state associated with an inode. The VFS
   folks are not open to relocating the LSM hook so we have to get
   creative when it comes to releasing an inode's LSM state.
   Traditionally we have used a single LSM callback within the hook that
   is triggered when the inode is "marked for death", but not actually
   released due to RCU.

   Unfortunately, this causes problems for LSMs which want to take an
   action when the inode's associated LSM state is actually released; so
   we add an additional LSM callback, inode_free_security_rcu(), that is
   called when the inode's LSM state is released in the RCU free
   callback.

 - Refactor two LSM hooks to better fit the LSM return value patterns

   The vast majority of the LSM hooks follow the "return 0 on success,
   negative values on failure" pattern, however, there are a small
   handful that have unique return value behaviors which has caused
   confusion in the past and makes it difficult for the BPF verifier to
   properly vet BPF LSM programs. This includes patches to
   convert two of these"special" LSM hooks to the common 0/-ERRNO pattern.

 - Various cleanups and improvements

   A handful of patches to remove redundant code, better leverage the
   IS_ERR_OR_NULL() helper, add missing "static" markings, and do some
   minor style fixups.

* tag 'lsm-pr-20240911' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: (40 commits)
  security: Update file_set_fowner documentation
  fs: Fix file_set_fowner LSM hook inconsistencies
  lsm: Use IS_ERR_OR_NULL() helper function
  lsm: remove LSM_COUNT and LSM_CONFIG_COUNT
  ipe: Remove duplicated include in ipe.c
  lsm: replace indirect LSM hook calls with static calls
  lsm: count the LSMs enabled at compile time
  kernel: Add helper macros for loop unrolling
  init/main.c: Initialize early LSMs after arch code, static keys and calls.
  MAINTAINERS: add IPE entry with Fan Wu as maintainer
  documentation: add IPE documentation
  ipe: kunit test for parser
  scripts: add boot policy generation program
  ipe: enable support for fs-verity as a trust provider
  fsverity: expose verified fsverity built-in signatures to LSMs
  lsm: add security_inode_setintegrity() hook
  ipe: add support for dm-verity as a trust provider
  dm-verity: expose root hash digest and signature data to LSMs
  block,lsm: add LSM blob and new LSM hooks for block devices
  ipe: add permissive toggle
  ...
2024-09-16 18:19:47 +02:00
Linus Torvalds
adfc3ded5c for-6.12/io_uring-discard-20240913
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmbkboUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpj7DD/oDqQ13NOHuotVbufPRDWuG6+UEaN/Pukp/
 RYDWwYu/DB4v7LVWBV9COqN5jQqY2wrMpgBdZqtEnDtC7yjN6QYAT4TQdfIq/HNo
 NooN4ULmJzOOC6sR9MBGyzOsCbz7kmRt1nBZ7vdEXMrLXeX9JDX3bDrELf7jhKsk
 84lKE/Mxs530LSzxAtN9KaOQncK5gXen4WSrZsYraU2vJFAPBkJwQGAL5pOdmsp9
 NqvNE3QonPr4v99XnDJH80q44afuqffUITPjtGX52tBMO3CCUQFUpZp5fiUjfa1v
 Okz+SyeBE6gB7c008BGqTOgmKdQOMs3uwFDQ/xMw+pYwy+wHH4skzPP776DwAdgn
 C/SaVFsaXkqOXX4f+CiNJ01LmD4EOBy16LM5qE4NwLNpjQu/3EdHjNqaYfM/LCca
 YyQoUOsnYIRj21+oNFpKekscuEAPKG9ewyMyvfxbkk167j00lgwVwybb/2JfYvRJ
 i0GBY5phJnkeNUerU9SDm6RBTAjDOZ0stubTtFjugDZdrz2FmA4pBFGWjgYLiLhH
 3ZCyaCAOoYW8yxxkogTzKbLx6wXb5wgS7jTHgsk+eeSSWRBTnv2sd0fn/D5m3Uw7
 uBHKvauDp3zEd9MdF26QG7U6RlojEbVoyTYjnJskPsClxbch4WSpwvoEILdJRvls
 1dTczxgdyw==
 =wlzo
 -----END PGP SIGNATURE-----

Merge tag 'for-6.12/io_uring-discard-20240913' of git://git.kernel.dk/linux

Pull io_uring async discard support from Jens Axboe:
 "Sitting on top of both the 6.12 block and io_uring core branches,
  here's support for async discard through io_uring.

  This allows applications to issue async discards, rather than rely on
  the blocking sync ioctl discards we already have. The sync support is
  difficult to use outside of idle/cleanup periods.

  On a real (but slow) device, testing shows the following results when
  compared to sync discard:

	qd64 sync discard: 21K IOPS, lat avg 3 msec (max 21 msec)
	qd64 async discard: 76K IOPS, lat avg 845 usec (max 2.2 msec)

	qd64 sync discard: 14K IOPS, lat avg 5 msec (max 25 msec)
	qd64 async discard: 56K IOPS, lat avg 1153 usec (max 3.6 msec)

  and synthetic null_blk testing with the same queue depth and block
  size settings as above shows:

	Type    Trim size       IOPS    Lat avg (usec)  Lat Max (usec)
	==============================================================
	sync    4k               144K       444            20314
	async   4k              1353K        47              595
	sync    1M                56K      1136            21031
	async   1M                94K       680              760"

* tag 'for-6.12/io_uring-discard-20240913' of git://git.kernel.dk/linux:
  block: implement async io_uring discard cmd
  block: introduce blk_validate_byte_range()
  filemap: introduce filemap_invalidate_pages
  io_uring/cmd: give inline space in request to cmds
  io_uring/cmd: expose iowq to cmds
2024-09-16 13:50:14 +02:00
Linus Torvalds
26bb0d3f38 for-6.12/block-20240913
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmbkZhQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjOKD/0fzd4yOcqxSI9W3OLGd04VrOTJIQa4CRbV
 GmoTq39pOeIDVGug5ekkTpqqHHnuGk+nQhCzD9vsN/eTmC7yZOIr847O2aWzvYEn
 PzFRgmJpoo2E9sr/IsTR5LnJjbaIZhQVkqLH6ZOj9tpKlVwN2SK0nIRVNrAi5zgT
 MaDrto/2OUld+vmA99Rgb23jxM6UBdCPIjuiVa+11Vg9Z3D1tWbBmrsG7OMysyIf
 FbASBeKHqFSO61/ipFCZv6VV1X8zoWEVyT8n4A1yUbbN5rLzPgoQJVbfSqQRXIdr
 cdrKeCbKxl+joSgKS6LKpvnfwRgGF+hgAfpZg4c0vrbZGTQcRhhLFECyh/aVI08F
 p5TOMArhVaX59664gHgSPq4KnGTXOO29dot9N3Jya/ZQnxinjY9r+GVOfLuduPPy
 1B04vab8oAsk4zK7fZbkDxgYUyifwzK/vQ6OqYq2mYdpdIS/AE7T2ou61Bz5mI7I
 /BuucNV0Z96OKlyLEXwXXZjZgNu1TFcq6ARIBJ8L08PY64Fesj5BXabRyXkeNH26
 0exyz9heeJs6OwRGfngXmS24tDSS0k74CeZX3KoePNj69u6KCn346KiU1qgntwwD
 E5F7AEHqCl5FjUEIWB4M1EPlfA8U0MzOL+tkx2xKJAjsU60wAy7jRSyOIcqodpMs
 6UlPcJzgYg==
 =uuLl
 -----END PGP SIGNATURE-----

Merge tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - MD changes via Song:
      - md-bitmap refactoring (Yu Kuai)
      - raid5 performance optimization (Artur Paszkiewicz)
      - Other small fixes (Yu Kuai, Chen Ni)
      - Add a sysfs entry 'new_level' (Xiao Ni)
      - Improve information reported in /proc/mdstat (Mateusz Kusiak)

 - NVMe changes via Keith:
      - Asynchronous namespace scanning (Stuart)
      - TCP TLS updates (Hannes)
      - RDMA queue controller validation (Niklas)
      - Align field names to the spec (Anuj)
      - Metadata support validation (Puranjay)
      - A syntax cleanup (Shen)
      - Fix a Kconfig linking error (Arnd)
      - New queue-depth quirk (Keith)

 - Add missing unplug trace event (Keith)

 - blk-iocost fixes (Colin, Konstantin)

 - t10-pi modular removal and fixes (Alexey)

 - Fix for potential BLKSECDISCARD overflow (Alexey)

 - bio splitting cleanups and fixes (Christoph)

 - Deal with folios rather than rather than pages, speeding up how the
   block layer handles bigger IOs (Kundan)

 - Use spinlocks rather than bit spinlocks in zram (Sebastian, Mike)

 - Reduce zoned device overhead in ublk (Ming)

 - Add and use sendpages_ok() for drbd and nvme-tcp (Ofir)

 - Fix regression in partition error pointer checking (Riyan)

 - Add support for write zeroes and rotational status in nbd (Wouter)

 - Add Yu Kuai as new BFQ maintainer. The scheduler has been
   unmaintained for quite a while.

 - Various sets of fixes for BFQ (Yu Kuai)

 - Misc fixes and cleanups (Alvaro, Christophe, Li, Md Haris, Mikhail,
   Yang)

* tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux: (120 commits)
  nvme-pci: qdepth 1 quirk
  block: fix potential invalid pointer dereference in blk_add_partition
  blk_iocost: make read-only static array vrate_adj_pct const
  block: unpin user pages belonging to a folio at once
  mm: release number of pages of a folio
  block: introduce folio awareness and add a bigger size from folio
  block: Added folio-ized version of bio_add_hw_page()
  block, bfq: factor out a helper to split bfqq in bfq_init_rq()
  block, bfq: remove local variable 'bfqq_already_existing' in bfq_init_rq()
  block, bfq: remove local variable 'split' in bfq_init_rq()
  block, bfq: remove bfq_log_bfqg()
  block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()
  block, bfq: fix procress reference leakage for bfqq in merge chain
  block, bfq: fix uaf for accessing waker_bfqq after splitting
  blk-throttle: support prioritized processing of metadata
  blk-throttle: remove last_low_overflow_time
  drbd: Add NULL check for net_conf to prevent dereference in state validation
  nvme-tcp: fix link failure for TCP auth
  blk-mq: add missing unplug trace event
  mtip32xx: Remove redundant null pointer checks in mtip_hw_debugfs_init()
  ...
2024-09-16 13:33:06 +02:00
Linus Torvalds
ee25861f26 vfs-6.12.fallocate
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEwAAKCRCRxhvAZXjc
 omD7AQCZuWPXkEGYFD37MJZuRXNEoq7Tuj6yd0O2b5khUpzvyAD+MPuthGiCMPsu
 voPpUP83x7T0D3JsEsCAXtNeVRcIBQI=
 =xTs6
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.12.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fallocate updates from Christian Brauner:
 "This contains work to try and cleanup some the fallocate mode
  handling. Currently, it confusingly mixes operation modes and an
  optional flag.

  The work here tries to better define operation modes and optional
  flags allowing the core and filesystem code to use switch statements
  to switch on the operation mode"

* tag 'vfs-6.12.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  xfs: refactor xfs_file_fallocate
  xfs: move the xfs_is_always_cow_inode check into xfs_alloc_file_space
  xfs: call xfs_flush_unmap_range from xfs_free_file_space
  fs: sort out the fallocate mode vs flag mess
  ext4: remove tracing for FALLOC_FL_NO_HIDE_STALE
  block: remove checks for FALLOC_FL_NO_HIDE_STALE
2024-09-16 09:34:08 +02:00
Linus Torvalds
2775df6e5e vfs-6.12.folio
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEvgAKCRCRxhvAZXjc
 ou77AQD3U1KjbdgzbUi6kaUmiiWOPhfYTlm8mho8dBjqvTCB+AD/XTWSFCWWhHB4
 KyQZTbjRD81xmVNbKjASazp0EA6Ahwc=
 =gIsD
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.12.folio' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull vfs folio updates from Christian Brauner:
 "This contains work to port write_begin and write_end to rely on folios
  for various filesystems.

  This converts ocfs2, vboxfs, orangefs, jffs2, hostfs, fuse, f2fs,
  ecryptfs, ntfs3, nilfs2, reiserfs, minixfs, qnx6, sysv, ufs, and
  squashfs.

  After this series lands a bunch of the filesystems in this list do not
  mention struct page anymore"

* tag 'vfs-6.12.folio' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (61 commits)
  Squashfs: Ensure all readahead pages have been used
  Squashfs: Rewrite and update squashfs_readahead_fragment() to not use page->index
  Squashfs: Update squashfs_readpage_block() to not use page->index
  Squashfs: Update squashfs_readahead() to not use page->index
  Squashfs: Update page_actor to not use page->index
  jffs2: Use a folio in jffs2_garbage_collect_dnode()
  jffs2: Convert jffs2_do_readpage_nolock to take a folio
  buffer: Convert __block_write_begin() to take a folio
  ocfs2: Convert ocfs2_write_zero_page to use a folio
  fs: Convert aops->write_begin to take a folio
  fs: Convert aops->write_end to take a folio
  vboxsf: Use a folio in vboxsf_write_end()
  orangefs: Convert orangefs_write_begin() to use a folio
  orangefs: Convert orangefs_write_end() to use a folio
  jffs2: Convert jffs2_write_begin() to use a folio
  jffs2: Convert jffs2_write_end() to use a folio
  hostfs: Convert hostfs_write_end() to use a folio
  fuse: Convert fuse_write_begin() to use a folio
  fuse: Convert fuse_write_end() to use a folio
  f2fs: Convert f2fs_write_begin() to use a folio
  ...
2024-09-16 08:54:30 +02:00
Keith Busch
76c313f658 blk-integrity: improved sg segment mapping
Make the integrity mapping more like data mapping, blk_rq_map_sg. Use
the request to validate the segment count, and update the callers so
they don't have to.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240913191746.2628196-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-13 13:22:09 -06:00
Keith Busch
db5197b554 block: unexport blk_rq_count_integrity_sg
There are no external users of this.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240913182854.2445457-9-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-13 12:31:45 -06:00
Keith Busch
d2c5b1facc block: provide a request helper for user integrity segments
Provide a helper to keep the request flags and nr_integrity_segments in
sync with the bio's integrity payload. This is an integrity equivalent
to the normal data helper function, 'blk_rq_map_user()'.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240913182854.2445457-6-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-13 12:31:45 -06:00
Keith Busch
0d7cb52fe4 blk-integrity: consider entire bio list for merging
If a bio is merged to a request, the entire bio list is merged, so don't
temporarily detach it from its list when counting segments. In most
cases, bi_next will already be NULL, so detaching is usually a no-op.
But if the bio does have a list, the current code is miscounting the
segments for the resulting merge.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240913182854.2445457-5-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-13 12:31:45 -06:00
Keith Busch
d148d75034 blk-integrity: properly account for segments
Both types of merging when integrity data is used are miscounting the
segments:

Merging two requests wasn't accounting for the new segment count, so add
the "next" segment count to the first on a successful merge to ensure
this value is accurate.

Merging a bio into an existing request was double counting the bio's
segments, even if the merge failed later on. Move the segment accounting
to the end when the merge is successful.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240913182854.2445457-4-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-13 12:31:45 -06:00
Keith Busch
9c297eced5 blk-mq: set the nr_integrity_segments from bio
This value is used for merging considerations, so it needs to be
accurate.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240913182854.2445457-3-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-13 12:31:45 -06:00
Keith Busch
2b01808614 blk-mq: unconditional nr_integrity_segments
Always defining the field will make using it easier and less error prone
in future patches.

There shouldn't be any downside to this: the field fits in what would
otherwise be a 2-byte hole, so we're not saving space by conditionally
leaving it out.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240913182854.2445457-2-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-13 12:31:45 -06:00
Riyan Dhiman
26e197b7f9 block: fix potential invalid pointer dereference in blk_add_partition
The blk_add_partition() function initially used a single if-condition
(IS_ERR(part)) to check for errors when adding a partition. This was
modified to handle the specific case of -ENXIO separately, allowing the
function to proceed without logging the error in this case. However,
this change unintentionally left a path where md_autodetect_dev()
could be called without confirming that part is a valid pointer.

This commit separates the error handling logic by splitting the
initial if-condition, improving code readability and handling specific
error scenarios explicitly. The function now distinguishes the general
error case from -ENXIO without altering the existing behavior of
md_autodetect_dev() calls.

Fixes: b72053072c (block: allow partitions on host aware zone devices)
Signed-off-by: Riyan Dhiman <riyandhiman14@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240911132954.5874-1-riyandhiman14@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-12 08:46:40 -06:00
Colin Ian King
cc08968466 blk_iocost: make read-only static array vrate_adj_pct const
The static array vrate_adj_pct is read-only, so make it const as
well.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20240911214124.197403-1-colin.i.king@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-11 16:05:26 -06:00
Pavel Begunkov
50c52250e2 block: implement async io_uring discard cmd
io_uring allows implementing custom file specific asynchronous
operations via the fops->uring_cmd callback, a.k.a. IORING_OP_URING_CMD
requests or just io_uring commands. Use it to add support for async
discards.

Normally, it first tries to queue up bios in a non-blocking context,
and if that fails, we'd retry from a blocking context by returning
-EAGAIN to the core io_uring. We always get the result from bios
asynchronously by setting a custom bi_end_io callback, at which point
we drag the request into the task context to either reissue or complete
it and post a completion to the user.

Unlike ioctl(BLKDISCARD) with stronger guarantees against races, we only
do a best effort attempt to invalidate page cache, and it can race with
any writes and reads and leave page cache stale. It's the same kind of
races we allow to direct writes.

Also, apart from cases where discarding is not allowed at all, e.g.
discards are not supported or the file/device is read only, the user
should assume that the sector range on disk is not valid anymore, even
when an error was returned to the user.

Suggested-by: Conrad Meyer <conradmeyer@meta.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2b5210443e4fa0257934f73dfafcc18a77cd0e09.1726072086.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-11 10:45:28 -06:00
Pavel Begunkov
7a07210bbc block: introduce blk_validate_byte_range()
In preparation to further changes extract a helper function out of
blk_ioctl_discard() that validates if we can do IO against the given
range of disk byte addresses.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/19a7779323c71e742a2f511e4cf49efcfd68cfd4.1726072086.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-11 10:45:07 -06:00
Jens Axboe
318ad4283a Merge branch 'for-6.12/block' into for-6.12/io_uring-discard
* for-6.12/block: (115 commits)
  block: unpin user pages belonging to a folio at once
  mm: release number of pages of a folio
  block: introduce folio awareness and add a bigger size from folio
  block: Added folio-ized version of bio_add_hw_page()
  block, bfq: factor out a helper to split bfqq in bfq_init_rq()
  block, bfq: remove local variable 'bfqq_already_existing' in bfq_init_rq()
  block, bfq: remove local variable 'split' in bfq_init_rq()
  block, bfq: remove bfq_log_bfqg()
  block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()
  block, bfq: fix procress reference leakage for bfqq in merge chain
  block, bfq: fix uaf for accessing waker_bfqq after splitting
  blk-throttle: support prioritized processing of metadata
  blk-throttle: remove last_low_overflow_time
  drbd: Add NULL check for net_conf to prevent dereference in state validation
  blk-mq: add missing unplug trace event
  mtip32xx: Remove redundant null pointer checks in mtip_hw_debugfs_init()
  md: Add new_level sysfs interface
  zram: Shrink zram_table_entry::flags.
  zram: Remove ZRAM_LOCK
  zram: Replace bit spinlocks with a spinlock_t.
  ...
2024-09-11 10:42:37 -06:00
Kundan Kumar
eb1d46fcd5 block: unpin user pages belonging to a folio at once
Use newly added mm function unpin_user_folio() to put refs by npages
count.

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20240911064935.5630-5-kundan.kumar@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-11 07:24:01 -06:00
Kundan Kumar
ed9832bc08 block: introduce folio awareness and add a bigger size from folio
Add a bigger size from folio to bio and skip merge processing for pages.

Fetch the offset of page within a folio. Depending on the size of folio
and folio_offset, fetch a larger length. This length may consist of
multiple contiguous pages if folio is multiorder.

Using the length calculate number of pages which will be added to bio and
increment the loop counter to skip those pages.

This technique helps to avoid overhead of merging pages which belong to
same large order folio.

Also folio-ize the functions bio_iov_add_page() and
bio_iov_add_zone_append_page()

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20240911064935.5630-3-kundan.kumar@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-11 07:24:01 -06:00
Kundan Kumar
7de9895468 block: Added folio-ized version of bio_add_hw_page()
Added new bio_add_hw_folio() function as a wrapper around
bio_add_hw_page(). This is a prep patch.

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20240911064935.5630-2-kundan.kumar@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-11 07:24:00 -06:00
Yu Kuai
a7609d2aec block, bfq: factor out a helper to split bfqq in bfq_init_rq()
Make code cleaner, there are no functional changes.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240909134154.954924-8-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-10 16:32:09 -06:00
Yu Kuai
3c61429c29 block, bfq: remove local variable 'bfqq_already_existing' in bfq_init_rq()
Now that 'bfqq_already_existing' is only used in one branch, it can be
removed. There are no functional changes.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240909134154.954924-7-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-10 16:32:09 -06:00
Yu Kuai
e61e002a67 block, bfq: remove local variable 'split' in bfq_init_rq()
The local variable is used to call bfq_bfqq_resume_state() later,
since 'bfqd->lock' is held, and bfqq status will not change between
setting 'split' and calling bfq_bfqq_resume_state(), move forward
bfq_bfqq_resume_state() so that 'split' can be removed. There are no
functional chagnes.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240909134154.954924-6-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-10 16:32:09 -06:00
Yu Kuai
553a606c25 block, bfq: remove bfq_log_bfqg()
It's not used, hence can be removed.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240909134154.954924-5-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-10 16:32:09 -06:00
Yu Kuai
bc3b1e9e7c block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()
Because bfq_put_cooperator() is always followed by
bfq_release_process_ref().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240909134154.954924-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-10 16:32:09 -06:00
Yu Kuai
73aeab3735 block, bfq: fix procress reference leakage for bfqq in merge chain
Original state:

        Process 1       Process 2       Process 3       Process 4
         (BIC1)          (BIC2)          (BIC3)          (BIC4)
          Λ                |               |               |
           \--------------\ \-------------\ \-------------\|
                           V               V               V
          bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
    ref    0               1               2               4

After commit 0e456dba86 ("block, bfq: choose the last bfqq from merge
chain in bfq_setup_cooperator()"), if P1 issues a new IO:

Without the patch:

        Process 1       Process 2       Process 3       Process 4
         (BIC1)          (BIC2)          (BIC3)          (BIC4)
          Λ                |               |               |
           \------------------------------\ \-------------\|
                                           V               V
          bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
    ref    0               0               2               4

bfqq3 will be used to handle IO from P1, this is not expected, IO
should be redirected to bfqq4;

With the patch:

          -------------------------------------------
          |                                         |
        Process 1       Process 2       Process 3   |   Process 4
         (BIC1)          (BIC2)          (BIC3)     |    (BIC4)
                           |               |        |      |
                            \-------------\ \-------------\|
                                           V               V
          bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
    ref    0               0               2               4

IO is redirected to bfqq4, however, procress reference of bfqq3 is still
2, while there is only P2 using it.

Fix the problem by calling bfq_merge_bfqqs() for each bfqq in the merge
chain. Also change bfqq_merge_bfqqs() to return new_bfqq to simplify
code.

Fixes: 0e456dba86 ("block, bfq: choose the last bfqq from merge chain in bfq_setup_cooperator()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240909134154.954924-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-10 16:32:09 -06:00
Yu Kuai
1ba0403ac6 block, bfq: fix uaf for accessing waker_bfqq after splitting
After commit 42c306ed72 ("block, bfq: don't break merge chain in
bfq_split_bfqq()"), if the current procress is the last holder of bfqq,
the bfqq can be freed after bfq_split_bfqq(). Hence recored the bfqq and
then access bfqq->waker_bfqq may trigger UAF. What's more, the waker_bfqq
may in the merge chain of bfqq, hence just recored waker_bfqq is still
not safe.

Fix the problem by adding a helper bfq_waker_bfqq() to check if
bfqq->waker_bfqq is in the merge chain, and current procress is the only
holder.

Fixes: 42c306ed72 ("block, bfq: don't break merge chain in bfq_split_bfqq()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240909134154.954924-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-10 16:32:09 -06:00
Yu Kuai
29390bb566 blk-throttle: support prioritized processing of metadata
Currently, blk-throttle handle all IO fifo, hence if data IO is
throttled and then meta IO is dispatched, the meta IO will have to wait
for the data IO, causing priority inversion problems.

This patch support to handle metadata first and then pay debt while
throttling data.

Test script: use cgroup v1 to throttle root cgroup, then create new
dir and file while write back is throttled

test() {
  mkdir /mnt/test/xxx
  touch /mnt/test/xxx/1
  sync /mnt/test/xxx
  sync /mnt/test/xxx
}

mkfs.ext4 -F /dev/nvme0n1 -E lazy_itable_init=0,lazy_journal_init=0
mount /dev/nvme0n1 /mnt/test

echo "259:0 $((1024*1024))" > /sys/fs/cgroup/blkio/blkio.throttle.write_bps_device
dd if=/dev/zero of=/mnt/test/foo1 bs=16M count=1 conv=fdatasync status=none &
sleep 4

time test
echo "259:0 0" > /sys/fs/cgroup/blkio/blkio.throttle.write_bps_device

sleep 1
umount /dev/nvme0n1

Test result: time cost for creating new dir and file
before this patch:  14s
after this patch:   0.1s

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20240903135149.271857-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-10 16:31:41 -06:00
Yu Kuai
3bf73e6283 blk-throttle: remove last_low_overflow_time
last_low_overflow_time is not used anymore after commit bf20ab538c
("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW").

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20240903135149.271857-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-10 16:31:41 -06:00
Damien Le Moal
734e1a8603 block: Prevent deadlocks when switching elevators
Commit af28141498 ("block: freeze the queue in queue_attr_store")
changed queue_attr_store() to always freeze a sysfs attribute queue
before calling the attribute store() method, to ensure that no IOs are
in-flight when an attribute value is being updated.

However, this change created a potential deadlock situation for the
scheduler queue attribute as changing the queue elevator with
elv_iosched_store() can result in a call to request_module() if the user
requested module is not already registered. If the file of the requested
module is stored on the block device of the frozen queue, a deadlock
will happen as the read operations triggered by request_module() will
wait for the queue freeze to end.

Solve this issue by introducing the load_module method in struct
queue_sysfs_entry, and to calling this method function in
queue_attr_store() before freezing the attribute queue.
The macro definition QUEUE_RW_LOAD_MODULE_ENTRY() is added to define a
queue sysfs attribute that needs loading a module.

The definition of the scheduler atrribute is changed to using
QUEUE_RW_LOAD_MODULE_ENTRY(), with the function
elv_iosched_load_module() defined as the load_module method.
elv_iosched_store() can then be simplified to remove the call to
request_module().

Reported-by: Richard W.M. Jones <rjones@redhat.com>
Reported-by: Jiri Jaburek <jjaburek@redhat.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219166
Fixes: af28141498 ("block: freeze the queue in queue_attr_store")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Link: https://lore.kernel.org/r/20240908000704.414538-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-10 13:43:42 -06:00
Keith Busch
acc8c0a988 blk-mq: add missing unplug trace event
The single-queue optimized list flush doesn't have an unplug trace event
to pair with the plug event. Add one.

In the unlikely event an error occurs and falls back to the less
optimized plug flush path, it's possible a 2nd unplug trace event will
be logged, but it will show the remainig count that weren't previously
handled.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240906194540.3719642-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-07 07:41:12 -06:00
Alexey Dobriyan
697ba0b6ec block: fix integer overflow in BLKSECDISCARD
I independently rediscovered

	commit 22d24a544b
	block: fix overflow in blk_ioctl_discard()

but for secure erase.

Same problem:

	uint64_t r[2] = {512, 18446744073709551104ULL};
	ioctl(fd, BLKSECDISCARD, r);

will enter near infinite loop inside blkdev_issue_secure_erase():

	a.out: attempt to access beyond end of device
	loop0: rw=5, sector=3399043073, nr_sectors = 1024 limit=2048
	bio_check_eod: 3286214 callbacks suppressed

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Link: https://lore.kernel.org/r/9e64057f-650a-46d1-b9f7-34af391536ef@p183
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-04 08:01:30 -06:00
Alvaro Parker
2be6190cd7 block: fix comment to use set_current_state
The explanatory comment used `set_task_state` instead of
`set_current_state` which is the function actually used in the code.

Signed-off-by: Alvaro Parker <alparkerdf@gmail.com>
Link: https://lore.kernel.org/r/20240903172214.520086-1-alparkerdf@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-04 07:27:19 -06:00
Mikulas Patocka
b858a36fe9 bio-integrity: don't restrict the size of integrity metadata
bio_integrity_add_page restricts the size of the integrity metadata to
queue_max_hw_sectors(q). This restriction is not needed because oversized
bios are split automatically. This restriction causes problems with
dm-integrity 'inline' mode - if we send a large bio to dm-integrity and
the bio's metadata are larger than queue_max_hw_sectors(q),
bio_integrity_add_page fails and the bio is ended with BLK_STS_RESOURCE
error.

An example that triggers it:

dd: error writing '/dev/mapper/in2': Cannot allocate memory
1+0 records in
0+0 records out
0 bytes copied, 0.00169291 s, 0.0 kB/s

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: fb0987682c ("dm-integrity: introduce the Inline mode")
Fixes: 0ece1d649b ("bio-integrity: create multi-page bvecs in bio_integrity_add_page()")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/e41b3b8e-16c2-70cb-97cb-881234bb200d@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-04 07:17:00 -06:00
Yu Kuai
f45916ae60 block, bfq: use bfq_reassign_last_bfqq() in bfq_bfqq_move()
Instead of open coding it, there are no functional changes.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240902130329.3787024-5-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-03 09:51:54 -06:00
Yu Kuai
42c306ed72 block, bfq: don't break merge chain in bfq_split_bfqq()
Consider the following scenario:

    Process 1       Process 2       Process 3       Process 4
     (BIC1)          (BIC2)          (BIC3)          (BIC4)
      Λ               |               |                |
       \-------------\ \-------------\ \--------------\|
                      V               V                V
      bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref    0              1               2                4

If Process 1 issue a new IO and bfqq2 is found, and then bfq_init_rq()
decide to spilt bfqq2 by bfq_split_bfqq(). Howerver, procress reference
of bfqq2 is 1 and bfq_split_bfqq() just clear the coop flag, which will
break the merge chain.

Expected result: caller will allocate a new bfqq for BIC1

    Process 1       Process 2       Process 3       Process 4
     (BIC1)          (BIC2)          (BIC3)          (BIC4)
                      |               |                |
                       \-------------\ \--------------\|
                                      V                V
      bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref    0              0               1                3

Since the condition is only used for the last bfqq4 when the previous
bfqq2 and bfqq3 are already splited. Fix the problem by checking if
bfqq is the last one in the merge chain as well.

Fixes: 36eca89483 ("block, bfq: add Early Queue Merge (EQM)")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240902130329.3787024-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-03 09:51:54 -06:00
Yu Kuai
0e456dba86 block, bfq: choose the last bfqq from merge chain in bfq_setup_cooperator()
Consider the following merge chain:

Process 1       Process 2       Process 3	Process 4
 (BIC1)          (BIC2)          (BIC3)		 (BIC4)
  Λ                |               |               |
   \--------------\ \-------------\ \-------------\|
                   V               V		   V
  bfqq1--------->bfqq2---------->bfqq3----------->bfqq4

IO from Process 1 will get bfqf2 from BIC1 first, then
bfq_setup_cooperator() will found bfqq2 already merged to bfqq3 and then
handle this IO from bfqq3. However, the merge chain can be much deeper
and bfqq3 can be merged to other bfqq as well.

Fix this problem by iterating to the last bfqq in
bfq_setup_cooperator().

Fixes: 36eca89483 ("block, bfq: add Early Queue Merge (EQM)")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240902130329.3787024-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-03 09:51:54 -06:00
Yu Kuai
18ad4df091 block, bfq: fix possible UAF for bfqq->bic with merge chain
1) initial state, three tasks:

		Process 1       Process 2	Process 3
		 (BIC1)          (BIC2)		 (BIC3)
		  |  Λ            |  Λ		  |  Λ
		  |  |            |  |		  |  |
		  V  |            V  |		  V  |
		  bfqq1           bfqq2		  bfqq3
process ref:	   1		    1		    1

2) bfqq1 merged to bfqq2:

		Process 1       Process 2	Process 3
		 (BIC1)          (BIC2)		 (BIC3)
		  |               |		  |  Λ
		  \--------------\|		  |  |
		                  V		  V  |
		  bfqq1--------->bfqq2		  bfqq3
process ref:	   0		    2		    1

3) bfqq2 merged to bfqq3:

		Process 1       Process 2	Process 3
		 (BIC1)          (BIC2)		 (BIC3)
	 here -> Λ                |		  |
		  \--------------\ \-------------\|
		                  V		  V
		  bfqq1--------->bfqq2---------->bfqq3
process ref:	   0		    1		    3

In this case, IO from Process 1 will get bfqq2 from BIC1 first, and then
get bfqq3 through merge chain, and finially handle IO by bfqq3.
Howerver, current code will think bfqq2 is owned by BIC1, like initial
state, and set bfqq2->bic to BIC1.

bfq_insert_request
-> by Process 1
 bfqq = bfq_init_rq(rq)
  bfqq = bfq_get_bfqq_handle_split
   bfqq = bic_to_bfqq
   -> get bfqq2 from BIC1
 bfqq->ref++
 rq->elv.priv[0] = bic
 rq->elv.priv[1] = bfqq
 if (bfqq_process_refs(bfqq) == 1)
  bfqq->bic = bic
  -> record BIC1 to bfqq2

  __bfq_insert_request
   new_bfqq = bfq_setup_cooperator
   -> get bfqq3 from bfqq2->new_bfqq
   bfqq_request_freed(bfqq)
   new_bfqq->ref++
   rq->elv.priv[1] = new_bfqq
   -> handle IO by bfqq3

Fix the problem by checking bfqq is from merge chain fist. And this
might fix a following problem reported by our syzkaller(unreproducible):

==================================================================
BUG: KASAN: slab-use-after-free in bfq_do_early_stable_merge block/bfq-iosched.c:5692 [inline]
BUG: KASAN: slab-use-after-free in bfq_do_or_sched_stable_merge block/bfq-iosched.c:5805 [inline]
BUG: KASAN: slab-use-after-free in bfq_get_queue+0x25b0/0x2610 block/bfq-iosched.c:5889
Write of size 1 at addr ffff888123839eb8 by task kworker/0:1H/18595

CPU: 0 PID: 18595 Comm: kworker/0:1H Tainted: G             L     6.6.0-07439-gba2303cacfda #6
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Workqueue: kblockd blk_mq_requeue_work
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106
 print_address_description mm/kasan/report.c:364 [inline]
 print_report+0x10d/0x610 mm/kasan/report.c:475
 kasan_report+0x8e/0xc0 mm/kasan/report.c:588
 bfq_do_early_stable_merge block/bfq-iosched.c:5692 [inline]
 bfq_do_or_sched_stable_merge block/bfq-iosched.c:5805 [inline]
 bfq_get_queue+0x25b0/0x2610 block/bfq-iosched.c:5889
 bfq_get_bfqq_handle_split+0x169/0x5d0 block/bfq-iosched.c:6757
 bfq_init_rq block/bfq-iosched.c:6876 [inline]
 bfq_insert_request block/bfq-iosched.c:6254 [inline]
 bfq_insert_requests+0x1112/0x5cf0 block/bfq-iosched.c:6304
 blk_mq_insert_request+0x290/0x8d0 block/blk-mq.c:2593
 blk_mq_requeue_work+0x6bc/0xa70 block/blk-mq.c:1502
 process_one_work kernel/workqueue.c:2627 [inline]
 process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700
 worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781
 kthread+0x33c/0x440 kernel/kthread.c:388
 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305
 </TASK>

Allocated by task 20776:
 kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
 kasan_set_track+0x25/0x30 mm/kasan/common.c:52
 __kasan_slab_alloc+0x87/0x90 mm/kasan/common.c:328
 kasan_slab_alloc include/linux/kasan.h:188 [inline]
 slab_post_alloc_hook mm/slab.h:763 [inline]
 slab_alloc_node mm/slub.c:3458 [inline]
 kmem_cache_alloc_node+0x1a4/0x6f0 mm/slub.c:3503
 ioc_create_icq block/blk-ioc.c:370 [inline]
 ioc_find_get_icq+0x180/0xaa0 block/blk-ioc.c:436
 bfq_prepare_request+0x39/0xf0 block/bfq-iosched.c:6812
 blk_mq_rq_ctx_init.isra.7+0x6ac/0xa00 block/blk-mq.c:403
 __blk_mq_alloc_requests+0xcc0/0x1070 block/blk-mq.c:517
 blk_mq_get_new_requests block/blk-mq.c:2940 [inline]
 blk_mq_submit_bio+0x624/0x27c0 block/blk-mq.c:3042
 __submit_bio+0x331/0x6f0 block/blk-core.c:624
 __submit_bio_noacct_mq block/blk-core.c:703 [inline]
 submit_bio_noacct_nocheck+0x816/0xb40 block/blk-core.c:732
 submit_bio_noacct+0x7a6/0x1b50 block/blk-core.c:826
 xlog_write_iclog+0x7d5/0xa00 fs/xfs/xfs_log.c:1958
 xlog_state_release_iclog+0x3b8/0x720 fs/xfs/xfs_log.c:619
 xlog_cil_push_work+0x19c5/0x2270 fs/xfs/xfs_log_cil.c:1330
 process_one_work kernel/workqueue.c:2627 [inline]
 process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700
 worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781
 kthread+0x33c/0x440 kernel/kthread.c:388
 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305

Freed by task 946:
 kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
 kasan_set_track+0x25/0x30 mm/kasan/common.c:52
 kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522
 ____kasan_slab_free mm/kasan/common.c:236 [inline]
 __kasan_slab_free+0x12c/0x1c0 mm/kasan/common.c:244
 kasan_slab_free include/linux/kasan.h:164 [inline]
 slab_free_hook mm/slub.c:1815 [inline]
 slab_free_freelist_hook mm/slub.c:1841 [inline]
 slab_free mm/slub.c:3786 [inline]
 kmem_cache_free+0x118/0x6f0 mm/slub.c:3808
 rcu_do_batch+0x35c/0xe30 kernel/rcu/tree.c:2189
 rcu_core+0x819/0xd90 kernel/rcu/tree.c:2462
 __do_softirq+0x1b0/0x7a2 kernel/softirq.c:553

Last potentially related work creation:
 kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
 __kasan_record_aux_stack+0xaf/0xc0 mm/kasan/generic.c:492
 __call_rcu_common kernel/rcu/tree.c:2712 [inline]
 call_rcu+0xce/0x1020 kernel/rcu/tree.c:2826
 ioc_destroy_icq+0x54c/0x830 block/blk-ioc.c:105
 ioc_release_fn+0xf0/0x360 block/blk-ioc.c:124
 process_one_work kernel/workqueue.c:2627 [inline]
 process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700
 worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781
 kthread+0x33c/0x440 kernel/kthread.c:388
 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305

Second to last potentially related work creation:
 kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
 __kasan_record_aux_stack+0xaf/0xc0 mm/kasan/generic.c:492
 __call_rcu_common kernel/rcu/tree.c:2712 [inline]
 call_rcu+0xce/0x1020 kernel/rcu/tree.c:2826
 ioc_destroy_icq+0x54c/0x830 block/blk-ioc.c:105
 ioc_release_fn+0xf0/0x360 block/blk-ioc.c:124
 process_one_work kernel/workqueue.c:2627 [inline]
 process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700
 worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781
 kthread+0x33c/0x440 kernel/kthread.c:388
 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305

The buggy address belongs to the object at ffff888123839d68
 which belongs to the cache bfq_io_cq of size 1360
The buggy address is located 336 bytes inside of
 freed 1360-byte region [ffff888123839d68, ffff88812383a2b8)

The buggy address belongs to the physical page:
page:ffffea00048e0e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88812383f588 pfn:0x123838
head:ffffea00048e0e00 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x17ffffc0000a40(workingset|slab|head|node=0|zone=2|lastcpupid=0x1fffff)
page_type: 0xffffffff()
raw: 0017ffffc0000a40 ffff88810588c200 ffffea00048ffa10 ffff888105889488
raw: ffff88812383f588 0000000000150006 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff888123839d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888123839e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff888123839e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                        ^
 ffff888123839f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888123839f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

Fixes: 36eca89483 ("block, bfq: add Early Queue Merge (EQM)")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240902130329.3787024-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-03 09:51:54 -06:00
Josef Bacik
31754ea6cb iomap: add a private argument for iomap_file_buffered_write
In order to switch fuse over to using iomap for buffered writes we need
to be able to have the struct file for the original write, in case we
have to read in the page to make it uptodate.  Handle this by using the
existing private field in the iomap_iter, and add the argument to
iomap_file_buffered_write.  This will allow us to pass the file in
through the iomap buffered write path, and is flexible for any other
file systems needs.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/7f55c7c32275004ba00cddf862d970e6e633f750.1724755651.git.josef@toxicpanda.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-03 15:01:23 +02:00
Christoph Hellwig
1251580983 block: don't use bio_split_rw on misc operations
bio_split_rw is designed to split read and write bios with a payload.
Currently it is called by __bio_split_to_limits for all operations not
explicitly list, which works because bio_may_need_split explicitly checks
for bi_vcnt == 1 and thus skips the bypass if there is no payload and
bio_for_each_bvec loop will never execute it's body if bi_size is 0.

But all this is hard to understand, fragile and wasted pointless cycles.
Switch __bio_split_to_limits to only call bio_split_rw for READ and
WRITE command and don't attempt any kind split for operation that do not
require splitting.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Link: https://lore.kernel.org/r/20240826173820.1690925-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-29 04:32:32 -06:00
Christoph Hellwig
1e8a7f6af9 block: properly handle REQ_OP_ZONE_APPEND in __bio_split_to_limits
Currently REQ_OP_ZONE_APPEND is handled by the bio_split_rw case in
__bio_split_to_limits.  This is harmful because REQ_OP_ZONE_APPEND
bios do not adhere to the soft max_limits value but instead use their
own capped version of max_hw_sectors, leading to incorrect splits that
later blow up in bio_split.

We still need the bio_split_rw logic to count nr_segs for blk-mq code,
so add a new wrapper that passes in the right limit, and turns any bio
that would need a split into an error as an additional debugging aid.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Link: https://lore.kernel.org/r/20240826173820.1690925-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-29 04:32:32 -06:00
Christoph Hellwig
b35243a447 block: rework bio splitting
The current setup with bio_may_exceed_limit and __bio_split_to_limits
is a bit of a mess.

Change it so that __bio_split_to_limits does all the work and is just
a variant of bio_split_to_limits that returns nr_segs.  This is done
by inlining it and instead have the various bio_split_* helpers directly
submit the potentially split bios.

To support btrfs, the rw version has a lower level helper split out
that just returns the offset to split.  This turns out to nicely clean
up the btrfs flow as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Link: https://lore.kernel.org/r/20240826173820.1690925-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-29 04:32:32 -06:00
Christoph Hellwig
ad01dadaa9 block: remove checks for FALLOC_FL_NO_HIDE_STALE
While the FALLOC_FL_NO_HIDE_STALE value has been registered, it has
always been rejected by vfs_fallocate before making it into
blkdev_fallocate because it isn't in the supported mask.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240827065123.1762168-2-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-28 16:53:57 +02:00
Darrick J. Wong
e33a97a830 block: fix detection of unsupported WRITE SAME in blkdev_issue_write_zeroes
On error, blkdev_issue_write_zeroes used to recheck the block device's
WRITE SAME queue limits after submitting WRITE SAME bios.  As stated in
the comment, the purpose of this was to collapse all IO errors to
EOPNOTSUPP if the effect of issuing bios was that WRITE SAME got turned
off in the queue limits.  Therefore, it does not make sense to reuse the
zeroes limit that was read earlier in the function because we only care
about the queue limit *now*, not what it was at the start of the
function.

Found by running generic/351 from fstests.

Fixes: 64b582ca88 ("block: Read max write zeroes once for __blkdev_issue_write_zeroes()")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240827175340.GB1977952@frogsfrogsfrogs
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-28 08:49:25 -06:00
Konstantin Ovsepian
9bce8005ec blk_iocost: fix more out of bound shifts
Recently running UBSAN caught few out of bound shifts in the
ioc_forgive_debts() function:

UBSAN: shift-out-of-bounds in block/blk-iocost.c:2142:38
shift exponent 80 is too large for 64-bit type 'u64' (aka 'unsigned long
long')
...
UBSAN: shift-out-of-bounds in block/blk-iocost.c:2144:30
shift exponent 80 is too large for 64-bit type 'u64' (aka 'unsigned long
long')
...
Call Trace:
<IRQ>
dump_stack_lvl+0xca/0x130
__ubsan_handle_shift_out_of_bounds+0x22c/0x280
? __lock_acquire+0x6441/0x7c10
ioc_timer_fn+0x6cec/0x7750
? blk_iocost_init+0x720/0x720
? call_timer_fn+0x5d/0x470
call_timer_fn+0xfa/0x470
? blk_iocost_init+0x720/0x720
__run_timer_base+0x519/0x700
...

Actual impact of this issue was not identified but I propose to fix the
undefined behaviour.
The proposed fix to prevent those out of bound shifts consist of
precalculating exponent before using it the shift operations by taking
min value from the actual exponent and maximum possible number of bits.

Reported-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Konstantin Ovsepian <ovs@ovs.to>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20240822154137.2627818-1-ovs@ovs.to
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-26 14:13:22 -06:00
Deven Bowers
b55d26bd18 block,lsm: add LSM blob and new LSM hooks for block devices
This patch introduces a new LSM blob to the block_device structure,
enabling the security subsystem to store security-sensitive data related
to block devices. Currently, for a device mapper's mapped device containing
a dm-verity target, critical security information such as the roothash and
its signing state are not readily accessible. Specifically, while the
dm-verity volume creation process passes the dm-verity roothash and its
signature from userspace to the kernel, the roothash is stored privately
within the dm-verity target, and its signature is discarded
post-verification. This makes it extremely hard for the security subsystem
to utilize these data.

With the addition of the LSM blob to the block_device structure, the
security subsystem can now retain and manage important security metadata
such as the roothash and the signing state of a dm-verity by storing them
inside the blob. Access decisions can then be based on these stored data.

The implementation follows the same approach used for security blobs in
other structures like struct file, struct inode, and struct superblock.
The initialization of the security blob occurs after the creation of the
struct block_device, performed by the security subsystem. Similarly, the
security blob is freed by the security subsystem before the struct
block_device is deallocated or freed.

This patch also introduces a new hook security_bdev_setintegrity() to save
block device's integrity data to the new LSM blob. For example, for
dm-verity, it can use this hook to expose its roothash and signing state
to LSMs, then LSMs can save these data into the LSM blob.

Please note that the new hook should be invoked every time the security
information is updated to keep these data current. For example, in
dm-verity, if the mapping table is reloaded and configured to use a
different dm-verity target with a new roothash and signing information,
the previously stored data in the LSM blob will become obsolete. It is
crucial to re-invoke the hook to refresh these data and ensure they are up
to date. This necessity arises from the design of device-mapper, where a
device-mapper device is first created, and then targets are subsequently
loaded into it. These targets can be modified multiple times during the
device's lifetime. Therefore, while the LSM blob is allocated during the
creation of the block device, its actual contents are not initialized at
this stage and can change substantially over time. This includes
alterations from data that the LSM 'trusts' to those it does not, making
it essential to handle these changes correctly. Failure to address this
dynamic aspect could potentially allow for bypassing LSM checks.

Signed-off-by: Deven Bowers <deven.desai@linux.microsoft.com>
Signed-off-by: Fan Wu <wufan@linux.microsoft.com>
[PM: merge fuzz, subject line tweaks]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-08-20 14:02:33 -04:00
Caleb Sander Mateos
e68ac2b488 softirq: Remove unused 'action' parameter from action callback
When soft interrupt actions are called, they are passed a pointer to the
struct softirq action which contains the action's function pointer.

This pointer isn't useful, as the action callback already knows what
function it is. And since each callback handles a specific soft interrupt,
the callback also knows which soft interrupt number is running.

No soft interrupt action callback actually uses this parameter, so remove
it from the function pointer signature. This clarifies that soft interrupt
actions are global routines and makes it slightly cheaper to call them.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/all/20240815171549.3260003-1-csander@purestorage.com
2024-08-20 17:13:40 +02:00
John Garry
64b582ca88 block: Read max write zeroes once for __blkdev_issue_write_zeroes()
As reported in [0], we may get a hang when formatting a XFS FS on a RAID0
drive.

Commit 73a768d5f9 ("block: factor out a blk_write_zeroes_limit helper")
changed __blkdev_issue_write_zeroes() to read the max write zeroes
value in the loop. This is not safe as max write zeroes may change in
value. Specifically for the case of [0], the value goes to 0, and we get
an infinite loop.

Lift the limit reading out of the loop.

[0] https://lore.kernel.org/linux-xfs/4d31268f-310b-4220-88a2-e191c3932a82@oracle.com/T/#t

Fixes: 73a768d5f9 ("block: factor out a blk_write_zeroes_limit helper")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240815163228.216051-2-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-19 09:48:59 -06:00
Yue Haibing
b2261de752 blk-cgroup: Remove unused declaration blkg_path()
Commit bb7e5a193d ("block, bfq: remove blkg_path()") removed the
implementation but leave declaration.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20240816095821.877842-1-yuehaibing@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-16 15:07:27 -06:00
Li Lingfeng
b313a8c835 block: Fix lockdep warning in blk_mq_mark_tag_wait
Lockdep reported a warning in Linux version 6.6:

[  414.344659] ================================
[  414.345155] WARNING: inconsistent lock state
[  414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted
[  414.346221] --------------------------------
[  414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[  414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes:
[  414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0
[  414.351204] {IN-SOFTIRQ-W} state was registered at:
[  414.351751]   lock_acquire+0x18d/0x460
[  414.352218]   _raw_spin_lock_irqsave+0x39/0x60
[  414.352769]   __wake_up_common_lock+0x22/0x60
[  414.353289]   sbitmap_queue_wake_up+0x375/0x4f0
[  414.353829]   sbitmap_queue_clear+0xdd/0x270
[  414.354338]   blk_mq_put_tag+0xdf/0x170
[  414.354807]   __blk_mq_free_request+0x381/0x4d0
[  414.355335]   blk_mq_free_request+0x28b/0x3e0
[  414.355847]   __blk_mq_end_request+0x242/0xc30
[  414.356367]   scsi_end_request+0x2c1/0x830
[  414.345155] WARNING: inconsistent lock state
[  414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted
[  414.346221] --------------------------------
[  414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[  414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes:
[  414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0
[  414.351204] {IN-SOFTIRQ-W} state was registered at:
[  414.351751]   lock_acquire+0x18d/0x460
[  414.352218]   _raw_spin_lock_irqsave+0x39/0x60
[  414.352769]   __wake_up_common_lock+0x22/0x60
[  414.353289]   sbitmap_queue_wake_up+0x375/0x4f0
[  414.353829]   sbitmap_queue_clear+0xdd/0x270
[  414.354338]   blk_mq_put_tag+0xdf/0x170
[  414.354807]   __blk_mq_free_request+0x381/0x4d0
[  414.355335]   blk_mq_free_request+0x28b/0x3e0
[  414.355847]   __blk_mq_end_request+0x242/0xc30
[  414.356367]   scsi_end_request+0x2c1/0x830
[  414.356863]   scsi_io_completion+0x177/0x1610
[  414.357379]   scsi_complete+0x12f/0x260
[  414.357856]   blk_complete_reqs+0xba/0xf0
[  414.358338]   __do_softirq+0x1b0/0x7a2
[  414.358796]   irq_exit_rcu+0x14b/0x1a0
[  414.359262]   sysvec_call_function_single+0xaf/0xc0
[  414.359828]   asm_sysvec_call_function_single+0x1a/0x20
[  414.360426]   default_idle+0x1e/0x30
[  414.360873]   default_idle_call+0x9b/0x1f0
[  414.361390]   do_idle+0x2d2/0x3e0
[  414.361819]   cpu_startup_entry+0x55/0x60
[  414.362314]   start_secondary+0x235/0x2b0
[  414.362809]   secondary_startup_64_no_verify+0x18f/0x19b
[  414.363413] irq event stamp: 428794
[  414.363825] hardirqs last  enabled at (428793): [<ffffffff816bfd1c>] ktime_get+0x1dc/0x200
[  414.364694] hardirqs last disabled at (428794): [<ffffffff85470177>] _raw_spin_lock_irq+0x47/0x50
[  414.365629] softirqs last  enabled at (428444): [<ffffffff85474780>] __do_softirq+0x540/0x7a2
[  414.366522] softirqs last disabled at (428419): [<ffffffff813f65ab>] irq_exit_rcu+0x14b/0x1a0
[  414.367425]
               other info that might help us debug this:
[  414.368194]  Possible unsafe locking scenario:
[  414.368900]        CPU0
[  414.369225]        ----
[  414.369548]   lock(&sbq->ws[i].wait);
[  414.370000]   <Interrupt>
[  414.370342]     lock(&sbq->ws[i].wait);
[  414.370802]
                *** DEADLOCK ***
[  414.371569] 5 locks held by kworker/u10:3/1152:
[  414.372088]  #0: ffff88810130e938 ((wq_completion)writeback){+.+.}-{0:0}, at: process_scheduled_works+0x357/0x13f0
[  414.373180]  #1: ffff88810201fdb8 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_scheduled_works+0x3a3/0x13f0
[  414.374384]  #2: ffffffff86ffbdc0 (rcu_read_lock){....}-{1:2}, at: blk_mq_run_hw_queue+0x637/0xa00
[  414.375342]  #3: ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0
[  414.376377]  #4: ffff888106205a08 (&hctx->dispatch_wait_lock){+.-.}-{2:2}, at: blk_mq_dispatch_rq_list+0x1337/0x1ee0
[  414.378607]
               stack backtrace:
[  414.379177] CPU: 0 PID: 1152 Comm: kworker/u10:3 Not tainted 6.6.0-07439-gba2303cacfda #6
[  414.380032] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[  414.381177] Workqueue: writeback wb_workfn (flush-253:0)
[  414.381805] Call Trace:
[  414.382136]  <TASK>
[  414.382429]  dump_stack_lvl+0x91/0xf0
[  414.382884]  mark_lock_irq+0xb3b/0x1260
[  414.383367]  ? __pfx_mark_lock_irq+0x10/0x10
[  414.383889]  ? stack_trace_save+0x8e/0xc0
[  414.384373]  ? __pfx_stack_trace_save+0x10/0x10
[  414.384903]  ? graph_lock+0xcf/0x410
[  414.385350]  ? save_trace+0x3d/0xc70
[  414.385808]  mark_lock.part.20+0x56d/0xa90
[  414.386317]  mark_held_locks+0xb0/0x110
[  414.386791]  ? __pfx_do_raw_spin_lock+0x10/0x10
[  414.387320]  lockdep_hardirqs_on_prepare+0x297/0x3f0
[  414.387901]  ? _raw_spin_unlock_irq+0x28/0x50
[  414.388422]  trace_hardirqs_on+0x58/0x100
[  414.388917]  _raw_spin_unlock_irq+0x28/0x50
[  414.389422]  __blk_mq_tag_busy+0x1d6/0x2a0
[  414.389920]  __blk_mq_get_driver_tag+0x761/0x9f0
[  414.390899]  blk_mq_dispatch_rq_list+0x1780/0x1ee0
[  414.391473]  ? __pfx_blk_mq_dispatch_rq_list+0x10/0x10
[  414.392070]  ? sbitmap_get+0x2b8/0x450
[  414.392533]  ? __blk_mq_get_driver_tag+0x210/0x9f0
[  414.393095]  __blk_mq_sched_dispatch_requests+0xd99/0x1690
[  414.393730]  ? elv_attempt_insert_merge+0x1b1/0x420
[  414.394302]  ? __pfx___blk_mq_sched_dispatch_requests+0x10/0x10
[  414.394970]  ? lock_acquire+0x18d/0x460
[  414.395456]  ? blk_mq_run_hw_queue+0x637/0xa00
[  414.395986]  ? __pfx_lock_acquire+0x10/0x10
[  414.396499]  blk_mq_sched_dispatch_requests+0x109/0x190
[  414.397100]  blk_mq_run_hw_queue+0x66e/0xa00
[  414.397616]  blk_mq_flush_plug_list.part.17+0x614/0x2030
[  414.398244]  ? __pfx_blk_mq_flush_plug_list.part.17+0x10/0x10
[  414.398897]  ? writeback_sb_inodes+0x241/0xcc0
[  414.399429]  blk_mq_flush_plug_list+0x65/0x80
[  414.399957]  __blk_flush_plug+0x2f1/0x530
[  414.400458]  ? __pfx___blk_flush_plug+0x10/0x10
[  414.400999]  blk_finish_plug+0x59/0xa0
[  414.401467]  wb_writeback+0x7cc/0x920
[  414.401935]  ? __pfx_wb_writeback+0x10/0x10
[  414.402442]  ? mark_held_locks+0xb0/0x110
[  414.402931]  ? __pfx_do_raw_spin_lock+0x10/0x10
[  414.403462]  ? lockdep_hardirqs_on_prepare+0x297/0x3f0
[  414.404062]  wb_workfn+0x2b3/0xcf0
[  414.404500]  ? __pfx_wb_workfn+0x10/0x10
[  414.404989]  process_scheduled_works+0x432/0x13f0
[  414.405546]  ? __pfx_process_scheduled_works+0x10/0x10
[  414.406139]  ? do_raw_spin_lock+0x101/0x2a0
[  414.406641]  ? assign_work+0x19b/0x240
[  414.407106]  ? lock_is_held_type+0x9d/0x110
[  414.407604]  worker_thread+0x6f2/0x1160
[  414.408075]  ? __kthread_parkme+0x62/0x210
[  414.408572]  ? lockdep_hardirqs_on_prepare+0x297/0x3f0
[  414.409168]  ? __kthread_parkme+0x13c/0x210
[  414.409678]  ? __pfx_worker_thread+0x10/0x10
[  414.410191]  kthread+0x33c/0x440
[  414.410602]  ? __pfx_kthread+0x10/0x10
[  414.411068]  ret_from_fork+0x4d/0x80
[  414.411526]  ? __pfx_kthread+0x10/0x10
[  414.411993]  ret_from_fork_asm+0x1b/0x30
[  414.412489]  </TASK>

When interrupt is turned on while a lock holding by spin_lock_irq it
throws a warning because of potential deadlock.

blk_mq_prep_dispatch_rq
 blk_mq_get_driver_tag
  __blk_mq_get_driver_tag
   __blk_mq_alloc_driver_tag
    blk_mq_tag_busy -> tag is already busy
    // failed to get driver tag
 blk_mq_mark_tag_wait
  spin_lock_irq(&wq->lock) -> lock A (&sbq->ws[i].wait)
  __add_wait_queue(wq, wait) -> wait queue active
  blk_mq_get_driver_tag
  __blk_mq_tag_busy
-> 1) tag must be idle, which means there can't be inflight IO
   spin_lock_irq(&tags->lock) -> lock B (hctx->tags)
   spin_unlock_irq(&tags->lock) -> unlock B, turn on interrupt accidentally
-> 2) context must be preempt by IO interrupt to trigger deadlock.

As shown above, the deadlock is not possible in theory, but the warning
still need to be fixed.

Fix it by using spin_lock_irqsave to get lockB instead of spin_lock_irq.

Fixes: 4f1731df60 ("blk-mq: fix potential io hang by wrong 'wake_batch'")
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240815024736.2040971-1-lilingfeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-15 19:25:03 -06:00
Alexey Dobriyan
a28dc358e2 block: constify ext_pi_ref_escape()
This function doesn't mutate data.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/d24611b3-dddf-473a-903d-39290db03b11@p183
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-13 06:20:02 -06:00
Alexey Dobriyan
49923a0dff block: delete module stuff from t10-pi
It is not possible to build t10-pi.ko anymore.

This file doesn't even export functions.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/216ccc79-5b80-47b2-b507-990951aa810c@p183
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-13 06:19:39 -06:00
John Garry
ea6787c695 scsi: block: Don't check REQ_ATOMIC for reads
We check in submit_bio_noacct() if flag REQ_ATOMIC is set for both read and
write operations, and then validate the atomic operation if set. Flag
REQ_ATOMIC can only be set for writes, so don't bother checking for reads.

Fixes: 9da3d1e912 ("block: Add core atomic write support")
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240805113315.1048591-3-john.g.garry@oracle.com
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2024-08-12 18:03:38 -04:00
Matthew Wilcox (Oracle)
1da86618bd
fs: Convert aops->write_begin to take a folio
Convert all callers from working on a page to working on one page
of a folio (support for working on an entire folio can come later).
Removes a lot of folio->page->folio conversions.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07 11:33:21 +02:00
Matthew Wilcox (Oracle)
a225800f32
fs: Convert aops->write_end to take a folio
Most callers have a folio, and most implementations operate on a folio,
so remove the conversion from folio->page->folio to fit through this
interface.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07 11:32:02 +02:00
Matthew Wilcox (Oracle)
97edbc02b2
buffer: Convert block_write_end() to take a folio
All callers now have a folio, so pass it in instead of converting
from a folio to a page and back to a folio again.  Saves a call
to compound_head().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07 11:31:59 +02:00
Matthew Wilcox (Oracle)
1262249d03
block: Use a folio in blkdev_write_end()
Replaces two hidden calls to compound_head() with one explicit one.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07 11:31:58 +02:00
Yu Kuai
79c6c60a6c blk-ioprio: remove per-disk structure
ioprio works on the blk-cgroup level, all disks in the same cgroup
are the same, and the struct ioprio_blkg doesn't have anything in it.
Hence register the policy is enough, because cpd_alloc/free_fn will
be handled for each blk-cgroup, and there is no need to activate the
policy for disk. Hence remove blk_ioprio_init/exit and
ioprio_alloc/free_pd.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240719071506.158075-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-28 16:47:51 -06:00
Yu Kuai
d0e9279599 blk-ioprio: remove ioprio_blkcg_from_bio()
Currently, if config is enabled, then ioprio is always enabled by
default from blkcg_init_disk(), hence there is no point to check if
the policy is enabled from blkg in ioprio_blkcg_from_bio(). Hence remove
it and get blkcg directly from bio.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240719071506.158075-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-28 16:47:51 -06:00
Yu Kuai
ae8650b45d blk-cgroup: check for pd_(alloc|free)_fn in blkcg_activate_policy()
Currently all policies implement pd_(alloc|free)_fn, however, this is
not necessary for ioprio that only works for blkcg, not blkg.

There are no functional changes, prepare to cleanup activating ioprio
policy.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20240719071506.158075-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-28 16:47:51 -06:00
Dr. David Alan Gilbert
01aa8c869d blk-throttle: remove more latency dead-code
The struct 'latency_bucket' and the #define 'request_bucket_index'
are unused since
commit bf20ab538c ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW")

and the 'LATENCY_BUCKET_SIZE' #define was only used by the
'request_bucket_index' define.

Remove them.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://lore.kernel.org/r/20240727155824.1000042-1-linux@treblig.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-27 10:33:25 -06:00
Yang Yang
7e04da2dc7 block: fix deadlock between sd_remove & sd_release
Our test report the following hung task:

[ 2538.459400] INFO: task "kworker/0:0":7 blocked for more than 188 seconds.
[ 2538.459427] Call trace:
[ 2538.459430]  __switch_to+0x174/0x338
[ 2538.459436]  __schedule+0x628/0x9c4
[ 2538.459442]  schedule+0x7c/0xe8
[ 2538.459447]  schedule_preempt_disabled+0x24/0x40
[ 2538.459453]  __mutex_lock+0x3ec/0xf04
[ 2538.459456]  __mutex_lock_slowpath+0x14/0x24
[ 2538.459459]  mutex_lock+0x30/0xd8
[ 2538.459462]  del_gendisk+0xdc/0x350
[ 2538.459466]  sd_remove+0x30/0x60
[ 2538.459470]  device_release_driver_internal+0x1c4/0x2c4
[ 2538.459474]  device_release_driver+0x18/0x28
[ 2538.459478]  bus_remove_device+0x15c/0x174
[ 2538.459483]  device_del+0x1d0/0x358
[ 2538.459488]  __scsi_remove_device+0xa8/0x198
[ 2538.459493]  scsi_forget_host+0x50/0x70
[ 2538.459497]  scsi_remove_host+0x80/0x180
[ 2538.459502]  usb_stor_disconnect+0x68/0xf4
[ 2538.459506]  usb_unbind_interface+0xd4/0x280
[ 2538.459510]  device_release_driver_internal+0x1c4/0x2c4
[ 2538.459514]  device_release_driver+0x18/0x28
[ 2538.459518]  bus_remove_device+0x15c/0x174
[ 2538.459523]  device_del+0x1d0/0x358
[ 2538.459528]  usb_disable_device+0x84/0x194
[ 2538.459532]  usb_disconnect+0xec/0x300
[ 2538.459537]  hub_event+0xb80/0x1870
[ 2538.459541]  process_scheduled_works+0x248/0x4dc
[ 2538.459545]  worker_thread+0x244/0x334
[ 2538.459549]  kthread+0x114/0x1bc

[ 2538.461001] INFO: task "fsck.":15415 blocked for more than 188 seconds.
[ 2538.461014] Call trace:
[ 2538.461016]  __switch_to+0x174/0x338
[ 2538.461021]  __schedule+0x628/0x9c4
[ 2538.461025]  schedule+0x7c/0xe8
[ 2538.461030]  blk_queue_enter+0xc4/0x160
[ 2538.461034]  blk_mq_alloc_request+0x120/0x1d4
[ 2538.461037]  scsi_execute_cmd+0x7c/0x23c
[ 2538.461040]  ioctl_internal_command+0x5c/0x164
[ 2538.461046]  scsi_set_medium_removal+0x5c/0xb0
[ 2538.461051]  sd_release+0x50/0x94
[ 2538.461054]  blkdev_put+0x190/0x28c
[ 2538.461058]  blkdev_release+0x28/0x40
[ 2538.461063]  __fput+0xf8/0x2a8
[ 2538.461066]  __fput_sync+0x28/0x5c
[ 2538.461070]  __arm64_sys_close+0x84/0xe8
[ 2538.461073]  invoke_syscall+0x58/0x114
[ 2538.461078]  el0_svc_common+0xac/0xe0
[ 2538.461082]  do_el0_svc+0x1c/0x28
[ 2538.461087]  el0_svc+0x38/0x68
[ 2538.461090]  el0t_64_sync_handler+0x68/0xbc
[ 2538.461093]  el0t_64_sync+0x1a8/0x1ac

  T1:				T2:
  sd_remove
  del_gendisk
  __blk_mark_disk_dead
  blk_freeze_queue_start
  ++q->mq_freeze_depth
  				bdev_release
 				mutex_lock(&disk->open_mutex)
  				sd_release
 				scsi_execute_cmd
 				blk_queue_enter
 				wait_event(!q->mq_freeze_depth)
  mutex_lock(&disk->open_mutex)

SCSI does not set GD_OWNS_QUEUE, so QUEUE_FLAG_DYING is not set in
this scenario. This is a classic ABBA deadlock. To fix the deadlock,
make sure we don't try to acquire disk->open_mutex after freezing
the queue.

Cc: stable@vger.kernel.org
Fixes: eec1be4c30 ("block: delete partitions later in del_gendisk")
Signed-off-by: Yang Yang <yang.yang@vivo.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixes: and Cc: stable tags are missing. Otherwise this patch looks fine
Link: https://lore.kernel.org/r/20240724070412.22521-1-yang.yang@vivo.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-24 09:51:21 -06:00
Linus Torvalds
7d080fa867 for-6.11/block-20240722
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmaeZBIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqI7D/9XPinZuuwiZ/670P8yjk1SHFzqzdwtuFuP
 +Dq2lcoYRkuwm5PvCvhs3QH2mnjS1vo1SIoAijGEy3V1bs41mw87T2knKMIn4g5v
 I5A4gC6i0IqxIkFm17Zx9yG+MivoOmPtqM4RMxze2xS/uJwWcvg4tjBHZfylY3d9
 oaIXyZj+0dTRf955K2x/5dpfE6qjtDG0bqrrJXnzaIKHBJk2HKezYFbTstAA4OY+
 MvMqRL7uJmJBd7384/WColIO0b8/UEchPl7qG+zy9pg+wzQGLFyF/Z/KdjrWdDMD
 IFs92uNDFQmiGoyujJmXdDV9xpKi94nqDAtUR+Qct0Mui5zz0w2RNcGvyTDjBMpv
 CAzTkTW48moYkwLPhPmy8Ge69elT82AC/9ZQAHbA7g3TYgJML5IT/7TtiaVe6Rc1
 podnTR3/e9XmZnc25aUZeAr6CG7b+0NBvB+XPO9lNyMEE38sfwShoPdAGdKX25oA
 mjnLHBc9grVOQzRGEx22E11k+1ChXf/o9H546PB2Pr9yvf/DQ3868a+QhHssxufL
 Xul1K5a+pUmOnaTLD3ESftYlFmcDOHQ6gDK697so7mU7lrD3ctN4HYZ2vwNk35YY
 2b4xrABrOEbAXlUo3Ht8F/ecg6qw4xTr9vAW5q4+L2H5+28RaZKYclHhLmR23yfP
 xJ/d5FfVFQ==
 =fqoV
 -----END PGP SIGNATURE-----

Merge tag 'for-6.11/block-20240722' of git://git.kernel.dk/linux

Pull more block updates from Jens Axboe:

 - MD fixes via Song:
     - md-cluster fixes (Heming Zhao)
     - raid1 fix (Mateusz Jończyk)

 - s390/dasd module description (Jeff)

 - Series cleaning up and hardening the blk-mq debugfs flag handling
   (John, Christoph)

 - blk-cgroup cleanup (Xiu)

 - Error polled IO attempts if backend doesn't support it (hexue)

 - Fix for an sbitmap hang (Yang)

* tag 'for-6.11/block-20240722' of git://git.kernel.dk/linux: (23 commits)
  blk-cgroup: move congestion_count to struct blkcg
  sbitmap: fix io hung due to race on sbitmap_word::cleared
  block: avoid polling configuration errors
  block: Catch possible entries missing from rqf_name[]
  block: Simplify definition of RQF_NAME()
  block: Use enum to define RQF_x bit indexes
  block: Catch possible entries missing from cmd_flag_name[]
  block: Catch possible entries missing from alloc_policy_name[]
  block: Catch possible entries missing from hctx_flag_name[]
  block: Catch possible entries missing from hctx_state_name[]
  block: Catch possible entries missing from blk_queue_flag_name[]
  block: Make QUEUE_FLAG_x as an enum
  block: Relocate BLK_MQ_MAX_DEPTH
  block: Relocate BLK_MQ_CPU_WORK_BATCH
  block: remove QUEUE_FLAG_STOPPED
  block: Add missing entry to hctx_flag_name[]
  block: Add zone write plugging entry to rqf_name[]
  block: Add missing entries from cmd_flag_name[]
  s390/dasd: fix error checks in dasd_copy_pair_store()
  s390/dasd: add missing MODULE_DESCRIPTION() macros
  ...
2024-07-22 11:32:05 -07:00
Linus Torvalds
0256994887 for-6.11/block-post-20240722
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmaeY00QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjPGD/9CPo93+V/ztfzY1J18KhA2CCUh1uuxZIjx
 dLfi07Bo+gyLwB1vaSf0bNy9gM8SzGFSMszSIDTErNq9/F6RvWjXN0CchyQf1Wii
 o2UyQg8JLjT2o1pJSsdJySZQRsG/daWUHzHaX1kD343Cd6OBV2YaVFdYTaXUGg4v
 G1AVh7qFvQhAIg1jV8q2z7QC7PSeuTnvyvY65Z8/iVJe95FayOrtGmDPTaJab8r2
 7uEFiWZk23erzNygVdcSoNIrwWFmRARz5o3IvwJJfEL08hkdoAqu6vD2oCUZspKU
 3g4wU6JrN0QYQpVwIJ9WcwYcoOm6iMm9xwCVMsp8R3KRUU107HjaiEazFDGk4HW4
 ozZTa7leTXnrRqnjVhcQpUvC+1uVLCFN8sSElNY7m2dg0IojnlMz+t3lMiTtaR9N
 Rt6wy5alVQFlb2uhzALuUh6HM1zA98swWySNoP0arTkOT9kjXwwAgn0I+M1s9Uxo
 FaQvM0YnAsb2C8LSpNtZWLaTlRSLTzUsGThLSJMBZueIJ9+BF23i7W7euklCNxjj
 Jl6CykEkEkacOxU6b9PG6qSnUq9JJ+W7gcJVing+ugAFrZDutxy6eJZXVv8wuvCC
 EOxaADpSs2xAaH9V0BMmwO51w0NDWySyGPHB5UBkhNjqOji/oG3FvAITiboQArgS
 FES4jtU1TA==
 =dn4l
 -----END PGP SIGNATURE-----

Merge tag 'for-6.11/block-post-20240722' of git://git.kernel.dk/linux

Pull block integrity mapping updates from Jens Axboe:
 "A set of cleanups and fixes for the block integrity support.

  Sent separately from the main block changes from last week, as they
  depended on later fixes in the 6.10-rc cycle"

* tag 'for-6.11/block-post-20240722' of git://git.kernel.dk/linux:
  block: don't free the integrity payload in bio_integrity_unmap_free_user
  block: don't free submitter owned integrity payload on I/O completion
  block: call bio_integrity_unmap_free_user from blk_rq_unmap_user
  block: don't call bio_uninit from bio_endio
  block: also return bio_integrity_payload * from stubs
  block: split integrity support out of bio.h
2024-07-22 11:04:09 -07:00
Linus Torvalds
4305ca0087 SCSI misc on 20240718
Updates to the usual drivers (ufs, lpfc, qla2xxx, mpi3mr) plus some
 misc small fixes. The only core changes are to both bsg and scsi to
 pass in the device instead of setting it afterwards as q->queuedata,
 so no functional change.
 
 Signed-off-by: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCZpl6BiYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishVkAAQCOLA0V
 TFI4RfRjk7TW/6ZgKVS5A4NNLG8p8r9F7Y/QswEAlT4NrYnHiHQwBYEiTw6w02J8
 SqiHtHKv/SQ7LIwEJlQ=
 =WhCT
 -----END PGP SIGNATURE-----

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
 "Updates to the usual drivers (ufs, lpfc, qla2xxx, mpi3mr) plus some
  misc small fixes.

  The only core changes are to both bsg and scsi to pass in the device
  instead of setting it afterwards as q->queuedata, so no functional
  change"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (69 commits)
  scsi: aha152x: Use DECLARE_COMPLETION_ONSTACK for non-constant completion
  scsi: qla2xxx: Convert comma to semicolon
  scsi: qla2xxx: Update version to 10.02.09.300-k
  scsi: qla2xxx: Use QP lock to search for bsg
  scsi: qla2xxx: Reduce fabric scan duplicate code
  scsi: qla2xxx: Fix optrom version displayed in FDMI
  scsi: qla2xxx: During vport delete send async logout explicitly
  scsi: qla2xxx: Complete command early within lock
  scsi: qla2xxx: Fix flash read failure
  scsi: qla2xxx: Return ENOBUFS if sg_cnt is more than one for ELS cmds
  scsi: qla2xxx: Fix for possible memory corruption
  scsi: qla2xxx: validate nvme_local_port correctly
  scsi: qla2xxx: Unable to act on RSCN for port online
  scsi: ufs: exynos: Add support for Flash Memory Protector (FMP)
  scsi: ufs: core: Add UFSHCD_QUIRK_KEYS_IN_PRDT
  scsi: ufs: core: Add fill_crypto_prdt variant op
  scsi: ufs: core: Add UFSHCD_QUIRK_BROKEN_CRYPTO_ENABLE
  scsi: ufs: core: fold ufshcd_clear_keyslot() into its caller
  scsi: ufs: core: Add UFSHCD_QUIRK_CUSTOM_CRYPTO_PROFILE
  scsi: ufs: mcq: Make .get_hba_mac() optional
  ...
2024-07-19 10:56:58 -07:00
Xiu Jianfeng
89ed6c9ac6 blk-cgroup: move congestion_count to struct blkcg
The congestion_count was introduced into the struct cgroup by
commit d09d8df3a2 ("blkcg: add generic throttling mechanism"),
but since it is closely related to the blkio subsys, it is not
appropriate to put it in the struct cgroup, so let's move it to
struct blkcg. There should be no functional changes because blkcg
is per cgroup.

Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20240716133058.3491350-1-xiujianfeng@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19 09:40:07 -06:00
hexue
73e59d3eec block: avoid polling configuration errors
This patch adds a poll queue check, aiming to help users use polled IO
accurately.

If users do polled IO but the device doesn't have poll queues, they will
get suboptimal performance data and waste CPU resources. Add a poll queue
check batching this. If users don't have the device properly configured,
or if it simply doesn't support polled IO, it will error the IO with
-EOPNOTSUPP. This is similar to what we used to do for sync polled IO,
which is no longer supported.

Signed-off-by: hexue <xue01.he@samsung.com>
Link: https://lore.kernel.org/r/20240718070817.1031494-1-xue01.he@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19 09:35:35 -06:00
John Garry
8a47e33f50 block: Catch possible entries missing from rqf_name[]
Add a BUILD_BUG_ON() call to ensure that we are not missing entries in
rqf_name[].

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240719112912.3830443-16-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19 09:32:49 -06:00
John Garry
2d61a6c2ca block: Simplify definition of RQF_NAME()
Now that we have a bit index for RQF_x in __RQF_x, use __RQF_x to simplify
the definition of RQF_NAME() by not using ilog2((__force u32()).

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240719112912.3830443-15-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19 09:32:49 -06:00
John Garry
6fa99325ec block: Catch possible entries missing from cmd_flag_name[]
Add a BUILD_BUG_ON() call to ensure that we are not missing entries in
cmd_flag_name[].

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240719112912.3830443-13-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19 09:32:49 -06:00
John Garry
26d3bdb57e block: Catch possible entries missing from alloc_policy_name[]
Make BLK_TAG_ALLOC_x an enum and add a "max" entry.

Add a BUILD_BUG_ON() call to ensure that we are not missing entries in
hctx_flag_name[].

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240719112912.3830443-12-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19 09:32:49 -06:00
John Garry
226f0f6afc block: Catch possible entries missing from hctx_flag_name[]
Refresh values in BLK_MQ_F_x enum, and then re-arrange members in
hctx_flag_name[] to match that enum. Renumber
BLK_MQ_F_ALLOC_POLICY_START_BIT to match the value refresh.

Add a BUILD_BUG_ON() call to ensure that we are not missing entries in
hctx_flag_name[].

Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240719112912.3830443-11-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19 09:32:49 -06:00
John Garry
23827310cc block: Catch possible entries missing from hctx_state_name[]
Add a build-time assert that we are not missing entries from
hctx_state_name[]. For this, create a separate enum for state flags and add
a "max" entry for BLK_MQ_S_x flags.

The numbering for those enum values is as default, so don't explicitly
number.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240719112912.3830443-10-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19 09:32:49 -06:00
John Garry
cce496de06 block: Catch possible entries missing from blk_queue_flag_name[]
Assert that we are not missing flag entries in blk_queue_flag_name[].

Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240719112912.3830443-9-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19 09:32:49 -06:00
John Garry
3dff615573 block: Relocate BLK_MQ_CPU_WORK_BATCH
BLK_MQ_CPU_WORK_BATCH is defined in include/linux/blk-mq.h, but only used
in blk-mq.c, so relocate to block/blk-mq.h

Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240719112912.3830443-6-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19 09:32:48 -06:00
Christoph Hellwig
c8f51feee1 block: remove QUEUE_FLAG_STOPPED
QUEUE_FLAG_STOPPED is entirely unused.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240719112912.3830443-5-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19 09:32:48 -06:00
John Garry
1c83c5375e block: Add missing entry to hctx_flag_name[]
Add missing entry for NO_SCHED_BY_DEFAULT and reorder to match the enum.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240719112912.3830443-4-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19 09:32:48 -06:00
John Garry
af54963f19 block: Add zone write plugging entry to rqf_name[]
Add missing entry.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240719112912.3830443-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19 09:32:48 -06:00
John Garry
6b3789e6c5 block: Add missing entries from cmd_flag_name[]
Add missing entries for req_flag_bits.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240719112912.3830443-2-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19 09:32:48 -06:00
Linus Torvalds
3e78198862 for-6.11/block-20240710
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmaOTd8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppqIEACUr8Vv2FtezvT3OfVSlYWHHLXzkRhwEG5s
 vdk0o7Ow6U54sMjfymbHTgLD0ZOJf3uJ6BI95FQuW41jPzDFVbx4Hy8QzqonMkw9
 1D/YQ4zrVL2mOKBzATbKpoGJzMOzGeoXEueFZ1AYPAX7RrDtP4xPQNfrcfkdE2zF
 LycJN70Vp6lrZZMuI9yb9ts1tf7TFzK0HJANxOAKTgSiPmBmxesjkJlhrdUrgkAU
 qDVyjj7u/ssndBJAb9i6Bl95Do8s9t4DeJq5/6wgKqtf5hClMXzPVB8Wy084gr6E
 rTRsCEhOug3qEZSqfAgAxnd3XFRNc/p2KMUe5YZ4mAqux4hpSmIQQDM/5X5K9vEv
 f4MNqUGlqyqntZx+KPyFpf7kLHFYS1qK4ub0FojWJEY4GrbBPNjjncLJ9+ozR0c8
 kNDaFjMNAjalBee1FxNNH8LdVcd28rrCkPxRLEfO/gvBMUmvJf4ZyKmSED0v5DhY
 vZqKlBqG+wg0EXvdiWEHMDh9Y+q/2XBIkS6NN/Bhh61HNu+XzC838ts1X7lR+4o2
 AM5Vapw+v0q6kFBMRP3IcJI/c0UcIU8EQU7axMyzWtvhog8kx8x01hIj1L4UyYYr
 rUdWrkugBVXJbywFuH/QIJxWxS/z4JdSw5VjASJLIrXy+aANmmG9Wonv95eyhpUv
 5iv+EdRSNA==
 =wVi8
 -----END PGP SIGNATURE-----

Merge tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe updates via Keith:
     - Device initialization memory leak fixes (Keith)
     - More constants defined (Weiwen)
     - Target debugfs support (Hannes)
     - PCIe subsystem reset enhancements (Keith)
     - Queue-depth multipath policy (Redhat and PureStorage)
     - Implement get_unique_id (Christoph)
     - Authentication error fixes (Gaosheng)

 - MD updates via Song
     - sync_action fix and refactoring (Yu Kuai)
     - Various small fixes (Christoph Hellwig, Li Nan, and Ofir Gal, Yu
       Kuai, Benjamin Marzinski, Christophe JAILLET, Yang Li)

 - Fix loop detach/open race (Gulam)

 - Fix lower control limit for blk-throttle (Yu)

 - Add module descriptions to various drivers (Jeff)

 - Add support for atomic writes for block devices, and statx reporting
   for same. Includes SCSI and NVMe (John, Prasad, Alan)

 - Add IO priority information to block trace points (Dongliang)

 - Various zone improvements and tweaks (Damien)

 - mq-deadline tag reservation improvements (Bart)

 - Ignore direct reclaim swap writes in writeback throttling (Baokun)

 - Block integrity improvements and fixes (Anuj)

 - Add basic support for rust based block drivers. Has a dummy null_blk
   variant for now (Andreas)

 - Series converting driver settings to queue limits, and cleanups and
   fixes related to that (Christoph)

 - Cleanup for poking too deeply into the bvec internals, in preparation
   for DMA mapping API changes (Christoph)

 - Various minor tweaks and fixes (Jiapeng, John, Kanchan, Mikulas,
   Ming, Zhu, Damien, Christophe, Chaitanya)

* tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux: (206 commits)
  floppy: add missing MODULE_DESCRIPTION() macro
  loop: add missing MODULE_DESCRIPTION() macro
  ublk_drv: add missing MODULE_DESCRIPTION() macro
  xen/blkback: add missing MODULE_DESCRIPTION() macro
  block/rnbd: Constify struct kobj_type
  block: take offset into account in blk_bvec_map_sg again
  block: fix get_max_segment_size() warning
  loop: Don't bother validating blocksize
  virtio_blk: Don't bother validating blocksize
  null_blk: Don't bother validating blocksize
  block: Validate logical block size in blk_validate_limits()
  virtio_blk: Fix default logical block size fallback
  nvmet-auth: fix nvmet_auth hash error handling
  nvme: implement ->get_unique_id
  block: pass a phys_addr_t to get_max_segment_size
  block: add a bvec_phys helper
  blk-lib: check for kill signal in ioctl BLKZEROOUT
  block: limit the Write Zeroes to manually writing zeroes fallback
  block: refacto blkdev_issue_zeroout
  block: move read-only and supported checks into (__)blkdev_issue_zeroout
  ...
2024-07-15 14:20:22 -07:00
Christoph Hellwig
61353a63a2 block: take offset into account in blk_bvec_map_sg again
The rebase of commit 09595e0c9d ("block: pass a phys_addr_t to
get_max_segment_size") lost adding the total to to the offset in
blk_bvec_map_sg.  Add it back.

Fixes: 09595e0c9d ("block: pass a phys_addr_t to get_max_segment_size")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Reported-by: Chaitanya Kulkarni <chaitanyak@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240709070126.3019940-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-09 01:02:44 -06:00
Chaitanya Kulkarni
0ffc46eb1b block: fix get_max_segment_size() warning
Correct the parameter name in the comment of get_max_segment_size()
to fix following warning:-

block/blk-merge.c:220: warning: Function parameter or struct member 'len' not described in 'get_max_segment_size'
block/blk-merge.c:220: warning: Excess function parameter 'max_len' description in 'get_max_segment_size'

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240709045432.8688-1-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-09 00:01:11 -06:00
John Garry
fe3d508ba9 block: Validate logical block size in blk_validate_limits()
Some drivers validate that their own logical block size. It is no harm to
always do this, so validate in blk_validate_limits().

This allows us to remove the validation in most of those drivers.

Add a comment to blk_validate_block_size() to inform users that self-
validation of LBS is usually unnecessary.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240708091651.177447-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-09 00:00:17 -06:00
Christoph Hellwig
09595e0c9d block: pass a phys_addr_t to get_max_segment_size
Work on a single address to simplify the logic, and prepare the callers
from using better helpers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240706075228.2350978-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-08 01:51:05 -06:00
Christoph Hellwig
25f76c3db2 block: add a bvec_phys helper
Get callers out of poking into bvec internals a bit more.  Not a huge win
right now, but with the proposed new DMA mapping API we might end up with
a lot more of this otherwise.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240706075228.2350978-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-08 01:51:05 -06:00
Christoph Hellwig
bf86bcdb40 blk-lib: check for kill signal in ioctl BLKZEROOUT
Zeroout can access a significant capacity and take longer than the user
expected.  A user may change their mind about wanting to run that
command and attempt to kill the process and do something else with their
device. But since the task is uninterruptable, they have to wait for it
to finish, which could be many hours.

Add a new BLKDEV_ZERO_KILLABLE flag for blkdev_issue_zeroout that checks
for a fatal signal at each iteration so the user doesn't have to wait for
their regretted operation to complete naturally.

Heavily based on an earlier patch from Keith Busch.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240701165219.1571322-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05 00:53:15 -06:00
Christoph Hellwig
39722a2f2b block: limit the Write Zeroes to manually writing zeroes fallback
Only fall back from hardware Write Zeroes failures when
blkdev_issue_write_zeroes returns -EOPNOTSUPP;

Note that blkdev_issue_write_zeroes turns any failure into -EOPNOTSUPP
when the write zeroes queue limit has been cleared to 0, so this still
catches all I/O errors where the driver detected missing support
for the hardware acceleration.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240701165219.1571322-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05 00:53:15 -06:00
Christoph Hellwig
99800ced26 block: refacto blkdev_issue_zeroout
Split out two well-defined helpers for hardware supported Write Zeroes
and manually writing zeroes using the Write command.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240701165219.1571322-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05 00:53:15 -06:00
Christoph Hellwig
f6eacb2654 block: move read-only and supported checks into (__)blkdev_issue_zeroout
Move these checks out of the lower level helpers and into the higher level
ones to prepare for refactoring.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240701165219.1571322-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05 00:53:15 -06:00
Christoph Hellwig
ff760a8f0d block: remove the LBA alignment check in __blkdev_issue_zeroout
__blkdev_issue_zeroout is a purely kernel internal API and thus can rely
on the block layer sector alignment checks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240701165219.1571322-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05 00:53:15 -06:00
Christoph Hellwig
73a768d5f9 block: factor out a blk_write_zeroes_limit helper
Contrary to the comment in __blkdev_issue_write_zeroes, nothing here
checks for a potential bi_size overflow.  Add a helper mirroring
the secure erase code for the check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240701165219.1571322-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05 00:53:15 -06:00
Damien Le Moal
2f20872ed4 block: Remove blk_alloc_zone_bitmap()
Remove the helper function blk_alloc_zone_bitmap() and replace its
single call site with a call to bitmap_alloc(). To be consistent with
this change, use bitmap_free() to free a disk convnetional zone bitmap.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240704052816.623865-6-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05 00:42:04 -06:00
Damien Le Moal
f2a7bea237 block: Remove REQ_OP_ZONE_RESET_ALL emulation
Now that device mapper can handle resetting all zones of a mapped zoned
device using REQ_OP_ZONE_RESET_ALL, all zoned block device drivers
support this operation. With this, the request queue feature
BLK_FEAT_ZONE_RESETALL is not necessary and the emulation code in
blk-zone.c can be removed.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240704052816.623865-5-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05 00:42:04 -06:00
Anuj Gupta
ba94223805 block: reuse original bio_vec array for integrity during clone
Modify bio_integrity_clone to reuse the original bvec array instead of
allocating and copying it, similar to how bio data path is cloned.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240702100753.2168-1-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-04 02:07:58 -06:00
Christoph Hellwig
74cc150282 block: don't free the integrity payload in bio_integrity_unmap_free_user
Now that the integrity payload is always freed in bio_uninit, don't
bother freeing it a little earlier in bio_integrity_unmap_free_user.
With that the separate bio_integrity_unmap_free_user can go away by
just passing the bio to bio_integrity_unmap_user.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240702151047.1746127-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-03 10:21:16 -06:00
Christoph Hellwig
85253bac4d block: don't free submitter owned integrity payload on I/O completion
Currently __bio_integrity_endio frees the integrity payload unless it is
explicitly marked as user-mapped.  This means in-kernel callers that
allocate their own integrity payload never get to see it on I/O
completion.  The current two users don't need it as they just pre-mapped
PI tuples received over the network, but this limits uses of integrity
data lot.

Change bio_integrity_endio to call __bio_integrity_endio for block layer
generated integrity data only, and leave freeing of submitter
allocated integrity data to bio_uninit which also gets called from
the final bio_put.  This requires that unmapping user mapped or copied
integrity data is now always done by the caller, and the special
BIP_INTEGRITY_USER flag can go away.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240702151047.1746127-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-03 10:21:16 -06:00
Christoph Hellwig
f8924374fd block: call bio_integrity_unmap_free_user from blk_rq_unmap_user
blk_rq_unmap_user always unmaps user space pass-through request.  If such
a request has integrity data attached it must come from a user mapping
as well.  Call bio_integrity_unmap_free_user from blk_rq_unmap_user
and remove the nvme_unmap_bio wrapper in the nvme driver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240702151047.1746127-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-03 10:21:16 -06:00
Christoph Hellwig
bf4c89fc87 block: don't call bio_uninit from bio_endio
Commit b222dd2fdd ("block: call bio_uninit in bio_endio") added a call
to bio_uninit in bio_endio to work around callers that use bio_init but
fail to call bio_uninit after they are done to release the resources.
While this is an abuse of the bio_init API we still have quite a few of
those left.  But this early uninit causes a problem for integrity data,
as at least some users need the bio_integrity_payload.  Right now the
only one is the NVMe passthrough which archives this by adding a special
case to skip the freeing if the BIP_INTEGRITY_USER flag is set.

Sort this out by only putting bi_blkg in bio_endio as that is the cause
of the actual leaks - the few users of the crypto context and integrity
data all properly call bio_uninit, usually through bio_put for
dynamically allocated bios.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240702151047.1746127-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-03 10:21:16 -06:00
Christoph Hellwig
da042a3655 block: split integrity support out of bio.h
Split struct bio_integrity_payload and the related prototypes out of
bio.h into a separate bio-integrity.h header so that it is only pulled
in by the few places that need it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240702151047.1746127-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-03 10:21:15 -06:00
Jens Axboe
1a50d14670 Linux 6.10-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmaB0NweHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGkvwH/36UJRk/o6wvXnyH
 E6QjCSWo2226APyWks22NjtC3I/8Iqdvkneuh6wG0qL2sXAB078EMjUq5R81bF8H
 wWFBJwetjYTp8GEyLioMEb2wCH/J3R29dLFC4UYTplafXRGP6//xcpJaKmTxcgdR
 31IzvTPXbApZ7L3k1U6rA2bK9PNKcFCOvZlrNMUCuwMrabymHsDfOUt1DqXyg2xp
 zjqiWYBwlklozmgawSWt/mdEgkWuTcAbg+KyqDVQF59s9aj/OOwZ0j+HACq5V8CM
 quTPIAYL6CC9p7uxa69lGr/sgC0Is/BZLPX7RTZAwCgarGvnX+1HUsjDcaFCtrVg
 O6fPUV8=
 =pgUx
 -----END PGP SIGNATURE-----

Merge tag 'v6.10-rc6' into for-6.11/block-post

Pull in v6.10-rc6 to resolve a conflict for the integrity cleanups.

* tag 'v6.10-rc6': (778 commits)
  Linux 6.10-rc6
  ata: ahci: Clean up sysfs file on error
  ata: libata-core: Fix double free on error
  ata,scsi: libata-core: Do not leak memory for ata_port struct members
  ata: libata-core: Fix null pointer dereference on error
  x86-32: fix cmpxchg8b_emu build error with clang
  x86: stop playing stack games in profile_pc()
  i2c: testunit: discard write requests while old command is running
  i2c: testunit: don't erase registers after STOP
  tty: mxser: Remove __counted_by from mxser_board.ports[]
  randomize_kstack: Remove non-functional per-arch entropy filtering
  string: kunit: add missing MODULE_DESCRIPTION() macros
  ata: libata-core: Add ATA_HORKAGE_NOLPM for all Crucial BX SSD1 models
  MAINTAINERS: Update IOMMU tree location
  tools/power turbostat: Add local build_bug.h header for snapshot target
  tools/power turbostat: Fix unc freq columns not showing with '-q' or '-l'
  tools/power turbostat: option '-n' is ambiguous
  drm/drm_file: Fix pid refcounting race
  kallsyms: rework symbol lookup return codes
  gpiolib: cdev: Ignore reconfiguration without direction
  ...

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-03 10:20:05 -06:00
Bart Van Assche
39823b47bb block/mq-deadline: Fix the tag reservation code
The current tag reservation code is based on a misunderstanding of the
meaning of data->shallow_depth. Fix the tag reservation code as follows:
* By default, do not reserve any tags for synchronous requests because
  for certain use cases reserving tags reduces performance. See also
  Harshit Mogalapalli, [bug-report] Performance regression with fio
  sequential-write on a multipath setup, 2024-03-07
  (https://lore.kernel.org/linux-block/5ce2ae5d-61e2-4ede-ad55-551112602401@oracle.com/)
* Reduce min_shallow_depth to one because min_shallow_depth must be less
  than or equal any shallow_depth value.
* Scale dd->async_depth from the range [1, nr_requests] to [1,
  bits_per_sbitmap_word].

Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Zhiguo Niu <zhiguo.niu@unisoc.com>
Fixes: 07757588e5 ("block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240509170149.7639-3-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-02 08:47:45 -06:00
Bart Van Assche
6259151c04 block: Call .limit_depth() after .hctx has been set
Call .limit_depth() after data->hctx has been set such that data->hctx can
be used in .limit_depth() implementations.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Zhiguo Niu <zhiguo.niu@unisoc.com>
Fixes: 07757588e5 ("block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240509170149.7639-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-02 08:47:45 -06:00
Christoph Hellwig
37105615f7 block: don't reduce max_sectors based on io_opt
Don't reduce the max_sectors value below the normal cap when the driver
advertsizes a very low io_opt.  This restores the behavior we had before
the recent changes to the max_sectors calculation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240701051800.1245240-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-01 06:52:42 -06:00
Christoph Hellwig
f62e8edc0a block: remove a duplicate io_min check in blk_validate_limits
If io_min is larger than the cap, it must by definition be non-zero.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240701051800.1245240-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-01 06:52:42 -06:00
Baokun Li
4e63aeb5d0 blk-wbt: don't throttle swap writes in direct reclaim
Now we avoid throttling swap writes by determining whether the current
process is kswapd (aka current_is_kswapd()), but swap writes can come
from either kswapd or direct reclaim, so the swap writes from direct
reclaim will still be throttled.

When a process holds a lock to allocate a free page, and enters direct
reclaim because there is no free memory, then it might trigger a hung
due to the wbt throttling that causes other processes to fail to get
the lock.

Both kswapd and direct reclaim set the REQ_SWAP flag, so use REQ_SWAP
instead of current_is_kswapd() to avoid throttling swap writes. Also
renamed WBT_KSWAPD to WBT_SWAP and WBT_RWQ_KSWAPD to WBT_RWQ_SWAP.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240604030522.3686177-1-libaokun@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-01 06:51:53 -06:00
Christoph Hellwig
62e35f9422 block: pass a gendisk to the queue_sysfs_entry methods
The kobject for the queue entries is embedded into a struct gendisk.
Pass it to the sysfs methods instead of the request_queue derived from
it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240627111407.476276-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28 15:06:16 -06:00
Christoph Hellwig
319e8cfdf3 block: add helper macros to de-duplicate the queue sysfs attributes
A lof the code to implement the queue sysfs attributes is repetitive.
Add a few macros to generate the common cases.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240627111407.476276-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28 15:06:16 -06:00
Yu Kuai
1beabab88e blk-throttle: fix lower control under super low iops limit
User will configure allowed iops limit in 1s, and calculate_io_allowed()
will calculate allowed iops in the slice by:

limit * HZ / throtl_slice

However, if limit is quite low, the result can be 0, then
allowed IO in the slice is 0, this will cause missing dispatch and
control will be lower than limit.

For example, set iops_limit to 5 with HD disk, and test will found that
iops will be 3.

This is usually not a big deal, because user will unlikely to configure
such low iops limit, however, this is still a problem in the extreme
scene.

Fix the problem by making sure the wait time calculated by
tg_within_iops_limit() should allow at least one IO to be dispatched.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20240618062108.3680835-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28 14:55:02 -06:00
Anuj Gupta
3991657ae7 block: set bip_vcnt correctly
Set the bip_vcnt correctly in bio_integrity_init_user and
bio_integrity_copy_user. If the bio gets split at a later point,
this value is required to set the right bip_vcnt in the cloned bio.

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240626100700.3629-3-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28 14:34:54 -06:00
Ming Lei
0676c434a9 block: check bio alignment in blk_mq_submit_bio
IO logical block size is one fundamental queue limit, and every IO has
to be aligned with logical block size because our bio split can't deal
with unaligned bio.

The check has to be done with queue usage counter grabbed because device
reconfiguration may change logical block size, and we can prevent the
reconfiguration from happening by holding queue usage counter.

logical_block_size stays in the 1st cache line of queue_limits, and this
cache line is always fetched in fast path via bio_may_exceed_limits(),
so IO perf won't be affected by this check.

Cc: Yi Zhang <yi.zhang@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ye Bin <yebin10@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240620030631.3114026-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28 10:31:34 -06:00
Christoph Hellwig
d19b46340b block: remove bio_integrity_process
Move the bvec interation into the generate/verify helpers to avoid a bit
of argument passing churn.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240626045950.189758-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28 10:29:42 -06:00
Christoph Hellwig
df3c485e0e block: switch on bio operation in bio_integrity_prep
Use a single switch to perform read and write specific checks and exit
early for other operations instead of having two checks using different
predicates.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240626045950.189758-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28 10:29:41 -06:00
Christoph Hellwig
dac18fabba block: remove allocation failure warnings in bio_integrity_prep
Allocation failures already leave a stack trace, so don't spew another
warning.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240626045950.189758-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28 10:29:41 -06:00
Christoph Hellwig
c096df9083 block: simplify adding the payload in bio_integrity_prep
bio_integrity_add_page can add physically contiguous regions of any size,
so don't bother chunking up the kmalloced buffer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240626045950.189758-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28 10:29:41 -06:00
Christoph Hellwig
c546d6f438 block: only zero non-PI metadata tuples in bio_integrity_prep
The PI generation helpers already zero the app tag, so relax the zeroing
to non-PI metadata.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240626045950.189758-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28 10:29:41 -06:00
John Garry
63db4a1f79 block: Delete blk_queue_flag_test_and_set()
Since commit 70200574cc ("block: remove QUEUE_FLAG_DISCARD"),
blk_queue_flag_test_and_set() has not been used, so delete it.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240627160735.842189-1-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-27 12:43:27 -06:00
Li Nan
e269537e49 block: clean up the check in blkdev_iomap_begin()
It is odd to check the offset amidst a series of assignments. Moving this
check to the beginning of the function makes the code look better.

Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240625115517.1472120-1-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-27 05:56:35 -06:00
Christoph Hellwig
e94b45d08b block: move dma_pad_mask into queue_limits
dma_pad_mask is a queue_limits by all ways of looking at it, so move it
there and set it through the atomic queue limits APIs.

Add a little helper that takes the alignment and pad into account to
simplify the code that is touched a bit.

Note that there never was any need for the > check in
blk_queue_update_dma_pad, this probably was just copy and paste from
dma_update_dma_alignment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240626142637.300624-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-26 09:37:35 -06:00
Christoph Hellwig
73781b3b81 block: remove disk_update_readahead
Mark blk_apply_bdi_limits non-static and open code disk_update_readahead
in the only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240626142637.300624-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-26 09:37:35 -06:00
Christoph Hellwig
3302f6f090 block: conding style fixup for blk_queue_max_guaranteed_bio
"static" never goes on a line of its own.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240626142637.300624-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-26 09:37:35 -06:00
Christoph Hellwig
fcf865e357 block: convert features and flags to __bitwise types
... and let sparse help us catch mismatches or abuses.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240626142637.300624-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-26 09:37:35 -06:00
Christoph Hellwig
ec9b1cf0b0 block: rename BLK_FEAT_MISALIGNED
This is a flag for ->flags and not a feature for ->features.  And fix the
one place that actually incorrectly cleared it from ->features.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240626142637.300624-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-26 09:37:35 -06:00
Christoph Hellwig
78887d004f block: correctly report cache type
Check the features flag and the override flag using the
blk_queue_write_cache, helper otherwise we're going to always
report "write through".

Fixes: 1122c0c1cc ("block: move cache control settings out of queue->flags")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240626142637.300624-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-26 09:37:35 -06:00
John Garry
8324bb755a block: Fix blk_validate_atomic_write_limits() build for arm32
For arm32, we get the following build warning:
 In file included from /tmp/next/build/include/linux/printk.h:10,
                  from /tmp/next/build/include/linux/kernel.h:31,
                  from /tmp/next/build/block/blk-settings.c:5:
 /tmp/next/build/block/blk-settings.c: In function 'blk_validate_atomic_write_limits':
 /tmp/next/build/include/asm-generic/div64.h:222:35: warning: comparison of distinct pointer types lacks a cast
   222 |         (void)(((typeof((n)) *)0) == ((uint64_t *)0));  \
       |                                   ^~

The divident for do_div() should be 64b, which it is not. Since we want to
check 2x unsigned ints, just use % operator. This allows us to drop the
chunk_sectors variable.

Fixes: 9da3d1e912 ("block: Add core atomic write support")
Reported-by: Mark Brown <broonie@kernel.org>
Closes: https://lore.kernel.org/linux-next/b765d200-4e0f-48b1-a962-7dfa1c4aef9c@kernel.dk/T/#mbf067b1edd89c7f9d7dac6e258c516199953a108
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240621183016.3092518-1-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-21 12:31:55 -06:00
Damien Le Moal
b6cfe2287d block: Define bdev_nr_zones() as an inline function
There is no need for bdev_nr_zones() to be an exported function
calculating the number of zones of a block device. Instead, given that
all callers use this helper with a fully initialized block device that
has a gendisk, we can redefine this function as an inline helper in
blkdev.h.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240621031506.759397-3-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-21 08:26:35 -06:00
John Garry
caf336f81b block: Add fops atomic write support
Support atomic writes by submitting a single BIO with the REQ_ATOMIC set.

It must be ensured that the atomic write adheres to its rules, like
naturally aligned offset, so call blkdev_dio_invalid() ->
blkdev_atomic_write_valid() [with renaming blkdev_dio_unaligned() to
blkdev_dio_invalid()] for this purpose. The BIO submission path currently
checks for atomic writes which are too large, so no need to check here.

In blkdev_direct_IO(), if the nr_pages exceeds BIO_MAX_VECS, then we cannot
produce a single BIO, so error in this case.

Finally set FMODE_CAN_ATOMIC_WRITE when the bdev can support atomic writes
and the associated file flag is for O_DIRECT.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-8-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 15:19:17 -06:00
Prasad Singamsetty
9abcfbd235 block: Add atomic write support for statx
Extend statx system call to return additional info for atomic write support
support if the specified file is a block device.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-7-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 15:19:17 -06:00
John Garry
9da3d1e912 block: Add core atomic write support
Add atomic write support, as follows:
- add helper functions to get request_queue atomic write limits
- report request_queue atomic write support limits to sysfs and update Doc
- support to safely merge atomic writes
- deal with splitting atomic writes
- misc helper functions
- add a per-request atomic write flag

New request_queue limits are added, as follows:
- atomic_write_hw_max is set by the block driver and is the maximum length
  of an atomic write which the device may support. It is not
  necessarily a power-of-2.
- atomic_write_max_sectors is derived from atomic_write_hw_max_sectors and
  max_hw_sectors. It is always a power-of-2. Atomic writes may be merged,
  and atomic_write_max_sectors would be the limit on a merged atomic write
  request size. This value is not capped at max_sectors, as the value in
  max_sectors can be controlled from userspace, and it would only cause
  trouble if userspace could limit atomic_write_unit_max_bytes and the
  other atomic write limits.
- atomic_write_hw_unit_{min,max} are set by the block driver and are the
  min/max length of an atomic write unit which the device may support. They
  both must be a power-of-2. Typically atomic_write_hw_unit_max will hold
  the same value as atomic_write_hw_max.
- atomic_write_unit_{min,max} are derived from
  atomic_write_hw_unit_{min,max}, max_hw_sectors, and block core limits.
  Both min and max values must be a power-of-2.
- atomic_write_hw_boundary is set by the block driver. If non-zero, it
  indicates an LBA space boundary at which an atomic write straddles no
  longer is atomically executed by the disk. The value must be a
  power-of-2. Note that it would be acceptable to enforce a rule that
  atomic_write_hw_boundary_sectors is a multiple of
  atomic_write_hw_unit_max, but the resultant code would be more
  complicated.

All atomic writes limits are by default set 0 to indicate no atomic write
support. Even though it is assumed by Linux that a logical block can always
be atomically written, we ignore this as it is not of particular interest.
Stacked devices are just not supported either for now.

An atomic write must always be submitted to the block driver as part of a
single request. As such, only a single BIO must be submitted to the block
layer for an atomic write. When a single atomic write BIO is submitted, it
cannot be split. As such, atomic_write_unit_{max, min}_bytes are limited
by the maximum guaranteed BIO size which will not be required to be split.
This max size is calculated by request_queue max segments and the number
of bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace
issuing a write with iovcnt=1 for pwritev2() - as such, we can rely on each
segment containing PAGE_SIZE of data, apart from the first+last, which each
can fit logical block size of data. The first+last will be LBS
length/aligned as we rely on direct IO alignment rules also.

New sysfs files are added to report the following atomic write limits:
- atomic_write_unit_max_bytes - same as atomic_write_unit_max_sectors in
				bytes
- atomic_write_unit_min_bytes - same as atomic_write_unit_min_sectors in
				bytes
- atomic_write_boundary_bytes - same as atomic_write_hw_boundary_sectors in
				bytes
- atomic_write_max_bytes      - same as atomic_write_max_sectors in bytes

Atomic writes may only be merged with other atomic writes and only under
the following conditions:
- total resultant request length <= atomic_write_max_bytes
- the merged write does not straddle a boundary

Helper function bdev_can_atomic_write() is added to indicate whether
atomic writes may be issued to a bdev. If a bdev is a partition, the
partition start must be aligned with both atomic_write_unit_min_sectors
and atomic_write_hw_boundary_sectors.

FSes will rely on the block layer to validate that an atomic write BIO
submitted will be of valid size, so add blk_validate_atomic_write_op_size()
for this purpose. Userspace expects an atomic write which is of invalid
size to be rejected with -EINVAL, so add BLK_STS_INVAL for this. Also use
BLK_STS_INVAL for when a BIO needs to be split, as this should mean an
invalid size BIO.

Flag REQ_ATOMIC is used for indicating an atomic write.

Co-developed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-6-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 15:19:17 -06:00
John Garry
f70167a7a6 block: Generalize chunk_sectors support as boundary support
The purpose of the chunk_sectors limit is to ensure that a mergeble request
fits within the boundary of the chunck_sector value.

Such a feature will be useful for other request_queue boundary limits, so
generalize the chunk_sectors merge code.

This idea was proposed by Hannes Reinecke.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 15:19:17 -06:00
John Garry
8d1dfd51c8 block: Pass blk_queue_get_max_sectors() a request pointer
Currently blk_queue_get_max_sectors() is passed a enum req_op. In future
the value returned from blk_queue_get_max_sectors() may depend on certain
request flags, so pass a request pointer.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-2-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 15:19:17 -06:00
Jens Axboe
e821bcecdf Merge branch 'for-6.11/block-limits' into for-6.11/block
Merge in queue limits cleanups.

* for-6.11/block-limits:
  block: move the raid_partial_stripes_expensive flag into the features field
  block: remove the discard_alignment flag
  block: move the misaligned flag into the features field
  block: renumber and rename the cache disabled flag
  block: fix spelling and grammar for in writeback_cache_control.rst
  block: remove the unused blk_bounce enum
2024-06-20 06:55:20 -06:00
Christoph Hellwig
7d4dec525f block: move the raid_partial_stripes_expensive flag into the features field
Move the raid_partial_stripes_expensive flags into the features field to
reclaim a little bit of space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240619154623.450048-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 06:53:15 -06:00
Christoph Hellwig
4cac3d3a71 block: remove the discard_alignment flag
queue_limits.discard_alignment is never read except in the places
where it is stacked into another limit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240619154623.450048-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 06:53:14 -06:00
Christoph Hellwig
5543217be4 block: move the misaligned flag into the features field
Move the misaligned flags into the features field to reclaim a little
bit of space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240619154623.450048-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 06:53:14 -06:00
Christoph Hellwig
bae1c74316 block: renumber and rename the cache disabled flag
Start with the first bit, and drop the plural-S from the name.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240619154623.450048-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 06:53:14 -06:00
Jens Axboe
69c34f07e4 Merge branch 'for-6.11/block-limits' into for-6.11/block
Merge in last round of queue limits changes from Christoph.

* for-6.11/block-limits: (26 commits)
  block: move the bounce flag into the features field
  block: move the skip_tagset_quiesce flag to queue_limits
  block: move the pci_p2pdma flag to queue_limits
  block: move the zone_resetall flag to queue_limits
  block: move the zoned flag into the features field
  block: move the poll flag to queue_limits
  block: move the dax flag to queue_limits
  block: move the nowait flag to queue_limits
  block: move the synchronous flag to queue_limits
  block: move the stable_writes flag to queue_limits
  block: move the io_stat flag setting to queue_limits
  block: move the add_random flag to queue_limits
  block: move the nonrot flag to queue_limits
  block: move cache control settings out of queue->flags
  block: remove blk_flush_policy
  block: freeze the queue in queue_attr_store
  nbd: move setting the cache control flags to __nbd_set_size
  virtio_blk: remove virtblk_update_cache_mode
  loop: fold loop_update_rotational into loop_reconfigure_limits
  loop: also use the default block size from an underlying block device
  ...

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 08:14:49 -06:00
Christoph Hellwig
339d3948c0 block: move the bounce flag into the features field
Move the bounce flag into the features field to reclaim a little bit of
space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-27-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
8c8f5c85b2 block: move the skip_tagset_quiesce flag to queue_limits
Move the skip_tagset_quiesce flag into the queue_limits feature field so
that it can be set atomically with the queue frozen.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-26-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
9c1e42e3c8 block: move the pci_p2pdma flag to queue_limits
Move the pci_p2pdma flag into the queue_limits feature field so that it
can be set atomically with the queue frozen.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-25-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
a52758a397 block: move the zone_resetall flag to queue_limits
Move the zone_resetall flag into the queue_limits feature field so that
it can be set atomically with the queue frozen.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-24-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
b1fc937a55 block: move the zoned flag into the features field
Move the zoned flags into the features field to reclaim a little
bit of space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-23-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
8023e144f9 block: move the poll flag to queue_limits
Move the poll flag into the queue_limits feature field so that it can
be set atomically with the queue frozen.

Stacking drivers are simplified in that they now can simply set the
flag, and blk_stack_limits will clear it when the features is not
supported by any of the underlying devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-22-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
f467fee48d block: move the dax flag to queue_limits
Move the dax flag into the queue_limits feature field so that it can be
set atomically with the queue frozen.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-21-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
f76af42f8b block: move the nowait flag to queue_limits
Move the nowait flag into the queue_limits feature field so that it can
be set atomically with the queue frozen.

Stacking drivers are simplified in that they now can simply set the
flag, and blk_stack_limits will clear it when the features is not
supported by any of the underlying devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-20-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
aadd5c59c9 block: move the synchronous flag to queue_limits
Move the synchronous flag into the queue_limits feature field so that it
can be set atomically with the queue frozen.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-19-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
1a02f3a73f block: move the stable_writes flag to queue_limits
Move the stable_writes flag into the queue_limits feature field so that
it can be set atomically with the queue frozen.

The flag is now inherited by blk_stack_limits, which greatly simplifies
the code in dm, and fixed md which previously did not pass on the flag
set on lower devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
cdb2497918 block: move the io_stat flag setting to queue_limits
Move the io_stat flag into the queue_limits feature field so that it can
be set atomically with the queue frozen.

Simplify md and dm to set the flag unconditionally instead of avoiding
setting a simple flag for cases where it already is set by other means,
which is a bit pointless.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
39a9f1c334 block: move the add_random flag to queue_limits
Move the add_random flag into the queue_limits feature field so that it
can be set atomically with the queue frozen.

Note that this also removes code from dm to clear the flag based on
the underlying devices, which can't be reached as dm devices will
always start out without the flag set.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
bd4a633b6f block: move the nonrot flag to queue_limits
Move the nonrot flag into the queue_limits feature field so that it can
be set atomically with the queue frozen.

Use the chance to switch to defaulting to non-rotational and require
the driver to opt into rotational, which matches the polarity of the
sysfs interface.

For the z2ram, ps3vram, 2x memstick, ubiblock and dcssblk the new
rotational flag is not set as they clearly are not rotational despite
this being a behavior change.  There are some other drivers that
unconditionally set the rotational flag to keep the existing behavior
as they arguably can be used on rotational devices even if that is
probably not their main use today (e.g. virtio_blk and drbd).

The flag is automatically inherited in blk_stack_limits matching the
existing behavior in dm and md.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
1122c0c1cc block: move cache control settings out of queue->flags
Move the cache control settings into the queue_limits so that the flags
can be set atomically with the device queue frozen.

Add new features and flags field for the driver set flags, and internal
(usually sysfs-controlled) flags in the block layer.  Note that we'll
eventually remove enough field from queue_limits to bring it back to the
previous size.

The disable flag is inverted compared to the previous meaning, which
means it now survives a rescan, similar to the max_sectors and
max_discard_sectors user limits.

The FLUSH and FUA flags are now inherited by blk_stack_limits, which
simplified the code in dm a lot, but also causes a slight behavior
change in that dm-switch and dm-unstripe now advertise a write cache
despite setting num_flush_bios to 0.  The I/O path will handle this
gracefully, but as far as I can tell the lack of num_flush_bios
and thus flush support is a pre-existing data integrity bug in those
targets that really needs fixing, after which a non-zero num_flush_bios
should be required in dm for targets that map to underlying devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
70905f8706 block: remove blk_flush_policy
Fold blk_flush_policy into the only caller to prepare for pending changes
to it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240617060532.127975-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
af28141498 block: freeze the queue in queue_attr_store
queue_attr_store updates attributes used to control generating I/O, and
can cause malformed bios if changed with I/O in flight.  Freeze the queue
in common code instead of adding it to almost every attribute.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240617060532.127975-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Yu Kuai
bb7e5a193d block, bfq: remove blkg_path()
After commit 35fe6d7632 ("block: use standard blktrace API to output
cgroup info for debug notes"), the field 'bfqg->blkg_path' is not used
and hence can be removed, and therefor blkg_path() is not used anymore
and can be removed.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240618032753.3502528-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-18 09:22:45 -06:00
Kanchan Joshi
b83bd486b4 block: cleanup flag_{show,store}
Remove a superfluous argument that flag_show and flag_store currently
take.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240617044918.374608-1-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-17 10:13:37 -06:00
John Garry
66088084fd block: BFQ: Refactor bfq_exit_icq() to silence sparse warning
Currently building for C=1 generates the following warning:
block/bfq-iosched.c:5498:9: warning: context imbalance in 'bfq_exit_icq' - different lock contexts for basic block

Refactor bfq_exit_icq() into a core part which loops for the actuators,
and only lock calling this routine when necessary.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240614090345.655716-4-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-16 15:30:32 -06:00
John Garry
c3042a5403 block: Drop locking annotation for limits_lock
Currently compiling block/blk-settings.c with C=1 gives the following
warning:
block/blk-settings.c:262:9: warning: context imbalance in 'queue_limits_commit_update' - wrong count at exit

request_queue.limits_lock is a mutex. Sparse locking annotation for
mutexes are currently not supported - see [0] - so drop that locking
annotation.

[0] https://lore.kernel.org/lkml/cover.1579893447.git.jbi.octave@gmail.com/T/#mbb8bda6c0a7ca7ce19f46df976a8e3b489745488

Fixes: d690cb8ae1 ("block: add an API to atomically update queue limits")
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240614090345.655716-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-16 15:30:26 -06:00
Jiapeng Chong
d9c2332199 bdev: make blockdev_mnt static
The blockdev_mnt are not used outside the file bdev.c, so the modification
is defined as static.

block/bdev.c:377:17: warning: symbol 'blockdev_mnt' was not declared. Should it be static?

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
jpg: Remove closes bugzilla link
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Fixes: 8f3a608827d1 ("bdev: open block device as files")
Tested-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240614090345.655716-2-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-16 15:29:55 -06:00
Damien Le Moal
e21d12c7cd block: Improve checks on zone resource limits
Make sure that the zone resource limits of a zoned block device are
correct by checking that:
(a) If the device has a max active zones limit, make sure that the max
    open zones limit is lower than the max active zones limit.
(b) If the device has zone resource limits, check that the limits
    values are lower than the number of sequential zones of the device.
    If it is not, assume that the zoned device has no limits by setting
    the limits to 0.

For (a), a check is added to blk_validate_zoned_limits() and an error
returned if the max open zones limit exceeds the value of the max active
zone limit (if there is one).

For (b), given that we need the number of sequential zones of the zoned
device, this check is added to disk_update_zone_resources(). This is
safe to do as that function is executed with the disk queue frozen and
the check executed after queue_limits_start_update() which takes the
queue limits lock. Of note is that the early return in this function
for zoned devices that do not use zone write plugging (e.g. DM devices
using native zone append) is moved to after the new check and adjustment
of the zone resource limits so that the check applies to any zoned
device.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com>
Link: https://lore.kernel.org/r/20240611023639.89277-2-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-15 20:42:20 -06:00
Christoph Hellwig
c6e56cf6b2 block: move integrity information into queue_limits
Move the integrity information into the queue limits so that it can be
set atomically with other queue limits, and that the sysfs changes to
the read_verify and write_generate flags are properly synchronized.
This also allows to provide a more useful helper to stack the integrity
fields, although it still is separate from the main stacking function
as not all stackable devices want to inherit the integrity settings.
Even with that it greatly simplifies the code in md and dm.

Note that the integrity field is moved as-is into the queue limits.
While there are good arguments for removing the separate blk_integrity
structure, this would cause a lot of churn and might better be done at a
later time if desired.  However the integrity field in the queue_limits
structure is now unconditional so that various ifdefs can be avoided or
replaced with IS_ENABLED().  Given that tiny size of it that seems like
a worthwhile trade off.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240613084839.1044015-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14 10:20:07 -06:00
Christoph Hellwig
9f4aa46f2a block: invert the BLK_INTEGRITY_{GENERATE,VERIFY} flags
Invert the flags so that user set values will be able to persist
revalidating the integrity information once we switch the integrity
information to queue_limits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240613084839.1044015-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14 10:20:06 -06:00
Christoph Hellwig
3c3e85ddff block: bypass the STABLE_WRITES flag for protection information
Currently registering a checksum-enabled (aka PI) integrity profile sets
the QUEUE_FLAG_STABLE_WRITE flag, and unregistering it clears the flag.
This can incorrectly clear the flag when the driver requires stable
writes even without PI, e.g. in case of iSCSI or NVMe/TCP with data
digest enabled.

Fix this by looking at the csum_type directly in bdev_stable_writes and
not setting the queue flag.  Also remove the blk_queue_stable_writes
helper as the only user in nvme wants to only look at the actual
QUEUE_FLAG_STABLE_WRITE flag as it inherits the integrity configuration
by other means.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240613084839.1044015-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14 10:20:06 -06:00
Christoph Hellwig
43c5dbe98a block: don't require stable pages for non-PI metadata
Non-PI metadata doesn't contain checksums and thus doesn't require
stable pages.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240613084839.1044015-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14 10:20:06 -06:00
Christoph Hellwig
1d59857ed2 block: use kstrtoul in flag_store
Use the text to integer helper that has error handling and doesn't modify
the input pointer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20240613084839.1044015-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14 10:20:06 -06:00
Christoph Hellwig
1366251a79 block: factor out flag_{store,show} helper for integrity
Factor the duplicate code for the generate and verify attributes into
common helpers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240613084839.1044015-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14 10:20:06 -06:00
Christoph Hellwig
e8bc14d116 block: remove the blk_flush_integrity call in blk_integrity_unregister
Now that there are no indirect calls for PI processing there is no
way to dereference a NULL pointer here.  Additionally drivers now always
freeze the queue (or in case of stacking drivers use their internal
equivalent) around changing the integrity profile.

This is effectively a revert of commit 3df49967f6 ("block: flush the
integrity workqueue in blk_integrity_unregister").

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240613084839.1044015-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14 10:20:06 -06:00
Christoph Hellwig
e9f5f44ad3 block: remove the blk_integrity_profile structure
Block layer integrity configuration is a bit complex right now, as it
indirects through operation vectors for a simple two-dimensional
configuration:

 a) the checksum type of none, ip checksum, crc, crc64
 b) the presence or absence of a reference tag

Remove the integrity profile, and instead add a separate csum_type flag
which replaces the existing ip-checksum field and a new flag that
indicates the presence of the reference tag.

This removes up to two layers of indirect calls, remove the need to
offload the no-op verification of non-PI metadata to a workqueue and
generally simplifies the code. The downside is that block/t10-pi.c now
has to be built into the kernel when CONFIG_BLK_DEV_INTEGRITY is
supported.  Given that both nvme and SCSI require t10-pi.ko, it is loaded
for all usual configurations that enabled CONFIG_BLK_DEV_INTEGRITY
already, though.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240613084839.1044015-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14 10:20:06 -06:00
Christoph Hellwig
899ee2c382 block: initialize integrity buffer to zero before writing it to media
Metadata added by bio_integrity_prep is using plain kmalloc, which leads
to random kernel memory being written media.  For PI metadata this is
limited to the app tag that isn't used by kernel generated metadata,
but for non-PI metadata the entire buffer leaks kernel memory.

Fix this by adding the __GFP_ZERO flag to allocations for writes.

Fixes: 7ba1ba12ee ("block: Block layer data integrity support")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240613084839.1044015-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14 10:20:06 -06:00
Christoph Hellwig
73e3715ed1 block: add special APIs for run-time disabling of discard and friends
A few drivers optimistically try to support discard, write zeroes and
secure erase and disable the features from the I/O completion handler
if the hardware can't support them.  This disable can't be done using
the atomic queue limits API because the I/O completion handlers can't
take sleeping locks or freeze the queue.  Keep the existing clearing
of the relevant field to zero, but replace the old blk_queue_max_*
APIs with new disable APIs that force the value to 0.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240531074837.1648501-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14 10:19:44 -06:00
Christoph Hellwig
1652b0bafe block: remove unused queue limits API
Remove all APIs that are unused now that sd and sr have been converted
to the atomic queue limits API.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240531074837.1648501-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14 10:19:44 -06:00
Christoph Hellwig
a23634644a block: take io_opt and io_min into account for max_sectors
The soft max_sectors limit is normally capped by the hardware limits and
an arbitrary upper limit enforced by the kernel, but can be modified by
the user.  A few drivers want to increase this limit (nbd, rbd) or
adjust it up or down based on hardware capabilities (sd).

Change blk_validate_limits to default max_sectors to the optimal I/O
size, or upgrade it to the preferred minimal I/O size if that is
larger than the kernel default if no optimal I/O size is provided based
on the logic in the SD driver.

This keeps the existing kernel default for drivers that do not provide
an io_opt or very big io_min value, but picks a much more useful
default for those who provide these hints, and allows to remove the
hacks to set the user max_sectors limit in nbd, rbd and sd.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Acked-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240531074837.1648501-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14 10:19:44 -06:00
Anuj Gupta
e038ee6189 block: unmap and free user mapped integrity via submitter
The user mapped intergity is copied back and unpinned by
bio_integrity_free which is a low-level routine. Do it via the submitter
rather than doing it in the low-level block layer code, to split the
submitter side from the consumer side of the bio.

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240610111144.14647-1-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-12 11:00:50 -06:00
Chengming Zhou
d0321c812d block: fix request.queuelist usage in flush
Friedrich Weber reported a kernel crash problem and bisected to commit
81ada09cc2 ("blk-flush: reuse rq queuelist in flush state machine").

The root cause is that we use "list_move_tail(&rq->queuelist, pending)"
in the PREFLUSH/POSTFLUSH sequences. But rq->queuelist.next == xxx since
it's popped out from plug->cached_rq in __blk_mq_alloc_requests_batch().
We don't initialize its queuelist just for this first request, although
the queuelist of all later popped requests will be initialized.

Fix it by changing to use "list_add_tail(&rq->queuelist, pending)" so
rq->queuelist doesn't need to be initialized. It should be ok since rq
can't be on any list when PREFLUSH or POSTFLUSH, has no move actually.

Please note the commit 81ada09cc2 ("blk-flush: reuse rq queuelist in
flush state machine") also has another requirement that no drivers would
touch rq->queuelist after blk_mq_end_request() since we will reuse it to
add rq to the post-flush pending list in POSTFLUSH. If this is not true,
we will have to revert that commit IMHO.

This updated version adds "list_del_init(&rq->queuelist)" in flush rq
callback since the dm layer may submit request of a weird invalid format
(REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH), which causes double list_add
if without this "list_del_init(&rq->queuelist)". The weird invalid format
problem should be fixed in dm layer.

Reported-by: Friedrich Weber <f.weber@proxmox.com>
Closes: https://lore.kernel.org/lkml/14b89dfb-505c-49f7-aebb-01c54451db40@proxmox.com/
Closes: https://lore.kernel.org/lkml/c9d03ff7-27c5-4ebd-b3f6-5a90d96f35ba@proxmox.com/
Fixes: 81ada09cc2 ("blk-flush: reuse rq queuelist in flush state machine")
Cc: Christoph Hellwig <hch@lst.de>
Cc: ming.lei@redhat.com
Cc: bvanassche@acm.org
Tested-by: Friedrich Weber <f.weber@proxmox.com>
Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240608143115.972486-1-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-12 10:58:11 -06:00
Damien Le Moal
1933192a91 block: Optimize disk zone resource cleanup
For zoned block devices using zone write plugging, an rcu_barrier() call
is needed in disk_free_zone_resources() to synchronize freeing of zone
write plugs and the destrution of the mempool used to allocate the
plugs. The barrier call does slow down a little teardown of zoned block
devices but should not affect teardown of regular block devices or zoned
block devices that do not use zone write plugging (e.g. zoned DM devices
that do not require zone append emulation).

Modify disk_free_zone_resources() to return early if we do not have a
mempool to start with, that is, if the device does not use zone write
plugging. This avoids the costly rcu_barrier() and speeds up disk
teardown.

Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: dd291d77cc ("block: Introduce zone write plugging")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Link: https://lore.kernel.org/r/20240607002126.104227-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-12 10:56:45 -06:00
Su Hui
9b1ebce6a1 block: sed-opal: avoid possible wrong address reference in read_sed_opal_key()
Clang static checker (scan-build) warning:
block/sed-opal.c:line 317, column 3
Value stored to 'ret' is never read.

Fix this problem by returning the error code when keyring_search() failed.
Otherwise, 'key' will have a wrong value when 'kerf' stores the error code.

Fixes: 3bfeb61256 ("block: sed-opal: keyring support for SED keys")
Signed-off-by: Su Hui <suhui@nfschina.com>
Link: https://lore.kernel.org/r/20240611073659.429582-1-suhui@nfschina.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-12 10:53:20 -06:00
Waiman Long
0a751df456 blk-throttle: Fix incorrect display of io.max
Commit bf20ab538c ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW")
attempts to revert the code change introduced by commit cd5ab1b0fc
("blk-throttle: add .low interface").  However, it leaves behind the
bps_conf[] and iops_conf[] fields in the throtl_grp structure which
aren't set anywhere in the new blk-throttle.c code but are still being
used by tg_prfill_limit() to display the limits in io.max. Now io.max
always displays the following values if a block queue is used:

	<m>:<n> rbps=0 wbps=0 riops=0 wiops=0

Fix this problem by removing bps_conf[] and iops_conf[] and use bps[]
and iops[] instead to complete the revert.

Fixes: bf20ab538c ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW")
Reported-by: Justin Forbes <jforbes@redhat.com>
Closes: https://github.com/containers/podman/issues/22701#issuecomment-2120627789
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240530134547.970075-1-longman@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-30 19:44:29 -06:00
John Garry
41b7574252 scsi: bsg: Pass dev to blk_mq_alloc_queue()
When calling bsg_setup_queue() -> blk_mq_alloc_queue(), we don't pass
the dev as the queuedata, but rather manually set it afterwards. Just
pass dev to blk_mq_alloc_queue() to have automatically set.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240524084829.2132555-3-john.g.garry@oracle.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2024-05-30 20:22:15 -04:00
Damien Le Moal
29459c3eaa block: Fix zone write plugging handling of devices with a runt zone
A zoned device may have a last sequential write required zone that is
smaller than other zones. However, all tests to check if a zone write
plug write offset exceeds the zone capacity use the same capacity
value stored in the gendisk zone_capacity field. This is incorrect for a
zoned device with a last runt (smaller) zone.

Add the new field last_zone_capacity to struct gendisk to store the
capacity of the last zone of the device. blk_revalidate_seq_zone() and
blk_revalidate_conv_zone() are both modified to get this value when
disk_zone_is_last() returns true. Similarly to zone_capacity, the value
is first stored using the last_zone_capacity field of struct
blk_revalidate_zone_args. Once zone revalidation of all zones is done,
this is used to set the gendisk last_zone_capacity field.

The checks to determine if a zone is full or if a sector offset in a
zone exceeds the zone capacity in disk_should_remove_zone_wplug(),
disk_zone_wplug_abort_unaligned(), blk_zone_write_plug_init_request(),
and blk_zone_wplug_prepare_bio() are modified to use the new helper
functions disk_zone_is_full() and disk_zone_wplug_is_full().
disk_zone_is_full() uses the zone index to determine if the zone being
tested is the last one of the disk and uses the either the disk
zone_capacity or last_zone_capacity accordingly.

Fixes: dd291d77cc ("block: Introduce zone write plugging")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Link: https://lore.kernel.org/r/20240530054035.491497-4-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-30 15:03:52 -06:00
Damien Le Moal
cd63999368 block: Fix validation of zoned device with a runt zone
Commit ecfe43b11b ("block: Remember zone capacity when revalidating
zones") introduced checks to ensure that the capacity of the zones of
a zoned device is constant for all zones. However, this check ignores
the possibility that a zoned device has a smaller last zone with a size
not equal to the capacity of other zones. Such device correspond in
practice to an SMR drive with a smaller last zone and all zones with a
capacity equal to the zone size, leading to the last zone capacity being
different than the capacity of other zones.

Correctly handle such device by fixing the check for the constant zone
capacity in blk_revalidate_seq_zone() using the new helper function
disk_zone_is_last(). This helper function is also used in
blk_revalidate_zone_cb() when checking the zone size.

Fixes: ecfe43b11b ("block: Remember zone capacity when revalidating zones")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Link: https://lore.kernel.org/r/20240530054035.491497-3-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-30 15:03:52 -06:00
Hannes Reinecke
e993db2d6e block: check for max_hw_sectors underflow
The logical block size need to be smaller than the max_hw_sector
setting, otherwise we can't even transfer a single LBA.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-28 06:55:23 -06:00
Christoph Hellwig
e528bede6f block: stack max_user_sectors
The max_user_sectors is one of the three factors determining the actual
max_sectors limit for READ/WRITE requests.  Because of that it needs to
be stacked at least for the device mapper multi-path case where requests
are directly inserted on the lower device.  For SCSI disks this is
important because the sd driver actually sets it's own advisory limit
that is lower than max_hw_sectors based on the block limits VPD page.
While this is a bit odd an unusual, the same effect can happen if a
user or udev script tweaks the value manually.

Fixes: 4f563a6473 ("block: add a max_user_discard_sectors queue limit")
Reported-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240523182618.602003-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-28 06:54:36 -06:00
hexue
30a0e3135f block: delete redundant function declaration
blk_stats_alloc_enable was used for block hybrid poll, the related
function definition was removed by patch:
commit 54bdd67d0f ("blk-mq: remove hybrid polling")
but the function declaration was not deleted.

Signed-off-by: hexue <xue01.he@samsung.com>
Link: https://lore.kernel.org/r/20240527084533.1485210-1-xue01.he@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-27 13:58:06 -06:00
Linus Torvalds
b4d88a60fe block-6.10-20240523
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmZPaegQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplkkD/4h1vxr2a6jg44TEUJ9f59rIOELuYHXJdpt
 5m7r8UWcy7LF6HfmMgSeHV/7Gr1bBw6jh1eMubZRt9pZJ1sSGnc6vQdrOU+RnG9k
 F9i0qogAD2WXClQPAxvHGC1KD1quSdeiKME0hNJdGA6SsV4cYnDVeR8O6SQbaomD
 KPeGGBdjvrygRFhyDBFDACWK3GuD5POlbswUOwASYNrAb4OrQsj+bX/QXkuOXir9
 n/NW/RfiQqAvI4m51yzaMqfFWw+s0irhXNfchl3i8RBMvDFBRNEkgtDN4y2rUynK
 +FaDeAwGXR51/qL9gr0ZScXAY6Q7f/B9FkrTUZR7S1lD3JsLXiS+uOefXEljKsDd
 RpNUc0sX3RjaSu1uNiUD/H4v+umvR+r3uuAyH6OXstCQt+98SJUbQvZuzphVGC60
 iM8W+NRsaYZUhjN4LBj0NBGgCiidHanm22GCPADWN1fxZbjRWUoA886sZXTqmmMj
 +GGqpPU3pbGtj09ysaJpLKxu1TbD3QmcCUVPWQ8+DKt8PGGDDa+vIRXV8xswwQDg
 DyZoq0s/s00DzCXiPsbvVyKwXCJ1XSB0sEq0gvjDfGXb+5h6T+lH2irbcjBxUlwq
 qbofAmk6PVjxeWMUP4NXE04oK5Itc/l20LT9ECFPWzMdc1ht31TsqmxldHLIpDqp
 KUeacOh94A==
 =Btam
 -----END PGP SIGNATURE-----

Merge tag 'block-6.10-20240523' of git://git.kernel.dk/linux

Pull more block updates from Jens Axboe:
 "Followup block updates, mostly due to NVMe being a bit late to the
  party. But nothing major in there, so not a big deal.

  In detail, this contains:

   - NVMe pull request via Keith:
       - Fabrics connection retries (Daniel, Hannes)
       - Fabrics logging enhancements (Tokunori)
       - RDMA delete optimization (Sagi)

   - ublk DMA alignment fix (me)

   - null_blk sparse warning fixes (Bart)

   - Discard support for brd (Keith)

   - blk-cgroup list corruption fixes (Ming)

   - blk-cgroup stat propagation fix (Waiman)

   - Regression fix for plugging stall with md (Yu)

   - Misc fixes or cleanups (David, Jeff, Justin)"

* tag 'block-6.10-20240523' of git://git.kernel.dk/linux: (24 commits)
  null_blk: fix null-ptr-dereference while configuring 'power' and 'submit_queues'
  blk-throttle: remove unused struct 'avg_latency_bucket'
  block: fix lost bio for plug enabled bio based device
  block: t10-pi: add MODULE_DESCRIPTION()
  blk-mq: add helper for checking if one CPU is mapped to specified hctx
  blk-cgroup: Properly propagate the iostat update up the hierarchy
  blk-cgroup: fix list corruption from reorder of WRITE ->lqueued
  blk-cgroup: fix list corruption from resetting io stat
  cdrom: rearrange last_media_change check to avoid unintentional overflow
  nbd: Fix signal handling
  nbd: Remove a local variable from nbd_send_cmd()
  nbd: Improve the documentation of the locking assumptions
  nbd: Remove superfluous casts
  nbd: Use NULL to represent a pointer
  brd: implement discard support
  null_blk: Fix two sparse warnings
  ublk_drv: set DMA alignment mask to 3
  nvme-rdma, nvme-tcp: include max reconnects for reconnect logging
  nvmet-rdma: Avoid o(n^2) loop in delete_ctrl
  nvme: do not retry authentication failures
  ...
2024-05-23 13:44:47 -07:00
Dr. David Alan Gilbert
4a482e691c blk-throttle: remove unused struct 'avg_latency_bucket'
'avg_latency_bucket' is unused since
commit bf20ab538c ("blk-throttle: remove
CONFIG_BLK_DEV_THROTTLING_LOW")

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://lore.kernel.org/r/20240522172458.334173-1-linux@treblig.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-22 11:30:47 -06:00
Yu Kuai
9a42891c35 block: fix lost bio for plug enabled bio based device
With the following two conditions, bio will be lost:

1) blk plug is not enabled, for example, __blkdev_direct_IO_simple() and
__blkdev_direct_IO_async();
2) bio plug is enabled, for example write IO for raid1/raid10 while
bitmap is enabled;

Root cause is that blk_finish_plug() will add the bio to
curent->bio_list, while such bio will not be handled:

__submit_bio_noacct
 current->bio_list = bio_list_on_stack;
 blk_start_plug

 do {
  dm_submit_bio
   md_handle_request
    raid10_write_request
     -> generate new bio for underlying disks
     raid1_add_bio_to_plug -> bio is added to plug
 } while ((bio = bio_list_pop(&bio_list_on_stack[0])))
 -> previous bio are all handled

 blk_finish_plug
  raid10_unplug
   raid1_submit_write
    submit_bio_noacct
     if (current->bio_list)
      bio_list_add(&current->bio_list[0], bio)
      -> add new bio

 current->bio_list = NULL
 -> new bio is lost

Fix the problem by moving the plug into the while loop, so that
current->bio_list will still be handled after blk_finish_plug().

By the way, enable plug for raid1/raid10 in this case will also prevent
delay IO handling into daemon thread, which should also improve IO
performance.

Fixes: 060406c61c ("block: add plug while submitting IO")
Reported-by: Changhui Zhong <czhong@redhat.com>
Closes: https://lore.kernel.org/all/CAGVVp+Xsmzy2G9YuEatfMT6qv1M--YdOCQ0g7z7OVmcTbBxQAg@mail.gmail.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Tested-by: Changhui Zhong <czhong@redhat.com>
Link: https://lore.kernel.org/r/20240521200308.983986-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-21 19:37:33 -06:00
Linus Torvalds
3413efa888 Compactifying bdev flags
We can easily have up to 24 flags with sane
 atomicity, _without_ pushing anything out
 of the first cacheline of struct block_device.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZkznRwAKCRBZ7Krx/gZQ
 69XpAQDOZCyvYOZ/dlMOKKLf2vAojC/h++E/NjvGt3erbvVN2wEArXMi13ECsoCw
 JYJA3MsmvjuY6VNcm24icf2/p4TMIgo=
 =JyYi
 -----END PGP SIGNATURE-----

Merge tag 'pull-bd_flags-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull bdev flags update from Al Viro:
 "Compactifying bdev flags.

  We can easily have up to 24 flags with sane atomicity, _without_
  pushing anything out of the first cacheline of struct block_device"

* tag 'pull-bd_flags-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  bdev: move ->bd_make_it_fail to ->__bd_flags
  bdev: move ->bd_ro_warned to ->__bd_flags
  bdev: move ->bd_has_subit_bio to ->__bd_flags
  bdev: move ->bd_write_holder into ->__bd_flags
  bdev: move ->bd_read_only to ->__bd_flags
  bdev: infrastructure for flags
  wrapper for access to ->bd_partno
  Use bdev_is_paritition() instead of open-coding it
2024-05-21 13:02:56 -07:00
Linus Torvalds
38da32ee70 bd_inode series
Replacement of bdev->bd_inode with sane(r) set of primitives.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZkwjlgAKCRBZ7Krx/gZQ
 66OmAP9nhZLASn/iM2+979I6O0GW+vid+uLh48uW3d+LbsmVIgD9GYpR+cuLQ/xj
 mJESWfYKOVSpFFSrqlzKg9PQlU/GFgs=
 =6LRp
 -----END PGP SIGNATURE-----

Merge tag 'pull-bd_inode-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull bdev bd_inode updates from Al Viro:
 "Replacement of bdev->bd_inode with sane(r) set of primitives by me and
  Yu Kuai"

* tag 'pull-bd_inode-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  RIP ->bd_inode
  dasd_format(): killing the last remaining user of ->bd_inode
  nilfs_attach_log_writer(): use ->bd_mapping->host instead of ->bd_inode
  block/bdev.c: use the knowledge of inode/bdev coallocation
  gfs2: more obvious initializations of mapping->host
  fs/buffer.c: massage the remaining users of ->bd_inode to ->bd_mapping
  blk_ioctl_{discard,zeroout}(): we only want ->bd_inode->i_mapping here...
  grow_dev_folio(): we only want ->bd_inode->i_mapping there
  use ->bd_mapping instead of ->bd_inode->i_mapping
  block_device: add a pointer to struct address_space (page cache of bdev)
  missing helpers: bdev_unhash(), bdev_drop()
  block: move two helpers into bdev.c
  block2mtd: prevent direct access of bd_inode
  dm-vdo: use bdev_nr_bytes(bdev) instead of i_size_read(bdev->bd_inode)
  blkdev_write_iter(): saner way to get inode and bdev
  bcachefs: remove dead function bdev_sectors()
  ext4: remove block_device_ejected()
  erofs_buf: store address_space instead of inode
  erofs: switch erofs_bread() to passing offset instead of block number
2024-05-21 09:51:42 -07:00
Linus Torvalds
5ad8b6ad9a getting rid of bogus set_blocksize() uses, switching it
to struct file * and verifying that caller has device
 opened exclusively.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZkwkfQAKCRBZ7Krx/gZQ
 62C3AQDW5vuXNx2+KDPma5YStjFpPLC0xtSyAS5D3YANjtyRFgD/TOcCarq7rvBt
 KubxHVFsfW+eu6ASeaoMRB83w5OIzwk=
 =Liix
 -----END PGP SIGNATURE-----

Merge tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs blocksize updates from Al Viro:
 "This gets rid of bogus set_blocksize() uses, switches it over
  to be based on a 'struct file *' and verifies that the caller
  has the device opened exclusively"

* tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  make set_blocksize() fail unless block device is opened exclusive
  set_blocksize(): switch to passing struct file *
  btrfs_get_bdev_and_sb(): call set_blocksize() only for exclusive opens
  swsusp: don't bother with setting block size
  zram: don't bother with reopening - just use O_EXCL for open
  swapon(2): open swap with O_EXCL
  swapon(2)/swapoff(2): don't bother with block size
  pktcdvd: sort set_blocksize() calls out
  bcache_register(): don't bother with set_blocksize()
2024-05-21 08:34:51 -07:00
Jeff Johnson
f0eab3e8d1 block: t10-pi: add MODULE_DESCRIPTION()
Fix the allmodconfig 'make W=1' issue:

WARNING: modpost: missing MODULE_DESCRIPTION() in block/t10-pi.o

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240516-md-t10-pi-v1-1-44a3469374aa@quicinc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-20 08:07:44 -06:00
Linus Torvalds
eb6a9339ef Mainly singleton patches, documented in their respective changelogs.
Notable series include:
 
 - Some maintenance and performance work for ocfs2 in Heming Zhao's
   series "improve write IO performance when fragmentation is high".
 
 - Some ocfs2 bugfixes from Su Yue in the series "ocfs2 bugs fixes
   exposed by fstests".
 
 - kfifo header rework from Andy Shevchenko in the series "kfifo: Clean
   up kfifo.h".
 
 - GDB script fixes from Florian Rommel in the series "scripts/gdb: Fixes
   for $lx_current and $lx_per_cpu".
 
 - After much discussion, a coding-style update from Barry Song
   explaining one reason why inline functions are preferred over macros.
   The series is "codingstyle: avoid unused parameters for a function-like
   macro".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZkpLYQAKCRDdBJ7gKXxA
 jo9NAQDctSD3TMXqxqCHLaEpCaYTYzi6TGAVHjgkqGzOt7tYjAD/ZIzgcmRwthjP
 R7SSiSgZ7UnP9JRn16DQILmFeaoG1gs=
 =lYhr
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2024-05-19-11-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-mm updates from Andrew Morton:
 "Mainly singleton patches, documented in their respective changelogs.
  Notable series include:

   - Some maintenance and performance work for ocfs2 in Heming Zhao's
     series "improve write IO performance when fragmentation is high".

   - Some ocfs2 bugfixes from Su Yue in the series "ocfs2 bugs fixes
     exposed by fstests".

   - kfifo header rework from Andy Shevchenko in the series "kfifo:
     Clean up kfifo.h".

   - GDB script fixes from Florian Rommel in the series "scripts/gdb:
     Fixes for $lx_current and $lx_per_cpu".

   - After much discussion, a coding-style update from Barry Song
     explaining one reason why inline functions are preferred over
     macros. The series is "codingstyle: avoid unused parameters for a
     function-like macro""

* tag 'mm-nonmm-stable-2024-05-19-11-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (62 commits)
  fs/proc: fix softlockup in __read_vmcore
  nilfs2: convert BUG_ON() in nilfs_finish_roll_forward() to WARN_ON()
  scripts: checkpatch: check unused parameters for function-like macro
  Documentation: coding-style: ask function-like macros to evaluate parameters
  nilfs2: use __field_struct() for a bitwise field
  selftests/kcmp: remove unused open mode
  nilfs2: remove calls to folio_set_error() and folio_clear_error()
  kernel/watchdog_perf.c: tidy up kerneldoc
  watchdog: allow nmi watchdog to use raw perf event
  watchdog: handle comma separated nmi_watchdog command line
  nilfs2: make superblock data array index computation sparse friendly
  squashfs: remove calls to set the folio error flag
  squashfs: convert squashfs_symlink_read_folio to use folio APIs
  scripts/gdb: fix detection of current CPU in KGDB
  scripts/gdb: make get_thread_info accept pointers
  scripts/gdb: fix parameter handling in $lx_per_cpu
  scripts/gdb: fix failing KGDB detection during probe
  kfifo: don't use "proxy" headers
  media: stih-cec: add missing io.h
  media: rc: add missing io.h
  ...
2024-05-19 14:02:03 -07:00
Ming Lei
7b815817aa blk-mq: add helper for checking if one CPU is mapped to specified hctx
Commit a46c27026d ("blk-mq: don't schedule block kworker on isolated CPUs")
rules out isolated CPUs from hctx->cpumask, and hctx->cpumask should only be
used for scheduling kworker.

Add helper blk_mq_cpu_mapped_to_hctx() and apply it into cpuhp handlers.

This patch avoids to forget clearing INACTIVE of hctx state in case that one
isolated CPU becomes online, and fixes hang issue when allocating request
from this hctx's tags.

Cc: Raju Cheerla <rcheerla@redhat.com>
Fixes: a46c27026d ("blk-mq: don't schedule block kworker on isolated CPUs")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240517020514.149771-1-ming.lei@redhat.com
Tested-by: Raju Cheerla <rcheerla@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-17 09:40:26 -06:00
Waiman Long
9d230c0996 blk-cgroup: Properly propagate the iostat update up the hierarchy
During a cgroup_rstat_flush() call, the lowest level of nodes are flushed
first before their parents. Since commit 3b8cc62987 ("blk-cgroup:
Optimize blkcg_rstat_flush()"), iostat propagation was still done to
the parent. Grandparent, however, may not get the iostat update if the
parent has no blkg_iostat_set queued in its lhead lockless list.

Fix this iostat propagation problem by queuing the parent's global
blkg->iostat into one of its percpu lockless lists to make sure that
the delta will always be propagated up to the grandparent and so on
toward the root blkcg.

Note that successive calls to __blkcg_rstat_flush() are serialized by
the cgroup_rstat_lock. So no special barrier is used in the reading
and writing of blkg->iostat.lqueued.

Fixes: 3b8cc62987 ("blk-cgroup: Optimize blkcg_rstat_flush()")
Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Closes: https://lore.kernel.org/lkml/ZkO6l%2FODzadSgdhC@dschatzberg-fedora-PF3DHTBV/
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20240515143059.276677-1-longman@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-15 20:15:54 -06:00
Ming Lei
d0aac23635 blk-cgroup: fix list corruption from reorder of WRITE ->lqueued
__blkcg_rstat_flush() can be run anytime, especially when blk_cgroup_bio_start
is being executed.

If WRITE of `->lqueued` is re-ordered with READ of 'bisc->lnode.next' in
the loop of __blkcg_rstat_flush(), `next_bisc` can be assigned with one
stat instance being added in blk_cgroup_bio_start(), then the local
list in __blkcg_rstat_flush() could be corrupted.

Fix the issue by adding one barrier.

Cc: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Fixes: 3b8cc62987 ("blk-cgroup: Optimize blkcg_rstat_flush()")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240515013157.443672-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-15 20:14:20 -06:00
Ming Lei
6da6680632 blk-cgroup: fix list corruption from resetting io stat
Since commit 3b8cc62987 ("blk-cgroup: Optimize blkcg_rstat_flush()"),
each iostat instance is added to blkcg percpu list, so blkcg_reset_stats()
can't reset the stat instance by memset(), otherwise the llist may be
corrupted.

Fix the issue by only resetting the counter part.

Cc: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Jay Shin <jaeshin@redhat.com>
Fixes: 3b8cc62987 ("blk-cgroup: Optimize blkcg_rstat_flush()")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20240515013157.443672-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-15 20:14:20 -06:00
Linus Torvalds
113d1dd9c8 SCSI misc on 20240514
Updates to the usual drivers (ufs, lpfc, qla2xxx, mpi3mr, libsas).
 The major update (which causes a conflict with block, see below) is
 Christoph removing the queue limits and their associated block
 helpers.  The remaining patches are assorted minor fixes and
 deprecated function updates plus a bit of constification.
 
 Signed-off-by: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCZkOnWyYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishYe7AP93XRN/
 xnccJbSTTUL4FFGobq2CYXv58Na+FM/b/+/kEAD+PNi0LmHDdDTOaFUblMd9l4lj
 mpvYLRvJ6ifnHX6WXAg=
 =PVnL
 -----END PGP SIGNATURE-----

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
 "Updates to the usual drivers (ufs, lpfc, qla2xxx, mpi3mr, libsas).

  The major update (which causes a conflict with block, see below) is
  Christoph removing the queue limits and their associated block
  helpers.

  The remaining patches are assorted minor fixes and deprecated function
  updates plus a bit of constification"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (141 commits)
  scsi: mpi3mr: Sanitise num_phys
  scsi: lpfc: Copyright updates for 14.4.0.2 patches
  scsi: lpfc: Update lpfc version to 14.4.0.2
  scsi: lpfc: Add support for 32 byte CDBs
  scsi: lpfc: Change lpfc_hba hba_flag member into a bitmask
  scsi: lpfc: Introduce rrq_list_lock to protect active_rrq_list
  scsi: lpfc: Clear deferred RSCN processing flag when driver is unloading
  scsi: lpfc: Update logging of protection type for T10 DIF I/O
  scsi: lpfc: Change default logging level for unsolicited CT MIB commands
  scsi: target: Remove unused list 'device_list'
  scsi: iscsi: Remove unused list 'connlist_err'
  scsi: ufs: exynos: Add support for Tensor gs101 SoC
  scsi: ufs: exynos: Add some pa_dbg_ register offsets into drvdata
  scsi: ufs: exynos: Allow max frequencies up to 267Mhz
  scsi: ufs: exynos: Add EXYNOS_UFS_OPT_TIMER_TICK_SELECT option
  scsi: ufs: exynos: Add EXYNOS_UFS_OPT_UFSPR_SECURE option
  scsi: ufs: dt-bindings: exynos: Add gs101 compatible
  scsi: qla2xxx: Fix debugfs output for fw_resource_count
  scsi: qedf: Ensure the copied buf is NUL terminated
  scsi: bfa: Ensure the copied buf is NUL terminated
  ...
2024-05-14 18:25:53 -07:00
Linus Torvalds
a3d1f54d7a for-6.10-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmZCE4MACgkQxWXV+ddt
 WDudtQ//WjXcHtY3I6NJtDhPsIOG3Qjg9mA0shp73X4djJtZoGCdgL7dq+fTp5lk
 Wu6/XY5g+CSttTgwF4eyHgUSJOptKWY0XQDWxX5VR8WCM2qmUZ7SedlrBED9GNDM
 rN/3egmc74OGwnqyQq3I/2qYLByXFj66tsvW3UBjLNB8vMHajjw1idj9ujipioHq
 ySStPCHkPMwuhEzw9+CTe3W47VUSb5Ug3XDhAZXvxT99oDHn1m+CxKQwcona/IPH
 1El8PmZ7JetaT9ZO3DICBICfCyo+2SSy/KXYypXXE+nzNZhbhC0V9N7Uqm1c91C0
 aRglsJZCXmHBD4BPLvkls6CqEIvMc7FvcNCqQlrbRT6PlfX91/XaeDq4l3RUcuPn
 mGShsdHUiwbPMWYVwqVUKd0IPiktF1R7yigTjYSkEFJTL6HFTrBqV/2fAMUsMfPc
 8gyzYMCPQld73WmrnXZQPKvmzO/LvE0gS5cPapokGwoXstq9n3iYd4ypN0wN6sif
 1jwy3efNzWXXMYV0WzcihKwFMm2fqp/pl9bXq/zwn2CunfIX4WTsaQ2NmJf81jqF
 qFNjlr8S3qO7AvIOs+R2XY9E3VjfzeDADzvjpQy5J/ZYbcHBcxxdYDhg+QGhe5nB
 eNmR51oL1pHSjU2M8PxATL8JxKkX2BvX6u64lVojaw4rxUlyFC0=
 =MMpE
 -----END PGP SIGNATURE-----

Merge tag 'for-6.10-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "This update brings a few minor performance improvements, otherwise
  there's a lot of refactoring, cleanups and other sort of not user
  visible changes.

  Performance improvements:

   - inline b-tree locking functions, improvement in metadata-heavy
     changes

   - relax locking on a range that's being reflinked, allows read
     operations to run in parallel

   - speed up NOCOW write checks (throughput +9% on a sample test)

   - extent locking ranges have been reduced in several places, namely
     around delayed ref processing

  Core:

   - more page to folio conversions:
      - relocation
      - send
      - compression
      - inline extent handling
      - super block write and wait

   - extent_map structure optimizations:
      - reduced structure size
      - code simplifications
      - add shrinker for allocated objects, the numbers can go high and
        could exhaust memory on smaller systems (reported) as they may
        not get an opportunity to be freed fast enough

   - extent locking optimizations:
      - reduce locking ranges where it does not seem to be necessary and
        are safe due to other means of synchronization
      - potential improvements due to lower contention,
        allocation/freeing and state management operations of extent
        state tracking structures

   - delayed ref cleanups and simplifications

   - updated trace points

   - improved error handling, warnings and assertions

   - cleanups and refactoring, unification of error handling paths"

* tag 'for-6.10-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (122 commits)
  btrfs: qgroup: fix initialization of auto inherit array
  btrfs: count super block write errors in device instead of tracking folio error state
  btrfs: use the folio iterator in btrfs_end_super_write()
  btrfs: convert super block writes to folio in write_dev_supers()
  btrfs: convert super block writes to folio in wait_dev_supers()
  bio: Export bio_add_folio_nofail to modules
  btrfs: remove duplicate included header from fs.h
  btrfs: add a cached state to extent_clear_unlock_delalloc
  btrfs: push extent lock down in submit_one_async_extent
  btrfs: push lock_extent down in cow_file_range()
  btrfs: move can_cow_file_range_inline() outside of the extent lock
  btrfs: push lock_extent into cow_file_range_inline
  btrfs: push extent lock into cow_file_range
  btrfs: push extent lock into run_delalloc_cow
  btrfs: remove unlock_extent from run_delalloc_compressed
  btrfs: push extent lock down in run_delalloc_nocow
  btrfs: adjust while loop condition in run_delalloc_nocow
  btrfs: push extent lock into run_delalloc_nocow
  btrfs: push the extent lock into btrfs_run_delalloc_range
  btrfs: lock extent when doing inline extent in compression
  ...
2024-05-14 17:25:36 -07:00
Linus Torvalds
0c9f4ac808 for-6.10/block-20240511
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmY/YgsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvi0EACwnFRtYioizBH0x7QUHTBcIr0IhACd5gfz
 bm+uwlDUtf6G6lupHdJT9gOVB2z2z1m2Pz//8RuUVWw3Eqw2+rfgG8iJd+yo7IaV
 DpX3WaM4NnBvB7FKOKHlMPvGuf7KgbZ3uPm3x8cbrn/axMmkZ6ljxTixJ3p5t4+s
 xRsef/lVdG71DkXIFgTKATB86yNRJNlRQTbL+sZW22vdXdtfyBbOgR1sBuFfp7Hd
 g/uocZM/z0ahM6JH/5R2IX2ttKXMIBZLA8HRkJdvYqg022cj4js2YyRCPU3N6jQN
 MtN4TpJV5I++8l6SPQOOhaDNrK/6zFtDQpwG0YBiKKj3nQDgVbWWb8ejYTIUv4MP
 SrEto4MVBEqg5N65VwYYhIf45rmueFyJp6z0Vqv6Owur5nuww/YIFknmoMa/WDMd
 V8dIU3zL72FZDbPjIBjxHeqAGz9OgzEVafled7pi0Xbw6wqiB4kZihlMGXlD+WBy
 Yd6xo8PX4i5+d2LLKKPxpW1X0eJlKYJ/4dnYCoFN8LmXSiPJnMx2pYrV+NqMxy4X
 Thr8lxswLQC7j9YBBuIeDl8NB9N5FZZLvaC6I25QKq045M2ckJ+VrounsQb3vGwJ
 72nlxxBZL8wz3sasgX9Pc1Cez9AqYbM+UZahq8ezPY5y3Jh0QfRw/MOk1ZaDNC8V
 CNOHBH0E+Q==
 =HnjE
 -----END PGP SIGNATURE-----

Merge tag 'for-6.10/block-20240511' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - Add a partscan attribute in sysfs, fixing an issue with systemd
   relying on an internal interface that went away.

 - Attempt #2 at making long running discards interruptible. The
   previous attempt went into 6.9, but we ended up mostly reverting it
   as it had issues.

 - Remove old ida_simple API in bcache

 - Support for zoned write plugging, greatly improving the performance
   on zoned devices.

 - Remove the old throttle low interface, which has been experimental
   since 2017 and never made it beyond that and isn't being used.

 - Remove page->index debugging checks in brd, as it hasn't caught
   anything and prepares us for removing in struct page.

 - MD pull request from Song

 - Don't schedule block workers on isolated CPUs

* tag 'for-6.10/block-20240511' of git://git.kernel.dk/linux: (84 commits)
  blk-throttle: delay initialization until configuration
  blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW
  block: fix that util can be greater than 100%
  block: support to account io_ticks precisely
  block: add plug while submitting IO
  bcache: fix variable length array abuse in btree_iter
  bcache: Remove usage of the deprecated ida_simple_xx() API
  md: Revert "md: Fix overflow in is_mddev_idle"
  blk-lib: check for kill signal in ioctl BLKDISCARD
  block: add a bio_await_chain helper
  block: add a blk_alloc_discard_bio helper
  block: add a bio_chain_and_submit helper
  block: move discard checks into the ioctl handler
  block: remove the discard_granularity check in __blkdev_issue_discard
  block/ioctl: prefer different overflow check
  null_blk: Fix the WARNING: modpost: missing MODULE_DESCRIPTION()
  block: fix and simplify blkdevparts= cmdline parsing
  block: refine the EOF check in blkdev_iomap_begin
  block: add a partscan sysfs attribute for disks
  block: add a disk_has_partscan helper
  ...
2024-05-13 13:03:54 -07:00
Linus Torvalds
1b0aabcc9a vfs-6.10.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZj3HuwAKCRCRxhvAZXjc
 orYvAQCZOr68uJaEaXAArYTdnMdQ6HIzG+FVlwrqtrhz0BV07wEAqgmtSR9XKh+L
 0+DNepg4R8PZOHH371eSSsLNRCUCkAs=
 =SVsU
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains the usual miscellaneous features, cleanups, and fixes
  for vfs and individual fses.

  Features:

   - Free up FMODE_* bits. I've freed up bits 6, 7, 8, and 24. That
     means we now have six free FMODE_* bits in total (but bit #6
     already got used for FMODE_WRITE_RESTRICTED)

   - Add FOP_HUGE_PAGES flag (follow-up to FMODE_* cleanup)

   - Add fd_raw cleanup class so we can make use of automatic cleanup
     provided by CLASS(fd_raw, f)(fd) for O_PATH fds as well

   - Optimize seq_puts()

   - Simplify __seq_puts()

   - Add new anon_inode_getfile_fmode() api to allow specifying f_mode
     instead of open-coding it in multiple places

   - Annotate struct file_handle with __counted_by() and use
     struct_size()

   - Warn in get_file() whether f_count resurrection from zero is
     attempted (epoll/drm discussion)

   - Folio-sophize aio

   - Export the subvolume id in statx() for both btrfs and bcachefs

   - Relax linkat(AT_EMPTY_PATH) requirements

   - Add F_DUPFD_QUERY fcntl() allowing to compare two file descriptors
     for dup*() equality replacing kcmp()

  Cleanups:

   - Compile out swapfile inode checks when swap isn't enabled

   - Use (1 << n) notation for FMODE_* bitshifts for clarity

   - Remove redundant variable assignment in fs/direct-io

   - Cleanup uses of strncpy in orangefs

   - Speed up and cleanup writeback

   - Move fsparam_string_empty() helper into header since it's currently
     open-coded in multiple places

   - Add kernel-doc comments to proc_create_net_data_write()

   - Don't needlessly read dentry->d_flags twice

  Fixes:

   - Fix out-of-range warning in nilfs2

   - Fix ecryptfs overflow due to wrong encryption packet size
     calculation

   - Fix overly long line in xfs file_operations (follow-up to FMODE_*
     cleanup)

   - Don't raise FOP_BUFFER_{R,W}ASYNC for directories in xfs (follow-up
     to FMODE_* cleanup)

   - Don't call xfs_file_open from xfs_dir_open (follow-up to FMODE_*
     cleanup)

   - Fix stable offset api to prevent endless loops

   - Fix afs file server rotations

   - Prevent xattr node from overflowing the eraseblock in jffs2

   - Move fdinfo PTRACE_MODE_READ procfs check into the .permission()
     operation instead of .open() operation since this caused userspace
     regressions"

* tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits)
  afs: Fix fileserver rotation getting stuck
  selftests: add F_DUPDFD_QUERY selftests
  fcntl: add F_DUPFD_QUERY fcntl()
  file: add fd_raw cleanup class
  fs: WARN when f_count resurrection is attempted
  seq_file: Simplify __seq_puts()
  seq_file: Optimize seq_puts()
  proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation
  fs: Create anon_inode_getfile_fmode()
  xfs: don't call xfs_file_open from xfs_dir_open
  xfs: drop fop_flags for directories
  xfs: fix overly long line in the file_operations
  shmem: Fix shmem_rename2()
  libfs: Add simple_offset_rename() API
  libfs: Fix simple_offset_rename_exchange()
  jffs2: prevent xattr node from overflowing the eraseblock
  vfs, swap: compile out IS_SWAPFILE() on swapless configs
  vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements
  fs/direct-io: remove redundant assignment to variable retval
  fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading
  ...
2024-05-13 11:40:06 -07:00
Linus Torvalds
f4345f05c0 block-6.9-20240510
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmY+Tr0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpok2EACLqHZ2faer07Rpmm9pYzQxPVAJ1iQn3+dX
 kQLVStabysazYC8ZRk8VrgUZDJAblHDEct/f4eMVIX0abQoXB6nghkAUPFXledso
 SjCJCzWzhZN7xbC3FAQoxldpFjQxlI3y5xiqcD33bfTouPCDAwkThqCtXIEMwyff
 OwRDvrB8SrGLHIJYPntfKFnI1DTpu41ZFY/508olxs/uZSuUbxdPFQj2rh8gC5on
 b86HvsCS8laGY6EO86bbTjjp9WJJ1MMMaFrPzwc9deWbh/lJDB70hptxrHBLLv5Z
 i+CctM+KEYB/KRL+YjXZSOS2tYmoeA9jEbrtcqiEX87h3F3bfhH4XAp2EDkNoeG9
 bzLRaR8tNsKoBkpauGgptjtxpJKHDs5ax4mJgsphOGGv6VBi+RfUlmIe76XFspzZ
 YJHRjpJ1FsBjyWtQ60W2FWdN+1IrvZ0GriN/Wk68ReHPGp9pBsnK6LXSUSAQ5cq2
 eCKppG7S7ZVMylXZhNOK2E8vgz1XqsaE0wf7jI5tPjQKyZBVFrlIQ0sVBr01mKkv
 ycOJEUx6dGnJrzJPaH3T2m0lhCrUDJaM1/eUWFWvdQEWXyDlGmF6cU0smiSYbeca
 LP4TU+iB5s5D5nltzHq9D+9E96MWm6axhKAfoJZLpPI6+I73xD3hOAw3ufKIlNi9
 jWjtMbArpQ==
 =k1hJ
 -----END PGP SIGNATURE-----

Merge tag 'block-6.9-20240510' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
     - nvme target fixes (Sagi, Dan, Maurizo)
     - new vendor quirk for broken MSI (Sean)

 - Virtual boundary fix for a regression in this merge window (Ming)

* tag 'block-6.9-20240510' of git://git.kernel.dk/linux:
  nvmet-rdma: fix possible bad dereference when freeing rsps
  nvmet: prevent sprintf() overflow in nvmet_subsys_nsid_exists()
  nvmet: make nvmet_wq unbound
  nvmet-auth: return the error code to the nvmet_auth_ctrl_hash() callers
  nvme-pci: Add quirk for broken MSIs
  block: set default max segment size in case of virt_boundary
2024-05-10 10:24:16 -07:00
Yu Kuai
a3166c5170 blk-throttle: delay initialization until configuration
Other cgroup policy like bfq, iocost are lazy-initialized when they are
configured for the first time for the device, but blk-throttle is
initialized unconditionally from blkcg_init_disk().

Delay initialization of blk-throttle as well, to save some cpu and
memory overhead if it's not configured.

Noted that once it's initialized, it can't be destroyed until disk
removal, even if it's disabled.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240509121107.3195568-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-09 09:44:56 -06:00
Yu Kuai
bf20ab538c blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW
One the one hand, it's marked EXPERIMENTAL since 2017, and looks like
there are no users since then, and no testers and no developers, it's
just not active at all.

On the other hand, even if the config is disabled, there are still many
fields in throtl_grp and throtl_data and many functions that are only
used for throtl low.

At last, currently blk-throtl is initialized during disk initialization,
and destroyed during disk removal, and it exposes many functions to be
called directly from block layer.

Remove throtl low to make code much more cleaner and follow up work much
easier.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20240509121107.3195568-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-09 09:44:55 -06:00
Yu Kuai
7be835694d block: fix that util can be greater than 100%
util means the percentage that disk has IO, and theoretically it should
not be greater than 100%. However, there is a gap for rq-based disk:

io_ticks will be updated when rq is allocated, however, before such rq
dispatch to driver, it will not be account as inflight from
blk_mq_start_request() hence diskstats_show()/part_stat_show() will not
update io_ticks. For example:

1) at t0, issue a new IO, rq is allocated, and blk_account_io_start()
update io_ticks;

2) something is wrong with drivers, and the rq can't be dispatched;

3) at t0 + 10s, drivers recovers and rq is dispatched and done, io_ticks
is updated;

Then if user is using "iostat 1" to monitor "util", between t0 - t0+9s,
util will be zero, and between t0+9s - t0+10s, util will be 1000%.

Fix this problem by updating io_ticks from diskstats_show() and
part_stat_show() if there are rq allocated.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240509123717.3223892-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-09 07:59:44 -06:00
Yu Kuai
99dc422335 block: support to account io_ticks precisely
Currently, io_ticks is accounted based on sampling, specifically
update_io_ticks() will always account io_ticks by 1 jiffies from
bdev_start_io_acct()/blk_account_io_start(), and the result can be
inaccurate, for example(HZ is 250):

Test script:
fio -filename=/dev/sda -bs=4k -rw=write -direct=1 -name=test -thinktime=4ms

Test result: util is about 90%, while the disk is really idle.

This behaviour is introduced by commit 5b18b5a737 ("block: delete
part_round_stats and switch to less precise counting"), however, there
was a key point that is missed that this patch also improve performance
a lot:

Before the commit:
part_round_stats:
  if (part->stamp != now)
   stats |= 1;

  part_in_flight()
  -> there can be lots of task here in 1 jiffies.
  part_round_stats_single()
   __part_stat_add()
  part->stamp = now;

After the commit:
update_io_ticks:
  stamp = part->bd_stamp;
  if (time_after(now, stamp))
   if (try_cmpxchg())
    __part_stat_add()
    -> only one task can reach here in 1 jiffies.

Hence in order to account io_ticks precisely, we only need to know if
there are IO inflight at most once in one jiffies. Noted that for
rq-based device, iterating tags should not be used here because
'tags->lock' is grabbed in blk_mq_find_and_get_req(), hence
part_stat_lock_inc/dec() and part_in_flight() is used to trace inflight.
The additional overhead is quite little:

 - per cpu add/dec for each IO for rq-based device;
 - per cpu sum for each jiffies;

And it's verified by null-blk that there are no performance degration
under heavy IO pressure.

Fixes: 5b18b5a737 ("block: delete part_round_stats and switch to less precise counting")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240509123717.3223892-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-09 07:59:44 -06:00
Yu Kuai
060406c61c block: add plug while submitting IO
So that if caller didn't use plug, for example, __blkdev_direct_IO_simple()
and __blkdev_direct_IO_async(), block layer can still benefit from caching
nsec time in the plug.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-09 07:57:37 -06:00
Matthew Wilcox (Oracle)
8fde439b2d bio: Export bio_add_folio_nofail to modules
Several modules use __bio_add_page() today and may need to be converted
to bio_add_folio_nofail().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:10 +02:00
Christoph Hellwig
719c15a75e blk-lib: check for kill signal in ioctl BLKDISCARD
Discards can access a significant capacity and take longer than the user
expected.  A user may change their mind about wanting to run that command
and attempt to kill the process and do something else with their device.
But since the task is uninterruptable, they have to wait for it to
finish, which could be many hours.

Open code blkdev_issue_discard in the BLKDISCARD ioctl handler and check
for a fatal signal at each iteration so the user doesn't have to wait
for their regretted operation to complete naturally.

Heavily based on an earlier patch from Keith Busch.

Reported-by: Conrad Meyer <conradmeyer@meta.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240506042027.2289826-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07 07:29:42 -06:00
Keith Busch
0f8e9ecc46 block: add a bio_await_chain helper
Add a helper to wait for an entire chain of bios to complete.

[hch: split from a larger patch, moved and changed the name now that it
 is non-static]

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240506042027.2289826-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07 07:29:42 -06:00
Christoph Hellwig
e8b4869bc7 block: add a blk_alloc_discard_bio helper
Factor out a helper from __blkdev_issue_discard that chews off as much as
possible from a discard range and allocates a bio for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240506042027.2289826-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07 07:29:42 -06:00
Christoph Hellwig
81c2168c22 block: add a bio_chain_and_submit helper
This is basically blk_next_bio just with the bio allocation moved
to the caller to allow for more flexible bio handling in the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240506042027.2289826-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07 07:29:42 -06:00
Christoph Hellwig
30f1e72414 block: move discard checks into the ioctl handler
Most bio operations get basic sanity checking in submit_bio and anything
more complicated than that is done in the callers.  Discards are a bit
different from that in that a lot of checking is done in
__blkdev_issue_discard, and the specific errnos for that are returned
to userspace.  Move the checks that require specific errnos to the ioctl
handler instead, and just leave the basic sanity checking in submit_bio
for the other handlers.  This introduces two changes in behavior:

 1) the logical block size alignment check of the start and len is lost
    for non-ioctl callers.
    This matches what is done for other operations including reads and
    writes.  We should probably verify this for all bios, but for now
    make discards match the normal flow.
 2) for non-ioctl callers all errors are reported on I/O completion now
    instead of synchronously.  Callers in general mostly ignore or log
    errors so this will actually simplify the code once cleaned up

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240506042027.2289826-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07 07:29:42 -06:00
Christoph Hellwig
0942592045 block: remove the discard_granularity check in __blkdev_issue_discard
We now set a default granularity in the queue limits API, so don't
bother with this extra check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240506042027.2289826-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07 07:29:42 -06:00
Justin Stitt
ccb326b5f9 block/ioctl: prefer different overflow check
Running syzkaller with the newly reintroduced signed integer overflow
sanitizer shows this report:

[   62.982337] ------------[ cut here ]------------
[   62.985692] cgroup: Invalid name
[   62.986211] UBSAN: signed-integer-overflow in ../block/ioctl.c:36:46
[   62.989370] 9pnet_fd: p9_fd_create_tcp (7343): problem connecting socket to 127.0.0.1
[   62.992992] 9223372036854775807 + 4095 cannot be represented in type 'long long'
[   62.997827] 9pnet_fd: p9_fd_create_tcp (7345): problem connecting socket to 127.0.0.1
[   62.999369] random: crng reseeded on system resumption
[   63.000634] GUP no longer grows the stack in syz-executor.2 (7353): 20002000-20003000 (20001000)
[   63.000668] CPU: 0 PID: 7353 Comm: syz-executor.2 Not tainted 6.8.0-rc2-00035-gb3ef86b5a957 #1
[   63.000677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   63.000682] Call Trace:
[   63.000686]  <TASK>
[   63.000731]  dump_stack_lvl+0x93/0xd0
[   63.000919]  __get_user_pages+0x903/0xd30
[   63.001030]  __gup_longterm_locked+0x153e/0x1ba0
[   63.001041]  ? _raw_read_unlock_irqrestore+0x17/0x50
[   63.001072]  ? try_get_folio+0x29c/0x2d0
[   63.001083]  internal_get_user_pages_fast+0x1119/0x1530
[   63.001109]  iov_iter_extract_pages+0x23b/0x580
[   63.001206]  bio_iov_iter_get_pages+0x4de/0x1220
[   63.001235]  iomap_dio_bio_iter+0x9b6/0x1410
[   63.001297]  __iomap_dio_rw+0xab4/0x1810
[   63.001316]  iomap_dio_rw+0x45/0xa0
[   63.001328]  ext4_file_write_iter+0xdde/0x1390
[   63.001372]  vfs_write+0x599/0xbd0
[   63.001394]  ksys_write+0xc8/0x190
[   63.001403]  do_syscall_64+0xd4/0x1b0
[   63.001421]  ? arch_exit_to_user_mode_prepare+0x3a/0x60
[   63.001479]  entry_SYSCALL_64_after_hwframe+0x6f/0x77
[   63.001535] RIP: 0033:0x7f7fd3ebf539
[   63.001551] Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
[   63.001562] RSP: 002b:00007f7fd32570c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   63.001584] RAX: ffffffffffffffda RBX: 00007f7fd3ff3f80 RCX: 00007f7fd3ebf539
[   63.001590] RDX: 4db6d1e4f7e43360 RSI: 0000000020000000 RDI: 0000000000000004
[   63.001595] RBP: 00007f7fd3f1e496 R08: 0000000000000000 R09: 0000000000000000
[   63.001599] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[   63.001604] R13: 0000000000000006 R14: 00007f7fd3ff3f80 R15: 00007ffd415ad2b8
...
[   63.018142] ---[ end trace ]---

Historically, the signed integer overflow sanitizer did not work in the
kernel due to its interaction with `-fwrapv` but this has since been
changed [1] in the newest version of Clang; It was re-enabled in the
kernel with Commit 557f8c582a ("ubsan: Reintroduce signed overflow
sanitizer").

Let's rework this overflow checking logic to not actually perform an
overflow during the check itself, thus avoiding the UBSAN splat.

[1]: https://github.com/llvm/llvm-project/pull/82432

Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240507-b4-sio-block-ioctl-v3-1-ba0c2b32275e@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07 07:29:20 -06:00
Ming Lei
ffd379c13f block: set default max segment size in case of virt_boundary
For devices with virt_boundary limit, the driver may provide zero max
segment size, we have to set it as UINT_MAX at default. Otherwise, it
may cause warning in driver when handling sglist.

Fix it by setting default max segment size as UINT_MAX.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@kernel.org>
Fixes: b561ea56a2 ("block: allow device to have both virt_boundary_mask and max segment size")
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reported-by: Geert Uytterhoeven <geert+renesas@glider.be>
Closes: https://lore.kernel.org/linux-block/7e38b67c-9372-a42d-41eb-abdce33d3372@linux-m68k.org/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240424134722.2584284-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-06 20:27:51 -06:00
INAGAKI Hiroshi
bc2e07dfd2 block: fix and simplify blkdevparts= cmdline parsing
Fix the cmdline parsing of the "blkdevparts=" parameter using strsep(),
which makes the code simpler.

Before commit 146afeb235 ("block: use strscpy() to instead of
strncpy()"), we used a strncpy() to copy a block device name and partition
names. The commit simply replaced a strncpy() and NULL termination with
a strscpy(). It did not update calculations of length passed to strscpy().
While the length passed to strncpy() is just a length of valid characters
without NULL termination ('\0'), strscpy() takes it as a length of the
destination buffer, including a NULL termination.

Since the source buffer is not necessarily NULL terminated, the current
code copies "length - 1" characters and puts a NULL character in the
destination buffer. It replaces the last character with NULL and breaks
the parsing.

As an example, that buffer will be passed to parse_parts() and breaks
parsing sub-partitions due to the missing ')' at the end, like the
following.

example (Check Point V-80 & OpenWrt):

- Linux Kernel 6.6

  [    0.000000] Kernel command line: console=ttyS0,115200 earlycon=uart8250,mmio32,0xf0512000 crashkernel=30M mvpp2x.queue_mode=1 blkdevparts=mmcblk1:48M@10M(kernel-1),1M(dtb-1),720M(rootfs-1),48M(kernel-2),1M(dtb-2),720M(rootfs-2),300M(default_sw),650M(logs),1M(preset_cfg),1M(adsl),-(storage) maxcpus=4
  ...
  [    0.884016] mmc1: new HS200 MMC card at address 0001
  [    0.889951] mmcblk1: mmc1:0001 004GA0 3.69 GiB
  [    0.895043] cmdline partition format is invalid.
  [    0.895704]  mmcblk1: p1
  [    0.903447] mmcblk1boot0: mmc1:0001 004GA0 2.00 MiB
  [    0.908667] mmcblk1boot1: mmc1:0001 004GA0 2.00 MiB
  [    0.913765] mmcblk1rpmb: mmc1:0001 004GA0 512 KiB, chardev (248:0)

  1. "48M@10M(kernel-1),..." is passed to strscpy() with length=17
     from parse_parts()
  2. strscpy() returns -E2BIG and the destination buffer has
     "48M@10M(kernel-1\0"
  3. "48M@10M(kernel-1\0" is passed to parse_subpart()
  4. parse_subpart() fails to find ')' when parsing a partition name,
     and returns error

- Linux Kernel 6.1

  [    0.000000] Kernel command line: console=ttyS0,115200 earlycon=uart8250,mmio32,0xf0512000 crashkernel=30M mvpp2x.queue_mode=1 blkdevparts=mmcblk1:48M@10M(kernel-1),1M(dtb-1),720M(rootfs-1),48M(kernel-2),1M(dtb-2),720M(rootfs-2),300M(default_sw),650M(logs),1M(preset_cfg),1M(adsl),-(storage) maxcpus=4
  ...
  [    0.953142] mmc1: new HS200 MMC card at address 0001
  [    0.959114] mmcblk1: mmc1:0001 004GA0 3.69 GiB
  [    0.964259]  mmcblk1: p1(kernel-1) p2(dtb-1) p3(rootfs-1) p4(kernel-2) p5(dtb-2) 6(rootfs-2) p7(default_sw) p8(logs) p9(preset_cfg) p10(adsl) p11(storage)
  [    0.979174] mmcblk1boot0: mmc1:0001 004GA0 2.00 MiB
  [    0.984674] mmcblk1boot1: mmc1:0001 004GA0 2.00 MiB
  [    0.989926] mmcblk1rpmb: mmc1:0001 004GA0 512 KiB, chardev (248:0

By the way, strscpy() takes a length of destination buffer and it is
often confusing when copying characters with a specified length. Using
strsep() helps to separate the string by the specified character. Then,
we can use strscpy() naturally with the size of the destination buffer.

Separating the string on the fly is also useful to omit the redundant
string copy, reducing memory usage and improve the code readability.

Fixes: 146afeb235 ("block: use strscpy() to instead of strncpy()")
Suggested-by: Naohiro Aota <naota@elisp.net>
Signed-off-by: INAGAKI Hiroshi <musashino.open@gmail.com>
Reviewed-by: Daniel Golle <daniel@makrotopia.org>
Link: https://lore.kernel.org/r/20240421074005.565-1-musashino.open@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-03 09:57:53 -06:00
Christoph Hellwig
0c12028aec block: refine the EOF check in blkdev_iomap_begin
blkdev_iomap_begin rounds down the offset to the logical block size
before stashing it in iomap->offset and checking that it still is
inside the inode size.

Check the i_size check to the raw pos value so that we don't try a
zero size write if iter->pos is unaligned.

Fixes: 487c607df7 ("block: use iomap for writes to block devices")
Reported-by: syzbot+0a3683a0a6fecf909244@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: syzbot+0a3683a0a6fecf909244@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/20240503081042.2078062-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-03 09:05:11 -06:00
Christoph Hellwig
a4217c6740 block: add a partscan sysfs attribute for disks
Userspace had been unknowingly relying on a non-stable interface of
kernel internals to determine if partition scanning is enabled for a
given disk. Provide a stable interface for this purpose instead.

Cc: stable@vger.kernel.org # 6.3+
Depends-on: 140ce28dd3 ("block: add a disk_has_partscan helper")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/linux-block/ZhQJf8mzq_wipkBH@gardel-login/
Link: https://lore.kernel.org/r/20240502130033.1958492-3-hch@lst.de
[axboe: add links and commit message from Keith]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-03 09:00:07 -06:00
Christoph Hellwig
140ce28dd3 block: add a disk_has_partscan helper
Add a helper to check if partition scanning is enabled instead of
open coding the check in a few places.  This now always checks for
the hidden flag even if all but one of the callers are never reachable
for hidden gendisks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240502130033.1958492-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-03 08:59:59 -06:00
Al Viro
203c1ce0bb RIP ->bd_inode
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-03 02:36:56 -04:00
Al Viro
df65f1660b block/bdev.c: use the knowledge of inode/bdev coallocation
Here we know that bdevfs inodes are coallocated with struct block_device
and we can get to ->bd_inode value without any dereferencing.  Introduce
an inlined helper (static, *not* exported, purely internal for bdev.c)
that gets an associated inode by block_device - BD_INODE(bdev).

NOTE: leave it static; nobody outside of block/bdev.c has any business
playing with that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-03 02:36:55 -04:00
Al Viro
881494ed03 blk_ioctl_{discard,zeroout}(): we only want ->bd_inode->i_mapping here...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20240411145346.2516848-6-viro@zeniv.linux.org.uk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-03 02:36:51 -04:00
Al Viro
224941e837 use ->bd_mapping instead of ->bd_inode->i_mapping
Just the low-hanging fruit...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20240411145346.2516848-2-viro@zeniv.linux.org.uk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-03 02:36:51 -04:00
Al Viro
e33aef2c58 block_device: add a pointer to struct address_space (page cache of bdev)
points to ->i_data of coallocated inode.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20240411145346.2516848-1-viro@zeniv.linux.org.uk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-03 02:36:50 -04:00
Al Viro
2638c20876 missing helpers: bdev_unhash(), bdev_drop()
bdev_unhash(): make block device invisible to lookups by device number
bdev_drop(): drop reference to associated inode.

Both are internal, for use by genhd and partition-related code - similar
to bdev_add().  The logics in there (especially the lifetime-related
parts of it) ought to be cleaned up, but that's a separate story; here
we just encapsulate getting to associated inode.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-03 02:36:21 -04:00
Yu Kuai
186ddac207 block: move two helpers into bdev.c
disk_live() and block_size() access bd_inode directly, prepare to remove
the field bd_inode from block_device, and only access bd_inode in block
layer.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20240411145346.2516848-8-viro@zeniv.linux.org.uk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-03 02:36:21 -04:00
Al Viro
39c3b4e7d0 blkdev_write_iter(): saner way to get inode and bdev
... same as in other methods - bdev_file_inode() and I_BDEV() of that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20240411145346.2516848-5-viro@zeniv.linux.org.uk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-03 02:35:57 -04:00
Al Viro
811ba89a88 bdev: move ->bd_make_it_fail to ->__bd_flags
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02 20:04:18 -04:00
Al Viro
49a43dae93 bdev: move ->bd_ro_warned to ->__bd_flags
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02 20:04:17 -04:00
Al Viro
ac2b6f9dee bdev: move ->bd_has_subit_bio to ->__bd_flags
In bdev_alloc() we have all flags initialized to false, so
assignment to ->bh_has_submit_bio n there is a no-op unless
we have partno != 0 and flag already set on entire device.

In device_add_disk() we have just allocated the block_device
in question and it had been a full-device one, so the flag
is guaranteed to be still clear when we get to assignment.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02 20:00:37 -04:00
Al Viro
4c80105e39 bdev: move ->bd_write_holder into ->__bd_flags
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02 19:50:29 -04:00
Al Viro
01e198f01d bdev: move ->bd_read_only to ->__bd_flags
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02 19:50:29 -04:00
Al Viro
1116b9fa15 bdev: infrastructure for flags
Replace bd_partno with a 32bit field (__bd_flags).  The lower 8 bits
contain the partition number, the upper 24 are for flags.

Helpers: bdev_{test,set,clear}_flag(bdev, flag), with atomic_or()
and atomic_andnot() used to set/clear.

NOTE: this commit does not actually move any flags over there - they
are still bool fields.  As the result, it shifts the fields wrt
cacheline boundaries; that's going to be restored once the first
3 flags are dealt with.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02 19:50:11 -04:00
Al Viro
b8c873edbf wrapper for access to ->bd_partno
On the next step it's going to get folded into a field where flags will go.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02 17:48:09 -04:00
Al Viro
3f9b8fb46e Use bdev_is_paritition() instead of open-coding it
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02 17:48:09 -04:00
Al Viro
d18a867958 make set_blocksize() fail unless block device is opened exclusive
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02 17:39:44 -04:00
Al Viro
ead083aeee set_blocksize(): switch to passing struct file *
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02 17:39:44 -04:00
Damien Le Moal
d7580149ef block: Cleanup blk_revalidate_zone_cb()
Define the code for checking conventional and sequential write required
zones suing the functions blk_revalidate_conv_zone() and
blk_revalidate_seq_zone() respectively. This simplifies the zone type
switch-case in blk_revalidate_zone_cb().

No functional changes.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240501110907.96950-15-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01 08:08:43 -06:00
Damien Le Moal
c9c8aea03c block: Simplify zone write plug BIO abort
When BIOs plugged in a zone write plug are aborted,
blk_zone_wplug_bio_io_error() clears the BIO BIO_ZONE_WRITE_PLUGGING
flag so that bio_io_error(bio) does not end up calling
blk_zone_write_plug_bio_endio() and we thus need to manually drop the
reference on the zone write plug held by the aborted BIO.

Move the call to disk_put_zone_wplug() that is alwasy following the call
to blk_zone_wplug_bio_io_error() inside that function to simplify the
code.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240501110907.96950-14-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01 08:08:43 -06:00
Damien Le Moal
b5a64ec2ea block: Simplify blk_zone_write_plug_bio_endio()
We already have the disk variable obtained from the bio when calling
disk_get_zone_wplug(). So use that variable instead of dereferencing the
bio bdev again for the disk argument of disk_get_zone_wplug().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240501110907.96950-13-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01 08:08:43 -06:00
Damien Le Moal
347bde9da1 block: Improve zone write request completion handling
blk_zone_complete_request() must be called to handle the completion of a
zone write request handled with zone write plugging. This function is
called from blk_complete_request(), blk_update_request() and also in
blk_mq_submit_bio() error path. Improve this by moving this function
call into blk_mq_finish_request() as all requests are processed with
this function when they complete as well as when they are freed without
being executed. This also improves blk_update_request() used by scsi
devices as these may repeatedly call this function to handle partial
completions.

To be consistent with this change, blk_zone_complete_request() is
renamed to blk_zone_finish_request() and
blk_zone_write_plug_complete_request() is renamed to
blk_zone_write_plug_finish_request().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240501110907.96950-12-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01 08:08:43 -06:00
Damien Le Moal
c4c3ffdab2 block: Improve blk_zone_write_plug_bio_merged()
Improve blk_zone_write_plug_bio_merged() to check that we succefully get
a reference on the zone write plug of the merged BIO, as expected since
for a merge we already have at least one request and one BIO referencing
the zone write plug. Comments in this function are also improved to
better explain the references to the BIO zone write plug.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240501110907.96950-11-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01 08:08:43 -06:00
Damien Le Moal
096bc7ea33 block: Fix handling of non-empty flush write requests to zones
Zone write plugging ignores empty (no data) flush operations but handles
flush BIOs that have data to ensure that the flush machinery generated
write is processed in order. However, the call to
blk_zone_write_plug_attempt_merge() which sets a request
RQF_ZONE_WRITE_PLUGGING flag is called after blk_insert_flush(), thus
missing indicating that a non empty flush request completion needs
handling by zone write plugging.

Fix this by moving the call to blk_zone_write_plug_attempt_merge()
before blk_insert_flush(). And while at it, rename that function as
blk_zone_write_plug_init_request() to be clear that it is not just about
merging plugged BIOs in the request. While at it, also add a WARN_ONCE()
check that the zone write plug for the request is not NULL.

Fixes: dd291d77cc ("block: Introduce zone write plugging")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240501110907.96950-10-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01 08:08:43 -06:00
Damien Le Moal
af147b740f block: Fix flush request sector restore
Make sure that a request bio is not NULL before trying to restore the
request start sector.

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Fixes: 6f8fd758de ("block: Restore sector of flush requests")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240501110907.96950-9-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01 08:08:43 -06:00
Damien Le Moal
7b29518728 block: Do not remove zone write plugs still in use
Large write BIOs that span a zone boundary are split in
blk_mq_submit_bio() before being passed to blk_zone_plug_bio() for zone
write plugging. Such split BIO will be chained with one fragment
targeting one zone and the remainder of the BIO targeting the next
zone. The two BIOs can be executed in parallel, without a predetermine
order relative to eachother and their completion may be reversed: the
remainder first completing and the first fragment then completing. In
such case, bio_endio() will not immediately execute
blk_zone_write_plug_bio_endio() for the parent BIO (the remainder of the
split BIO) as the BIOs are chained. blk_zone_write_plug_bio_endio() for
the parent BIO will be executed only once the first fragment completes.

In the case of a device with small zones and very large BIOs, uch
completion pattern can lead to disk_should_remove_zone_wplug() to return
true for the zone of the parent BIO when the parent BIO request
completes and blk_zone_write_plug_complete_request() is executed. This
triggers the removal of the zone write plug from the hash table using
disk_remove_zone_wplug(). With the zone write plug of the parent BIO
missing, the call to disk_get_zone_wplug() in
blk_zone_write_plug_bio_endio() returns NULL and triggers a warning.

This patterns can be recreated fairly easily using a scsi_debug device
with small zone and btrfs. E.g.

modprobe scsi_debug delay=0 dev_size_mb=1024 sector_size=4096 \
	zbc=host-managed zone_cap_mb=3 zone_nr_conv=0 zone_size_mb=4
mkfs.btrfs -f -O zoned /dev/sda
mount -t btrfs /dev/sda /mnt
fio --name=wrtest --rw=randwrite --direct=1 --ioengine=libaio \
	--bs=4k --iodepth=16 --size=1M --directory=/mnt --time_based \
	--runtime=10
umount /dev/sda

Will result in the warning:

[   29.035538] WARNING: CPU: 3 PID: 37 at block/blk-zoned.c:1207 blk_zone_write_plug_bio_endio+0xee/0x1e0
...
[   29.058682] Call Trace:
[   29.059095]  <TASK>
[   29.059473]  ? __warn+0x80/0x120
[   29.059983]  ? blk_zone_write_plug_bio_endio+0xee/0x1e0
[   29.060728]  ? report_bug+0x160/0x190
[   29.061283]  ? handle_bug+0x36/0x70
[   29.061830]  ? exc_invalid_op+0x17/0x60
[   29.062399]  ? asm_exc_invalid_op+0x1a/0x20
[   29.063025]  ? blk_zone_write_plug_bio_endio+0xee/0x1e0
[   29.063760]  bio_endio+0xb7/0x150
[   29.064280]  btrfs_clone_write_end_io+0x2b/0x60 [btrfs]
[   29.065049]  blk_update_request+0x17c/0x500
[   29.065666]  scsi_end_request+0x27/0x1a0 [scsi_mod]
[   29.066356]  scsi_io_completion+0x5b/0x690 [scsi_mod]
[   29.067077]  blk_complete_reqs+0x3a/0x50
[   29.067692]  __do_softirq+0xcf/0x2b3
[   29.068248]  ? sort_range+0x20/0x20
[   29.068791]  run_ksoftirqd+0x1c/0x30
[   29.069339]  smpboot_thread_fn+0xcc/0x1b0
[   29.069936]  kthread+0xcf/0x100
[   29.070438]  ? kthread_complete_and_exit+0x20/0x20
[   29.071314]  ret_from_fork+0x31/0x50
[   29.071873]  ? kthread_complete_and_exit+0x20/0x20
[   29.072563]  ret_from_fork_asm+0x11/0x20
[   29.073146]  </TASK>

either when fio executes or when unmount is executed.

Fix this by modifying disk_should_remove_zone_wplug() to check that the
reference count to a zone write plug is not larger than 2, that is, that
the only references left on the zone are the caller held reference
(blk_zone_write_plug_complete_request()) and the initial extra reference
for the zone write plug taken when it was initialized (and that is
dropped when the zone write plug is removed from the hash table).

To be consistent with this change, make sure to drop the request or BIO
held reference to the zone write plug before calling
disk_zone_wplug_unplug_bio(). All references are also dropped using
disk_put_zone_wplug() instead of atomic_dec() to ensure that the zone
write plug is freed if it needs to be.

Comments are also improved to clarify zone write plugs reference
handling.

Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: dd291d77cc ("block: Introduce zone write plugging")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240501110907.96950-8-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01 08:08:43 -06:00
Damien Le Moal
79ae35a423 block: Unhash a zone write plug only if needed
Fix disk_remove_zone_wplug() to ensure that a zone write plug already
removed from a disk hash table of zone write plugs is not removed
again. Do this by checking the BLK_ZONE_WPLUG_UNHASHED flag of the plug
and calling hlist_del_init_rcu() only if the flag is not set.

Furthermore, since BIO completions can happen at any time, that is,
decrementing of the zone write plug reference count can happen at any
time, make sure to use disk_put_zone_wplug() instead of atomic_dec() to
ensure that the zone write plug is freed when its last reference is
dropped. In order to do this, disk_remove_zone_wplug() is moved after
the definition of disk_put_zone_wplug(). disk_should_remove_zone_wplug()
is moved as well to keep it together with disk_remove_zone_wplug().

To be consistent with this change, add a check in disk_put_zone_wplug()
to ensure that a zone write plug being freed was already removed from
the disk hash table.

Fixes: dd291d77cc ("block: Introduce zone write plugging")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240501110907.96950-7-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01 08:08:43 -06:00
Damien Le Moal
9e78c38ab3 block: Hold a reference on zone write plugs to schedule submission
Since a zone write plug BIO work is a field of struct blk_zone_wplug, we
must ensure that a zone write plug is never freed when its BIO
submission work is queued or running. Do this by holding a reference on
the zone write plug when the submission work is scheduled for execution
with queue_work() and releasing the reference at the end of the
execution of the work function blk_zone_wplug_bio_work().
The helper function disk_zone_wplug_schedule_bio_work() is introduced to
get a reference on a zone write plug and queue its work. This helper is
used in disk_zone_wplug_unplug_bio() and disk_zone_wplug_handle_error().

Fixes: dd291d77cc ("block: Introduce zone write plugging")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240501110907.96950-6-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01 08:08:43 -06:00
Damien Le Moal
19aad274c2 block: Fix reference counting for zone write plugs in error state
When zone is reset or finished, disk_zone_wplug_set_wp_offset() is
called to update the zone write plug write pointer offset and to clear
the zone error state (BLK_ZONE_WPLUG_ERROR flag) if it is set.
However, this processing is missing dropping the reference to the zone
write plug that was taken in disk_zone_wplug_set_error() when the error
flag was first set. Furthermore, the error state handling must release
the zone write plug lock to first execute a report zones command. When
the report zone races with a reset or finish operation that clears the
error, we can end up decrementing the zone write plug reference count
twice: once in disk_zone_wplug_set_wp_offset() for the reset/finish
operation and one more time in disk_zone_wplugs_work() once
disk_zone_wplug_handle_error() completes.

Fix this by introducing disk_zone_wplug_clear_error() as the symmetric
function of disk_zone_wplug_set_error(). disk_zone_wplug_clear_error()
decrements the zone write plug reference count obtained in
disk_zone_wplug_set_error() only if the error handling has not started
yet, that is, only if disk_zone_wplugs_work() has not yet taken the zone
write plug off the error list. This ensure that either
disk_zone_wplug_clear_error() or disk_zone_wplugs_work() drop the zone
write plug reference count.

Fixes: dd291d77cc ("block: Introduce zone write plugging")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240501110907.96950-5-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01 08:08:43 -06:00
Damien Le Moal
74b7ae5f48 block: Fix zone write plug initialization from blk_revalidate_zone_cb()
When revalidating the zones of a zoned block device,
blk_revalidate_zone_cb() must allocate a zone write plug for any
sequential write required zone that is not empty nor full. However, the
current code tests the latter case by comparing the zone write pointer
offset to the zone size instead of the zone capacity. Furthermore,
disk_get_and_lock_zone_wplug() is called with a sector argument equal to
the zone start instead of the current zone write pointer position.
This commit fixes both issues by calling disk_get_and_lock_zone_wplug()
for a zone that is not empty and with a write pointer offset lower than
the zone capacity and use the zone capacity sector as the sector
argument for disk_get_and_lock_zone_wplug().

Fixes: dd291d77cc ("block: Introduce zone write plugging")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240501110907.96950-4-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01 08:08:42 -06:00
Damien Le Moal
6b7593b5fb block: Exclude conventional zones when faking max open limit
For a device that has no limits for the maximum number of open and
active zones, we default to using the number of zones, limited to
BLK_ZONE_WPLUG_DEFAULT_POOL_SIZE (128), for the maximum number of open
zones indicated to the user. However, for a device that has conventional
zones and less zones than BLK_ZONE_WPLUG_DEFAULT_POOL_SIZE, we should
not account conventional zones and set the limit to the number of
sequential write required zones. Furthermore, for cases where the limit
is equal to the number of sequential write required zones, we can
advertize a limit of 0 to indicate "no limits".

Fix this by moving the zone write plug mempool resizing from
disk_revalidate_zone_resources() to disk_update_zone_resources() where
we can safely compute the number of conventional zones and update the
limits.

Fixes: 843283e96e ("block: Fake max open zones limit when there is no limit")
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240501110907.96950-3-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01 08:08:42 -06:00
Linus Torvalds
52034cae02 vfs-6.9-rc6.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZiulnAAKCRCRxhvAZXjc
 ogO+AP9z3+WAvgGmJkWOjT1aOrcQWVe+ZEdEUdK26ufkHhM5vAD/RXmdUBVHcYWk
 3oE1hG8bONOASUc6dUIATPHBDjvqFg8=
 =LtmL
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.9-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:
 "This contains a few small fixes for this merge window and the attempt
  to handle the ntfs removal regression that was reported a little while
  ago:

   - After the removal of the legacy ntfs driver we received reports
     about regressions for some people that do mount "ntfs" explicitly
     and expect the driver to be available. Since ntfs3 is a drop-in for
     legacy ntfs we alias legacy ntfs to ntfs3 just like ext3 is aliased
     to ext4.

     We also enforce legacy ntfs is always mounted read-only and give it
     custom file operations to ensure that ioctl()'s can't be abused to
     perform write operations.

   - Fix an unbalanced module_get() in bdev_open().

   - Two smaller fixes for the netfs work done earlier in this cycle.

   - Fix the errno returned from the new FS_IOC_GETUUID and
     FS_IOC_GETFSSYSFSPATH ioctls. Both commands just pull information
     out of the superblock so there's no need to call into the actual
     ioctl handlers.

     So instead of returning ENOIOCTLCMD to indicate to fallback we just
     return ENOTTY directly avoiding that indirection"

* tag 'vfs-6.9-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  netfs: Fix the pre-flush when appending to a file in writethrough mode
  netfs: Fix writethrough-mode error handling
  ntfs3: add legacy ntfs file operations
  ntfs3: enforce read-only when used as legacy ntfs driver
  ntfs3: serve as alias for the legacy ntfs driver
  block: fix module reference leakage from bdev_open_by_dev error path
  fs: Return ENOTTY directly if FS_IOC_GETUUID or FS_IOC_GETFSSYSFSPATH fail
2024-04-26 11:01:28 -07:00
Arnd Bergmann
597bc741e5 block/partitions/ldm: convert strncpy() to strscpy()
The strncpy() here can cause a non-terminated string, which older gcc
versions such as gcc-9 warn about:

In function 'ldm_parse_tocblock',
    inlined from 'ldm_validate_tocblocks' at block/partitions/ldm.c:386:7,
    inlined from 'ldm_partition' at block/partitions/ldm.c:1457:7:
block/partitions/ldm.c:134:2: error: 'strncpy' specified bound 16 equals destination size [-Werror=stringop-truncation]
  134 |  strncpy (toc->bitmap1_name, data + 0x24, sizeof (toc->bitmap1_name));
      |  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
block/partitions/ldm.c:145:2: error: 'strncpy' specified bound 16 equals destination size [-Werror=stringop-truncation]
  145 |  strncpy (toc->bitmap2_name, data + 0x46, sizeof (toc->bitmap2_name));
      |  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

New versions notice that the code is correct after all because of the
following termination, but replacing the strncpy() with strscpy_pad()
or strcpy() avoids the warning and simplifies the code at the same time.

Use the padding version here to keep the existing behavior, in case
the code relies on not including uninitialized data.

Link: https://lkml.kernel.org/r/20240409140059.3806717-4-arnd@kernel.org
Reviewed-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Alexey Starikovskiy <astarikovskiy@suse.de>
Cc: Bob Moore <robert.moore@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Len Brown <lenb@kernel.org>
Cc: Lin Ming <ming.m.lin@intel.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicolas Schier <nicolas@fjasle.eu>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Richard Russon (FlatCap)" <ldm@flatcap.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 21:07:07 -07:00
Johannes Thumshirn
57787fa42f block: check if zone_wplugs_hash exists in queue_zone_wplugs_show
Changhui reported a kernel crash when running this simple shell
reproducer:
 # cd /sys/kernel/debug/block && find  . -type f   -exec grep -aH . {} \;

The above results in a NULL pointer dereference if a device does not have
a zone_wplugs_hash allocated.

To fix this, return early if we don't have a zone_wplugs_hash.

Reported-by: Changhui Zhong <czhong@redhat.com>
Fixes: a98b05b02f ("block: Replace zone_wlock debugfs entry with zone_wplugs entry")
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/e5fec079dfca448cc21c425cfa5d7b291f5faa67.1714046443.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-25 07:47:46 -06:00
Damien Le Moal
a8f59e5a5d block: use a per disk workqueue for zone write plugging
A zone write plug BIO work function blk_zone_wplug_bio_work() calls
submit_bio_noacct_nocheck() to execute the next unplugged BIO. This
function may block. So executing zone plugs BIO works using the block
layer global kblockd workqueue can potentially lead to preformance or
latency issues as the number of concurrent work for a workqueue is
limited to WQ_DFL_ACTIVE (256).
1) For a system with a large number of zoned disks, issuing write
   requests to otherwise unused zones may be delayed wiating for a work
   thread to become available.
2) Requeue operations which use kblockd but are independent of zone
   write plugging may alsoi end up being delayed.

To avoid these potential performance issues, create a workqueue per
zoned device to execute zone plugs BIO work. The workqueue max active
parameter is set to the maximum number of zone write plugs allocated
with the zone write plug mempool. This limit is equal to the maximum
number of open zones of the disk and defaults to 128 for disks that do
not have a limit on the number of open zones.

Fixes: dd291d77cc ("block: Introduce zone write plugging")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240420075811.1276893-3-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-23 09:46:34 -06:00
Linus Torvalds
977b1ef518 block-6.9-20240420
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmYj3soQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpin7D/9wn9XnjUJVp8Pw2by2gY+j9V4Mlr30+HIW
 7XT51PHEYWpHQfMCm2gMdc16hb2w+GKkf40ymU9/fnEuqWfs2MAiw3tD7Seo8vTr
 qiLCp4hYQNM6YF6MQA4It2HJE2r0o8QRPjwFQSY0pnPZe+NPT/cdyKgJAlns4VW8
 d5kMOw7hLfZvN6iLjOW0hz0dQqoFdOGP9/QrXFgNzaexnJxDA+N8D7E5WEGjXvJj
 mHCqXXZEKMj2phuUlKfSeRDGGVDL8Zv2/whPD1TlNHn/8683lSwHXISEaw5KCBb2
 9dVFPMQv4eFY0yCBbqmfxOBki/0KElYKZ+ri3A0kdEnJG67F7LCIWEyGhIfZuGXl
 MGjzaSI8HSdUfUPgn0b4Ad1/cTpUaeHIu7b+x63KlbBO5sBbwh4tKUkuj30s7wP4
 FC9egqFL+B4JyzuMPvWtDKvA8v+KMRYsMBNUkYEy/DfQUuf2lmf6dtGSDBK94QvX
 n7Vdzxkm0gTHuJPnrkt4esS2dwCgMqgk6BpQDJ6ODkMWLtebw7ZYMIoFDJknbWgT
 W8uovm1uejUbsdjzvvG1ioL/ry3GiaP6sN8TEWHeq0RZrFGPwDjjpu4HVeEXrD0Z
 PpglL9LDj5bE2IJVCpaEyn86O3eqVeFfoHatAoFrbAKuJjSDALGM9wpC4UNgBFvN
 CSZ/ZiTKlA==
 =EMuq
 -----END PGP SIGNATURE-----

Merge tag 'block-6.9-20240420' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:
 "Just two minor fixes that should go into the 6.9 kernel release, one
  fixing a regression with partition scanning errors, and one fixing a
  WARN_ON() that can get triggered if we race with a timer"

* tag 'block-6.9-20240420' of git://git.kernel.dk/linux:
  blk-iocost: do not WARN if iocg was already offlined
  block: propagate partition scanning errors to the BLKRRPART ioctl
2024-04-20 11:28:02 -07:00
Jiapeng Chong
8294d49adb block/mq-deadline: Remove some unused functions
These functions are defined in the mq-deadline.c file, but not called
elsewhere, so delete these unused functions.

block/mq-deadline.c:134:1: warning: unused function 'deadline_earlier_request'.
block/mq-deadline.c:148:1: warning: unused function 'deadline_latter_request'.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8803
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240419025610.34298-1-jiapeng.chong@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-19 08:10:36 -06:00
Li Nan
01bc4fda9e blk-iocost: do not WARN if iocg was already offlined
In iocg_pay_debt(), warn is triggered if 'active_list' is empty, which
is intended to confirm iocg is active when it has debt. However, warn
can be triggered during a blkcg or disk removal, if iocg_waitq_timer_fn()
is run at that time:

  WARNING: CPU: 0 PID: 2344971 at block/blk-iocost.c:1402 iocg_pay_debt+0x14c/0x190
  Call trace:
  iocg_pay_debt+0x14c/0x190
  iocg_kick_waitq+0x438/0x4c0
  iocg_waitq_timer_fn+0xd8/0x130
  __run_hrtimer+0x144/0x45c
  __hrtimer_run_queues+0x16c/0x244
  hrtimer_interrupt+0x2cc/0x7b0

The warn in this situation is meaningless. Since this iocg is being
removed, the state of the 'active_list' is irrelevant, and 'waitq_timer'
is canceled after removing 'active_list' in ioc_pd_free(), which ensures
iocg is freed after iocg_waitq_timer_fn() returns.

Therefore, add the check if iocg was already offlined to avoid warn
when removing a blkcg or disk.

Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20240419093257.3004211-1-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-19 08:06:24 -06:00
Christoph Hellwig
752863bdda block: propagate partition scanning errors to the BLKRRPART ioctl
Commit 4601b4b130 ("block: reopen the device in blkdev_reread_part")
lost the propagation of I/O errors from the low-level read of the
partition table to the user space caller of the BLKRRPART.

Apparently some user space relies on, so restore the propagation.  This
isn't exactly pretty as other block device open calls explicitly do not
are about these errors, so add a new BLK_OPEN_STRICT_SCAN to opt into
the error propagation.

Fixes: 4601b4b130 ("block: reopen the device in blkdev_reread_part")
Reported-by: Saranya Muruganandam <saranyamohan@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/20240417144743.2277601-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-18 09:34:34 -06:00
Damien Le Moal
99a9476b27 block: Do not special-case plugging of zone write operations
With the block layer zone write plugging being automatically done for
any write operation to a zone of a zoned block device, a regular request
plugging handled through current->plug can only ever see at most a
single write request per zone. In such case, any potential reordering
of the plugged requests will be harmless. We can thus remove the special
casing for write operations to zones and have these requests plugged as
well. This allows removing the function blk_mq_plug and instead directly
using current->plug where needed.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240408014128.205141-29-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:03 -06:00
Damien Le Moal
97abee507b block: Do not force select mq-deadline with CONFIG_BLK_DEV_ZONED
Now that zone block device write ordering control does not depend
anymore on mq-deadline and zone write locking, there is no need to force
select the mq-deadline scheduler when CONFIG_BLK_DEV_ZONED is enabled.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240408014128.205141-28-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:03 -06:00
Damien Le Moal
02ccd7c360 block: Remove zone write locking
Zone write locking is now unused and replaced with zone write plugging.
Remove all code that was implementing zone write locking, that is, the
various helper functions controlling request zone write locking and
the gendisk attached zone bitmaps.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240408014128.205141-27-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:03 -06:00
Damien Le Moal
a98b05b02f block: Replace zone_wlock debugfs entry with zone_wplugs entry
In preparation to completely remove zone write locking, replace the
"zone_wlock" mq-debugfs entry that was listing zones that are
write-locked with the zone_wplugs entry which lists the zones that
currently have a write plug allocated.

The write plug information provided is: the zone number, the zone write
plug flags, the zone write plug write pointer offset and the number of
BIOs currently waiting for execution in the zone write plug BIO list.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240408014128.205141-26-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:03 -06:00
Damien Le Moal
d9f1439a30 block: Move zone related debugfs attribute to blk-zoned.c
block/blk-mq-debugfs-zone.c contains a single debugfs attribute
function. Defining this outside of block/blk-zoned.c does not really
help in any way, so move this zone related debugfs attribute to
block/blk-zoned.c and delete block/blk-mq-debugfs-zone.c.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240408014128.205141-25-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:03 -06:00
Damien Le Moal
bca150f0d4 block: Do not check zone type in blk_check_zone_append()
Zone append operations are only allowed to target sequential write
required zones. blk_check_zone_append() uses bio_zone_is_seq() to check
this. However, this check is not necessary because:
1) For NVMe ZNS namespace devices, only sequential write required zones
   exist, making the zone type check useless.
2) For null_blk, the driver will fail the request anyway, thus notifying
   the user that a conventional zone was targeted.
3) For all other zoned devices, zone append is now emulated using zone
   write plugging, which checks that a zone append operation does not
   target a conventional zone.

In preparation for the removal of zone write locking and its
conventional zone bitmap (used by bio_zone_is_seq()), remove the
bio_zone_is_seq() call from blk_check_zone_append().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240408014128.205141-24-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:03 -06:00
Damien Le Moal
e4eb37cc0f block: Remove elevator required features
The only elevator feature ever implemented is ELEVATOR_F_ZBD_SEQ_WRITE
for signaling that a scheduler implements zone write locking to tightly
control the dispatching order of write operations to zoned block
devices. With the removal of zone write locking support in mq-deadline
and the reliance of all block device drivers on the block layer zone
write plugging to control ordering of write operations to zones, the
elevator feature ELEVATOR_F_ZBD_SEQ_WRITE is completely unused.
Remove it, and also remove the now unused code for filtering the
possible schedulers for a block device based on required features.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240408014128.205141-23-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:03 -06:00
Damien Le Moal
fde02699c2 block: mq-deadline: Remove support for zone write locking
With the block layer generic plugging of write operations for zoned
block devices, mq-deadline, or any other scheduler, can only ever
see at most one write operation per zone at any time. There is thus no
sequentiality requirements for these writes and thus no need to tightly
control the dispatching of write requests using zone write locking.

Remove all the code that implement this control in the mq-deadline
scheduler and remove advertizing support for the
ELEVATOR_F_ZBD_SEQ_WRITE elevator feature.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240408014128.205141-22-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:03 -06:00
Damien Le Moal
9b3c08b90f block: Simplify blk_revalidate_disk_zones() interface
The only user of blk_revalidate_disk_zones() second argument was the
SCSI disk driver (sd). Now that this driver does not require this
update_driver_data argument, remove it to simplify the interface of
blk_revalidate_disk_zones(). Also update the function kdoc comment to
be more accurate (i.e. there is no gendisk ->revalidate method).

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240408014128.205141-21-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:03 -06:00
Damien Le Moal
63b5385e78 block: Remove BLK_STS_ZONE_RESOURCE
The zone append emulation of the scsi disk driver was the only driver
using BLK_STS_ZONE_RESOURCE. With this code removed,
BLK_STS_ZONE_RESOURCE is now unused. Remove this macro definition and
simplify blk_mq_dispatch_rq_list() where this status code was handled.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240408014128.205141-20-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:03 -06:00
Damien Le Moal
946dd71ed8 block: Allow BIO-based drivers to use blk_revalidate_disk_zones()
In preparation for allowing BIO based device drivers to use zone write
plugging and its zone append emulation, allow these drivers to call
blk_revalidate_disk_zones() so that all zone resources necessary to zone
write plugging can be initialized.

To do so, remove the check in blk_revalidate_disk_zones() restricting
the use of this function to mq request-based drivers to allow also
BIO-based drivers to use it. This is safe to do as long as the
BIO-based block device queue is already setup and usable, as it should,
and can be safely frozen.

The helper function disk_need_zone_resources() is added to control the
allocation and initialization of the zone write plug hash table and
of the conventional zone bitmap only for mq devices and for BIO-based
devices that require zone append emulation.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240408014128.205141-12-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:03 -06:00
Damien Le Moal
9b1ce7f0c6 block: Implement zone append emulation
Given that zone write plugging manages all writes to zones of a zoned
block device and tracks the write pointer position of all zones that are
not full nor empty, emulating zone append operations using regular
writes can be implemented generically, without relying on the underlying
device driver to implement such emulation. This is needed for devices
that do not natively support the zone append command (e.g. SMR
hard-disks).

A device may request zone append emulation by setting its
max_zone_append_sectors queue limit to 0. For such device, the function
blk_zone_wplug_prepare_bio() changes zone append BIOs into
non-mergeable regular write BIOs. Modified zone append BIOs are flagged
with the new BIO flag BIO_EMULATES_ZONE_APPEND. This flag is checked
on completion of the BIO in blk_zone_write_plug_bio_endio() to restore
the original REQ_OP_ZONE_APPEND operation code of the BIO.

The block layer internal inline helper function bio_is_zone_append() is
added to test if a BIO is either a native zone append operation
(REQ_OP_ZONE_APPEND operation code) or if it is flagged with
BIO_EMULATES_ZONE_APPEND. Given that both native and emulated zone
append BIO completion handling should be similar, The functions
blk_update_request() and blk_zone_complete_request_bio() are modified to
use bio_is_zone_append() to execute blk_zone_update_request_bio() for
both native and emulated zone append operations.

This commit contains contributions from Christoph Hellwig <hch@lst.de>.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240408014128.205141-11-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:03 -06:00
Damien Le Moal
ccdbf0aad2 block: Allow zero value of max_zone_append_sectors queue limit
In preparation for adding a generic zone append emulation using zone
write plugging, allow device drivers supporting zoned block device to
set a the max_zone_append_sectors queue limit of a device to 0 to
indicate the lack of native support for zone append operations and that
the block layer should emulate these operations using regular write
operations.

blk_queue_max_zone_append_sectors() is modified to allow passing 0 as
the max_zone_append_sectors argument. The function
queue_max_zone_append_sectors() is also modified to ensure that the
minimum of the max_hw_sectors and chunk_sectors limit is used whenever
the max_zone_append_sectors limit is 0. This minimum is consistent with
the value set for the max_zone_append_sectors limit by the function
blk_validate_zoned_limits() when limits for a queue are validated.

The helper functions queue_emulates_zone_append() and
bdev_emulates_zone_append() are added to test if a queue (or block
device) emulates zone append operations.

In order for blk_revalidate_disk_zones() to accept zoned block devices
relying on zone append emulation, the direct check to the
max_zone_append_sectors queue limit of the disk is replaced by a check
using the value returned by queue_max_zone_append_sectors(). Similarly,
queue_zone_append_max_show() is modified to use the same accessor so
that the sysfs attribute advertizes the non-zero limit that will be
used, regardless if it is for native or emulated commands.

For stacking drivers, a top device should not need to care if the
underlying devices have native or emulated zone append operations.
blk_stack_limits() is thus modified to set the top device
max_zone_append_sectors limit using the new accessor
queue_limits_max_zone_append_sectors(). queue_max_zone_append_sectors()
is modified to use this function as well. Stacking drivers that require
zone append emulation, e.g. dm-crypt, can still request this feature by
calling blk_queue_max_zone_append_sectors() with a 0 limit.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240408014128.205141-10-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:03 -06:00
Damien Le Moal
843283e96e block: Fake max open zones limit when there is no limit
For a zoned block device that has no limit on the number of open zones
and no limit on the number of active zones, the zone write plug mempool
is created with a size of 128 zone write plugs. For such case, set the
device max_open_zones queue limit to this value to indicate to the user
the potential performance penalty that may happen when writing
simultaneously to more zones than the mempool size.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240408014128.205141-9-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:03 -06:00
Damien Le Moal
dd291d77cc block: Introduce zone write plugging
Zone write plugging implements a per-zone "plug" for write operations
to control the submission and execution order of write operations to
sequential write required zones of a zoned block device. Per-zone
plugging guarantees that at any time there is at most only one write
request per zone being executed. This mechanism is intended to replace
zone write locking which implements a similar per-zone write throttling
at the scheduler level, but is implemented only by mq-deadline.

Unlike zone write locking which operates on requests, zone write
plugging operates on BIOs. A zone write plug is simply a BIO list that
is atomically manipulated using a spinlock and a kblockd submission
work. A write BIO to a zone is "plugged" to delay its execution if a
write BIO for the same zone was already issued, that is, if a write
request for the same zone is being executed. The next plugged BIO is
unplugged and issued once the write request completes.

This mechanism allows to:
 - Untangle zone write ordering from block IO schedulers. This allows
   removing the restriction on using mq-deadline for writing to zoned
   block devices. Any block IO scheduler, including "none" can be used.
 - Zone write plugging operates on BIOs instead of requests. Plugged
   BIOs waiting for execution thus do not hold scheduling tags and thus
   are not preventing other BIOs from executing (reads or writes to
   other zones). Depending on the workload, this can significantly
   improve the device use (higher queue depth operation) and
   performance.
 - Both blk-mq (request based) zoned devices and BIO-based zoned devices
   (e.g.  device mapper) can use zone write plugging. It is mandatory
   for the former but optional for the latter. BIO-based drivers can
   use zone write plugging to implement write ordering guarantees, or
   the drivers can implement their own if needed.
 - The code is less invasive in the block layer and is mostly limited to
   blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and
   bio.c.

Zone write plugging is implemented using struct blk_zone_wplug. This
structure includes a spinlock, a BIO list and a work structure to
handle the submission of plugged BIOs. Zone write plugs structures are
managed using a per-disk hash table.

Plugging of zone write BIOs is done using the function
blk_zone_write_plug_bio() which returns false if a BIO execution does
not need to be delayed and true otherwise. This function is called
from blk_mq_submit_bio() after a BIO is split to avoid large BIOs
spanning multiple zones which would cause mishandling of zone write
plugs. This ichange enables by default zone write plugging for any mq
request-based block device. BIO-based device drivers can also use zone
write plugging by expliclty calling blk_zone_write_plug_bio() in their
->submit_bio method. For such devices, the driver must ensure that a
BIO passed to blk_zone_write_plug_bio() is already split and not
straddling zone boundaries.

Only write and write zeroes BIOs are plugged. Zone write plugging does
not introduce any significant overhead for other operations. A BIO that
is being handled through zone write plugging is flagged using the new
BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with
this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag.
The completion of BIOs and requests flagged trigger respectively calls
to the functions blk_zone_write_bio_endio() and
blk_zone_write_complete_request(). The latter function is used to
trigger submission of the next plugged BIO using the zone plug work.
blk_zone_write_bio_endio() does the same for BIO-based devices.
This ensures that at any time, at most one request (blk-mq devices) or
one BIO (BIO-based devices) is being executed for any zone. The
handling of zone write plugs using a per-zone plug spinlock maximizes
parallelism and device usage by allowing multiple zones to be writen
simultaneously without lock contention.

Zone write plugging ignores flush BIOs without data. Hovever, any flush
BIO that has data is always plugged so that the write part of the flush
sequence is serialized with other regular writes.

Given that any BIO handled through zone write plugging will be the only
BIO in flight for the target zone when it is executed, the unplugging
and submission of a BIO will have no chance of successfully merging with
plugged requests or requests in the scheduler. To overcome this
potential performance degradation, blk_mq_submit_bio() calls the
function blk_zone_write_plug_attempt_merge() to try to merge other
plugged BIOs with the one just unplugged and submitted. Successful
merging is signaled using blk_zone_write_plug_bio_merged(), called from
bio_attempt_back_merge(). Furthermore, to avoid recalculating the number
of segments of plugged BIOs to attempt merging, the number of segments
of a plugged BIO is saved using the new struct bio field
__bi_nr_segments. To avoid growing the size of struct bio, this field is
added as a union with the bio_cookie field. This is safe to do as
polling is always disabled for plugged BIOs.

When BIOs are plugged in a zone write plug, the device request queue
usage counter is always incremented. This reference is kept and reused
for blk-mq devices when the plugged BIO is unplugged and submitted
again using submit_bio_noacct_nocheck(). For this case, the unplugged
BIO is already flagged with BIO_ZONE_WRITE_PLUGGING and
blk_mq_submit_bio() proceeds directly to allocating a new request for
the BIO, re-using the usage reference count taken when the BIO was
plugged. This extra reference count is dropped in
blk_zone_write_plug_attempt_merge() for any plugged BIO that is
successfully merged. Given that BIO-based devices will not take this
path, the extra reference is dropped after a plugged BIO is unplugged
and submitted.

Zone write plugs are dynamically allocated and managed using a hash
table (an array of struct hlist_head) with RCU protection.
A zone write plug is allocated when a write BIO is received for the
zone and not freed until the zone is fully written, reset or finished.
To detect when a zone write plug can be freed, the write state of each
zone is tracked using a write pointer offset which corresponds to the
offset of a zone write pointer relative to the zone start. Write
operations always increment this write pointer offset. Zone reset
operations set it to 0 and zone finish operations set it to the zone
size.

If a write error happens, the wp_offset value of a zone write plug may
become incorrect and out of sync with the device managed write pointer.
This is handled using the zone write plug flag BLK_ZONE_WPLUG_ERROR.
The function blk_zone_wplug_handle_error() is called from the new disk
zone write plug work when this flag is set. This function executes a
report zone to update the zone write pointer offset to the current
value as indicated by the device. The disk zone write plug work is
scheduled whenever a BIO flagged with BIO_ZONE_WRITE_PLUGGING completes
with an error or when bio_zone_wplug_prepare_bio() detects an unaligned
write. Once scheduled, the disk zone write plugs work keeps running
until all zone errors are handled.

To match the new data structures used for zoned disks, the function
disk_free_zone_bitmaps() is renamed to the more generic
disk_free_zone_resources(). The function disk_init_zone_resources() is
also introduced to initialize zone write plugs resources when a gendisk
is allocated.

In order to guarantee that the user can simultaneously write up to a
number of zones equal to a device max active zone limit or max open zone
limit, zone write plugs are allocated using a mempool sized to the
maximum of these 2 device limits. For a device that does not have
active and open zone limits, 128 is used as the default mempool size.

If a change to the device active and open zone limits is detected, the
disk mempool is resized when blk_revalidate_disk_zones() is executed.

This commit contains contributions from Christoph Hellwig <hch@lst.de>.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240408014128.205141-8-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:03 -06:00
Damien Le Moal
ecfe43b11b block: Remember zone capacity when revalidating zones
In preparation for adding zone write plugging, modify
blk_revalidate_disk_zones() to get the capacity of zones of a zoned
block device. This capacity value as a number of 512B sectors is stored
in the gendisk zone_capacity field.

Given that host-managed SMR disks (including zoned UFS drives) and all
known NVMe ZNS devices have the same zone capacity for all zones
blk_revalidate_disk_zones() returns an error if different capacities are
detected for different zones.

This also adds check to verify that the values reported by the device
for zone capacities are correct, that is, that the zone capacity is
never 0, does not exceed the zone size and is equal to the zone size for
conventional zones.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240408014128.205141-7-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:02 -06:00
Damien Le Moal
dd850ff3ee block: Allow using bio_attempt_back_merge() internally
Remove "static" from the definition of bio_attempt_back_merge() and
declare this function in block/blk.h to allow using it internally from
other block layer files. The definition of enum bio_merge_status is
also moved to block/blk.h.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240408014128.205141-6-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:02 -06:00
Damien Le Moal
a0508c36ef block: Introduce blk_zone_update_request_bio()
On completion of a zone append request, the request sector indicates the
location of the written data. This value must be returned to the user
through the BIO iter sector. This is done in 2 places: in
blk_complete_request() and in blk_update_request(). Introduce the inline
helper function blk_zone_update_request_bio() to avoid duplicating
this BIO update for zone append requests, and to compile out this
helper call when CONFIG_BLK_DEV_ZONED is not enabled.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240408014128.205141-4-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:02 -06:00
Damien Le Moal
c0da26f950 block: Remove req_bio_endio()
Moving req_bio_endio() code into its only caller, blk_update_request(),
allows reducing accesses to and tests of bio and request fields. Also,
given that partial completions of zone append operations is not
possible and that zone append operations cannot be merged, the update
of the BIO sector using the request sector for these operations can be
moved directly before the call to bio_endio().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240408014128.205141-3-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:02 -06:00
Damien Le Moal
6f8fd758de block: Restore sector of flush requests
On completion of a flush sequence, blk_flush_restore_request() restores
the bio of a request to the original submitted BIO. However, the last
use of the request in the flush sequence may have been for a POSTFLUSH
which does not have a sector. So make sure to restore the request sector
using the iter sector of the original BIO. This BIO has not changed yet
since the completions of the flush sequence intermediate steps use
requeueing of the request until all steps are completed.

Restoring the request sector ensures that blk_mq_end_request() will see
a valid sector as originally set when the flush BIO was submitted.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240408014128.205141-2-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:02 -06:00
John Garry
de4c7bef9d block: Call blkdev_dio_unaligned() from blkdev_direct_IO()
blkdev_dio_unaligned() is called from __blkdev_direct_IO(),
__blkdev_direct_IO_simple(), and __blkdev_direct_IO_async(), and all these
are only called from blkdev_direct_IO().

Move the blkdev_dio_unaligned() call to the common callsite,
blkdev_direct_IO().

Pass those functions the bdev pointer from blkdev_direct_IO(), as it is
non-trivial to look up.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240415122020.1541594-1-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:12:22 -06:00
Linus Torvalds
d7ad058156 block-6.9-20240412
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmYZWKYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpl7VD/0VyVykpWz9ydtEQvlTLndMASi4U0RpSPPy
 mPX8ydz5k8IywiaAaKxCWK0hCSEXWceQKrbqJqQau9M33JD1Ss7x44BQR/XBZD5J
 vRV5PbnMpIFL1knGzf9xgXKAeEHI2Hi/VKuW3dMoXcXHABfHlbS2ulvpQ97S9oAo
 iCS3728sFOiqcKh2SSSMI04f0o+48F2mh3g8CAd45Kx/kwbEogaxc6U590wbnA+5
 +TTVb+XRT0x55DCq0awtXDxLs4030IkcqJ+S5891m5Dfm8LUt0m0ekr8R9ks/sA+
 +axbcLLw3M3B6QTsT8QCytpAyW5eNBdyGSj4BFl5AOGjUDjuassdzXFtKevItgEd
 kPsuYexNHERnntb+zEXfO7wVvyewy4Dby/WjoYi2hOhvkt73UrR9y93cp3Lq3iI4
 G1+JNLNVkQo/bI0ZtyN5+q39odHcvhGvAr6zAygR+fqU6qOymRvQBpV/bnnMrv2j
 wxlGC5r7tGd7kEwTg/UBK+sVR8OyOm8IJAszK372KRORDRiYLlw7cF+4d5jpG9P5
 frk6z6QrJPoH15iaOjZY62PKEEXiuJbQm2FJYMZvD5upR+UHVvg2xVdLMy/tSnaX
 zgNzlPFPKXXQ72IS8shJ9JE0XZWfZxirbUNIL+1Z2HUUj3t5SduBV/dWEeupPvuv
 5Ml9Lg2uMA==
 =rCiN
 -----END PGP SIGNATURE-----

Merge tag 'block-6.9-20240412' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - MD pull request via Song:
       - UAF fix (Yu)

 - Avoid out-of-bounds shift in blk-iocost (Rik)

 - Fix for q->blkg_list corruption (Ming)

 - Relax virt boundary mask/size segment checking (Ming)

* tag 'block-6.9-20240412' of git://git.kernel.dk/linux:
  block: fix that blk_time_get_ns() doesn't update time after schedule
  block: allow device to have both virt_boundary_mask and max segment size
  block: fix q->blkg_list corruption during disk rebind
  blk-iocost: avoid out of bounds shift
  raid1: fix use-after-free for original bio in raid1_write_request()
2024-04-12 10:22:33 -07:00
Yu Kuai
3ec4848913 block: fix that blk_time_get_ns() doesn't update time after schedule
While monitoring the throttle time of IO from iocost, it's found that
such time is always zero after the io_schedule() from ioc_rqos_throttle,
for example, with the following debug patch:

+       printk("%s-%d: %s enter %llu\n", current->comm, current->pid, __func__, blk_time_get_ns());
        while (true) {
                set_current_state(TASK_UNINTERRUPTIBLE);
                if (wait.committed)
                        break;
                io_schedule();
        }
+       printk("%s-%d: %s exit  %llu\n", current->comm, current->pid, __func__, blk_time_get_ns());

It can be observerd that blk_time_get_ns() always return the same time:

[ 1068.096579] fio-1268: ioc_rqos_throttle enter 1067901962288
[ 1068.272587] fio-1268: ioc_rqos_throttle exit  1067901962288
[ 1068.274389] fio-1268: ioc_rqos_throttle enter 1067901962288
[ 1068.472690] fio-1268: ioc_rqos_throttle exit  1067901962288
[ 1068.474485] fio-1268: ioc_rqos_throttle enter 1067901962288
[ 1068.672656] fio-1268: ioc_rqos_throttle exit  1067901962288
[ 1068.674451] fio-1268: ioc_rqos_throttle enter 1067901962288
[ 1068.872655] fio-1268: ioc_rqos_throttle exit  1067901962288

And I think the root cause is that 'PF_BLOCK_TS' is always cleared
by blk_flush_plug() before scheduel(), hence blk_plug_invalidate_ts()
will never be called:

blk_time_get_ns
 plug->cur_ktime = ktime_get_ns();
 current->flags |= PF_BLOCK_TS;

io_schedule:
 io_schedule_prepare
  blk_flush_plug
   __blk_flush_plug
    /* the flag is cleared, while time is not */
    current->flags &= ~PF_BLOCK_TS;
 schedule
 sched_update_worker
  /* the flag is not set, hence plug->cur_ktime is not cleared */
  if (tsk->flags & PF_BLOCK_TS)
   blk_plug_invalidate_ts()

blk_time_get_ns
 /* got the time stashed before schedule */
 return plug->cur_ktime;

Fix the problem by clearing cached time in __blk_flush_plug().

Fixes: 06b23f92af ("block: update cached timestamp post schedule/preemption")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240411032349.3051233-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-12 08:31:54 -06:00
Christoph Hellwig
ec84ca4025 scsi: block: Remove now unused queue limits helpers
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240409143748.980206-24-hch@lst.de
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2024-04-12 06:32:01 -04:00
Christoph Hellwig
4373d2ecca scsi: bsg: Pass queue_limits to bsg_setup_queue()
This allows bsg_setup_queue() to pass them to blk_mq_alloc_queue() and thus
set up the limits at queue allocation time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240409143748.980206-3-hch@lst.de
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2024-04-11 21:37:48 -04:00
Yu Kuai
9617cd6f24
block: fix module reference leakage from bdev_open_by_dev error path
At the time bdev_may_open() is called, module reference is grabbed
already, hence module reference should be released if bdev_may_open()
failed.

This problem is found by code review.

Fixes: ed5cc702d3 ("block: Add config option to not allow writing to mounted devices")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240406090930.2252838-22-yukuai1@huaweicloud.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-11 11:14:36 +02:00
Ming Lei
b561ea56a2 block: allow device to have both virt_boundary_mask and max segment size
When one stacking device is over one device with virt_boundary_mask and
another one with max segment size, the stacking device have both limits
set. This way is allowed before d690cb8ae1 ("block: add an API to
atomically update queue limits").

Relax the limit so that we won't break such kind of stacking setting.

Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218687
Reported-by: janpieter.sollie@edpnet.be
Fixes: d690cb8ae1 ("block: add an API to atomically update queue limits")
Link: https://lore.kernel.org/linux-block/ZfGl8HzUpiOxCLm3@fedora/
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: dm-devel@lists.linux.dev
Cc: Song Liu <song@kernel.org>
Cc: linux-raid@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20240407131931.4055231-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-07 15:50:33 -06:00
Ming Lei
8b8ace0803 block: fix q->blkg_list corruption during disk rebind
Multiple gendisk instances can allocated/added for single request queue
in case of disk rebind. blkg may still stay in q->blkg_list when calling
blkcg_init_disk() for rebind, then q->blkg_list becomes corrupted.

Fix the list corruption issue by:

- add blkg_init_queue() to initialize q->blkg_list & q->blkcg_mutex only
- move calling blkg_init_queue() into blk_alloc_queue()

The list corruption should be started since commit f1c006f1c6 ("blk-cgroup:
synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
which delays removing blkg from q->blkg_list into blkg_free_workfn().

Fixes: f1c006f1c6 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Fixes: 1059699f87 ("block: move blkcg initialization/destroy into disk allocation/release handler")
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240407125910.4053377-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-07 15:50:13 -06:00
Christian Brauner
210a03c9d5
fs: claw back a few FMODE_* bits
There's a bunch of flags that are purely based on what the file
operations support while also never being conditionally set or unset.
IOW, they're not subject to change for individual files. Imho, such
flags don't need to live in f_mode they might as well live in the fops
structs itself. And the fops struct already has that lonely
mmap_supported_flags member. We might as well turn that into a generic
fop_flags member and move a few flags from FMODE_* space into FOP_*
space. That gets us four FMODE_* bits back and the ability for new
static flags that are about file ops to not have to live in FMODE_*
space but in their own FOP_* space. It's not the most beautiful thing
ever but it gets the job done. Yes, there'll be an additional pointer
chase but hopefully that won't matter for these flags.

I suspect there's a few more we can move into there and that we can also
redirect a bunch of new flag suggestions that follow this pattern into
the fop_flags field instead of f_mode.

Link: https://lore.kernel.org/r/20240328-gewendet-spargel-aa60a030ef74@brauner
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-07 13:49:02 +02:00
Rik van Riel
beaa51b360 blk-iocost: avoid out of bounds shift
UBSAN catches undefined behavior in blk-iocost, where sometimes
iocg->delay is shifted right by a number that is too large,
resulting in undefined behavior on some architectures.

[  186.556576] ------------[ cut here ]------------
UBSAN: shift-out-of-bounds in block/blk-iocost.c:1366:23
shift exponent 64 is too large for 64-bit type 'u64' (aka 'unsigned long long')
CPU: 16 PID: 0 Comm: swapper/16 Tainted: G S          E    N 6.9.0-0_fbk700_debug_rc2_kbuilder_0_gc85af715cac0 #1
Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A23 12/08/2020
Call Trace:
 <IRQ>
 dump_stack_lvl+0x8f/0xe0
 __ubsan_handle_shift_out_of_bounds+0x22c/0x280
 iocg_kick_delay+0x30b/0x310
 ioc_timer_fn+0x2fb/0x1f80
 __run_timer_base+0x1b6/0x250
...

Avoid that undefined behavior by simply taking the
"delay = 0" branch if the shift is too large.

I am not sure what the symptoms of an undefined value
delay will be, but I suspect it could be more than a
little annoying to debug.

Signed-off-by: Rik van Riel <riel@surriel.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20240404123253.0f58010f@imladris.surriel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-05 20:07:40 -06:00
Linus Torvalds
8a05ef7087 block-6.9-20240405
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmYQT2oQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpt8dEADbrvjvMvjTSfskku0sof/Yv+0RkQfleRjD
 9nch6bcYHmnbgSNpKsf62gDKmGWWLfjiWaxzBy2u6ZJ+m/Yg7QWSqPZqM15Ayy05
 SsgJtb6N7AgTEOy3fpNLwLaQpSp0Mtx3lGPNJpahJmL9Wl+ZKl8EKoBL1GrvJpc7
 DCPenrbEtrXb+uunm8AnyDHgYhVmRx6S3K41JeINTC7ZiG5hc01xkh5DXNCXMF9I
 c+0asZDsADltbh6jA3tud12pnhdJFpSkHM3nsnFWB0rNKsXKRRSSj/Eexbq5+tYU
 38GQgwDtl8bwvCxmYRLj1PISrOROBiKC0or3wCTWW/3PInj4BS3Qry3j7r7HpYrJ
 4uy8REgHp2inZACZToBaRoZK2wrJeCHJDogZag3VAuthIsetRqb+uj+qGd+yQK/G
 XEIy2d9KwFC1mqXeUKy0jZVS2IfE4YQ8ZRB76ZP8wCK3a9mrfAv/WmONeZu9NFGs
 qvvpCoNJLDRT2WoygsbeXTmSRxGX2FK9F7VIfKFzpD6/JGY58S/N7QznMOVZKmBe
 Gnb7c+7tCVpCEpcRInN3UrUawKVWX/0x5YMcxi6vCkrf/asIuqqSkxL48vfPQ1b+
 r5rEsnXsMzbr6o7zbbowIFFZzdFbIutKHFmN8mPAThKf5JmX+Ccm7g3YMxVUqVut
 dOmIJZeclg==
 =ns3r
 -----END PGP SIGNATURE-----

Merge tag 'block-6.9-20240405' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
      - Atomic queue limits fixes (Christoph)
      - Fabrics fixes (Hannes, Daniel)

 - Discard overflow fix (Li)

 - Cleanup fix for null_blk (Damien)

* tag 'block-6.9-20240405' of git://git.kernel.dk/linux:
  nvme-fc: rename free_ctrl callback to match name pattern
  nvmet-fc: move RCU read lock to nvmet_fc_assoc_exists
  nvmet: implement unique discovery NQN
  nvme: don't create a multipath node for zero capacity devices
  nvme: split nvme_update_zone_info
  nvme-multipath: don't inherit LBA-related fields for the multipath node
  block: fix overflow in blk_ioctl_discard()
  nullblk: Fix cleanup order in null_add_dev() error path
2024-04-05 17:04:11 -07:00
Linus Torvalds
fae0268777 vfs-6.9-rc3.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZg/C8wAKCRCRxhvAZXjc
 oljxAQCneq62ginESgeQLw88fzSBTV4C50xXUA+Qz18AEgA/fgD+J3DlWquEHhMM
 tJmfs3aUn9w7+wDpukcsLjJfJEiSYA8=
 =f2Z6
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.9-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:
 "This contains a few small fixes. This comes with some delay because I
  wanted to wait on people running their reproducers and the Easter
  Holidays meant that those replies came in a little later than usual:

   - Fix handling of preventing writes to mounted block devices.

     Since last kernel we allow to prevent writing to mounted block
     devices provided CONFIG_BLK_DEV_WRITE_MOUNTED isn't set and the
     block device is opened with restricted writes. When we switched to
     opening block devices as files we altered the mechanism by which we
     recognize when a block device has been opened with write
     restrictions.

     The detection logic assumed that only read-write mounted
     filesystems would apply write restrictions to their block devices
     from other openers. That of course is not true since it also makes
     sense to apply write restrictions for filesystems that are
     read-only.

     Fix the detection logic using an FMODE_* bit. We still have a few
     left since we freed up a couple a while ago. I also picked up a
     patch to free up four additional FMODE_* bits scheduled for the
     next merge window.

   - Fix counting the number of writers to a block device. This just
     changes the logic to be consistent.

   - Fix a bug in aio causing a NULL pointer derefernce after we
     implemented batched processing in aio.

   - Finally, add the changes we discussed that allows to yield block
     devices early even though file closing itself is deferred.

     This also allows us to remove two holder operations to get and
     release the holder to align lifetime of file and holder of the
     block device"

* tag 'vfs-6.9-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  aio: Fix null ptr deref in aio_complete() wakeup
  fs,block: yield devices early
  block: count BLK_OPEN_RESTRICT_WRITES openers
  block: handle BLK_OPEN_RESTRICT_WRITES correctly
2024-04-05 09:47:26 -07:00
Kefeng Wang
688c8b9208 blk-cgroup: use group allocation/free of per-cpu counters API
Use group allocation/free of per-cpu counters api to accelerate
blkg_rwstat_init/exit() and simplify code.

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Link: https://lore.kernel.org/r/20240325035955.50019-1-wangkefeng.wang@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-03 09:10:17 -06:00
Li Nan
22d24a544b block: fix overflow in blk_ioctl_discard()
There is no check for overflow of 'start + len' in blk_ioctl_discard().
Hung task occurs if submit an discard ioctl with the following param:
  start = 0x80000000000ff000, len = 0x8000000000fff000;
Add the overflow validation now.

Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240329012319.2034550-1-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-02 07:43:24 -06:00
Christoph Hellwig
7a324d8389 blk-cgroup: use bio_list_merge_init
Use bio_list_merge_init instead of open coding bio_list_merge and
bio_list_init.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240328084147.2954434-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-01 11:53:37 -06:00
John Garry
d3a3a086ad blk-throttle: Only use seq_printf() in tg_prfill_limit()
Currently tg_prfill_limit() uses a combination of snprintf() and strcpy()
to generate the values parts of the limits string, before passing them as
arguments to seq_printf().

Convert to use only a sequence of seq_printf() calls per argument, which is
simpler.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240327094020.3505514-1-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-01 11:53:37 -06:00
Ming Lei
a46c27026d blk-mq: don't schedule block kworker on isolated CPUs
Kernel parameter of `isolcpus=` or 'nohz_full=' are used to isolate CPUs
for specific task, and it isn't expected to let block IO disturb these CPUs.
blk-mq kworker shouldn't be scheduled on isolated CPUs. Also if isolated
CPUs is run for blk-mq kworker, long block IO latency can be caused.

Kernel workqueue only respects CPU isolation for WQ_UNBOUND, for bound
WQ, the responsibility is on user because CPU is specified as WQ API
parameter, such as mod_delayed_work_on(cpu), queue_delayed_work_on(cpu)
and queue_work_on(cpu).

So not run blk-mq kworker on isolated CPUs by removing isolated CPUs
from hctx->cpumask. Meantime use queue map to check if all CPUs in this
hw queue are offline instead of hctx->cpumask, this way can avoid any
cost in fast IO code path, and is safe since hctx->cpumask are only
used in the two cases.

Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Andrew Theurer <atheurer@redhat.com>
Cc: Joe Mario <jmario@redhat.com>
Cc: Sebastian Jug <sejug@redhat.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Tejun Heo <tj@kernel.org>
Tesed-by: Joe Mario <jmario@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Link: https://lore.kernel.org/r/20240322021244.1056223-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-01 11:53:36 -06:00
Linus Torvalds
033e8088a4 block-6.9-20240329
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmYG3agQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvX+EADYziHzPZjy4Ik30nRhc2H3J9GvI0YGSPhB
 srRWO/ElJa3CvEXmDW7gCdHoc4vKnryyC/rXkTCrTmffzYp1BBKZb9kApUvFo1xc
 ax2Pww1hFTIDf5YsTCgg9wR+953lCwNyolfND+nQj/2LLOmfbypqSCkG7bDfwuhA
 vIiwxbgBZM4yi/354xoIUTEimbSHuRzyLXyvCZo5nBxiEFTBXIQMY8UgIXRiNb7v
 zi0LRxcbtfkcUcxs1seyE3Lke8P+gkx0SPo7r9LLRTNPuJ/fwIfqHvYPDTAj2toj
 P71kELJLdDNLivmxX5kbC4EqGueo9L6aaKkYQHD4RRlM98LtxuhNHdzYt2YoXVDk
 2gg58VNZh7aNzpPlqa4FDf2Sjp0M0k3G/LX9IpmySTL0VVLYvQWhr1qyji2O1yAj
 m4W6RK1mYE98rt66cxqKKtrWYl6oJj3J0P/KcfPNe6nIdYYgefQxJwh+B4LMSfrr
 sgDUXxYIwfsbKuSeagIXEWw8FMlFO3nfSOu6BIGdRRcwNl/fJzrVODYoMrAq7+sP
 mGRnhYAz4HKfWImFFsIla+A6gJEFVGrftqeWLbeB/mjvV/kPhZT+5n+Wv8zS3bjv
 9/upgnnaIoRZo6/RA/ev/HXR3YEygZKSHK162lTzGmhw85BSxmFMq/WvB99U3m17
 Qeh+J+yviw==
 =zEIA
 -----END PGP SIGNATURE-----

Merge tag 'block-6.9-20240329' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:
 "Small round of minor fixes or cleanups for the 6.9-rc2 kernel, one
  fixing an issue introduced in 6.8"

* tag 'block-6.9-20240329' of git://git.kernel.dk/linux:
  block: Do not force full zone append completion in req_bio_endio()
  block: don't reject too large max_user_sectors in blk_validate_limits
  block: Make blk_rq_set_mixed_merge() static
2024-03-29 09:40:22 -07:00
Damien Le Moal
55251fbdf0 block: Do not force full zone append completion in req_bio_endio()
This reverts commit 748dc0b65e.

Partial zone append completions cannot be supported as there is no
guarantees that the fragmented data will be written sequentially in the
same manner as with a full command. Commit 748dc0b65e ("block: fix
partial zone append completion handling in req_bio_endio()") changed
req_bio_endio() to always advance a partially failed BIO by its full
length, but this can lead to incorrect accounting. So revert this
change and let low level device drivers handle this case by always
failing completely zone append operations. With this revert, users will
still see an IO error for a partially completed zone append BIO.

Fixes: 748dc0b65e ("block: fix partial zone append completion handling in req_bio_endio()")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240328004409.594888-2-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-28 17:04:48 -06:00
Christian Brauner
22650a9982
fs,block: yield devices early
Currently a device is only really released once the umount returns to
userspace due to how file closing works. That ultimately could cause
an old umount assumption to be violated that concurrent umount and mount
don't fail. So an exclusively held device with a temporary holder should
be yielded before the filesystem is gone. Add a helper that allows
callers to do that. This also allows us to remove the two holder ops
that Linus wasn't excited about.

Link: https://lore.kernel.org/r/20240326-vfs-bdev-end_holder-v1-1-20af85202918@kernel.org
Fixes: f3a608827d ("bdev: open block device as files") # mainline only
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-27 13:17:15 +01:00
Christian Brauner
3ff56e285d
block: count BLK_OPEN_RESTRICT_WRITES openers
The original changes in v6.8 do allow for a block device to be reopened
with BLK_OPEN_RESTRICT_WRITES provided the same holder is used as per
bdev_may_open(). I think this has a bug.

The first opener @f1 of that block device will set bdev->bd_writers to
-1. The second opener @f2 using the same holder will pass the check in
bdev_may_open() that bdev->bd_writers must not be greater than zero.

The first opener @f1 now closes the block device and in bdev_release()
will end up calling bdev_yield_write_access() which calls
bdev_writes_blocked() and sets bdev->bd_writers to 0 again.

Now @f2 holds a file to that block device which was opened with
exclusive write access but bdev->bd_writers has been reset to 0.

So now @f3 comes along and succeeds in opening the block device with
BLK_OPEN_WRITE betraying @f2's request to have exclusive write access.

This isn't a practical issue yet because afaict there's no codepath
inside the kernel that reopenes the same block device with
BLK_OPEN_RESTRICT_WRITES but it will be if there is.

Fix this by counting the number of BLK_OPEN_RESTRICT_WRITES openers. So
we only allow writes again once all BLK_OPEN_RESTRICT_WRITES openers are
done.

Link: https://lore.kernel.org/r/20240323-abtauchen-klauen-c2953810082d@brauner
Fixes: ed5cc702d3 ("block: Add config option to not allow writing to mounted devices")
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-27 12:59:25 +01:00
Christian Brauner
ddd65e19c6
block: handle BLK_OPEN_RESTRICT_WRITES correctly
Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By
default this option is set. When it is set the long-standing behavior
of being able to write to mounted block devices is enabled.

But in order to guard against unintended corruption by writing to the
block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned
off. In that case it isn't possible to write to mounted block devices
anymore.

A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES
which disallows concurrent BLK_OPEN_WRITE access. When we still had the
bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because
the mode was passed around. Since we managed to get rid of the bdev
handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based
on whether the file was opened writable and writes to that block device
are blocked. That logic doesn't work because we do allow
BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE.

Fix the detection logic and use an FMODE_* bit. We could've also abused
O_EXCL as an indicator that BLK_OPEN_RESTRICT_WRITES has been requested.
For userspace open paths O_EXCL will never be retained but for internal
opens where we open files that are never installed into a file
descriptor table this is fine. But it would be a gamble that this
doesn't cause bugs. Note that BLK_OPEN_RESTRICT_WRITES is an internal
only flag that cannot directly be raised by userspace. It is implicitly
raised during mounting.

Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and
unset.

Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@casper.infradead.org
Link: https://lore.kernel.org/r/20240323-zielbereich-mittragen-6fdf14876c3e@brauner
Fixes: 321de651fa ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access")
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-27 09:31:41 +01:00
Christoph Hellwig
038105a200 block: don't reject too large max_user_sectors in blk_validate_limits
We already cap down the actual max_sectors to the max of the hardware
and user limit, so don't reject the configuration.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240326060745.2349154-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-26 11:28:52 -06:00
John Garry
dc53d9eac1 block: Make blk_rq_set_mixed_merge() static
Since commit 8e756373d7 ("block: Move bio merge related functions into
blk-merge.c"), blk_rq_set_mixed_merge() has only been referenced in
blk-merge.c, so make it static.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240325083501.2816408-1-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-26 11:28:20 -06:00
Linus Torvalds
0a7b0acece vfs-6.9-rc1.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZfglxgAKCRCRxhvAZXjc
 ovK9APsF7/TMFhNbtW+JsghSyrEk0cOVPizi8JkRDDWNW3qY+wEAxtydhbmWpbKq
 MpIjMHqwjPx3zXBL8Ec/b4vAoJqpJwQ=
 =NgvO
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.9-rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:
 "This contains a few small fixes for this merge window:

   - Undo the hiding of silly-rename files in afs. If they're hidden
     they can't be deleted by rm manually anymore causing regressions

   - Avoid caching the preferred address for an afs server to avoid
     accidently overriding an explicitly specified preferred server
     address

   - Fix bad stat() and rmdir() interaction in afs

   - Take a passive reference on the superblock when opening a block
     device so the holder is available to concurrent callers from the
     block layer

   - Clear private data pointer in fscache_begin_operation() to avoid it
     being falsely treated as valid"

* tag 'vfs-6.9-rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fscache: Fix error handling in fscache_begin_operation()
  fs,block: get holder during claim
  afs: Fix occasional rmdir-then-VNOVNODE with generic/011
  afs: Don't cache preferred address
  afs: Revert "afs: Hide silly-rename files from userspace"
2024-03-18 09:15:50 -07:00
Christian Brauner
59a55a63c2
fs,block: get holder during claim
Now that we open block devices as files we need to deal with the
realities that closing is a deferred operation. An operation on the
block device such as e.g., freeze, thaw, or removal that runs
concurrently with umount, tries to acquire a stable reference on the
holder. The holder might already be gone though. Make that reliable by
grabbing a passive reference to the holder during bdev_open() and
releasing it during bdev_release().

Fixes: f3a608827d ("bdev: open block device as files") # mainline only
Reported-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/r/ZfEQQ9jZZVes0WCZ@infradead.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reported-by: https://lore.kernel.org/r/CAHj4cs8tbDwKRwfS1=DmooP73ysM__xAb2PQc6XsAmWR+VuYmg@mail.gmail.com
Link: https://lore.kernel.org/r/20240315-freibad-annehmbar-ca68c375af91@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-18 10:32:44 +01:00
Jiapeng Chong
4c4ab8ae41 block: fix mismatched kerneldoc function name
No functional modification involved.

block/blk-settings.c:281: warning: expecting prototype for queue_limits_commit_set(). Prototype was for queue_limits_set() instead.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8539
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20240314025615.71269-1-jiapeng.chong@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-14 09:40:47 -06:00
Christoph Hellwig
bf5e3a30f7 Revert "blk-lib: check for kill signal"
This reverts commit 8a08c5fd89.

It turns out while this is a perfectly valid and long overdue thing to do
for user initiated discards / zeroing from the ioctl handler, it actually
breaks file system use of the discard helper by interrupting in places
the file system doesn't expect, and by leaving the bio chain in a state
that the file system callers of (at least) __blkdev_issue_discard do
not expect.

Revert the change for now, we'll redo it for the next merge window
after refactoring the code to better split the file system vs ioctl
callers and cleaning up a few other loose ends.

Reported-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240314021623.1908895-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-13 20:35:48 -06:00
Bart Van Assche
256aab46e3 Revert "block/mq-deadline: use correct way to throttling write requests"
The code "max(1U, 3 * (1U << shift)  / 4)" comes from the Kyber I/O
scheduler. The Kyber I/O scheduler maintains one internal queue per hwq
and hence derives its async_depth from the number of hwq tags. Using
this approach for the mq-deadline scheduler is wrong since the
mq-deadline scheduler maintains one internal queue for all hwqs
combined. Hence this revert.

Cc: stable@vger.kernel.org
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Cc: Zhiguo Niu <Zhiguo.Niu@unisoc.com>
Fixes: d47f9717e5 ("block/mq-deadline: use correct way to throttling write requests")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240313214218.1736147-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-13 15:56:14 -06:00
Jens Axboe
b874d4aae5 block: limit block time caching to in_task() context
We should not have any callers of this from non-task context, but Jakub
ran [1] into one from blk-iocost. Rather than risk running into others,
or future ones, just limit blk_time_get_ns() to when it is called from
a task. Any other usage is invalid.

[1] https://lore.kernel.org/lkml/CAHk-=wiOaBLqarS2uFhM1YdwOvCX4CZaWkeyNDY1zONpbYw2ig@mail.gmail.com/

Fixes: da4c8c3d09 ("block: cache current nsec time in struct blk_plug")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-13 14:12:53 -06:00
Linus Torvalds
bff4b74625 Revert "dm: use queue_limits_set"
This reverts commit 8e0ef41286.

It's broken, and causes the boot to fail on encrypted volumes.

Reported-and-bisected-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/all/20240311235023.GA1205@cmpxchg.org/
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-03-11 17:11:28 -07:00
Linus Torvalds
1ddeeb2a05 for-6.9/block-20240310
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmXuFO4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpq33D/9hyNyBce2A9iyo026eK8EqLDoed6BPzuvB
 kLKj5tsGvX4YlfuswvP86M5dgibTASXclnfUK394TijW/JPOfJ3mNhi9gMnHzRoK
 ZaR1di0Lum56dY1FkpMmWiGmE4fB79PAtXYKtajOkuoIcNzylncEAAACUY4/Ouhg
 Cm+LMg2prcc+m9g8rKDNQ51pUFg4U21KAUTl35XLMUAaQk1ahW3EDEVYhweC/zwE
 V/5hJsv8UY72+oQGY2Dc/YgQk/Zj4ZDh7C+oHR9XeB/ro99kr3/Vopagu0gBMLZi
 Rq6qqz6PVMhVcuz8uN2rsTQKXmXhsBn9/adsl4AKtdxcW5D5moWb5BLq1P0WQylc
 nzMxa1d6cVcTKZpaUQQv3Rj6ZMrLuDwP277UYHfn5x1oPWYRZCG7FtHuOo1gNcpG
 DrSNwVG6BSDcbABqI+MIS2oD1JoUMyevjwT7e2hOXukZhc6GLO5F3ODWE5j3KnCR
 S/aGSAmcdR4fTcgavULqWdQVt7SYl4f1IxT8KrUirJGVhc2LgahaWj69ooklVHoU
 fPDFRiruwJ5YkH4RWCSDm9mi4kAz6eUf+f4yE06wZOFOb2fT8/1ZK2Snpz2KeXuZ
 INO0RejtFzT8L0OUlu7dBmF20y6rgAYt87lR8mIt71yuuATIrVhzlX1VdsvhdrAo
 VLHGV1Ncgw==
 =WlVL
 -----END PGP SIGNATURE-----

Merge tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - MD pull requests via Song:
      - Cleanup redundant checks (Yu Kuai)
      - Remove deprecated headers (Marc Zyngier, Song Liu)
      - Concurrency fixes (Li Lingfeng)
      - Memory leak fix (Li Nan)
      - Refactor raid1 read_balance (Yu Kuai, Paul Luse)
      - Clean up and fix for md_ioctl (Li Nan)
      - Other small fixes (Gui-Dong Han, Heming Zhao)
      - MD atomic limits (Christoph)

 - NVMe pull request via Keith:
      - RDMA target enhancements (Max)
      - Fabrics fixes (Max, Guixin, Hannes)
      - Atomic queue_limits usage (Christoph)
      - Const use for class_register (Ricardo)
      - Identification error handling fixes (Shin'ichiro, Keith)

 - Improvement and cleanup for cached request handling (Christoph)

 - Moving towards atomic queue limits. Core changes and driver bits so
   far (Christoph)

 - Fix UAF issues in aoeblk (Chun-Yi)

 - Zoned fix and cleanups (Damien)

 - s390 dasd cleanups and fixes (Jan, Miroslav)

 - Block issue timestamp caching (me)

 - noio scope guarding for zoned IO (Johannes)

 - block/nvme PI improvements (Kanchan)

 - Ability to terminate long running discard loop (Keith)

 - bdev revalidation fix (Li)

 - Get rid of old nr_queues hack for kdump kernels (Ming)

 - Support for async deletion of ublk (Ming)

 - Improve IRQ bio recycling (Pavel)

 - Factor in CPU capacity for remote vs local completion (Qais)

 - Add shared_tags configfs entry for null_blk (Shin'ichiro

 - Fix for a regression in page refcounts introduced by the folio
   unification (Tony)

 - Misc fixes and cleanups (Arnd, Colin, John, Kunwu, Li, Navid,
   Ricardo, Roman, Tang, Uwe)

* tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux: (221 commits)
  block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC
  block/swim: Convert to platform remove callback returning void
  cdrom: gdrom: Convert to platform remove callback returning void
  block: remove disk_stack_limits
  md: remove mddev->queue
  md: don't initialize queue limits
  md/raid10: use the atomic queue limit update APIs
  md/raid5: use the atomic queue limit update APIs
  md/raid1: use the atomic queue limit update APIs
  md/raid0: use the atomic queue limit update APIs
  md: add queue limit helpers
  md: add a mddev_is_dm helper
  md: add a mddev_add_trace_msg helper
  md: add a mddev_trace_remap helper
  bcache: move calculation of stripe_size and io_opt into bcache_device_init
  virtio_blk: Do not use disk_set_max_open/active_zones()
  aoe: fix the potential use-after-free problem in aoecmd_cfg_pkts
  block: move capacity validation to blkpg_do_ioctl()
  block: prevent division by zero in blk_rq_stat_sum()
  drbd: atomically update queue limits in drbd_reconsider_queue_parameters
  ...
2024-03-11 11:43:44 -07:00
Linus Torvalds
910202f00a vfs-6.9.super
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZem4DwAKCRCRxhvAZXjc
 ooTRAQDRI6Qz6wJym5Yblta8BScMGbt/SgrdgkoCvT6y83MtqwD+Nv/AZQzi3A3l
 9NdULtniW1reuCYkc8R7dYM8S+yAwAc=
 =Y1qX
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull block handle updates from Christian Brauner:
 "Last cycle we changed opening of block devices, and opening a block
  device would return a bdev_handle. This allowed us to implement
  support for restricting and forbidding writes to mounted block
  devices. It was accompanied by converting and adding helpers to
  operate on bdev_handles instead of plain block devices.

  That was already a good step forward but ultimately it isn't necessary
  to have special purpose helpers for opening block devices internally
  that return a bdev_handle.

  Fundamentally, opening a block device internally should just be
  equivalent to opening files. So now all internal opens of block
  devices return files just as a userspace open would. Instead of
  introducing a separate indirection into bdev_open_by_*() via struct
  bdev_handle bdev_file_open_by_*() is made to just return a struct
  file. Opening and closing a block device just becomes equivalent to
  opening and closing a file.

  This all works well because internally we already have a pseudo fs for
  block devices and so opening block devices is simple. There's a few
  places where we needed to be careful such as during boot when the
  kernel is supposed to mount the rootfs directly without init doing it.
  Here we need to take care to ensure that we flush out any asynchronous
  file close. That's what we already do for opening, unpacking, and
  closing the initramfs. So nothing new here.

  The equivalence of opening and closing block devices to regular files
  is a win in and of itself. But it also has various other advantages.
  We can remove struct bdev_handle completely. Various low-level helpers
  are now private to the block layer. Other helpers were simply
  removable completely.

  A follow-up series that is already reviewed build on this and makes it
  possible to remove bdev->bd_inode and allows various clean ups of the
  buffer head code as well. All places where we stashed a bdev_handle
  now just stash a file and use simple accessors to get to the actual
  block device which was already the case for bdev_handle"

* tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
  block: remove bdev_handle completely
  block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access
  bdev: remove bdev pointer from struct bdev_handle
  bdev: make struct bdev_handle private to the block layer
  bdev: make bdev_{release, open_by_dev}() private to block layer
  bdev: remove bdev_open_by_path()
  reiserfs: port block device access to file
  ocfs2: port block device access to file
  nfs: port block device access to files
  jfs: port block device access to file
  f2fs: port block device access to files
  ext4: port block device access to file
  erofs: port device access to file
  btrfs: port device access to file
  bcachefs: port block device access to file
  target: port block device access to file
  s390: port block device access to file
  nvme: port block device access to file
  block2mtd: port device access to files
  bcache: port block device access to files
  ...
2024-03-11 10:52:34 -07:00
Linus Torvalds
54126fafea vfs-6.9.iomap
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZem4UQAKCRCRxhvAZXjc
 ouERAQDg63R9s3bKmUgGqngf9cfr//VCTE+WVARwOUTdn2iDbwEA1IME7X1kL/Vz
 EdhEjyqO6xom+ao/Vqxe0XIDNz70vgs=
 =8RdE
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.9.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull iomap updates from Christian Brauner:

 - Restore read-write hints in struct bio through the bi_write_hint
   member for the sake of UFS devices in mobile applications. This can
   result in up to 40% lower write amplification in UFS devices. The
   patch series that builds on this will be coming in via the SCSI
   maintainers (Bart)

 - Overhaul the iomap writeback code. Afterwards ->map_blocks() is able
   to map multiple blocks at once as long as they're in the same folio.
   This reduces CPU usage for buffered write workloads on e.g., xfs on
   systems with lots of cores (Christoph)

 - Record processed bytes in iomap_iter() trace event (Kassey)

 - Extend iomap_writepage_map() trace event after Christoph's
   ->map_block() changes to map mutliple blocks at once (Zhang)

* tag 'vfs-6.9.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
  iomap: Add processed for iomap_iter
  iomap: add pos and dirty_len into trace_iomap_writepage_map
  block, fs: Restore the per-bio/request data lifetime fields
  fs: Propagate write hints to the struct block_device inode
  fs: Move enum rw_hint into a new header file
  fs: Split fcntl_rw_hint()
  fs: Verify write lifetime constants at compile time
  fs: Fix rw_hint validation
  iomap: pass the length of the dirty region to ->map_blocks
  iomap: map multiple blocks at a time
  iomap: submit ioends immediately
  iomap: factor out a iomap_writepage_map_block helper
  iomap: only call mapping_set_error once for each failed bio
  iomap: don't chain bios
  iomap: move the iomap_sector sector calculation out of iomap_add_to_ioend
  iomap: clean up the iomap_alloc_ioend calling convention
  iomap: move all remaining per-folio logic into iomap_writepage_map
  iomap: factor out a iomap_writepage_handle_eof helper
  iomap: move the PF_MEMALLOC check to iomap_writepages
  iomap: move the io_folios field out of struct iomap_ioend
  ...
2024-03-11 10:07:03 -07:00
Colin Ian King
5205a4aa8f block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC
The helper function mac_fix_string is only required with CONFIG_PPC_PMAC,
add #if CONFIG_PPC_PMAC and #endif around the function.

Cleans up clang scan build warning:
block/partitions/mac.c:23:20: warning: unused function 'mac_fix_string' [-Wunused-function]

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20240308133921.2058227-1-colin.i.king@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-09 07:31:42 -07:00
Jens Axboe
d37977f0af Merge tag 'md-6.9-20240306' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.9/block
Pull MD atomic queue limits changes from Song.

* tag 'md-6.9-20240306' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  block: remove disk_stack_limits
  md: remove mddev->queue
  md: don't initialize queue limits
  md/raid10: use the atomic queue limit update APIs
  md/raid5: use the atomic queue limit update APIs
  md/raid1: use the atomic queue limit update APIs
  md/raid0: use the atomic queue limit update APIs
  md: add queue limit helpers
  md: add a mddev_is_dm helper
  md: add a mddev_add_trace_msg helper
  md: add a mddev_trace_remap helper
2024-03-06 11:15:24 -07:00
Christoph Hellwig
dd27a84b06 block: remove disk_stack_limits
disk_stack_limits is unused now, remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed--by: Song Liu <song@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240303140150.5435-12-hch@lst.de
2024-03-06 08:59:54 -08:00
Li Lingfeng
b9355185d2 block: move capacity validation to blkpg_do_ioctl()
Commit 6d4e80db4e ("block: add capacity validation in
bdev_add_partition()") add check of partition's start and end sectors to
prevent exceeding the size of the disk when adding partitions. However,
there is still no check for resizing partitions now.
Move the check to blkpg_do_ioctl() to cover resizing partitions.

Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240305032132.548958-1-lilingfeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-06 08:32:06 -07:00
Roman Smirnov
93f52fbeaf block: prevent division by zero in blk_rq_stat_sum()
The expression dst->nr_samples + src->nr_samples may
have zero value on overflow. It is necessary to add
a check to avoid division by zero.

Found by Linux Verification Center (linuxtesting.org) with Svace.

Signed-off-by: Roman Smirnov <r.smirnov@omp.ru>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Link: https://lore.kernel.org/r/20240305134509.23108-1-r.smirnov@omp.ru
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-06 08:31:54 -07:00
Li kunyu
5f2ad31fbb sed-opal: Remove the ret variable from the function
The ret variable in the function has not yet been effective and can be
removed.

Signed-off-by: Li kunyu <kunyu@nfschina.com>
Link: https://lore.kernel.org/r/20240306101444.1244-1-kunyu@nfschina.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-06 08:29:49 -07:00
Li kunyu
2449be8c8c sed-opal: Remove unnecessary ‘0’ values from ret
ret is assigned first, so it does not need to initialize the assignment.

Signed-off-by: Li kunyu <kunyu@nfschina.com>
Link: https://lore.kernel.org/r/20240306100659.106521-1-kunyu@nfschina.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-06 08:29:43 -07:00
Li zeming
217fcc4807 sed-opal: Remove unnecessary ‘0’ values from err
err is assigned first, so it does not need to initialize the assignment.

Signed-off-by: Li zeming <zeming@nfschina.com>
Link: https://lore.kernel.org/r/20240306100216.69340-1-zeming@nfschina.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-06 08:29:38 -07:00
Li zeming
147fe61334 sed-opal: Remove unnecessary ‘0’ values from error
error is assigned first, so it does not need to initialize the assignment.

Signed-off-by: Li zeming <zeming@nfschina.com>
Link: https://lore.kernel.org/r/20240306095608.26839-1-zeming@nfschina.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-06 08:29:31 -07:00
Ricardo B. Marliere
f8c7511db0 block: make block_class constant
Since commit 43a7206b09 ("driver core: class: make class_register() take
a const *"), the driver core allows for struct class to be in read-only
memory, so move the block_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240305-class_cleanup-block-v1-1-130bb27b9c72@marliere.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-06 08:29:20 -07:00
Tony Battersby
38b43539d6 block: Fix page refcounts for unaligned buffers in __bio_release_pages()
Fix an incorrect number of pages being released for buffers that do not
start at the beginning of a page.

Fixes: 1b151e2435 ("block: Remove special-casing of compound pages")
Cc: stable@vger.kernel.org
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Tested-by: Greg Edwards <gedwards@ddn.com>
Link: https://lore.kernel.org/r/86e592a9-98d4-4cff-a646-0c0084328356@cybernetics.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-06 08:26:42 -07:00
Christian Brauner
86835c39e0 vfs-6.9.rw_hint
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZdcKwQAKCRCRxhvAZXjc
 oldXAP4uzKixPvJeJmmuLs8Yl2X4g4SnxXFoLwMjCOxGSH1DWQD+Oj0nGs81lIKm
 iLCZwk09JzfVEat/6KVmkjiqLLTwNgw=
 =TmTQ
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZeYHOAAKCRCRxhvAZXjc
 opwvAP0fqxfEAS04/MNdYSf0dA5GMr8v+8RBablWtkVuOMMbRQD/RMFJKXK02afq
 B4YUemRHtYETdbV69+yzninHy8y4gQQ=
 =ThqF
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.9.rw_hint' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull write hint fix from Christian Brauner:

UFS devices are widely used in mobile applications, e.g. in smartphones.
UFS vendors need data lifetime information to achieve good performance.
Providing data lifetime information to UFS devices can result in up to
40% lower write amplification. Hence this patch series that restores the
bi_write_hint member in struct bio. After this patch series has been
merged, patches that implement data lifetime support in the SCSI disk
(sd) driver will be sent to the Linux kernel SCSI maintainer.

The following changes are included in this patch series:

- Improvements for the F_GET_RW_HINT and F_SET_RW_HINT fcntls.
- Move enum rw_hint into a new header file.
- Support F_SET_RW_HINT for block devices to make it easy to test data
  lifetime support.
- Restore the bio.bi_write_hint member and restore support in the VFS
  layer and also in the block layer for data lifetime information.

The shell script that has been used to test the patch series combined
with the SCSI patches is available at the end of this cover letter.

* tag 'vfs-6.9.rw_hint' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
  block, fs: Restore the per-bio/request data lifetime fields
  fs: Propagate write hints to the struct block_device inode
  fs: Move enum rw_hint into a new header file
  fs: Split fcntl_rw_hint()
  fs: Verify write lifetime constants at compile time
  fs: Fix rw_hint validation

Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-04 18:35:21 +01:00
Christoph Hellwig
8e0ef41286 dm: use queue_limits_set
Use queue_limits_set which validates the limits and takes care of
updating the readahead settings instead of directly assigning them to
the queue.  For that make sure all limits are actually updated before
the assignment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20240228225653.947152-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01 08:54:42 -07:00
Christoph Hellwig
c1373f1cf4 block: add a queue_limits_stack_bdev helper
Add a small wrapper around blk_stack_limits that allows passing a bdev
for the bottom device and prints an error in case of misaligned
device. The name fits into the new queue limits API and the intent is
to eventually replace disk_stack_limits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240228225653.947152-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01 08:54:42 -07:00
Christoph Hellwig
631d4efb80 block: add a queue_limits_set helper
Add a small wrapper around queue_limits_commit_update for stacking
drivers that don't want to update existing limits, but set an
entirely new set.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240228225653.947152-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01 08:54:42 -07:00